Preprint Deep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven Models

2025-05-02 0 0 871.51KB 16 页 10玖币

侵权投诉

Preprint

Deep Grey-Box Modeling With Adaptive Data-Driven Models

Toward Trustworthy Estimation of Theory-Driven Models

Naoya Takeishi naoya.takeishi@hesge.ch

Geneva School of Business Administration

University of Applied Sciences and Arts Western Switzerland (HES-SO)

Alexandros Kalousis alexandros.kalousis@hesge.ch

Geneva School of Business Administration

University of Applied Sciences and Arts Western Switzerland (HES-SO)

Abstract

The combination of deep neural nets and theory-driven models, which we call deep grey-box

modeling, can be inherently interpretable to some extent thanks to the theory backbone.

Deep grey-box models are usually learned with a regularized risk minimization to prevent

a theory-driven part from being overwritten and ignored by a deep neural net. However,

an estimation of the theory-driven part obtained by uncritically optimizing a regularizer

can hardly be trustworthy when we are not sure what regularizer is suitable for the given

data, which may harm the interpretability. Toward a trustworthy estimation of the theory-

driven part, we should analyze regularizers’ behavior to compare diﬀerent candidates and

to justify a speciﬁc choice. In this paper, we present a framework that enables us to analyze

a regularizer’s behavior empirically with a slight change in the neural net’s architecture and

the training objective.

1 INTRODUCTION

Grey-box modeling in general refers to the combination of theory-driven structures and data-driven compo-

nents (see, e.g., Sohlberg and Jacobsen, 2008). In this paper, we are interested in combining theory-driven

models such as mathematical models of physical phenomena, and data-driven, machine-learning models. For

example, for regressing from xto y, we are interested in models like

y=fT(x;θT) + fD(x;θD) + e,

where fTand fDdenote theory- and data-driven models parameterized by θTand θD, respectively, and eis

noise. A more general model will appear in Section 2 and thereafter. We discuss how we should (or should

not) cast the estimation problem of a grey-box model’s parameters.

We argue over the interpretability of grey-box models. They can be interpretable because a part of the

model lies on a theory backbone and often has a small number of parameters. They are valuable tools

to glimpse insights from data on which our theory is essentially incomplete. However, as we discuss later,

interpretability is not a free lunch, and we need to pay attention to how we perform the estimation to secure

the trustworthiness of the interpretation.

We are interested in cases where the data-driven model, fD, is a deep neural network. In such a case, care

must be taken for the theory-based model, fT, not to be overwritten and ignored by fDdue to the expressive

power of the latter. Such models have been studied recently (Qian et al., 2021; Takeishi and Kalousis, 2021;

Wehenkel et al., 2022; Yin et al., 2021, see also Section 4). Deep grey-box models are typically learned by

solving optimization problems like

minimize

θT,θDL+λR,

arXiv:2210.13103v1 [cs.LG] 24 Oct 2022

Preprint

where Lis a prediction loss, and Ris a regularizer needed for preventing fTfrom being ignored. Although this

optimization may result in a model with good prediction performance, we cannot judge if such an estimator is

worth being interpreted. There can be multiple parameter values that achieve similar performance, and the

optimization’s solution tells us nothing about the property of R, so it provides only insuﬃcient information

to justify a choice of θTto be interpreted. For the same reason, we cannot compare diﬀerent Rs solely based

on the optimization’s solution. Instead, we should analyze the behavior of R, at least empirically, toward

obtaining a trustworthy estimation and its interpretation.

Our idea is simple: We withhold the estimation of θT, at least in the ﬁrst trial of data analysis. To this end,

we learn fDadaptively with realizations of θT, essentially marginalizing out θT. Such a slight change in the

formulation allows us to analyze Rempirically by examining its landscape without any need for re-training.

In this paper, we take up the aforementioned argument about the estimation’s trustworthiness for discussion,

formulate the above idea, and conduct an empirical investigation of its eﬀectiveness.

A useful byproduct of the proposed formulation is that the optimization of Land Rcan now be decoupled.

It allows us to use diﬀerent optimizers for the two objectives, as well as to use unlabeled test data to optimize

R. Moreover, we found that such a decoupled optimization makes the optimization much less sensitive to

the hyperparameter, λ.

2 PRELIMINARY

2.1 Deﬁnition

We deﬁne deep grey-box models as compositions of theory-driven models and data-driven models, with the

latter being deep neural networks. For the sake of discussion, we suppose regression problems where y∈ Y

is to be predicted from x∈ X, though the extension to other problems is straightforward. We denote such

a model in general by

y=C(fT, fD;x),(1)

where Cis a functional that takes the two types of functions and an input variable as arguments. The

two functions, fTand fD, are a theory-driven model and a deep neural network, with unknown parameters

θT∈ΘTand θD∈ΘD, respectively. We may write fT(x;θT)to manifest fT’s dependency on θTor write

simply fT(x)though it still depends on θT(and analogously for fDand θD). Note that not only fDbut also

fTmay have unknown parameters to be inferred. We usually expect dim θTdim θD. The functional, C,

evaluates fTand fDwith a given (θT, θD, x)and then mixes up their outputs to give the ﬁnal output of the

model.

We try to keep the generality of C; it may include general function compositions and their arbitrary trans-

formations:

C(fT, fD;x) = SomeTransformation[fD(fT(x), x)].

Meanwhile, one of the most prevailing forms of Cin the literature is the additive grey-box ODEs like:

C(fT, fD;x) = ODESolve[ ˙s=fT(s) + fD(s)|s0=x],

where sis the state variable of the dynamics, s0is the initial condition, and ODESolve denotes an operation

that numerically solves initial value problems. Such grey-box (ordinary or partial) diﬀerential equations have

been studied by researchers such as Qian et al. (2021), Sasaki et al. (2019), Takeishi and Kalousis (2021),

and Yin et al. (2021).

We are particularly interested in cases where Cinherits the expressive power of fDas a function approximator.

This is not the case, for example, when Conly contains compositions such as fT(fD(x), x), that is, fTis

“outside” fD(e.g., Arık et al., 2020; Raissi et al., 2019; Schnell et al., 2022). Estimation of such models is

less challenging because fTcannot be ignored by construction. In contrast, we address more diﬃcult cases

where fTis “inside” fD, for which we should be careful so that fTis not overwritten and ignored by fD. We

put the following assumptions on the model:

Preprint

Assumption 1. fD:X →Y is a universal function approximator; for any  > 0and a continuous function

g:X → Y, there exists θD∈ΘDsatisfying supx∈SXkfD(x;θD)−g(x)k< , where SX⊂ X is some compact

set.

Assumption 2. C(fT, fD;·): X → Y is also a universal function approximator; that is, for any  > 0,θT∈

ΘT, and a continuous function g0:X → Y, there exists θD∈ΘDsatisfying supx∈SXkC(fT, fD;x)−g0(x)k< .

Remark 1.We assume the universal approximation property just for rigorously construct the discussion.

Even without the universal approximation property, as long as fDand Care much more expressive than fT,

discussions below would approximately hold in practice.

Assumption 3. fTand fDare Lipschitz continuous with regard to (x, θT)and (x, θD), respectively.

2.2 Why Grey-box?

Deep grey-box models are powerful function approximators with a certain level of inherent interpretability

owing to fT, a human-understandable model with a theory as a backbone. A typical use case would be

to estimate a grey-box model on data on which our theory is essentially incomplete and inspect the esti-

mated model to glimpse insights, e.g., when the incomplete theory is correct or not, how the missing part

approximated by fDbehaves, and so on.

Deep grey-box models can also be advantageous in generalization capability and robustness to extrapolation,

as reported empirically so far (Qian et al., 2021; Takeishi and Kalousis, 2021; Wehenkel et al., 2022; Yin

et al., 2021). It is natural to expect such improvements because the presence of fTin the model would reduce

the sample complexity of the learning problem, and fTis supposed to work well in the out-of-data regime

(in other words, it is a requirement for a model to be regarded as theory-driven). However, rigorous analysis

of generalization is challenging for models involving deep neural nets. Anyway, we do not touch on such

performance aspects of deep grey-box models given the previous studies, so the comparison to non-grey-box

models is out of the paper’s scope.

2.3 Empirical Risk Minimization Cannot Select θT

There is a natural consequence of deep grey-box modeling; the theory-driven model’s parameter, θT, cannot

be chosen solely by minimizing an empirical risk of prediction. For example, suppose we learn C(fT, fD;x) =

fT(x) + fD(x)by minimizing the mean squared error, L=ky−(fT(x) + fD(x))k2

2. The empirical risk can

be minimized to a similar extent for any θT∈ΘTbecause fD, a deep neural net, can, it alone, approximate

any function on the training set (as assumed in Assumption 1) and thus also the function y−fT(x). We

formally state this fact as follows:

Proposition 1. Let S={(x1, y1),...,(xn, yn)}be a training set. Let L(x,y)(θT, θD)be a Lipschitz con-

tinuous loss function between the prediction (i.e., the value of C(fT, fD;x)) and the target (i.e., y). Let

LS(θT, θD) = P(x,y)∈SL(x,y)(θT, θD)be the empirical risk on the training set. Suppose that Assumptions 1–

3 hold. Then, for any 0>0,θD∈ΘD, and θT, θ0

T∈ΘTwhere θT6=θ0

T, there exists θ0

D∈ΘDthat satisﬁes

|LS(θT, θD)− LS(θ0

T, θ0

D)|< 0.(2)

Proof. From the assumptions, for any  > 0,θD∈ΘD, and θT6=θ0

T∈ΘT, there exists θ0

D∈ΘDthat satisﬁes

supx∈{x1,...,xn}kC(fT, fD;x)− C(f0

T, f0

D;x)k< , where f0

iis parameterized by θ0

ifor i= T,D. Since Lis

Lipschitz continuous, supx,y∈S|L(x,y)(θT, θD)−L(x,y)(θ0

T, θ0

D)|< K with where Kis L’s Lipschitz constant.

Therefore, with 0:=|S|K,|LS(θT, θD)− LS(θ0

T, θ0

D)|< 0.

2.4 Regularized Risk Minimization

Proposition 1 states that any θT∈ΘTcan be equally likely solely under the empirical risk. It necessitates reg-

ularizing the problem; we should optimize LS+λRinstead, where λ≥0is a regularization hyperparameter,

and Ris some regularizer that should reﬂect our inductive biases on how we should combine the theory- and

data-driven models. Let us, for example, consider the linear combination case, C(fT, fD;x) = fT(x) + fD(x).

Preprint

One of the common ways of thinking is that fTshould as accurately explain the x–yrelation as possible,

and fDshould have the least possible eﬀect. This idea can be operationalized by deﬁning R=kfDk, where

the norm is the function’s norm. Though such an Rhas been a popular choice, it is not the only possibility.

For example, when one wants the two models’ output to be uncorrelated, one can use R=|hfT(x), fD(x)i|.1

We assume that Rdepends only on x. It is natural because the role of Ris not to ﬁt the x–yrelation. We

will recall this assumption, if necessary, by writing RSX, where SX={x1, . . . , xn}is the extract of xs from

S. We do not suppose, at least explicitly, any more speciﬁcations of Rthan this assumption.

The regularized estimation problem can be cast as follows:

Inductive Learning The simplest formulation is

θ∗

T, θ∗

D= arg min

θT,θDLS(θT, θD) + λRSX(θT, θD).(3)

In this formulation, not only Lbut also Rsuﬀers a generalization gap. Also, λneeds to be tuned somehow.

Transductive Learning Since we assume that Ronly depends on x, it is reasonable to mention trans-

ductive learning (Gammerman et al., 1998). Let S0

Xbe some set of xthat may include SXas a subset. The

idea is to minimize the unsupervised part of the objective, not on the training data SXbut rather on S0

θ∗

T, θ∗

D= arg min

θT,θDLS(θT, θD) + λRS0

X(θT, θD).(4)

The generalization gap disappears for Rwhen (a subset of) S0

Xis the test set, but still λneeds to be tuned.

3 TOWARD TRUSTWORTHY ESTIMATION

3.1 Challenges of Deep Grey-box Model Estimation

Deep grey-box models have been studied mainly in terms of empirical generalization and extrapolation

capability (Qian et al., 2021; Takeishi and Kalousis, 2021; Wehenkel et al., 2022; Yin et al., 2021). However,

when the model’s interpretation is concerned, the prediction performance does not speak a lot; there can

be multiple parameter values that perform similarly (cf. Rashomon sets), and we cannot judge which one

we should interpret. The solution of the optimization in Eq. (3) or (4) tells us nothing about the analytical

property of R, so we can hardly understand the full picture of how the optimization selects θT. Instead

of uncritically optimizing the regularizer, R, we should know the property of Rin order to gain more

information to explain the choice of θTto be interpreted. We would contrast the situation with, for example,

the estimators of linear regression models, which have been extensively analyzed and thus are trustworthy in

some sense. We do not suggest analyzing our Rs analytically as it is too problem-dependent, but analyzing

them at least empirically would help us make θT’s estimation more trustworthy.

The challenge due to not knowing R’s property stands out more when we do not know what Ris suitable

for the given data and need to compare diﬀerent candidate Rs, which is often the case as we do not know

the whole data-generating process. The point estimation via Eq. (3) or (4) would not tell much about the

goodness of R, since diﬀerent Rs could achieve similar prediction performance. This viewpoint also supports

the need for analyzing Rat least empirically for gaining information to compare diﬀerent Rs.

Another challenge, yet more technical, is the choice of the regularization hyperparameter, λ. It can be tricky

because, in Eq. (3) or (4), it controls two things at the same time: “which θTshould be selected” and “how

much fDshould be regularized.” They are diﬀerent problems if interrelated, and thus decoupling them would

be beneﬁcial.

3.2 Proposed Formulation

As we argued above, analyzing Rempirically can be a useful ﬁrst step toward a trustworthy estimation of

θT. More speciﬁcally, we aim to explore the landscape of R, that is, to evaluate the values of Rfor diﬀerent

1Suggesting speciﬁc Rfor each application or in general is out of the scope of this paper; on contrary, our proposal in

Section 3 is for cases where we cannot specify Ra priori.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintDeepGrey-BoxModelingWithAdaptiveData-DrivenModelsTowardTrustworthyEstimationofTheory-DrivenModelsNaoyaTakeishinaoya.takeishi@hesge.chGenevaSchoolofBusinessAdministrationUniversityofAppliedSciencesandArtsWesternSwitzerland(HES-SO)AlexandrosKalousisalexandros.kalousis@hesge.chGenevaSchoolofBus...

展开>> 收起<<

Preprint Deep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven Models.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint Deep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: