Preprint Deep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven Models

2025-05-02 0 0 871.51KB 16 页 10玖币
侵权投诉
Preprint
Deep Grey-Box Modeling With Adaptive Data-Driven Models
Toward Trustworthy Estimation of Theory-Driven Models
Naoya Takeishi naoya.takeishi@hesge.ch
Geneva School of Business Administration
University of Applied Sciences and Arts Western Switzerland (HES-SO)
Alexandros Kalousis alexandros.kalousis@hesge.ch
Geneva School of Business Administration
University of Applied Sciences and Arts Western Switzerland (HES-SO)
Abstract
The combination of deep neural nets and theory-driven models, which we call deep grey-box
modeling, can be inherently interpretable to some extent thanks to the theory backbone.
Deep grey-box models are usually learned with a regularized risk minimization to prevent
a theory-driven part from being overwritten and ignored by a deep neural net. However,
an estimation of the theory-driven part obtained by uncritically optimizing a regularizer
can hardly be trustworthy when we are not sure what regularizer is suitable for the given
data, which may harm the interpretability. Toward a trustworthy estimation of the theory-
driven part, we should analyze regularizers’ behavior to compare different candidates and
to justify a specific choice. In this paper, we present a framework that enables us to analyze
a regularizer’s behavior empirically with a slight change in the neural net’s architecture and
the training objective.
1 INTRODUCTION
Grey-box modeling in general refers to the combination of theory-driven structures and data-driven compo-
nents (see, e.g., Sohlberg and Jacobsen, 2008). In this paper, we are interested in combining theory-driven
models such as mathematical models of physical phenomena, and data-driven, machine-learning models. For
example, for regressing from xto y, we are interested in models like
y=fT(x;θT) + fD(x;θD) + e,
where fTand fDdenote theory- and data-driven models parameterized by θTand θD, respectively, and eis
noise. A more general model will appear in Section 2 and thereafter. We discuss how we should (or should
not) cast the estimation problem of a grey-box model’s parameters.
We argue over the interpretability of grey-box models. They can be interpretable because a part of the
model lies on a theory backbone and often has a small number of parameters. They are valuable tools
to glimpse insights from data on which our theory is essentially incomplete. However, as we discuss later,
interpretability is not a free lunch, and we need to pay attention to how we perform the estimation to secure
the trustworthiness of the interpretation.
We are interested in cases where the data-driven model, fD, is a deep neural network. In such a case, care
must be taken for the theory-based model, fT, not to be overwritten and ignored by fDdue to the expressive
power of the latter. Such models have been studied recently (Qian et al., 2021; Takeishi and Kalousis, 2021;
Wehenkel et al., 2022; Yin et al., 2021, see also Section 4). Deep grey-box models are typically learned by
solving optimization problems like
minimize
θTDL+λR,
1
arXiv:2210.13103v1 [cs.LG] 24 Oct 2022
Preprint
where Lis a prediction loss, and Ris a regularizer needed for preventing fTfrom being ignored. Although this
optimization may result in a model with good prediction performance, we cannot judge if such an estimator is
worth being interpreted. There can be multiple parameter values that achieve similar performance, and the
optimization’s solution tells us nothing about the property of R, so it provides only insufficient information
to justify a choice of θTto be interpreted. For the same reason, we cannot compare different Rs solely based
on the optimization’s solution. Instead, we should analyze the behavior of R, at least empirically, toward
obtaining a trustworthy estimation and its interpretation.
Our idea is simple: We withhold the estimation of θT, at least in the first trial of data analysis. To this end,
we learn fDadaptively with realizations of θT, essentially marginalizing out θT. Such a slight change in the
formulation allows us to analyze Rempirically by examining its landscape without any need for re-training.
In this paper, we take up the aforementioned argument about the estimation’s trustworthiness for discussion,
formulate the above idea, and conduct an empirical investigation of its effectiveness.
A useful byproduct of the proposed formulation is that the optimization of Land Rcan now be decoupled.
It allows us to use different optimizers for the two objectives, as well as to use unlabeled test data to optimize
R. Moreover, we found that such a decoupled optimization makes the optimization much less sensitive to
the hyperparameter, λ.
2 PRELIMINARY
2.1 Definition
We define deep grey-box models as compositions of theory-driven models and data-driven models, with the
latter being deep neural networks. For the sake of discussion, we suppose regression problems where y∈ Y
is to be predicted from x∈ X, though the extension to other problems is straightforward. We denote such
a model in general by
y=C(fT, fD;x),(1)
where Cis a functional that takes the two types of functions and an input variable as arguments. The
two functions, fTand fD, are a theory-driven model and a deep neural network, with unknown parameters
θTΘTand θDΘD, respectively. We may write fT(x;θT)to manifest fT’s dependency on θTor write
simply fT(x)though it still depends on θT(and analogously for fDand θD). Note that not only fDbut also
fTmay have unknown parameters to be inferred. We usually expect dim θTdim θD. The functional, C,
evaluates fTand fDwith a given (θT, θD, x)and then mixes up their outputs to give the final output of the
model.
We try to keep the generality of C; it may include general function compositions and their arbitrary trans-
formations:
C(fT, fD;x) = SomeTransformation[fD(fT(x), x)].
Meanwhile, one of the most prevailing forms of Cin the literature is the additive grey-box ODEs like:
C(fT, fD;x) = ODESolve[ ˙s=fT(s) + fD(s)|s0=x],
where sis the state variable of the dynamics, s0is the initial condition, and ODESolve denotes an operation
that numerically solves initial value problems. Such grey-box (ordinary or partial) differential equations have
been studied by researchers such as Qian et al. (2021), Sasaki et al. (2019), Takeishi and Kalousis (2021),
and Yin et al. (2021).
We are particularly interested in cases where Cinherits the expressive power of fDas a function approximator.
This is not the case, for example, when Conly contains compositions such as fT(fD(x), x), that is, fTis
“outside” fD(e.g., Arık et al., 2020; Raissi et al., 2019; Schnell et al., 2022). Estimation of such models is
less challenging because fTcannot be ignored by construction. In contrast, we address more difficult cases
where fTis “inside” fD, for which we should be careful so that fTis not overwritten and ignored by fD. We
put the following assumptions on the model:
2
Preprint
Assumption 1. fD:X →Y is a universal function approximator; for any  > 0and a continuous function
g:X → Y, there exists θDΘDsatisfying supxSXkfD(x;θD)g(x)k< , where SX⊂ X is some compact
set.
Assumption 2. C(fT, fD;·): X → Y is also a universal function approximator; that is, for any  > 0,θT
ΘT, and a continuous function g0:X → Y, there exists θDΘDsatisfying supxSXkC(fT, fD;x)g0(x)k< .
Remark 1.We assume the universal approximation property just for rigorously construct the discussion.
Even without the universal approximation property, as long as fDand Care much more expressive than fT,
discussions below would approximately hold in practice.
Assumption 3. fTand fDare Lipschitz continuous with regard to (x, θT)and (x, θD), respectively.
2.2 Why Grey-box?
Deep grey-box models are powerful function approximators with a certain level of inherent interpretability
owing to fT, a human-understandable model with a theory as a backbone. A typical use case would be
to estimate a grey-box model on data on which our theory is essentially incomplete and inspect the esti-
mated model to glimpse insights, e.g., when the incomplete theory is correct or not, how the missing part
approximated by fDbehaves, and so on.
Deep grey-box models can also be advantageous in generalization capability and robustness to extrapolation,
as reported empirically so far (Qian et al., 2021; Takeishi and Kalousis, 2021; Wehenkel et al., 2022; Yin
et al., 2021). It is natural to expect such improvements because the presence of fTin the model would reduce
the sample complexity of the learning problem, and fTis supposed to work well in the out-of-data regime
(in other words, it is a requirement for a model to be regarded as theory-driven). However, rigorous analysis
of generalization is challenging for models involving deep neural nets. Anyway, we do not touch on such
performance aspects of deep grey-box models given the previous studies, so the comparison to non-grey-box
models is out of the paper’s scope.
2.3 Empirical Risk Minimization Cannot Select θT
There is a natural consequence of deep grey-box modeling; the theory-driven model’s parameter, θT, cannot
be chosen solely by minimizing an empirical risk of prediction. For example, suppose we learn C(fT, fD;x) =
fT(x) + fD(x)by minimizing the mean squared error, L=ky(fT(x) + fD(x))k2
2. The empirical risk can
be minimized to a similar extent for any θTΘTbecause fD, a deep neural net, can, it alone, approximate
any function on the training set (as assumed in Assumption 1) and thus also the function yfT(x). We
formally state this fact as follows:
Proposition 1. Let S={(x1, y1),...,(xn, yn)}be a training set. Let L(x,y)(θT, θD)be a Lipschitz con-
tinuous loss function between the prediction (i.e., the value of C(fT, fD;x)) and the target (i.e., y). Let
LS(θT, θD) = P(x,y)SL(x,y)(θT, θD)be the empirical risk on the training set. Suppose that Assumptions 1–
3 hold. Then, for any 0>0,θDΘD, and θT, θ0
TΘTwhere θT6=θ0
T, there exists θ0
DΘDthat satisfies
|LS(θT, θD)− LS(θ0
T, θ0
D)|< 0.(2)
Proof. From the assumptions, for any  > 0,θDΘD, and θT6=θ0
TΘT, there exists θ0
DΘDthat satisfies
supx∈{x1,...,xn}kC(fT, fD;x)− C(f0
T, f0
D;x)k< , where f0
iis parameterized by θ0
ifor i= T,D. Since Lis
Lipschitz continuous, supx,yS|L(x,y)(θT, θD)L(x,y)(θ0
T, θ0
D)|< Kwith where Kis L’s Lipschitz constant.
Therefore, with 0:=|S|K,|LS(θT, θD)− LS(θ0
T, θ0
D)|< 0.
2.4 Regularized Risk Minimization
Proposition 1 states that any θTΘTcan be equally likely solely under the empirical risk. It necessitates reg-
ularizing the problem; we should optimize LS+λRinstead, where λ0is a regularization hyperparameter,
and Ris some regularizer that should reflect our inductive biases on how we should combine the theory- and
data-driven models. Let us, for example, consider the linear combination case, C(fT, fD;x) = fT(x) + fD(x).
3
Preprint
One of the common ways of thinking is that fTshould as accurately explain the xyrelation as possible,
and fDshould have the least possible effect. This idea can be operationalized by defining R=kfDk, where
the norm is the function’s norm. Though such an Rhas been a popular choice, it is not the only possibility.
For example, when one wants the two models’ output to be uncorrelated, one can use R=|hfT(x), fD(x)i|.1
We assume that Rdepends only on x. It is natural because the role of Ris not to fit the xyrelation. We
will recall this assumption, if necessary, by writing RSX, where SX={x1, . . . , xn}is the extract of xs from
S. We do not suppose, at least explicitly, any more specifications of Rthan this assumption.
The regularized estimation problem can be cast as follows:
Inductive Learning The simplest formulation is
θ
T, θ
D= arg min
θTDLS(θT, θD) + λRSX(θT, θD).(3)
In this formulation, not only Lbut also Rsuffers a generalization gap. Also, λneeds to be tuned somehow.
Transductive Learning Since we assume that Ronly depends on x, it is reasonable to mention trans-
ductive learning (Gammerman et al., 1998). Let S0
Xbe some set of xthat may include SXas a subset. The
idea is to minimize the unsupervised part of the objective, not on the training data SXbut rather on S0
X;
θ
T, θ
D= arg min
θTDLS(θT, θD) + λRS0
X(θT, θD).(4)
The generalization gap disappears for Rwhen (a subset of) S0
Xis the test set, but still λneeds to be tuned.
3 TOWARD TRUSTWORTHY ESTIMATION
3.1 Challenges of Deep Grey-box Model Estimation
Deep grey-box models have been studied mainly in terms of empirical generalization and extrapolation
capability (Qian et al., 2021; Takeishi and Kalousis, 2021; Wehenkel et al., 2022; Yin et al., 2021). However,
when the model’s interpretation is concerned, the prediction performance does not speak a lot; there can
be multiple parameter values that perform similarly (cf. Rashomon sets), and we cannot judge which one
we should interpret. The solution of the optimization in Eq. (3) or (4) tells us nothing about the analytical
property of R, so we can hardly understand the full picture of how the optimization selects θT. Instead
of uncritically optimizing the regularizer, R, we should know the property of Rin order to gain more
information to explain the choice of θTto be interpreted. We would contrast the situation with, for example,
the estimators of linear regression models, which have been extensively analyzed and thus are trustworthy in
some sense. We do not suggest analyzing our Rs analytically as it is too problem-dependent, but analyzing
them at least empirically would help us make θT’s estimation more trustworthy.
The challenge due to not knowing R’s property stands out more when we do not know what Ris suitable
for the given data and need to compare different candidate Rs, which is often the case as we do not know
the whole data-generating process. The point estimation via Eq. (3) or (4) would not tell much about the
goodness of R, since different Rs could achieve similar prediction performance. This viewpoint also supports
the need for analyzing Rat least empirically for gaining information to compare different Rs.
Another challenge, yet more technical, is the choice of the regularization hyperparameter, λ. It can be tricky
because, in Eq. (3) or (4), it controls two things at the same time: “which θTshould be selected” and “how
much fDshould be regularized. They are different problems if interrelated, and thus decoupling them would
be beneficial.
3.2 Proposed Formulation
As we argued above, analyzing Rempirically can be a useful first step toward a trustworthy estimation of
θT. More specifically, we aim to explore the landscape of R, that is, to evaluate the values of Rfor different
1Suggesting specific Rfor each application or in general is out of the scope of this paper; on contrary, our proposal in
Section 3 is for cases where we cannot specify Ra priori.
4
摘要:

PreprintDeepGrey-BoxModelingWithAdaptiveData-DrivenModelsTowardTrustworthyEstimationofTheory-DrivenModelsNaoyaTakeishinaoya.takeishi@hesge.chGenevaSchoolofBusinessAdministrationUniversityofAppliedSciencesandArtsWesternSwitzerland(HES-SO)AlexandrosKalousisalexandros.kalousis@hesge.chGenevaSchoolofBus...

展开>> 收起<<
Preprint Deep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven Models.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:871.51KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注