
Preprint
Assumption 1. fD:X →Y is a universal function approximator; for any > 0and a continuous function
g:X → Y, there exists θD∈ΘDsatisfying supx∈SXkfD(x;θD)−g(x)k< , where SX⊂ X is some compact
set.
Assumption 2. C(fT, fD;·): X → Y is also a universal function approximator; that is, for any > 0,θT∈
ΘT, and a continuous function g0:X → Y, there exists θD∈ΘDsatisfying supx∈SXkC(fT, fD;x)−g0(x)k< .
Remark 1.We assume the universal approximation property just for rigorously construct the discussion.
Even without the universal approximation property, as long as fDand Care much more expressive than fT,
discussions below would approximately hold in practice.
Assumption 3. fTand fDare Lipschitz continuous with regard to (x, θT)and (x, θD), respectively.
2.2 Why Grey-box?
Deep grey-box models are powerful function approximators with a certain level of inherent interpretability
owing to fT, a human-understandable model with a theory as a backbone. A typical use case would be
to estimate a grey-box model on data on which our theory is essentially incomplete and inspect the esti-
mated model to glimpse insights, e.g., when the incomplete theory is correct or not, how the missing part
approximated by fDbehaves, and so on.
Deep grey-box models can also be advantageous in generalization capability and robustness to extrapolation,
as reported empirically so far (Qian et al., 2021; Takeishi and Kalousis, 2021; Wehenkel et al., 2022; Yin
et al., 2021). It is natural to expect such improvements because the presence of fTin the model would reduce
the sample complexity of the learning problem, and fTis supposed to work well in the out-of-data regime
(in other words, it is a requirement for a model to be regarded as theory-driven). However, rigorous analysis
of generalization is challenging for models involving deep neural nets. Anyway, we do not touch on such
performance aspects of deep grey-box models given the previous studies, so the comparison to non-grey-box
models is out of the paper’s scope.
2.3 Empirical Risk Minimization Cannot Select θT
There is a natural consequence of deep grey-box modeling; the theory-driven model’s parameter, θT, cannot
be chosen solely by minimizing an empirical risk of prediction. For example, suppose we learn C(fT, fD;x) =
fT(x) + fD(x)by minimizing the mean squared error, L=ky−(fT(x) + fD(x))k2
2. The empirical risk can
be minimized to a similar extent for any θT∈ΘTbecause fD, a deep neural net, can, it alone, approximate
any function on the training set (as assumed in Assumption 1) and thus also the function y−fT(x). We
formally state this fact as follows:
Proposition 1. Let S={(x1, y1),...,(xn, yn)}be a training set. Let L(x,y)(θT, θD)be a Lipschitz con-
tinuous loss function between the prediction (i.e., the value of C(fT, fD;x)) and the target (i.e., y). Let
LS(θT, θD) = P(x,y)∈SL(x,y)(θT, θD)be the empirical risk on the training set. Suppose that Assumptions 1–
3 hold. Then, for any 0>0,θD∈ΘD, and θT, θ0
T∈ΘTwhere θT6=θ0
T, there exists θ0
D∈ΘDthat satisfies
|LS(θT, θD)− LS(θ0
T, θ0
D)|< 0.(2)
Proof. From the assumptions, for any > 0,θD∈ΘD, and θT6=θ0
T∈ΘT, there exists θ0
D∈ΘDthat satisfies
supx∈{x1,...,xn}kC(fT, fD;x)− C(f0
T, f0
D;x)k< , where f0
iis parameterized by θ0
ifor i= T,D. Since Lis
Lipschitz continuous, supx,y∈S|L(x,y)(θT, θD)−L(x,y)(θ0
T, θ0
D)|< K with where Kis L’s Lipschitz constant.
Therefore, with 0:=|S|K,|LS(θT, θD)− LS(θ0
T, θ0
D)|< 0.
2.4 Regularized Risk Minimization
Proposition 1 states that any θT∈ΘTcan be equally likely solely under the empirical risk. It necessitates reg-
ularizing the problem; we should optimize LS+λRinstead, where λ≥0is a regularization hyperparameter,
and Ris some regularizer that should reflect our inductive biases on how we should combine the theory- and
data-driven models. Let us, for example, consider the linear combination case, C(fT, fD;x) = fT(x) + fD(x).
3