poor surrogate model. In summary, none of the approaches
that are based on surrogate models provide a reliable bound
on the true failure probability. Furthermore, all these ap-
proaches require end-to-end measurements from the real
system, ignoring the composite structure of the system.
In practice, however, the system output
S(·)
in Eq.
(1)
refers
to a complex system that often has a composite structure.
That is, global inputs
x
propagate through an arrangement,
oftentimes termed a function network, of subsystems or com-
ponents, see Fig. 1. Exploiting such a structure is expected
to have a notable impact on the target task, be it experi-
mental design (Marque-Pucheu et al., 2019), calibration and
optimization (Astudillo and Frazier, 2019, 2021; Kusakawa
et al., 2022; Xiao et al., 2022), uncertainty quantification
(Sanson et al., 2019), or system validation as presented here.
In the context of Bayesian Optimization (BO), for example,
Astudillo and Frazier (2021) construct a surrogate system
of Gaussian Processes (GP) that mirrors the compositional
structure of the system. Similarly, Sanson et al. (2019)
discuss similarities of such structured surrogate models to
Deep GPs (Damianou and Lawrence, 2013), and extend this
framework to local evaluations of constituent components.
However, learning (probabilistic) models of inaccuracies
(Sanson et al., 2019; Riedmaier et al., 2021) introduces
further modeling assumptions and cannot account for data-
shifts. Instead, we aim at model-free worst-case statements.
Marque-Pucheu et al. (2019) showed that a composite func-
tion can be efficiently modeled from local evaluations of
constituent components in a sequential design approach.
Friedman et al. (2021) extend this framework to cyclic struc-
tures of composite systems for adaptive experimental design.
They derive bounds on the simulation error in composite sys-
tems, although assuming knowledge of Lipschitz constants
as well as uniformly bounded component-wise errors.
Stitching different datasets covering the different parts of a
larger mechanism without loosing the causal relation was
analyzed by Chau et al. (2021) and corresponding models
were constructed, but the quality with which statements
about the real mechanism can be made was not analyzed.
Bounding the test error of models under input-datashift was
analyzed empirically in Jiang et al. (2022) by investigating
the disagreement between different models. Although they
find a correlation between disagreement and test error, the
authors do not provide a rigorous bound on the test error
(Sec. 3.3) and also cannot incorporate an existing simulation
model into the analysis.
3 METHOD
3.1 Setup: Composite System Validation
We consider a (real) system or system under test
S
that is
composed of subsystems
Sc
(
c= 1,2, . . . , C
), over which
we have only limited information. The validation task is to
determine whether
S
conforms to a given specification, such
as whether the system output
y=S(x)
stays below a given
threshold
τ
for typical inputs
x
– or whether the system’s
probability of failure, defined as violating the threshold, is
sufficiently low, see Eq. (1). Our approach to this task is
built on a model
M
(typically a simulation, with no analytic
form) of
S
that is similarly composed of corresponding sub-
models
Mc
. The main challenge in assessing the system’s
failure probability lies in determining how closely
M
ap-
proximates
S
, in the case where the system data originate
from disparate component measurements, which cannot be
combined to consistent end-to-end data.
Components and signals. Mathematically, each compo-
nent of
S
– and similarly for
M
– is a (potentially stochastic)
map
Sc
, which upon input of a signal
xc
produces an output
signal (sample) yc∼Sc(·|xc)according to the conditional
distribution
Sc
. The stochasticity allows for aleatoric sys-
tem behavior or unmodeled influences. We consider the case
where all signals are tuples
xc= (xc
1, . . . , xc
dc
in )
, such as
real vectors. The allowed “compositions” of the subsystems
Sc
must be such that upon input of any signal (stimulus)
x
,
an output sample
y∼S(·|x)
can be produced by iterating
through the components
Sc
in order
c= 1,2, . . . , C
. More
precisely, we assume that the input signal
xc
into
Sc
is a con-
catenation of some entries
x|0→c
of the overall input tuple
x
and entries
yc′|c′→c
of some preceding outputs
yc′
(with
c′= 1, . . . , c −1
); thus,
Sc
is ready to be queried right after
Sc−1
. We assume the overall system output
y=yC∈R
to
be real-valued as multiple technical performance indicators
(TPIs) could be considered separately or concatenated by
weighted mean, etc. The simplest example of such a com-
posite system is a linear chain
S=SC◦. . .◦S2◦S1
, where
x≡x1
is the input into
S1
and the output of each compo-
nent is fed into the next, i.e.
xc+1 ≡yc
. Another example is
shown in Fig. 1, where
x3
is concatenated from both outputs
y1
and
y2
. We assume the identical compositional structure
for the model Mwith components Mc.
Validation data. An essential characteristic of our setup is
that neither
S
nor the subsystem maps
Sc
are known explic-
itly, and that “end-to-end” measurements
(x, y)
from the full
system
S
are unavailable (see Sec. 1). Rather, we assume
that validation data are available only for every subsystem
Sc
, i.e. pairs
(xc
v, yc
v)
of inputs
xc
v
and corresponding output
samples
yc
v∼Sc(·|xc
v)
(
v= 1, . . . , V c
). Such validation
data may have been obtained by measuring subsystem
Sc
in isolation on some inputs
xc
v
, without needing the full
system
S
; note, the inputs
xc
v
do not necessarily follow the
distribution from previous components. In the same spirit,
the models
Mc
may also have been trained from such “local”
system data; we assume Mc,Mto be given from the start.
Probability distributions. We aim at probabilistic vali-
dation statements, namely that the system fails or violates
its requirements only with low probability. For this, we