
3. Theoretical insights
3.1. Problem setup
Consider a supervised setting of predicting labels Y∈ Y
from input samples X∈ X by a classifier fθ:X → Y
parameterized by θ∈Θ. Following [37], let (Xe, Y e)∼
Pe, where Xe∈ X and Ye∈ Y refer to the input random
variable and the corresponding label, respectively, and e∈
E={1,2,...E}denotes the index of environment, Peis
the corresponding distribution, and the set Ecorresponds to
every possible environments. We further assume that Eis
divided into training environmments Etrain and unseen test
environments Etest, i.e. E=Etrain ∪ Etest.
For a given a loss function ℓ:X × Y × Θ→R+, the
standard training protocol for the empirical risk minimiza-
tion (ERM) is to minimize the expected loss with a training
environment e∈ Etrain:
ˆ
θERM = arg min
θ
E(Xe,Y e)∼ˆ
Peℓ(Xe, Y e;θ),(1)
where ˆ
Peis the empirical distribution over the training data.
Our goal is to learn a model with good performance on
OOD samples of e∈ Etest.
3.2. Motivating example
We conjecture that neural networks trained by ERM in-
discriminately rely on predictive features, including those
spuriously correlated ones [34].
To verify this conjecture, we present a simple binary-
classification example (Xe, Y e)∼Pe, where Ye∈ Y =
{−1,1}represents the corresponding target label, and a
sample Xe∈ X ={−1,1}D+1 ∈RD+1 is constituted
with both the invariant feature Ze
inv ∈ {−1,1}and spu-
rious features Ze
sp ∈ {−1,1}D, i.e. Xe= (Ze
inv,Ze
sp).
Suppose, furthermore, Ze
sp,i denote the i-th spurious feature
component of Ze
sp. Note that we assume D≫1to simulate
the model heavily relies on spurious features Ze
sp [26,37].
We consider the setting where the training environment
e∈ Etrain is highly biased. In other words, we suppose that
Ze
inv =Ye, and each of the i-th spurious feature compo-
nent Ze
sp,i is independent and identically distributed (i.i.d)
Bernoulli variable: i.e. Ze
sp,i independently takes a value
equal to Yewith a probability peand −Yewith a prob-
ability 1−pe, where pe∈(0.5,1],∀e∈ Etrain. Note
that pe→1as the environment is severely biased. A
test environment e∈ Etest is assumed to have pe= 0.5,
which implies that the spurious feature is totally indepen-
dent with Ye. Then we introduce a linear classifier fpa-
rameterized by a weight vector w= (winv,wsp)∈RD+1,
where winv ∈Rand wsp ∈RD. In this example, we
consider a class of pretrained classifiers parameterized by
˜
w(t) = ˜winv(t),˜wsp,1(t),..., ˜wsp,D (t), where t<Tis
a finite pretraining time for some sufficiently large T. Time
twill be often omitted in notations for simplicity.
Our goal is to obtain the optimal sparse classifier with a
highly biased training dataset. To achieve this, we introduce
a binary weight pruning mask mas m= (minv,msp)∈
{0,1}D+1 for the pretrained weights, which is a signifi-
cant departure from the theoretical setting in [37]. Specif-
ically, let minv ∼Bern(πinv), where πinv and 1−πinv
represents the probability of preserving (i.e. minv = 1)
and pruning out (i.e. minv = 0), respectively. Simi-
larly, let msp,i ∼Bern(πsp,i),∀i. Then, our optimiza-
tion goal is to estimate the pruning probability parameter
π= (π1, . . . , πD+1)=(πinv, πsp,1, . . . , πsp,D), where
m∼P(π)is a mask sampled with probability parameters
π. Accordingly, our main loss function for the pruning pa-
rameters given the environment ecan be defined as follows:
ℓe(π) = 1
2EXe,Y e,m[1 −Yeˆ
Ye]
=1
2EXe,Y e,mh1−Ye·sgn ˜wT(Xe⊙m)i,
(2)
where ˆ
Yeis the prediction of binary classifier, ˜wis the pre-
trained weight vector, sgn(·)represents the sign function,
and ⊙represents element-wise product.
We first derive the upper-bound of the training loss ℓe(π)
to illustrate the difficulty of learning optimal pruning pa-
rameters in a biased data setting. The proof can be found in
Supplementary Material.
Theorem 1. (Training and test bound) Assume that pe>
1/2in the biased training environment e∈ Etrain. Define
˜
w(t)as weights pretrained for a finite time t<T. Then
the upper bound of the error of training environment w.r.t.
pruning parameters πis given as:
ℓe(π)≤2 exp −2πinv + (2pe−1) PD
i=1 αi(t)πsp,i2
4PD
i=1 αi(t)2+ 1 ,
(3)
where the weight ratio αi(t) = ˜wsp,i(t)/˜winv (t)is
bounded below some positive constant. Given a test envi-
ronment e∈ Etest with pe=1
2, the upper bound of the
error of test environment w.r.t. πis given as:
ℓe(π)≤2 exp −2π2
inv
4PD
i=1 αi(t)2+ 1,(4)
which implies that there is an unavoidable gap between
training bound and test bound.
The detailed proof of Theorem 1is provided in the sup-
plementary material. This mismatch of the bounds is at-
tributed to the contribution of πsp,i on the training bound
(3). Intuitively, the networks prefer to preserve both ˜winv
and ˜wsp,i in the presence of strong spurious correlations due