
Haotian Ye, James Zou, Linjun Zhang
Related Works on rbustness to spurious correlations.
Recent works aim to develop methods that are robust to
spurious correlations, including learning invariant repre-
sentations (Arjovsky et al., 2019; Guo et al., 2021; Khezeli
et al., 2021; Koyama and Yamaguchi, 2020; Krueger et al.,
2021; Yao et al., 2022), weighting/sampling (Shimodaira,
2000; Japkowicz and Stephen, 2002; Buda et al., 2018; Cui
et al., 2019; Sagawa et al., 2020), and distributionally ro-
bust optimization (DRO) (Ben-Tal et al., 2013; Namkoong
and Duchi, 2017; Oren et al., 2019). Rather than learn a
“one-shot model”, we take a different strategy proposed in
Kirichenko et al. (2022) that conducts regression on the test
environment.
Related Works on representation learning. Learning a
good representation is essential for the success of deep
learning models Bengio et al. (2013). The representation
learning has been studied in the settings of autoencoders He
et al. (2021), transfer learning Du et al. (2020); Tripuraneni
et al. (2020, 2021); Deng et al. (2021); Yao et al. (2021);
Yang et al. (2022), topic modeling Arora et al. (2016); Ke
and Wang (2022); Wu et al. (2022), algorithmic fairness
Zemel et al. (2013); Madras et al. (2018); Burhanpurkar
et al. (2021) and self-supervised learning Lee et al. (2020);
Ji et al. (2021); Tian et al. (2021); Nakada et al. (2023).
2 Preliminary
Throughout the paper, we consider the classification task
X → Y, where X ⊂ Rdand Y= [K]. Here we use [N]to
denote the set {1,··· , N}. We denote all possible distri-
butions over a set Eas ∆(E). Assume that the distribution
of (x, y)is Etr in the training environment and Ete in the
test environment.
Spurious correlation. Learning under spurious correla-
tion is a special kind of Out-Of-Distribution (OOD) learn-
ing where Dtr 6=Dte. We denote the term feature as a
mapping φ(·) : X 7→ Rmthat captures some property of
X. We say φis core (robust) if y|φ(x)has the same
distribution across Etr and Ete. Otherwise, it is spurious.
Non-realizable noise. Learning under noise has been
widely explored in machine learning literature, but is barely
considered when spurious correlations exist. Following
B¨
uhlmann (2020); Arjovsky et al. (2019), we consider non-
realizable noise as the randomness along a generating pro-
cess (can be either on features or on labels). Specifically,
in the causal path φ(x)core →y, we treat the label noise
on yas the non-realizable noise, and call it “core noise”
as it is relevant to the core features; in the causal path
y→φspu(x), we treat the feature noise on φspu(x)as
the non-realizable noise, and call it “spurious noise” as it is
relevant to the spurious features. As we will show, the non-
realizable noise influences the model learning preference.
Goal. Our goal is to minimize the prediction error in Ete,
where spurious correlations are different from Etr. In this
paper, we consider the new strategy proposed in Kirichenko
et al. (2022) that trains a feature learner on Etr and linearly
probes the learned features on Ete, which we call test-time
probing (or last layer retraining). No knowledge about
Ete is obtained during the first training stage. When de-
ploying the model to Ete, we are given a small test datasets
{xi, yi}n
i=1 sampled from Dte, and we are allowed to con-
duct logistic/linear regression on φ(x)and yto obtain our
final prediction function. The goal is that after probing on
the learned features, the model can perform well in Ete,un-
der various possible feature noise settings.
3 Theory: Understand Learned Features
under Spurious Correlation
In this section, we theoretically show why core features can
still be learned by ERM in spite of spurious correlations,
and why non-realizable noises are crucial. Roughly speak-
ing, only when core noise is smaller than spurious noise,
features learned by ERM can guarantee the downstream
probing performance. All proofs are in Appendix D.
3.1 Problem Setup
Data generation mechanism. To capture the spurious cor-
relations and non-realizable noises, we assume the data
x, yis generated from the following mechanism:
x1∼P∈∆(R1×d1), y =x1β+core,
x2=(yγ>+spu Etr
spu Ete ∈R1×d2,x= (x1,x2)∈R1×d.
Here x1is the core feature with an invertible covariance
matrix Σ,E[x>
1x1].x2is the spurious feature that is
differently distributed in Etr and Ete.core ∈R, spu ∈
R1×d2are independent core and spurious noises with mean
zero and variance (covariance matrix) η2
core and η2
spuIre-
spectively. β∈Rd1×1, γ ∈Rd2×1are normalized coeffi-
cients with unit `2norm. We assume that there exists some
k∈Nsuch that the top-keigenvalues are larger than the
noise variance η2
spu, η2
core, and βlies in the span of top-k
eigenvectors of Σ. This is to ensure that the signal along
βis salient enough to be learned. For technical simplicity,
we also assume that all eigenvalues of Σare distinct.
Our data generation mechanism is motivated by Arjovsky
et al. (2019) (Figure 3), where we extend their data model.
We allow core features to be drawn from any distribution P
so long as Σis invertible, while Arjovsky et al. (2019) only
consider a specific form of P. In addition, in our mecha-
nism, labels depend on core features and spurious features
depend on labels. However, our theorems and algorithms
can be easily applied to another setting where both core and
spurious features depend on labels. This is because the dif-
ference between the two settings can be summarized as the