Freeze then Train Towards Provable Representation Learning under Spurious Correlations and Feature Noise Haotian Ye James ZouyLinjun Zhangy

2025-05-06 0 0 992.99KB 23 页 10玖币
侵权投诉
Freeze then Train: Towards Provable Representation Learning under
Spurious Correlations and Feature Noise
Haotian Ye James ZouLinjun Zhang
haotianye@pku.edu.cn
Peking University
jamesz@stanford.edu
Stanford University
linjun.zhang@rutgers.edu
Rutgers University
Abstract
The existence of spurious correlations such as
image backgrounds in the training environment
can make empirical risk minimization (ERM)
perform badly in the test environment. To ad-
dress this problem, Kirichenko et al. (2022) em-
pirically found that the core features that are re-
lated to the outcome can still be learned well even
with the presence of spurious correlations. This
opens a promising strategy to first train a fea-
ture learner rather than a classifier, and then per-
form linear probing (last layer retraining) in the
test environment. However, a theoretical under-
standing of when and why this approach works
is lacking. In this paper, we find that core fea-
tures are only learned well when their associated
non-realizable noise is smaller than that of spu-
rious features, which is not necessarily true in
practice. We provide both theories and exper-
iments to support this finding and to illustrate
the importance of non-realizable noise. More-
over, we propose an algorithm called Freeze then
Train (FTT), that first freezes certain salient fea-
tures and then trains the rest of the features us-
ing ERM. We theoretically show that FTT pre-
serves features that are more beneficial to test
time probing. Across two commonly used spuri-
ous correlation datasets, FTT outperforms ERM,
IRM, JTT and CVaR-DRO, with substantial im-
provement in accuracy (by 4.5%) when the fea-
ture noise is large. FTT also performs better on
general distribution shift benchmarks.
1 Introduction
Real-world datasets are riddled with features that are “right
for wrong reasons” (Zhou et al., 2021). For instance, in Wa-
Proceedings of the 26th International Conference on Artificial
Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.
PMLR: Volume 206. Copyright 2023 by the author(s).
Figure 1: The improvement of last layer retraining ac-
curacy (%) before v.s. after ERM training on Dominoes
dataset (Shah et al., 2020). The model is initialized with
ImageNet pretrained parameters. The x-axis and y-axis
represent noise levels of the spurious and core features,
respectively. ERM training helps/harms the performance
when the non-realizable noise of core features is smaller/-
greater than that of the spurious features. Experiment set-
tings are in Section 5.
terbirds (Sagawa et al., 2019), the bird type can be highly
correlated with the spurious feature image backgrounds,
and in CelebA (Liu et al., 2015) the hair color can be rel-
evant to the gender. These features are referred to as spu-
rious features (Hovy and Søgaard, 2015; Blodgett et al.,
2016; Hashimoto et al., 2018), being predictive for most of
the training examples, but are not truly correlated with the
intrinsic labeling function. Machine learning models that
minimize the average loss on a training set (ERM) rely on
these spurious features and will suffer high errors in en-
vironments where the spurious correlation changes. Most
previous works seek to avoid learning spurious features by
minimizing subpopulation group loss (Duchi et al., 2019),
by up-weighting samples that are misclassified (Liu et al.,
2021), by selectively mixing samples Yao et al. (2022), and
so on. The general goal is to recover the core features under
spurious correlations.
Recently, Kirichenko et al. (2022) empirically found that
ERM can still learn the core features well even with the
presence of spurious correlations. They show that by sim-
ply retraining the last layer using a small set of data with
little spurious correlation, one can reweight on core fea-
arXiv:2210.11075v2 [cs.LG] 11 Apr 2023
Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise
Figure 2: An illustration of our method, Freeze then Train (FTT). We start with a pretrained feature extractor (e.g. CNN)
and find dataset-specific salient features using any unsupervised method like contrastive learning or PCA (the orange part).
We then freeze these features and learn the rest of the features using any supervised method like ERM or a robust training
algorithm (the blue part). In the test environment, the last layer is retrained. The pseudo-code can be found in Appendix A.
tures and achieves state-of-the-art performance on popular
benchmark datasets. This method is called Deep Feature
Reweighting (DFR), and it points to a new promising strat-
egy to overcome spurious correlation: learn a feature ex-
tractor rather than a classifier, and then perform linear prob-
ing on the test environment data. This strategy is also used
in many real-world applications in NLP, where the pipeline
is to learn a large pretrained model and conduct linear prob-
ing in downstream tasks (Brown et al., 2020). It simply re-
quires a CPU-based logistic regression on a few amount of
samples from the deployed environment.
However, several problems regarding this strategy remain
open. First, it is unclear when and why the core features can
and cannot be learned during training and be recovered in
test-time probing. Moreover, in the setting where the DFR
strategy does not work well, is there an alternative strategy
to learn the core features and make the test-time probing
strategy work again?
In this paper, we first present a theoretical framework to
quantify this phenomenon in a two-layer linear network
and give both upper and lower control of the probing accu-
racy in Theorems 1 and 2. Our theories analyze the effect of
training and retraining, which is highly nontrivial due to the
non-convex nature of the problem. Our theories point out
an essential factor of this strategy: the feature-dependent
non-realizable noise (abbreviated as non-realizable noise).
Noise is common and inevitable in real-world (Fr´
enay and
Verleysen, 2013). For example, labels can have intrinsic
variance and are imperfect, and human experts may also
assign incorrect labels; in addition, noise is often hetero-
geneous and feature-dependent Zhang et al. (2021), and
spurious features can be better correlated with labels in the
training environment (Yan et al., 2014; Veit et al., 2017).
Our theories show that in order to learn core features, ERM
requires the non-realizable noise of core features to be
much smaller than that of spurious features. As illustrated
in Figure 1, when this condition is violated, the features
learned by ERM perform even worse than the pretrained
features. The intuition is that models typically learn a mix-
ture of different features, where the proportion depends on
the trade-off between information and noise: features with
larger noise are used less. During the last-layer probing,
when the proportion of the core feature is small, we suffer
more to amplify this feature. Our theories and experiments
suggest that the scenario in Kirichenko et al. (2022) is in-
complete, and the strategy can sometimes be ineffective.
Inspired by this understanding, we propose an algorithm,
called Freeze then Train (FTT), which first learns salient
features in an unsupervised way and freezes them, and then
trains the rest of the features via supervised learning. We
illustrate it in Figure 2. Based on our finding that lin-
ear probing fails when the non-realizable noise of spuri-
ous features is smaller (since labels incentivize ERM to fo-
cus more on features with smaller noise), we propose to
learn features both with and without the guidance of labels.
This exploits the information provided in labels, while still
preserving useful features that might not be learned in su-
pervised training. We show in Theorem 3 that FTT at-
tains near-optimal performance in our theoretical frame-
work, providing initial proof of its effectiveness.
We conduct extensive experiments to show that: (1) In
real-world datasets the phenomenon matches our theories
well. (2) On three spurious correlation datasets, FTT out-
performs other algorithms by 1.4%,0.3%,4.1% on aver-
age, and 4.5%,0.4%,9% at most. (3) On more general
OOD tasks such as three distribution shift datasets, FTT
outperforms other OOD algorithms by 1.1%,0.8%,2.1%
on average. (4) We also conduct fine-grained ablations ex-
periments to study FTT under different unsupervised fea-
ture fractions, and a different number of learned features.
Together, we give a theoretical understanding of the prob-
ing strategy, propose FTT that is more suitable for test-
time probing and outperforms existing algorithms in vari-
ous benchmarks. Even under spurious correlation and non-
realizable noises, by combining ERM with unsupervised
methods, we can still perform well in the test environment.
Haotian Ye, James Zou, Linjun Zhang
Related Works on rbustness to spurious correlations.
Recent works aim to develop methods that are robust to
spurious correlations, including learning invariant repre-
sentations (Arjovsky et al., 2019; Guo et al., 2021; Khezeli
et al., 2021; Koyama and Yamaguchi, 2020; Krueger et al.,
2021; Yao et al., 2022), weighting/sampling (Shimodaira,
2000; Japkowicz and Stephen, 2002; Buda et al., 2018; Cui
et al., 2019; Sagawa et al., 2020), and distributionally ro-
bust optimization (DRO) (Ben-Tal et al., 2013; Namkoong
and Duchi, 2017; Oren et al., 2019). Rather than learn a
“one-shot model”, we take a different strategy proposed in
Kirichenko et al. (2022) that conducts regression on the test
environment.
Related Works on representation learning. Learning a
good representation is essential for the success of deep
learning models Bengio et al. (2013). The representation
learning has been studied in the settings of autoencoders He
et al. (2021), transfer learning Du et al. (2020); Tripuraneni
et al. (2020, 2021); Deng et al. (2021); Yao et al. (2021);
Yang et al. (2022), topic modeling Arora et al. (2016); Ke
and Wang (2022); Wu et al. (2022), algorithmic fairness
Zemel et al. (2013); Madras et al. (2018); Burhanpurkar
et al. (2021) and self-supervised learning Lee et al. (2020);
Ji et al. (2021); Tian et al. (2021); Nakada et al. (2023).
2 Preliminary
Throughout the paper, we consider the classification task
X → Y, where X Rdand Y= [K]. Here we use [N]to
denote the set {1,··· , N}. We denote all possible distri-
butions over a set Eas ∆(E). Assume that the distribution
of (x, y)is Etr in the training environment and Ete in the
test environment.
Spurious correlation. Learning under spurious correla-
tion is a special kind of Out-Of-Distribution (OOD) learn-
ing where Dtr 6=Dte. We denote the term feature as a
mapping φ(·) : X 7→ Rmthat captures some property of
X. We say φis core (robust) if y|φ(x)has the same
distribution across Etr and Ete. Otherwise, it is spurious.
Non-realizable noise. Learning under noise has been
widely explored in machine learning literature, but is barely
considered when spurious correlations exist. Following
B¨
uhlmann (2020); Arjovsky et al. (2019), we consider non-
realizable noise as the randomness along a generating pro-
cess (can be either on features or on labels). Specifically,
in the causal path φ(x)core y, we treat the label noise
on yas the non-realizable noise, and call it “core noise”
as it is relevant to the core features; in the causal path
yφspu(x), we treat the feature noise on φspu(x)as
the non-realizable noise, and call it “spurious noise” as it is
relevant to the spurious features. As we will show, the non-
realizable noise influences the model learning preference.
Goal. Our goal is to minimize the prediction error in Ete,
where spurious correlations are different from Etr. In this
paper, we consider the new strategy proposed in Kirichenko
et al. (2022) that trains a feature learner on Etr and linearly
probes the learned features on Ete, which we call test-time
probing (or last layer retraining). No knowledge about
Ete is obtained during the first training stage. When de-
ploying the model to Ete, we are given a small test datasets
{xi, yi}n
i=1 sampled from Dte, and we are allowed to con-
duct logistic/linear regression on φ(x)and yto obtain our
final prediction function. The goal is that after probing on
the learned features, the model can perform well in Ete,un-
der various possible feature noise settings.
3 Theory: Understand Learned Features
under Spurious Correlation
In this section, we theoretically show why core features can
still be learned by ERM in spite of spurious correlations,
and why non-realizable noises are crucial. Roughly speak-
ing, only when core noise is smaller than spurious noise,
features learned by ERM can guarantee the downstream
probing performance. All proofs are in Appendix D.
3.1 Problem Setup
Data generation mechanism. To capture the spurious cor-
relations and non-realizable noises, we assume the data
x, yis generated from the following mechanism:
x1P∆(R1×d1), y =x1β+core,
x2=(yγ>+spu Etr
spu Ete R1×d2,x= (x1,x2)R1×d.
Here x1is the core feature with an invertible covariance
matrix Σ,E[x>
1x1].x2is the spurious feature that is
differently distributed in Etr and Ete.core R, spu
R1×d2are independent core and spurious noises with mean
zero and variance (covariance matrix) η2
core and η2
spuIre-
spectively. βRd1×1, γ Rd2×1are normalized coeffi-
cients with unit `2norm. We assume that there exists some
kNsuch that the top-keigenvalues are larger than the
noise variance η2
spu, η2
core, and βlies in the span of top-k
eigenvectors of Σ. This is to ensure that the signal along
βis salient enough to be learned. For technical simplicity,
we also assume that all eigenvalues of Σare distinct.
Our data generation mechanism is motivated by Arjovsky
et al. (2019) (Figure 3), where we extend their data model.
We allow core features to be drawn from any distribution P
so long as Σis invertible, while Arjovsky et al. (2019) only
consider a specific form of P. In addition, in our mecha-
nism, labels depend on core features and spurious features
depend on labels. However, our theorems and algorithms
can be easily applied to another setting where both core and
spurious features depend on labels. This is because the dif-
ference between the two settings can be summarized as the
Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise
difference on Σ, while the techniques we use do not rely
on the concrete form of Σ.
Models. To capture the property of features and retrain-
ing, we consider a regression task using a two-layer linear
network f(x) = xW b, where WRd×mis the feature
learner and bRm×1is the last layer that will be retrained
in Ete. We assume that the model learns a low-dimensional
representation (md), but is able to capture the ground
truth signal (mk). Notice that the optimization over
(W,b)is non-convex, and there is no closed-form solution.
This two-layer network model has been commonly used in
machine learning theory literature (Arora et al., 2018; Gidel
et al., 2019; Kumar et al., 2022). The major technical dif-
ficulty in our setting is how to analyze the learned features
and control probing performance under this non-convexity
with spurious correlations. We assume the parameters are
initialized according to Xavier uniform distribution1.
Optimization. During the training stage, we minimize
the l2-loss `tr(W,b) = 1
2nkf(X)Yk2where X=
(x>
1,··· ,x>
n)>and Y= (y1,··· , yn)>. For the clarity
of analysis, we consider two extremes that can help sim-
plify the optimization while still maintaining our key intu-
ition. First, we take an infinitely small learning rate such
that the optimization process becomes a gradient flow (Gu-
nasekar et al., 2017; Du et al., 2018). Denote the param-
eters W,bat training time tas W(t),b(t), and v(t) =
W(t)b(t). Second, we consider the infinite data setting
(n→ ∞). This is a widely used simplification to avoid the
influence of sample randomness Kim et al. (2019); Ghor-
bani et al. (2021). The parameters are updated as
tW(t) = −∇W`tr(W(t),b(t))
=X>X
nW(t)b(t)X>Y
nb(t)>
=E[x>y]E[x>x]v(t)b(t)>
tb(t) = −∇b`tr(W(t),b(t))
=W(t)>X>X
nW(t)b(t)X>Y
n
=W(t)>E[x>y]E[x>x]v(t).
In the test stage, we retrain the last layer bto minimize the
test loss, i.e.
`te(W) = min
b
EEte
1
2kxW b yk2.
In the test stage, the spurious correlation is broken, i.e.
x2=spu. The minimum error in the test stage is err
te =
η2
core/2when v= (β>,0)>.
3.2 Theoretical Analysis: Noises Matter
We are now ready to introduce our theoretical results when
the core features can and cannot be learned by ERM with
1Our theorems can be easily applied to various initializations.
different levels of non-realizable noises. One important in-
tuition on why core features can still be learned well de-
spite the (possibly more easily learned) spurious features is
that, the loss can be further reduced by using both core and
spurious features simultaneously.
Lemma 1 For all WRd×m,bRm, we have
`tr(W,b)1
2Ekxv
tr yk2
2=η2
coreη2
spu
2(η2
core +η2
spu)
,err
tr,
where v
tr = (αβ>,(1 α)γ>)>is the optimal coefficient
for training, and α=η2
spu
η2
core +η2
spu
.
Lemma 1 shows that by assigning αfraction of weight to
the core feature βand the rest to γ, the loss is minimized.
This implies that the model will learn a mixture of both
features even with large spurious correlations. More impor-
tantly, the magnitude of αwill largely influence the probing
performance. During the test stage, x2become useless,
and the trained W(t)b(t)can only recover αfraction of
y, which induces a large approximation error. To this end,
during the retraining the last layer coefficients should scale
up in order to predict ywell. Meanwhile, this also scales
up the weight on x2, which is merely a harmful noise, re-
sulting in a trade-off between learning accurate core fea-
tures and removing spurious features. When the core noise
is small, i.e., α1, the noise on x2will not be scaled
up much. The Waterbirds dataset considered in Kirichenko
et al. (2022) has ηcore = 0% < ηspu = 5%, falling into
this region. The following theorem tells how well the ERM
with last-layer probing works in this region.
Theorem 1 (Upper Bound) Assume that v(t)is bounded
away from 0throughout the whole optimization2, i.e.
kv(t)k2> c0>0. Then, for any 0< ηcore < ηspu,
any time t, we have
`te(W(t)) 1 + η2
core
η2
spu err
te +O(t1).(1)
Here err
te =η2
core/2is the optimal testing error and O
hides the dependency on ηcore, ηtspu, c0and the initialized
parameters. When ηcore
ηspu 0, this theorem suggests test-
time probing achieves near optimal error.
Theorem 1 gives a theoretical explanation of the last layer
retraining phenomenon. It shows that the test error after
retraining can be close to err
te over time. However, this
guarantee holds only when ηcore < ηspu. The following
theorem shows that when the core features have large noise,
the representation learned by ERM would produce a down-
graded performance after linear probing.
2This is to guarantee that our gradient flow will not fail to con-
verge to an minimum, in which case the theorem is meaningless.
Haotian Ye, James Zou, Linjun Zhang
Figure 3: A toy example illustrating when and why ERM
can perform well after retraining in Ete (d= 2, m = 1).
Assume the core feature βis vertical and the spurious fea-
ture γis horizontal. Both features can predict yin Etr ,
while γis useless in Ete since x2=spu. We initialize our
single feature W(0), and obtain W(t)after training on Etr.
We then retrain the last layer (probing) on Ete, i.e. rescale
W(t)and obtain vtest. When ηcore < ηspu,W(t)will use
βmore (the blue flow); after probing, vtest can recover β
(small approximation error) without suffering much from
the spurious 2on the direction or γ(small spurious noise
error). On the contrary, when ηcore is large, W(t)will fol-
low the red flow; this leads to a trade-off between two error
terms. In this case, ERM performs much worse. Notice
that flows in the figure are just for illustration. In practice,
probing can either lengthen or shorten W(t), depending on
the concrete form of two error terms.
Theorem 2 (Lower Bound) Assume that in the infinity,
W1(),limt→∞ W1(t)has full column rank, which
almost surely holds when m < d. Then for any ηcore >
ηspu >0, we have
lim
t→∞
`te(W(t))
err
te
1 + η2
core
2η2
spu 11
2η2
spu kΣ1k2
W
1()
2
2
!.(2)
Here Ais the Moore-Penrose inverse of A, and abtakes
the minimum over a, b. When ηspu
ηcore 0, the last layer
retraining error is much larger than the optimal error.
Theorem 2 implies that the error can be η2
core
η2
spu
times larger
than err
te when ηcore > ηspu, showing that ERM with
last layer retraining does not work in this scenario, and the
features learned by ERM are insufficient to recover near-
optimal performance. In summary, we prove that test-time
probing performance largely relies on the non-realizable
noises, and it only works when the core noise is relatively
smaller. We illustrate two theorems in Figure 3.
4 Method: Improving Test-Time Probing
Our theories raise a natural question: can we improve the
learned features and make the test-time probing strategy
effective under various noise conditions? A feature can
be better correlated with labels in Etr than others, but the
correlation may be spurious and even disappears in Ete.
Without concrete knowledge about Ete and spurious cor-
relations, it is impossible to determine whether or not a
learned feature is informative only in Etr, especially given
that there are innumerable amount of features. This prob-
lem comes from treating the label as an absolute oracle and
is unlikely to be addressed by switching to other supervised
robust training methods that still depend on labels. We ex-
perimentally verify this in Section 5.2.
In order to perform well in test-time probing under different
noise conditions, we should also learn salient features that
are selected without relying on labels. This helps preserve
features that are useful in the testing stage, but are ruled out
because they are less informative than other features w.r.t.
labels. By learning features both with and without the help
of labels, we can extract informative features and simulta-
neously maximize diversity. To this end, we propose the
Freeze then Train (FTT) algorithm, which first freezes cer-
tain salient features unsupervisedly and then trains the rest
of the features supervisedly. The algorithm is illustrated in
Figure 2, and we describe the details below.
4.1 Method: Freeze then Train
Algorithm 1 Freeze Then Train
Input: Dataset S={xi, yi}n
i=1, initialized feature extrac-
tor M:X 7→ Rm, unsupervised fraction p, n class K.
1: Conduct PCA on {M(xi)}n
i=1 with dimension pm,
obtain transform matrix Wul Rm×pm
2: Set unsupervised model Mul(x) = M(x)Wul, and
freeze its parameters (including Wul)
3: set Msl(x) = M(x)Wsl, initialize linear head h:
Rm7→ RK.
4: Supervisedly train the model MF T T (x) =
h((Mul(x),Msl(x)) on Susing ERM, update
Msl,Wsl, h until converge.
Output: MFTT
Step 1. Unsupervised freeze stage. FTT starts with a
model Minit pretrained in large datasets like ImageNet
or language corpus. Given a training set Str Dtr, we
use an unsupervised method like Contrastive Learning or
Principal Component Analysis (PCA) to learn pm features,
where mis the number of total features, and p[0,1] is a
摘要:

FreezethenTrain:TowardsProvableRepresentationLearningunderSpuriousCorrelationsandFeatureNoiseHaotianYeJamesZouyLinjunZhangyhaotianye@pku.edu.cnPekingUniversityjamesz@stanford.eduStanfordUniversitylinjun.zhang@rutgers.eduRutgersUniversityAbstractTheexistenceofspuriouscorrelationssuchasimagebackground...

展开>> 收起<<
Freeze then Train Towards Provable Representation Learning under Spurious Correlations and Feature Noise Haotian Ye James ZouyLinjun Zhangy.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:23 页 大小:992.99KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注