Freeze then Train Towards Provable Representation Learning under Spurious Correlations and Feature Noise Haotian Ye James ZouyLinjun Zhangy

2025-05-06 0 0 992.99KB 23 页 10玖币

侵权投诉

Freeze then Train: Towards Provable Representation Learning under

Spurious Correlations and Feature Noise

Haotian Ye James Zou†Linjun Zhang†

haotianye@pku.edu.cn

Peking University

jamesz@stanford.edu

Stanford University

linjun.zhang@rutgers.edu

Rutgers University

Abstract

The existence of spurious correlations such as

image backgrounds in the training environment

can make empirical risk minimization (ERM)

perform badly in the test environment. To ad-

dress this problem, Kirichenko et al. (2022) em-

pirically found that the core features that are re-

lated to the outcome can still be learned well even

with the presence of spurious correlations. This

opens a promising strategy to ﬁrst train a fea-

ture learner rather than a classiﬁer, and then per-

form linear probing (last layer retraining) in the

test environment. However, a theoretical under-

standing of when and why this approach works

is lacking. In this paper, we ﬁnd that core fea-

tures are only learned well when their associated

non-realizable noise is smaller than that of spu-

rious features, which is not necessarily true in

practice. We provide both theories and exper-

iments to support this ﬁnding and to illustrate

the importance of non-realizable noise. More-

over, we propose an algorithm called Freeze then

Train (FTT), that ﬁrst freezes certain salient fea-

tures and then trains the rest of the features us-

ing ERM. We theoretically show that FTT pre-

serves features that are more beneﬁcial to test

time probing. Across two commonly used spuri-

ous correlation datasets, FTT outperforms ERM,

IRM, JTT and CVaR-DRO, with substantial im-

provement in accuracy (by 4.5%) when the fea-

ture noise is large. FTT also performs better on

general distribution shift benchmarks.

1 Introduction

Real-world datasets are riddled with features that are “right

for wrong reasons” (Zhou et al., 2021). For instance, in Wa-

Proceedings of the 26th International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.

Figure 1: The improvement of last layer retraining ac-

curacy (%) before v.s. after ERM training on Dominoes

dataset (Shah et al., 2020). The model is initialized with

ImageNet pretrained parameters. The x-axis and y-axis

represent noise levels of the spurious and core features,

respectively. ERM training helps/harms the performance

when the non-realizable noise of core features is smaller/-

greater than that of the spurious features. Experiment set-

tings are in Section 5.

terbirds (Sagawa et al., 2019), the bird type can be highly

correlated with the spurious feature image backgrounds,

and in CelebA (Liu et al., 2015) the hair color can be rel-

evant to the gender. These features are referred to as spu-

rious features (Hovy and Søgaard, 2015; Blodgett et al.,

2016; Hashimoto et al., 2018), being predictive for most of

the training examples, but are not truly correlated with the

intrinsic labeling function. Machine learning models that

minimize the average loss on a training set (ERM) rely on

these spurious features and will suffer high errors in en-

vironments where the spurious correlation changes. Most

previous works seek to avoid learning spurious features by

minimizing subpopulation group loss (Duchi et al., 2019),

by up-weighting samples that are misclassiﬁed (Liu et al.,

2021), by selectively mixing samples Yao et al. (2022), and

so on. The general goal is to recover the core features under

spurious correlations.

Recently, Kirichenko et al. (2022) empirically found that

ERM can still learn the core features well even with the

presence of spurious correlations. They show that by sim-

ply retraining the last layer using a small set of data with

little spurious correlation, one can reweight on core fea-

arXiv:2210.11075v2 [cs.LG] 11 Apr 2023

Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise

Figure 2: An illustration of our method, Freeze then Train (FTT). We start with a pretrained feature extractor (e.g. CNN)

and ﬁnd dataset-speciﬁc salient features using any unsupervised method like contrastive learning or PCA (the orange part).

We then freeze these features and learn the rest of the features using any supervised method like ERM or a robust training

algorithm (the blue part). In the test environment, the last layer is retrained. The pseudo-code can be found in Appendix A.

tures and achieves state-of-the-art performance on popular

benchmark datasets. This method is called Deep Feature

Reweighting (DFR), and it points to a new promising strat-

egy to overcome spurious correlation: learn a feature ex-

tractor rather than a classiﬁer, and then perform linear prob-

ing on the test environment data. This strategy is also used

in many real-world applications in NLP, where the pipeline

is to learn a large pretrained model and conduct linear prob-

ing in downstream tasks (Brown et al., 2020). It simply re-

quires a CPU-based logistic regression on a few amount of

samples from the deployed environment.

However, several problems regarding this strategy remain

open. First, it is unclear when and why the core features can

and cannot be learned during training and be recovered in

test-time probing. Moreover, in the setting where the DFR

strategy does not work well, is there an alternative strategy

to learn the core features and make the test-time probing

strategy work again?

In this paper, we ﬁrst present a theoretical framework to

quantify this phenomenon in a two-layer linear network

and give both upper and lower control of the probing accu-

racy in Theorems 1 and 2. Our theories analyze the effect of

training and retraining, which is highly nontrivial due to the

non-convex nature of the problem. Our theories point out

an essential factor of this strategy: the feature-dependent

non-realizable noise (abbreviated as non-realizable noise).

Noise is common and inevitable in real-world (Fr´

enay and

Verleysen, 2013). For example, labels can have intrinsic

variance and are imperfect, and human experts may also

assign incorrect labels; in addition, noise is often hetero-

geneous and feature-dependent Zhang et al. (2021), and

spurious features can be better correlated with labels in the

training environment (Yan et al., 2014; Veit et al., 2017).

Our theories show that in order to learn core features, ERM

requires the non-realizable noise of core features to be

much smaller than that of spurious features. As illustrated

in Figure 1, when this condition is violated, the features

learned by ERM perform even worse than the pretrained

features. The intuition is that models typically learn a mix-

ture of different features, where the proportion depends on

the trade-off between information and noise: features with

larger noise are used less. During the last-layer probing,

when the proportion of the core feature is small, we suffer

more to amplify this feature. Our theories and experiments

suggest that the scenario in Kirichenko et al. (2022) is in-

complete, and the strategy can sometimes be ineffective.

Inspired by this understanding, we propose an algorithm,

called Freeze then Train (FTT), which ﬁrst learns salient

features in an unsupervised way and freezes them, and then

trains the rest of the features via supervised learning. We

illustrate it in Figure 2. Based on our ﬁnding that lin-

ear probing fails when the non-realizable noise of spuri-

ous features is smaller (since labels incentivize ERM to fo-

cus more on features with smaller noise), we propose to

learn features both with and without the guidance of labels.

This exploits the information provided in labels, while still

preserving useful features that might not be learned in su-

pervised training. We show in Theorem 3 that FTT at-

tains near-optimal performance in our theoretical frame-

work, providing initial proof of its effectiveness.

We conduct extensive experiments to show that: (1) In

real-world datasets the phenomenon matches our theories

well. (2) On three spurious correlation datasets, FTT out-

performs other algorithms by 1.4%,0.3%,4.1% on aver-

age, and 4.5%,0.4%,9% at most. (3) On more general

OOD tasks such as three distribution shift datasets, FTT

outperforms other OOD algorithms by 1.1%,0.8%,2.1%

on average. (4) We also conduct ﬁne-grained ablations ex-

periments to study FTT under different unsupervised fea-

ture fractions, and a different number of learned features.

Together, we give a theoretical understanding of the prob-

ing strategy, propose FTT that is more suitable for test-

time probing and outperforms existing algorithms in vari-

ous benchmarks. Even under spurious correlation and non-

realizable noises, by combining ERM with unsupervised

methods, we can still perform well in the test environment.

Haotian Ye, James Zou, Linjun Zhang

Related Works on rbustness to spurious correlations.

Recent works aim to develop methods that are robust to

spurious correlations, including learning invariant repre-

sentations (Arjovsky et al., 2019; Guo et al., 2021; Khezeli

et al., 2021; Koyama and Yamaguchi, 2020; Krueger et al.,

2021; Yao et al., 2022), weighting/sampling (Shimodaira,

2000; Japkowicz and Stephen, 2002; Buda et al., 2018; Cui

et al., 2019; Sagawa et al., 2020), and distributionally ro-

bust optimization (DRO) (Ben-Tal et al., 2013; Namkoong

and Duchi, 2017; Oren et al., 2019). Rather than learn a

“one-shot model”, we take a different strategy proposed in

Kirichenko et al. (2022) that conducts regression on the test

environment.

Related Works on representation learning. Learning a

good representation is essential for the success of deep

learning models Bengio et al. (2013). The representation

learning has been studied in the settings of autoencoders He

et al. (2021), transfer learning Du et al. (2020); Tripuraneni

et al. (2020, 2021); Deng et al. (2021); Yao et al. (2021);

Yang et al. (2022), topic modeling Arora et al. (2016); Ke

and Wang (2022); Wu et al. (2022), algorithmic fairness

Zemel et al. (2013); Madras et al. (2018); Burhanpurkar

et al. (2021) and self-supervised learning Lee et al. (2020);

Ji et al. (2021); Tian et al. (2021); Nakada et al. (2023).

2 Preliminary

Throughout the paper, we consider the classiﬁcation task

X → Y, where X ⊂ Rdand Y= [K]. Here we use [N]to

denote the set {1,··· , N}. We denote all possible distri-

butions over a set Eas ∆(E). Assume that the distribution

of (x, y)is Etr in the training environment and Ete in the

test environment.

Spurious correlation. Learning under spurious correla-

tion is a special kind of Out-Of-Distribution (OOD) learn-

ing where Dtr 6=Dte. We denote the term feature as a

mapping φ(·) : X 7→ Rmthat captures some property of

X. We say φis core (robust) if y|φ(x)has the same

distribution across Etr and Ete. Otherwise, it is spurious.

Non-realizable noise. Learning under noise has been

widely explored in machine learning literature, but is barely

considered when spurious correlations exist. Following

B¨

uhlmann (2020); Arjovsky et al. (2019), we consider non-

realizable noise as the randomness along a generating pro-

cess (can be either on features or on labels). Speciﬁcally,

in the causal path φ(x)core →y, we treat the label noise

on yas the non-realizable noise, and call it “core noise”

as it is relevant to the core features; in the causal path

y→φspu(x), we treat the feature noise on φspu(x)as

the non-realizable noise, and call it “spurious noise” as it is

relevant to the spurious features. As we will show, the non-

realizable noise inﬂuences the model learning preference.

Goal. Our goal is to minimize the prediction error in Ete,

where spurious correlations are different from Etr. In this

paper, we consider the new strategy proposed in Kirichenko

et al. (2022) that trains a feature learner on Etr and linearly

probes the learned features on Ete, which we call test-time

probing (or last layer retraining). No knowledge about

Ete is obtained during the ﬁrst training stage. When de-

ploying the model to Ete, we are given a small test datasets

{xi, yi}n

i=1 sampled from Dte, and we are allowed to con-

duct logistic/linear regression on φ(x)and yto obtain our

ﬁnal prediction function. The goal is that after probing on

the learned features, the model can perform well in Ete,un-

der various possible feature noise settings.

3 Theory: Understand Learned Features

under Spurious Correlation

In this section, we theoretically show why core features can

still be learned by ERM in spite of spurious correlations,

and why non-realizable noises are crucial. Roughly speak-

ing, only when core noise is smaller than spurious noise,

features learned by ERM can guarantee the downstream

probing performance. All proofs are in Appendix D.

3.1 Problem Setup

Data generation mechanism. To capture the spurious cor-

relations and non-realizable noises, we assume the data

x, yis generated from the following mechanism:

x1∼P∈∆(R1×d1), y =x1β+core,

x2=(yγ>+spu Etr

spu Ete ∈R1×d2,x= (x1,x2)∈R1×d.

Here x1is the core feature with an invertible covariance

matrix Σ,E[x>

1x1].x2is the spurious feature that is

differently distributed in Etr and Ete.core ∈R, spu ∈

R1×d2are independent core and spurious noises with mean

zero and variance (covariance matrix) η2

core and η2

spuIre-

spectively. β∈Rd1×1, γ ∈Rd2×1are normalized coefﬁ-

cients with unit `2norm. We assume that there exists some

k∈Nsuch that the top-keigenvalues are larger than the

noise variance η2

spu, η2

core, and βlies in the span of top-k

eigenvectors of Σ. This is to ensure that the signal along

βis salient enough to be learned. For technical simplicity,

we also assume that all eigenvalues of Σare distinct.

Our data generation mechanism is motivated by Arjovsky

et al. (2019) (Figure 3), where we extend their data model.

We allow core features to be drawn from any distribution P

so long as Σis invertible, while Arjovsky et al. (2019) only

consider a speciﬁc form of P. In addition, in our mecha-

nism, labels depend on core features and spurious features

depend on labels. However, our theorems and algorithms

can be easily applied to another setting where both core and

spurious features depend on labels. This is because the dif-

ference between the two settings can be summarized as the

Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise

difference on Σ, while the techniques we use do not rely

on the concrete form of Σ.

Models. To capture the property of features and retrain-

ing, we consider a regression task using a two-layer linear

network f(x) = xW b, where W∈Rd×mis the feature

learner and b∈Rm×1is the last layer that will be retrained

in Ete. We assume that the model learns a low-dimensional

representation (md), but is able to capture the ground

truth signal (mk). Notice that the optimization over

(W,b)is non-convex, and there is no closed-form solution.

This two-layer network model has been commonly used in

machine learning theory literature (Arora et al., 2018; Gidel

et al., 2019; Kumar et al., 2022). The major technical dif-

ﬁculty in our setting is how to analyze the learned features

and control probing performance under this non-convexity

with spurious correlations. We assume the parameters are

initialized according to Xavier uniform distribution1.

Optimization. During the training stage, we minimize

the l2-loss `tr(W,b) = 1

2nkf(X)−Yk2where X=

(x>

1,··· ,x>

n)>and Y= (y1,··· , yn)>. For the clarity

of analysis, we consider two extremes that can help sim-

plify the optimization while still maintaining our key intu-

ition. First, we take an inﬁnitely small learning rate such

that the optimization process becomes a gradient ﬂow (Gu-

nasekar et al., 2017; Du et al., 2018). Denote the param-

eters W,bat training time tas W(t),b(t), and v(t) =

W(t)b(t). Second, we consider the inﬁnite data setting

(n→ ∞). This is a widely used simpliﬁcation to avoid the

inﬂuence of sample randomness Kim et al. (2019); Ghor-

bani et al. (2021). The parameters are updated as

∂tW(t) = −∇W`tr(W(t),b(t))

=−X>X

nW(t)b(t)−X>Y

nb(t)>

=E[x>y]−E[x>x]v(t)b(t)>

∂tb(t) = −∇b`tr(W(t),b(t))

=−W(t)>X>X

nW(t)b(t)−X>Y

n

=W(t)>E[x>y]−E[x>x]v(t).

In the test stage, we retrain the last layer bto minimize the

test loss, i.e.

`te(W) = min

EEte

2kxW b −yk2.

In the test stage, the spurious correlation is broken, i.e.

x2=spu. The minimum error in the test stage is err∗

te =

η2

core/2when v= (β>,0)>.

3.2 Theoretical Analysis: Noises Matter

We are now ready to introduce our theoretical results when

the core features can and cannot be learned by ERM with

1Our theorems can be easily applied to various initializations.

different levels of non-realizable noises. One important in-

tuition on why core features can still be learned well de-

spite the (possibly more easily learned) spurious features is

that, the loss can be further reduced by using both core and

spurious features simultaneously.

Lemma 1 For all W∈Rd×m,b∈Rm, we have

`tr(W,b)≥1

2Ekxv∗

tr −yk2

2=η2

coreη2

spu

2(η2

core +η2

spu)

,err∗

tr,

where v∗

tr = (αβ>,(1 −α)γ>)>is the optimal coefﬁcient

for training, and α=η2

spu

η2

core +η2

spu

Lemma 1 shows that by assigning αfraction of weight to

the core feature βand the rest to γ, the loss is minimized.

This implies that the model will learn a mixture of both

features even with large spurious correlations. More impor-

tantly, the magnitude of αwill largely inﬂuence the probing

performance. During the test stage, x2become useless,

and the trained W(t)b(t)can only recover αfraction of

y, which induces a large approximation error. To this end,

during the retraining the last layer coefﬁcients should scale

up in order to predict ywell. Meanwhile, this also scales

up the weight on x2, which is merely a harmful noise, re-

sulting in a trade-off between learning accurate core fea-

tures and removing spurious features. When the core noise

is small, i.e., α≈1, the noise on x2will not be scaled

up much. The Waterbirds dataset considered in Kirichenko

et al. (2022) has ηcore = 0% < ηspu = 5%, falling into

this region. The following theorem tells how well the ERM

with last-layer probing works in this region.

Theorem 1 (Upper Bound) Assume that v(t)is bounded

away from 0throughout the whole optimization2, i.e.

kv(t)k2> c0>0. Then, for any 0< ηcore < ηspu,

any time t, we have

`te(W(t)) ≤1 + η2

core

η2

spu err∗

te +O(t−1).(1)

Here err∗

te =η2

core/2is the optimal testing error and O

hides the dependency on ηcore, ηtspu, c0and the initialized

parameters. When ηcore

ηspu →0, this theorem suggests test-

time probing achieves near optimal error.

Theorem 1 gives a theoretical explanation of the last layer

retraining phenomenon. It shows that the test error after

retraining can be close to err∗

te over time. However, this

guarantee holds only when ηcore < ηspu. The following

theorem shows that when the core features have large noise,

the representation learned by ERM would produce a down-

graded performance after linear probing.

2This is to guarantee that our gradient ﬂow will not fail to con-

verge to an minimum, in which case the theorem is meaningless.

Haotian Ye, James Zou, Linjun Zhang

Figure 3: A toy example illustrating when and why ERM

can perform well after retraining in Ete (d= 2, m = 1).

Assume the core feature βis vertical and the spurious fea-

ture γis horizontal. Both features can predict yin Etr ,

while γis useless in Ete since x2=spu. We initialize our

single feature W(0), and obtain W(t)after training on Etr.

We then retrain the last layer (probing) on Ete, i.e. rescale

W(t)and obtain vtest. When ηcore < ηspu,W(t)will use

βmore (the blue ﬂow); after probing, vtest can recover β

(small approximation error) without suffering much from

the spurious 2on the direction or γ(small spurious noise

error). On the contrary, when ηcore is large, W(t)will fol-

low the red ﬂow; this leads to a trade-off between two error

terms. In this case, ERM performs much worse. Notice

that ﬂows in the ﬁgure are just for illustration. In practice,

probing can either lengthen or shorten W(t), depending on

the concrete form of two error terms.

Theorem 2 (Lower Bound) Assume that in the inﬁnity,

W1(∞),limt→∞ W1(t)has full column rank, which

almost surely holds when m < d. Then for any ηcore >

ηspu >0, we have

lim

t→∞

`te(W(t))

err∗

≥1 + η2

core

2η2

spu 1∧1

2η2

spu kΣ−1k2



W†

1(∞)



!.(2)

Here A†is the Moore-Penrose inverse of A, and a∧btakes

the minimum over a, b. When ηspu

ηcore →0, the last layer

retraining error is much larger than the optimal error.

Theorem 2 implies that the error can be η2

core

η2

spu

times larger

than err∗

te when ηcore > ηspu, showing that ERM with

last layer retraining does not work in this scenario, and the

features learned by ERM are insufﬁcient to recover near-

optimal performance. In summary, we prove that test-time

probing performance largely relies on the non-realizable

noises, and it only works when the core noise is relatively

smaller. We illustrate two theorems in Figure 3.

4 Method: Improving Test-Time Probing

Our theories raise a natural question: can we improve the

learned features and make the test-time probing strategy

effective under various noise conditions? A feature can

be better correlated with labels in Etr than others, but the

correlation may be spurious and even disappears in Ete.

Without concrete knowledge about Ete and spurious cor-

relations, it is impossible to determine whether or not a

learned feature is informative only in Etr, especially given

that there are innumerable amount of features. This prob-

lem comes from treating the label as an absolute oracle and

is unlikely to be addressed by switching to other supervised

robust training methods that still depend on labels. We ex-

perimentally verify this in Section 5.2.

In order to perform well in test-time probing under different

noise conditions, we should also learn salient features that

are selected without relying on labels. This helps preserve

features that are useful in the testing stage, but are ruled out

because they are less informative than other features w.r.t.

labels. By learning features both with and without the help

of labels, we can extract informative features and simulta-

neously maximize diversity. To this end, we propose the

Freeze then Train (FTT) algorithm, which ﬁrst freezes cer-

tain salient features unsupervisedly and then trains the rest

of the features supervisedly. The algorithm is illustrated in

Figure 2, and we describe the details below.

4.1 Method: Freeze then Train

Algorithm 1 Freeze Then Train

Input: Dataset S={xi, yi}n

i=1, initialized feature extrac-

tor M:X 7→ Rm, unsupervised fraction p, n class K.

1: Conduct PCA on {M(xi)}n

i=1 with dimension pm,

obtain transform matrix Wul ∈Rm×pm

2: Set unsupervised model Mul(x) = M(x)Wul, and

freeze its parameters (including Wul)

3: set Msl(x) = M(x)Wsl, initialize linear head h:

Rm7→ RK.

4: Supervisedly train the model MF T T (x) =

h((Mul(x),Msl(x)) on Susing ERM, update

Msl,Wsl, h until converge.

Output: MFTT

Step 1. Unsupervised freeze stage. FTT starts with a

model Minit pretrained in large datasets like ImageNet

or language corpus. Given a training set Str ∼Dtr, we

use an unsupervised method like Contrastive Learning or

Principal Component Analysis (PCA) to learn pm features,

where mis the number of total features, and p∈[0,1] is a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FreezethenTrain:TowardsProvableRepresentationLearningunderSpuriousCorrelationsandFeatureNoiseHaotianYeJamesZouyLinjunZhangyhaotianye@pku.edu.cnPekingUniversityjamesz@stanford.eduStanfordUniversitylinjun.zhang@rutgers.eduRutgersUniversityAbstractTheexistenceofspuriouscorrelationssuchasimagebackground...

展开>> 收起<<

Freeze then Train Towards Provable Representation Learning under Spurious Correlations and Feature Noise Haotian Ye James ZouyLinjun Zhangy.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Freeze then Train Towards Provable Representation Learning under Spurious Correlations and Feature Noise Haotian Ye James ZouyLinjun Zhangy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: