On Feature Learning in the Presence of Spurious Correlations Pavel IzmailovPolina KirichenkoNate GruverAndrew Gordon Wilson

2025-05-02 0 0 2.42MB 28 页 10玖币

侵权投诉

On Feature Learning in the Presence of Spurious

Correlations

Pavel Izmailov∗Polina Kirichenko∗Nate Gruver∗Andrew Gordon Wilson

New York University

Abstract

Deep classiﬁers are known to rely on spurious features — patterns which are corre-

lated with the target on the training data but not inherently relevant to the learning

problem, such as the image backgrounds when classifying the foregrounds. In this

paper we evaluate the amount of information about the core (non-spurious) features

that can be decoded from the representations learned by standard empirical risk

minimization (ERM) and specialized group robustness training. Following recent

work on Deep Feature Reweighting (DFR), we evaluate the feature representations

by re-training the last layer of the model on a held-out set where the spurious

correlation is broken. On multiple vision and NLP problems, we show that the

features learned by simple ERM are highly competitive with the features learned

by specialized group robustness methods targeted at reducing the effect of spurious

correlations. Moreover, we show that the quality of learned feature representations

is greatly affected by the design decisions beyond the training method, such as the

model architecture and pre-training strategy. On the other hand, we ﬁnd that strong

regularization is not necessary for learning high quality feature representations.

Finally, using insights from our analysis, we signiﬁcantly improve upon the best

results reported in the literature on the popular Waterbirds, CelebA hair color pre-

diction and WILDS-FMOW problems, achieving 97%, 92% and 50% worst-group

accuracies, respectively.

1 Introduction

In classiﬁcation problems, a feature is spurious if it is predictive of the label without being causally

related to it. Models that exploit the predictive power of spurious features can achieve strong average

performance on training and in-distribution test data but often perform poorly on sub-groups of the

data where the spurious correlation does not hold [

]. For example, neural networks trained on

ImageNet are known to rely on backgrounds [

] or texture [

], which are often correlated with

labels without being causally signiﬁcant. Similarly in natural language processing, models often

rely on speciﬁc words and syntactic heuristics when predicting the sentiment of a sentence or the

relationship between a pair of sentences [

]. In extreme cases, neural networks completely ignore

task-relevant core features and only use spurious features in their predictions [

], achieving zero

accuracy on the subgroups of the data where the spurious correlation does not hold.

In recent work, Kirichenko et al.

[40]

showed that, surprisingly, standard Empirical Risk Mini-

mization (ERM) learns a high-quality representation of the core features on datasets with spurious

correlations, even when the model primarily relies on spurious features to make predictions. Moreover,

they showed that it is often possible to recover state-of-the-art performance on benchmark spurious

correlation problems by simply retraining the last layer of the model on a small held-out dataset

where the spurious correlation does not hold. This procedure is called Deep Feature Reweighting

(DFR).

∗Equal contribution.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11369v1 [cs.LG] 20 Oct 2022

In this paper, we provide an in-depth study of the factors that affect the quality of learned repre-

sentations in the presence of spurious correlations: how accurately we can decode the core features

from the learned representations. Following Kirichenko et al.

[40]

, we break the problem of training

a robust classiﬁer into two tasks: extracting feature representations and training a linear classiﬁer

on these features. In order to study the feature learning in isolation, we use the DFR procedure to

learn an optimal linear classiﬁer on the feature representations, and evaluate the features learned with

different training methods, neural network architectures, and hyper-parameters.

First, on a range of problems with spurious correlations we show that while specialized group robust-

ness methods such as group distributionally robust optimization (group DRO) [

] can signiﬁcantly

outperform the standard ERM training, the quality of the features learned by ERM is highly competi-

tive: by applying the DFR procedure to the features learned by ERM and group DRO we achieve

similar performance. Furthermore, we show that the performance improvements of group DRO

are largely explained by the better weighting of the learned features in the last classiﬁcation layer,

and not by learning a better representation of the core features. This observation has high practical

signiﬁcance, as the problem of training the last layer of the model is much simpler both conceptually

and computationally than training the full model to avoid spurious correlations [40].

Next, focusing on the ERM training, we explore the effect of model class, pretraining strategy and

regularization on feature learning. We ﬁnd a linear dependence between the in-distribution accuracy

of the model and the worst group accuracy after applying DFR, meaning that on natural datasets

good generalization typically implies good feature learning, even in the presence of spurious features.

Further, we show that the pre-training strategy has a very signiﬁcant effect on the quality of the

learned features, while strong regularization does not signiﬁcantly improve the feature representations

on most benchmarks.

Finally, by ﬁnetuning a pretrained state-of-the-art ConvNext model [

], we signiﬁcantly outperform

the best reported results on the popular Waterbirds [

], CelebA hair color and WILDS FMOW [

]

spurious correlation benchmarks, using only simple ERM training followed by DFR.

Our code is available at github.com/izmailovpavel/spurious_feature_learning.

2 Related Work

Numerous works describe how neural networks can rely on spurious correlations in real world

problems. In vision, neural networks can learn to rely on an image’s background [

secondary objects [

], object textures [

] and other semantically irrelevant

features [

]. Spurious correlations are especially problematic in high-risk domains such as

medical imaging, where it was shown that neural networks can use hospital-speciﬁc metal tokens

[

] or cues of disease treatment [

] rather than symptoms to perform automated diagnosis on chest

X-ray images. Spurious features are also extremely prevalent in NLP, where models can achieve good

performance on benchmarks without properly solving them, e.g. by using simple syntactic heuristics

such as lexical overlap between the two sentences in order to classify the relationship between them

[64,22,37,56]. For a comprehensive survey of the area, see Geirhos et al. [19].

Because of the high practical signiﬁcance of spurious correlations, many group robustness methods

have been proposed. These methods aim to reduce the reliance of deep learning models on spurious

correlations and improve worst group performance. Group DRO [

] is the state-of-the-art group

robustness method, which minimizes the worst-group loss instead of the average loss. Other works

focus on automatically identifying the minority group examples [

100

], learning several

diverse classiﬁers that use different features [

] or using partially available group labels

[84,63]. Group subsampling was shown to be a strong baseline for some benchmarks [34,78].

In this work, we focus on feature learning in the presence of spurious correlations. Hermann and

Lampinen

[31]

perform a conceptually similar study, but focusing on synthetic datasets. Similar to

their work, we explore how well the different features of the data can be decoded from the features

learned by deep neural networks, but on large-scale natural datasets. Hermann et al.

[30]

explore the

feature learning in the context of texture bias [

], ﬁnding that data augmentation has a profound

effect on the texture bias while architectures and training objectives have a relatively small effect.

Ghosal et al.

[20]

show that Vision Transformer models pretrained on ImageNet22k [

] signiﬁcantly

outperform standard CNN models on several spurious correlation benchmarks.

Lovering et al.

[54]

explore the factors which affect the extractability of features after pre-training

and ﬁne-tuning of NLP models. Kaushik et al.

[36]

construct counterfactually augmented sentiment

analysis and naural language inference datasets (CAD) and show that combining CAD with the

original data reduces the reliance on spurious correlations on the corresponding benchmarks. Kaushik

et al.

[38]

explain the efﬁcacy of CAD and show that while adding noise to causal features degrades

in-distribution and out-of-distribution performance, adding noise to non-causal features improves

robustess. Eisenstein

[17]

and Veitch et al.

[86]

formally deﬁne and study spurious features in NLP

from the perspective of causality.

Kirichenko et al.

[40]

show that models trained with standard ERM training often learn high-quality

representations of the core features, and propose the DFR procedure (see Section 3) which we

use extensively in this paper. Related observations have also been reported in other works in the

context of spurious correlations [

], domain generalization [

] and long-tail classiﬁcation [

While we build on the observations of Kirichenko et al.

[40]

, our work provides profound new

insights and greatly expands on the scope of their work. In particular, we investigate the feature

representations learned by methods beyond standard ERM, and the role of model architecture, pre-

training, regularization and data augmentation on learning semantic structure. We also extend our

analysis beyond the standard spurious correlation benchmarks studied by Kirichenko et al.

[40]

, by

considering the challenging real world satellite imaging and chest X-ray datasets.

In an independent and concurrent work, Shi et al.

[81]

also propose an evaluation framework for out-

of-distribution generalization based on last layer retraining, inspired by the observations of Kirichenko

et al.

[40]

and Kang et al.

[35]

. They focus on the comparison of supervised, self-supervised and

unsupervised training methods, providing complementary observations to our work.

3 Background

Preliminaries.

We consider classiﬁcation tasks with inputs

x∈ X

and classes

y∈ Y

. We

assume that the data distribution consists of groups

which are not equally represented in the

training data. The distribution of groups can change between the training and test distributions, with

majority groups becoming less common or minority groups becoming more common. Because of the

imbalance in training data, models trained with ERM often have a gap between average and worst

group performance on test. Throughout this paper, we will be studying worst group accuracy (WGA),

i.e. the lowest test accuracy across all the groups

. For most problems considered in this paper, we

assume that each data point has an attribute

s∈ S

which is spuriously correlated with the label

and the groups are deﬁned by a combination of the label and spurious attribute:

G ∈ Y × S

. In test

distribution we might ﬁnd that

is no longer correlated with

, and thus a model that has learned

to rely on the spurious feature

during training will perform poorly at test time. Models that rely

on the spurious features will typically achieve poor worst group accuracy, while models that rely on

core features will have more uniform accuracies across the groups. In Appendix A, we describe the

groups, spurious and core features in the datasets that we use in this paper.

In order to perform controlled experiments, we assume that we have access to the spurious attributes

(or group labels) for training or validation data, which we use for training of group robustness

baselines and feature quality evaluation. However, we emphasize that our results on the features

learned by ERM hold generally, even when spurious attributes are unknown, as ERM does not use

the information about the spurious features: we only use the spurious attributes to perform analysis.

Deep feature reweighting.

Suppose we are given a model

m:X → C

, where

is the input

space and

is the set of classes. Kirichenko et al.

[40]

assume that the model

consists of a feature

extractor (typically, a sequence of convolutional or transformer layers) followed by a classiﬁcation

head (typically, a single linear layer):

m=h◦e

, where

e:X → F

is a feature extractor and

h:F → C

is a classiﬁcation head. They discard the classiﬁcation head, and use the feature

extractor

to compute the set of embeddings

De={(e(xi), yi)}n

i=1

of all the datapoints in the

reweighting dataset

; the reweighting dataset is used to retrain the last layer of the model, and

contains group-balanced data where the spurious correlation does not hold. Finally, they train a

logistic regression classiﬁer

l:F → C

on the dataset

De1

. Then, the ﬁnal model used on new test

For stability, logistic regression models are trained

times on different random group-balanced subsets of

the reweighting dataset

, and the weights of the learned logistic regression models are averaged. See Appendix

B of Kirichenko et al. [40] for full details on the DFR procedure.

data is given by

ml=l◦e

. Thoughout this paper, we use a group-balanced held-out dataset (subset

of the validation dataset where each group has the same number of datapoints) as the reweighting

dataset ˆ

D; Kirichenko et al. [40] denote this variation of the method as DFRVal

Tr .

4 Experimental Setup and Evaluation Procedure

In this section, we describe the datasets, models and evaluation procedure that we use throughout the

paper.

Datasets.

In order to cover a broad range of practical scenarios, we consider four image classiﬁca-

tion and two text classiﬁcation problems.

•

Waterbirds [

] is a binary image classiﬁcation problem, where the class corresponds to the

type of the bird (landbird or waterbird), and the background is spuriously correlated with

the class. Namely, most landbirds are shown on land, and most waterbirds are shown over

water.

•

CelebA hair color [

] is a binary image classiﬁcation problem, where the goal is to predict

whether a person shown in the image is blond; the gender of the person serves as a spurious

feature, as 94% of the images with the “blond” label depict females.

•

WILDS-FMOW [

] is a satellite image classiﬁcation problem, where the classes

correspond to one of

land use or building types, and the spurious attribute

corresponds

to the region (Africa, Americas, Asia, Europe, Oceania or Other; the “Other” region is not

used in the evaluation). We note that for the FMOW datasets the groups

are deﬁned by

the value of the spurious attribute, and not the combination of the spurious attribute and

the class label

Gy,s

, as described in Section 3. Moreover, on FMOW there is also a domain

shift: the images for test and validation data (used for last layer retraining) are collected in

2016 and 2017, while the training data is collected before 2016. For more details, please see

Appendix A.

•

CXR-14 [

] is a dataset with chest X-ray images for which we focus on a binary classiﬁca-

tion problem of pneumothorax prediction. Oakden-Rayner et al.

[65]

showed that there is

a hidden stratiﬁcation in the dataset such that most images from the positive class contain

a chest drain, which is a non-causal feature related to treatment of the disease. While for

all other benchmarks we report WGA on test data, for this dataset, following prior work

[65,48,70], we report worst group AUC because of the heavy class imbalance.2

•

MultiNLI [

] is a text classiﬁcation problem, where the task is to classify the relationship

between a given pair of sentences as a contradiction, entailment or neither of them. In this

dataset, the presence of negation words (e.g. “never”) in the second sentence is spuriously

correlated with the “contradiction” class.

•

CivilComments [

] is a text classiﬁcation problem, where the goal is to classify whether

a given comment is toxic. We follow Idrissi et al.

[34]

and use the coarse version of

the dataset both for training and evaluation, where the spurious attribute is

s= 1

if the

comment mentions at least one of the following categories: male, female, LGBT, black,

white, Christian, Muslim, other religion; otherwise, the spurious label is

. The presence

of the eight categories above is spuriously correlated with the comment being classiﬁed as

toxic.

The Waterbirds, CelebA, CivilComments and MultiNLI datasets are commonly used to benchmark

the performance of group robustness methods [see e.g.

]. The FMOW and CXR-14 datasets

present challenging real-world problems with spurious correlations. In these datasets, the inputs

do not resemble natural images from datasets such as ImageNet [

], so models have to learn the

relevant features from data to achieve good performance, and cannot simply rely on feature transfer.

We provide detailed descriptions of the data and show example datapoints in Appendix A, Figures 7

and 8.

We take the minimum out of the two scores on test: AUC for classifying negative class against positive

examples with chest drain and AUC for negative class against positive without chest drain.

Models.

Following prior work [

] we use a ResNet-50 [

] model pretrained on ImageNet1k

[

] on Waterbirds, CelebA and FMOW. For the NLP problems, we use a BERT model [

] pre-

trained on Book Corpus and English Wikipedia data. On CXR-14, following prior work [e.g.

] we use a DenseNet-121 model [

] pretrained on ImageNet1k. In Section 6, we provide

an extensive study of the effect of architecture and pretraining on the image classiﬁcation problems,

and in Appendix Ewe perform a similar study on the MultiNLI text classiﬁcation problem.

Evaluation strategy.

We use DFR to evaluate the quality of the learned feature representations,

as described in Section 3: we measure how well the core features can be decoded from the learned

representations with last layer retraining. In some of the experiments, we also train a linear classiﬁer

to predict the spurious attribute

instead of the class label

. Using this classiﬁer, we can evaluate

the decodability of the spurious feature from the learned feature representation. We refer to this

procedure as

-DFR and the corresponding worst group accuracy (in predicting the spurious attribute

) as DFR

-WGA. Additionally, we evaluate the worst group accuracy and mean

accuracy of the

base model without applying DFR, which we refer to as base WGA and base accuracy respectively.

5 ERM vs Group Robustness Training

Multiple methods have been proposed for training classiﬁers which are more robust to spurious

correlations, with signiﬁcant improvements in worst group accuracy compared to standard training.

In this section, we use DFR to investigate whether the improvements of group robustness methods

are caused by better feature representations or by better weighting of the learned features.

Methods.

We consider 4 methods for learning the features. ERM or Empirical Risk Minimization

is the standard training on the original training data, without any techniques targeted at improving

worst group performance. RWG reweights the loss on each of the groups according to the size of the

group and RWY reweights the loss on each class according to the size of the class [

]. Group DRO

[

] is a state-of-the-art method which uses the group information on the training data to minimize

the worst group loss instead of the average loss. Group DRO is often considered as an oracle method

or upper-bound on the worst group performance under spurious correlations [50,13].

On the CXR dataset the group labels are not available on the train data, so we cannot apply RWG or

group DRO; on this dataset we compare ERM to RWY. On several datasets, the performance of RWY

and RWG methods deteriorates during training. For these datasets, we additionally report the results

for the checkpoint obtained with early stopping (RWY-ES and RWG-ES). For group DRO, we report

the performance with early stopping on all datasets except for CXR (GDRO-ES). In all cases, early

stopping is performed based on the worst-group accuracy on the validation set.

Hyper-parameters selection.

We train ERM models, RWG and RWY with the same hyper-

parameters shared between all the image datasets (apart from batch size which is set to

Waterbirds, and

100

on the other datasets) and between the natural language datasets. We do not tune

the hyper-parameters of these methods for worst group accuracy. For group DRO, we run a grid search

over the values of the generalization adjustment

, weight decay and learning rate hyper-parameters,

and select the best combination according to the worst-group accuracy on validation data with early

stopping. For details, please see Appendix B.

5.1 Results

We compare feature learning methods on all datasets in Figure 1. As expected, the worst group

accuracy of group robustness methods is signiﬁcantly better than the ERM worst group accuracy on

most datasets. For example, on Waterbirds, ERM only gets

68.8%

WGA, while group DRO with

early stopping gets

90.6%

. However, after applying DFR the performance of ERM and group DRO is

very close, with a slight advantage for ERM (

91.1%

for ERM and

90%

for group DRO), and similar

observations hold on all datasets.

The results for RWG and RWY are analogous. Namely, when combined with early stopping, these

methods outperform ERM on base model performance. Once we apply DFR, however, the gap in

Following Sagawa et al.

[76]

and Kirichenko et al.

[40]

, we evaluate the mean accuracy according to the

group distribution in the training set. This way, mean accuracy represents the in-distribution generalization of

the model.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnFeatureLearninginthePresenceofSpuriousCorrelationsPavelIzmailovPolinaKirichenkoNateGruverAndrewGordonWilsonNewYorkUniversityAbstractDeepclassiersareknowntorelyonspuriousfeaturespatternswhicharecorre-latedwiththetargetonthetrainingdatabutnotinherentlyrelevanttothelearningproblem,suchastheimage...

展开>> 收起<<

On Feature Learning in the Presence of Spurious Correlations Pavel IzmailovPolina KirichenkoNate GruverAndrew Gordon Wilson.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Feature Learning in the Presence of Spurious Correlations Pavel IzmailovPolina KirichenkoNate GruverAndrew Gordon Wilson

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: