On Feature Learning in the Presence of Spurious Correlations Pavel IzmailovPolina KirichenkoNate GruverAndrew Gordon Wilson

2025-05-02 0 0 2.42MB 28 页 10玖币
侵权投诉
On Feature Learning in the Presence of Spurious
Correlations
Pavel IzmailovPolina KirichenkoNate GruverAndrew Gordon Wilson
New York University
Abstract
Deep classifiers are known to rely on spurious features — patterns which are corre-
lated with the target on the training data but not inherently relevant to the learning
problem, such as the image backgrounds when classifying the foregrounds. In this
paper we evaluate the amount of information about the core (non-spurious) features
that can be decoded from the representations learned by standard empirical risk
minimization (ERM) and specialized group robustness training. Following recent
work on Deep Feature Reweighting (DFR), we evaluate the feature representations
by re-training the last layer of the model on a held-out set where the spurious
correlation is broken. On multiple vision and NLP problems, we show that the
features learned by simple ERM are highly competitive with the features learned
by specialized group robustness methods targeted at reducing the effect of spurious
correlations. Moreover, we show that the quality of learned feature representations
is greatly affected by the design decisions beyond the training method, such as the
model architecture and pre-training strategy. On the other hand, we find that strong
regularization is not necessary for learning high quality feature representations.
Finally, using insights from our analysis, we significantly improve upon the best
results reported in the literature on the popular Waterbirds, CelebA hair color pre-
diction and WILDS-FMOW problems, achieving 97%, 92% and 50% worst-group
accuracies, respectively.
1 Introduction
In classification problems, a feature is spurious if it is predictive of the label without being causally
related to it. Models that exploit the predictive power of spurious features can achieve strong average
performance on training and in-distribution test data but often perform poorly on sub-groups of the
data where the spurious correlation does not hold [
18
]. For example, neural networks trained on
ImageNet are known to rely on backgrounds [
93
] or texture [
19
], which are often correlated with
labels without being causally significant. Similarly in natural language processing, models often
rely on specific words and syntactic heuristics when predicting the sentiment of a sentence or the
relationship between a pair of sentences [
56
,
22
]. In extreme cases, neural networks completely ignore
task-relevant core features and only use spurious features in their predictions [
97
,
79
], achieving zero
accuracy on the subgroups of the data where the spurious correlation does not hold.
In recent work, Kirichenko et al.
[40]
showed that, surprisingly, standard Empirical Risk Mini-
mization (ERM) learns a high-quality representation of the core features on datasets with spurious
correlations, even when the model primarily relies on spurious features to make predictions. Moreover,
they showed that it is often possible to recover state-of-the-art performance on benchmark spurious
correlation problems by simply retraining the last layer of the model on a small held-out dataset
where the spurious correlation does not hold. This procedure is called Deep Feature Reweighting
(DFR).
Equal contribution.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11369v1 [cs.LG] 20 Oct 2022
In this paper, we provide an in-depth study of the factors that affect the quality of learned repre-
sentations in the presence of spurious correlations: how accurately we can decode the core features
from the learned representations. Following Kirichenko et al.
[40]
, we break the problem of training
a robust classifier into two tasks: extracting feature representations and training a linear classifier
on these features. In order to study the feature learning in isolation, we use the DFR procedure to
learn an optimal linear classifier on the feature representations, and evaluate the features learned with
different training methods, neural network architectures, and hyper-parameters.
First, on a range of problems with spurious correlations we show that while specialized group robust-
ness methods such as group distributionally robust optimization (group DRO) [
76
] can significantly
outperform the standard ERM training, the quality of the features learned by ERM is highly competi-
tive: by applying the DFR procedure to the features learned by ERM and group DRO we achieve
similar performance. Furthermore, we show that the performance improvements of group DRO
are largely explained by the better weighting of the learned features in the last classification layer,
and not by learning a better representation of the core features. This observation has high practical
significance, as the problem of training the last layer of the model is much simpler both conceptually
and computationally than training the full model to avoid spurious correlations [40].
Next, focusing on the ERM training, we explore the effect of model class, pretraining strategy and
regularization on feature learning. We find a linear dependence between the in-distribution accuracy
of the model and the worst group accuracy after applying DFR, meaning that on natural datasets
good generalization typically implies good feature learning, even in the presence of spurious features.
Further, we show that the pre-training strategy has a very significant effect on the quality of the
learned features, while strong regularization does not significantly improve the feature representations
on most benchmarks.
Finally, by finetuning a pretrained state-of-the-art ConvNext model [
52
], we significantly outperform
the best reported results on the popular Waterbirds [
76
], CelebA hair color and WILDS FMOW [
43
]
spurious correlation benchmarks, using only simple ERM training followed by DFR.
Our code is available at github.com/izmailovpavel/spurious_feature_learning.
2 Related Work
Numerous works describe how neural networks can rely on spurious correlations in real world
problems. In vision, neural networks can learn to rely on an image’s background [
93
,
76
,
61
],
secondary objects [
45
,
72
,
83
,
80
,
2
,
60
], object textures [
19
] and other semantically irrelevant
features [
9
,
49
]. Spurious correlations are especially problematic in high-risk domains such as
medical imaging, where it was shown that neural networks can use hospital-specific metal tokens
[
97
] or cues of disease treatment [
65
] rather than symptoms to perform automated diagnosis on chest
X-ray images. Spurious features are also extremely prevalent in NLP, where models can achieve good
performance on benchmarks without properly solving them, e.g. by using simple syntactic heuristics
such as lexical overlap between the two sentences in order to classify the relationship between them
[64,22,37,56]. For a comprehensive survey of the area, see Geirhos et al. [19].
Because of the high practical significance of spurious correlations, many group robustness methods
have been proposed. These methods aim to reduce the reliance of deep learning models on spurious
correlations and improve worst group performance. Group DRO [
76
] is the state-of-the-art group
robustness method, which minimizes the worst-group loss instead of the average loss. Other works
focus on automatically identifying the minority group examples [
50
,
62
,
13
,
100
], learning several
diverse classifiers that use different features [
48
,
66
,
85
] or using partially available group labels
[84,63]. Group subsampling was shown to be a strong baseline for some benchmarks [34,78].
In this work, we focus on feature learning in the presence of spurious correlations. Hermann and
Lampinen
[31]
perform a conceptually similar study, but focusing on synthetic datasets. Similar to
their work, we explore how well the different features of the data can be decoded from the features
learned by deep neural networks, but on large-scale natural datasets. Hermann et al.
[30]
explore the
feature learning in the context of texture bias [
19
], finding that data augmentation has a profound
effect on the texture bias while architectures and training objectives have a relatively small effect.
Ghosal et al.
[20]
show that Vision Transformer models pretrained on ImageNet22k [
44
] significantly
outperform standard CNN models on several spurious correlation benchmarks.
2
Lovering et al.
[54]
explore the factors which affect the extractability of features after pre-training
and fine-tuning of NLP models. Kaushik et al.
[36]
construct counterfactually augmented sentiment
analysis and naural language inference datasets (CAD) and show that combining CAD with the
original data reduces the reliance on spurious correlations on the corresponding benchmarks. Kaushik
et al.
[38]
explain the efficacy of CAD and show that while adding noise to causal features degrades
in-distribution and out-of-distribution performance, adding noise to non-causal features improves
robustess. Eisenstein
[17]
and Veitch et al.
[86]
formally define and study spurious features in NLP
from the perspective of causality.
Kirichenko et al.
[40]
show that models trained with standard ERM training often learn high-quality
representations of the core features, and propose the DFR procedure (see Section 3) which we
use extensively in this paper. Related observations have also been reported in other works in the
context of spurious correlations [
58
], domain generalization [
73
] and long-tail classification [
35
].
While we build on the observations of Kirichenko et al.
[40]
, our work provides profound new
insights and greatly expands on the scope of their work. In particular, we investigate the feature
representations learned by methods beyond standard ERM, and the role of model architecture, pre-
training, regularization and data augmentation on learning semantic structure. We also extend our
analysis beyond the standard spurious correlation benchmarks studied by Kirichenko et al.
[40]
, by
considering the challenging real world satellite imaging and chest X-ray datasets.
In an independent and concurrent work, Shi et al.
[81]
also propose an evaluation framework for out-
of-distribution generalization based on last layer retraining, inspired by the observations of Kirichenko
et al.
[40]
and Kang et al.
[35]
. They focus on the comparison of supervised, self-supervised and
unsupervised training methods, providing complementary observations to our work.
3 Background
Preliminaries.
We consider classification tasks with inputs
x∈ X
and classes
y∈ Y
. We
assume that the data distribution consists of groups
G
which are not equally represented in the
training data. The distribution of groups can change between the training and test distributions, with
majority groups becoming less common or minority groups becoming more common. Because of the
imbalance in training data, models trained with ERM often have a gap between average and worst
group performance on test. Throughout this paper, we will be studying worst group accuracy (WGA),
i.e. the lowest test accuracy across all the groups
G
. For most problems considered in this paper, we
assume that each data point has an attribute
s∈ S
which is spuriously correlated with the label
y
,
and the groups are defined by a combination of the label and spurious attribute:
G Y × S
. In test
distribution we might find that
s
is no longer correlated with
y
, and thus a model that has learned
to rely on the spurious feature
s
during training will perform poorly at test time. Models that rely
on the spurious features will typically achieve poor worst group accuracy, while models that rely on
core features will have more uniform accuracies across the groups. In Appendix A, we describe the
groups, spurious and core features in the datasets that we use in this paper.
In order to perform controlled experiments, we assume that we have access to the spurious attributes
s
(or group labels) for training or validation data, which we use for training of group robustness
baselines and feature quality evaluation. However, we emphasize that our results on the features
learned by ERM hold generally, even when spurious attributes are unknown, as ERM does not use
the information about the spurious features: we only use the spurious attributes to perform analysis.
Deep feature reweighting.
Suppose we are given a model
m:X → C
, where
X
is the input
space and
C
is the set of classes. Kirichenko et al.
[40]
assume that the model
m
consists of a feature
extractor (typically, a sequence of convolutional or transformer layers) followed by a classification
head (typically, a single linear layer):
m=he
, where
e:X → F
is a feature extractor and
h:F → C
is a classification head. They discard the classification head, and use the feature
extractor
e
to compute the set of embeddings
ˆ
De={(e(xi), yi)}n
i=1
of all the datapoints in the
reweighting dataset
ˆ
D
; the reweighting dataset is used to retrain the last layer of the model, and
contains group-balanced data where the spurious correlation does not hold. Finally, they train a
logistic regression classifier
l:F → C
on the dataset
ˆ
De1
. Then, the final model used on new test
1
For stability, logistic regression models are trained
10
times on different random group-balanced subsets of
the reweighting dataset
ˆ
D
, and the weights of the learned logistic regression models are averaged. See Appendix
B of Kirichenko et al. [40] for full details on the DFR procedure.
3
data is given by
ml=le
. Thoughout this paper, we use a group-balanced held-out dataset (subset
of the validation dataset where each group has the same number of datapoints) as the reweighting
dataset ˆ
D; Kirichenko et al. [40] denote this variation of the method as DFRVal
Tr .
4 Experimental Setup and Evaluation Procedure
In this section, we describe the datasets, models and evaluation procedure that we use throughout the
paper.
Datasets.
In order to cover a broad range of practical scenarios, we consider four image classifica-
tion and two text classification problems.
Waterbirds [
76
] is a binary image classification problem, where the class corresponds to the
type of the bird (landbird or waterbird), and the background is spuriously correlated with
the class. Namely, most landbirds are shown on land, and most waterbirds are shown over
water.
CelebA hair color [
51
] is a binary image classification problem, where the goal is to predict
whether a person shown in the image is blond; the gender of the person serves as a spurious
feature, as 94% of the images with the “blond” label depict females.
WILDS-FMOW [
12
,
43
,
77
] is a satellite image classification problem, where the classes
correspond to one of
62
land use or building types, and the spurious attribute
s
corresponds
to the region (Africa, Americas, Asia, Europe, Oceania or Other; the “Other” region is not
used in the evaluation). We note that for the FMOW datasets the groups
Gs
are defined by
the value of the spurious attribute, and not the combination of the spurious attribute and
the class label
Gy,s
, as described in Section 3. Moreover, on FMOW there is also a domain
shift: the images for test and validation data (used for last layer retraining) are collected in
2016 and 2017, while the training data is collected before 2016. For more details, please see
Appendix A.
CXR-14 [
89
] is a dataset with chest X-ray images for which we focus on a binary classifica-
tion problem of pneumothorax prediction. Oakden-Rayner et al.
[65]
showed that there is
a hidden stratification in the dataset such that most images from the positive class contain
a chest drain, which is a non-causal feature related to treatment of the disease. While for
all other benchmarks we report WGA on test data, for this dataset, following prior work
[65,48,70], we report worst group AUC because of the heavy class imbalance.2
MultiNLI [
91
,
76
] is a text classification problem, where the task is to classify the relationship
between a given pair of sentences as a contradiction, entailment or neither of them. In this
dataset, the presence of negation words (e.g. “never”) in the second sentence is spuriously
correlated with the “contradiction” class.
CivilComments [
8
,
43
] is a text classification problem, where the goal is to classify whether
a given comment is toxic. We follow Idrissi et al.
[34]
and use the coarse version of
the dataset both for training and evaluation, where the spurious attribute is
s= 1
if the
comment mentions at least one of the following categories: male, female, LGBT, black,
white, Christian, Muslim, other religion; otherwise, the spurious label is
0
. The presence
of the eight categories above is spuriously correlated with the comment being classified as
toxic.
The Waterbirds, CelebA, CivilComments and MultiNLI datasets are commonly used to benchmark
the performance of group robustness methods [see e.g.
34
,
50
,
63
]. The FMOW and CXR-14 datasets
present challenging real-world problems with spurious correlations. In these datasets, the inputs
do not resemble natural images from datasets such as ImageNet [
75
], so models have to learn the
relevant features from data to achieve good performance, and cannot simply rely on feature transfer.
We provide detailed descriptions of the data and show example datapoints in Appendix A, Figures 7
and 8.
2
We take the minimum out of the two scores on test: AUC for classifying negative class against positive
examples with chest drain and AUC for negative class against positive without chest drain.
4
Models.
Following prior work [
76
,
34
] we use a ResNet-50 [
26
] model pretrained on ImageNet1k
[
75
] on Waterbirds, CelebA and FMOW. For the NLP problems, we use a BERT model [
14
] pre-
trained on Book Corpus and English Wikipedia data. On CXR-14, following prior work [e.g.
70
,
65
,
48
] we use a DenseNet-121 model [
32
] pretrained on ImageNet1k. In Section 6, we provide
an extensive study of the effect of architecture and pretraining on the image classification problems,
and in Appendix Ewe perform a similar study on the MultiNLI text classification problem.
Evaluation strategy.
We use DFR to evaluate the quality of the learned feature representations,
as described in Section 3: we measure how well the core features can be decoded from the learned
representations with last layer retraining. In some of the experiments, we also train a linear classifier
to predict the spurious attribute
s
instead of the class label
y
. Using this classifier, we can evaluate
the decodability of the spurious feature from the learned feature representation. We refer to this
procedure as
s
-DFR and the corresponding worst group accuracy (in predicting the spurious attribute
s
) as DFR
s
-WGA. Additionally, we evaluate the worst group accuracy and mean
3
accuracy of the
base model without applying DFR, which we refer to as base WGA and base accuracy respectively.
5 ERM vs Group Robustness Training
Multiple methods have been proposed for training classifiers which are more robust to spurious
correlations, with significant improvements in worst group accuracy compared to standard training.
In this section, we use DFR to investigate whether the improvements of group robustness methods
are caused by better feature representations or by better weighting of the learned features.
Methods.
We consider 4 methods for learning the features. ERM or Empirical Risk Minimization
is the standard training on the original training data, without any techniques targeted at improving
worst group performance. RWG reweights the loss on each of the groups according to the size of the
group and RWY reweights the loss on each class according to the size of the class [
34
]. Group DRO
[
76
] is a state-of-the-art method which uses the group information on the training data to minimize
the worst group loss instead of the average loss. Group DRO is often considered as an oracle method
or upper-bound on the worst group performance under spurious correlations [50,13].
On the CXR dataset the group labels are not available on the train data, so we cannot apply RWG or
group DRO; on this dataset we compare ERM to RWY. On several datasets, the performance of RWY
and RWG methods deteriorates during training. For these datasets, we additionally report the results
for the checkpoint obtained with early stopping (RWY-ES and RWG-ES). For group DRO, we report
the performance with early stopping on all datasets except for CXR (GDRO-ES). In all cases, early
stopping is performed based on the worst-group accuracy on the validation set.
Hyper-parameters selection.
We train ERM models, RWG and RWY with the same hyper-
parameters shared between all the image datasets (apart from batch size which is set to
32
on
Waterbirds, and
100
on the other datasets) and between the natural language datasets. We do not tune
the hyper-parameters of these methods for worst group accuracy. For group DRO, we run a grid search
over the values of the generalization adjustment
C
, weight decay and learning rate hyper-parameters,
and select the best combination according to the worst-group accuracy on validation data with early
stopping. For details, please see Appendix B.
5.1 Results
We compare feature learning methods on all datasets in Figure 1. As expected, the worst group
accuracy of group robustness methods is significantly better than the ERM worst group accuracy on
most datasets. For example, on Waterbirds, ERM only gets
68.8%
WGA, while group DRO with
early stopping gets
90.6%
. However, after applying DFR the performance of ERM and group DRO is
very close, with a slight advantage for ERM (
91.1%
for ERM and
90%
for group DRO), and similar
observations hold on all datasets.
The results for RWG and RWY are analogous. Namely, when combined with early stopping, these
methods outperform ERM on base model performance. Once we apply DFR, however, the gap in
3
Following Sagawa et al.
[76]
and Kirichenko et al.
[40]
, we evaluate the mean accuracy according to the
group distribution in the training set. This way, mean accuracy represents the in-distribution generalization of
the model.
5
摘要:

OnFeatureLearninginthePresenceofSpuriousCorrelationsPavelIzmailovPolinaKirichenkoNateGruverAndrewGordonWilsonNewYorkUniversityAbstractDeepclassiersareknowntorelyonspuriousfeatures—patternswhicharecorre-latedwiththetargetonthetrainingdatabutnotinherentlyrelevanttothelearningproblem,suchastheimage...

展开>> 收起<<
On Feature Learning in the Presence of Spurious Correlations Pavel IzmailovPolina KirichenkoNate GruverAndrew Gordon Wilson.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:2.42MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注