
Lovering et al.
[54]
explore the factors which affect the extractability of features after pre-training
and fine-tuning of NLP models. Kaushik et al.
[36]
construct counterfactually augmented sentiment
analysis and naural language inference datasets (CAD) and show that combining CAD with the
original data reduces the reliance on spurious correlations on the corresponding benchmarks. Kaushik
et al.
[38]
explain the efficacy of CAD and show that while adding noise to causal features degrades
in-distribution and out-of-distribution performance, adding noise to non-causal features improves
robustess. Eisenstein
[17]
and Veitch et al.
[86]
formally define and study spurious features in NLP
from the perspective of causality.
Kirichenko et al.
[40]
show that models trained with standard ERM training often learn high-quality
representations of the core features, and propose the DFR procedure (see Section 3) which we
use extensively in this paper. Related observations have also been reported in other works in the
context of spurious correlations [
58
], domain generalization [
73
] and long-tail classification [
35
].
While we build on the observations of Kirichenko et al.
[40]
, our work provides profound new
insights and greatly expands on the scope of their work. In particular, we investigate the feature
representations learned by methods beyond standard ERM, and the role of model architecture, pre-
training, regularization and data augmentation on learning semantic structure. We also extend our
analysis beyond the standard spurious correlation benchmarks studied by Kirichenko et al.
[40]
, by
considering the challenging real world satellite imaging and chest X-ray datasets.
In an independent and concurrent work, Shi et al.
[81]
also propose an evaluation framework for out-
of-distribution generalization based on last layer retraining, inspired by the observations of Kirichenko
et al.
[40]
and Kang et al.
[35]
. They focus on the comparison of supervised, self-supervised and
unsupervised training methods, providing complementary observations to our work.
3 Background
Preliminaries.
We consider classification tasks with inputs
x∈ X
and classes
y∈ Y
. We
assume that the data distribution consists of groups
G
which are not equally represented in the
training data. The distribution of groups can change between the training and test distributions, with
majority groups becoming less common or minority groups becoming more common. Because of the
imbalance in training data, models trained with ERM often have a gap between average and worst
group performance on test. Throughout this paper, we will be studying worst group accuracy (WGA),
i.e. the lowest test accuracy across all the groups
G
. For most problems considered in this paper, we
assume that each data point has an attribute
s∈ S
which is spuriously correlated with the label
y
,
and the groups are defined by a combination of the label and spurious attribute:
G ∈ Y × S
. In test
distribution we might find that
s
is no longer correlated with
y
, and thus a model that has learned
to rely on the spurious feature
s
during training will perform poorly at test time. Models that rely
on the spurious features will typically achieve poor worst group accuracy, while models that rely on
core features will have more uniform accuracies across the groups. In Appendix A, we describe the
groups, spurious and core features in the datasets that we use in this paper.
In order to perform controlled experiments, we assume that we have access to the spurious attributes
s
(or group labels) for training or validation data, which we use for training of group robustness
baselines and feature quality evaluation. However, we emphasize that our results on the features
learned by ERM hold generally, even when spurious attributes are unknown, as ERM does not use
the information about the spurious features: we only use the spurious attributes to perform analysis.
Deep feature reweighting.
Suppose we are given a model
m:X → C
, where
X
is the input
space and
C
is the set of classes. Kirichenko et al.
[40]
assume that the model
m
consists of a feature
extractor (typically, a sequence of convolutional or transformer layers) followed by a classification
head (typically, a single linear layer):
m=h◦e
, where
e:X → F
is a feature extractor and
h:F → C
is a classification head. They discard the classification head, and use the feature
extractor
e
to compute the set of embeddings
ˆ
De={(e(xi), yi)}n
i=1
of all the datapoints in the
reweighting dataset
ˆ
D
; the reweighting dataset is used to retrain the last layer of the model, and
contains group-balanced data where the spurious correlation does not hold. Finally, they train a
logistic regression classifier
l:F → C
on the dataset
ˆ
De1
. Then, the final model used on new test
1
For stability, logistic regression models are trained
10
times on different random group-balanced subsets of
the reweighting dataset
ˆ
D
, and the weights of the learned logistic regression models are averaged. See Appendix
B of Kirichenko et al. [40] for full details on the DFR procedure.
3