on group information during training, which can
be expensive to obtain and unavailable for less pop-
ular datasets. On the other hand, methods such
as DRO with Conditional Value-at-Risk (CVaR
DRO, Duchi et al. 2019;Levy et al. 2020), Learn-
ing from Failure (LfF, Nam et al. 2020), Predict
then Interpolate (PI, Bao et al. 2021), Spectral De-
coupling (SD, Pezeshki et al. 2021), Just Train
Twice (
JTT
,Liu et al. 2021), and RWY and SUBY
from (Idrissi et al.,2022) all aim to minimize worst-
group loss without group information.
CVaR DRO minimizes worst-case loss over all
subpopulations of a specific size and requires com-
puting the worst-case loss at each step. LfF trains
an intentionally biased model and upweights the
minority examples. PI interpolates distributions
of correct and incorrect predictions and can min-
imize worst-case loss over all interpolations. SD
replaces the L
2
weight decay in the cross entropy
loss function with logits. RWY reweights sam-
pling probabilities so that mini-batches are class-
balanced. SUBY subsamples large classes so that
every class is the same size as the smallest class.
JTT
simply obtains misclassified examples (the er-
ror set) from the training set once and upweights
the fixed set of erroneous examples. We focus on
JTT
due to its simplicity and relative effectiveness
and because it does not require group information
for improving worst-group accuracy. While Idrissi
et al. (2022)’s SUBY and RWY also follow
JTT
in
improving worst-group accuracies, their methods
target only datasets with imbalanced classes, and
are not applicable to class-balanced datasets such
as MultiNLI (Williams et al.,2018).
We propose further enhancing
JTT
by removing
outliers from the error set before upweighting it.
The outliers might be examples that are difficult
to learn, such as annotation errors. Keeping them
from being upweighted allows the model to train on
a cleaner error set and thus better show the intended
effect of the original
JTT
. We focus on worst-group
performance caused by the spurious correlations
of negation words and evaluate on datasets sus-
ceptible to spurious correlations of this type. Our
experiments on the FEVER and MultiNLI datasets
show that our method can outperform
JTT
in terms
of either the average or the worst-group accuracy
while maintaining the same level of performance
for the other groups.
Our contributions are as follows. We devise a
method for improving worst-group accuracy with-
out group information during training based on
JTT
(Section 3). We show that by removing out-
liers from the error set being upweighted, we can
achieve similar or better overall and worst-group
performance (Section 4.2). Our examination of the
outliers being removed also suggests that the im-
provement may come from removing annotation
errors in the upweighted error set (Section 4.3).
2 Background
Spurious correlations and minority groups
We investigate the spurious correlations occurring
in two natural-language datasets: FEVER (Thorne
et al.,2018) and MultiNLI (Williams et al.,2018).
The task for FEVER involves retrieving docu-
ments related to a given claim, finding sentences
to form evidence against the claim, and then clas-
sifying the claim on the basis of the evidence into
three classes: SUPPORTS (SUP), REFUTES (REF),
or NOT ENOUGH INFORMATION (NEI). We fo-
cus on improving the worst-group classification
performance for the final part of the task. The
task for MultiNLI is to classify whether the hy-
pothesis is entailed by, neutral with, or contra-
dicted by the premise. We use Schuster et al.
(2021)’s preprocessing of both datasets, contain-
ing 178,059/11,620/11,710 training/dev/test exam-
ples for FEVER and 392,702/9,832 training/test
examples for MultiNLI.
Attributes known to cause spurious correlations
for these datasets are negation words (Gururangan
et al.,2018) and verbs that suggest negating actions
(Schuster et al.,2019). We merge these two sources
of negation words into a single set: {no, never,
nothing, nobody, not, yet, refuse, refuses, refused,
fail, fails, failed, only, incapable, unable, neither,
none}. Each class can be split into two groups
based on whether each claim/hypothesis contains
a spurious attribute (i.e., the negation words listed
above). Models tend to perform well on groups
where the attributes are highly correlated with the
label. Groups where the correlation between the
label and the attribute does not hold are called mi-
nority groups or worst groups, since models often
fail to classify their examples correctly. For exam-
ple, the claim “Luis Fonsi does
not
go by his given
name on stage.”, labeled SUPPORTS, belongs to the
worst group [SUP, neg].
Table 1(a) shows that most claims containing
negation are from the class REFUTES. The rela-
tively small amount of examples from the groups