
ing “not” will flip the label); high PS means that
adding the feature to an example would produce
the label (e.g. adding “the movie is brilliant” to a
neutral review is likely to make it positive). Under
this framework, we define two types of spurious
features (Section 2): irrelevant features (e.g. the
director name) that have low PN and low PS, and
necessary features (e.g. the negation word) that
have high PN despite low PS.
Next, we describe the challenges in evaluating
and improving robustness to necessary spurious
features (Section 4). First, necessary features com-
pose with other features in the context to influence
the label. Thus, evaluating whether the model relies
solely on the necessary feature requires perturbing
the context. This process often introduces new fea-
tures and leads to inconsistent results depending on
how the context is perturbed.
Second, we analyze the effectiveness of two
classes of methods—data balancing and represen-
tation debiasing—on the two types of spurious
features. Data balancing breaks the correlation
between the label and the spurious feature (e.g.
Sagawa et al. (2020)); representation debiasing
directly removes the spurious feature from the
learned representation (e.g. Ravfogel et al. (2020)).
Although they are effective for irrelevant features,
we show that for necessary spurious features, (i)
data balancing does not lead to invariant perfor-
mance with respect to the spurious feature (Sec-
tion 5.1); and (ii) removing the spurious feature
from the representation significantly hurts perfor-
mance (Section 5.2).
In sum, this work provides a formal characteri-
zation of spurious features in natural language. We
highlight that many common spurious features in
NLU are necessary (despite being not sufficient) to
predict the label, which introduces new challenges
to both evaluation and learning.
2 Categorization of Spurious Features
2.1 Causal Models
We describe a structural causal model for text clas-
sification to illustrate the relation between differ-
ent spurious features and the label. Let
X=
(X1, X2, .., Xn)
denote a sequence of input word-
s/features
1
and
Y
the output label. We assume a
1
For illustration purposes, we assume that each feature is a
word in the input text. However, the same model and analysis
apply to cases where
Xi
denote a more complex feature (e.g.
named entities or text length) extracted from the input.
data generating model shown in Figure 1a. There
is a common cause
C
of the input (e.g. a review
writer, a PCFG or a semantic representation of the
sentence), conditioned on which the words are inde-
pendent to each other. Each word
Xi
may causally
affect the label Y.
Under this model, the dependence between
Y
and a feature
Xi
can be induced by two processes.
The type 1 dependence is induced by a confounder
(in this case
C
) influencing both
Y
and
Xi
due
to biases in data collection, e.g. search engines
return positive reviews for famous movies; we de-
note this non-causal association by the red path in
Figure 1b. The type 2 dependence is induced by
input words that causally affect
Y
(the red path
in Figure 1c), e.g. negating an adjective affects
the sentiment. Importantly, the two processes can
and often do happen simultaneously. For example,
in NLI datasets, the association between negation
words and the label is also induced by crowdwork-
ers’ inclination of negating the premise to create a
contradiction example.
A type 1 dependence (“Titanic”-sentiment) is
clearly spurious because the feature and
Y
are as-
sociated through
C
while having no causal rela-
tionship.
2
In contrast, a type 2 dependence (“not”-
sentiment) is not spurious per se—even a human
needs to rely on negation words to predict the label.
Now, how do we measure and differentiate the
two types of feature-label dependence? In the fol-
lowing, we describe fine-grained notions of the
relationship between a feature and a label, which
will allow us to define the spuriousness of a feature.
2.2 Sufficiency and Necessity of a Feature
We borrow notions from causality to describe
whether a feature is a necessary or sufficient cause
of a label (Pearl,1999;Wang and Jordan,2021).
Consider the examples in Table 1: intuitively, “not”
is necessary for the contradiction label because
in the absence of it (e.g. removing or replacing
it by other syntactically correct words) the exam-
ple would no longer be contradiction; in contrast,
“the movie is brilliant” is sufficient to produce the
positive label because adding the sentence to a neg-
ative review is likely to increase its sentiment score.
Thus, the feature’s effect on the label relies on
counterfactual outcomes.
2
The two types of dependencies are also discussed in
Veitch et al. (2021), where the type 1 dependence is called
“purely spurious”.