
Robustifying Sentiment Classification
by Maximally Exploiting Few Counterfactuals
Maarten De Raedt♢♣ Fréderic Godin♢Chris Develder♣Thomas Demeester♣
♢Sinch Chatlayer ♣Ghent University
{maarten.deraedt, chris.develder, thomas.demeester}@ugent.be
frederic.godin@sinch.com
Abstract
For text classification tasks, finetuned lan-
guage models perform remarkably well. Yet,
they tend to rely on spurious patterns in train-
ing data, thus limiting their performance on
out-of-distribution (OOD) test data. Among
recent models aiming to avoid this spurious
pattern problem, adding extra counterfactual
samples to the training data has proven to be
very effective. Yet, counterfactual data gen-
eration is costly since it relies on human an-
notation. Thus, we propose a novel solution
that only requires annotation of a small frac-
tion (e.g., 1%) of the original training data,
and uses automatic generation of extra coun-
terfactuals in an encoding vector space. We
demonstrate the effectiveness of our approach
in sentiment classification, using IMDb data
for training and other sets for OOD tests (i.e.,
Amazon, SemEval and Yelp). We achieve
noticeable accuracy improvements by adding
only 1% manual counterfactuals: +3% com-
pared to adding +100% in-distribution training
samples, +1.3% compared to alternate counter-
factual approaches.
1 Introduction and Related Work
For a wide range of text classification tasks, finetun-
ing large pretrained language models (Devlin et al.,
2019;Liu et al.,2019;Clark et al.,2020;Lewis
et al.,2020) on task-specific data has been proven
very effective. Yet, analysis has shown that their
predictions tend to rely on spurious patterns (Poliak
et al.,2018;Gururangan et al.,2018;Kiritchenko
and Mohammad,2018;McCoy et al.,2019;Niven
and Kao,2019;Zmigrod et al.,2019;Wang and
Culotta,2020), i.e., features that from a human per-
spective are not indicative for the classifier’s label.
For instance, Kaushik et al. (2019) found the rather
neutral words “will”, “my” and “has” to be impor-
tant for a positive sentiment classification. Such
reliance on spurious patterns were suspected to de-
grade performance on out-of-distribution (OOD)
Original source documents
Manually created
counterfactuals
(paired with an original)
+
+
+
+
+
+
+
+
+
+
+
+
−
−
−
−
Document representation vector space
Articially created
counterfactual
representations
Fig. 1: We propose to generate counterfactuals in rep-
resentation space, learning — from only a few manu-
ally created counterfactuals — a mapping function tto
transform a document representation φ(x)to a coun-
terfactual one (having the opposite classification label).
Illustration for positively labeled originals only.
test data, distributionally different from training
data (Quiñonero-Candela et al.,2008). Specifically
for sentiment classification, this suspicion has been
confirmed by Kaushik et al. (2019,2020).
For mitigating the spurious pattern effect,
generic methods include regularization of masked
language models, which limits over-reliance on a
limited set of keywords (Moon et al.,2021). Al-
ternatively, to improve robustness in imbalanced
data settings, additional training samples can be
automatically created (Han et al.,2021). Other
approaches rely on adding extra training data by
human annotation. Specifically to avoid spurious
patterns, Kaushik et al. (2019) proposed Counter-
factually Augmented Data (CAD), where annota-
tors minimally revise training data to flip their la-
bels: training on both original and counterfactual
samples reduced spurious patterns. Rather than
editing existing samples, Katakkar et al. (2021)
propose to annotate them with text spans support-
ing the assigned labels as a “rationale” (Pruthi et al.,
2020;Jain et al.,2020), thus achieving increased
performance on OOD data. Similar in spirit, Wang
and Culotta (2020) have an expert annotating spu-
rious vs. causal sentiment words and use word-
level classification (spurious vs. genuine) to train
arXiv:2210.11805v1 [cs.CL] 21 Oct 2022