
2 Related Work
This work relates to a line of research using behav-
ioral methods for understanding model behavior
(Ribeiro et al.,2020;Elazar et al.,2021a;Vig et al.,
2020), and more specifically regarding the extent in
which specific heuristics being used by models for
prediction (Poliak et al.,2018;Naik et al.,2018;
McCoy et al.,2019;Jacovi et al.,2021;Elazar
et al.,2021b). The use of heuristics such as lexi-
cal overlap also typically point on the lexical and
under sensitivity of models (Welbl et al.,2020),
studied in different setups (Iyyer et al.,2018;Jia
and Liang,2017;Gan and Ng,2019;Misra et al.,
2020;Belinkov and Bisk,2018;Ribeiro et al.,2020;
Ebrahimi et al.,2018;Bandel et al.,2022).
Our work suggests that the size of PLMs af-
fects their inductive bias (Haussler,1988) towards
the preferred strategy. Studies of inductive biases
in NLP has gained attention recently (Dyer et al.,
2016;Battaglia et al.,2018;Dhingra et al.,2018;
Ravfogel et al.,2019;McCoy et al.,2020a). While
Warstadt et al. (2020) studied the effect of the
amount of pre-training data on linguistic general-
ization, we show that additional training iterations
and larger models affects their inductive biases with
respect to the use of lexical overlap heuristics.
Finally, Tu et al. (2020) show that the adoption
of heuristics can be explained by the ability to gen-
eralize from a few examples that aren’t aligned
with the heuristics. They show that larger model
size and longer fine-tuning can marginally increase
the ability of the model to generalize from minority
groups, an insight also discussed also by Djolonga
et al. (2021). Our focus of lexical heuristics exclu-
sively, together with robust fine-grained experimen-
tal setup concludes, unlike Tu et al. (2020), that
the change in lexical heuristic adoption behavior is
consistent and non marginal.
3 Measuring Reliance on Lexical
Overlap
Definition: Reliance on Lexical Overlap
We
say that a model makes use of lexical heuristics if
it relies on superficial lexical cues in the text by
making use of their identity rather than the seman-
tics of the text.
3.1 Experimental Design
We focus on tasks that involve some inference over
text pairs. We estimate the usage of the lexical-
overlap heuristic by training a model on a stan-
dard training set for the task, and then inspecting
models’ predictions on a diagnostic set, containing
high lexical overlap instances,
2
where half of the
instances are consistent with the heuristic and half
are inconsistent (e.g. the first two rows of each
dataset in Figure 2, respectively). Models that rely
on the heuristic will perform well on the consistent
group, and significantly worse on the inconsistent
ones.
Metric
We define the HEURistic score (HEUR)
of a model on a diagnostic set as the difference
in model performance on the consistent examples,
and its performance on the inconsistent examples.
Higher HEUR values indicate high use of the lex-
ical overlap heuristic (bad) while low values indi-
cate lower reliance on the heuristic (good).
3.2 Data
For the
training sets
we use the MNLI dataset
(Williams et al.,2018) for Natural Language In-
ference (NLI; Dagan et al.,2005;Bowman et al.,
2015), the Quora Question Pairs (QQP; Sharma
et al.,2019) for paraphrasing, and SQuAD 2.0 (Ra-
jpurkar et al.,2018) for Reading Comprehension
(RC). The corresponding high-lexical overlap diag-
nostic sets are described below:
HANS
(McCoy et al.,2019) is an NLI dataset,
designed as a challenging test set to test for reliance
on different heuristics. Models that predict entail-
ment when based solely on one of those heuristics
- will fail in half the examples where the heuristic
does not hold. We focus on the lexical overlap
heuristic part, termed HANS-Overlap.
PAWS
(Zhang et al.,2019) is a paraphrase de-
tection dataset containing paraphrase and non-
paraphrase pairs with high lexical overlap. The
pairs were generated by controlled word swapping
and back translation, with a manually filtering val-
idation step. We use the Quora Question Pairs
(PAWS-QQP) part of the dataset.
ALSQA
To test the lexical overlap heuristic uti-
lization in Reading Comprehension models, we
create a new test set: Analyzing Lexically Similar
QA (ALSQA). We augment the SQuAD 2.0 dataset
(Rajpurkar et al.,2018) by asking crowdworkers
2
We define the lexical overlap between pair of texts as the
number of unique words that appear in both texts divided by
the number of unique words in the shorter text. The definition
for ALSQA is similar, but we consider lemmas instead of
words, and ignore function words and proper names.