
but they only apply to a narrow subset of tasks.
In this work, we propose
Likelihood Splits
, a
general-purpose method to create challenging OOD
splits for existing datasets. The principle behind
Likelihood Splits is that any system that claims
to reliably process natural language must be able
to generalize from more common utterances seen
during training to the long tail of rare utterances
at test time. Generalization, not merely memoriza-
tion, is necessary because even a very large training
dataset cannot exhaustively cover all possible long-
tail examples that may be encountered in the real
world. Moreover, standard annotation procedures
tend to over-sample examples from the head of the
distribution, further ignoring the challenge posed
by infrequent examples. We identify tail exam-
ples using the likelihood under the GPT-2 language
model (Radford et al.,2019). Examples with low
likelihood under GPT-2 are placed in the held-out
evaluation sets and the high likelihood examples
are used as the training set (see Figure 1).
Likelihood Splits are a novel, widely applicable
strategy that can create interesting generalization
benchmarks at no additional annotation cost. They
are more challenging than a random split across
a wide range of tasks: error rates relative to ran-
dom splits increase by 59% for T5 (Raffel et al.,
2020) on SPIDER (Yu et al.,2018), 93% for ELEC-
TRA (Clark et al.,2020) on SNLI (Bowman et al.,
2015), and 33% for ROBERTA(Liu et al.,2019)
on BOOLQ (Clark et al.,2019). Moreover, the
proposed splits do not unfairly penalize the GPT-2
model used to create the splits when it is used as a
task model, thus avoiding one of the downsides of
adversarial filtering. We identify many independent
challenges required by Likelihood Splits, including
generalizing to rare words, complex programs, and
syntactically complex sentences. We encourage
future benchmark creators to release Likelihood
Splits as a complementary evaluation to the stan-
dard IID evaluation to better test out-of-distribution
generalization performance. We will release the
splits discussed in this work along with the code to
easily create Likelihood Splits of other datasets.1
2 Related Work
Generalizing to the long-tail.
Evaluating sys-
tems on long-tail phenomena is important, espe-
cially because many datasets over-sample the head
of the distribution. For example, some question-
1github.com/ameyagodbole/long-tail-likelihood-splits
answering (QA) datasets limit their purview to pop-
ular web-pages (Yang et al.,2018) or frequent user
queries (Kwiatkowski et al.,2019). Lewis et al.
(2021); Liu et al. (2021) demonstrate that mod-
els trained on these datasets often fail on exam-
ples that do not match the most frequent training
cases. Similar observations have been made in en-
tity linking to rare entities, (Orr et al.,2021;Chen
et al.,2021), information retrieval for open-domain
QA (Sciavolino et al.,2021), relation extraction
for rare relations (Sabo et al.,2021) and lexicon
induction for rare senses in machine translation
(Czarnowska et al.,2019). Zero-shot performance
of large LMs on numerical reasoning and factoid
questions is also correlated with the frequency of
occurence of the facts in the pre-training corpus
(Razeghi et al.,2022;Kandpal et al.,2022;Elazar
et al.,2022). While we do not test whether models
can memorize long-tail knowledge, we instead test
whether models can process long-tail sentences.
Naik et al. (2022) note that it is challenging to cata-
logue and evaluate generalization along micro-level
dimensions and instead propose benchmarks that
vary along macro-level dimensions (such as the lan-
guage and domain) as a proxy. We hypothesize that
LMs learn which micro-level phenomena are rare,
as this would improve their overall language mod-
eling objective. In this work, we present a recipe
that leverages LMs to evaluate tail generalization
for any language task.
Task-specific test sets.
Ribeiro et al. (2020) use
templated queries to evaluate model performance
under various linguistic perturbations. This method
requires dataset designers to define phenomena of
interest and axes of perturbation along which labels
may be preserved or changed. Naik et al. (2018)
analyze model errors and instantiate tests that ex-
plicitly evaluate models on more examples from
each error class. Gardner et al. (2020) check for
model consistency under local perturbations of test
set examples. All of these approaches require anno-
tators to create new examples, whereas we propose
a method to resplit existing datasets.
Adversarial approaches.
Søgaard et al. (2021)
argue that random splits over-estimate model per-
formance on new in-domain data and recommend
the use of adversarial and heuristically challenging
splits to estimate generalizability. Adversarial data
collection promotes the creation of difficult exam-
ples by encouraging annotators to fool a model-