Text Characterization Toolkit
Daniel Simig*, Tianlu Wang*, Verna Dankers*+, Peter Henderson‡*,
Khuyagbaatar Batsuren†,Dieuwke Hupkes*,Mona Diab*
*Meta AI, +University of Edinburgh, †National University of Mongolia, ‡Stanford University
{danielsimig,dieuwkehupkes,mdiab}@fb.com
Abstract
In NLP, models are usually evaluated by re-
porting single-number performance scores on
a number of readily available benchmarks,
without much deeper analysis. Here, we ar-
gue that – especially given the well-known
fact that benchmarks often contain biases, arte-
facts, and spurious correlations – deeper re-
sults analysis should become the de-facto stan-
dard when presenting new models or bench-
marks. We present a tool that researchers can
use to study properties of the dataset and the in-
fluence of those properties on their models’ be-
haviour. Our Text Characterization Toolkit in-
cludes both an easy-to-use annotation tool, as
well as off-the-shelf scripts that can be used for
specific analyses. We also present use-cases
from three different domains: we use the tool
to predict what are difficult examples for given
well-known trained models and identify (po-
tentially harmful) biases and heuristics that are
present in a dataset.
1 Introduction
NLP technology has progressed tremendously over
the recent decades with significant advances in al-
gorithms and modeling. Yet, by comparison, our
understanding lags behind significantly for datasets
(including all datasets types in the model life cycle:
training, validation, evaluation) that contribute to
model performance. This is mostly due to the lack
of frameworks, methods, and tools to draw insights
into datasets, especially at scale.
Most NLP models, to date, are evaluated using a
relatively small number of readily available evalu-
ation benchmarks, that are often created automat-
ically, or via crowd-sourcing (e.g. Bowman et al.,
2015;Wang et al.,2018;Williams et al.,2018;
Zellers et al.,2018). It is well-known that most
popular (evaluation) datasets are rife with biases,
dataset artefacts and spurious correlations, and are
prone to be solved with shortcuts (Gardner et al.,
2021;Kiela et al.,2021). Presenting models with
adversarial examples for which those biases or
correlations do not hold, often results in stark per-
formance drops (e.g. Linzen,2020;McCoy et al.,
2019;Jia and Liang,2017;Chen et al.,2016;Po-
liak et al.,2018;Tsuchiya,2018;Belinkov et al.,
2019). At best, using datasets with such known
issues might result in overestimation of a models’
capability on the task in question, which may not
be reflective of how well they can execute this task
in more realistic scenarios. More worrying, how-
ever, is that training or finetuning on datasets that
contain biases and artefacts may result in mod-
els implementing undesired, biased behavior (e.g.
Rudinger et al.,2018;Blodgett et al.,2016).
Additionally, datasets are usually treated as ho-
mogeneous collections of text, performance for
which is captured in a single number – even though
there is often a substantial difference between the
difficulty/complexity of different examples in a
dataset (e.g. Sugawara et al.,2022). Research pa-
pers rarely report thorough analysis of performance
broken down by characteristics of the data set ex-
amples ignoring underlying patterns performance
numbers may reflect. The problem is exacerbated
by the pervasiveness of benchmarks coupled with a
leaderboard competitive culture, where what counts
most is system rank.
In part, this may be due to the fact that deeper
analysis of results – especially when a number of
different datasets is involved – is complex and time-
consuming, and there are no standard frameworks
or protocols that practitioners can resort to. The
problem is even more pervasive, where we curate
datasets for development and evaluation. How we
curate, create, select data plays a critical role in un-
derstanding our models. Many NLP models (even
beyond text) require up/down sampling of specific
types of data. These processes should rely on prin-
cipled characterization of data for any given model.
Towards this end, we believe that the existence
of a standard toolkit that provides an easy to use set
arXiv:2210.01734v1 [cs.CL] 4 Oct 2022