Text Characterization Toolkit Daniel Simig Tianlu Wang Verna Dankers Peter Henderson Khuyagbaatar BatsurenDieuwke HupkesMona Diab

2025-05-02 0 0 1.09MB 17 页 10玖币
侵权投诉
Text Characterization Toolkit
Daniel Simig*, Tianlu Wang*, Verna Dankers*+, Peter Henderson‡*,
Khuyagbaatar Batsuren,Dieuwke Hupkes*,Mona Diab*
*Meta AI, +University of Edinburgh, National University of Mongolia, Stanford University
{danielsimig,dieuwkehupkes,mdiab}@fb.com
Abstract
In NLP, models are usually evaluated by re-
porting single-number performance scores on
a number of readily available benchmarks,
without much deeper analysis. Here, we ar-
gue that – especially given the well-known
fact that benchmarks often contain biases, arte-
facts, and spurious correlations – deeper re-
sults analysis should become the de-facto stan-
dard when presenting new models or bench-
marks. We present a tool that researchers can
use to study properties of the dataset and the in-
fluence of those properties on their models’ be-
haviour. Our Text Characterization Toolkit in-
cludes both an easy-to-use annotation tool, as
well as off-the-shelf scripts that can be used for
specific analyses. We also present use-cases
from three different domains: we use the tool
to predict what are difficult examples for given
well-known trained models and identify (po-
tentially harmful) biases and heuristics that are
present in a dataset.
1 Introduction
NLP technology has progressed tremendously over
the recent decades with significant advances in al-
gorithms and modeling. Yet, by comparison, our
understanding lags behind significantly for datasets
(including all datasets types in the model life cycle:
training, validation, evaluation) that contribute to
model performance. This is mostly due to the lack
of frameworks, methods, and tools to draw insights
into datasets, especially at scale.
Most NLP models, to date, are evaluated using a
relatively small number of readily available evalu-
ation benchmarks, that are often created automat-
ically, or via crowd-sourcing (e.g. Bowman et al.,
2015;Wang et al.,2018;Williams et al.,2018;
Zellers et al.,2018). It is well-known that most
popular (evaluation) datasets are rife with biases,
dataset artefacts and spurious correlations, and are
prone to be solved with shortcuts (Gardner et al.,
2021;Kiela et al.,2021). Presenting models with
adversarial examples for which those biases or
correlations do not hold, often results in stark per-
formance drops (e.g. Linzen,2020;McCoy et al.,
2019;Jia and Liang,2017;Chen et al.,2016;Po-
liak et al.,2018;Tsuchiya,2018;Belinkov et al.,
2019). At best, using datasets with such known
issues might result in overestimation of a models’
capability on the task in question, which may not
be reflective of how well they can execute this task
in more realistic scenarios. More worrying, how-
ever, is that training or finetuning on datasets that
contain biases and artefacts may result in mod-
els implementing undesired, biased behavior (e.g.
Rudinger et al.,2018;Blodgett et al.,2016).
Additionally, datasets are usually treated as ho-
mogeneous collections of text, performance for
which is captured in a single number – even though
there is often a substantial difference between the
difficulty/complexity of different examples in a
dataset (e.g. Sugawara et al.,2022). Research pa-
pers rarely report thorough analysis of performance
broken down by characteristics of the data set ex-
amples ignoring underlying patterns performance
numbers may reflect. The problem is exacerbated
by the pervasiveness of benchmarks coupled with a
leaderboard competitive culture, where what counts
most is system rank.
In part, this may be due to the fact that deeper
analysis of results – especially when a number of
different datasets is involved – is complex and time-
consuming, and there are no standard frameworks
or protocols that practitioners can resort to. The
problem is even more pervasive, where we curate
datasets for development and evaluation. How we
curate, create, select data plays a critical role in un-
derstanding our models. Many NLP models (even
beyond text) require up/down sampling of specific
types of data. These processes should rely on prin-
cipled characterization of data for any given model.
Towards this end, we believe that the existence
of a standard toolkit that provides an easy to use set
arXiv:2210.01734v1 [cs.CL] 4 Oct 2022
of tools and metrics allowing researchers to analyze
and systematically characterize datasets involved
in the model life cycle, while gaining insights into
the relationship between model performance and
data properties could become more common place.
In this paper, we introduce the Text Character-
ization Toolkit
1
(TCT), which aims to enable re-
searchers to gain a detailed understanding of the
datasets and models they create – with minimal
effort. TCT is inspired by the Coh-Metrix toolkit
(Graesser et al.,2004), a collection of over 100
diverse text characteristics intended for use for text
analysis in various applications. TCT offers these
capabilities at scale by design. While TCT can pro-
cess a dataset of 20000 paragraphs in less than a
minute using a single command on a MacBook Pro
laptop, the very same library, for instance, can also
be used as part of a PySpark pipeline to compute
text characteristics for a full snapshot of Common
Crawl
2
(3.1B web pages) in a matter of hours. In
this paper we present:
1.
A repository of text metrics that can help re-
veal (hidden) patterns in datasets coupled with
model performance on these datasets;
2.
A set of off-the-shelf analysis tools that re-
searchers can use in a simple notebook to
study properties of the dataset and the influ-
ence of those properties on model behaviour;
3.
A framework that enables the community to
share, reuse and standardize metrics and anal-
yses methods/tools;
4.
Use cases that demonstrate the efficacy of
TCT in practice covering Language Model
prompting, Translation and Bias Detection.
With these contributions, we aspire to contribute
to improving how we assess NLP models, and get
closer to a scenario where providing detailed results
analyses becomes the standard for NLP research.
2 The Text Characterization Toolkit
TCT consists of two main components:
A framework for defining and computing text
characteristics.
A collection of analysis tools that help users
interpret text characteristics and evaluate re-
sults with respect to these characteristics.
As illustrated by Figure 1, the workflow of ex-
tending a standard evaluation process with TCT is
1https://github.com/facebookresearch/text_
characterization_toolkit
2https://commoncrawl.org
Figure 1: Text Characterization Toolkit extends model
evaluation to provide insights about the role of data.
typically the following:
Given a dataset, define how to extract text
fragments from each data point: For a QA
dataset text fragments could be individual
questions, whereas in document summariza-
tion, the text fragments would be the docu-
ments themselves.
Use TCT to compute characteristics of the text
fragments. One might use the default charac-
teristics already included in the framework or
define their own specific metric.
Load the computed characteristics and other
evaluation specific data into a Python note-
book for analysis using TCT. One might an-
alyze the dataset itself (e.g. to identify spuri-
ous correlations or biases) or jointly analyze
model evaluation metrics and text characteris-
tics (e.g. through correlation analysis between
TCT features and models’ test set accuracy).
Use the results of the analysis to improve the
dataset, the model, or the evaluation protocol
– for example by extending evaluation data
with examples where a model is expected to
perform poorly or focusing evaluation on a
challenging subset of the test data.
Concrete examples of the workflow above are
described in §3and in Appendix B. The rest of this
section provides more details on the two important
components of the framework.
2.1 Text Characteristics
While the majority of the characteristics found in
TCT is motivated by metric classes in Coh-Metrix
(Graesser et al.,2004), we have included new data
bases for existing metrics and added entirely new
metrics. At the time of writing, there are 61 char-
acteristics implemented in TCT. An overview of
the main categories of currently implemented char-
acteristics can be found in Table 1. The toolkit
provides a standardized framework to implement,
Category Example Metrics
Descriptive Word Count
Sentence Length
Lexical Diversity Type-Token Ratio
MTLD
Complexity Left Embeddedness
# of NP modifiers
Incidence Scores Different POS tags
Types of connectives
Word Property Age of Acquisition
Concreteness
Table 1: Categories of characteristics currently
implemented. See Appendix Afor an exhaustive list.
configure, and compute these metrics. Adding a
new metric is as simple as implementing two func-
tions: one that loads any required resource (such as
a word database) and initializes computation, and
one that computes the metric given these resources
and an input text.
2.2 Analysis tools
To further decrease the effort required to carry out
text characteristics based analysis, we provide an
initial set of analysis tools that users can use out
of the box. We encourage users to contribute their
own implementations of TCT-based analyses to the
toolkit, to allow for re-use in the future develop-
ment of datasets and models. The current func-
tionality of the toolkit, as illustrated in Figure 2,
includes:
1.
Visualising distributions of different charac-
teristics;
2.
Visualising a pairwise correlation matrix for
the characteristics;
3.
Visualising correlations between individual
characteristics and outcomes (e.g., accuracy);
4.
Fitting a model on all characteristics to out-
comes (logistic regression and random forests
are supported currently) and analyzing a
model’s predictive power and coefficients.
3 Example Use Cases
In order to demonstrate the ability of TCT to pro-
duce meaningful and actionable insights, we pro-
vide 3 examples of its use on real world data. For
each one of these use cases, a thorough description
of the experimental setup and results is included in
Appendix Band reference notebooks are provided
in the examples directory of the TCT repository.
(a) Correlations between
text characteristics
(b) Model performance
w.r.t. some characteristics
(c) Results of a regression analysis: coefficients and fit
Figure 2: TCT analysis tools in action. See Appendix
Bfor detailed explanations and high-resolution
images.
Predicting Accuracy of OPT Baselines
We use
the logistic regression analysis tool to fit a model
that predicts the accuracy of the 6.7B OPT (Zhang
et al.,2022) model on the HellaSwag (Zellers et al.,
2019) task based on simple characteristics such as
mean word length and concreteness. Using this
model we identify subsets of the test data with
precision as low as 40% and as high as 90%.
Gender Bias in Co-reference Resolution
By
computing genderedness metrics on co-reference
labels and using these metrics as inputs to the anal-
ysis tools, we reproduce the results of Zhao et al.
(2018) showing that models perform much worse
when the stereotypically associated gender of an oc-
cupation does not match the gender of the pronom-
inal reference.
Fluctuations in Translation Performance
We
show how translation performance of the NLLB
model (Costa-jussà et al.,2022) using the Hugging-
Face pipeline (Wolf et al.,2019) fluctuates as a
function of sample characteristics, like the number
of sentences. This performance heterogeneity can
be fixed by segmenting sentences before using the
pipeline, showing that TCT can help debug model
pipelines even with many layers of abstraction.
4 Related Work
Multiple existing tools offer similar functionality
as TCT does: DataLab (Xiao et al.,2022) is a tool
for detailed data analysis that, among other things,
allows users to inspect datasets through the lens of
a few text characteristics such as text length, lexical
diversity and gender-related features. The Know
Your Data
3
tool allows for inspection of image
data, it surfaces spurious correlations, biases and
imbalances in datasets. However, both tools do not
connect model behavior to properties of datasets.
Collins et al. (2018) predicts overall hardness of
classification datasets based on label statistics and
a few text characteristics such as readability and
lexical diversity. ExplainaBoard (Liu et al.,2021)
focuses on model performance analysis and pro-
vides a model performance breakdown by simple
attributes of data points such as text length, provid-
ing a functionality most similar to our work.
Our toolkit distinguishes itself by including a
much wider range of text characteristics and multi-
variable analysis tools that can identify larger vari-
ations in model accuracy. By packaging our toolkit
as a simple Python library used in notebooks – in
contrast to the previously described feature-rich
systems – we also intend to minimize the effort
needed to both use it as well as contribute to it
(crowd sourcing more functionality).
The Coh-Metrix tool (Graesser et al.,2004) col-
lected the most diverse set of text characteristics to
our knowledge, designed for various use cases in
linguistics and pedagogy. The tool, developed in
2004, is slow as it is designed to process a single
document at a time, relatively difficult to access,
and the underlying word databases are outdated.
Our toolkit aims to make a subset of these metrics
easily accessible to the NLP community.
5 Future Work
As illustrated in §2we envision TCT to be a frame-
work and an associated tool that allows for com-
munity contributions, crowdsourcing even more
functionality and use cases. Future work involves
usage of the tool:
Firstly, we encourage creators of new datasets
to use TCT as a data annotation tool, to extract
a wide range of dataset statistics in a straightfor-
ward manner, and report about them in academic
publications for transparency about the contents
of their dataset. Such statistics could be included
in datasheets and data cards (Gebru et al.,2021),
and they can aid in outlier detection during data
cleaning.
3https://knowyourdata.withgoogle.com/
We also prompt dataset creators to perform sta-
tistical analyses capturing which features are pre-
dictive of the gold targets before further training
computational models, to ensure one is aware about
potential short-cut learning opportunities due to bi-
ases in the dataset. Naturally, not all correlations
are bad or avoidable – e.g. consider sentences con-
taining the word ‘fantastic’ that are likely to have a
positive label in sentiment analysis – but others are
good to be aware of when working with a dataset
– e.g. consider a natural language inference task
where all sentences with the label ‘entailed’ have
an atypical average word length. Such analyses
could be included in a ‘cautions’ section with a
dataset’s release.
A third type of usage would be by owners of new
models, that, on the one hand, use TCT to measure
whether some dataset characteristics are predictive
of success and failure by their model, and, on the
other hand, provide performance on subclasses of
samples. One may already know that model per-
formance is lower for longer sentences, but what
about performance on different readability classes,
classes with varying amounts of causal connectives
or different ratings for syntactic complexity (e.g.
SYNLE)? TCT will help answer those questions.
Understanding how the model performance fluc-
tuates for different data subsets provides further
understanding in model robustness, and can, in
turn, improve datasets’ quality if model owners re-
port back on biases identified in datasets. It should
be noted that TCT could be an effective tool for
data selection for both training and evaluation, in
particular at scale.
Limitations
Text Characteristics in our framework have varying
levels of coverage depending on their type. Word
property based characteristics, for example, are
limited by the coverage of the word databases that
back them – this can be limited even for English.
While we plan to extend the framework to multiple
languages in the near future, language coverage of
backing word databases and NLP pipelines such as
WordNet (Miller,1995) or SpaCy (Honnibal et al.,
2020) will affect our ability to scale the number of
languages supported.
References
Yonatan Belinkov, Adam Poliak, Stuart Shieber, Ben-
jamin Van Durme, and Alexander Rush. 2019.
摘要:

TextCharacterizationToolkitDanielSimig*,TianluWang*,VernaDankers*+,PeterHenderson‡*,KhuyagbaatarBatsuren†,DieuwkeHupkes*,MonaDiab**MetaAI,+UniversityofEdinburgh,†NationalUniversityofMongolia,‡StanfordUniversity{danielsimig,dieuwkehupkes,mdiab}@fb.comAbstractInNLP,modelsareusuallyevaluatedbyre-portin...

展开>> 收起<<
Text Characterization Toolkit Daniel Simig Tianlu Wang Verna Dankers Peter Henderson Khuyagbaatar BatsurenDieuwke HupkesMona Diab.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.09MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注