Text Characterization Toolkit Daniel Simig Tianlu Wang Verna Dankers Peter Henderson Khuyagbaatar BatsurenDieuwke HupkesMona Diab

2025-05-02 0 0 1.09MB 17 页 10玖币

侵权投诉

Text Characterization Toolkit

Daniel Simig*, Tianlu Wang*, Verna Dankers*+, Peter Henderson‡*,

Khuyagbaatar Batsuren†,Dieuwke Hupkes*,Mona Diab*

*Meta AI, +University of Edinburgh, †National University of Mongolia, ‡Stanford University

{danielsimig,dieuwkehupkes,mdiab}@fb.com

Abstract

In NLP, models are usually evaluated by re-

porting single-number performance scores on

a number of readily available benchmarks,

without much deeper analysis. Here, we ar-

gue that – especially given the well-known

fact that benchmarks often contain biases, arte-

facts, and spurious correlations – deeper re-

sults analysis should become the de-facto stan-

dard when presenting new models or bench-

marks. We present a tool that researchers can

use to study properties of the dataset and the in-

ﬂuence of those properties on their models’ be-

haviour. Our Text Characterization Toolkit in-

cludes both an easy-to-use annotation tool, as

well as off-the-shelf scripts that can be used for

speciﬁc analyses. We also present use-cases

from three different domains: we use the tool

to predict what are difﬁcult examples for given

well-known trained models and identify (po-

tentially harmful) biases and heuristics that are

present in a dataset.

1 Introduction

NLP technology has progressed tremendously over

the recent decades with signiﬁcant advances in al-

gorithms and modeling. Yet, by comparison, our

understanding lags behind signiﬁcantly for datasets

(including all datasets types in the model life cycle:

training, validation, evaluation) that contribute to

model performance. This is mostly due to the lack

of frameworks, methods, and tools to draw insights

into datasets, especially at scale.

Most NLP models, to date, are evaluated using a

relatively small number of readily available evalu-

ation benchmarks, that are often created automat-

ically, or via crowd-sourcing (e.g. Bowman et al.,

2015;Wang et al.,2018;Williams et al.,2018;

Zellers et al.,2018). It is well-known that most

popular (evaluation) datasets are rife with biases,

dataset artefacts and spurious correlations, and are

prone to be solved with shortcuts (Gardner et al.,

2021;Kiela et al.,2021). Presenting models with

adversarial examples for which those biases or

correlations do not hold, often results in stark per-

formance drops (e.g. Linzen,2020;McCoy et al.,

2019;Jia and Liang,2017;Chen et al.,2016;Po-

liak et al.,2018;Tsuchiya,2018;Belinkov et al.,

2019). At best, using datasets with such known

issues might result in overestimation of a models’

capability on the task in question, which may not

be reﬂective of how well they can execute this task

in more realistic scenarios. More worrying, how-

ever, is that training or ﬁnetuning on datasets that

contain biases and artefacts may result in mod-

els implementing undesired, biased behavior (e.g.

Rudinger et al.,2018;Blodgett et al.,2016).

Additionally, datasets are usually treated as ho-

mogeneous collections of text, performance for

which is captured in a single number – even though

there is often a substantial difference between the

difﬁculty/complexity of different examples in a

dataset (e.g. Sugawara et al.,2022). Research pa-

pers rarely report thorough analysis of performance

broken down by characteristics of the data set ex-

amples ignoring underlying patterns performance

numbers may reﬂect. The problem is exacerbated

by the pervasiveness of benchmarks coupled with a

leaderboard competitive culture, where what counts

most is system rank.

In part, this may be due to the fact that deeper

analysis of results – especially when a number of

different datasets is involved – is complex and time-

consuming, and there are no standard frameworks

or protocols that practitioners can resort to. The

problem is even more pervasive, where we curate

datasets for development and evaluation. How we

curate, create, select data plays a critical role in un-

derstanding our models. Many NLP models (even

beyond text) require up/down sampling of speciﬁc

types of data. These processes should rely on prin-

cipled characterization of data for any given model.

Towards this end, we believe that the existence

of a standard toolkit that provides an easy to use set

arXiv:2210.01734v1 [cs.CL] 4 Oct 2022

of tools and metrics allowing researchers to analyze

and systematically characterize datasets involved

in the model life cycle, while gaining insights into

the relationship between model performance and

data properties could become more common place.

In this paper, we introduce the Text Character-

ization Toolkit

(TCT), which aims to enable re-

searchers to gain a detailed understanding of the

datasets and models they create – with minimal

effort. TCT is inspired by the Coh-Metrix toolkit

(Graesser et al.,2004), a collection of over 100

diverse text characteristics intended for use for text

analysis in various applications. TCT offers these

capabilities at scale by design. While TCT can pro-

cess a dataset of 20000 paragraphs in less than a

minute using a single command on a MacBook Pro

laptop, the very same library, for instance, can also

be used as part of a PySpark pipeline to compute

text characteristics for a full snapshot of Common

Crawl

(3.1B web pages) in a matter of hours. In

this paper we present:

A repository of text metrics that can help re-

veal (hidden) patterns in datasets coupled with

model performance on these datasets;

A set of off-the-shelf analysis tools that re-

searchers can use in a simple notebook to

study properties of the dataset and the inﬂu-

ence of those properties on model behaviour;

A framework that enables the community to

share, reuse and standardize metrics and anal-

yses methods/tools;

Use cases that demonstrate the efﬁcacy of

TCT in practice covering Language Model

prompting, Translation and Bias Detection.

With these contributions, we aspire to contribute

to improving how we assess NLP models, and get

closer to a scenario where providing detailed results

analyses becomes the standard for NLP research.

2 The Text Characterization Toolkit

TCT consists of two main components:

•

A framework for deﬁning and computing text

characteristics.

•

A collection of analysis tools that help users

interpret text characteristics and evaluate re-

sults with respect to these characteristics.

As illustrated by Figure 1, the workﬂow of ex-

tending a standard evaluation process with TCT is

1https://github.com/facebookresearch/text_

characterization_toolkit

2https://commoncrawl.org

Figure 1: Text Characterization Toolkit extends model

evaluation to provide insights about the role of data.

typically the following:

•

Given a dataset, deﬁne how to extract text

fragments from each data point: For a QA

dataset text fragments could be individual

questions, whereas in document summariza-

tion, the text fragments would be the docu-

ments themselves.

•

Use TCT to compute characteristics of the text

fragments. One might use the default charac-

teristics already included in the framework or

deﬁne their own speciﬁc metric.

•

Load the computed characteristics and other

evaluation speciﬁc data into a Python note-

book for analysis using TCT. One might an-

alyze the dataset itself (e.g. to identify spuri-

ous correlations or biases) or jointly analyze

model evaluation metrics and text characteris-

tics (e.g. through correlation analysis between

TCT features and models’ test set accuracy).

•

Use the results of the analysis to improve the

dataset, the model, or the evaluation protocol

– for example by extending evaluation data

with examples where a model is expected to

perform poorly or focusing evaluation on a

challenging subset of the test data.

Concrete examples of the workﬂow above are

described in §3and in Appendix B. The rest of this

section provides more details on the two important

components of the framework.

2.1 Text Characteristics

While the majority of the characteristics found in

TCT is motivated by metric classes in Coh-Metrix

(Graesser et al.,2004), we have included new data

bases for existing metrics and added entirely new

metrics. At the time of writing, there are 61 char-

acteristics implemented in TCT. An overview of

the main categories of currently implemented char-

acteristics can be found in Table 1. The toolkit

provides a standardized framework to implement,

Category Example Metrics

Descriptive Word Count

Sentence Length

Lexical Diversity Type-Token Ratio

MTLD

Complexity Left Embeddedness

# of NP modiﬁers

Incidence Scores Different POS tags

Types of connectives

Word Property Age of Acquisition

Concreteness

Table 1: Categories of characteristics currently

implemented. See Appendix Afor an exhaustive list.

conﬁgure, and compute these metrics. Adding a

new metric is as simple as implementing two func-

tions: one that loads any required resource (such as

a word database) and initializes computation, and

one that computes the metric given these resources

and an input text.

2.2 Analysis tools

To further decrease the effort required to carry out

text characteristics based analysis, we provide an

initial set of analysis tools that users can use out

of the box. We encourage users to contribute their

own implementations of TCT-based analyses to the

toolkit, to allow for re-use in the future develop-

ment of datasets and models. The current func-

tionality of the toolkit, as illustrated in Figure 2,

includes:

Visualising distributions of different charac-

teristics;

Visualising a pairwise correlation matrix for

the characteristics;

Visualising correlations between individual

characteristics and outcomes (e.g., accuracy);

Fitting a model on all characteristics to out-

comes (logistic regression and random forests

are supported currently) and analyzing a

model’s predictive power and coefﬁcients.

3 Example Use Cases

In order to demonstrate the ability of TCT to pro-

duce meaningful and actionable insights, we pro-

vide 3 examples of its use on real world data. For

each one of these use cases, a thorough description

of the experimental setup and results is included in

Appendix Band reference notebooks are provided

in the examples directory of the TCT repository.

(a) Correlations between

text characteristics

(b) Model performance

w.r.t. some characteristics

Figure 2: TCT analysis tools in action. See Appendix

Bfor detailed explanations and high-resolution

images.

Predicting Accuracy of OPT Baselines

We use

the logistic regression analysis tool to ﬁt a model

that predicts the accuracy of the 6.7B OPT (Zhang

et al.,2022) model on the HellaSwag (Zellers et al.,

2019) task based on simple characteristics such as

mean word length and concreteness. Using this

model we identify subsets of the test data with

precision as low as 40% and as high as 90%.

Gender Bias in Co-reference Resolution

computing genderedness metrics on co-reference

labels and using these metrics as inputs to the anal-

ysis tools, we reproduce the results of Zhao et al.

(2018) showing that models perform much worse

when the stereotypically associated gender of an oc-

cupation does not match the gender of the pronom-

inal reference.

Fluctuations in Translation Performance

show how translation performance of the NLLB

model (Costa-jussà et al.,2022) using the Hugging-

Face pipeline (Wolf et al.,2019) ﬂuctuates as a

function of sample characteristics, like the number

of sentences. This performance heterogeneity can

be ﬁxed by segmenting sentences before using the

pipeline, showing that TCT can help debug model

pipelines even with many layers of abstraction.

4 Related Work

Multiple existing tools offer similar functionality

as TCT does: DataLab (Xiao et al.,2022) is a tool

for detailed data analysis that, among other things,

allows users to inspect datasets through the lens of

a few text characteristics such as text length, lexical

diversity and gender-related features. The Know

Your Data

tool allows for inspection of image

data, it surfaces spurious correlations, biases and

imbalances in datasets. However, both tools do not

connect model behavior to properties of datasets.

Collins et al. (2018) predicts overall hardness of

classiﬁcation datasets based on label statistics and

a few text characteristics such as readability and

lexical diversity. ExplainaBoard (Liu et al.,2021)

focuses on model performance analysis and pro-

vides a model performance breakdown by simple

attributes of data points such as text length, provid-

ing a functionality most similar to our work.

Our toolkit distinguishes itself by including a

much wider range of text characteristics and multi-

variable analysis tools that can identify larger vari-

ations in model accuracy. By packaging our toolkit

as a simple Python library used in notebooks – in

contrast to the previously described feature-rich

systems – we also intend to minimize the effort

needed to both use it as well as contribute to it

(crowd sourcing more functionality).

The Coh-Metrix tool (Graesser et al.,2004) col-

lected the most diverse set of text characteristics to

our knowledge, designed for various use cases in

linguistics and pedagogy. The tool, developed in

2004, is slow as it is designed to process a single

document at a time, relatively difﬁcult to access,

and the underlying word databases are outdated.

Our toolkit aims to make a subset of these metrics

easily accessible to the NLP community.

5 Future Work

As illustrated in §2we envision TCT to be a frame-

work and an associated tool that allows for com-

munity contributions, crowdsourcing even more

functionality and use cases. Future work involves

usage of the tool:

Firstly, we encourage creators of new datasets

to use TCT as a data annotation tool, to extract

a wide range of dataset statistics in a straightfor-

ward manner, and report about them in academic

publications for transparency about the contents

of their dataset. Such statistics could be included

in datasheets and data cards (Gebru et al.,2021),

and they can aid in outlier detection during data

cleaning.

3https://knowyourdata.withgoogle.com/

We also prompt dataset creators to perform sta-

tistical analyses capturing which features are pre-

dictive of the gold targets before further training

computational models, to ensure one is aware about

potential short-cut learning opportunities due to bi-

ases in the dataset. Naturally, not all correlations

are bad or avoidable – e.g. consider sentences con-

taining the word ‘fantastic’ that are likely to have a

positive label in sentiment analysis – but others are

good to be aware of when working with a dataset

– e.g. consider a natural language inference task

where all sentences with the label ‘entailed’ have

an atypical average word length. Such analyses

could be included in a ‘cautions’ section with a

dataset’s release.

A third type of usage would be by owners of new

models, that, on the one hand, use TCT to measure

whether some dataset characteristics are predictive

of success and failure by their model, and, on the

other hand, provide performance on subclasses of

samples. One may already know that model per-

formance is lower for longer sentences, but what

about performance on different readability classes,

classes with varying amounts of causal connectives

or different ratings for syntactic complexity (e.g.

SYNLE)? TCT will help answer those questions.

Understanding how the model performance ﬂuc-

tuates for different data subsets provides further

understanding in model robustness, and can, in

turn, improve datasets’ quality if model owners re-

port back on biases identiﬁed in datasets. It should

be noted that TCT could be an effective tool for

data selection for both training and evaluation, in

particular at scale.

Limitations

Text Characteristics in our framework have varying

levels of coverage depending on their type. Word

property based characteristics, for example, are

limited by the coverage of the word databases that

back them – this can be limited even for English.

While we plan to extend the framework to multiple

languages in the near future, language coverage of

backing word databases and NLP pipelines such as

WordNet (Miller,1995) or SpaCy (Honnibal et al.,

2020) will affect our ability to scale the number of

languages supported.

References

Yonatan Belinkov, Adam Poliak, Stuart Shieber, Ben-

jamin Van Durme, and Alexander Rush. 2019.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TextCharacterizationToolkitDanielSimig*,TianluWang*,VernaDankers*+,PeterHenderson*,KhuyagbaatarBatsuren,DieuwkeHupkes*,MonaDiab**MetaAI,+UniversityofEdinburgh,NationalUniversityofMongolia,StanfordUniversity{danielsimig,dieuwkehupkes,mdiab}@fb.comAbstractInNLP,modelsareusuallyevaluatedbyre-portin...

展开>> 收起<<

Text Characterization Toolkit Daniel Simig Tianlu Wang Verna Dankers Peter Henderson Khuyagbaatar BatsurenDieuwke HupkesMona Diab.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Text Characterization Toolkit Daniel Simig Tianlu Wang Verna Dankers Peter Henderson Khuyagbaatar BatsurenDieuwke HupkesMona Diab

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: