Analyzing Text Representations under Tight Annotation Budgets Measuring Structural Alignment César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni

2025-04-27 0 0 639.14KB 14 页 10玖币

侵权投诉

Analyzing Text Representations under Tight Annotation Budgets:

Measuring Structural Alignment

César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni

Universitat Politècnica de Catalunya, Barcelona, Spain

{cesar.gonzalez.gutierrez, audi.primadhanty, francesco.cazzaro}@upc.edu

Abstract

Annotating large collections of textual data

can be time consuming and expensive. That

is why the ability to train models with limited

annotation budgets is of great importance. In

this context, it has been shown that under tight

annotation budgets the choice of data represen-

tation is key. The goal of this paper is to better

understand why this is so. With this goal in

mind, we propose a metric that measures the

extent to which a given representation is struc-

turally aligned with a task. We conduct exper-

iments on several text classiﬁcation datasets

testing a variety of models and representations.

Using our proposed metric we show that an ef-

ﬁcient representation for a task (i.e. one that

enables learning from few samples) is a rep-

resentation that induces a good alignment be-

tween latent input structure and class structure.

1 Introduction

With the emergence of deep learning models, the

latter years have witnessed signiﬁcant progress

on supervised learning of textual classiﬁers. The

caveat being that most of these methods require

large amounts of training data. Annotating large

collections of textual data can be time consuming

and expensive. Because of this, whenever a new

NLP application needs to be developed, data anno-

tation becomes a bottleneck and the ability to train

models with limited annotation budgets is of great

importance.

In this context, it has been shown that when

annotated data is scarce the choice of data represen-

tation is crucial. More speciﬁcally, previous work

showed that representations based on pre-trained

contextual word-embeddings signiﬁcantly outper-

form classical sparse bag-of-words representations

for textual classiﬁcation with small annotation bud-

gets. This prior-work used linear-SVM models for

the experimental comparisons. Our experiments

further complement these conclusions by showing

that the superiority of pre-trained contextual word-

embeddings is true for both simple linear classiﬁers

as well as for more complex models.

The goal of this paper is to better understand

why the choice of representation is crucial when

annotated training data is scarce. Clearly, a few

samples in a high dimensional space will provide

a very sparse coverage of the input domain. How-

ever, if the representation space is properly aligned

with the class structure even a small sample can be

representative. To illustrate this idea imagine a clas-

siﬁcation problem involving two classes. Suppose

that we perform a clustering on a given represen-

tation space that results on a few pure clusters (i.e.

clusters such that all samples belong to the same

class). Then any training set that ‘hits’ all the clus-

ters can be representative. Notice that there is a

trade-off between the number of clusters and their

purity, a well aligned representation is one so that

we can obtain a clustering with a small number

of highly pure clusters. Based on this insight we

propose a metric that measures the extent to which

a given representation is structurally aligned with

a task.

We conduct experiments on several text classiﬁ-

cation datasets comparing different representations.

Our results show that there is a clear correlation

between the structural alignment induced by a rep-

resentation and performance with few training sam-

ples. Providing an answer to the main question

addressed in this work: An efﬁcient representation

for a task (i.e. one that enables learning from a few

samples) is a representation that induces a good

alignment between latent input structure and class

structure.

In summary, the main contributions of this paper

are:

•

We show that using pre-trained word embed-

dings signiﬁcantly improves performance un-

der low annotation budgets, for both simple

and complex models.

arXiv:2210.05721v1 [cs.CL] 11 Oct 2022

Figure 1: SAM general schema. A dendogram is ﬁrst constructed from a hierarchical clustering of the represen-

tation. As we traverse the tree vertically, for each level we have a clustering. Branches correspond with clusters,

merging from isolated samples (bottom) to a single group (top). For each clustering, we measure the alignment

against the label classes. These scores plot a characteristic curve sweeping an area under it, which constitutes our

ﬁnal measure.

• We propose a metric to measure the extent to

which a representation space is aligned with a

given class structure.

•

We conduct experiments on several textual

classiﬁcation datasets and show that the most

efﬁcient representations are those with an in-

put latent structure that is well aligned to the

class structure.

The paper is organized as follows: Section 2

discusses related work, Section 3presents the pre-

liminary experiments showing that representation

choice is key for good performance, Section 4

presents our proposed metric to measure represen-

tation quality, Section 5presents our main experi-

mental results over four text classiﬁcation datasets,

ﬁnally Section 6concludes the paper and discusses

future work.

2 Related Work

The importance of representation choice has lately

received a signiﬁcant amount of attention from the

active learning (AL) community (Schröder and

Niekler,2020;Zhang et al.,2017). Most of the

research in AL attempts to quantify what represen-

tation is best when training the initial model for

active learning, which is usually referred as the

cold start problem (Lu et al.,2019). The impor-

tance of word embeddings has been also studied

in the context of highly imbalanced data scenarios

(Sahan et al.,2021;Naseem et al.,2021;Hashimoto

et al.,2016;Kholghi et al.,2016).

The focus of most of the research by the AL

community regarding textual representations is in

quantifying which representations enable higher

performance for a given task. In contrast, the focus

of our paper is to understand why a given represen-

tation performs better in a given task, with special

attention to low annotation budget scenarios.

Since the objective of our contribution is to study

some properties of different textual representations,

this work is also related to recent work on evalu-

ating the general capabilities of word embeddings.

On this line of research, many studies are interested

in testing the behaviour of such models using prob-

ing tasks that signal different linguistic capabilities

(Conneau et al.,2018;Conneau and Kiela,2018;

Marvin and Linzen,2018;Tenney et al.,2019;Mi-

aschi and Dell’Orletta,2020). Others have targeted

the capacity of word embeddings to transfer lin-

guistic content (Ravishankar et al.,2019;Conneau

et al.,2020).

Aside from using probing tasks we now look at

approaches that analyze the properties of represen-

tations directly, without an intermediate probing

task. A correlation method called Singular Vector

Canonical Correlation Analysis (Saphra and Lopez,

2019) has been used to compare representations

during consecutive pre-training stages. Analysing

the geometric properties of contextual embeddings

is also an active line of work (Reif et al.,2019;

Ethayarajh,2019;Hewitt and Manning,2019).

The main differences between these works and

ours is that previous work has focused on analysing

geometric properties of the representations inde-

pendently of a task while our focus is on studying

the relationship between a representation and the

labels of a downstream target task.

Samples Labels (−/+)

IMDB 50K 25K / 25K

WT 224K 202K / 21K

CC 2M 1.84M / 160K

s140 1.6M 800K / 800K

Table 1: Datasets statistics with number of samples

and number of labels per class: negative and positive

(−/+).

A recent contribution whose objective is more

closely related to ours is (Yauney and Mimno,

2021a). In this work the authors present a method

to measure the alignment between documents (in a

given representation space) and labels for a down-

stream classiﬁcation task based on a data complex-

ity measure developed in the learning-theory com-

munity. The main idea is to exploit a dual repre-

sentation of the documents (i.e. each document is

represented by its similarity to other documents)

and test for label linear separability in this space.

There are three main differences between their

work and ours: 1) Our approach does not test for

linear separability (in a dual space). Instead we

measure the alignment of the latent structure in

the given representation space with the label class

structure (potentially testing for more complex de-

cision surfaces). 2) Their empirical study focuses

on binary text classiﬁcation tasks with balanced

label distributions while we study both balanced

and highly non-balanced scenarios. 3) Our study

focuses on the low annotation budget scenario (it

is in this low-budget scenario that our experiments

show that representation is critical independently

of the classiﬁcation model).

3 Learning Under an Annotation Budget:

Choice of Representation is Key

In this section we investigate the importance of

pre-trained word representations when learning

with a few training samples (i.e. learning under

an annotation budget). To study this, we compare

several models of different levels of complexity,

trained with and without pre-trained word embed-

dings. More speciﬁcally, we consider the following

models:

•

Max-Entropy: a standard max-entropy model

trained with l2regularization.

•

WFA: this is in essence equivalent to an RNN

with linear activation function (Quattoni and

Carreras,2020).

0 200 400 600 800 1000

# training samples

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

performance

IMDB

ME-BOW

ME-BERT

WFA-BOW

WFA-BERT

BERT-NPT

BERT-PT

Figure 2: Performance of different models and textual

representations when learning with a limited annota-

tion budget for the IMDB dataset.

•

BERT: a BERT-base uncased (110M parame-

ters) model (Devlin et al.,2019) pre-trained

on BooksCorpus and Wikipedia.

Each of the models described above will be

trained in two settings: 1) with a sparse bag-

of-words (BOW) representation and 2) with pre-

trained word embeddings (PRE-BERT). For BERT,

training with BOW means training without pre-

trained word embeddings.

For the experiments in this section and section

5we use four textual classiﬁcation datasets with

both balanced and imbalanced label distributions,

covering a range of tasks and input lengths. More

precisely, we run experiments on:

•

IMDB (Maas et al.,2011): Movie reviews

annotated with sentiment. This is a dataset

with a balanced distribution of labels.

•

s140 (Go et al.,2009): A collection of short

messages in Twitter annotated with sentiment.

This is a dataset with a balanced distribution

of labels.

•

WT (Wulczyn et al.,2017): A collection of

Wikipedia comments annotated with toxicity

labels. This is a dataset with a highly unbal-

anced label distribution, less than %15 of the

labels correspond to toxic comments.

•

CC (Borkan et al.,2019): A collection of com-

ments posted using Civil Comments platform

annotated with respect to toxic behaviour.

This is a dataset with a highly unbalanced

label distribution, less than %10 of the labels

correspond to toxic behaviour.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnalyzingTextRepresentationsunderTightAnnotationBudgets:MeasuringStructuralAlignmentCésarGonzález-GutiérrezAudiPrimadhantyFrancescoCazzaroAriadnaQuattoniUniversitatPolitècnicadeCatalunya,Barcelona,Spain{cesar.gonzalez.gutierrez,audi.primadhanty,francesco.cazzaro}@upc.eduAbstractAnnotatinglargecollec...

展开>> 收起<<

Analyzing Text Representations under Tight Annotation Budgets Measuring Structural Alignment César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Analyzing Text Representations under Tight Annotation Budgets Measuring Structural Alignment César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: