Analyzing Text Representations under Tight Annotation Budgets Measuring Structural Alignment César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni

2025-04-27 0 0 639.14KB 14 页 10玖币
侵权投诉
Analyzing Text Representations under Tight Annotation Budgets:
Measuring Structural Alignment
César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni
Universitat Politècnica de Catalunya, Barcelona, Spain
{cesar.gonzalez.gutierrez, audi.primadhanty, francesco.cazzaro}@upc.edu
Abstract
Annotating large collections of textual data
can be time consuming and expensive. That
is why the ability to train models with limited
annotation budgets is of great importance. In
this context, it has been shown that under tight
annotation budgets the choice of data represen-
tation is key. The goal of this paper is to better
understand why this is so. With this goal in
mind, we propose a metric that measures the
extent to which a given representation is struc-
turally aligned with a task. We conduct exper-
iments on several text classification datasets
testing a variety of models and representations.
Using our proposed metric we show that an ef-
ficient representation for a task (i.e. one that
enables learning from few samples) is a rep-
resentation that induces a good alignment be-
tween latent input structure and class structure.
1 Introduction
With the emergence of deep learning models, the
latter years have witnessed significant progress
on supervised learning of textual classifiers. The
caveat being that most of these methods require
large amounts of training data. Annotating large
collections of textual data can be time consuming
and expensive. Because of this, whenever a new
NLP application needs to be developed, data anno-
tation becomes a bottleneck and the ability to train
models with limited annotation budgets is of great
importance.
In this context, it has been shown that when
annotated data is scarce the choice of data represen-
tation is crucial. More specifically, previous work
showed that representations based on pre-trained
contextual word-embeddings significantly outper-
form classical sparse bag-of-words representations
for textual classification with small annotation bud-
gets. This prior-work used linear-SVM models for
the experimental comparisons. Our experiments
further complement these conclusions by showing
that the superiority of pre-trained contextual word-
embeddings is true for both simple linear classifiers
as well as for more complex models.
The goal of this paper is to better understand
why the choice of representation is crucial when
annotated training data is scarce. Clearly, a few
samples in a high dimensional space will provide
a very sparse coverage of the input domain. How-
ever, if the representation space is properly aligned
with the class structure even a small sample can be
representative. To illustrate this idea imagine a clas-
sification problem involving two classes. Suppose
that we perform a clustering on a given represen-
tation space that results on a few pure clusters (i.e.
clusters such that all samples belong to the same
class). Then any training set that ‘hits’ all the clus-
ters can be representative. Notice that there is a
trade-off between the number of clusters and their
purity, a well aligned representation is one so that
we can obtain a clustering with a small number
of highly pure clusters. Based on this insight we
propose a metric that measures the extent to which
a given representation is structurally aligned with
a task.
We conduct experiments on several text classifi-
cation datasets comparing different representations.
Our results show that there is a clear correlation
between the structural alignment induced by a rep-
resentation and performance with few training sam-
ples. Providing an answer to the main question
addressed in this work: An efficient representation
for a task (i.e. one that enables learning from a few
samples) is a representation that induces a good
alignment between latent input structure and class
structure.
In summary, the main contributions of this paper
are:
We show that using pre-trained word embed-
dings significantly improves performance un-
der low annotation budgets, for both simple
and complex models.
arXiv:2210.05721v1 [cs.CL] 11 Oct 2022
Figure 1: SAM general schema. A dendogram is first constructed from a hierarchical clustering of the represen-
tation. As we traverse the tree vertically, for each level we have a clustering. Branches correspond with clusters,
merging from isolated samples (bottom) to a single group (top). For each clustering, we measure the alignment
against the label classes. These scores plot a characteristic curve sweeping an area under it, which constitutes our
final measure.
We propose a metric to measure the extent to
which a representation space is aligned with a
given class structure.
We conduct experiments on several textual
classification datasets and show that the most
efficient representations are those with an in-
put latent structure that is well aligned to the
class structure.
The paper is organized as follows: Section 2
discusses related work, Section 3presents the pre-
liminary experiments showing that representation
choice is key for good performance, Section 4
presents our proposed metric to measure represen-
tation quality, Section 5presents our main experi-
mental results over four text classification datasets,
finally Section 6concludes the paper and discusses
future work.
2 Related Work
The importance of representation choice has lately
received a significant amount of attention from the
active learning (AL) community (Schröder and
Niekler,2020;Zhang et al.,2017). Most of the
research in AL attempts to quantify what represen-
tation is best when training the initial model for
active learning, which is usually referred as the
cold start problem (Lu et al.,2019). The impor-
tance of word embeddings has been also studied
in the context of highly imbalanced data scenarios
(Sahan et al.,2021;Naseem et al.,2021;Hashimoto
et al.,2016;Kholghi et al.,2016).
The focus of most of the research by the AL
community regarding textual representations is in
quantifying which representations enable higher
performance for a given task. In contrast, the focus
of our paper is to understand why a given represen-
tation performs better in a given task, with special
attention to low annotation budget scenarios.
Since the objective of our contribution is to study
some properties of different textual representations,
this work is also related to recent work on evalu-
ating the general capabilities of word embeddings.
On this line of research, many studies are interested
in testing the behaviour of such models using prob-
ing tasks that signal different linguistic capabilities
(Conneau et al.,2018;Conneau and Kiela,2018;
Marvin and Linzen,2018;Tenney et al.,2019;Mi-
aschi and Dell’Orletta,2020). Others have targeted
the capacity of word embeddings to transfer lin-
guistic content (Ravishankar et al.,2019;Conneau
et al.,2020).
Aside from using probing tasks we now look at
approaches that analyze the properties of represen-
tations directly, without an intermediate probing
task. A correlation method called Singular Vector
Canonical Correlation Analysis (Saphra and Lopez,
2019) has been used to compare representations
during consecutive pre-training stages. Analysing
the geometric properties of contextual embeddings
is also an active line of work (Reif et al.,2019;
Ethayarajh,2019;Hewitt and Manning,2019).
The main differences between these works and
ours is that previous work has focused on analysing
geometric properties of the representations inde-
pendently of a task while our focus is on studying
the relationship between a representation and the
labels of a downstream target task.
Samples Labels (/+)
IMDB 50K 25K / 25K
WT 224K 202K / 21K
CC 2M 1.84M / 160K
s140 1.6M 800K / 800K
Table 1: Datasets statistics with number of samples
and number of labels per class: negative and positive
(/+).
A recent contribution whose objective is more
closely related to ours is (Yauney and Mimno,
2021a). In this work the authors present a method
to measure the alignment between documents (in a
given representation space) and labels for a down-
stream classification task based on a data complex-
ity measure developed in the learning-theory com-
munity. The main idea is to exploit a dual repre-
sentation of the documents (i.e. each document is
represented by its similarity to other documents)
and test for label linear separability in this space.
There are three main differences between their
work and ours: 1) Our approach does not test for
linear separability (in a dual space). Instead we
measure the alignment of the latent structure in
the given representation space with the label class
structure (potentially testing for more complex de-
cision surfaces). 2) Their empirical study focuses
on binary text classification tasks with balanced
label distributions while we study both balanced
and highly non-balanced scenarios. 3) Our study
focuses on the low annotation budget scenario (it
is in this low-budget scenario that our experiments
show that representation is critical independently
of the classification model).
3 Learning Under an Annotation Budget:
Choice of Representation is Key
In this section we investigate the importance of
pre-trained word representations when learning
with a few training samples (i.e. learning under
an annotation budget). To study this, we compare
several models of different levels of complexity,
trained with and without pre-trained word embed-
dings. More specifically, we consider the following
models:
Max-Entropy: a standard max-entropy model
trained with l2regularization.
WFA: this is in essence equivalent to an RNN
with linear activation function (Quattoni and
Carreras,2020).
0 200 400 600 800 1000
# training samples
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
performance
IMDB
ME-BOW
ME-BERT
WFA-BOW
WFA-BERT
BERT-NPT
BERT-PT
Figure 2: Performance of different models and textual
representations when learning with a limited annota-
tion budget for the IMDB dataset.
BERT: a BERT-base uncased (110M parame-
ters) model (Devlin et al.,2019) pre-trained
on BooksCorpus and Wikipedia.
Each of the models described above will be
trained in two settings: 1) with a sparse bag-
of-words (BOW) representation and 2) with pre-
trained word embeddings (PRE-BERT). For BERT,
training with BOW means training without pre-
trained word embeddings.
For the experiments in this section and section
5we use four textual classification datasets with
both balanced and imbalanced label distributions,
covering a range of tasks and input lengths. More
precisely, we run experiments on:
IMDB (Maas et al.,2011): Movie reviews
annotated with sentiment. This is a dataset
with a balanced distribution of labels.
s140 (Go et al.,2009): A collection of short
messages in Twitter annotated with sentiment.
This is a dataset with a balanced distribution
of labels.
WT (Wulczyn et al.,2017): A collection of
Wikipedia comments annotated with toxicity
labels. This is a dataset with a highly unbal-
anced label distribution, less than %15 of the
labels correspond to toxic comments.
CC (Borkan et al.,2019): A collection of com-
ments posted using Civil Comments platform
annotated with respect to toxic behaviour.
This is a dataset with a highly unbalanced
label distribution, less than %10 of the labels
correspond to toxic behaviour.
摘要:

AnalyzingTextRepresentationsunderTightAnnotationBudgets:MeasuringStructuralAlignmentCésarGonzález-GutiérrezAudiPrimadhantyFrancescoCazzaroAriadnaQuattoniUniversitatPolitècnicadeCatalunya,Barcelona,Spain{cesar.gonzalez.gutierrez,audi.primadhanty,francesco.cazzaro}@upc.eduAbstractAnnotatinglargecollec...

展开>> 收起<<
Analyzing Text Representations under Tight Annotation Budgets Measuring Structural Alignment César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:639.14KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注