Analyzing Text Representations under Tight Annotation Budgets:
Measuring Structural Alignment
César González-Gutiérrez Audi Primadhanty Francesco Cazzaro Ariadna Quattoni
Universitat Politècnica de Catalunya, Barcelona, Spain
{cesar.gonzalez.gutierrez, audi.primadhanty, francesco.cazzaro}@upc.edu
Abstract
Annotating large collections of textual data
can be time consuming and expensive. That
is why the ability to train models with limited
annotation budgets is of great importance. In
this context, it has been shown that under tight
annotation budgets the choice of data represen-
tation is key. The goal of this paper is to better
understand why this is so. With this goal in
mind, we propose a metric that measures the
extent to which a given representation is struc-
turally aligned with a task. We conduct exper-
iments on several text classification datasets
testing a variety of models and representations.
Using our proposed metric we show that an ef-
ficient representation for a task (i.e. one that
enables learning from few samples) is a rep-
resentation that induces a good alignment be-
tween latent input structure and class structure.
1 Introduction
With the emergence of deep learning models, the
latter years have witnessed significant progress
on supervised learning of textual classifiers. The
caveat being that most of these methods require
large amounts of training data. Annotating large
collections of textual data can be time consuming
and expensive. Because of this, whenever a new
NLP application needs to be developed, data anno-
tation becomes a bottleneck and the ability to train
models with limited annotation budgets is of great
importance.
In this context, it has been shown that when
annotated data is scarce the choice of data represen-
tation is crucial. More specifically, previous work
showed that representations based on pre-trained
contextual word-embeddings significantly outper-
form classical sparse bag-of-words representations
for textual classification with small annotation bud-
gets. This prior-work used linear-SVM models for
the experimental comparisons. Our experiments
further complement these conclusions by showing
that the superiority of pre-trained contextual word-
embeddings is true for both simple linear classifiers
as well as for more complex models.
The goal of this paper is to better understand
why the choice of representation is crucial when
annotated training data is scarce. Clearly, a few
samples in a high dimensional space will provide
a very sparse coverage of the input domain. How-
ever, if the representation space is properly aligned
with the class structure even a small sample can be
representative. To illustrate this idea imagine a clas-
sification problem involving two classes. Suppose
that we perform a clustering on a given represen-
tation space that results on a few pure clusters (i.e.
clusters such that all samples belong to the same
class). Then any training set that ‘hits’ all the clus-
ters can be representative. Notice that there is a
trade-off between the number of clusters and their
purity, a well aligned representation is one so that
we can obtain a clustering with a small number
of highly pure clusters. Based on this insight we
propose a metric that measures the extent to which
a given representation is structurally aligned with
a task.
We conduct experiments on several text classifi-
cation datasets comparing different representations.
Our results show that there is a clear correlation
between the structural alignment induced by a rep-
resentation and performance with few training sam-
ples. Providing an answer to the main question
addressed in this work: An efficient representation
for a task (i.e. one that enables learning from a few
samples) is a representation that induces a good
alignment between latent input structure and class
structure.
In summary, the main contributions of this paper
are:
•
We show that using pre-trained word embed-
dings significantly improves performance un-
der low annotation budgets, for both simple
and complex models.
arXiv:2210.05721v1 [cs.CL] 11 Oct 2022