Are Deep Sequence Classifiers Good at Non-Trivial Generalization Francesco Cazzaro

2025-04-30 0 0 2.12MB 14 页 10玖币
侵权投诉
Are Deep Sequence Classifiers Good at Non-Trivial
Generalization?
Francesco Cazzaro
Universitat Politècnica
de Catalunya,
Barcelona, Spain
{francesco.cazzaro,
Ariadna Quattoni
Universitat Politècnica
de Catalunya,
Barcelona, Spain
ariadna.julieta.quattoni}@upc.edu
Xavier Carreras
dMetrics,
Brooklyn, NY
Abstract
Recent advances in deep learning models for sequence classification have greatly
improved their classification accuracy, specially when large training sets are avail-
able. However, several works have suggested that under some settings the predic-
tions made by these models are poorly calibrated. In this work we study binary
sequence classification problems and we look at model calibration from a different
perspective by asking the question: Are deep learning models capable of learning
the underlying target class distribution? We focus on sparse sequence classification,
that is problems in which the target class is rare and compare three deep learning
sequence classification models. We develop an evaluation that measures how well
a classifier is learning the target class distribution. In addition, our evaluation disen-
tangles good performance achieved by mere compression of the training sequences
versus performance achieved by proper model generalization. Our results suggest
that in this binary setting the deep-learning models are indeed able to learn the
underlying class distribution in a non-trivial manner, i.e. by proper generalization
beyond data compression.
1 Introduction
Recent advances in deep learning models for sequence classification have greatly improved their
classification accuracy, specially when large training sets are available. However, several works
have suggested that under some settings the predictions made by these models are poorly calibrated:
a model might provide in general correct predictions but give bad estimates of the confidence
scores
Pr(y|x)
. Having well calibrated estimates of
Pr(y|x)
is crucial to many machine learning
applications. For example, this is critical in systems in which an end user needs to make a decision
based on the prediction of the model. Here a good estimate is as important as a good prediction.
In this work we study binary sequence classification with a focus on sparse sequence classification
tasks in which only a very small fraction of the sequences in the domain belong to the target
class. These types of problems naturally arise in several applications, for example in NLP or in
Computational Biology where distinguishing relevant segments in DNA sequences is challenging as
it is postulated that only about 1–3% of segments have any biological significance [Eskin, 2002].
While traditional evaluations on model calibration have focused on the quality of the conditional
prediction
PrM(y|x)
in this paper we look at model evaluation from a different perspective by asking
the question: Are deep learning models capables of learning the underlying target distribution? The
main idea is quite simple, if we have access to large quantities of unlabeled data that accurately
represent the input distribution we can estimate the joint distribution
PrM(y, x)
implicitly induced
by the conditional model and then we can compare it to the true distribution.
Workshop on Robustness in Sequence Modeling, 36th Conference on Neural Information Processing Systems
(NeurIPS 2022).
arXiv:2210.13082v2 [cs.LG] 11 Nov 2022
We conduct experiments using three deep-learning models: an RNN, a Transformer and a WFA (i.e.
Weighted Finite Automata, a form of RNN with linear activation). Our results suggest that in the
binary classification setting the deep-learning models are indeed able to learn the underlying class
distribution in a non-trivial manner. These results seem to be consistent with previous studies showing
that deep-learning models provide calibrated predictions for binary classification problems [Niculescu-
Mizil and Caruana, 2005]. Our main contributions are: 1) We present a novel evaluation framework for
sequence prediction models. By exploiting unlabeled data, we evaluate the model with respect to the
implicitly induced joint distribution. 2) Our evaluation distinguishes performance over components
of the distribution seen in training data as well as unseen components explicitly differentiating
compression from generalization. 3) Our experiments on a sparse sequence classification task show
that deep learning architectures are able to induce good distributions in a non-trivial manner.
2 Related Work
In the recent literature several works have addressed the problem of model calibration. Some of
these studies have shown that for some deep learning architectures, the predictions produced by the
models are not well calibrated [Guo et al., 2017] [Kumar and Sarawagi, 2019] especially for non
pre-trained transformers [Desai and Durrett, 2020]. In contrast, for the case of binary classification
some previous work has suggested that they are indeed well calibrated [Niculescu-Mizil and Caruana,
2005]. To our knowledge we are the first ones to study calibration by looking at the joint distribution
Pr(x, y)induced by the learned classifier instead of the quality of the conditional class predictions.
Calibration aside, several works have compared the classification performance of different deep-
learning architectures: CNNs and RNNs [Józefowicz et al., 2016] [Yin et al., 2017], Transformers
and RNNs [Karita et al., 2019] [Lakew et al., 2018], Transformers and CNNs [Kolesnikov et al.,
2021] [Pinto et al., 2021] [Bai et al., 2021], WFA to CNNs [Quattoni and Carreras, 2020] and WFA
to RNNs [Quattoni and Carreras, 2019].
Finally, an orthogonal related problem is that of developing deep learning models for density
estimation. These include Transformers [Fakoor et al., 2020], Autoregressive Networks [Uria
et al., 2016] [Oliva et al., 2018] and Flow Models [Durkan et al., 2019] [De Cao et al., 2020].
3 Evaluating Deep-Learning Model as Moment Predictors
Our goal is to evaluate models defined over sequences of discrete symbols. More precisely we
consider an alphabet
Σ
and the set of all possible sequences
Σ?
. In general, we can think of
probabilistic binary sequence classifiers as functions from
Σ?[0,1]
. In our setting we assume that
we have access to a large set of sequences
U={x(1), . . . , x(u)}
sampled according to the underlying
distribution over
Σ?
. We can think of
U
as a large set of unlabeled sequences that represent the
domain of the sequence classification task. The target class
Y
that we wish to learn is a subset of
sequences in
U
, that is
YU
. We are particularly interested in cases where the target class is rare,
that is cases in which |Y|is significantly smaller than |U|.
We will create a labeled training set
T={(x(1), y(1)),...,(x(m), y(m))}
of size
m
by sampling
sequences from
U
and labeling
y= 1
as positive instances if they appear in
Y
, and otherwise as
negative instances
y= 0
. Section 4 contains further details about how we create the task data from an
existing dataset, for various training sizes
m
. Given a training set we will train a sequence classifier
Mthat defines a distribution PrM(y|x).
In this work, we consider three sequence classification models: an
RNN
, a
Transformer
and a
WFA
.
As recurrent neural network (RNN) we employ a multi-layer LSTM [Hochreiter and Schmidhuber,
1997] with a binary classification feed-forward layer on top. For the Transformer we select the BERT
[Devlin et al., 2019] architecture, we don’t use the pre-trained weights and we expand the embeddings
with new randomly initialized ones to deal with the protein dataset vocabulary. Both models output
the conditional probability
Pr(l|x)
where
x
is a sentence and
l
a label. We also evaluate a WFA
(Weighted Finite Automata; which in essence is an RNN with linear activation functions [Rabusseau
et al., 2019]) employing the ensemble proposed in Quattoni and Carreras [2020].
2
3.1 Evaluation Metrics
The sequence model
M
that we wish to evaluate defines a conditional distribution
PrM(y|x)
. Gener-
ally, we would like to compute the true error of
M
by evaluating the model on the true distribution
Pr(x, y)
. In this section, we will use the set
U
as a proxy for
Pr(x)
, and specifically we will estimate
the moments of
M
on the joint distribution
PrM(x, y)
that is implicitly induced by the model over
the target class.1More precisely, consider the set:
Zn={zn|znΣn,xUsuch that znx}(1)
where
zn
is a sub-sequence of size
n
,
znx
indicates that
zn
is a sub-sequence of a domain
sequence
x
, and therefore
Zn
is the set of all sub-sequences of size
n
observed in
U
. We use these
sub-sequences as the support set to compute moments of a model, and we define the moment function
EM:zn×URas:
EM(zn, U) = 1
|U|X
xU
PrM(y|x) #[znx](2)
where
zn
is a sub-sequence of length
n
and
#[znx]
is a function that counts the number of
times that
zn
appears in sequence
x
. From now on, we will assume a fixed
U
and drop it from the
expression, and we will refer to the sub-sequences
zn
and their values
EM(zn)
as the moments of
M
.
Since we assume that
U
is a good representation of the the domain,
EM(zn)
can be regarded as an
estimate of the expected number of times that we should observe znin a sample from PrM(x, y).
Given sets
U
and
Y
, the moments from the true joint distribution
PrY(x, y)
can be computed similarly
as:
EY(zn) = 1
|U|X
xY
#[znx](3)
Putting it all together, if we have a domain represented by a set of unlabeled sequences
U
, a subset of
target sequences
Y
and a model
PrM(y|x)
, we can compute the model’s moment function
EM(zn)
and compare it to the true moments
EY(xn)
. Notice that for a model that learns the target class
perfectly (i.e. a model that gives accurate predictions and that it is perfectly calibrated) we will have
EM(zn) = EY(zn)for all moments in N.
Now that we have the necessary functions, we propose to evaluate a model
M
by comparing
EM
and
EY. To compare the moment functions we propose the following metrics:
MSPC
: This metric measures the Spearman rank correlation between gold and model moments of a
fixed length. The Spearman correlation coefficient measures the strength and direction of association
between ranked variables. In our evaluation a high MSPC means that the model sorts the moments of
the distribution in a way that is similar to the gold ordering. The MSPC is computed as:
cov(rank[EM(zn)],rank[EY(zn)])
σ(rank[EM(zn)]) ·σ(rank[EY(zn)]) (4)
where
rank[E(zn)]
are the raw function scores for all
znZn
converted to ranks. In essence, this
metric measures the agreement on partial orderings induced by the model and gold moment functions.
That is, given any two pairs of moments of a fixed length
zn
and
z0
n
whenever
EY(zn)> EY(z0
n)
we
want EM(zn)> EM(z0
n).
MSPCP
: While MSPC is a useful metric when evaluating the class distribution induced by a model,
we need to keep in mind that for longer moments
EY(zn)
will be sparse and for most inputs it will be
0 since we are targeting a rare class. So in this context we compute the Spearman rank correlation
between gold non-zero moments and the model prediction for those moments, this will give an idea
of how well the model would sort the moments of the target distribution if it knew the true support of
the target moments function.
MR
: as MSPC might be overly harsh, a complementary alternative metric in this case is the mean-rank
given to the target non-zero moments:
1X
zn:EY(zn)>0
rank[EM(zn)]
|Zn|(5)
1
We highlight that the set
U
is assumed to be sampled from the true
Pr(x)
and will contain repetitions of
each distinct
x
proportional to its likelihood. And so does
Y
. This detail is relevant because we can think of
U
as an empirical estimate of Pr(x), which is necessary for the definitions of moments in this section.
3
摘要:

AreDeepSequenceClassiersGoodatNon-TrivialGeneralization?FrancescoCazzaroUniversitatPolitècnicadeCatalunya,Barcelona,Spain{francesco.cazzaro,AriadnaQuattoniUniversitatPolitècnicadeCatalunya,Barcelona,Spainariadna.julieta.quattoni}@upc.eduXavierCarrerasdMetrics,Brooklyn,NYAbstractRecentadvancesindeep...

展开>> 收起<<
Are Deep Sequence Classifiers Good at Non-Trivial Generalization Francesco Cazzaro.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.12MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注