Are Deep Sequence Classiﬁers Good at Non-Trivial Generalization Francesco Cazzaro

2025-04-30 0 0 2.12MB 14 页 10玖币

侵权投诉

Are Deep Sequence Classiﬁers Good at Non-Trivial

Generalization?

Francesco Cazzaro

Universitat Politècnica

de Catalunya,

Barcelona, Spain

{francesco.cazzaro,

Ariadna Quattoni

Universitat Politècnica

de Catalunya,

Barcelona, Spain

ariadna.julieta.quattoni}@upc.edu

Xavier Carreras

dMetrics,

Brooklyn, NY

Abstract

Recent advances in deep learning models for sequence classiﬁcation have greatly

improved their classiﬁcation accuracy, specially when large training sets are avail-

able. However, several works have suggested that under some settings the predic-

tions made by these models are poorly calibrated. In this work we study binary

sequence classiﬁcation problems and we look at model calibration from a different

perspective by asking the question: Are deep learning models capable of learning

the underlying target class distribution? We focus on sparse sequence classiﬁcation,

that is problems in which the target class is rare and compare three deep learning

sequence classiﬁcation models. We develop an evaluation that measures how well

a classiﬁer is learning the target class distribution. In addition, our evaluation disen-

tangles good performance achieved by mere compression of the training sequences

versus performance achieved by proper model generalization. Our results suggest

that in this binary setting the deep-learning models are indeed able to learn the

underlying class distribution in a non-trivial manner, i.e. by proper generalization

beyond data compression.

1 Introduction

Recent advances in deep learning models for sequence classiﬁcation have greatly improved their

classiﬁcation accuracy, specially when large training sets are available. However, several works

have suggested that under some settings the predictions made by these models are poorly calibrated:

a model might provide in general correct predictions but give bad estimates of the conﬁdence

scores

Pr(y|x)

. Having well calibrated estimates of

Pr(y|x)

is crucial to many machine learning

applications. For example, this is critical in systems in which an end user needs to make a decision

based on the prediction of the model. Here a good estimate is as important as a good prediction.

In this work we study binary sequence classiﬁcation with a focus on sparse sequence classiﬁcation

tasks in which only a very small fraction of the sequences in the domain belong to the target

class. These types of problems naturally arise in several applications, for example in NLP or in

Computational Biology where distinguishing relevant segments in DNA sequences is challenging as

it is postulated that only about 1–3% of segments have any biological signiﬁcance [Eskin, 2002].

While traditional evaluations on model calibration have focused on the quality of the conditional

prediction

PrM(y|x)

in this paper we look at model evaluation from a different perspective by asking

the question: Are deep learning models capables of learning the underlying target distribution? The

main idea is quite simple, if we have access to large quantities of unlabeled data that accurately

represent the input distribution we can estimate the joint distribution

PrM(y, x)

implicitly induced

by the conditional model and then we can compare it to the true distribution.

Workshop on Robustness in Sequence Modeling, 36th Conference on Neural Information Processing Systems

(NeurIPS 2022).

arXiv:2210.13082v2 [cs.LG] 11 Nov 2022

We conduct experiments using three deep-learning models: an RNN, a Transformer and a WFA (i.e.

Weighted Finite Automata, a form of RNN with linear activation). Our results suggest that in the

binary classiﬁcation setting the deep-learning models are indeed able to learn the underlying class

distribution in a non-trivial manner. These results seem to be consistent with previous studies showing

that deep-learning models provide calibrated predictions for binary classiﬁcation problems [Niculescu-

Mizil and Caruana, 2005]. Our main contributions are: 1) We present a novel evaluation framework for

sequence prediction models. By exploiting unlabeled data, we evaluate the model with respect to the

implicitly induced joint distribution. 2) Our evaluation distinguishes performance over components

of the distribution seen in training data as well as unseen components explicitly differentiating

compression from generalization. 3) Our experiments on a sparse sequence classiﬁcation task show

that deep learning architectures are able to induce good distributions in a non-trivial manner.

2 Related Work

In the recent literature several works have addressed the problem of model calibration. Some of

these studies have shown that for some deep learning architectures, the predictions produced by the

models are not well calibrated [Guo et al., 2017] [Kumar and Sarawagi, 2019] especially for non

pre-trained transformers [Desai and Durrett, 2020]. In contrast, for the case of binary classiﬁcation

some previous work has suggested that they are indeed well calibrated [Niculescu-Mizil and Caruana,

2005]. To our knowledge we are the ﬁrst ones to study calibration by looking at the joint distribution

Pr(x, y)induced by the learned classiﬁer instead of the quality of the conditional class predictions.

Calibration aside, several works have compared the classiﬁcation performance of different deep-

learning architectures: CNNs and RNNs [Józefowicz et al., 2016] [Yin et al., 2017], Transformers

and RNNs [Karita et al., 2019] [Lakew et al., 2018], Transformers and CNNs [Kolesnikov et al.,

2021] [Pinto et al., 2021] [Bai et al., 2021], WFA to CNNs [Quattoni and Carreras, 2020] and WFA

to RNNs [Quattoni and Carreras, 2019].

Finally, an orthogonal related problem is that of developing deep learning models for density

estimation. These include Transformers [Fakoor et al., 2020], Autoregressive Networks [Uria

et al., 2016] [Oliva et al., 2018] and Flow Models [Durkan et al., 2019] [De Cao et al., 2020].

3 Evaluating Deep-Learning Model as Moment Predictors

Our goal is to evaluate models deﬁned over sequences of discrete symbols. More precisely we

consider an alphabet

and the set of all possible sequences

Σ?

. In general, we can think of

probabilistic binary sequence classiﬁers as functions from

Σ?→[0,1]

. In our setting we assume that

we have access to a large set of sequences

U={x(1), . . . , x(u)}

sampled according to the underlying

distribution over

Σ?

. We can think of

as a large set of unlabeled sequences that represent the

domain of the sequence classiﬁcation task. The target class

that we wish to learn is a subset of

sequences in

, that is

Y⊂U

. We are particularly interested in cases where the target class is rare,

that is cases in which |Y|is signiﬁcantly smaller than |U|.

We will create a labeled training set

T={(x(1), y(1)),...,(x(m), y(m))}

of size

by sampling

sequences from

and labeling

y= 1

as positive instances if they appear in

, and otherwise as

negative instances

y= 0

. Section 4 contains further details about how we create the task data from an

existing dataset, for various training sizes

. Given a training set we will train a sequence classiﬁer

Mthat deﬁnes a distribution PrM(y|x).

In this work, we consider three sequence classiﬁcation models: an

RNN

, a

Transformer

and a

WFA

As recurrent neural network (RNN) we employ a multi-layer LSTM [Hochreiter and Schmidhuber,

1997] with a binary classiﬁcation feed-forward layer on top. For the Transformer we select the BERT

[Devlin et al., 2019] architecture, we don’t use the pre-trained weights and we expand the embeddings

with new randomly initialized ones to deal with the protein dataset vocabulary. Both models output

the conditional probability

Pr(l|x)

where

is a sentence and

a label. We also evaluate a WFA

(Weighted Finite Automata; which in essence is an RNN with linear activation functions [Rabusseau

et al., 2019]) employing the ensemble proposed in Quattoni and Carreras [2020].

3.1 Evaluation Metrics

The sequence model

that we wish to evaluate deﬁnes a conditional distribution

PrM(y|x)

. Gener-

ally, we would like to compute the true error of

by evaluating the model on the true distribution

Pr(x, y)

. In this section, we will use the set

as a proxy for

Pr(x)

, and speciﬁcally we will estimate

the moments of

on the joint distribution

PrM(x, y)

that is implicitly induced by the model over

the target class.1More precisely, consider the set:

Zn={zn|zn∈Σn,∃x∈Usuch that zn∈x}(1)

where

is a sub-sequence of size

zn∈x

indicates that

is a sub-sequence of a domain

sequence

, and therefore

is the set of all sub-sequences of size

observed in

. We use these

sub-sequences as the support set to compute moments of a model, and we deﬁne the moment function

EM:zn×U→Ras:

EM(zn, U) = 1

|U|X

x∈U

PrM(y|x) #[zn∈x](2)

where

is a sub-sequence of length

and

#[zn∈x]

is a function that counts the number of

times that

appears in sequence

. From now on, we will assume a ﬁxed

and drop it from the

expression, and we will refer to the sub-sequences

and their values

EM(zn)

as the moments of

Since we assume that

is a good representation of the the domain,

EM(zn)

can be regarded as an

estimate of the expected number of times that we should observe znin a sample from PrM(x, y).

Given sets

and

, the moments from the true joint distribution

PrY(x, y)

can be computed similarly

as:

EY(zn) = 1

|U|X

x∈Y

#[zn∈x](3)

Putting it all together, if we have a domain represented by a set of unlabeled sequences

, a subset of

target sequences

and a model

PrM(y|x)

, we can compute the model’s moment function

EM(zn)

and compare it to the true moments

EY(xn)

. Notice that for a model that learns the target class

perfectly (i.e. a model that gives accurate predictions and that it is perfectly calibrated) we will have

EM(zn) = EY(zn)for all moments in N.

Now that we have the necessary functions, we propose to evaluate a model

by comparing

and

EY. To compare the moment functions we propose the following metrics:

MSPC

: This metric measures the Spearman rank correlation between gold and model moments of a

ﬁxed length. The Spearman correlation coefﬁcient measures the strength and direction of association

between ranked variables. In our evaluation a high MSPC means that the model sorts the moments of

the distribution in a way that is similar to the gold ordering. The MSPC is computed as:

cov(rank[EM(zn)],rank[EY(zn)])

σ(rank[EM(zn)]) ·σ(rank[EY(zn)]) (4)

where

rank[E(zn)]

are the raw function scores for all

zn∈Zn

converted to ranks. In essence, this

metric measures the agreement on partial orderings induced by the model and gold moment functions.

That is, given any two pairs of moments of a ﬁxed length

and

whenever

EY(zn)> EY(z0

want EM(zn)> EM(z0

n).

MSPCP

: While MSPC is a useful metric when evaluating the class distribution induced by a model,

we need to keep in mind that for longer moments

EY(zn)

will be sparse and for most inputs it will be

0 since we are targeting a rare class. So in this context we compute the Spearman rank correlation

between gold non-zero moments and the model prediction for those moments, this will give an idea

of how well the model would sort the moments of the target distribution if it knew the true support of

the target moments function.

: as MSPC might be overly harsh, a complementary alternative metric in this case is the mean-rank

given to the target non-zero moments:

1−X

zn:EY(zn)>0

rank[EM(zn)]

|Zn|(5)

We highlight that the set

is assumed to be sampled from the true

Pr(x)

and will contain repetitions of

each distinct

proportional to its likelihood. And so does

. This detail is relevant because we can think of

as an empirical estimate of Pr(x), which is necessary for the deﬁnitions of moments in this section.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AreDeepSequenceClassiersGoodatNon-TrivialGeneralization?FrancescoCazzaroUniversitatPolitècnicadeCatalunya,Barcelona,Spain{francesco.cazzaro,AriadnaQuattoniUniversitatPolitècnicadeCatalunya,Barcelona,Spainariadna.julieta.quattoni}@upc.eduXavierCarrerasdMetrics,Brooklyn,NYAbstractRecentadvancesindeep...

展开>> 收起<<

Are Deep Sequence Classiﬁers Good at Non-Trivial Generalization Francesco Cazzaro.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are Deep Sequence Classiﬁers Good at Non-Trivial Generalization Francesco Cazzaro

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: