GAPX Generalized Autoregressive Paraphrase-Identiﬁcation X Yifei Zhou

2025-05-06 0 0 716.44KB 21 页 10玖币

侵权投诉

GAPX: Generalized Autoregressive

Paraphrase-Identiﬁcation X

Yifei Zhou

Cornell University

yz639@cornell.edu

Renyu Li

Cornell University

rl626@cornell.edu

Hayden Housen

Cornell University

hth33@cornell.edu

Ser-nam Lim

Meta AI

sernamlim@fb.com

Abstract

Paraphrase Identiﬁcation is a fundamental task in Natural Language Processing.

While much progress has been made in the ﬁeld, the performance of many state-of-

the-art models often suffer from distribution shift during inference time. We verify

that a major source of this performance drop comes from biases introduced by

negative examples. To overcome these biases, we propose in this paper to train two

separate models, one that only utilizes the positive pairs and the other the negative

pairs. This enables us the option of deciding how much to utilize the negative

model, for which we introduce a perplexity based out-of-distribution metric that

we show can effectively and automatically determine how much weight it should

be given during inference. We support our ﬁndings with strong empirical results.

1 Introduction

Paraphrases are sentences or phrases that convey the same meaning using different wording, and

is fundamental to the understanding of languages [

]. Paraphrase Identiﬁcation is a well-studied

task of identifying if a given pair of sentences has the same meaning [

], and

has many important downstream applications such as machine translation [

], and

question-answering [11, 35].

What causes stool color to

change to yellow?

What can cause stool to come

out as little balls?

QQP

similar topics

The company then changed its

company ID number, and that

was it.

Figure 1: Negative pairs in different datasets are mined differently in different datasets, and can lead

to signiﬁcant biases during training.

Recently, researchers have observed that neural network architectures trained on different datasets

could achieve state-of-the-art performances for the task of paraphrase identiﬁcation [

While these advances are encouraging for the research community, it has however been observed

1Our code is publicly available at: https://github.com/YifeiZhou02/generalized_paraphrase_identiﬁcation

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01979v1 [cs.CL] 5 Oct 2022

that these models can be especially fragile in the face of distribution shift [

]. In other words, when

a model trained on a source dataset

is tested on another dataset,

, collected and annotated

independently, and with a distribution shift, the classiﬁcation accuracy drops signiﬁcantly [62].

This paper presents our ﬁndings and observations that negative pairs (i.e., non-paraphrase pairs) in

the training set, as opposed to the positive pairs, do not generalize well to out-of-distribution test

pairs. Intuitively, negative pairs only represent a limited perspective of how the meanings of sentences

can be different (and indeed it is practically infeasible to represent every possible perspective). We

conjecture that negative pairs are so speciﬁc to the dataset that they adversely encourage the model

to learn biased representations. We show this observation in Figure 1. Quora Question Pair (QQP)

extracts its negative pairs from similar topics. Paraphrase Adversarials from Word Scrambling

(PAWS) Zhang et al.

[62]

generate negative pairs primarily from word swapping. World Machine

Translation Metrics Task 2017 (WMT) [

] considers negative examples as poor machine translations.

We therefore hypothesize that biases introduced by the different ways negative pairs are mined are

major causes of the poor generalizability of paraphrase identiﬁcation models.

Based on this observation, we would like to be able to control the reliance on negative pairs for

out-of-distribution prediction. In order to achieve this, we propose to explicitly train two separate

models for the positive and negative pairs (we will refer to them as the positive and negative models

respectively), which will give us the option to choose when to use the negative model. It is well

known that just training on positive pairs alone can lead to a degenerate solution [

], e.g.,

a constant function network would still produce a perfect training loss. To prevent this, we propose a

novel generative framework where we use an autoregressive transformer [

], speciﬁcally BART

[

]. Given two sentences, we condition the prediction of the next token in the second sentence on

the ﬁrst sentence and the previous tokens. In a Bayesian sense, this would mean that the next token

predicted has a higher probability of being a positive/negative pair to the ﬁrst sentence for the positive

and negative model respectively. This learning strategy has no degenerate solutions even when we

are training the positive and negative models separately. We call our proposed approach GAP to

stand for Generalized Autoregressive Paraphrase-Identiﬁcation. One potential pitfall of GAP is that it

ignores the “interplay” between positive and negative pairs that would otherwise be learned if they

are utilized in training together. This is especially important when the test pairs are in-distribution.

To overcome, we utilize an extra discriminative model, trained with both positive and negative pairs,

to capture the interplay. We call this extension GAPX (pronounced as “Gaps”) to capture the eXtra

discriminative model used.

For all practical purposes, the weights we placed on the positive, negative and/or discriminative

model in GAP and GAPX need to be determined automatically during inference. For in-distribution

pairs, we desire to use them all, while for out-of-distribution pairs, we hope to rely on the positive

model much more heavily. This obviously leads to a question of how to determine whether a given

pair is in or out of distribution [

]. During testing, our method ensembles the positive

model, the negative model, and the discriminative model based on the degree of the similarity of the

test pair to the training pairs, and found that this works well for our purpose. We measure this degree

of similarity with probability cumulative density function (cdf) in terms of perplexity [

], and show

that it is superior to other measures.

In summary, our contributions are as follow:

We report new research insights, supported by empirical results, that the negative pairs of a

dataset could potentially introduce biases that will prevent a paraphrase identiﬁcation model

from generalizing to out-of-distribution pairs.

To overcome, we propose a novel autoregressive modeling approach to train both a positive

and a negative model, and ensemble them automatically during inference. Further, we

observe that the interplay between positive and negative pairs are important for in-distribution

inference, for which we add a discriminative model. We then introduce a new perplexity

based approach to determine whether a given pair is out-of-distribution to achieve auto

ensembling.

We support our proposal with strong empirical results. Compared to state-of-the-art trans-

formers in out-of-distribution performance, our model achieves an average of 11.5% and

8.4% improvement in terms of macro F1 and accuracy respectively over 7 different out-

2https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

of-distribution scenarios. Our method is especially robust to paraphrase adversarials like

PAWS, while keeping comparable performance for in-distribution prediction.

2 Related Works

2.1 Distribution Shift and Debiasing Models in NLP

The issue of dataset bias and model debiasing has been widely studied in a lot of ﬁeld in NLP

such as Natural Language Inference [

] and Question Answering [

]. Notable work by

[

] utilize ensembling to reduce models’ reliance on dataset bias. These models share

the same paradigm where they break down a given sample

into signal feature

and biased feature

, in the hope of preventing their model from relying on

, which has been shown to be the limiting

factor preventing the model from generalizing to out-of-distribution samples [

]. Here, a separate

model is ﬁrst either trained on

or on datasets with known biases [

], or acquired from

models known to have limited generalization capability. Then they train their main model with a

regularization term that encourages the main model to produce predictions that deviate from that of

the “biased model”. However, this type of approach has shown limited success [

] in debiasing

paraphrase identiﬁcation models. In contrast, our method is based on our observation that negative

pairs limit the generalization of paraphrase identiﬁcation models.

2.2 Out-of-distribution Detection

Another line of work relevant to this paper is the task of detecting out-of-distribution samples

[

]. Researchers have proposed methods to detect anomaly samples by examining

the softmax scores [

] or energy scores [

] produced by discriminative models, while others take

a more probabilistic approach to estimate the probability density [

] or reconstruction error

[

]. In this paper, we introduce a novel perplexity based out-of-distribution detection method that we

show empirically to work well for our purpose. Speciﬁcally, during inference, an out-of-distribution

score is utilized to weigh the contributions from the positive and negative models: the higher the

conﬁdence that the sample is out-of-distribution, the lesser the negative model’s contribution.

2.3 Text Generation Metrics

Finally, we would like to note the difference between our work and autoregressive methods that have

been explored [

] for evaluating text generation. Our work differs as follows: 1) Paraphrase

identiﬁcation seeks to assign a label of paraphrase or not while text generation metrics seeks to assign

a score to measure the similarity of sentences; 2) Current text generation metrics either cannot be

trained to ﬁt to a specifc distribution [

] or are limited to the i.i.d. setting [

] of the

training distribution. In contrast, our method not only signiﬁcantly improves out-of-distribution

performances but is also competitive with state-of-the-art paraphrase identiﬁcation methods for

in-distribution predictions.

3 Methodology

We observe that negative pairs in paraphrase identiﬁcation constitute the main source of bias. To

overcome this, we propose the following training paradigms to learn a signiﬁcantly less biased

paraphrase identiﬁcation model. We employ an autoregressive conditional sentence generators with

transformer architecture as the backbone of our model. Speciﬁcally, we train a positive and negative

model to estimate the distribution of positive and negative pairs in a dataset respectively. During

testing, the two models are ensembled based on how likely the input pair is out of distribution. This

section provides details on our method.

3.1 Separation of Dependence on Positive and Negative Pairs

Let

be the space of all sentences,

X= (s1, s2)

be the random variable representing a sample pair

from

, and

the random variable representing the labels, with

Y= 1

indicating that

and

are

paraphrases and otherwise when

Y= 0

. We seek to separate the dependence between the distribution

of positive and negative pairs, motivated by the observation of the presence of bias in the negative

Training Set

Validation Set

Positive Distribution

Negative Distribution

Positive Model

Negative Model

Distribution Model

Weibull Distribution

Figure 2: An overview of the training procedure of our model GAP. GAPX ensembles GAP with

another discriminative model.

pairs. To begin, we model the distribution of sentences by splitting the sentence

of length

into

the autoregressive product of individual words, where

w(i)

denotes the

th word in

. By applying

Bayesian Inference Law, we have:

P(Y=y|s1, s2) = P(Y=y|s1)Πn

i=1P(w(i)

2|s1, Y =y, w(1:i−1)

Πn

i=1P(w(i)

2|w(1:i−1)

2, s1).(1)

Subtracting the logarithm for Y= 1 and Y= 0, we get:

log(P(Y= 1|s1, s2)) −log(P(Y= 0|s1, s2))

= (log P(Y= 1|s1)−log P(Y= 0|s1))

+ (

i=1

log P(w(i)

2|s1, Y = 1, w(1:i−1)

2)−(

i=1

log P(w(i)

2|s1, Y = 0, w(1:i−1)

2))

= (log P(Y= 1) −log P(Y= 0))

+ (

i=1

log P(w(i)

2|s1, Y = 1, w(1:i−1)

2)−(

i=1

log P(w(i)

2|s1, Y = 0, w(1:i−1)

2)).(2)

In this way, we break the probability inference into 3 terms resulting in Eqn. 2: (1)

(log P(Y=

1) −log P(Y= 0))

, which should just be a constant; (2)

(Pn

i=1 log P(w(i)

2|s1, Y = 1, w(1:i−1)

which depends only on the distribution of positive pairs; (3)(Pn

i=1 log P(w(i)

2|s1, Y = 0, w(1:i−1)

2),

which depends only on the distribution of negative pairs. We deﬁne the score of conﬁdence as follows:

S(s1, s2) = (

i=1

log P(w(i)

2|s1, Y = 1, w(1:i−1)

| {z }

Positive Model

−(

i=1

log P(w(i)

2|s1, Y = 0, w(1:i−1)

2))

| {z }

Negative Model

.(3)

In the above, we are now left with two terms, the ﬁrst representing the positive model and the second

the negative model. If we were to train the two terms together, the effects of the negative pairs in

the resulting model can never be removed during inference, which we have observed to be a major

source of bias. To avoid this, we propose to train the ﬁrst term and the second term separately, and

then subsequently ensemble them based on the degree that a given pair is out of distribution. We train

each model on top of the pretrained autoregressive transformer described in [

] known as BART.

Given

and

, we feed

into the encoder as the condition, shift

to the right by one-token, and

feed shifted

to the decoder. While the decoder proceeds autoregressively, we record the next-word

probability distribution. We calculate the cross entropy between the next-word probability distribution

and the target token in

to update the model parameters. Note that here the Bayesian formulation

has been similarly raised in some of the previous work like Moore and Lewis

[37]

, but to the best

of our knowledge, we’re the ﬁrst to propose this Bayesian formulation to control the reliance on

different components of the model.

3.2 Ensembling

To combine the positive and negative model, if we know a priori

is in the same distribution as

we can directly substitute the prediction of the positive and negative model into Eqn. 3. We will refer

to this as the In-distribution Predictor (IDP). If we have reason to believe that there is a signiﬁcant

distribution shift between

and

(e.g., different sources of corpus and different dataset collection

procedure), we observe empirically that we should only utilize the positive model and disregard the

negative model due to the bias it introduces. We will refer to this as the Out-of-distribution Predictor

(OODP).

3.2.1 Automatic Ensembling

However, in most cases, we have little or no knowledge of the testing distribution, in which case we

need to automatically decide how important the negative model is by detecting how much a test pair

is in the same distribution as the training set. We adopt a weighted interpolation between a constant

and the negative model in addition to the positive model as follow:

S(s1, s2) = log P(s2|s1, Y = 1) −(1 −λ(s1, s2)) log P(s2|s1, Y = 0) −λ(s1, s2)C, (4)

where

λ(s1, s2)

is a weight parameter depending on

and

, and

is a constant that achieves

a regularization effect. See Appendix for ablations on how

can be set.

P(s2|s1, Y = 1)

and

P(s2|s1, Y = 0)

are the same terms in Eqn.3. To automatically assign

λ(s1, s2)

for different sentence

pairs, we measure an out-of-distribution score for

(s1, s2)

with regard to the training distribution.

Speciﬁcally, we use the same set of training data, comprising both positive and negative pairs, from

, on which we train another autoregressive model, which we will refer to as the distribution model.

The distribution model is trained by feeding an empty string into the encoder and the concatenated

and

into the decoder, with the training goal of predicting the next token. We measure the perplexity

of each sentence pair

(s1, s2)

using the distribution model based on the following formula,

being

the ith token of the concatenated (s1, s2)of length n:

P P (s1, s2) = n

i=1

P(wi|w1:i−1).(5)

We then ﬁt a Weibull distribution to the perplexity of a held-back set of validation data, so that it

can better model the right-skewed property of the distribution. We derive the exponential parameter

, the shape parameter

, the location parameter

loc

, and the scale parameter

scale

. During testing,

λ(s1, s2)can now be determined as:

λ(s1, s2) = cdf(P P (s1, s2), W eibull(a, c, loc, scale)).(6)

For the ﬁnal prediction, we predict the sentence pair to be paraphrase if

S(s1, s2)≥0

and non-

paraphrase otherwise. This forms what we referred to earlier as GAP (Generalized Autoregressive

Paraphrase-Identiﬁcation).

3.2.2 Capturing Interplay Between Positive and Negative Pairs

In practice, training a positive and negative model separately disregards the interplay between the

positive and negative pairs, which could be important when the test pairs are in-distribution. To

capture such interplay, we utilize both positive and negative pairs to train a discriminative model for

sequence classiﬁcation. Speciﬁcally, we ﬁrst deﬁne a thresholding function based on the value of λ:

τ(λ) = (0λ < 0.9

1else. (7)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GAPX:GeneralizedAutoregressiveParaphrase-IdenticationXYifeiZhouCornellUniversityyz639@cornell.eduRenyuLiCornellUniversityrl626@cornell.eduHaydenHousenCornellUniversityhth33@cornell.eduSer-namLimMetaAIsernamlim@fb.comAbstractParaphraseIdenticationisafundamentaltaskinNaturalLanguageProcessing.Whilem...

展开>> 收起<<

GAPX Generalized Autoregressive Paraphrase-Identiﬁcation X Yifei Zhou.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GAPX Generalized Autoregressive Paraphrase-Identiﬁcation X Yifei Zhou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: