GAPX Generalized Autoregressive Paraphrase-Identification X Yifei Zhou

2025-05-06 0 0 716.44KB 21 页 10玖币
侵权投诉
GAPX: Generalized Autoregressive
Paraphrase-Identification X
Yifei Zhou
Cornell University
yz639@cornell.edu
Renyu Li
Cornell University
rl626@cornell.edu
Hayden Housen
Cornell University
hth33@cornell.edu
Ser-nam Lim
Meta AI
sernamlim@fb.com
Abstract
Paraphrase Identification is a fundamental task in Natural Language Processing.
While much progress has been made in the field, the performance of many state-of-
the-art models often suffer from distribution shift during inference time. We verify
that a major source of this performance drop comes from biases introduced by
negative examples. To overcome these biases, we propose in this paper to train two
separate models, one that only utilizes the positive pairs and the other the negative
pairs. This enables us the option of deciding how much to utilize the negative
model, for which we introduce a perplexity based out-of-distribution metric that
we show can effectively and automatically determine how much weight it should
be given during inference. We support our findings with strong empirical results.
1
1 Introduction
Paraphrases are sentences or phrases that convey the same meaning using different wording, and
is fundamental to the understanding of languages [
7
]. Paraphrase Identification is a well-studied
task of identifying if a given pair of sentences has the same meaning [
51
,
47
,
56
,
57
,
31
], and
has many important downstream applications such as machine translation [
61
,
44
,
40
,
27
], and
question-answering [11, 35].
What causes stool color to
change to yellow?
What can cause stool to come
out as little balls?
QQP
similar topic
Emma Townshend is
represented by David Godwin at
DGA Associates .
David David Godwin is
represented at DGA Associates
by Emma Townshend .
PAWS
same bag-of-word
But it changed the name and
IDN and there was nothing to
deal with.
WMT
poor machine translation
That amber alert was getting
annoying
Why do I get amber alerts tho
PIT
similar topics
The company then changed its
company ID number, and that
was it.
Figure 1: Negative pairs in different datasets are mined differently in different datasets, and can lead
to significant biases during training.
Recently, researchers have observed that neural network architectures trained on different datasets
could achieve state-of-the-art performances for the task of paraphrase identification [
52
,
16
,
50
].
While these advances are encouraging for the research community, it has however been observed
1Our code is publicly available at: https://github.com/YifeiZhou02/generalized_paraphrase_identification
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01979v1 [cs.CL] 5 Oct 2022
that these models can be especially fragile in the face of distribution shift [
45
]. In other words, when
a model trained on a source dataset
Ds
is tested on another dataset,
Dt
, collected and annotated
independently, and with a distribution shift, the classification accuracy drops significantly [62].
This paper presents our findings and observations that negative pairs (i.e., non-paraphrase pairs) in
the training set, as opposed to the positive pairs, do not generalize well to out-of-distribution test
pairs. Intuitively, negative pairs only represent a limited perspective of how the meanings of sentences
can be different (and indeed it is practically infeasible to represent every possible perspective). We
conjecture that negative pairs are so specific to the dataset that they adversely encourage the model
to learn biased representations. We show this observation in Figure 1. Quora Question Pair (QQP)
2
extracts its negative pairs from similar topics. Paraphrase Adversarials from Word Scrambling
(PAWS) Zhang et al.
[62]
generate negative pairs primarily from word swapping. World Machine
Translation Metrics Task 2017 (WMT) [
8
] considers negative examples as poor machine translations.
We therefore hypothesize that biases introduced by the different ways negative pairs are mined are
major causes of the poor generalizability of paraphrase identification models.
Based on this observation, we would like to be able to control the reliance on negative pairs for
out-of-distribution prediction. In order to achieve this, we propose to explicitly train two separate
models for the positive and negative pairs (we will refer to them as the positive and negative models
respectively), which will give us the option to choose when to use the negative model. It is well
known that just training on positive pairs alone can lead to a degenerate solution [
21
,
59
,
42
,
13
], e.g.,
a constant function network would still produce a perfect training loss. To prevent this, we propose a
novel generative framework where we use an autoregressive transformer [
39
,
49
], specifically BART
[
30
]. Given two sentences, we condition the prediction of the next token in the second sentence on
the first sentence and the previous tokens. In a Bayesian sense, this would mean that the next token
predicted has a higher probability of being a positive/negative pair to the first sentence for the positive
and negative model respectively. This learning strategy has no degenerate solutions even when we
are training the positive and negative models separately. We call our proposed approach GAP to
stand for Generalized Autoregressive Paraphrase-Identification. One potential pitfall of GAP is that it
ignores the “interplay” between positive and negative pairs that would otherwise be learned if they
are utilized in training together. This is especially important when the test pairs are in-distribution.
To overcome, we utilize an extra discriminative model, trained with both positive and negative pairs,
to capture the interplay. We call this extension GAPX (pronounced as “Gaps”) to capture the eXtra
discriminative model used.
For all practical purposes, the weights we placed on the positive, negative and/or discriminative
model in GAP and GAPX need to be determined automatically during inference. For in-distribution
pairs, we desire to use them all, while for out-of-distribution pairs, we hope to rely on the positive
model much more heavily. This obviously leads to a question of how to determine whether a given
pair is in or out of distribution [
9
,
14
,
43
,
17
,
20
]. During testing, our method ensembles the positive
model, the negative model, and the discriminative model based on the degree of the similarity of the
test pair to the training pairs, and found that this works well for our purpose. We measure this degree
of similarity with probability cumulative density function (cdf) in terms of perplexity [
25
], and show
that it is superior to other measures.
In summary, our contributions are as follow:
1.
We report new research insights, supported by empirical results, that the negative pairs of a
dataset could potentially introduce biases that will prevent a paraphrase identification model
from generalizing to out-of-distribution pairs.
2.
To overcome, we propose a novel autoregressive modeling approach to train both a positive
and a negative model, and ensemble them automatically during inference. Further, we
observe that the interplay between positive and negative pairs are important for in-distribution
inference, for which we add a discriminative model. We then introduce a new perplexity
based approach to determine whether a given pair is out-of-distribution to achieve auto
ensembling.
3.
We support our proposal with strong empirical results. Compared to state-of-the-art trans-
formers in out-of-distribution performance, our model achieves an average of 11.5% and
8.4% improvement in terms of macro F1 and accuracy respectively over 7 different out-
2https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
2
of-distribution scenarios. Our method is especially robust to paraphrase adversarials like
PAWS, while keeping comparable performance for in-distribution prediction.
2 Related Works
2.1 Distribution Shift and Debiasing Models in NLP
The issue of dataset bias and model debiasing has been widely studied in a lot of field in NLP
such as Natural Language Inference [
23
,
3
] and Question Answering [
36
,
2
,
4
]. Notable work by
[
22
,
5
,
10
,
15
,
48
] utilize ensembling to reduce models’ reliance on dataset bias. These models share
the same paradigm where they break down a given sample
x
into signal feature
xs
and biased feature
xb
, in the hope of preventing their model from relying on
xb
, which has been shown to be the limiting
factor preventing the model from generalizing to out-of-distribution samples [
5
]. Here, a separate
model is first either trained on
xb
or on datasets with known biases [
10
,
15
,
22
], or acquired from
models known to have limited generalization capability. Then they train their main model with a
regularization term that encourages the main model to produce predictions that deviate from that of
the “biased model”. However, this type of approach has shown limited success [
45
] in debiasing
paraphrase identification models. In contrast, our method is based on our observation that negative
pairs limit the generalization of paraphrase identification models.
2.2 Out-of-distribution Detection
Another line of work relevant to this paper is the task of detecting out-of-distribution samples
[
9
,
14
,
43
,
17
,
20
,
55
]. Researchers have proposed methods to detect anomaly samples by examining
the softmax scores [
32
] or energy scores [
33
,
60
] produced by discriminative models, while others take
a more probabilistic approach to estimate the probability density [
14
,
25
,
66
,
1
] or reconstruction error
[
38
]. In this paper, we introduce a novel perplexity based out-of-distribution detection method that we
show empirically to work well for our purpose. Specifically, during inference, an out-of-distribution
score is utilized to weigh the contributions from the positive and negative models: the higher the
confidence that the sample is out-of-distribution, the lesser the negative model’s contribution.
2.3 Text Generation Metrics
Finally, we would like to note the difference between our work and autoregressive methods that have
been explored [
58
,
46
] for evaluating text generation. Our work differs as follows: 1) Paraphrase
identification seeks to assign a label of paraphrase or not while text generation metrics seeks to assign
a score to measure the similarity of sentences; 2) Current text generation metrics either cannot be
trained to fit to a specifc distribution [
61
,
58
,
63
] or are limited to the i.i.d. setting [
44
,
41
] of the
training distribution. In contrast, our method not only significantly improves out-of-distribution
performances but is also competitive with state-of-the-art paraphrase identification methods for
in-distribution predictions.
3 Methodology
We observe that negative pairs in paraphrase identification constitute the main source of bias. To
overcome this, we propose the following training paradigms to learn a significantly less biased
paraphrase identification model. We employ an autoregressive conditional sentence generators with
transformer architecture as the backbone of our model. Specifically, we train a positive and negative
model to estimate the distribution of positive and negative pairs in a dataset respectively. During
testing, the two models are ensembled based on how likely the input pair is out of distribution. This
section provides details on our method.
3.1 Separation of Dependence on Positive and Negative Pairs
Let
S
be the space of all sentences,
X= (s1, s2)
be the random variable representing a sample pair
from
S
, and
Y
the random variable representing the labels, with
Y= 1
indicating that
s1
and
s2
are
paraphrases and otherwise when
Y= 0
. We seek to separate the dependence between the distribution
of positive and negative pairs, motivated by the observation of the presence of bias in the negative
3
Training Set
Validation Set
Positive Distribution
Negative Distribution
Positive Model
Negative Model
Distribution Model
Weibull Distribution
Figure 2: An overview of the training procedure of our model GAP. GAPX ensembles GAP with
another discriminative model.
pairs. To begin, we model the distribution of sentences by splitting the sentence
s2
of length
n
into
the autoregressive product of individual words, where
w(i)
2
denotes the
i
th word in
s2
. By applying
Bayesian Inference Law, we have:
P(Y=y|s1, s2) = P(Y=y|s1n
i=1P(w(i)
2|s1, Y =y, w(1:i1)
2)
Πn
i=1P(w(i)
2|w(1:i1)
2, s1).(1)
Subtracting the logarithm for Y= 1 and Y= 0, we get:
log(P(Y= 1|s1, s2)) log(P(Y= 0|s1, s2))
= (log P(Y= 1|s1)log P(Y= 0|s1))
+ (
n
X
i=1
log P(w(i)
2|s1, Y = 1, w(1:i1)
2)(
n
X
i=1
log P(w(i)
2|s1, Y = 0, w(1:i1)
2))
= (log P(Y= 1) log P(Y= 0))
+ (
n
X
i=1
log P(w(i)
2|s1, Y = 1, w(1:i1)
2)(
n
X
i=1
log P(w(i)
2|s1, Y = 0, w(1:i1)
2)).(2)
In this way, we break the probability inference into 3 terms resulting in Eqn. 2: (1)
(log P(Y=
1) log P(Y= 0))
, which should just be a constant; (2)
(Pn
i=1 log P(w(i)
2|s1, Y = 1, w(1:i1)
2)
,
which depends only on the distribution of positive pairs; (3)(Pn
i=1 log P(w(i)
2|s1, Y = 0, w(1:i1)
2),
which depends only on the distribution of negative pairs. We define the score of confidence as follows:
S(s1, s2) = (
n
X
i=1
log P(w(i)
2|s1, Y = 1, w(1:i1)
2)
| {z }
Positive Model
(
n
X
i=1
log P(w(i)
2|s1, Y = 0, w(1:i1)
2))
| {z }
Negative Model
.(3)
In the above, we are now left with two terms, the first representing the positive model and the second
the negative model. If we were to train the two terms together, the effects of the negative pairs in
the resulting model can never be removed during inference, which we have observed to be a major
source of bias. To avoid this, we propose to train the first term and the second term separately, and
then subsequently ensemble them based on the degree that a given pair is out of distribution. We train
each model on top of the pretrained autoregressive transformer described in [
30
] known as BART.
Given
s1
and
s2
, we feed
s1
into the encoder as the condition, shift
s2
to the right by one-token, and
feed shifted
s2
to the decoder. While the decoder proceeds autoregressively, we record the next-word
probability distribution. We calculate the cross entropy between the next-word probability distribution
and the target token in
s2
to update the model parameters. Note that here the Bayesian formulation
has been similarly raised in some of the previous work like Moore and Lewis
[37]
, but to the best
4
of our knowledge, we’re the first to propose this Bayesian formulation to control the reliance on
different components of the model.
3.2 Ensembling
To combine the positive and negative model, if we know a priori
Dt
is in the same distribution as
Ds
,
we can directly substitute the prediction of the positive and negative model into Eqn. 3. We will refer
to this as the In-distribution Predictor (IDP). If we have reason to believe that there is a significant
distribution shift between
Ds
and
Dt
(e.g., different sources of corpus and different dataset collection
procedure), we observe empirically that we should only utilize the positive model and disregard the
negative model due to the bias it introduces. We will refer to this as the Out-of-distribution Predictor
(OODP).
3.2.1 Automatic Ensembling
However, in most cases, we have little or no knowledge of the testing distribution, in which case we
need to automatically decide how important the negative model is by detecting how much a test pair
is in the same distribution as the training set. We adopt a weighted interpolation between a constant
and the negative model in addition to the positive model as follow:
S(s1, s2) = log P(s2|s1, Y = 1) (1 λ(s1, s2)) log P(s2|s1, Y = 0) λ(s1, s2)C, (4)
where
λ(s1, s2)
is a weight parameter depending on
s1
and
s2
, and
C
is a constant that achieves
a regularization effect. See Appendix for ablations on how
C
can be set.
P(s2|s1, Y = 1)
and
P(s2|s1, Y = 0)
are the same terms in Eqn.3. To automatically assign
λ(s1, s2)
for different sentence
pairs, we measure an out-of-distribution score for
(s1, s2)
with regard to the training distribution.
Specifically, we use the same set of training data, comprising both positive and negative pairs, from
Ds
, on which we train another autoregressive model, which we will refer to as the distribution model.
The distribution model is trained by feeding an empty string into the encoder and the concatenated
s1
and
s2
into the decoder, with the training goal of predicting the next token. We measure the perplexity
of each sentence pair
(s1, s2)
using the distribution model based on the following formula,
wi
being
the ith token of the concatenated (s1, s2)of length n:
P P (s1, s2) = n
v
u
u
t(
n
Y
i=1
1
P(wi|w1:i1).(5)
We then fit a Weibull distribution to the perplexity of a held-back set of validation data, so that it
can better model the right-skewed property of the distribution. We derive the exponential parameter
a
, the shape parameter
c
, the location parameter
loc
, and the scale parameter
scale
. During testing,
λ(s1, s2)can now be determined as:
λ(s1, s2) = cdf(P P (s1, s2), W eibull(a, c, loc, scale)).(6)
For the final prediction, we predict the sentence pair to be paraphrase if
S(s1, s2)0
and non-
paraphrase otherwise. This forms what we referred to earlier as GAP (Generalized Autoregressive
Paraphrase-Identification).
3.2.2 Capturing Interplay Between Positive and Negative Pairs
In practice, training a positive and negative model separately disregards the interplay between the
positive and negative pairs, which could be important when the test pairs are in-distribution. To
capture such interplay, we utilize both positive and negative pairs to train a discriminative model for
sequence classification. Specifically, we first define a thresholding function based on the value of λ:
τ(λ) = (0λ < 0.9
1else. (7)
5
摘要:

GAPX:GeneralizedAutoregressiveParaphrase-IdenticationXYifeiZhouCornellUniversityyz639@cornell.eduRenyuLiCornellUniversityrl626@cornell.eduHaydenHousenCornellUniversityhth33@cornell.eduSer-namLimMetaAIsernamlim@fb.comAbstractParaphraseIdenticationisafundamentaltaskinNaturalLanguageProcessing.Whilem...

展开>> 收起<<
GAPX Generalized Autoregressive Paraphrase-Identification X Yifei Zhou.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:716.44KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注