
that these models can be especially fragile in the face of distribution shift [
45
]. In other words, when
a model trained on a source dataset
Ds
is tested on another dataset,
Dt
, collected and annotated
independently, and with a distribution shift, the classification accuracy drops significantly [62].
This paper presents our findings and observations that negative pairs (i.e., non-paraphrase pairs) in
the training set, as opposed to the positive pairs, do not generalize well to out-of-distribution test
pairs. Intuitively, negative pairs only represent a limited perspective of how the meanings of sentences
can be different (and indeed it is practically infeasible to represent every possible perspective). We
conjecture that negative pairs are so specific to the dataset that they adversely encourage the model
to learn biased representations. We show this observation in Figure 1. Quora Question Pair (QQP)
2
extracts its negative pairs from similar topics. Paraphrase Adversarials from Word Scrambling
(PAWS) Zhang et al.
[62]
generate negative pairs primarily from word swapping. World Machine
Translation Metrics Task 2017 (WMT) [
8
] considers negative examples as poor machine translations.
We therefore hypothesize that biases introduced by the different ways negative pairs are mined are
major causes of the poor generalizability of paraphrase identification models.
Based on this observation, we would like to be able to control the reliance on negative pairs for
out-of-distribution prediction. In order to achieve this, we propose to explicitly train two separate
models for the positive and negative pairs (we will refer to them as the positive and negative models
respectively), which will give us the option to choose when to use the negative model. It is well
known that just training on positive pairs alone can lead to a degenerate solution [
21
,
59
,
42
,
13
], e.g.,
a constant function network would still produce a perfect training loss. To prevent this, we propose a
novel generative framework where we use an autoregressive transformer [
39
,
49
], specifically BART
[
30
]. Given two sentences, we condition the prediction of the next token in the second sentence on
the first sentence and the previous tokens. In a Bayesian sense, this would mean that the next token
predicted has a higher probability of being a positive/negative pair to the first sentence for the positive
and negative model respectively. This learning strategy has no degenerate solutions even when we
are training the positive and negative models separately. We call our proposed approach GAP to
stand for Generalized Autoregressive Paraphrase-Identification. One potential pitfall of GAP is that it
ignores the “interplay” between positive and negative pairs that would otherwise be learned if they
are utilized in training together. This is especially important when the test pairs are in-distribution.
To overcome, we utilize an extra discriminative model, trained with both positive and negative pairs,
to capture the interplay. We call this extension GAPX (pronounced as “Gaps”) to capture the eXtra
discriminative model used.
For all practical purposes, the weights we placed on the positive, negative and/or discriminative
model in GAP and GAPX need to be determined automatically during inference. For in-distribution
pairs, we desire to use them all, while for out-of-distribution pairs, we hope to rely on the positive
model much more heavily. This obviously leads to a question of how to determine whether a given
pair is in or out of distribution [
9
,
14
,
43
,
17
,
20
]. During testing, our method ensembles the positive
model, the negative model, and the discriminative model based on the degree of the similarity of the
test pair to the training pairs, and found that this works well for our purpose. We measure this degree
of similarity with probability cumulative density function (cdf) in terms of perplexity [
25
], and show
that it is superior to other measures.
In summary, our contributions are as follow:
1.
We report new research insights, supported by empirical results, that the negative pairs of a
dataset could potentially introduce biases that will prevent a paraphrase identification model
from generalizing to out-of-distribution pairs.
2.
To overcome, we propose a novel autoregressive modeling approach to train both a positive
and a negative model, and ensemble them automatically during inference. Further, we
observe that the interplay between positive and negative pairs are important for in-distribution
inference, for which we add a discriminative model. We then introduce a new perplexity
based approach to determine whether a given pair is out-of-distribution to achieve auto
ensembling.
3.
We support our proposal with strong empirical results. Compared to state-of-the-art trans-
formers in out-of-distribution performance, our model achieves an average of 11.5% and
8.4% improvement in terms of macro F1 and accuracy respectively over 7 different out-
2https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
2