Improving Large-scale Paraphrase Acquisition and Generation Yao Dou Chao Jiang Wei Xu School of Interactive Computing

2025-05-08 0 0 6.31MB 23 页 10玖币
侵权投诉
Improving Large-scale Paraphrase Acquisition and Generation
Yao Dou, Chao Jiang, Wei Xu
School of Interactive Computing
Georgia Institute of Technology
{douy, chaojiang}@gatech.edu; wei.xu@cc.gatech.edu
http://twitter-paraphrase.com/
Abstract
This paper addresses the quality issues in ex-
isting Twitter-based paraphrase datasets, and
discusses the necessity of using two separate
definitions of paraphrase for identification and
generation tasks. We present a new Multi-
Topic Paraphrase in Twitter (MULTIPIT) cor-
pus that consists of a total of 130k sentence
pairs with crowdsoursing (MULTIPITCROWD )
and expert (MULTIPITEXPERT ) annotations us-
ing two different paraphrase definitions for
paraphrase identification, in addition to a
multi-reference test set (MULTIPITNMR) and
a large automatically constructed training
set (MULTIPITAUTO ) for paraphrase genera-
tion. With improved data annotation qual-
ity and task-specific paraphrase definition, the
best pre-trained language model fine-tuned
on our dataset achieves the state-of-the-art
performance of 84.2 F1for automatic para-
phrase identification. Furthermore, our em-
pirical results also demonstrate that the para-
phrase generation models trained on MUL-
TIPITAUTO generate more diverse and high-
quality paraphrases compared to their coun-
terparts fine-tuned on other corpora such as
Quora, MSCOCO, and ParaNMT.
1 Introduction
Paraphrases are alternative expressions that con-
vey a similar meaning (Bhagat and Hovy,2013).
Studying paraphrase facilitates research in both nat-
ural language understanding and generation. For
instance, identifying paraphrases on social media
is important for tracking the spread of misinforma-
tion (Bakshy et al.,2011) and capturing emerging
events (Vosoughi and Roy,2016). On the other
hand, paraphrase generation improves the linguistic
diversity in conventional agents (Li et al.,2016) and
machine translation (Thompson and Post,2020). It
has also been successfully applied in data argumen-
tation to improve information extraction (Zhang
et al.,2015;Ferguson et al.,2018) and question
answering systems (Gan and Ng,2019).
7. In Tibet, climate change causes
bigger, faster avalanches.
1. Bigger, faster collapsing glaciers,
triggered by climate change
2. Bigger, Faster Avalanches, Trigg-
ered by Climate Change in Tibet
8. Bigger, faster avalanches in Tibet,
triggered by climate change.
9. @KendraWrites on a study that
showed climate change drove
cataclysmic avalanches in Tibet
Formal
Similar
3. It's finally Friday I'm so
happiiiiiii.
22. yayayayayyaya it's finally
Friday and I have a half
day today
20. I'm so happy it's finally
Friday duck yeah
21. I've never been so happy
that it's finally Friday
Informal
Diverse
1. it's finally Friday and that's
all that matters rn.
2. So so so so thankful it's
finally Friday.
2,929
#FinallyFriday
6
Trending
11.2K Tweets
#FinallyFriday
6
Trending
11.2K Tweets
#FinallyFriday
6
Trending
11.2K Tweets
#FinallyFriday
6
Trending
11.2K Tweets
Bigger, Faster, Avalanches, Triggered …
A deadly 2016 glacier collapse in Tibet …
Bigger, Faster, Avalanches, Triggered …
A deadly 2016 glacier collapse in Tibet …
Bigger, Faster, Avalanches, Triggered …
A deadly 2016 glacier collapse in Tibet …
https://www.nytimes.com/2018/01/23/climate…
Bigger, Faster, Avalanches, Triggered …
A deadly 2016 glacier collapse in Tibet …
3,913
Figure 1: Two sets of paraphrases in MULTIPIT, dis-
cussing a trending topic or a news article, respectively.
Many researchers have been leveraging Twit-
ter data to study paraphrase given its lexical and
style diversity as well as coverage of up-to-date
events. However, existing Twitter-based paraphrase
datasets, namely PIT-2015 (Xu et al.,2015) and
Twitter-URL (Lan et al.,2017), suffer from qual-
ity issues such as topic unbalance and annotation
noise,
1
which limit the performance of the mod-
els trained using them. Moreover, past efforts on
creating paraphrase corpora only consider one para-
phrase criteria without taking into account the fact
that the desired “strictness” of semantic equiva-
lence in paraphrases varies from task to task (Bha-
gat and Hovy,2013;Liu and Soh,2022). For exam-
ple, for the purpose of tracking unfolding events,
“A tsunami hit Haiti. and “303 people died because
of the tsunami in Haiti” are sufficiently close to be
considered as paraphrases; whereas for paraphrase
generation, the extra information “303 people dead”
in the latter sentence may lead models to learn to
1
63% of sentences in Twitter-URL are related to the 2016
US presidential election, and 58% of sentences in PIT-2015
are about NFL draft (more detailed analysis in §2.4).
arXiv:2210.03235v3 [cs.CL] 8 Nov 2022
Topic Domains #Train #Dev #Test Sent/Tweet Len %Paraphrase #Trends/URLs #Uniq Sent %Multi-Ref
Our Multi-Topic Paraphrase in Twitter (MULTIPITCROWD ) Dataset
Trends
Sports 25,255 3,157 3,157 10.24 / 13.79 40.52% 1,201 34,786 17.89%
Entertainment 11,547 1,443 1,444 10.44 / 13.80 62.33% 610 15,784 18.11%
Event 8,624 1,078 1,079 10.86 / 15.32 82.83% 359 11,746 17.75%
Others 17,751 2,219 2,219 10.41 / 14.56 67.16% 817 24,286 18.33%
URL
Science/Tech 7,384 923 923 10.94 / 19.17 46.13% 1,032 10,327 17.74%
Health 9,123 1,140 1,141 11.29 / 21.68 46.78% 1,298 12,772 17.86%
Politics 7,981 998 998 10.95 / 18.48 56.56% 1,063 10,999 17.68%
Finance 4,552 569 569 11.19 / 23.08 18.96% 554 5,907 20.13%
Total 92,217 11,527 11,530 10.62 /16.10 53.73% 6,934 124,438 18.65%
Our MULTIPITEXPERT Dataset 4,458 555 557 12.08 / 17.02 53.11% 200 5,743 100%
Existing Twitter Paraphrase Datasets
PIT-2015 (Xu et al.) 13,063 4,727 972 11.9 / 30.60% 420 19,297 24.67%
Twitter URL (Lan et al.) 42,200 9,324 / 14.8 22.77% 5,187 48,906 23.91%
Table 1: Statistics of MULTIPITCROWD and MULTIPITEXPERT datasets. The sentence/tweet lengths are calculated
based on the number of tokens per unique sentence/tweet. %Multi-Ref denotes the percentage of source sentences
with more than one paraphrase. Compared with prior work, our MULTIPITCROWD dataset has a significantly larger
size, a higher portion of paraphrases, and a more balanced topic distribution.
hallucinate and generate more unfaithful content.
In this paper, we present an effective data col-
lection and annotation method to address these
issues. We curate the Multi-Topic Paraphrase in
Twitter (MULTIPIT) corpus, which includes MUL-
TIPIT
CROWD
, a large crowdsourced set of 125K sen-
tence pairs that is useful for tracking information on
Twitter, and MULTIPIT
EXPERT
, an expert annotated
set of 5.5K sentence pairs using a stricter definition
that is more suitable for acquiring paraphrases for
generation purpose. Compared to PIT-2015 and
Twitter-URL, our corpus contains more than twice
as much data with more balanced topic distribution
and better annotation quality. Two sets of examples
from MULTIPIT are shown in Figure 1.
We extensively evaluate several state-of-the-art
neural language models on our datasets to demon-
strate the importance of having task-specific para-
phrase definition. Our best model achieves 84.2
F
1
for automatic paraphrase identification. In ad-
dition, we construct a continually growing para-
phrase dataset, MULTIPIT
AUTO
, by applying the au-
tomatic identification model to unlabelled Twitter
data. Empirical results and analysis show that gen-
eration models fine-tuned on MULTIPIT
AUTO
gen-
erate more diverse and high-quality paraphrases
compared to models trained on other corpora, such
as MSCOCO (Lin et al.,2014), ParaNMT (Wiet-
ing and Gimpel,2018), and Quora.
2
We hope our
MULTIPIT corpus will facilitate future innovation
in paraphrase research.
2https://www.kaggle.com/c/
quora-question-pairs
2 Multi-Topic PIT Corpus
In this section, we present our data collection
and annotation methodology for creating MULTI-
PIT
CROWD
and MULTIPIT
EXPERT
datasets. The data
statistics is detailed in Table 1.
2.1 Collection of Tweets
To gather paraphrases about a diverse set of top-
ics as illustrated in Figure 1, we first group tweets
that contain the same trending topic
3
(year 2014–
2015) or the same URL (year 2017–2019) retrieved
through Twitter public APIs
4
over a long time
period. Specifically, for the URL-based method,
we extract the URLs embedded in the tweets
that are posted by 15 news agency accounts (e.g.,
NYTScience,CNNPolitics, and ForbesTech). To
get cleaner paraphrases, we split the tweets into
sentences, eliminating the extra noises caused by
multi-sentence tweets. More details of the improve-
ments we made to address the data preprocessing
issues in prior work are described in Appendix B.
2.2 Topic Classification and Balancing
To avoid a single type of topics dominating the en-
tire dataset as in prior work (Xu et al.,2015;Lan
et al.,2017), we manually categorize the topics for
each group of tweets and balance their distribution.
For trending topics, we ask three in-house anno-
tators to classify them into 4 different categories:
sports, entertainment, event, and others. All three
3https://www.twitter.com/explore/tabs/trending
4https://developer.twitter.com/en/docs/
twitter-api
Figure 2: Topic breakdown on 100 randomly sampled sentence pairs from MULTIPITCROWD, PIT-2015 and Twitter-
URL. Our MULTIPITCROWD corpus has a more balanced topic distribution.
annotators are college students with varied linguis-
tic annotation experience, and each received an
hour-long training session. For URLs, most of
them are linked to news articles and have already
been categorized by the news agency.
5
We include
the tweets grouped by URLs that belong to the sci-
ence/tech, health, politics, and finance categories.
2.3 Candidate Selection
The PIT-2015 (Xu et al.,2015) and Twitter-URL
(Lan et al.,2017) corpora contain only 23% and
31% sentence pairs that are paraphrases, respec-
tively. To increase the portion of paraphrases and
improve the annotation efficiency, we introduce
an additional step to filter out the tweet groups
that contain either too much noise or too few para-
phrases, and adaptively select sentence pairs for an-
notation (§2.4). For each of the trend-based groups,
we first select the top 2 sentences using a simple
ranking algorithm (Xu et al.,2015) based on the
averaged probability of words. We pair each of
these two sentences with 10 other sentences that are
randomly sampled from the top 20 in each group.
Among these 20 sentence pairs, if the annotators
found
n
[4, 6] or [7, 9] or [10, 12] or [13, 20]
pairs as paraphrases, then we further deploy 20,
30, 40, or 50 sentence pairs for annotation, respec-
tively. We pair one of the top 5 ranked sentences
with 10 sentences randomly selected from those
ranked between top 6 and top 50. Since the URL-
based groups generally contain fewer sentences, we
select the top 11 sentences and ask annotators to
choose one as the seed sentence that can be paired
with the rest 10 sentences to produce at least 3 para-
phrase pairs. If such a seed sentence exists, we pair
it with the rest 10 sentences and deploy them for
annotation. Otherwise, we skip the entire group.
5
For example, URL
https://www.nytimes.com/2019/
08/09/science/komodo-dragon-genome.html
belongs to
science topic.
2.4 Crowd Annotation for Paraphrase
Identification
We then annotate the selected sentence pairs us-
ing the crowdsourcing platform Figure-Eight
6
to
construct MULTIPITCROWD.
Annotation Process.
We design a 1-vs-1 annota-
tion schema, where we present one sentence pair to
workers at a time and ask them to annotate whether
it is a paraphrase pair or not. A screenshot of the an-
notation interface is provided in Appendix A.1. We
collect 6 judgments for every sentence pair and pay
$0.2 per annotation (
>
$7 per hour). For creating
MULTIPIT
CROWD
, with the purpose of identifying
similar sentences and tracking information spread-
ing on Twitter in mind, we consider two sentences
as paraphrases even if one contains some new infor-
mation that does not appear in the other sentence
(see Figure 3for examples). As a side note, be-
cause these sentences are grouped under the same
trend or URL, the new information is always rele-
vant and based on the context, otherwise, we will
consider them non-paraphrases.
Quality Control.
In every five sentence pairs, we
embed one hidden test sentence pair that are pre-
labeled by one of the authors, and constantly mon-
itor the workers’ performance. Whenever annota-
tors make a mistake on the test pair, they will be
alerted and provided with an explanation. Workers
can continue in the task if they achieve
>
85% ac-
curacy on the test pairs and
>
0.2 Cohen’s (Cohen,
1960) kappa when compared with the major vote
of other workers. All workers are in the U.S.
Inter-Annotator Agreement.
The average Co-
hen’s kappa is 0.75 for URL-sourced sentence pairs,
0.69 for Trends-sourced ones, and 0.70 for all. We
also sample 400 sampled sentence pairs and hire
two experienced in-house annotators to label them.
6https://www.appen.com/
Sentence1
Sentence2
is a paraphrase of if:
: imply
Simplification
Add Commonsense
Add World Knowledge
For Tracking Info On Twitter
plus
For Generation
Add Info Requires Fact Checking
S2
S1
S2
S1
S2
S1
S2
S1
(MULTIPITCROWD)
(MULTIPITEXPERT)
A1: Sweden’s first female
PM Magdalena Andersson,
resigns on DAY ONE!!
A2: Swedish PM Magdalena
Andersson resigns hours
after taking job.
Simplification
Add World Knowledge
B1: Facebook announces it will
be changing its name to Meta.!
B2: Facebook relaunches itself
as 'Meta' in a clear bid to
dominate the metaverse.
Add Info Requires Fact Checking
C1: 100% of the 140,000 U.S. jobs lost in December were held
by women.!
C2: In fact women lost 111% of the jobs in December because
men gained 16,000 jobs.
Examples:
Figure 3: Two different paraphrase definitions used for creating MULTIPITCROWD and MULTIPITEXPERT , with
examples. The difference between the two criteria is whether considering Sentence2 that contains new information
that requires fact-checking as a paraphrase of Sentence1.7
Assuming the in-house annotation is gold, the
F1
of crowdworkers’ majority vote is 89.1.
Accessing Topic Diversity.
We manually exam-
ine 100 sentence pairs randomly sampled from
MULTIPIT
CROWD
, PIT-2015 (Xu et al.,2015) and
Twitter-URL (Lan et al.,2017). Figure 2shows
the results of the manual inspection. MULTI-
PIT
CROWD
has a much more balanced topic dis-
tribution, compared to prior work where 58% of
sentences in PIT-2015 are about sports and 63% of
sentences in Twitter-URL are politics-related. This
improvement can be attributed to the long time pe-
riodd (§2.1) and topic classification step (§2.2) in
our data collection process. In contrast, PIT-2015
was collected within only 10 days (04/24/2013 –
05/03/2013) that was overwhelmed by a popular
sports event – the 2013 NFL draft (04/25 - 04/27),
and Twitter-URL was collected during the 3 months
of the 2016 US presidential election.
2.5 Expert Annotation for Paraphrase
Generation
Text generation models are prone to memorize
training data and generate unfaithful hallucinations
(Maynez et al.,2020;Carlini et al.,2021). Includ-
ing paraphrase pairs that contain extra information
other than world or commonsense knowledge in the
training data only worsens the problem, as shown
in Table 15 in Appendix F. For the purpose of
paraphrase generation, we further create MULTI-
PIT
EXPERT
with expert annotations, using a stricter
paraphrase definition than the one used in MULTI-
7
The example C1 and C2 is on the more extreme side of the
"loose" paraphrase criterion from the linguistic perspective,
more average cases are shown in Figure 1.
PIT
CROWD
. The different paraphrase criteria used
for creating these two datasets and their correspond-
ing examples are illustrated in Figure 3.
Data Selection.
To create a high-quality corpus
that focuses on differentiating strict paraphrases
from the more loosely defined ones, we first use our
best paraphrase identifier (§3) fine-tuned on MUL-
TIPIT
CROWD
to filter the sentence pairs and then
have experienced in-house annotators to further an-
notate them. Specifically, we gather sentence pairs
that are identified as paraphrases by the automatic
classifier from 9,762 trending topic groups (from
Oct-Dec 2021) and 181,254 URL groups (from Jan
2020-Jun 2021). To improve the diversity of our
dataset, instead of presenting these pairs directly to
the experts for annotation, we cluster the sentences
by considering the paraphrase relationship transi-
tive, i.e., if sentence pairs
(s1, s2)
and
(s2, s3)
are
both identified as paraphrases, then
(s1, s2, s3)
is a
cluster. For each trend or URL, we show two seed
sentences paired with up to 30 sentences in the
largest cluster for the experts to annotate. In total,
we have 5,570 sentence pairs annotated for MUL-
TIPIT
EXPERT
, in which 100 sentences sourced by
trend and 100 ones sourced by URL have at least 8
corresponding paraphrases. We use these 200 sets
to form MULTIPIT
NMR
, the first multi-reference
test set for paraphrase generation evaluation (§4).
Expert Annotation.
We ask two experienced an-
notators with linguistic backgrounds and rich anno-
tation experience to annotate each sentence pair as
paraphrases or not. Annotators thoroughly discuss
pairs that have inconsistent judgments until reach-
ing an agreement. A screenshot of the updated
annotation instruction is provided in Appendix A.2.
Model #Para.
MULTIPITCROWD MULTIPITEXPERT
LR Precision Recall F1Accuracy LR Precision Recall F1Accuracy
ESIM 17M 4e-4 89.55 70.15 78.67 82.15 4e-4 47.07 91.73 62.22 49.19
Infersent 47M 1e-3 87.03 87.57 87.29 86.47 1e-3 45.87 98.43 62.58 46.32
T5base 220M 1e-4 89.21 93.76 91.43 90.67 1e-4 71.96 83.86 77.45 77.74
T5large 770M 1e-5 90.36 93.58 91.94 91.29 1e-4 79.78 85.43 82.51 83.48
BERTbase 109M 3e-5 88.59 91.24 89.90 89.12 2e-5 71.66 86.61 78.43 78.28
BERTlarge 335M 2e-5 88.73 93.17 90.90 90.10 2e-5 72.22 87.01 78.93 78.82
RoBERTalarge 355M 2e-5 90.81 92.70 91.74 91.14 2e-5 77.01 83.07 79.92 80.97
BERTweetlarge 355M 2e-5 89.72 93.95 91.79 91.08 2e-5 82.47 81.50 81.98 83.66
ALBERTV2xxlarge 235M 1e-5 90.36 92.96 91.64 91.00 2e-5 82.68 82.68 82.68 84.20
DeBERTaV3large 400M 5e-6 90.46 93.59 92.00 91.36 5e-6 82.56 83.86 83.20 84.56
Table 2: Results on the test sets of MULTIPITCROWD and MULTIPITEXPERT. Models are fine-tuned on the corre-
sponding training set. DeBERTaV3large performs the best on both datasets. LR: learning rate.
3 Paraphrase Identification
Paraphrase identification is a task that determines
whether two given sentences are paraphrases or
not. The two paraphrase definitions used in MUL-
TIPIT
CROWD
and MULTIPIT
EXPERT
suit different
downstream applications: tracking information on
Twitter and acquiring high-quality paraphrase pairs
for training generation models. Paraphrase identifi-
cation models trained on our datasets achieve over
84 F1for each case.
Experimental Setup.
As each sentence pair in
MULTIPIT
CROWD
has six judgments, we use 3 as
the threshold, where pairs with
>
3 paraphrase
judgments are labeled as paraphrase, and the ones
with
<
3 paraphrase judgments are labeled as non-
paraphrase. We split MULTIPIT
CROWD
and MULTI-
PIT
EXPERT
into 80/10/10% for train/dev/test parti-
tions by time such that the oldest data are used for
training. More details on the implementation and
hyperparameter tuning are in Appendix C.
3.1 Models
We consider an encoder-decoder language model,
T5 (Raffel et al.,2020), five masked language
models,
BERT
(Devlin et al.,2019),
RoBERTa
(Liu et al.,2019),
ALBERT
(Lan et al.,2019),
BERTweet
(Nguyen et al.,2020), and
DeBER-
TaV3
(He et al.,2021). We also include two com-
petitive BiLSTM-based models,
Infersent
(Con-
neau et al.,2017) and
ESIM
(Chen et al.,2017), to
establish comparison with pre-BERT era work.
3.2 Results
Table 2presents results for the models fine-tuned
on each dataset. DeBERTaV3
large
achieves the best
results with 92
F1
on MULTIPIT
CROWD
and 83.2
F1
Method Data P. R. F1Acc.
Fine-tuning MC61.81 88.58 72.82 69.84
Fine-tuning ME82.56 83.86 83.20 84.56
Fine-tuning MC+ ME62.99 87.80 73.36 70.92
+ Filtering MC+ ME77.24 88.19 82.35 82.76
+ Flipping MC+ ME83.40 85.04 84.21 85.46
Table 3: Results of different methods on the test set of
MULTIPITEXPERT. MC: MULTIPITCROWD , ME: MUL-
TIPITEXPERT. We use DeBERTaV3large in the experi-
ments.
on MULTIPIT
EXPERT
. Transformer-based models
consistently outperform BiLSTM-based models,
especially on MULTIPITEXPERT.
Beyond Fine-tuning.
As MULTIPIT
CROWD
is a
large-scale dataset annotated with a loose para-
phrase definition, we test whether leveraging these
“noisy" data improves model performance on MUL-
TIPIT
EXPERT
. To reduce the noise that comes from
the difference in definitions, we first adjust the
labeling threshold for MULTIPIT
CROWD
from 3
to 4. Then we consider two noisy training tech-
niques adopted in prior work (Xie et al.,2020;
Zhang and Sabuncu,2018), namely
filtering
and
flipping
. Specifically, we fine-tune a teacher model
on MULTIPIT
EXPERT
and use it to go through MUL-
TIPIT
CROWD
as follows: for each sentence pair
p
, if
its label is
i
(0 for non-paraphrase, 1 for paraphrase)
and
Pteacher(y=i|p)λ
, we
filter
out
p
or
flip
its
label to
1i
(i.e. 0
1).
8
Next, we fine-tune a new
model on the combination of MULTIPIT
EXPERT
and
the re-labeled MULTIPIT
CROWD
. The experimental
results are shown in Table 3. Compared to fine-
8
We perform a small grid search on
λ
over {0.05, 0.15,
0.25, 0.35, 0.45}, and find 0.35 works well for the filtering
method and 0.25 for the flipping method.
摘要:

ImprovingLarge-scaleParaphraseAcquisitionandGenerationYaoDou,ChaoJiang,WeiXuSchoolofInteractiveComputingGeorgiaInstituteofTechnology{douy,chaojiang}@gatech.edu;wei.xu@cc.gatech.eduhttp://twitter-paraphrase.com/AbstractThispaperaddressesthequalityissuesinex-istingTwitter-basedparaphrasedatasets,anddi...

展开>> 收起<<
Improving Large-scale Paraphrase Acquisition and Generation Yao Dou Chao Jiang Wei Xu School of Interactive Computing.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:6.31MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注