
Topic Domains #Train #Dev #Test Sent/Tweet Len %Paraphrase #Trends/URLs #Uniq Sent %Multi-Ref
Our Multi-Topic Paraphrase in Twitter (MULTIPITCROWD ) Dataset
Trends
Sports 25,255 3,157 3,157 10.24 / 13.79 40.52% 1,201 34,786 17.89%
Entertainment 11,547 1,443 1,444 10.44 / 13.80 62.33% 610 15,784 18.11%
Event 8,624 1,078 1,079 10.86 / 15.32 82.83% 359 11,746 17.75%
Others 17,751 2,219 2,219 10.41 / 14.56 67.16% 817 24,286 18.33%
URL
Science/Tech 7,384 923 923 10.94 / 19.17 46.13% 1,032 10,327 17.74%
Health 9,123 1,140 1,141 11.29 / 21.68 46.78% 1,298 12,772 17.86%
Politics 7,981 998 998 10.95 / 18.48 56.56% 1,063 10,999 17.68%
Finance 4,552 569 569 11.19 / 23.08 18.96% 554 5,907 20.13%
Total 92,217 11,527 11,530 10.62 /16.10 53.73% 6,934 124,438 18.65%
Our MULTIPITEXPERT Dataset 4,458 555 557 12.08 / 17.02 53.11% 200 5,743 100%
Existing Twitter Paraphrase Datasets
PIT-2015 (Xu et al.) 13,063 4,727 972 11.9 / – 30.60% 420 19,297 24.67%
Twitter URL (Lan et al.) 42,200 – 9,324 – / 14.8 22.77% 5,187 48,906 23.91%
Table 1: Statistics of MULTIPITCROWD and MULTIPITEXPERT datasets. The sentence/tweet lengths are calculated
based on the number of tokens per unique sentence/tweet. %Multi-Ref denotes the percentage of source sentences
with more than one paraphrase. Compared with prior work, our MULTIPITCROWD dataset has a significantly larger
size, a higher portion of paraphrases, and a more balanced topic distribution.
hallucinate and generate more unfaithful content.
In this paper, we present an effective data col-
lection and annotation method to address these
issues. We curate the Multi-Topic Paraphrase in
Twitter (MULTIPIT) corpus, which includes MUL-
TIPIT
CROWD
, a large crowdsourced set of 125K sen-
tence pairs that is useful for tracking information on
Twitter, and MULTIPIT
EXPERT
, an expert annotated
set of 5.5K sentence pairs using a stricter definition
that is more suitable for acquiring paraphrases for
generation purpose. Compared to PIT-2015 and
Twitter-URL, our corpus contains more than twice
as much data with more balanced topic distribution
and better annotation quality. Two sets of examples
from MULTIPIT are shown in Figure 1.
We extensively evaluate several state-of-the-art
neural language models on our datasets to demon-
strate the importance of having task-specific para-
phrase definition. Our best model achieves 84.2
F
1
for automatic paraphrase identification. In ad-
dition, we construct a continually growing para-
phrase dataset, MULTIPIT
AUTO
, by applying the au-
tomatic identification model to unlabelled Twitter
data. Empirical results and analysis show that gen-
eration models fine-tuned on MULTIPIT
AUTO
gen-
erate more diverse and high-quality paraphrases
compared to models trained on other corpora, such
as MSCOCO (Lin et al.,2014), ParaNMT (Wiet-
ing and Gimpel,2018), and Quora.
2
We hope our
MULTIPIT corpus will facilitate future innovation
in paraphrase research.
2https://www.kaggle.com/c/
quora-question-pairs
2 Multi-Topic PIT Corpus
In this section, we present our data collection
and annotation methodology for creating MULTI-
PIT
CROWD
and MULTIPIT
EXPERT
datasets. The data
statistics is detailed in Table 1.
2.1 Collection of Tweets
To gather paraphrases about a diverse set of top-
ics as illustrated in Figure 1, we first group tweets
that contain the same trending topic
3
(year 2014–
2015) or the same URL (year 2017–2019) retrieved
through Twitter public APIs
4
over a long time
period. Specifically, for the URL-based method,
we extract the URLs embedded in the tweets
that are posted by 15 news agency accounts (e.g.,
NYTScience,CNNPolitics, and ForbesTech). To
get cleaner paraphrases, we split the tweets into
sentences, eliminating the extra noises caused by
multi-sentence tweets. More details of the improve-
ments we made to address the data preprocessing
issues in prior work are described in Appendix B.
2.2 Topic Classification and Balancing
To avoid a single type of topics dominating the en-
tire dataset as in prior work (Xu et al.,2015;Lan
et al.,2017), we manually categorize the topics for
each group of tweets and balance their distribution.
For trending topics, we ask three in-house anno-
tators to classify them into 4 different categories:
sports, entertainment, event, and others. All three
3https://www.twitter.com/explore/tabs/trending
4https://developer.twitter.com/en/docs/
twitter-api