Improving Large-scale Paraphrase Acquisition and Generation Yao Dou Chao Jiang Wei Xu School of Interactive Computing

2025-05-08 1 0 6.31MB 23 页 10玖币

侵权投诉

Improving Large-scale Paraphrase Acquisition and Generation

Yao Dou, Chao Jiang, Wei Xu

School of Interactive Computing

Georgia Institute of Technology

{douy, chaojiang}@gatech.edu; wei.xu@cc.gatech.edu

http://twitter-paraphrase.com/

Abstract

This paper addresses the quality issues in ex-

isting Twitter-based paraphrase datasets, and

discusses the necessity of using two separate

deﬁnitions of paraphrase for identiﬁcation and

generation tasks. We present a new Multi-

Topic Paraphrase in Twitter (MULTIPIT) cor-

pus that consists of a total of 130k sentence

pairs with crowdsoursing (MULTIPITCROWD )

and expert (MULTIPITEXPERT ) annotations us-

ing two different paraphrase deﬁnitions for

paraphrase identiﬁcation, in addition to a

multi-reference test set (MULTIPITNMR) and

a large automatically constructed training

set (MULTIPITAUTO ) for paraphrase genera-

tion. With improved data annotation qual-

ity and task-speciﬁc paraphrase deﬁnition, the

best pre-trained language model ﬁne-tuned

on our dataset achieves the state-of-the-art

performance of 84.2 F1for automatic para-

phrase identiﬁcation. Furthermore, our em-

pirical results also demonstrate that the para-

phrase generation models trained on MUL-

TIPITAUTO generate more diverse and high-

quality paraphrases compared to their coun-

terparts ﬁne-tuned on other corpora such as

Quora, MSCOCO, and ParaNMT.

1 Introduction

Paraphrases are alternative expressions that con-

vey a similar meaning (Bhagat and Hovy,2013).

Studying paraphrase facilitates research in both nat-

ural language understanding and generation. For

instance, identifying paraphrases on social media

is important for tracking the spread of misinforma-

tion (Bakshy et al.,2011) and capturing emerging

events (Vosoughi and Roy,2016). On the other

hand, paraphrase generation improves the linguistic

diversity in conventional agents (Li et al.,2016) and

machine translation (Thompson and Post,2020). It

has also been successfully applied in data argumen-

tation to improve information extraction (Zhang

et al.,2015;Ferguson et al.,2018) and question

answering systems (Gan and Ng,2019).

7. In Tibet, climate change causes

bigger, faster avalanches.

1. Bigger, faster collapsing glaciers,

triggered by climate change

2. Bigger, Faster Avalanches, Trigg-

ered by Climate Change in Tibet

8. Bigger, faster avalanches in Tibet,

triggered by climate change.

9. @KendraWrites on a study that

showed climate change drove

cataclysmic avalanches in Tibet

Formal

Similar

3. It's ﬁnally Friday I'm so

happiiiiiii.

22. yayayayayyaya it's ﬁnally

Friday and I have a half

day today

20. I'm so happy it's ﬁnally

Friday duck yeah

21. I've never been so happy

that it's ﬁnally Friday

Informal

Diverse

1. it's ﬁnally Friday and that's

all that matters rn.

2. So so so so thankful it's

ﬁnally Friday.

2,929

#FinallyFriday

Trending

11.2K Tweets

#FinallyFriday

Trending

11.2K Tweets

#FinallyFriday

Trending

11.2K Tweets

#FinallyFriday

Trending

11.2K Tweets

nytime.com

Bigger, Faster, Avalanches, Triggered …

A deadly 2016 glacier collapse in Tibet …

nytime.com

Bigger, Faster, Avalanches, Triggered …

A deadly 2016 glacier collapse in Tibet …

nytime.com

Bigger, Faster, Avalanches, Triggered …

A deadly 2016 glacier collapse in Tibet …

https://www.nytimes.com/2018/01/23/climate…

nytime.com

Bigger, Faster, Avalanches, Triggered …

A deadly 2016 glacier collapse in Tibet …

3,913

Figure 1: Two sets of paraphrases in MULTIPIT, dis-

cussing a trending topic or a news article, respectively.

Many researchers have been leveraging Twit-

ter data to study paraphrase given its lexical and

style diversity as well as coverage of up-to-date

events. However, existing Twitter-based paraphrase

datasets, namely PIT-2015 (Xu et al.,2015) and

Twitter-URL (Lan et al.,2017), suffer from qual-

ity issues such as topic unbalance and annotation

noise,

which limit the performance of the mod-

els trained using them. Moreover, past efforts on

creating paraphrase corpora only consider one para-

phrase criteria without taking into account the fact

that the desired “strictness” of semantic equiva-

lence in paraphrases varies from task to task (Bha-

gat and Hovy,2013;Liu and Soh,2022). For exam-

ple, for the purpose of tracking unfolding events,

“A tsunami hit Haiti.” and “303 people died because

of the tsunami in Haiti” are sufﬁciently close to be

considered as paraphrases; whereas for paraphrase

generation, the extra information “303 people dead”

in the latter sentence may lead models to learn to

63% of sentences in Twitter-URL are related to the 2016

US presidential election, and 58% of sentences in PIT-2015

are about NFL draft (more detailed analysis in §2.4).

arXiv:2210.03235v3 [cs.CL] 8 Nov 2022

Topic Domains #Train #Dev #Test Sent/Tweet Len %Paraphrase #Trends/URLs #Uniq Sent %Multi-Ref

Our Multi-Topic Paraphrase in Twitter (MULTIPITCROWD ) Dataset

Trends

Sports 25,255 3,157 3,157 10.24 / 13.79 40.52% 1,201 34,786 17.89%

Entertainment 11,547 1,443 1,444 10.44 / 13.80 62.33% 610 15,784 18.11%

Event 8,624 1,078 1,079 10.86 / 15.32 82.83% 359 11,746 17.75%

Others 17,751 2,219 2,219 10.41 / 14.56 67.16% 817 24,286 18.33%

URL

Science/Tech 7,384 923 923 10.94 / 19.17 46.13% 1,032 10,327 17.74%

Health 9,123 1,140 1,141 11.29 / 21.68 46.78% 1,298 12,772 17.86%

Politics 7,981 998 998 10.95 / 18.48 56.56% 1,063 10,999 17.68%

Finance 4,552 569 569 11.19 / 23.08 18.96% 554 5,907 20.13%

Total 92,217 11,527 11,530 10.62 /16.10 53.73% 6,934 124,438 18.65%

Our MULTIPITEXPERT Dataset 4,458 555 557 12.08 / 17.02 53.11% 200 5,743 100%

Existing Twitter Paraphrase Datasets

PIT-2015 (Xu et al.) 13,063 4,727 972 11.9 / – 30.60% 420 19,297 24.67%

Twitter URL (Lan et al.) 42,200 – 9,324 – / 14.8 22.77% 5,187 48,906 23.91%

Table 1: Statistics of MULTIPITCROWD and MULTIPITEXPERT datasets. The sentence/tweet lengths are calculated

based on the number of tokens per unique sentence/tweet. %Multi-Ref denotes the percentage of source sentences

with more than one paraphrase. Compared with prior work, our MULTIPITCROWD dataset has a signiﬁcantly larger

size, a higher portion of paraphrases, and a more balanced topic distribution.

hallucinate and generate more unfaithful content.

In this paper, we present an effective data col-

lection and annotation method to address these

issues. We curate the Multi-Topic Paraphrase in

Twitter (MULTIPIT) corpus, which includes MUL-

TIPIT

CROWD

, a large crowdsourced set of 125K sen-

tence pairs that is useful for tracking information on

Twitter, and MULTIPIT

EXPERT

, an expert annotated

set of 5.5K sentence pairs using a stricter deﬁnition

that is more suitable for acquiring paraphrases for

generation purpose. Compared to PIT-2015 and

Twitter-URL, our corpus contains more than twice

as much data with more balanced topic distribution

and better annotation quality. Two sets of examples

from MULTIPIT are shown in Figure 1.

We extensively evaluate several state-of-the-art

neural language models on our datasets to demon-

strate the importance of having task-speciﬁc para-

phrase deﬁnition. Our best model achieves 84.2

for automatic paraphrase identiﬁcation. In ad-

dition, we construct a continually growing para-

phrase dataset, MULTIPIT

AUTO

, by applying the au-

tomatic identiﬁcation model to unlabelled Twitter

data. Empirical results and analysis show that gen-

eration models ﬁne-tuned on MULTIPIT

AUTO

gen-

erate more diverse and high-quality paraphrases

compared to models trained on other corpora, such

as MSCOCO (Lin et al.,2014), ParaNMT (Wiet-

ing and Gimpel,2018), and Quora.

We hope our

MULTIPIT corpus will facilitate future innovation

in paraphrase research.

2https://www.kaggle.com/c/

quora-question-pairs

2 Multi-Topic PIT Corpus

In this section, we present our data collection

and annotation methodology for creating MULTI-

PIT

CROWD

and MULTIPIT

EXPERT

datasets. The data

statistics is detailed in Table 1.

2.1 Collection of Tweets

To gather paraphrases about a diverse set of top-

ics as illustrated in Figure 1, we ﬁrst group tweets

that contain the same trending topic

(year 2014–

2015) or the same URL (year 2017–2019) retrieved

through Twitter public APIs

over a long time

period. Speciﬁcally, for the URL-based method,

we extract the URLs embedded in the tweets

that are posted by 15 news agency accounts (e.g.,

NYTScience,CNNPolitics, and ForbesTech). To

get cleaner paraphrases, we split the tweets into

sentences, eliminating the extra noises caused by

multi-sentence tweets. More details of the improve-

ments we made to address the data preprocessing

issues in prior work are described in Appendix B.

2.2 Topic Classiﬁcation and Balancing

To avoid a single type of topics dominating the en-

tire dataset as in prior work (Xu et al.,2015;Lan

et al.,2017), we manually categorize the topics for

each group of tweets and balance their distribution.

For trending topics, we ask three in-house anno-

tators to classify them into 4 different categories:

sports, entertainment, event, and others. All three

3https://www.twitter.com/explore/tabs/trending

4https://developer.twitter.com/en/docs/

twitter-api

Figure 2: Topic breakdown on 100 randomly sampled sentence pairs from MULTIPITCROWD, PIT-2015 and Twitter-

URL. Our MULTIPITCROWD corpus has a more balanced topic distribution.

annotators are college students with varied linguis-

tic annotation experience, and each received an

hour-long training session. For URLs, most of

them are linked to news articles and have already

been categorized by the news agency.

We include

the tweets grouped by URLs that belong to the sci-

ence/tech, health, politics, and ﬁnance categories.

2.3 Candidate Selection

The PIT-2015 (Xu et al.,2015) and Twitter-URL

(Lan et al.,2017) corpora contain only 23% and

31% sentence pairs that are paraphrases, respec-

tively. To increase the portion of paraphrases and

improve the annotation efﬁciency, we introduce

an additional step to ﬁlter out the tweet groups

that contain either too much noise or too few para-

phrases, and adaptively select sentence pairs for an-

notation (§2.4). For each of the trend-based groups,

we ﬁrst select the top 2 sentences using a simple

ranking algorithm (Xu et al.,2015) based on the

averaged probability of words. We pair each of

these two sentences with 10 other sentences that are

randomly sampled from the top 20 in each group.

Among these 20 sentence pairs, if the annotators

found

n∈

[4, 6] or [7, 9] or [10, 12] or [13, 20]

pairs as paraphrases, then we further deploy 20,

30, 40, or 50 sentence pairs for annotation, respec-

tively. We pair one of the top 5 ranked sentences

with 10 sentences randomly selected from those

ranked between top 6 and top 50. Since the URL-

based groups generally contain fewer sentences, we

select the top 11 sentences and ask annotators to

choose one as the seed sentence that can be paired

with the rest 10 sentences to produce at least 3 para-

phrase pairs. If such a seed sentence exists, we pair

it with the rest 10 sentences and deploy them for

annotation. Otherwise, we skip the entire group.

For example, URL

https://www.nytimes.com/2019/

08/09/science/komodo-dragon-genome.html

belongs to

science topic.

2.4 Crowd Annotation for Paraphrase

Identiﬁcation

We then annotate the selected sentence pairs us-

ing the crowdsourcing platform Figure-Eight

construct MULTIPITCROWD.

Annotation Process.

We design a 1-vs-1 annota-

tion schema, where we present one sentence pair to

workers at a time and ask them to annotate whether

it is a paraphrase pair or not. A screenshot of the an-

notation interface is provided in Appendix A.1. We

collect 6 judgments for every sentence pair and pay

$0.2 per annotation (

$7 per hour). For creating

MULTIPIT

CROWD

, with the purpose of identifying

similar sentences and tracking information spread-

ing on Twitter in mind, we consider two sentences

as paraphrases even if one contains some new infor-

mation that does not appear in the other sentence

(see Figure 3for examples). As a side note, be-

cause these sentences are grouped under the same

trend or URL, the new information is always rele-

vant and based on the context, otherwise, we will

consider them non-paraphrases.

Quality Control.

In every ﬁve sentence pairs, we

embed one hidden test sentence pair that are pre-

labeled by one of the authors, and constantly mon-

itor the workers’ performance. Whenever annota-

tors make a mistake on the test pair, they will be

alerted and provided with an explanation. Workers

can continue in the task if they achieve

85% ac-

curacy on the test pairs and

0.2 Cohen’s (Cohen,

1960) kappa when compared with the major vote

of other workers. All workers are in the U.S.

Inter-Annotator Agreement.

The average Co-

hen’s kappa is 0.75 for URL-sourced sentence pairs,

0.69 for Trends-sourced ones, and 0.70 for all. We

also sample 400 sampled sentence pairs and hire

two experienced in-house annotators to label them.

6https://www.appen.com/

Sentence1

Sentence2

is a paraphrase of if:

: imply

Simpliﬁcation

Add Commonsense

Add World Knowledge

For Tracking Info On Twitter

plus

For Generation

Add Info Requires Fact Checking

(MULTIPITCROWD)

(MULTIPITEXPERT)

A1: Sweden’s ﬁrst female

PM Magdalena Andersson,

resigns on DAY ONE!!

A2: Swedish PM Magdalena

Andersson resigns hours

after taking job.

Simpliﬁcation

Add World Knowledge

B1: Facebook announces it will

be changing its name to Meta.!

B2: Facebook relaunches itself

as 'Meta' in a clear bid to

dominate the metaverse.

Add Info Requires Fact Checking

C1: 100% of the 140,000 U.S. jobs lost in December were held

by women.!

C2: In fact women lost 111% of the jobs in December because

men gained 16,000 jobs.

Examples:

Figure 3: Two different paraphrase deﬁnitions used for creating MULTIPITCROWD and MULTIPITEXPERT , with

examples. The difference between the two criteria is whether considering Sentence2 that contains new information

that requires fact-checking as a paraphrase of Sentence1.7

Assuming the in-house annotation is gold, the

of crowdworkers’ majority vote is 89.1.

Accessing Topic Diversity.

We manually exam-

ine 100 sentence pairs randomly sampled from

MULTIPIT

CROWD

, PIT-2015 (Xu et al.,2015) and

Twitter-URL (Lan et al.,2017). Figure 2shows

the results of the manual inspection. MULTI-

PIT

CROWD

has a much more balanced topic dis-

tribution, compared to prior work where 58% of

sentences in PIT-2015 are about sports and 63% of

sentences in Twitter-URL are politics-related. This

improvement can be attributed to the long time pe-

riodd (§2.1) and topic classiﬁcation step (§2.2) in

our data collection process. In contrast, PIT-2015

was collected within only 10 days (04/24/2013 –

05/03/2013) that was overwhelmed by a popular

sports event – the 2013 NFL draft (04/25 - 04/27),

and Twitter-URL was collected during the 3 months

of the 2016 US presidential election.

2.5 Expert Annotation for Paraphrase

Generation

Text generation models are prone to memorize

training data and generate unfaithful hallucinations

(Maynez et al.,2020;Carlini et al.,2021). Includ-

ing paraphrase pairs that contain extra information

other than world or commonsense knowledge in the

training data only worsens the problem, as shown

in Table 15 in Appendix F. For the purpose of

paraphrase generation, we further create MULTI-

PIT

EXPERT

with expert annotations, using a stricter

paraphrase deﬁnition than the one used in MULTI-

The example C1 and C2 is on the more extreme side of the

"loose" paraphrase criterion from the linguistic perspective,

more average cases are shown in Figure 1.

PIT

CROWD

. The different paraphrase criteria used

for creating these two datasets and their correspond-

ing examples are illustrated in Figure 3.

Data Selection.

To create a high-quality corpus

that focuses on differentiating strict paraphrases

from the more loosely deﬁned ones, we ﬁrst use our

best paraphrase identiﬁer (§3) ﬁne-tuned on MUL-

TIPIT

CROWD

to ﬁlter the sentence pairs and then

have experienced in-house annotators to further an-

notate them. Speciﬁcally, we gather sentence pairs

that are identiﬁed as paraphrases by the automatic

classiﬁer from 9,762 trending topic groups (from

Oct-Dec 2021) and 181,254 URL groups (from Jan

2020-Jun 2021). To improve the diversity of our

dataset, instead of presenting these pairs directly to

the experts for annotation, we cluster the sentences

by considering the paraphrase relationship transi-

tive, i.e., if sentence pairs

(s1, s2)

and

(s2, s3)

are

both identiﬁed as paraphrases, then

(s1, s2, s3)

is a

cluster. For each trend or URL, we show two seed

sentences paired with up to 30 sentences in the

largest cluster for the experts to annotate. In total,

we have 5,570 sentence pairs annotated for MUL-

TIPIT

EXPERT

, in which 100 sentences sourced by

trend and 100 ones sourced by URL have at least 8

corresponding paraphrases. We use these 200 sets

to form MULTIPIT

NMR

, the ﬁrst multi-reference

test set for paraphrase generation evaluation (§4).

Expert Annotation.

We ask two experienced an-

notators with linguistic backgrounds and rich anno-

tation experience to annotate each sentence pair as

paraphrases or not. Annotators thoroughly discuss

pairs that have inconsistent judgments until reach-

ing an agreement. A screenshot of the updated

annotation instruction is provided in Appendix A.2.

Model #Para.

MULTIPITCROWD MULTIPITEXPERT

LR Precision Recall F1Accuracy LR Precision Recall F1Accuracy

ESIM 17M 4e-4 89.55 70.15 78.67 82.15 4e-4 47.07 91.73 62.22 49.19

Infersent 47M 1e-3 87.03 87.57 87.29 86.47 1e-3 45.87 98.43 62.58 46.32

T5base 220M 1e-4 89.21 93.76 91.43 90.67 1e-4 71.96 83.86 77.45 77.74

T5large 770M 1e-5 90.36 93.58 91.94 91.29 1e-4 79.78 85.43 82.51 83.48

BERTbase 109M 3e-5 88.59 91.24 89.90 89.12 2e-5 71.66 86.61 78.43 78.28

BERTlarge 335M 2e-5 88.73 93.17 90.90 90.10 2e-5 72.22 87.01 78.93 78.82

RoBERTalarge 355M 2e-5 90.81 92.70 91.74 91.14 2e-5 77.01 83.07 79.92 80.97

BERTweetlarge 355M 2e-5 89.72 93.95 91.79 91.08 2e-5 82.47 81.50 81.98 83.66

ALBERTV2xxlarge 235M 1e-5 90.36 92.96 91.64 91.00 2e-5 82.68 82.68 82.68 84.20

DeBERTaV3large 400M 5e-6 90.46 93.59 92.00 91.36 5e-6 82.56 83.86 83.20 84.56

Table 2: Results on the test sets of MULTIPITCROWD and MULTIPITEXPERT. Models are ﬁne-tuned on the corre-

sponding training set. DeBERTaV3large performs the best on both datasets. LR: learning rate.

3 Paraphrase Identiﬁcation

Paraphrase identiﬁcation is a task that determines

whether two given sentences are paraphrases or

not. The two paraphrase deﬁnitions used in MUL-

TIPIT

CROWD

and MULTIPIT

EXPERT

suit different

downstream applications: tracking information on

Twitter and acquiring high-quality paraphrase pairs

for training generation models. Paraphrase identiﬁ-

cation models trained on our datasets achieve over

84 F1for each case.

Experimental Setup.

As each sentence pair in

MULTIPIT

CROWD

has six judgments, we use 3 as

the threshold, where pairs with

3 paraphrase

judgments are labeled as paraphrase, and the ones

with

3 paraphrase judgments are labeled as non-

paraphrase. We split MULTIPIT

CROWD

and MULTI-

PIT

EXPERT

into 80/10/10% for train/dev/test parti-

tions by time such that the oldest data are used for

training. More details on the implementation and

hyperparameter tuning are in Appendix C.

3.1 Models

We consider an encoder-decoder language model,

T5 (Raffel et al.,2020), ﬁve masked language

models,

BERT

(Devlin et al.,2019),

RoBERTa

(Liu et al.,2019),

ALBERT

(Lan et al.,2019),

BERTweet

(Nguyen et al.,2020), and

DeBER-

TaV3

(He et al.,2021). We also include two com-

petitive BiLSTM-based models,

Infersent

(Con-

neau et al.,2017) and

ESIM

(Chen et al.,2017), to

establish comparison with pre-BERT era work.

3.2 Results

Table 2presents results for the models ﬁne-tuned

on each dataset. DeBERTaV3

large

achieves the best

results with 92

on MULTIPIT

CROWD

and 83.2

Method Data P. R. F1Acc.

Fine-tuning MC61.81 88.58 72.82 69.84

Fine-tuning ME82.56 83.86 83.20 84.56

Fine-tuning MC+ ME62.99 87.80 73.36 70.92

+ Filtering MC+ ME77.24 88.19 82.35 82.76

+ Flipping MC+ ME83.40 85.04 84.21 85.46

Table 3: Results of different methods on the test set of

MULTIPITEXPERT. MC: MULTIPITCROWD , ME: MUL-

TIPITEXPERT. We use DeBERTaV3large in the experi-

ments.

on MULTIPIT

EXPERT

. Transformer-based models

consistently outperform BiLSTM-based models,

especially on MULTIPITEXPERT.

Beyond Fine-tuning.

As MULTIPIT

CROWD

is a

large-scale dataset annotated with a loose para-

phrase deﬁnition, we test whether leveraging these

“noisy" data improves model performance on MUL-

TIPIT

EXPERT

. To reduce the noise that comes from

the difference in deﬁnitions, we ﬁrst adjust the

labeling threshold for MULTIPIT

CROWD

from 3

to 4. Then we consider two noisy training tech-

niques adopted in prior work (Xie et al.,2020;

Zhang and Sabuncu,2018), namely

ﬁltering

and

ﬂipping

. Speciﬁcally, we ﬁne-tune a teacher model

on MULTIPIT

EXPERT

and use it to go through MUL-

TIPIT

CROWD

as follows: for each sentence pair

, if

its label is

(0 for non-paraphrase, 1 for paraphrase)

and

Pteacher(y=i|p)≤λ

, we

ﬁlter

out

ﬂip

its

label to

1−i

(i.e. 0

→

1).

Next, we ﬁne-tune a new

model on the combination of MULTIPIT

EXPERT

and

the re-labeled MULTIPIT

CROWD

. The experimental

results are shown in Table 3. Compared to ﬁne-

We perform a small grid search on

over {0.05, 0.15,

0.25, 0.35, 0.45}, and ﬁnd 0.35 works well for the ﬁltering

method and 0.25 for the ﬂipping method.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingLarge-scaleParaphraseAcquisitionandGenerationYaoDou,ChaoJiang,WeiXuSchoolofInteractiveComputingGeorgiaInstituteofTechnology{douy,chaojiang}@gatech.edu;wei.xu@cc.gatech.eduhttp://twitter-paraphrase.com/AbstractThispaperaddressesthequalityissuesinex-istingTwitter-basedparaphrasedatasets,anddi...

展开>> 收起<<

Improving Large-scale Paraphrase Acquisition and Generation Yao Dou Chao Jiang Wei Xu School of Interactive Computing.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Large-scale Paraphrase Acquisition and Generation Yao Dou Chao Jiang Wei Xu School of Interactive Computing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: