CrowdChecked Detecting Previously Fact-Checked Claims in Social Media Momchil Hardalov1Anton Chernyavskiy2 Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3

2025-04-27 0 0 746.51KB 20 页 10玖币

侵权投诉

CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media

Momchil Hardalov1Anton Chernyavskiy2

Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3

1Soﬁa University “St. Kliment Ohridski”, Bulgaria

2HSE University, Russia

3Mohamed bin Zayed University of Artiﬁcial Intelligence, UAE

{hardalov, koychev}@fmi.uni-sofia.bg

{acherniavskii, dilvovsky}@hse.ru

preslav.nakov@mbzuai.ac.ae

Abstract

While there has been substantial progress in

developing systems to automate fact-checking,

they still lack credibility in the eyes of the

users. Thus, an interesting approach has

emerged: to perform automatic fact-checking

by verifying whether an input claim has been

previously fact-checked by professional fact-

checkers and to return back an article that ex-

plains their decision. This is a sensible ap-

proach as people trust manual fact-checking,

and as many claims are repeated multiple

times. Yet, a major issue when building such

systems is the small number of known tweet–

verifying article pairs available for training.

Here, we aim to bridge this gap by making use

of crowd fact-checking, i.e., mining claims in

social media for which users have responded

with a link to a fact-checking article. In par-

ticular, we mine a large-scale collection of

330,000 tweets paired with a corresponding

fact-checking article. We further propose an

end-to-end framework to learn from this noisy

data based on modiﬁed self-adaptive training,

in a distant supervision scenario. Our exper-

iments on the CLEF’21 CheckThat! test set

show improvements over the state of the art

by two points absolute. Our code and datasets

are available at https://github.com/mhardalov/

crowdchecked-claims

1 Introduction

The massive spread of disinformation online, es-

pecially in social media, was counter-acted by ma-

jor efforts to limit the impact of false information

not only by journalists and fact-checking orga-

nizations but also by governments, private com-

panies, researchers, and ordinary Internet users.

This includes building systems for automatic fact-

checking (Zubiaga et al.,2016;Derczynski et al.,

2017;Nakov et al.,2021a;Gu et al.,2022;Guo

et al.,2022;Hardalov et al.,2022), fake news (Fer-

reira and Vlachos,2016;Nguyen et al.,2022), and

fake news website detection (Baly et al.,2020;Ste-

fanov et al.,2020;Panayotov et al.,2022).

Figure 1: Crowd fact-checking thread on Twitter. The

ﬁrst tweet (Post w/ claim) makes the claim that Iver-

mectin causes sterility in men, which then receives

replies. A (crowd) fact-checker replies with a link to

averifying article from a fact-checking website. We

pair the article with the tweet that made this claim (the

ﬁrst post 3), as it is irrelevant (7) to the other replies.

Unfortunately, fully automatic systems still lack

credibility, and thus it was proposed to focus on

detecting previously fact-checked claims instead:

Given a user comment, detect whether the claim

it makes was previously fact-checked with respect

to a collection of veriﬁed claims and their cor-

responding articles (see Table 1). This task is

an integral part of an end-to-end fact-checking

pipeline (Hassan et al.,2017), and also an impor-

tant task on its own right as people often repeat the

same claim (Barrón-Cedeño et al.,2020b;Vo and

Lee,2020;Shaar et al.,2021). Research on this

problem is limited by data scarceness, with datasets

typically having about a 1,000 tweet–verifying arti-

cle pairs (Barrón-Cedeño et al.,2020b;Shaar et al.,

2020,2021), with the notable exception of (Vo and

Lee,2020), which contains 19K claims about im-

ages matched against 3K fact-checking articles.

We propose to bridge this gap using crowd fact-

checking to create a large collection of tweet–

verifying article pairs, which we then label (if the

pair is correctly matched) automatically using dis-

tant supervision. An example is shown in Figure 1.

arXiv:2210.04447v1 [cs.CL] 10 Oct 2022

Our contributions are as follows:

•

we mine a large-scale collection of 330,000

tweets paired with fact-checking articles;

•

we propose two distant supervision strategies

to label the CrowdChecked dataset;

•

we propose a novel method to learn from this

data using modiﬁed self-adaptive training;

•

we demonstrate sizable improvements over

the state of the art on a standard test set.

2 Our Dataset: CrowdChecked

2.1 Dataset Collection

We use Snopes as our target fact-checking web-

site, due to its popularity among both Internet users

and researchers (Popat et al.,2016;Hanselowski

et al.,2019;Augenstein et al.,2019;Tchechmed-

jiev et al.,2019). We further use Twitter as the

source for collecting user messages, which could

contain claims and fact-checks of these claims.

Our data collection setup is similar to the one

in (Vo and Lee,2019). First, we form a query to

select tweets that contain a link to a fact-check

from Snopes (url:snopes.com/fact-check/ ), which

is either a reply or a quote tweet, and not a retweet.

An example result from the query is shown in Fig-

ure 1, where the tweet from the crowd fact-checker

contains a link to a fact-checking article. We then

assess its relevance to the claim (if any) made in

the ﬁrst tweet (the root of the conversation) and the

last reply in order to obtain tweet–veriﬁed article

pairs. We analyze in more detail the conversational

structure of these threads in Section 2.2.

We collected all tweets matching our query from

October 2017 till October 2021, obtaining a to-

tal of 482,736 unique hits. We further collected

148,503 reply tweets and 204,250 conversation

(root) tweets.

Finally, we ﬁlter out malformed

pairs, i.e., tweets linking to themselves, empty

tweets, non-English ones, such with no resolved

URLs in the Twitter object (‘entities’), with broken

links to the fact-checking website, and all tweets

in the CheckThat ’21 dataset. We ended up with

332,660 unique tweet–article pairs (shown in ﬁrst

row in Table 5), 316,564 unique tweets, and 10,340

fact-checking articles from Snopes they point to.

The sum of the unique replies and of the conversation

tweets is not equal to the total number of fact-checking tweets,

as more than one tweet might reply to the same comment.

User Post w/ Claim

: Sen. Mitch McConnell: “As recently

as October, now-President Biden said you can’t legislate by

executive action unless you are a dictator. Well, in one week,

he signed more than 30 unilateral actions.” [URL] — Forbes

(@Forbes) January 28, 2021

Veriﬁed Claims and their Corresponding Articles

(1)

When he was still a candidate for the presidency in

October 2020, U.S. President Joe Biden said,

“You can’t legislate by executive order unless

you’re a dictator.” http://snopes.com/fact-check/

biden-executive-order-dictator/

(2)

U.S. Sen. Mitch McConnell said he would not

participate in 2020 election debates that include

female moderators. http://snopes.com/fact-check/

mitch-mcconnell-debate-female/

Table 1: Illustrative examples for the task of detecting

previously fact-checked claims. The post contains a

claim (related to legislation and dictatorship), the Ver-

iﬁed Claims are part of a search collection of previ-

ous fact-checks. In row (1), the fact-check is a correct

match for the claim made in the tweet (3), whereas in

(2), the claim still discusses Sen. Mitch McConnell, but

it is a different claim (7), and thus this is an incorrect

pair.

More detail about the process of collecting fact-

checking articles as well as detailed statistics are

given in Appendix B.1 and on Figure 2.

2.2 Tweet Collection

(Conversation Structure) It is important to note that

the ‘fact-checking’ tweet can be part of a multiple-

turn conversational thread, therefore taking the post

that it replies to (previous turn), does not always

express a claim which the current tweet targets.

In order to better understand this, we performed

manual analysis of some conversational threads.

Conversational threads in Twitter are organized as

shown Figure 1: the root is the ﬁrst comment, then

there can be a long discussion, followed by a fact-

checking comment (i.e., the one with a link to a fact-

checking article on Snopes). In our analysis, we

identify four patterns: (i) the current tweet veriﬁes a

claim in the tweet it replies to, (ii) the tweet veriﬁes

the root of the conversation, (iii) the tweet does not

verify any claim in the chain (a common scenario),

and (iv) the fact-check targets a claim that was not

expressed in the root or in the closest tweet (this

was in very few cases). This analysis suggests that

for the task of detecting previously fact-checked

claims, it is sufﬁcient to collect the triplet of the

fact-checking tweet, the root of the conversation

(conversation), and the tweet that the target tweet

is replying to (reply).

Dataset Tweets‡Words Vocab

|Unique|Mean 50% Max |Unique|

CrowdChecked (Ours) 316,564 12.2 11 60 114,727

CheckThat ’21 1,399 17.5 16 62 9,007

Table 2: Statistics about our dataset vs. CheckThat ’21.

‡The number of unique tweets is lower than the total

number of tweet–article pairs, as an input tweet could

be fact-checked by multiple articles.

2.3 Comparison to Existing Datasets

We compare our dataset to a closely related dataset

from the CLEF-2021 CheckThat ’21 on Detecting

Previously Fact-Checked Claims in Tweets (Shaar

et al.,2021), to which we will refer as Check-

That ’21 in the rest of the paper. There exist other

related datasets that are smaller (Barrón-Cedeño

et al.,2020b), come from a different domain (Shaar

et al.,2021), are not in English (Elsayed et al.,

2019), or are multi-modal (Vo and Lee,2020).

Table 2compares our CrowdChecked to Check-

That ’21 in terms of number of examples, length

of the tweets, and vocabulary size. Before calcu-

lating these statistics, we lowercased the text and

we removed all URLs, Twitter handlers, English

stop words, and punctuation. We can see in Ta-

ble 2that CrowdChecked contains two orders of

magnitude more examples, slightly shorter tweets

(but the maximum length stays approximately the

same, which can be explained by the word limit

of Twitter), and has a vocabulary size that is an or-

der of magnitude larger. Note, however, that many

examples in CrowdChecked are incorrect matches

(see Section 2.1), and thus we use distant super-

vision to label them (see Section 2.4), with the

resulting dataset sizes of matching pairs shown in

Table 5. Here, we want to emphasize that there is

absolutely no overlap at all between CrowdChecked

and CheckThat ’21 in terms of tweets/claims.

In terms of topics, the claims in both our dataset

and CheckThat ’21 are quite diverse, including

fact-checks for a broad set of topics related, but

not limited to politics (e.g., the Capitol Hill riots,

US elections), pop culture (e.g., famous performers

and actors such as Drake and Leonardo di Caprio),

brands (e.g., McDonald’s and Disney), and COVID-

19, among many others. Illustrative examples of

the claim/topic diversity can be found in Tables 1

and 10 (in the Appendix). Moreover, the collection

of Snopes articles contains almost 14K different

fact-checks on an even wider range of topics, which

further diversiﬁes the set of tweet–article pairs.

Figure 2: Histogram of the year of publication of

the Snopes articles included in CrowdChecked (our

dataset) vs. those in CheckThat ’21.

Finally, we compare the set of Snopes fact-

checking articles referenced by the crowd fact-

checkers to the ones included in the CheckThat ’21

competition. We can see that the tweets in Crowd-

Checked refer to less articles (namely 10,340), com-

pared to CheckThat ’21, which consists of 13,835

articles. A total of 8,898 articles are present in both

datasets. Since the CheckThat ’21 is collected ear-

lier, it includes less articles from recent years com-

pared to CrowdChecked, and peaks at 2016/2017.

Nevertheless, for CheckThat ’21, the number of

Snopes articles included in a claim–article pair is

far less compared to our dataset (even after ﬁltering

out unrelated pairs), as it is capped at the number

of tweets included in that dataset (which is 1.4K).

More detail about the process of collecting the

fact-checking articles is given in Appendix B.1.

2.4 Data Labeling (Distant Supervision)

To label our examples, we experiment with two

distant supervision approaches: (i) based on the

Jaccard similarity between the tweet and the target

fact-checking article, and (ii) based on the predic-

tions of a model trained on CheckThat ’21.

Jaccard Similarity

In this approach, we ﬁrst

pre-process the texts by converting them to lower-

case, removing all URLs and replacing all numbers

with a single zero. Then, we tokenize them using

NLTK’s Twitter tokenizer (Loper and Bird,2002),

and we strip all handles and user mentions. Finally,

we ﬁlter out all stop words and punctuation (includ-

ing quotes and special symbols) and we stem all

tokens using the Porter stemmer (Porter,1980).

Range Examples Correct Pairs Correct Pairs

(Jaccard) (%) Reply (%) Conv. (%)

[0.0;0.1) 62.57 5.88 0.00

[0.1;0.2) 18.98 36.36 14.29

[0.2;0.3) 10.21 46.67 50.00

[0.3;0.4) 4.17 76.47 78.57

[0.4;0.5) 2.33 92.86 92.86

[0.5;0.6) 1.08 94.12 94.12

[0.6;0.7) 0.43 80.00 80.00

[0.7;0.8) 0.11 92.31 92.31

[0.8;0.9) 0.05 91.67 92.86

[0.9;1.0] 0.02 100.00 100.00

Table 3: Proportion of examples in different bins based

on average Jaccard similarity between the tweet and the

title/subtitle. Manual annotations of the correct pairs.

In order to obtain a numerical score for each

tweet–article pair, we calculate the Jaccard simi-

larity (jac) between the normalized tweet text and

each of the title and the subtitle from the Snopes

article (i.e., the intersection over the union of the

unique tokens). Both ﬁelds present a summary of

the fact-checked claim, and thus should include

more compressed information. Finally, we average

these two similarity values to obtain a more robust

score. Statistics are shown in Table 3.

Semi-Supervision

Here, we train a Sentence-

BERT (Reimers and Gurevych,2019) model, as de-

scribed in Section 3, using the manually annotated

data from CheckThat ’21. The model shows strong

performance on the testing set of CheckThat ’21

(see Table 6), and thus we expect it to have good

precision at detecting matching fact-checked pairs.

In particular, we calculate the cosine similarity be-

tween the embeddings of the fact-checked tweet

and the ﬁelds from the Snopes article. Statistics

about the scores are shown in Table 4.

2.5 Feasibility Evaluation

To evaluate the feasibility of the obtained labels, we

performed manual annotation, aiming to estimate

the number of correct pairs (i.e., tweet–article pairs,

where the article fact-checks the claim in the tweet).

Our prior observations of the data suggested that

unbiased sampling from the pool of tweets was

not suitable, as it would include mostly pairs that

have very few overlapping words, which is often

an indicator that the texts are not related. Thus, we

sample the candidates for annotation based on their

Jaccard similarity.

Range Examples Correct Pairs

(Cosine) (%) (%)

[-0.4;0.1) 37.83 0.00

[0.1;0.2) 16.50 6.67

[0.2;0.3) 12.28 41.46

[0.3;0.4) 10.12 36.36

[0.4;0.5) 8.58 63.16

[0.5;0.6) 6.69 70.00

[0.6;0.7) 4.47 84.21

[0.7;0.8) 2.48 96.15

[0.8;0.9) 0.97 93.10

[0.9;1.0] 0.08 100.00

Table 4: Proportion of examples in different bins based

on cosine similarity using Sentence-BERT trained on

CheckThat ’21. Manual annotations of the correct

pairs.

We divided the range of possible values [0;1]

into 10 equally sized bins and we sampled 15 exam-

ples from each bin, resulting into 150 conversation–

reply–tweet triples. Afterwards, the appropriate-

ness of each reply-article and conversation-article

pair is annotated by three annotators independently.

The annotators had a good level of inter-annotator

agreement: 0.75 in terms of Fleiss Kappa (Fleiss,

1971) (see Appendix C).

Tables 3and 4show the resulting estimates of

correct pairs for both Jaccard and cosine-based

labeling. In the case of Jaccard, we can see that the

expected number of correct examples is very high

(over 90%) in the range of [0.4–1.0], and then it

drastically decreases, going to almost zero when

the similarity is less than 0.1. Similarly, for the

cosine score, we can see high number of matches

in the top 4 bins ([0.6–1.0]), albeit the number of

matches remains relatively high in the following

interval of [0.2–0.6) between 36% and 63%, and

again gets close to zero for the lower-score bins.

We analyze the distribution of the Jaccard scores in

CheckThat ’21 in more detail in Appendix B.2.

3 Method

General Scheme

As a base for our models, we

use Sentence-BERT (SBERT). It uses a Siamese

network trained with a Transformer (Vaswani et al.,

2017) encoder to obtain sentence-level embeddings.

We keep the base architecture proposed by Reimers

and Gurevych (2019), but we use additional fea-

tures, training tricks, and losses described in the

next sections.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CrowdChecked:DetectingPreviouslyFact-CheckedClaimsinSocialMediaMomchilHardalov1AntonChernyavskiy2IvanKoychev1DmitryIlvovsky2PreslavNakov31SoaUniversitySt.KlimentOhridski,Bulgaria2HSEUniversity,Russia3MohamedbinZayedUniversityofArticialIntelligence,UAE{hardalov,koychev}@fmi.uni-sofia.bg{acherniav...

展开>> 收起<<

CrowdChecked Detecting Previously Fact-Checked Claims in Social Media Momchil Hardalov1Anton Chernyavskiy2 Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CrowdChecked Detecting Previously Fact-Checked Claims in Social Media Momchil Hardalov1Anton Chernyavskiy2 Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: