CrowdChecked Detecting Previously Fact-Checked Claims in Social Media Momchil Hardalov1Anton Chernyavskiy2 Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3

2025-04-27 0 0 746.51KB 20 页 10玖币
侵权投诉
CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media
Momchil Hardalov1Anton Chernyavskiy2
Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3
1Sofia University “St. Kliment Ohridski”, Bulgaria
2HSE University, Russia
3Mohamed bin Zayed University of Artificial Intelligence, UAE
{hardalov, koychev}@fmi.uni-sofia.bg
{acherniavskii, dilvovsky}@hse.ru
preslav.nakov@mbzuai.ac.ae
Abstract
While there has been substantial progress in
developing systems to automate fact-checking,
they still lack credibility in the eyes of the
users. Thus, an interesting approach has
emerged: to perform automatic fact-checking
by verifying whether an input claim has been
previously fact-checked by professional fact-
checkers and to return back an article that ex-
plains their decision. This is a sensible ap-
proach as people trust manual fact-checking,
and as many claims are repeated multiple
times. Yet, a major issue when building such
systems is the small number of known tweet–
verifying article pairs available for training.
Here, we aim to bridge this gap by making use
of crowd fact-checking, i.e., mining claims in
social media for which users have responded
with a link to a fact-checking article. In par-
ticular, we mine a large-scale collection of
330,000 tweets paired with a corresponding
fact-checking article. We further propose an
end-to-end framework to learn from this noisy
data based on modified self-adaptive training,
in a distant supervision scenario. Our exper-
iments on the CLEF’21 CheckThat! test set
show improvements over the state of the art
by two points absolute. Our code and datasets
are available at https://github.com/mhardalov/
crowdchecked-claims
1 Introduction
The massive spread of disinformation online, es-
pecially in social media, was counter-acted by ma-
jor efforts to limit the impact of false information
not only by journalists and fact-checking orga-
nizations but also by governments, private com-
panies, researchers, and ordinary Internet users.
This includes building systems for automatic fact-
checking (Zubiaga et al.,2016;Derczynski et al.,
2017;Nakov et al.,2021a;Gu et al.,2022;Guo
et al.,2022;Hardalov et al.,2022), fake news (Fer-
reira and Vlachos,2016;Nguyen et al.,2022), and
fake news website detection (Baly et al.,2020;Ste-
fanov et al.,2020;Panayotov et al.,2022).
Figure 1: Crowd fact-checking thread on Twitter. The
first tweet (Post w/ claim) makes the claim that Iver-
mectin causes sterility in men, which then receives
replies. A (crowd) fact-checker replies with a link to
averifying article from a fact-checking website. We
pair the article with the tweet that made this claim (the
first post 3), as it is irrelevant (7) to the other replies.
Unfortunately, fully automatic systems still lack
credibility, and thus it was proposed to focus on
detecting previously fact-checked claims instead:
Given a user comment, detect whether the claim
it makes was previously fact-checked with respect
to a collection of verified claims and their cor-
responding articles (see Table 1). This task is
an integral part of an end-to-end fact-checking
pipeline (Hassan et al.,2017), and also an impor-
tant task on its own right as people often repeat the
same claim (Barrón-Cedeño et al.,2020b;Vo and
Lee,2020;Shaar et al.,2021). Research on this
problem is limited by data scarceness, with datasets
typically having about a 1,000 tweet–verifying arti-
cle pairs (Barrón-Cedeño et al.,2020b;Shaar et al.,
2020,2021), with the notable exception of (Vo and
Lee,2020), which contains 19K claims about im-
ages matched against 3K fact-checking articles.
We propose to bridge this gap using crowd fact-
checking to create a large collection of tweet–
verifying article pairs, which we then label (if the
pair is correctly matched) automatically using dis-
tant supervision. An example is shown in Figure 1.
arXiv:2210.04447v1 [cs.CL] 10 Oct 2022
Our contributions are as follows:
we mine a large-scale collection of 330,000
tweets paired with fact-checking articles;
we propose two distant supervision strategies
to label the CrowdChecked dataset;
we propose a novel method to learn from this
data using modified self-adaptive training;
we demonstrate sizable improvements over
the state of the art on a standard test set.
2 Our Dataset: CrowdChecked
2.1 Dataset Collection
We use Snopes as our target fact-checking web-
site, due to its popularity among both Internet users
and researchers (Popat et al.,2016;Hanselowski
et al.,2019;Augenstein et al.,2019;Tchechmed-
jiev et al.,2019). We further use Twitter as the
source for collecting user messages, which could
contain claims and fact-checks of these claims.
Our data collection setup is similar to the one
in (Vo and Lee,2019). First, we form a query to
select tweets that contain a link to a fact-check
from Snopes (url:snopes.com/fact-check/ ), which
is either a reply or a quote tweet, and not a retweet.
An example result from the query is shown in Fig-
ure 1, where the tweet from the crowd fact-checker
contains a link to a fact-checking article. We then
assess its relevance to the claim (if any) made in
the first tweet (the root of the conversation) and the
last reply in order to obtain tweet–verified article
pairs. We analyze in more detail the conversational
structure of these threads in Section 2.2.
We collected all tweets matching our query from
October 2017 till October 2021, obtaining a to-
tal of 482,736 unique hits. We further collected
148,503 reply tweets and 204,250 conversation
(root) tweets.
1
Finally, we filter out malformed
pairs, i.e., tweets linking to themselves, empty
tweets, non-English ones, such with no resolved
URLs in the Twitter object (‘entities’), with broken
links to the fact-checking website, and all tweets
in the CheckThat ’21 dataset. We ended up with
332,660 unique tweet–article pairs (shown in first
row in Table 5), 316,564 unique tweets, and 10,340
fact-checking articles from Snopes they point to.
1
The sum of the unique replies and of the conversation
tweets is not equal to the total number of fact-checking tweets,
as more than one tweet might reply to the same comment.
User Post w/ Claim
: Sen. Mitch McConnell: “As recently
as October, now-President Biden said you can’t legislate by
executive action unless you are a dictator. Well, in one week,
he signed more than 30 unilateral actions. [URL] — Forbes
(@Forbes) January 28, 2021
Verified Claims and their Corresponding Articles
(1)
When he was still a candidate for the presidency in
October 2020, U.S. President Joe Biden said,
“You can’t legislate by executive order unless
you’re a dictator. http://snopes.com/fact-check/
biden-executive-order-dictator/
3
(2)
U.S. Sen. Mitch McConnell said he would not
participate in 2020 election debates that include
female moderators. http://snopes.com/fact-check/
mitch-mcconnell-debate-female/
7
Table 1: Illustrative examples for the task of detecting
previously fact-checked claims. The post contains a
claim (related to legislation and dictatorship), the Ver-
ified Claims are part of a search collection of previ-
ous fact-checks. In row (1), the fact-check is a correct
match for the claim made in the tweet (3), whereas in
(2), the claim still discusses Sen. Mitch McConnell, but
it is a different claim (7), and thus this is an incorrect
pair.
More detail about the process of collecting fact-
checking articles as well as detailed statistics are
given in Appendix B.1 and on Figure 2.
2.2 Tweet Collection
(Conversation Structure) It is important to note that
the ‘fact-checking’ tweet can be part of a multiple-
turn conversational thread, therefore taking the post
that it replies to (previous turn), does not always
express a claim which the current tweet targets.
In order to better understand this, we performed
manual analysis of some conversational threads.
Conversational threads in Twitter are organized as
shown Figure 1: the root is the first comment, then
there can be a long discussion, followed by a fact-
checking comment (i.e., the one with a link to a fact-
checking article on Snopes). In our analysis, we
identify four patterns: (i) the current tweet verifies a
claim in the tweet it replies to, (ii) the tweet verifies
the root of the conversation, (iii) the tweet does not
verify any claim in the chain (a common scenario),
and (iv) the fact-check targets a claim that was not
expressed in the root or in the closest tweet (this
was in very few cases). This analysis suggests that
for the task of detecting previously fact-checked
claims, it is sufficient to collect the triplet of the
fact-checking tweet, the root of the conversation
(conversation), and the tweet that the target tweet
is replying to (reply).
Dataset TweetsWords Vocab
|Unique|Mean 50% Max |Unique|
CrowdChecked (Ours) 316,564 12.2 11 60 114,727
CheckThat ’21 1,399 17.5 16 62 9,007
Table 2: Statistics about our dataset vs. CheckThat ’21.
The number of unique tweets is lower than the total
number of tweet–article pairs, as an input tweet could
be fact-checked by multiple articles.
2.3 Comparison to Existing Datasets
We compare our dataset to a closely related dataset
from the CLEF-2021 CheckThat ’21 on Detecting
Previously Fact-Checked Claims in Tweets (Shaar
et al.,2021), to which we will refer as Check-
That ’21 in the rest of the paper. There exist other
related datasets that are smaller (Barrón-Cedeño
et al.,2020b), come from a different domain (Shaar
et al.,2021), are not in English (Elsayed et al.,
2019), or are multi-modal (Vo and Lee,2020).
Table 2compares our CrowdChecked to Check-
That ’21 in terms of number of examples, length
of the tweets, and vocabulary size. Before calcu-
lating these statistics, we lowercased the text and
we removed all URLs, Twitter handlers, English
stop words, and punctuation. We can see in Ta-
ble 2that CrowdChecked contains two orders of
magnitude more examples, slightly shorter tweets
(but the maximum length stays approximately the
same, which can be explained by the word limit
of Twitter), and has a vocabulary size that is an or-
der of magnitude larger. Note, however, that many
examples in CrowdChecked are incorrect matches
(see Section 2.1), and thus we use distant super-
vision to label them (see Section 2.4), with the
resulting dataset sizes of matching pairs shown in
Table 5. Here, we want to emphasize that there is
absolutely no overlap at all between CrowdChecked
and CheckThat ’21 in terms of tweets/claims.
In terms of topics, the claims in both our dataset
and CheckThat ’21 are quite diverse, including
fact-checks for a broad set of topics related, but
not limited to politics (e.g., the Capitol Hill riots,
US elections), pop culture (e.g., famous performers
and actors such as Drake and Leonardo di Caprio),
brands (e.g., McDonald’s and Disney), and COVID-
19, among many others. Illustrative examples of
the claim/topic diversity can be found in Tables 1
and 10 (in the Appendix). Moreover, the collection
of Snopes articles contains almost 14K different
fact-checks on an even wider range of topics, which
further diversifies the set of tweet–article pairs.
Figure 2: Histogram of the year of publication of
the Snopes articles included in CrowdChecked (our
dataset) vs. those in CheckThat ’21.
Finally, we compare the set of Snopes fact-
checking articles referenced by the crowd fact-
checkers to the ones included in the CheckThat ’21
competition. We can see that the tweets in Crowd-
Checked refer to less articles (namely 10,340), com-
pared to CheckThat ’21, which consists of 13,835
articles. A total of 8,898 articles are present in both
datasets. Since the CheckThat ’21 is collected ear-
lier, it includes less articles from recent years com-
pared to CrowdChecked, and peaks at 2016/2017.
Nevertheless, for CheckThat ’21, the number of
Snopes articles included in a claim–article pair is
far less compared to our dataset (even after filtering
out unrelated pairs), as it is capped at the number
of tweets included in that dataset (which is 1.4K).
More detail about the process of collecting the
fact-checking articles is given in Appendix B.1.
2.4 Data Labeling (Distant Supervision)
To label our examples, we experiment with two
distant supervision approaches: (i) based on the
Jaccard similarity between the tweet and the target
fact-checking article, and (ii) based on the predic-
tions of a model trained on CheckThat ’21.
Jaccard Similarity
In this approach, we first
pre-process the texts by converting them to lower-
case, removing all URLs and replacing all numbers
with a single zero. Then, we tokenize them using
NLTK’s Twitter tokenizer (Loper and Bird,2002),
and we strip all handles and user mentions. Finally,
we filter out all stop words and punctuation (includ-
ing quotes and special symbols) and we stem all
tokens using the Porter stemmer (Porter,1980).
Range Examples Correct Pairs Correct Pairs
(Jaccard) (%) Reply (%) Conv. (%)
[0.0;0.1) 62.57 5.88 0.00
[0.1;0.2) 18.98 36.36 14.29
[0.2;0.3) 10.21 46.67 50.00
[0.3;0.4) 4.17 76.47 78.57
[0.4;0.5) 2.33 92.86 92.86
[0.5;0.6) 1.08 94.12 94.12
[0.6;0.7) 0.43 80.00 80.00
[0.7;0.8) 0.11 92.31 92.31
[0.8;0.9) 0.05 91.67 92.86
[0.9;1.0] 0.02 100.00 100.00
Table 3: Proportion of examples in different bins based
on average Jaccard similarity between the tweet and the
title/subtitle. Manual annotations of the correct pairs.
In order to obtain a numerical score for each
tweet–article pair, we calculate the Jaccard simi-
larity (jac) between the normalized tweet text and
each of the title and the subtitle from the Snopes
article (i.e., the intersection over the union of the
unique tokens). Both fields present a summary of
the fact-checked claim, and thus should include
more compressed information. Finally, we average
these two similarity values to obtain a more robust
score. Statistics are shown in Table 3.
Semi-Supervision
Here, we train a Sentence-
BERT (Reimers and Gurevych,2019) model, as de-
scribed in Section 3, using the manually annotated
data from CheckThat ’21. The model shows strong
performance on the testing set of CheckThat ’21
(see Table 6), and thus we expect it to have good
precision at detecting matching fact-checked pairs.
In particular, we calculate the cosine similarity be-
tween the embeddings of the fact-checked tweet
and the fields from the Snopes article. Statistics
about the scores are shown in Table 4.
2.5 Feasibility Evaluation
To evaluate the feasibility of the obtained labels, we
performed manual annotation, aiming to estimate
the number of correct pairs (i.e., tweet–article pairs,
where the article fact-checks the claim in the tweet).
Our prior observations of the data suggested that
unbiased sampling from the pool of tweets was
not suitable, as it would include mostly pairs that
have very few overlapping words, which is often
an indicator that the texts are not related. Thus, we
sample the candidates for annotation based on their
Jaccard similarity.
Range Examples Correct Pairs
(Cosine) (%) (%)
[-0.4;0.1) 37.83 0.00
[0.1;0.2) 16.50 6.67
[0.2;0.3) 12.28 41.46
[0.3;0.4) 10.12 36.36
[0.4;0.5) 8.58 63.16
[0.5;0.6) 6.69 70.00
[0.6;0.7) 4.47 84.21
[0.7;0.8) 2.48 96.15
[0.8;0.9) 0.97 93.10
[0.9;1.0] 0.08 100.00
Table 4: Proportion of examples in different bins based
on cosine similarity using Sentence-BERT trained on
CheckThat ’21. Manual annotations of the correct
pairs.
We divided the range of possible values [0;1]
into 10 equally sized bins and we sampled 15 exam-
ples from each bin, resulting into 150 conversation–
reply–tweet triples. Afterwards, the appropriate-
ness of each reply-article and conversation-article
pair is annotated by three annotators independently.
The annotators had a good level of inter-annotator
agreement: 0.75 in terms of Fleiss Kappa (Fleiss,
1971) (see Appendix C).
Tables 3and 4show the resulting estimates of
correct pairs for both Jaccard and cosine-based
labeling. In the case of Jaccard, we can see that the
expected number of correct examples is very high
(over 90%) in the range of [0.4–1.0], and then it
drastically decreases, going to almost zero when
the similarity is less than 0.1. Similarly, for the
cosine score, we can see high number of matches
in the top 4 bins ([0.6–1.0]), albeit the number of
matches remains relatively high in the following
interval of [0.2–0.6) between 36% and 63%, and
again gets close to zero for the lower-score bins.
We analyze the distribution of the Jaccard scores in
CheckThat ’21 in more detail in Appendix B.2.
3 Method
General Scheme
As a base for our models, we
use Sentence-BERT (SBERT). It uses a Siamese
network trained with a Transformer (Vaswani et al.,
2017) encoder to obtain sentence-level embeddings.
We keep the base architecture proposed by Reimers
and Gurevych (2019), but we use additional fea-
tures, training tricks, and losses described in the
next sections.
摘要:

CrowdChecked:DetectingPreviouslyFact-CheckedClaimsinSocialMediaMomchilHardalov1AntonChernyavskiy2IvanKoychev1DmitryIlvovsky2PreslavNakov31SoaUniversity“St.KlimentOhridski”,Bulgaria2HSEUniversity,Russia3MohamedbinZayedUniversityofArticialIntelligence,UAE{hardalov,koychev}@fmi.uni-sofia.bg{acherniav...

展开>> 收起<<
CrowdChecked Detecting Previously Fact-Checked Claims in Social Media Momchil Hardalov1Anton Chernyavskiy2 Ivan Koychev1Dmitry Ilvovsky2Preslav Nakov3.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:746.51KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注