MiDe22 An Annotated Multi-Event Tweet Dataset for Misinformation Detection Cagri Toraman1 Oguzhan Ozcelik23 Furkan Şahinuç4 Fazli Can2

2025-04-24 0 0 7.47MB 13 页 10玖币
侵权投诉
MiDe22: An Annotated Multi-Event Tweet Dataset for
Misinformation Detection
Cagri Toraman1* , Oguzhan Ozcelik2,3, Furkan Şahinuç4* , Fazli Can2
1Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
2Department of Computer Engineering, Bilkent University, Ankara, Turkey 3Aselsan, Ankara, Turkey
4Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt
ctoraman@ceng.metu.edu.tr
oguzhan.ozcelik@bilkent.edu.tr
furkan.sahinuc@tu-darmstadt.de
canf@cs.bilkent.edu.tr
Abstract
The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful
consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action
is required to address this problem. In this study, we construct a new human-annotated dataset, called
MiDe22
,
having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between
2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user
engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data
analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.
Keywords: Human-annotation, Misinformation detection, Multi-event dataset, Tweet
1. Introduction
With the growth of online social networks, people
develop new behaviors and trends. An example
is the amount of news consumed in these net-
works, and eventually the phrase “social media” is
coined. However, considering their popularity and
easy accessibility, it is inevitable to observe differ-
ent kinds of content in social media platforms; e.g
information manipulations, fake news, and misinfor-
mation/disinformation spread
1
. Twitter (rebranding
to X since July 2023) is one of the platforms where
misinformation can be widely spread as observed
in the U.S. Elections (Grinberg et al.,2019), so that
“fake news” became the Word of the Year in 2017
(CollinsDictionary,2017).
Misinformation is spread in many domains includ-
ing but not limited to health, politics, and disasters.
Once misinformation is spread, the consequences
can be devastating (Islam et al.,2020b;Reuters,
2022). For instance, many people died because of
false rumors that claim that the cure for COVID-19
is drinking methanol (Islam et al.,2020b). Another
example is that Ukraine sought an emergency or-
der from the International Court of Justice due to
the false claims of genocide against Russian speak-
ers in Ukraine (Reuters,2022). Considering the
importance of misinformation spread in society and
the ugly truth of unavoidable diffusion and beliefs,
misinformation detection becomes a critical task
*Work partially done in Aselsan, Ankara, Turkey.
1
We use misinformation as an umbrella term that
refers to all instances where information have falsehoods.
that requires advanced methods and datasets.
A straightforward solution for misinformation is to
avoid the spread in advance. However, people can
be biased to change their beliefs even if corrections
exist, and the attempts to correct falsehoods may
not avoid its spread and even sometimes help its
diffusion (Nyhan and Reifler,2010). Moreover, tar-
geted advertising to increase user engagement can
help misinformation spread, which may be a source
of revenue for social media platforms (Neumann
et al.,2022).
We have four main observations on existing so-
cial media collections for misinformation detection.
Although they mostly cover a limited number of top-
ics (Ma et al.,2017), these topics remain too high-
level to provide an opportunity to systematically
examine which type of incidents trigger the misin-
formation spread. The availability of fine-grained
event-specific information can play a significant role
in capturing different user behaviors for detecting
and preventing misinformation. Furthermore, the
existing datasets focus on widely used languages
such as English (D’Ulizia et al.,2021), while they
are very limited for low-resource languages. Lastly,
user engagements (like, reply, retweet, and quote)
and media elements (image and video) in false
tweets can be useful to analyze different types of
information diffusion and detection methods (e.g.
multimodal), but not all types are always included
in the datasets.
In order to bridge these gaps, we present
an annotated multi-event tweet dataset for
Misinformation Detection under several recent
arXiv:2210.05401v2 [cs.SI] 11 Jul 2024
Figure 1: The topics (inner circle) and events (outer circle) in
MiDe22
for English (left) and Turkish (right).
The areas are proportional to the number of tweets they have.
events from 2020 to 2022, called
MiDe22
, including
English and Turkish tweets with four types of user
engagements and they are likes, replies, retweets,
and quotes.
1.1. Dataset Contents
The
MiDe22
dataset
2
consists of three parts: (i)
Topics and Events, (ii) Tweets, and (iii) Engage-
ments. Each part exists for both English (
MiDe22-
EN) and Turkish (MiDe22-TR).
Topics and Events. We consider the issues oc-
cupying the world’s agenda in recent years as the
topics of our dataset. Then, we extract the sig-
nificant events with the highest spread of misin-
formation. Figure 1presents an overview of the
structure of our dataset. The inner circle indicates
the COVID-19 pandemic, the 2022 War between
Russia and Ukraine, Refugees (Immigration), and
Miscellaneous events that are not categorized un-
der the previous topics. Overall, these topics con-
tain 40 newsworthy events in the outer circles of
the figure. We also provide the event titles along
with their topics online2.
Note that we prefer well-known recent events for
both languages. The reason is that some misinfor-
mation events can be global and observed in sev-
eral countries, such as “COVID-19 vaccines contain
Human Immunodeficiency Virus (HIV)”. These com-
mon events can provide an opportunity to inspect
how misinformation is spread in different languages.
On the other hand, there are local events that have
influence in specific regions. The details on the
events are given in Section 3.1.
Tweets. The dataset has tweets related to the
events. The crawling process is explained in Sec-
2
The dataset and all other related documents can be
accessed at https://github.com/metunlp/MiDe22
tion 3.1. Each tweet is labeled according to three
classes: False information, True information, and
Other. The Other class includes tweets that cannot
be categorized under false and true information.
The annotation process is explained in Section 3.2.
User Engagements and Media. We provide the
user engagements with all tweets. Separate en-
gagement splits are provided in the types of like,
reply, retweet and quote. We also provide media
elements in our dataset, i.e. image and video if
they exist in the tweets.
1.2. Contributions
Our contribution involves the development of a
novel tweet dataset for misinformation detection
in two languages with various topics and user en-
gagements. The languages are a widely used
language: English, and a low-resource language:
Turkish. The topics of the dataset cover several
recent events, such as the 2022 Russia-Ukraine
War and the COVID-19 pandemic. The dataset
includes the user engagements with all tweets in
terms of likes, replies, retweets, and quotes. It can
be used in many studies such as misinformation,
event, and topic detection. Additionally, we con-
duct experiments to provide initial baseline scores
from different model families, e.g., bag-of-words,
neural, and transformer-based models. Apart from
demonstrating the quality and utility of our dataset,
these baselines also provide a benchmark for re-
searchers to compare against and further enhance
their developments. The variety of baseline mod-
els is rich enough to perform statistical tests and
interpret the results properly.
Dataset Name Langs. Domain Topics Date of Data Engagements Size Labels
LIAR (Wang,2017) En Statements MISC 2007-2016 None 12.8k Annotated
FakeNewsNet (Shu et al.,2020) En News, tweets MISC n/a None 23.1k, 1.9m Query
CoAID (Cui and Lee,2020) En News, tweets C19 2019-2020 Reply 4.2k, 160k Query
COVIDLies (Hossain et al.,2020b) En Tweets C19 2020 None 6.7k Annotated
CMU-MisCOV19 (Memon and Carley,2020) En Tweets C19 2020 None 4.5k Annotated
MM-COVID (Li et al.,2020) 6 langs. Tweets C19 n/a Reply, retweet 105.3k Query
VaccineLies (Weinzierl and Harabagiu,2022) En Tweets C19, HPV 2019-2021 None 14.6k Annotated
MuMin (Nielsen and McConville,2022) 41 langs. Tweets MISC n/a Reply, retweet 21.5m Query
MR2 (Hu et al.,2023) En, Zh Tweets, Weibo MISC 2017-2022 Reply, retweet 14.7k Annotated
MiDe22 (this study) En, Tr Tweets RUW, C19, IMM, MISC 2020-2022 Reply, retweet, like, quote 10.3k Annotated
Table 1: Related misinformation studies. RUW stands for Russia-Ukraine War, C19 for COVID-19,
IMM for Immigration and Refugees, HPV for Human Papilloma Virus, and MISC for Miscellaneous. The
last column shows if tweets are annotated by humans, or labeled by the output of queries to Twitter API.
Size is given in terms of number of tweets.
2. Related Work
In this section, we provide a brief review of the
existing literature and explore the methods used for
the analysis and detection of misinformation, the
available datasets for research purposes, and the
various interventions implemented to combat the
spread of misinformation.
2.1. Misinformation Analysis
Misinformation analysis is the process of identi-
fying, evaluating, and understanding the spread
and impact of false, misleading, or inaccurate in-
formation. Misinformation modeling covers tempo-
ral and patterns of information diffusion to analyze
spread (Shin et al.,2018;Rosenfeld et al.,2020),
and also analysis of misinformation spreads during
important events such as the 2016 U.S. Election
(Grinberg et al.,2019), the COVID-19 Pandemic
(Ferrara et al.,2020), and the 2020 BLM Movement
(Toraman et al.,2022a).
2.2. Misinformation Detection
Misinformation detection is a challenging task when
the dynamics subject to misinformation spread are
considered. The task is also studied as fake news
detection (Zhou and Zafarani,2020), rumor detec-
tion (Zubiaga et al.,2018), and fact/claim verifica-
tion (Bekoulis et al.,2021;Guo et al.,2022).
There are two important aspects of misinforma-
tion detection. First, the task mostly depends on
supervised learning with a labeled dataset. Second,
existing studies rely on different feature types for au-
tomated misinformation detection (Wu et al.,2016).
Text contents are represented in a vector or embed-
ding space by natural language processing (Os-
hikawa et al.,2020) and the task is formulated as
classification or regression mostly solved by deep
learning models (Islam et al.,2020a). The features
extracted from user profiles can be used to detect
the spreaders (Lee et al.,2011). Besides contents,
there are efforts to extract features from the network
structure such as network diffusion models (Kwon
and Cha,2014;Shu et al.,2019a) and graph neu-
ral networks (Mehta et al.,2022). Lastly, external
knowledge sources (Shi and Weninger,2016;Tora-
man et al.,2022b) and the social context among
publishers, news, and users (Shu et al.,2019b) can
be integrated to the learning phase.
Rather than identifying the content with misin-
formation, there are efforts to detect the user ac-
counts that would spread undesirable content such
as spamming and misinformation. Social honeypot
(Lee et al.,2011) is a method to identify such users
by attracting them to engage with a fake account,
called honeypot. There are also bots producing
computer-generated content to promote misinfor-
mation (Himelein-Wachowiak et al.,2021).
2.3. Misinformation Datasets
There are several efforts in the literature to con-
struct a dataset for misinformation detection. The
LIAR dataset (Wang,2017) includes short state-
ments from different backgrounds, annotated by
PolitiFact API. News and related tweets for fact-
checked events are composed in a dataset in (Shu
et al.,2020). Recently, global events and their
repercussions in social media lead to the emer-
gence of new misinformation datasets. For in-
stance, Memon and Carley (2020) annotate tweets
according to misinformation categories such as
fake treatments for COVID-19. In (Li et al.,2020),
news sources are investigated for fake news in
different languages. Hossain et al. (2020b) re-
trieve common misconceptions about COVID-19,
and label tweets according to their stances against
misconceptions. Weinzierl and Harabagiu (2022)
compose the vaccine version of the same dataset.
Other datasets include COVID-19 healthcare misin-
formation (Cui and Lee,2020), and large-scale mul-
timodal misinformation (Nielsen and McConville,
2022). (Hu et al.,2023) curate annotated multi-
modal social media dataset for two widely-spoken
languages (English and Chinese), providing reply
and retweet engagements. Lastly, there are very
limited datasets for low-resource languages (Hos-
sain et al.,2020a;Lucas et al.,2022) but do not
exist for Turkish.
摘要:

MiDe22:AnAnnotatedMulti-EventTweetDatasetforMisinformationDetectionCagriToraman1*,OguzhanOzcelik2,3,FurkanŞahinuç4*,FazliCan21DepartmentofComputerEngineering,MiddleEastTechnicalUniversity,Ankara,Turkey2DepartmentofComputerEngineering,BilkentUniversity,Ankara,Turkey3Aselsan,Ankara,Turkey4UbiquitousKn...

展开>> 收起<<
MiDe22 An Annotated Multi-Event Tweet Dataset for Misinformation Detection Cagri Toraman1 Oguzhan Ozcelik23 Furkan Şahinuç4 Fazli Can2.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:7.47MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注