MiDe22 An Annotated Multi-Event Tweet Dataset for Misinformation Detection Cagri Toraman1 Oguzhan Ozcelik23 Furkan Şahinuç4 Fazli Can2

2025-04-24 1 0 7.47MB 13 页 10玖币

侵权投诉

MiDe22: An Annotated Multi-Event Tweet Dataset for

Misinformation Detection

Cagri Toraman1* , Oguzhan Ozcelik2,3, Furkan Şahinuç4* , Fazli Can2

1Department of Computer Engineering, Middle East Technical University, Ankara, Turkey

2Department of Computer Engineering, Bilkent University, Ankara, Turkey 3Aselsan, Ankara, Turkey

4Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt

ctoraman@ceng.metu.edu.tr

oguzhan.ozcelik@bilkent.edu.tr

furkan.sahinuc@tu-darmstadt.de

canf@cs.bilkent.edu.tr

Abstract

The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful

consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action

is required to address this problem. In this study, we construct a new human-annotated dataset, called

MiDe22

having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between

2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user

engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data

analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.

Keywords: Human-annotation, Misinformation detection, Multi-event dataset, Tweet

1. Introduction

With the growth of online social networks, people

develop new behaviors and trends. An example

is the amount of news consumed in these net-

works, and eventually the phrase “social media” is

coined. However, considering their popularity and

easy accessibility, it is inevitable to observe diﬀer-

ent kinds of content in social media platforms; e.g

information manipulations, fake news, and misinfor-

mation/disinformation spread

. Twitter (rebranding

to X since July 2023) is one of the platforms where

misinformation can be widely spread as observed

in the U.S. Elections (Grinberg et al.,2019), so that

“fake news” became the Word of the Year in 2017

(CollinsDictionary,2017).

Misinformation is spread in many domains includ-

ing but not limited to health, politics, and disasters.

Once misinformation is spread, the consequences

can be devastating (Islam et al.,2020b;Reuters,

2022). For instance, many people died because of

false rumors that claim that the cure for COVID-19

is drinking methanol (Islam et al.,2020b). Another

example is that Ukraine sought an emergency or-

der from the International Court of Justice due to

the false claims of genocide against Russian speak-

ers in Ukraine (Reuters,2022). Considering the

importance of misinformation spread in society and

the ugly truth of unavoidable diﬀusion and beliefs,

misinformation detection becomes a critical task

*Work partially done in Aselsan, Ankara, Turkey.

We use misinformation as an umbrella term that

refers to all instances where information have falsehoods.

that requires advanced methods and datasets.

A straightforward solution for misinformation is to

avoid the spread in advance. However, people can

be biased to change their beliefs even if corrections

exist, and the attempts to correct falsehoods may

not avoid its spread and even sometimes help its

diﬀusion (Nyhan and Reiﬂer,2010). Moreover, tar-

geted advertising to increase user engagement can

help misinformation spread, which may be a source

of revenue for social media platforms (Neumann

et al.,2022).

We have four main observations on existing so-

cial media collections for misinformation detection.

Although they mostly cover a limited number of top-

ics (Ma et al.,2017), these topics remain too high-

level to provide an opportunity to systematically

examine which type of incidents trigger the misin-

formation spread. The availability of ﬁne-grained

event-speciﬁc information can play a signiﬁcant role

in capturing diﬀerent user behaviors for detecting

and preventing misinformation. Furthermore, the

existing datasets focus on widely used languages

such as English (D’Ulizia et al.,2021), while they

are very limited for low-resource languages. Lastly,

user engagements (like, reply, retweet, and quote)

and media elements (image and video) in false

tweets can be useful to analyze diﬀerent types of

information diﬀusion and detection methods (e.g.

multimodal), but not all types are always included

in the datasets.

In order to bridge these gaps, we present

an annotated multi-event tweet dataset for

Misinformation Detection under several recent

arXiv:2210.05401v2 [cs.SI] 11 Jul 2024

Figure 1: The topics (inner circle) and events (outer circle) in

MiDe22

for English (left) and Turkish (right).

The areas are proportional to the number of tweets they have.

events from 2020 to 2022, called

MiDe22

, including

English and Turkish tweets with four types of user

engagements and they are likes, replies, retweets,

and quotes.

1.1. Dataset Contents

The

MiDe22

dataset

consists of three parts: (i)

Topics and Events, (ii) Tweets, and (iii) Engage-

ments. Each part exists for both English (

MiDe22-

EN) and Turkish (MiDe22-TR).

Topics and Events. We consider the issues oc-

cupying the world’s agenda in recent years as the

topics of our dataset. Then, we extract the sig-

niﬁcant events with the highest spread of misin-

formation. Figure 1presents an overview of the

structure of our dataset. The inner circle indicates

the COVID-19 pandemic, the 2022 War between

Russia and Ukraine, Refugees (Immigration), and

Miscellaneous events that are not categorized un-

der the previous topics. Overall, these topics con-

tain 40 newsworthy events in the outer circles of

the ﬁgure. We also provide the event titles along

with their topics online2.

Note that we prefer well-known recent events for

both languages. The reason is that some misinfor-

mation events can be global and observed in sev-

eral countries, such as “COVID-19 vaccines contain

Human Immunodeﬁciency Virus (HIV)”. These com-

mon events can provide an opportunity to inspect

how misinformation is spread in diﬀerent languages.

On the other hand, there are local events that have

inﬂuence in speciﬁc regions. The details on the

events are given in Section 3.1.

Tweets. The dataset has tweets related to the

events. The crawling process is explained in Sec-

The dataset and all other related documents can be

accessed at https://github.com/metunlp/MiDe22

tion 3.1. Each tweet is labeled according to three

classes: False information, True information, and

Other. The Other class includes tweets that cannot

be categorized under false and true information.

The annotation process is explained in Section 3.2.

User Engagements and Media. We provide the

user engagements with all tweets. Separate en-

gagement splits are provided in the types of like,

reply, retweet and quote. We also provide media

elements in our dataset, i.e. image and video if

they exist in the tweets.

1.2. Contributions

Our contribution involves the development of a

novel tweet dataset for misinformation detection

in two languages with various topics and user en-

gagements. The languages are a widely used

language: English, and a low-resource language:

Turkish. The topics of the dataset cover several

recent events, such as the 2022 Russia-Ukraine

War and the COVID-19 pandemic. The dataset

includes the user engagements with all tweets in

terms of likes, replies, retweets, and quotes. It can

be used in many studies such as misinformation,

event, and topic detection. Additionally, we con-

duct experiments to provide initial baseline scores

from diﬀerent model families, e.g., bag-of-words,

neural, and transformer-based models. Apart from

demonstrating the quality and utility of our dataset,

these baselines also provide a benchmark for re-

searchers to compare against and further enhance

their developments. The variety of baseline mod-

els is rich enough to perform statistical tests and

interpret the results properly.

Dataset Name Langs. Domain Topics Date of Data Engagements Size Labels

LIAR (Wang,2017) En Statements MISC 2007-2016 None 12.8k Annotated

FakeNewsNet (Shu et al.,2020) En News, tweets MISC n/a None 23.1k, 1.9m Query

CoAID (Cui and Lee,2020) En News, tweets C19 2019-2020 Reply 4.2k, 160k Query

COVIDLies (Hossain et al.,2020b) En Tweets C19 2020 None 6.7k Annotated

CMU-MisCOV19 (Memon and Carley,2020) En Tweets C19 2020 None 4.5k Annotated

MM-COVID (Li et al.,2020) 6 langs. Tweets C19 n/a Reply, retweet 105.3k Query

VaccineLies (Weinzierl and Harabagiu,2022) En Tweets C19, HPV 2019-2021 None 14.6k Annotated

MuMin (Nielsen and McConville,2022) 41 langs. Tweets MISC n/a Reply, retweet 21.5m Query

MR2 (Hu et al.,2023) En, Zh Tweets, Weibo MISC 2017-2022 Reply, retweet 14.7k Annotated

MiDe22 (this study) En, Tr Tweets RUW, C19, IMM, MISC 2020-2022 Reply, retweet, like, quote 10.3k Annotated

Table 1: Related misinformation studies. RUW stands for Russia-Ukraine War, C19 for COVID-19,

IMM for Immigration and Refugees, HPV for Human Papilloma Virus, and MISC for Miscellaneous. The

last column shows if tweets are annotated by humans, or labeled by the output of queries to Twitter API.

Size is given in terms of number of tweets.

2. Related Work

In this section, we provide a brief review of the

existing literature and explore the methods used for

the analysis and detection of misinformation, the

available datasets for research purposes, and the

various interventions implemented to combat the

spread of misinformation.

2.1. Misinformation Analysis

Misinformation analysis is the process of identi-

fying, evaluating, and understanding the spread

and impact of false, misleading, or inaccurate in-

formation. Misinformation modeling covers tempo-

ral and patterns of information diﬀusion to analyze

spread (Shin et al.,2018;Rosenfeld et al.,2020),

and also analysis of misinformation spreads during

important events such as the 2016 U.S. Election

(Grinberg et al.,2019), the COVID-19 Pandemic

(Ferrara et al.,2020), and the 2020 BLM Movement

(Toraman et al.,2022a).

2.2. Misinformation Detection

Misinformation detection is a challenging task when

the dynamics subject to misinformation spread are

considered. The task is also studied as fake news

detection (Zhou and Zafarani,2020), rumor detec-

tion (Zubiaga et al.,2018), and fact/claim veriﬁca-

tion (Bekoulis et al.,2021;Guo et al.,2022).

There are two important aspects of misinforma-

tion detection. First, the task mostly depends on

supervised learning with a labeled dataset. Second,

existing studies rely on diﬀerent feature types for au-

tomated misinformation detection (Wu et al.,2016).

Text contents are represented in a vector or embed-

ding space by natural language processing (Os-

hikawa et al.,2020) and the task is formulated as

classiﬁcation or regression mostly solved by deep

learning models (Islam et al.,2020a). The features

extracted from user proﬁles can be used to detect

the spreaders (Lee et al.,2011). Besides contents,

there are eﬀorts to extract features from the network

structure such as network diﬀusion models (Kwon

and Cha,2014;Shu et al.,2019a) and graph neu-

ral networks (Mehta et al.,2022). Lastly, external

knowledge sources (Shi and Weninger,2016;Tora-

man et al.,2022b) and the social context among

publishers, news, and users (Shu et al.,2019b) can

be integrated to the learning phase.

Rather than identifying the content with misin-

formation, there are eﬀorts to detect the user ac-

counts that would spread undesirable content such

as spamming and misinformation. Social honeypot

(Lee et al.,2011) is a method to identify such users

by attracting them to engage with a fake account,

called honeypot. There are also bots producing

computer-generated content to promote misinfor-

mation (Himelein-Wachowiak et al.,2021).

2.3. Misinformation Datasets

There are several eﬀorts in the literature to con-

struct a dataset for misinformation detection. The

LIAR dataset (Wang,2017) includes short state-

ments from diﬀerent backgrounds, annotated by

PolitiFact API. News and related tweets for fact-

checked events are composed in a dataset in (Shu

et al.,2020). Recently, global events and their

repercussions in social media lead to the emer-

gence of new misinformation datasets. For in-

stance, Memon and Carley (2020) annotate tweets

according to misinformation categories such as

fake treatments for COVID-19. In (Li et al.,2020),

news sources are investigated for fake news in

diﬀerent languages. Hossain et al. (2020b) re-

trieve common misconceptions about COVID-19,

and label tweets according to their stances against

misconceptions. Weinzierl and Harabagiu (2022)

compose the vaccine version of the same dataset.

Other datasets include COVID-19 healthcare misin-

formation (Cui and Lee,2020), and large-scale mul-

timodal misinformation (Nielsen and McConville,

2022). (Hu et al.,2023) curate annotated multi-

modal social media dataset for two widely-spoken

languages (English and Chinese), providing reply

and retweet engagements. Lastly, there are very

limited datasets for low-resource languages (Hos-

sain et al.,2020a;Lucas et al.,2022) but do not

exist for Turkish.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MiDe22:AnAnnotatedMulti-EventTweetDatasetforMisinformationDetectionCagriToraman1*,OguzhanOzcelik2,3,FurkanŞahinuç4*,FazliCan21DepartmentofComputerEngineering,MiddleEastTechnicalUniversity,Ankara,Turkey2DepartmentofComputerEngineering,BilkentUniversity,Ankara,Turkey3Aselsan,Ankara,Turkey4UbiquitousKn...

展开>> 收起<<

MiDe22 An Annotated Multi-Event Tweet Dataset for Misinformation Detection Cagri Toraman1 Oguzhan Ozcelik23 Furkan Şahinuç4 Fazli Can2.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MiDe22 An Annotated Multi-Event Tweet Dataset for Misinformation Detection Cagri Toraman1 Oguzhan Ozcelik23 Furkan Şahinuç4 Fazli Can2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: