HUMSET Dataset of Multilingual Information Extraction and Classiﬁcation for Humanitarian Crisis Response Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2

2025-04-29 1 0 292.48KB 11 页 10玖币

侵权投诉

HUMSET: Dataset of Multilingual Information Extraction and

Classiﬁcation for Humanitarian Crisis Response

Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2

Ximena Contla1Ewan Oglethorpe1Navid Rekabsaz3

1Data Friendly Space 2ToggleCorp Solutions

3Johannes Kepler University Linz, LIT AI Lab, Austria

{selim, nico, ximena, ewan}@datafriendlyspace.org

ranjan.shrestha@togglecorp.com

{benjamin.minixhofer, navid.rekabsaz}@jku.at

Abstract

Timely and effective response to humanitarian

crises requires quick and accurate analysis of

large amounts of text data – a process that can

highly beneﬁt from expert-assisted NLP sys-

tems trained on validated and annotated data

in the humanitarian response domain. To en-

able creation of such NLP systems, we intro-

duce and release HUMSET, a novel and rich

multilingual dataset of humanitarian response

documents annotated by experts in the human-

itarian response community. The dataset pro-

vides documents in three languages (English,

French, Spanish) and covers a variety of hu-

manitarian crises from 2018 to 2021 across the

globe. For each document, HUMSET provides

selected snippets (entries) as well as assigned

classes to each entry annotated using com-

mon humanitarian information analysis frame-

works. HUMSET also provides novel and chal-

lenging entry extraction and multi-label entry

classiﬁcation tasks. In this paper, we take a

ﬁrst step towards approaching these tasks and

conduct a set of experiments on Pre-trained

Language Models (PLM) to establish strong

baselines for future research in this domain.

The dataset is available at https://blog.

thedeep.io/humset/.

1 Introduction

During humanitarian crises caused by reasons rang-

ing from natural disasters, wars ,or epidemics such

as COVID-19, a timely and effective humanitar-

ian response highly depends on fast and accurate

analysis of relevant data to yield key information.

Early in the response phase, namely in the ﬁrst

72 hours after a disaster strikes, the humanitarian

response analysts in international organizations

review large amounts of data loosely or strongly

relevant to the crisis to gain situational awareness.

Such as the International Federation of Red Cross (IFRC),

the United Nations High Commissioner for Refugees (UN-

HCR), or the United Nations Ofﬁce for the Coordination of

Humanitarian Affairs (UNOCHA)

A large portion of this data appears in the form

of secondary data sources i. e. reports, news, and

other forms of text data, and is integral in revealing

which type of relief activities to undertake. Analy-

sis in this phase involves extracting key information

and organizing it according to sets of pre-deﬁned

domain-speciﬁc structures and guidelines, referred

to as humanitarian analysis frameworks.

While typically only small workforces are avail-

able to analyze such information, an automatic doc-

ument processing system can signiﬁcantly help ana-

lysts save time in the overall humanitarian response

cycle. To facilitate such systems, we introduce

and release HUMSET, a unique and rich dataset

of document analysis in the humanitarian response

domain. HUMSET is curated by humanitarian ana-

lysts and covers various disasters around the globe

that occurred from 2018 to 2021 in 46 humani-

tarian response projects. The dataset consists of

approximately 17K annotated documents in three

languages of English, French, and Spanish, origi-

nally taken from publicly-available resources.

For

each document, analysts have identiﬁed informa-

tive snippets (entries) with respect to common hu-

manitarian frameworks and assigned one or many

classes to each entry (details in §2).

HUMSET provides a large dataset for the training

and evaluation of entry extraction and classiﬁcation

models, enabling the research and development of

further NLP systems in the humanitarian response

domain. We take the ﬁrst step in this direction, by

studying the performance of a set of strong base-

line models (details in §3). Our released dataset

expands the previously provided collection by Yela-

Bello et al. (2021) with a more recent and compre-

hensive set of projects, as well as additional classi-

ﬁcation labels. Other similar datasets in the human-

itarian domain, Imran et al. (2016) present human-

annotated Twitter corpora collected during 19 dif-

2https://app.thedeep.io/

terms-and-privacy/

arXiv:2210.04573v3 [cs.CL] 6 Nov 2022

ferent crises between 2013 and 2015, Alam et al.

(2021) provide a combination of various social-

media crisis-related existing datasets, and Adel and

Wang (2020) and later Alharbi and Lee (2021) pub-

lish Arabic Twitter classiﬁcation datasets for crisis

events. HUMSET, in contrast to the current re-

sources which mostly originated from social media,

is created by humanitarian experts through an anno-

tation process on ofﬁcial documents and news from

the most recognized humanitarian agencies, con-

ferring high reliability, continuous updating, and

accurate geolocation information.

2 HUMSET Dataset

The collection originated from a multi-

organizational platform called the Data Entry and

Exploration Platform (DEEP),

developed and

maintained by Data Friendly Space (DFS)

The

platform facilitates classifying primarily qualitative

information with respect to analysis frameworks

and allows for collaborative classiﬁcation and an-

notation of secondary data. The dataset is available

at https://blog.thedeep.io/humset/.

2.1 Dataset Overview

HUMSET consists of data used to inform 46 hu-

manitarian response operations across the globe.

24 responses were in Central/South America, 14

in Africa, and 8 in Asia (detailed countries can be

found in Table 6in Appendix).

For each project, documents, referred to as leads,

related to a particular humanitarian crisis are col-

lected, analyzed, and annotated. The annotated

documents in the dataset mostly consist of recently

released information, with 79% of the documents

being released in 2020 and 2021 (Table 5in Ap-

pendix), and 90% of all documents being sourced

from websites (see Table 4in Appendix for the

most commonly used platforms). Documents are

selected from different sources, ranging from of-

ﬁcial reports by humanitarian organizations to in-

ternational and national media articles. Overall,

documents consist of ﬁles in PDF format (

70.4%

)

and HTML pages (

29.6%

) with an average length

∼2

K words. The number of documents ana-

lyzed per project varies, ranging from 2 to 2,266.

The relevant snippets of texts, referred to as en-

tries, in each document are annotated by humani-

3https://thedeep.io/

Data Friendly Space is a U.S. based inter-

national non-governmental organization (INGO)

https://datafriendlyspace.org/

(a)

(b)

(c)

Figure 1: (a) Distribution of documents per project. (b)

Log-scale distribution of tokens5per document. (c)

Log-scale distribution of tokens per entry.

tarian experts. The dataset provides an average of

∼10

entries per document, and an average length

∼65

words per entry. Overall, HUMSET is

composed of 148,621 tagged entries, selected from

16,857 documents, and in three languages: English

(61.3%), French (20.4%) and Spanish (18.3%).

The list of projects as well as the number of docu-

ments and annotated entries per project is reported

in Table 7in Appendix. Figure 1shows the distribu-

tion of the number of tagged documents per project,

as well as the number of tokens per document and

entry.

2.2 Humanitarian Analysis Frameworks and

Data Annotation Process

The concept of analytical frameworks originated in

the social sciences (Ragin and Amoroso,2011), but

can be considered foundational and indispensable

in numerous research ﬁelds. An analytical frame-

work is a set of methodologies and guidelines to

facilitate data collection, collation, and analysis,

helping to understand what information will be

useful and what can be discarded.

In the humanitarian domain, an analytical frame-

work (or analysis framework) not only assists

decision-makers to speed up humanitarian response

Tokenized using

word_tokenize

in NLTK v3.7 li-

brary (Bird and Loper,2004).

Categories # Tags

Sectors 11

Agriculture, Cross-sector,

Education, Food Security,

Health, Livelihoods, Logistics,

Nutrition, Protection, Shelter,

WASH (Water, Sanitation &

Hygiene)

Pillars 1D 7

Context, COVID-19,

Displacement, Humanitarian

Access, Information &

Communication, Casualties,

Shock/Event

Subpillars 1D 33 Details in Table 8in Appendix

Pillars 2D 6

Capacities & Response,

Humanitarian Conditions,

Impact, At Risk, Priority Needs,

Priority Interventions

Subpillars 2D 18 Details in Table 9in Appendix

Table 1: Overview of humanitarian analysis frame-

work.

and disaster relief but also enables various groups

to share resources (Zhang et al.,2002). When start-

ing a response or project, humanitarian organiza-

tions create or more often use an existing analy-

sis framework, which covers the generic but also

speciﬁc needs of the work. Our data originally

contained 11 different frameworks. As there are

high similarities across frameworks, we created a

common framework, which we refer to as humani-

tarian analysis framework. This framework covers

the framework dimensions of all projects. We build

our custom set of tags by mapping the original tags

in other frameworks to ours. More speciﬁcally, our

analysis framework consists of three categories:

Sectors (11 tags), Subpillars 1D (33 tags), and Sub-

pillars 2D (18 tags). Pillars/Subpillars 1D, and

2D have a hierarchical structure, consisting of a

two-leveled tree hierarchy (Pillars to Subpillars).

The list and the number of tags present for each

category are reported in Table 1.

For each project, documents relevant to under-

standing the situation, unmet needs, and underly-

ing factors are captured and uploaded to the DEEP

platform. From these sources, entries of text are

selected and categorized into an analysis frame-

work. Humanitarian annotators are trained in spe-

ciﬁc projects to follow analytical standards and

thinking to review secondary data.

This process eventually results in annotating and

organizing the data according to the humanitarian

analysis framework. As the HUMSET dataset is

created in a real-world scenario, the distribution

of annotated entries is skewed, with 33 tags be-

ing present in less than 2% of data. Tables 10,

11, and 12 in Appendix show the detailed number

and proportions of the annotated entries in Sectors,

Subpillars 1D, and 2D, respectively. Figure 2in

Appendix reports the distribution of tags in dataset.

2.3 NLP Tasks

Entry Extraction Task.

The ﬁrst step for hu-

manitarian taggers in analyzing a document is ﬁnd-

ing entries containing relevant information. A piece

of text or information is considered relevant if it

meaningfully contains at least one tag present in the

given humanitarian analytical framework. Since

documents often contain a large amount of infor-

mation (Figure 1), it is extremely beneﬁcial to au-

tomate the process of entry identiﬁcation, and this

is the ﬁrst task of this research. This can be seen

as an extractive summarization task i. e. selecting

a subset of passages that contain relevant informa-

tion from the given document. However, the entries

do not necessarily follow the common units of text

such as sentence and paragraph and can appear

in various lengths. In fact, only 38.8% of entries

consist of full sentences, and the rest are snippets

that are shorter or longer than sentences. This lim-

its the direct applicability of prior approaches to

extractive summarization (Liu and Lapata,2019;

Zhou et al.,2018), and makes the task particularly

challenging for NLP research.

Multi-label Entry Classiﬁcation Task.

After

selecting the most relevant entries within a docu-

ment, the next step is to categorize them according

to the humanitarian analysis framework (Table 1).

An automatic suggestion on which tag to choose

from a large number of possibilities can be decisive

in speeding up the annotation process. For each

category, more than one tag can be assigned to an

entry. Hence, we can view this task as multi-label

classiﬁcation.

3 Experiments and Results

To conduct a set of baseline experiments on HUM-

SET according to the mentioned tasks, we split

the data into training, validation, and test sets for

all our experiments (80%, 10%, and 10%, respec-

tively). We apply stratiﬁed splitting (Szymanski

and Kajdanowicz,2017) to maintain the same dis-

tribution of labels for each set. Implementation

details of Entry Extraction (Section 3.1) and Entry

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HUMSET:DatasetofMultilingualInformationExtractionandClassicationforHumanitarianCrisisResponseSelimFekih1NicolòTamagnone1BenjaminMinixhofer3RanjanShrestha2XimenaContla1EwanOglethorpe1NavidRekabsaz31DataFriendlySpace2ToggleCorpSolutions3JohannesKeplerUniversityLinz,LITAILab,Austria{selim,nico,ximena,...

展开>> 收起<<

HUMSET Dataset of Multilingual Information Extraction and Classiﬁcation for Humanitarian Crisis Response Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HUMSET Dataset of Multilingual Information Extraction and Classiﬁcation for Humanitarian Crisis Response Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: