HUMSET Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2

2025-04-29 0 0 292.48KB 11 页 10玖币
侵权投诉
HUMSET: Dataset of Multilingual Information Extraction and
Classification for Humanitarian Crisis Response
Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2
Ximena Contla1Ewan Oglethorpe1Navid Rekabsaz3
1Data Friendly Space 2ToggleCorp Solutions
3Johannes Kepler University Linz, LIT AI Lab, Austria
{selim, nico, ximena, ewan}@datafriendlyspace.org
ranjan.shrestha@togglecorp.com
{benjamin.minixhofer, navid.rekabsaz}@jku.at
Abstract
Timely and effective response to humanitarian
crises requires quick and accurate analysis of
large amounts of text data – a process that can
highly benefit from expert-assisted NLP sys-
tems trained on validated and annotated data
in the humanitarian response domain. To en-
able creation of such NLP systems, we intro-
duce and release HUMSET, a novel and rich
multilingual dataset of humanitarian response
documents annotated by experts in the human-
itarian response community. The dataset pro-
vides documents in three languages (English,
French, Spanish) and covers a variety of hu-
manitarian crises from 2018 to 2021 across the
globe. For each document, HUMSET provides
selected snippets (entries) as well as assigned
classes to each entry annotated using com-
mon humanitarian information analysis frame-
works. HUMSET also provides novel and chal-
lenging entry extraction and multi-label entry
classification tasks. In this paper, we take a
first step towards approaching these tasks and
conduct a set of experiments on Pre-trained
Language Models (PLM) to establish strong
baselines for future research in this domain.
The dataset is available at https://blog.
thedeep.io/humset/.
1 Introduction
During humanitarian crises caused by reasons rang-
ing from natural disasters, wars ,or epidemics such
as COVID-19, a timely and effective humanitar-
ian response highly depends on fast and accurate
analysis of relevant data to yield key information.
Early in the response phase, namely in the first
72 hours after a disaster strikes, the humanitarian
response analysts in international organizations
1
review large amounts of data loosely or strongly
relevant to the crisis to gain situational awareness.
1
Such as the International Federation of Red Cross (IFRC),
the United Nations High Commissioner for Refugees (UN-
HCR), or the United Nations Office for the Coordination of
Humanitarian Affairs (UNOCHA)
A large portion of this data appears in the form
of secondary data sources i. e. reports, news, and
other forms of text data, and is integral in revealing
which type of relief activities to undertake. Analy-
sis in this phase involves extracting key information
and organizing it according to sets of pre-defined
domain-specific structures and guidelines, referred
to as humanitarian analysis frameworks.
While typically only small workforces are avail-
able to analyze such information, an automatic doc-
ument processing system can significantly help ana-
lysts save time in the overall humanitarian response
cycle. To facilitate such systems, we introduce
and release HUMSET, a unique and rich dataset
of document analysis in the humanitarian response
domain. HUMSET is curated by humanitarian ana-
lysts and covers various disasters around the globe
that occurred from 2018 to 2021 in 46 humani-
tarian response projects. The dataset consists of
approximately 17K annotated documents in three
languages of English, French, and Spanish, origi-
nally taken from publicly-available resources.
2
For
each document, analysts have identified informa-
tive snippets (entries) with respect to common hu-
manitarian frameworks and assigned one or many
classes to each entry (details in §2).
HUMSET provides a large dataset for the training
and evaluation of entry extraction and classification
models, enabling the research and development of
further NLP systems in the humanitarian response
domain. We take the first step in this direction, by
studying the performance of a set of strong base-
line models (details in §3). Our released dataset
expands the previously provided collection by Yela-
Bello et al. (2021) with a more recent and compre-
hensive set of projects, as well as additional classi-
fication labels. Other similar datasets in the human-
itarian domain, Imran et al. (2016) present human-
annotated Twitter corpora collected during 19 dif-
2https://app.thedeep.io/
terms-and-privacy/
arXiv:2210.04573v3 [cs.CL] 6 Nov 2022
ferent crises between 2013 and 2015, Alam et al.
(2021) provide a combination of various social-
media crisis-related existing datasets, and Adel and
Wang (2020) and later Alharbi and Lee (2021) pub-
lish Arabic Twitter classification datasets for crisis
events. HUMSET, in contrast to the current re-
sources which mostly originated from social media,
is created by humanitarian experts through an anno-
tation process on official documents and news from
the most recognized humanitarian agencies, con-
ferring high reliability, continuous updating, and
accurate geolocation information.
2 HUMSET Dataset
The collection originated from a multi-
organizational platform called the Data Entry and
Exploration Platform (DEEP),
3
developed and
maintained by Data Friendly Space (DFS)
4
The
platform facilitates classifying primarily qualitative
information with respect to analysis frameworks
and allows for collaborative classification and an-
notation of secondary data. The dataset is available
at https://blog.thedeep.io/humset/.
2.1 Dataset Overview
HUMSET consists of data used to inform 46 hu-
manitarian response operations across the globe.
24 responses were in Central/South America, 14
in Africa, and 8 in Asia (detailed countries can be
found in Table 6in Appendix).
For each project, documents, referred to as leads,
related to a particular humanitarian crisis are col-
lected, analyzed, and annotated. The annotated
documents in the dataset mostly consist of recently
released information, with 79% of the documents
being released in 2020 and 2021 (Table 5in Ap-
pendix), and 90% of all documents being sourced
from websites (see Table 4in Appendix for the
most commonly used platforms). Documents are
selected from different sources, ranging from of-
ficial reports by humanitarian organizations to in-
ternational and national media articles. Overall,
documents consist of files in PDF format (
70.4%
)
and HTML pages (
29.6%
) with an average length
of
2
K words. The number of documents ana-
lyzed per project varies, ranging from 2 to 2,266.
The relevant snippets of texts, referred to as en-
tries, in each document are annotated by humani-
3https://thedeep.io/
4
Data Friendly Space is a U.S. based inter-
national non-governmental organization (INGO)
https://datafriendlyspace.org/
(a)
(b)
(c)
Figure 1: (a) Distribution of documents per project. (b)
Log-scale distribution of tokens5per document. (c)
Log-scale distribution of tokens per entry.
tarian experts. The dataset provides an average of
10
entries per document, and an average length
of
65
words per entry. Overall, HUMSET is
composed of 148,621 tagged entries, selected from
16,857 documents, and in three languages: English
(61.3%), French (20.4%) and Spanish (18.3%).
The list of projects as well as the number of docu-
ments and annotated entries per project is reported
in Table 7in Appendix. Figure 1shows the distribu-
tion of the number of tagged documents per project,
as well as the number of tokens per document and
entry.
2.2 Humanitarian Analysis Frameworks and
Data Annotation Process
The concept of analytical frameworks originated in
the social sciences (Ragin and Amoroso,2011), but
can be considered foundational and indispensable
in numerous research fields. An analytical frame-
work is a set of methodologies and guidelines to
facilitate data collection, collation, and analysis,
helping to understand what information will be
useful and what can be discarded.
In the humanitarian domain, an analytical frame-
work (or analysis framework) not only assists
decision-makers to speed up humanitarian response
5
Tokenized using
word_tokenize
in NLTK v3.7 li-
brary (Bird and Loper,2004).
Categories # Tags
Sectors 11
Agriculture, Cross-sector,
Education, Food Security,
Health, Livelihoods, Logistics,
Nutrition, Protection, Shelter,
WASH (Water, Sanitation &
Hygiene)
Pillars 1D 7
Context, COVID-19,
Displacement, Humanitarian
Access, Information &
Communication, Casualties,
Shock/Event
Subpillars 1D 33 Details in Table 8in Appendix
Pillars 2D 6
Capacities & Response,
Humanitarian Conditions,
Impact, At Risk, Priority Needs,
Priority Interventions
Subpillars 2D 18 Details in Table 9in Appendix
Table 1: Overview of humanitarian analysis frame-
work.
and disaster relief but also enables various groups
to share resources (Zhang et al.,2002). When start-
ing a response or project, humanitarian organiza-
tions create or more often use an existing analy-
sis framework, which covers the generic but also
specific needs of the work. Our data originally
contained 11 different frameworks. As there are
high similarities across frameworks, we created a
common framework, which we refer to as humani-
tarian analysis framework. This framework covers
the framework dimensions of all projects. We build
our custom set of tags by mapping the original tags
in other frameworks to ours. More specifically, our
analysis framework consists of three categories:
Sectors (11 tags), Subpillars 1D (33 tags), and Sub-
pillars 2D (18 tags). Pillars/Subpillars 1D, and
2D have a hierarchical structure, consisting of a
two-leveled tree hierarchy (Pillars to Subpillars).
The list and the number of tags present for each
category are reported in Table 1.
For each project, documents relevant to under-
standing the situation, unmet needs, and underly-
ing factors are captured and uploaded to the DEEP
platform. From these sources, entries of text are
selected and categorized into an analysis frame-
work. Humanitarian annotators are trained in spe-
cific projects to follow analytical standards and
thinking to review secondary data.
This process eventually results in annotating and
organizing the data according to the humanitarian
analysis framework. As the HUMSET dataset is
created in a real-world scenario, the distribution
of annotated entries is skewed, with 33 tags be-
ing present in less than 2% of data. Tables 10,
11, and 12 in Appendix show the detailed number
and proportions of the annotated entries in Sectors,
Subpillars 1D, and 2D, respectively. Figure 2in
Appendix reports the distribution of tags in dataset.
2.3 NLP Tasks
Entry Extraction Task.
The first step for hu-
manitarian taggers in analyzing a document is find-
ing entries containing relevant information. A piece
of text or information is considered relevant if it
meaningfully contains at least one tag present in the
given humanitarian analytical framework. Since
documents often contain a large amount of infor-
mation (Figure 1), it is extremely beneficial to au-
tomate the process of entry identification, and this
is the first task of this research. This can be seen
as an extractive summarization task i. e. selecting
a subset of passages that contain relevant informa-
tion from the given document. However, the entries
do not necessarily follow the common units of text
such as sentence and paragraph and can appear
in various lengths. In fact, only 38.8% of entries
consist of full sentences, and the rest are snippets
that are shorter or longer than sentences. This lim-
its the direct applicability of prior approaches to
extractive summarization (Liu and Lapata,2019;
Zhou et al.,2018), and makes the task particularly
challenging for NLP research.
Multi-label Entry Classification Task.
After
selecting the most relevant entries within a docu-
ment, the next step is to categorize them according
to the humanitarian analysis framework (Table 1).
An automatic suggestion on which tag to choose
from a large number of possibilities can be decisive
in speeding up the annotation process. For each
category, more than one tag can be assigned to an
entry. Hence, we can view this task as multi-label
classification.
3 Experiments and Results
To conduct a set of baseline experiments on HUM-
SET according to the mentioned tasks, we split
the data into training, validation, and test sets for
all our experiments (80%, 10%, and 10%, respec-
tively). We apply stratified splitting (Szymanski
and Kajdanowicz,2017) to maintain the same dis-
tribution of labels for each set. Implementation
details of Entry Extraction (Section 3.1) and Entry
摘要:

HUMSET:DatasetofMultilingualInformationExtractionandClassicationforHumanitarianCrisisResponseSelimFekih1NicolòTamagnone1BenjaminMinixhofer3RanjanShrestha2XimenaContla1EwanOglethorpe1NavidRekabsaz31DataFriendlySpace2ToggleCorpSolutions3JohannesKeplerUniversityLinz,LITAILab,Austria{selim,nico,ximena,...

收起<<
HUMSET Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response Selim Fekih1Nicolò Tamagnone1Benjamin Minixhofer3Ranjan Shrestha2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:292.48KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注