Named Entity Recognition in Twitter A Dataset and Analysis on Short-Term Temporal Shifts Asahi Ushio1 Leonardo Neves2 Vítor Silva2 Francesco Barbieri2 Jose Camacho-Collados1

2025-05-02 0 0 669.83KB 11 页 10玖币
侵权投诉
Named Entity Recognition in Twitter:
A Dataset and Analysis on Short-Term Temporal Shifts
Asahi Ushio1, Leonardo Neves2, Vítor Silva2, Francesco Barbieri2, Jose Camacho-Collados1
1Cardiff NLP, School of Computer Science and Informatics, Cardiff University, United Kingdom
{UshioA,CamachoColladosJ}@cardiff.ac.uk
2Snap Inc., Santa Monica, CA, United States
{lneves,vsilvasousa,fbarbieri}@snap.com
Abstract
Recent progress in language model pre-
training has led to important improvements in
Named Entity Recognition (NER). Nonethe-
less, this progress has been mainly tested
in well-formatted documents such as news,
Wikipedia, or scientific articles. In social me-
dia the landscape is different, in which it adds
another layer of complexity due to its noisy
and dynamic nature. In this paper, we focus on
NER in Twitter, one of the largest social media
platforms, and construct a new NER dataset,
TweetNER7, which contains seven entity types
annotated over 11,382 tweets from September
2019 to August 2021. The dataset was con-
structed by carefully distributing the tweets
over time and taking representative trends as a
basis. Along with the dataset, we provide a set
of language model baselines and perform an
analysis on the language model performance
on the task, especially analyzing the impact
of different time periods. In particular, we
focus on three important temporal aspects in
our analysis: short-term degradation of NER
models over time, strategies to fine-tune a lan-
guage model over different periods, and self-
labeling as an alternative to lack of recently-
labeled data. TweetNER7 is released publicly1
along with the models fine-tuned on it2.
1 Introduction
Named Entity Recognition (NER) is a long-
standing NLP task that consists of identifying an
entity in a sentence or document, and classifying
it into an entity-type from a fixed typeset. One of
the most common and successful types of NER sys-
tem is achieved by fine-tuning pre-trained language
models (LMs) on a human-annotated NER dataset
1https://huggingface.co/datasets/tner/
tweetner7
2
NER models have been integrated into TweetNLP
(Camacho-Collados et al.,2022) and can be found at
https://github.com/asahi417/tner/tree/master/
examples/tweetner7_paper
with token-wise classification (Peters et al.,2018;
Howard and Ruder,2018;Radford et al.,2018,
2019;Devlin et al.,2019). Remarkably, LM fine-
tuning based NER models (Yamada et al.,2020;Li
et al.,2020) already achieve over 90% F1 score in
standard NER datasets such as CoNLL2003 (Tjong
Kim Sang and De Meulder,2003) and OntoNotes5
(Hovy et al.,2006). However, NER is far from
being solved, specialized domains such as financial
news (Salinas Alvarado et al.,2015), biochemi-
cal (Collier and Kim,2004), or biomedical (Wei
et al.,2015;Li et al.,2016) still pose additional
challenges (Ushio and Camacho-Collados,2021).
Lower performance in these domains may be at-
tributed to various factors such as the usage specific
terminologies within those domains, which LMs
have not seen while pre-training (Lee et al.,2020).
Among recent studies, social media has been
acknowledged as one of the most challenging do-
mains for NER (Derczynski et al.,2016,2017).
Social media texts are generally more noisy and
less formal than conventional written languages
in addition to its vocabulary specificity. In so-
cial media, there is another particular feature that
needs to be addressed, which is the presence of
(quick) temporal shifts in the text semantics (Rijh-
wani and Preotiuc-Pietro,2020), where the mean-
ing of words is constantly changing or evolving
over time. This is a general issue with language
models (Lazaridou et al.,2021), but it is especially
relevant given the dynamic landscape and imme-
diacy present in social media (Del Tredici et al.,
2019). There have been a few specific approaches
to deal with the temporal shifts in social media. For
instance, Loureiro et al. (2022) addressed this issue
by pre-training language models on a large tweet
collection from different time period, highlighting
the importance of having an up-to-date language
model. Agarwal and Nenkova (2022) studied the
temporal-shift in various NLP tasks including NER
and analyzed methods to overcome the temporal-
arXiv:2210.03797v2 [cs.CL] 15 Nov 2022
shift with strategies such as self-labeling.
In this paper, we propose a new NER dataset
for Twitter (TweetNER7 henceforth). TweetNER7
contains tweets from diverse topics that are dis-
tributed uniformly from September 2019 to Au-
gust 2021. It contains 11,382 annotated tweets in
total, spanning seven entity types (person,loca-
tion,corporation,creative work,group,product,
and event). To the best of our knowledge, Tweet-
NER7 is the largest Twitter NER datasets with a
high coverage of entity types TTC (Rijhwani and
Preotiuc-Pietro,2020) contains about same amount
of annotation yet with three entity types, while
WNUT17 (Derczynski et al.,2017) has six entity
types yet suffer from very small annotations. The
tweets for TweetNER7 were collected by querying
tweets with weekly trending keywords so that the
tweet collection covers various topics within the
period, and we further removed near-duplicated
tweets and irrelevant tweets without any specific
topics in order to improve the quality of tweets. We
provide baseline results with language model fine-
tuning that showcases the difficulty of TweetNER7,
especially when dealing with time shifts. Finally,
we provide a temporal analysis with different strate-
gies including self-labeling, which does not prove
highly beneficial in our context, and provide in-
sights in the model inner working and potential
biases.
2 Related Work
There is a large variety of NER datasets in the liter-
ature. CoNLL2003 (Tjong Kim Sang and De Meul-
der,2003) and OntoNotes5 (Hovy et al.,2006)
are widely used common NER datasets in the lit-
erature, where the texts are collected from pub-
lic news, blogs, and dialogues. WikiAnn (Pan
et al.,2017) and MultiNERD (Tedeschi and Nav-
igli,2022) are both multilingual NER datasets
where the training set is constructed by distant-
supervision on Wikipedia and BabelNet. As far as
domain-specific NER datasets are concerned, FIN
(Salinas Alvarado et al.,2015) is a NER dataset
of financial news, while BioNLP2004 (Collier and
Kim,2004) and BioCreative (Wei et al.,2015;Li
et al.,2016) are both constructed from scientific
documents of the biochemical and biomedical do-
mains. However, none of these datasets address the
same challenges posed by the social media domain.
In the social media domain, the pioneering Broad
Twitter Corpus (BTC) NER dataset (Derczynski
et al.,2016) included users with different demo-
graphics with the aim to investigate spatial and
temporal shift of semantics in NER. More recently,
the test set of WNUT2017 (Derczynski et al.,2017)
contained unseen entities in the training set from
broader social media including Twitter, Reddit,
YouTube, and StackExchange. The recent Twee-
BankNER dataset (Jiang et al.,2022) annotated
TweeBank (Liu et al.,2018) with entity labels to in-
vestigate the interaction between syntax and NER.
The most similar dataset to ours is the Temporal
Twitter Corpus (TTC) NER dataset. (Rijhwani and
Preotiuc-Pietro,2020), which was also aimed at
analysing the temporal effects of NER in social
media. For this dataset, 2,000 tweets every year
from 2014 to 2019 were annotated. In general,
however, these social media datasets suffer from
limited data, non-uniform distribution over time, or
limited entity types (see Subsection 3.3 for more
details). In this paper, we contribute with a new
NER dataset (TweetNER7) based on recent data un-
til 2021, which is specifically designed to analyze
temporal shifts in social media.
3 TweetNER7: Dataset Construction,
Statistics and Baselines
In this section, we present our time-aware NER
dataset from publicly available tweets with seven
general entity types, which we refer as TweetNER7.
In the following subsections, we describe the data
collection (Subsection 3.1) and annotation (Subsec-
tion 3.2) processes. We also share relevant statistics
(Subsection 3.3) and baseline results (Subsection
3.4) of our dataset.
3.1 Data Collection
This NER dataset annotates a similar tweet collec-
tion used to construct TweetTopic (Antypas et al.,
2022). The main data consists of tweets from
September 2019 to August 2021 with roughly same
amount of tweets in each month. This collection pe-
riod makes it suitable for our purpose of evaluating
short-term temporal-shift of NER on Twitter. The
original tweets were filtered by leveraging weekly
trending topics as well as by various other types
of filtering see Antypas et al. (2022) for more de-
tails on the collection and filtering process). The
collected tweets were then split into two periods:
September 2019 to August 2020 (2020-set) and
September 2020 to August 2021 (2021-set).
3.2 Dataset Annotation
Annotation.
To attain named-entity annotations
over the tweets, we conducted a manual annotation
on Amazon Mechanical Turk with the interface
shown in Figure 1. We split tweets into two peri-
ods: September 2019 to August 2020 (2020-set)
and September 2020 to August 2021 (2021-set),
and randomly sampled 6,000 tweets from each pe-
riod, which were annotated by three annotators,
collecting 36,000 annotations in total. As the entity
types, we employed seven labels: person,location,
corporation,creative work,group,product, and
event. We followed Derczynski et al. (2017) for
the selection of the first six labels, and addition-
ally included event, as we found a large amount of
entities for events in our collected tweets.
Pre-processing.
We pre-process tweets before the
annotation to normalize some artifacts, convert-
ing URLs into a special token
{{URL}}
and non-
verified usernames into
{{USERNAME}}
. For veri-
fied usernames, we replace its display name with
symbols @. For example, a tweet
Get the all-analog Classic Vinyl Edition
of "Takin'Off" Album from @herbiehancock
via @bluenoterecords link below:
http://bluenote.lnk.to/AlbumOfTheWeek
is transformed into the following text.
Get the all-analog Classic Vinyl Edition
of "Takin'Off" Album from {@Herbie Hancock@}
via {{USERNAME}} link below: {{URL}}
We ask annotators to ignore those special tokens
but label the verified users’ mentions.
Quality Control.
Since we have three annotations
per tweet, we control the quality of the annotation
by taking the agreement into account. We disregard
the annotation if the agreement is 1/3, and manually
validate the annotation if it is 2/3, which happens
for roughly half of the instances.
3.3 Statistics
This subsection provides an statistical analysis of
(i) our dataset, (ii) our dataset in comparison with
other Twitter NER datasets, and (iii) our dataset
distribution over time.
Statistics of TweetNER7.
TweetNER7 contains
5,768 and 5,614 tweets annotated in each period
of 2020 and 2021, which are then split into train-
ing / validation / test sets for each year. Since the
2020-set is for model development, we consider
80% of the dataset as training set and 10% for val-
idation and test sets. Meanwhile, the 2021-set is
Period 2020-set 2021-set
Split Train Valid Test Train Valid Test
Number of Entities
- corporation 1,700 203 191 902 102 900
- creative work 1,661 208 179 690 74 731
- event 2,242 256 265 968 131 1,097
- group 2,242 227 311 1,313 227 1,516
- location 1,259 181 165 697 72 716
- person 4,666 598 596 2,362 283 2,712
- product 1,850 241 220 926 111 972
All 15,620 1,914 1,927 8,864 1,000 8,644
Entity Diversity
- corporation 69.9 92.6 90.1 72.1 85.3 74.3
- creative work 80.1 92.8 91.6 89.0 93.2 91.0
- event 71.1 90.6 84.2 75.9 89.3 70.9
- group 66.7 86.8 81.7 66.0 86.3 66.2
- location 66.4 80.7 81.2 67.9 88.9 64.9
- person 68.4 85.6 83.6 77.3 90.1 77.7
- product 56.2 71.4 76.4 60.3 79.3 56.6
Number of Tweets 4,616 576 576 2,495 310 2,807
Table 1: Number of entities, tweets, and entity diversity
in each data split and period, where the 2020-set is from
September 2019 to August 2020, while the 2021-set is
from September 2020 to August 2021.
mainly devised for model evaluation to measure
the temporal adaptability, so we take the majority
of the 2021-set (50%) as the test set and split the
rest into training and validation set with the same
ratio of training and validation set of the 2020-set.
Table 1summarizes the number of the entities as
well as the instances in each subset of TweetNER7.
We can observe a large gap between frequent en-
tity types such as person and rare entity types as
location, while the distribution of the entities are
roughly balanced across subsets. We also report
entity diversity, which we define as the percentage
of unique entities with respect to the total number
of entities. Entity types such as product contain
a relatively large number of duplicates (ranging
between 56.2% and 76.4% entity diversity scores),
while other types such as creative work are more
diverse (ranging between 80.1% and 93.2%).
Comparison with other Twitter NER Datasets.
In Table 2, we compare TweetNER7 against ex-
isting NER datasets for Twitter, which highlights
the large number of annotations of TweetNER7 for
our covered period. TweetNER7 and TTC are the
overall largest datasets with more than 10k anno-
tations, but TTC covers only three entities, which
may be insufficient for certain practical use cases
given the diversity of text in social media context
(Derczynski et al.,2017). In contrast, TweetNER7
has the highest coverage of entity types among all
摘要:

NamedEntityRecognitioninTwitter:ADatasetandAnalysisonShort-TermTemporalShiftsAsahiUshio1,LeonardoNeves2,VítorSilva2,FrancescoBarbieri2,JoseCamacho-Collados11CardiffNLP,SchoolofComputerScienceandInformatics,CardiffUniversity,UnitedKingdom{UshioA,CamachoColladosJ}@cardiff.ac.uk2SnapInc.,SantaMonica,CA...

展开>> 收起<<
Named Entity Recognition in Twitter A Dataset and Analysis on Short-Term Temporal Shifts Asahi Ushio1 Leonardo Neves2 Vítor Silva2 Francesco Barbieri2 Jose Camacho-Collados1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:669.83KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注