Named Entity Recognition in Twitter A Dataset and Analysis on Short-Term Temporal Shifts Asahi Ushio1 Leonardo Neves2 Vítor Silva2 Francesco Barbieri2 Jose Camacho-Collados1

2025-05-02 0 0 669.83KB 11 页 10玖币

侵权投诉

Named Entity Recognition in Twitter:

A Dataset and Analysis on Short-Term Temporal Shifts

Asahi Ushio1, Leonardo Neves2, Vítor Silva2, Francesco Barbieri2, Jose Camacho-Collados1

1Cardiff NLP, School of Computer Science and Informatics, Cardiff University, United Kingdom

{UshioA,CamachoColladosJ}@cardiff.ac.uk

2Snap Inc., Santa Monica, CA, United States

{lneves,vsilvasousa,fbarbieri}@snap.com

Abstract

Recent progress in language model pre-

training has led to important improvements in

Named Entity Recognition (NER). Nonethe-

less, this progress has been mainly tested

in well-formatted documents such as news,

Wikipedia, or scientiﬁc articles. In social me-

dia the landscape is different, in which it adds

another layer of complexity due to its noisy

and dynamic nature. In this paper, we focus on

NER in Twitter, one of the largest social media

platforms, and construct a new NER dataset,

TweetNER7, which contains seven entity types

annotated over 11,382 tweets from September

2019 to August 2021. The dataset was con-

structed by carefully distributing the tweets

over time and taking representative trends as a

basis. Along with the dataset, we provide a set

of language model baselines and perform an

analysis on the language model performance

on the task, especially analyzing the impact

of different time periods. In particular, we

focus on three important temporal aspects in

our analysis: short-term degradation of NER

models over time, strategies to ﬁne-tune a lan-

guage model over different periods, and self-

labeling as an alternative to lack of recently-

labeled data. TweetNER7 is released publicly1

along with the models ﬁne-tuned on it2.

1 Introduction

Named Entity Recognition (NER) is a long-

standing NLP task that consists of identifying an

entity in a sentence or document, and classifying

it into an entity-type from a ﬁxed typeset. One of

the most common and successful types of NER sys-

tem is achieved by ﬁne-tuning pre-trained language

models (LMs) on a human-annotated NER dataset

1https://huggingface.co/datasets/tner/

tweetner7

NER models have been integrated into TweetNLP

(Camacho-Collados et al.,2022) and can be found at

https://github.com/asahi417/tner/tree/master/

examples/tweetner7_paper

with token-wise classiﬁcation (Peters et al.,2018;

Howard and Ruder,2018;Radford et al.,2018,

2019;Devlin et al.,2019). Remarkably, LM ﬁne-

tuning based NER models (Yamada et al.,2020;Li

et al.,2020) already achieve over 90% F1 score in

standard NER datasets such as CoNLL2003 (Tjong

Kim Sang and De Meulder,2003) and OntoNotes5

(Hovy et al.,2006). However, NER is far from

being solved, specialized domains such as ﬁnancial

news (Salinas Alvarado et al.,2015), biochemi-

cal (Collier and Kim,2004), or biomedical (Wei

et al.,2015;Li et al.,2016) still pose additional

challenges (Ushio and Camacho-Collados,2021).

Lower performance in these domains may be at-

tributed to various factors such as the usage speciﬁc

terminologies within those domains, which LMs

have not seen while pre-training (Lee et al.,2020).

Among recent studies, social media has been

acknowledged as one of the most challenging do-

mains for NER (Derczynski et al.,2016,2017).

Social media texts are generally more noisy and

less formal than conventional written languages

in addition to its vocabulary speciﬁcity. In so-

cial media, there is another particular feature that

needs to be addressed, which is the presence of

(quick) temporal shifts in the text semantics (Rijh-

wani and Preotiuc-Pietro,2020), where the mean-

ing of words is constantly changing or evolving

over time. This is a general issue with language

models (Lazaridou et al.,2021), but it is especially

relevant given the dynamic landscape and imme-

diacy present in social media (Del Tredici et al.,

2019). There have been a few speciﬁc approaches

to deal with the temporal shifts in social media. For

instance, Loureiro et al. (2022) addressed this issue

by pre-training language models on a large tweet

collection from different time period, highlighting

the importance of having an up-to-date language

model. Agarwal and Nenkova (2022) studied the

temporal-shift in various NLP tasks including NER

and analyzed methods to overcome the temporal-

arXiv:2210.03797v2 [cs.CL] 15 Nov 2022

shift with strategies such as self-labeling.

In this paper, we propose a new NER dataset

for Twitter (TweetNER7 henceforth). TweetNER7

contains tweets from diverse topics that are dis-

tributed uniformly from September 2019 to Au-

gust 2021. It contains 11,382 annotated tweets in

total, spanning seven entity types (person,loca-

tion,corporation,creative work,group,product,

and event). To the best of our knowledge, Tweet-

NER7 is the largest Twitter NER datasets with a

high coverage of entity types TTC (Rijhwani and

Preotiuc-Pietro,2020) contains about same amount

of annotation yet with three entity types, while

WNUT17 (Derczynski et al.,2017) has six entity

types yet suffer from very small annotations. The

tweets for TweetNER7 were collected by querying

tweets with weekly trending keywords so that the

tweet collection covers various topics within the

period, and we further removed near-duplicated

tweets and irrelevant tweets without any speciﬁc

topics in order to improve the quality of tweets. We

provide baseline results with language model ﬁne-

tuning that showcases the difﬁculty of TweetNER7,

especially when dealing with time shifts. Finally,

we provide a temporal analysis with different strate-

gies including self-labeling, which does not prove

highly beneﬁcial in our context, and provide in-

sights in the model inner working and potential

biases.

2 Related Work

There is a large variety of NER datasets in the liter-

ature. CoNLL2003 (Tjong Kim Sang and De Meul-

der,2003) and OntoNotes5 (Hovy et al.,2006)

are widely used common NER datasets in the lit-

erature, where the texts are collected from pub-

lic news, blogs, and dialogues. WikiAnn (Pan

et al.,2017) and MultiNERD (Tedeschi and Nav-

igli,2022) are both multilingual NER datasets

where the training set is constructed by distant-

supervision on Wikipedia and BabelNet. As far as

domain-speciﬁc NER datasets are concerned, FIN

(Salinas Alvarado et al.,2015) is a NER dataset

of ﬁnancial news, while BioNLP2004 (Collier and

Kim,2004) and BioCreative (Wei et al.,2015;Li

et al.,2016) are both constructed from scientiﬁc

documents of the biochemical and biomedical do-

mains. However, none of these datasets address the

same challenges posed by the social media domain.

In the social media domain, the pioneering Broad

Twitter Corpus (BTC) NER dataset (Derczynski

et al.,2016) included users with different demo-

graphics with the aim to investigate spatial and

temporal shift of semantics in NER. More recently,

the test set of WNUT2017 (Derczynski et al.,2017)

contained unseen entities in the training set from

broader social media including Twitter, Reddit,

YouTube, and StackExchange. The recent Twee-

BankNER dataset (Jiang et al.,2022) annotated

TweeBank (Liu et al.,2018) with entity labels to in-

vestigate the interaction between syntax and NER.

The most similar dataset to ours is the Temporal

Twitter Corpus (TTC) NER dataset. (Rijhwani and

Preotiuc-Pietro,2020), which was also aimed at

analysing the temporal effects of NER in social

media. For this dataset, 2,000 tweets every year

from 2014 to 2019 were annotated. In general,

however, these social media datasets suffer from

limited data, non-uniform distribution over time, or

limited entity types (see Subsection 3.3 for more

details). In this paper, we contribute with a new

NER dataset (TweetNER7) based on recent data un-

til 2021, which is speciﬁcally designed to analyze

temporal shifts in social media.

3 TweetNER7: Dataset Construction,

Statistics and Baselines

In this section, we present our time-aware NER

dataset from publicly available tweets with seven

general entity types, which we refer as TweetNER7.

In the following subsections, we describe the data

collection (Subsection 3.1) and annotation (Subsec-

tion 3.2) processes. We also share relevant statistics

(Subsection 3.3) and baseline results (Subsection

3.4) of our dataset.

3.1 Data Collection

This NER dataset annotates a similar tweet collec-

tion used to construct TweetTopic (Antypas et al.,

2022). The main data consists of tweets from

September 2019 to August 2021 with roughly same

amount of tweets in each month. This collection pe-

riod makes it suitable for our purpose of evaluating

short-term temporal-shift of NER on Twitter. The

original tweets were ﬁltered by leveraging weekly

trending topics as well as by various other types

of ﬁltering see Antypas et al. (2022) for more de-

tails on the collection and ﬁltering process). The

collected tweets were then split into two periods:

September 2019 to August 2020 (2020-set) and

September 2020 to August 2021 (2021-set).

3.2 Dataset Annotation

Annotation.

To attain named-entity annotations

over the tweets, we conducted a manual annotation

on Amazon Mechanical Turk with the interface

shown in Figure 1. We split tweets into two peri-

ods: September 2019 to August 2020 (2020-set)

and September 2020 to August 2021 (2021-set),

and randomly sampled 6,000 tweets from each pe-

riod, which were annotated by three annotators,

collecting 36,000 annotations in total. As the entity

types, we employed seven labels: person,location,

corporation,creative work,group,product, and

event. We followed Derczynski et al. (2017) for

the selection of the ﬁrst six labels, and addition-

ally included event, as we found a large amount of

entities for events in our collected tweets.

Pre-processing.

We pre-process tweets before the

annotation to normalize some artifacts, convert-

ing URLs into a special token

and non-

veriﬁed usernames into

. For veri-

ﬁed usernames, we replace its display name with

symbols @. For example, a tweet

Get the all-analog Classic Vinyl Edition

of "Takin'Off" Album from @herbiehancock

via @bluenoterecords link below:

http://bluenote.lnk.to/AlbumOfTheWeek

is transformed into the following text.

Get the all-analog Classic Vinyl Edition

of "Takin'Off" Album from {@Herbie Hancock@}

via {{USERNAME}} link below: {{URL}}

We ask annotators to ignore those special tokens

but label the veriﬁed users’ mentions.

Quality Control.

Since we have three annotations

per tweet, we control the quality of the annotation

by taking the agreement into account. We disregard

the annotation if the agreement is 1/3, and manually

validate the annotation if it is 2/3, which happens

for roughly half of the instances.

3.3 Statistics

This subsection provides an statistical analysis of

(i) our dataset, (ii) our dataset in comparison with

other Twitter NER datasets, and (iii) our dataset

distribution over time.

Statistics of TweetNER7.

TweetNER7 contains

5,768 and 5,614 tweets annotated in each period

of 2020 and 2021, which are then split into train-

ing / validation / test sets for each year. Since the

2020-set is for model development, we consider

80% of the dataset as training set and 10% for val-

idation and test sets. Meanwhile, the 2021-set is

Period 2020-set 2021-set

Split Train Valid Test Train Valid Test

Number of Entities

- corporation 1,700 203 191 902 102 900

- creative work 1,661 208 179 690 74 731

- event 2,242 256 265 968 131 1,097

- group 2,242 227 311 1,313 227 1,516

- location 1,259 181 165 697 72 716

- person 4,666 598 596 2,362 283 2,712

- product 1,850 241 220 926 111 972

All 15,620 1,914 1,927 8,864 1,000 8,644

Entity Diversity

- corporation 69.9 92.6 90.1 72.1 85.3 74.3

- creative work 80.1 92.8 91.6 89.0 93.2 91.0

- event 71.1 90.6 84.2 75.9 89.3 70.9

- group 66.7 86.8 81.7 66.0 86.3 66.2

- location 66.4 80.7 81.2 67.9 88.9 64.9

- person 68.4 85.6 83.6 77.3 90.1 77.7

- product 56.2 71.4 76.4 60.3 79.3 56.6

Number of Tweets 4,616 576 576 2,495 310 2,807

Table 1: Number of entities, tweets, and entity diversity

in each data split and period, where the 2020-set is from

September 2019 to August 2020, while the 2021-set is

from September 2020 to August 2021.

mainly devised for model evaluation to measure

the temporal adaptability, so we take the majority

of the 2021-set (50%) as the test set and split the

rest into training and validation set with the same

ratio of training and validation set of the 2020-set.

Table 1summarizes the number of the entities as

well as the instances in each subset of TweetNER7.

We can observe a large gap between frequent en-

tity types such as person and rare entity types as

location, while the distribution of the entities are

roughly balanced across subsets. We also report

entity diversity, which we deﬁne as the percentage

of unique entities with respect to the total number

of entities. Entity types such as product contain

a relatively large number of duplicates (ranging

between 56.2% and 76.4% entity diversity scores),

while other types such as creative work are more

diverse (ranging between 80.1% and 93.2%).

Comparison with other Twitter NER Datasets.

In Table 2, we compare TweetNER7 against ex-

isting NER datasets for Twitter, which highlights

the large number of annotations of TweetNER7 for

our covered period. TweetNER7 and TTC are the

overall largest datasets with more than 10k anno-

tations, but TTC covers only three entities, which

may be insufﬁcient for certain practical use cases

given the diversity of text in social media context

(Derczynski et al.,2017). In contrast, TweetNER7

has the highest coverage of entity types among all

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NamedEntityRecognitioninTwitter:ADatasetandAnalysisonShort-TermTemporalShiftsAsahiUshio1,LeonardoNeves2,VítorSilva2,FrancescoBarbieri2,JoseCamacho-Collados11CardiffNLP,SchoolofComputerScienceandInformatics,CardiffUniversity,UnitedKingdom{UshioA,CamachoColladosJ}@cardiff.ac.uk2SnapInc.,SantaMonica,CA...

展开>> 收起<<

Named Entity Recognition in Twitter A Dataset and Analysis on Short-Term Temporal Shifts Asahi Ushio1 Leonardo Neves2 Vítor Silva2 Francesco Barbieri2 Jose Camacho-Collados1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Named Entity Recognition in Twitter A Dataset and Analysis on Short-Term Temporal Shifts Asahi Ushio1 Leonardo Neves2 Vítor Silva2 Francesco Barbieri2 Jose Camacho-Collados1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: