
3.2 Dataset Annotation
Annotation.
To attain named-entity annotations
over the tweets, we conducted a manual annotation
on Amazon Mechanical Turk with the interface
shown in Figure 1. We split tweets into two peri-
ods: September 2019 to August 2020 (2020-set)
and September 2020 to August 2021 (2021-set),
and randomly sampled 6,000 tweets from each pe-
riod, which were annotated by three annotators,
collecting 36,000 annotations in total. As the entity
types, we employed seven labels: person,location,
corporation,creative work,group,product, and
event. We followed Derczynski et al. (2017) for
the selection of the first six labels, and addition-
ally included event, as we found a large amount of
entities for events in our collected tweets.
Pre-processing.
We pre-process tweets before the
annotation to normalize some artifacts, convert-
ing URLs into a special token
{{URL}}
and non-
verified usernames into
{{USERNAME}}
. For veri-
fied usernames, we replace its display name with
symbols @. For example, a tweet
Get the all-analog Classic Vinyl Edition
of "Takin'Off" Album from @herbiehancock
via @bluenoterecords link below:
http://bluenote.lnk.to/AlbumOfTheWeek
is transformed into the following text.
Get the all-analog Classic Vinyl Edition
of "Takin'Off" Album from {@Herbie Hancock@}
via {{USERNAME}} link below: {{URL}}
We ask annotators to ignore those special tokens
but label the verified users’ mentions.
Quality Control.
Since we have three annotations
per tweet, we control the quality of the annotation
by taking the agreement into account. We disregard
the annotation if the agreement is 1/3, and manually
validate the annotation if it is 2/3, which happens
for roughly half of the instances.
3.3 Statistics
This subsection provides an statistical analysis of
(i) our dataset, (ii) our dataset in comparison with
other Twitter NER datasets, and (iii) our dataset
distribution over time.
Statistics of TweetNER7.
TweetNER7 contains
5,768 and 5,614 tweets annotated in each period
of 2020 and 2021, which are then split into train-
ing / validation / test sets for each year. Since the
2020-set is for model development, we consider
80% of the dataset as training set and 10% for val-
idation and test sets. Meanwhile, the 2021-set is
Period 2020-set 2021-set
Split Train Valid Test Train Valid Test
Number of Entities
- corporation 1,700 203 191 902 102 900
- creative work 1,661 208 179 690 74 731
- event 2,242 256 265 968 131 1,097
- group 2,242 227 311 1,313 227 1,516
- location 1,259 181 165 697 72 716
- person 4,666 598 596 2,362 283 2,712
- product 1,850 241 220 926 111 972
All 15,620 1,914 1,927 8,864 1,000 8,644
Entity Diversity
- corporation 69.9 92.6 90.1 72.1 85.3 74.3
- creative work 80.1 92.8 91.6 89.0 93.2 91.0
- event 71.1 90.6 84.2 75.9 89.3 70.9
- group 66.7 86.8 81.7 66.0 86.3 66.2
- location 66.4 80.7 81.2 67.9 88.9 64.9
- person 68.4 85.6 83.6 77.3 90.1 77.7
- product 56.2 71.4 76.4 60.3 79.3 56.6
Number of Tweets 4,616 576 576 2,495 310 2,807
Table 1: Number of entities, tweets, and entity diversity
in each data split and period, where the 2020-set is from
September 2019 to August 2020, while the 2021-set is
from September 2020 to August 2021.
mainly devised for model evaluation to measure
the temporal adaptability, so we take the majority
of the 2021-set (50%) as the test set and split the
rest into training and validation set with the same
ratio of training and validation set of the 2020-set.
Table 1summarizes the number of the entities as
well as the instances in each subset of TweetNER7.
We can observe a large gap between frequent en-
tity types such as person and rare entity types as
location, while the distribution of the entities are
roughly balanced across subsets. We also report
entity diversity, which we define as the percentage
of unique entities with respect to the total number
of entities. Entity types such as product contain
a relatively large number of duplicates (ranging
between 56.2% and 76.4% entity diversity scores),
while other types such as creative work are more
diverse (ranging between 80.1% and 93.2%).
Comparison with other Twitter NER Datasets.
In Table 2, we compare TweetNER7 against ex-
isting NER datasets for Twitter, which highlights
the large number of annotations of TweetNER7 for
our covered period. TweetNER7 and TTC are the
overall largest datasets with more than 10k anno-
tations, but TTC covers only three entities, which
may be insufficient for certain practical use cases
given the diversity of text in social media context
(Derczynski et al.,2017). In contrast, TweetNER7
has the highest coverage of entity types among all