For generalization analyses, we sampled corpora
specific to identity groups across datasets large
enough to contain a minimum number of instances
of hate speech against enough groups (described
in Section 4.1). These are the first 4 datasets noted
in Table 1. All datasets are used in the analysis of
removing dominant groups (Section 6.2).
Datasets are resampled to a 30/70 ratio of hate to
non-hate to eliminate a source of variance among
hate speech datasets known to affect generaliza-
tion (Swamy et al.,2019). Non-hate instances
are upsampled or downsampled to meet this ra-
tio, which was chosen as typical of hate speech
datasets (Vidgen and Derczynski,2020). If they
do not already contain a binary hate speech label,
dataset labels are binarized as described in Ap-
pendix A.
3.1 Target identity label normalization
Annotations for targeted identities vary consider-
ably across datasets. Some of these differences
are variations in naming conventions for identity
groups with significant similarity (‘Caucasian’ and
‘white people’, for example). Other identities are
subsets of broader identities, such as ‘trans men’ as
a specific group within ‘LGBTQ+ people’.
To construct identity-based corpora across
datasets, we normalized and grouped identities an-
notated in each dataset. One of the authors, who has
taken graduate-level courses on language and iden-
tity, manually normalized the most common iden-
tity labels in each dataset and assigned these nor-
malized identity labels into broader identity groups
(such as ‘LGBTQ+ people’). Intersectional iden-
tities, such as ‘Chinese women’, were assigned to
multiple groups (in this case ‘Asian people’ and
‘women’). Hate speech was often directed at con-
flated, problematic groupings such as ‘Muslims and
Arabs’. Though we do not condone these group-
ings, we use them as the most accurate descriptors
of identities targeted.
4 Cross-Identity Generalization
We examine variation among hate speech target-
ing different identities in a bottom-up, empirical
fashion. In order to do this, we construct corpora
of hate speech directed at the most commonly an-
notated target identities, grouped and normalized
as described in Section 3.1. We then trained hate
speech classifiers on each target identity corpus and
evaluated on corpora targeting other identities.
Along with practical implications for hate speech
classification generalization, this analysis suggests
which similarities and differences among identities
are most relevant for differentiating hate speech.
4.1 Data sampling
In order to have enough data targeting many iden-
tities and to generalize beyond the particularities
of specific datasets, we assembled identity-specific
corpora from multiple source datasets. To mitigate
dataset-specific effects, we uniformly sampled hate
speech instances directed toward target identities
from the first 4 datasets listed in Table 1. We se-
lect these datasets since they contain enough data to
train classifiers targeting a sufficient variety of iden-
tities. The corpus for each target identity contains
an equal amount of hate speech drawn from each of
these datasets, though the total number of instances
may differ among corpora. Negative instances were
also uniformly sampled across datasets, and were
restricted to those which had no target identity an-
notation or an annotation that matched the target
identity of the hate speech.
We selected target identities that contained a
minimum of 900 instances labeled as hate across
these four datasets after grouping and normaliza-
tion. We selected this threshold as a balance be-
tween including a sufficient number of identities
and having enough examples of hate speech toward
each identity to train classifiers. In order to in-
clude a variety of identities in the analysis while
maintaining uniform samples for each dataset, we
upsample identity-specific hate speech from indi-
vidual datasets up to 2 times if needed. Corpora
are split into a 60/40 train/test split. Selected target
identities and the size of each corpus can be found
in Table 2. These identity-specific corpora, which
are samples of existing publicly available datasets,
are available at https://osf.io/53tfs/.
4.2 Cross-identity hate speech classification
Due to the high performance of BERT-based mod-
els on hate speech classification (Mozafari et al.,
2019;Samghabadi et al.,2020), we trained and
evaluated a DistilBERT model (Sanh et al.,2019),
which has been shown to perform very similarly to
BERT on hate speech detection with fewer param-
eters (Vidgen et al.,2021). Models were trained
with early stopping after no improvement for 5
epochs on a development set of 10% of the training
set. An Adam optimizer was used with an initial
learning rate of
10−6
. Input data was lowercased