ment (Welch et al.,2022b).
Efstathiadis et al. (2021) examined the classifica-
tion of verdicts at both the post and comment levels,
finding that posts were more difficult to classify.
Botzer et al. (2022) also constructed a classifier
to predict the verdict given the text from a com-
ment and used it to study the behavior of users in
different subreddits. De Candia (2021) found that
the subreddits where a user has previously posted
can help predict how they will assign judgements.
The author manually categorized posts into five
categories: family, friendships, work, society, and
romantic relationships. They found that posts about
society, defined as “any situation concerning pol-
itics, racism or gender questions,” were the most
controversial. Several works have also looked at
the demographic factors or framing of posts affects
received judgements (Zhou et al.,2021;De Candia,
2021;Botzer et al.,2022).
3.2 Personalization
Many different approaches and tasks have used
some form of personalization. These methods
use demographic factors (Hovy,2015), personal-
ity traits (Lynn et al.,2017), extra-linguistic infor-
mation that could include context, or community
factors (Bamman and Smith,2015), or previously
written text. A similarity between personalization
and annotator modeling is that the most common
approach appears to be using author IDs. These
have been used, for instance, in sentiment anal-
ysis (Mireshghallah et al.,2021), sarcasm detec-
tion (Kolchinski and Potts,2018), and query auto-
completion (Jaech and Ostendorf,2018).
King and Cook (2020) evaluated methods of per-
sonalized language modeling, including priming,
interpolation, and fine-tuning of n-gram and neu-
ral language models. Wu et al. (2020) modeled
users by predicting their behaviors online. Sim-
ilarly, one’s use of language can be viewed as a
behavior. Welch et al. (2020b) modeled users by
learning separate embedding matrices for each user
in a shared embedding space. Welch et al. (2022a)
explored how to model users based on their simi-
larity to others. They used the perplexity of person-
alized models and the predictions of an authorship
attribution classifier to generate user representa-
tions. In social media in particular, a community
graph structure can be used to model relationships
between users and their linguistic patterns (Yang
and Eisenstein,2017).
3.3 Annotator Disagreement
There has been a shift in thinking about anno-
tator disagreement as positive rather than nega-
tive (Aroyo and Welty,2015). Disagreement be-
tween annotators is often resolved through majority
voting (Nowak and Rüger,2010). In some cases,
label averaging can be used (Sabou et al.,2014),
or disagreements can be resolved through adjudica-
tion (Waseem and Hovy,2016). Majority voting,
which is most often used, takes away the voice of
underrepresented groups in a set of annotators, for
instance, older crowd workers (Díaz et al.,2019),
and aggregation in general obscures the causes of
lower model performance and removes the perspec-
tives of certain sociodemographic groups (Prab-
hakaran et al.,2021). On the other hand, Geva et al.
(2019) uses annotator’s identifiers as features to
improve model performance while training. They
note that annotator bias is a factor which needs
additional thought when creating a dataset.
Fornaciari et al. (2021) predict soft-labels for
each annotator to model disagreement which miti-
gates overfitting and improves performance on ag-
gregated labels across tasks, including less subjec-
tive tasks like part-of-speech tagging. Davani et al.
(2021) developed a multi-task model to predict all
annotator’s judgements, finding that this achieves
similar or better performance to models trained
on majority vote labels. They note that a model
that predicts multiple labels can also be used to
measure uncertainty. They experiment with two
datasets, which have fewer than a hundred annota-
tors each. This allows them to model all annotators,
though they note that training their model on cor-
pora with thousands of annotators, like ours, is not
computationally viable.
Most work models annotators using their ID
only. Basile et al. (2021a) has called for extra infor-
mation about annotators to be taken into account.
Some annotation tasks have collected demographic
information about annotators, for instance (Sap
et al.,2021), or used the confidence of annotators
as extra information (Cabitza et al.,2020).
4 Dataset
We use the dataset of (Welch et al.,2022b), who
collected data from Reddit, an online platform with
many separate, focused communities called subred-
dits. The data is from the AITA subreddit, where
members describe a social situation they are in-
volved in, and ask members of the community for