4 Song and Lee, et al.
lexical variation [
7
] of a banned word to avoid automatically being ltered. The moderators then have to update their
lter by adding these variations [
20
]. While these false negatives are dealt with by updating the rules, false positives
are more annoying because they require moderators to manually reverse each issue [
40
]. In this study, we explore the
eectiveness of a moderation support system that allows its users to predict false positives and false negatives of a
rule-based automated content moderation tool that is applied to the content of their own community so that moderators
can improve their rules to prevent future false positives and false negatives.
2.2 Designing a System for Content Moderation
In the context of online content moderation, many studies have introduced machine learning-based classiers to detect
harmful comments and malicious users in the online space. Types of classier include the detection of cyberbullying [
15
],
profanities and insults [
43
], pornographic content [
41
], hate speech [
13
], and abusive behaviors [
10
,
32
]. As machine
learning techniques evolve, recent studies made classiers multimodal and community-specic, so that the classiers can
reect each community’s preference and culture. For example, Chancellor et al. [
6
] developed a multimodal classication
model to detect images and text that promote eating disorders, which do not fall into the traditional category of harmful
content. Furthermore, Chandrasekharan et al. [
8
] trained macro norms and community-specic to make the classier
more suitable for each community. While previous studies have focused on supervised learning to classify behaviors
generally considered harmful, our study adopts an embedding model pretrained by a large language corpus to nd
comments with few examples that represent the individual moderator’s intention. Our system combines the ltering
results and their semantic similarities with examples to nd possible false positives and false negatives.
Although there have been studies on algorithmic support for online content moderation, few studies have proposed
to visualize actual content of the community, such as comments to help congure automated rules and support the
moderation process. CommentIQ [
33
] is an interactive visualization tool for online news comment moderators, which
helps to nd high-quality comments for readers. The user can lter the comments by criteria, location, and times by
brushing and linking on their distribution visualization. Also, the system allows the users to reect their preference to
high-quality comments into the sorting order by setting the weights for predened criteria. Recently, FilterBuddy [
21
]
introduced a tool for YouTube creators to help moderate comments on their videos. The user could customize the word
lters to hide or remove comments with specic words in existing lter lists. The system used existing comments to
show what and how many comments were ltered, to help evaluate the performance of the lter. In this work, we focus
on system design to help community moderators congure a rule-based automated tool that supports combinations of
word lters to nd posts that violate community rules. Our system shows the expected results of the congured tool
using existing posts in a real community and visualizes the relationship between the posts and the conguration to
help users analyze each lter.
2.3 Background: Reddit AutoModerator
Reddit AutoModerator is a rule-based automated moderation tool developed by one of the Reddit moderators, Chad
Birch, in 2013 [
20
]. By conguring AutoModerator using YAML, Reddit moderators can create their own automated
rules suitable for each subreddit’s preference and culture. In 2015, Reddit ocially integrated AutoModerator into the
platform as a feature of the default moderation tools. According to Reddit transparency reports[
34
], AutoModerator
removed about 103.6M content in 2021, which is 20.9% more than 2020, and 58.9% of all content removed by moderators.
AutoModerator works on all the posts and comments on a subreddit according to the automated rules, which a human
moderator last saved in their AutoModerator. In other words, once moderators change their rules, AutoModerator