
Intent Policy Definition Examples and Slot Annotations
Dehumanisation Posts containing ::::::::::::
dehumanising
::::::::::::
comparisons
targeted to an individual or group based on
their protected characteristics violate the policy.
Immigrants are parasites
[Target Immigrants],[ProtectedCharacteristic Immigrants],
[DehumanisingComparison are parasites]
Threatening Posts containing ::::::::::
threatening::::::::
language
:::::::
targeted to an individual or group based on
their ::::::::
protected :::::::::::::
characteristics violate the policy.
"Women should be shot? Agreed!"*
[Target Women],[ProtectedCharacteristic Women],
[ThreateningSpeech should be shot]
Derogation Posts containing :::::::::
derogatory
::::::
terms::
or::::::
insults
:::::::
targeted to an individual or group based on
their ::::::::
protected :::::::::::::
characteristics violate the policy.
"Truly, I have been to Poland and there’s nothing to see there. Polish people are so nasty as well."
[Target Polish people],[ProtectedCharacteristic Polish],
[DerogatoryOpinion are so nasty]
Animosity Posts containing :::::::
implicit :::::::
abusive::::::::
language
:::::::
targeted to an individual or group based on
their ::::::::
protected :::::::::::::
characteristics violate the policy.
"I don’t want to have to be around a breastfeeding woman, if she wants to do that she can do it at home"
[Target a breastfeeding woman],[ProtectedCharacteristic she],
[NegativeOpinion can do it at home]
Pro Hate Crime Posts that glorify, ::::::
support or deny ::::::
hateful
actions, events, organisations and :::::::::
individuals.
"Hitler was right all along. We are witnessing it at home EVERY day."
[HateEntity Hitler],[Support was right all along]
Table 2: Definition of policy guidelines, intents, and slots associated with them. Example posts and their
annotations. Wording in the guidelines which is mapped onto slots is :::::::::
underlined.
veal that they are good at exploiting spurious cor-
relations in the data but unreliable in more re-
alistic scenarios (Calabrese et al.,2021;Röttger
et al.,2021). Although explainability is consid-
ered a critical capability (Mishra et al.,2019) in
the context of abuse detection, to our knowledge,
Sarwar et al. (2022) represent the only explainable
approach. Their model justifies its predictions by
returning the knearest neighbours that determined
the classification outcome. However, such “ex-
planations” may not be easily understandable to
humans, who are less skilled at detecting patterns
than transformers (Vaswani et al.,2017).
In our work, we formalize the problem of
policy-aware abuse detection as an instance of in-
tent classification and slot filling (ICSF), where
slots are properties like “target” and “protected
characteristic” and intents are policy rules or
guidelines (e.g., “dehumanisation”). While Ah-
mad et al. (2021) use ICSF to parse and explain
the content of a privacy policy, we are not aware of
any work that infers policy violations in text with
ICSF. State-of-the-art models developed for ICSF
are sequence-to-sequence transformers built on
top of pretrained architectures like BART (Agha-
janyan et al.,2020), and also represent the starting
point for our modeling approach.
3 Problem Formulation
Given a policy for the moderation of abusive con-
tent, and a post p, our task is to decide whether pis
abusive. We further note that policies are often ex-
pressed as a set of guidelines R={r1, r2,...rN}
as shown in Table 2and a post pis abusive when
its content violates any ri∈R. Aside from decid-
ing whether a guideline has been violated, we also
expect our model to return a human-readable ex-
planation which should be specific to p(i.e., an
extract from the policy describing the guideline
being violated is not an explanation), since cus-
tomised explanations can help moderators make
more informed decisions, and developers better
understand model behaviour.
Intent Classification and Slot Filling The gen-
eration of post specific explanations requires de-
tection systems to be able to reason over the
content of the policy. To facilitate this process,
we draw inspiration from previous work (Gupta
et al.,2018) on intent classification and slot fill-
ing (ICSF), a task where systems have to classify
the intent of a query (e.g., IN:CREATE_CALL
for the query “Call John”) and fill the slot asso-
ciated with it (e.g., “Call” is the filler for the slot
SL:METHOD and “John” for SL:CONTACT). For
our task, we decompose policies into a collection
of intents corresponding to the guidelines men-
tioned above, and each intent is characterized by
a set of properties, i.e., slots (see Table 2).
The canonical output of ICSF systems is a tree
structure. Multiple representations have been de-
fined, each with a different trade-off between ex-
pressivity and ease of parsing. For our use case,
we adopt the decoupled representation proposed
in Aghajanyan et al. (2020): non-terminal nodes
are either slots or intents, the root node is an in-
tent, and terminal nodes are words attested in the
post (see Figure 1). In this representation, it is not
necessary for all input words to appear in the tree
(i.e., in-order traversal of the tree cannot recon-
struct the original utterance). Although this ulti-
mately renders the parsing task harder, it is cru-
cial for our domain where words can be associated
with multiple slots or no slots, and reasoning over