Explainable Abuse Detection as Intent Classification and Slot Filling Agostina Calabrese and Björn Ross and Mirella Lapata Institute for Language Cognition and Computation

2025-04-27 0 0 1.18MB 14 页 10玖币
侵权投诉
Explainable Abuse Detection as Intent Classification and Slot Filling
Agostina Calabrese and Björn Ross and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, United Kingdom
{a.calabrese@,b.ross@,mlap@inf.}ed.ac.uk
Abstract
To proactively offer social media users a
safe online experience, there is a need for
systems that can detect harmful posts and
promptly alert platform moderators. In or-
der to guarantee the enforcement of a con-
sistent policy, moderators are provided with
detailed guidelines. In contrast, most state-
of-the-art models learn what abuse is from
labelled examples and as a result base their
predictions on spurious cues, such as the
presence of group identifiers, which can
be unreliable. In this work we introduce
the concept of policy-aware abuse detection,
abandoning the unrealistic expectation that
systems can reliably learn which phenom-
ena constitute abuse from inspecting the
data alone. We propose a machine-friendly
representation of the policy that moderators
wish to enforce, by breaking it down into
a collection of intents and slots. We col-
lect and annotate a dataset of 3,535 English
posts with such slots, and show how archi-
tectures for intent classification and slot fill-
ing can be used for abuse detection, while
providing a rationale for model decisions.1
1 Introduction
The central goal of online content moderation is
to offer users a safer experience by taking ac-
tions against abusive behaviours, such as hate
speech. Researchers have been developing super-
vised classifiers to detect hateful content, start-
ing from a collection of posts known to be abu-
sive and non-abusive. To successfully accomplish
this task, models are expected to learn complex
concepts from previously flagged examples. For
example, hate speech has been defined as “abu-
sive speech targeting specific group characteris-
tics, such as ethnic origin, religion, gender or sex-
ual orientation” (Warner and Hirschberg,2012),
1Accepted at TACL. Our code and data are available at
https://github.com/Ago3/PLEAD.
but there is no clear definition of what constitutes
abusive speech.
Recent research (Dixon et al.,2018) has shown
that supervised models fail to grasp these com-
plexities; instead, they exploit spurious correla-
tions in the data, they become overly reliant on
low-level lexical features and flag posts because
of, for instance, the presence of group identifiers
alone (e.g., women or gay). Efforts to mitigate
these problems focus on regularization, e.g., pre-
venting the model from paying attention to group
identifiers during training (Kennedy et al.,2020;
Zhang et al.,2020), however, they do not seem
effective at producing better classifiers (Calabrese
et al.,2021). Social media companies, on the other
hand, give moderators detailed guidelines to help
them decide whether a post should be deleted, and
these guidelines also help ensure consistency in
their decisions (see Table 1). Models are not given
access to these guidelines, and arguably this is the
reason for many of their documented weaknesses.
Let us illustrate this with the following exam-
ple. Assume we are shown two posts, the abusive
Immigrants are parasites”, and the non-abusive “I
love artists”, and are asked to judge whether a new
post “Artists are parasites” is abusive. While the
post is insulting, it does not contain hate speech, as
professions are not usually protected, but we can-
not know that without access to moderation guide-
lines. Based on these two posts alone, we might
struggle to decide which label to assign. We are
then given more examples, specifically the non-
abusive “I hate artists” and the abusive “I hate im-
migrants”. In the absence of any other informa-
tion, we would probably label the post “Artists are
parasites” as non-abusive. The example highlights
that 1) the current problem formulation (i.e., given
post pand a collection of labelled examples C, de-
cide whether pis abusive) is not adequate, since
even humans would struggle to agree on the cor-
rect classification, and 2) relying on group identi-
arXiv:2210.02659v1 [cs.CL] 6 Oct 2022
Post: Artists are parasites
Policy: Posts containing dehumanising comparisons
targeted to a group based on their protected character-
istics violate the policy. Protected characteristics in-
clude race, ethnicity, national origin, disability, reli-
gious affiliation, caste, sexual orientation, sex, gender
identity, serious disease and immigration status.
Old Formulation: Is the post abusive?
Our Formulation: Does the post violate the policy?
Table 1: While it is hard to judge whether a post is
abusive based solely on its content, taking the pol-
icy into account facilitates decision making. The
example is based on the Facebook Community
Standards.
fiers is a natural consequence of the problem def-
inition, and often not incorrect. Note that the dif-
ficulty does not arise due to the lack of data anno-
tated with real moderator decisions who would be
presumably making labeling decisions according
the policy. Rather, models are not able to distin-
guish between necessary and sufficient conditions
for making a decision based on examples alone
(Balkir et al.,2022).
In this work we depart from the com-
mon approach that aims to mitigate undesired
model behaviour by adding artificial constraints
(e.g., ignoring group identifiers when judging hate
speech) and instead re-define the task through the
the concept of policy-awareness: given post pand
policy P, decide whether pviolates P. This en-
tails models are given policy-related information
in order to classify posts like Artists are para-
sites”; e.g., they know that posts containing dehu-
manising comparisons targeted to a group based
on their protected characteristics violate the pol-
icy, and that profession is not listed among the
protected characteristics (see Table 1). To enable
models to exploit the policy, we formalize the task
as an instance of intent classification and slot fill-
ing and create a machine-friendly representation
of a policy for hate speech by decomposing it into
a collection of intents and corresponding slots. For
instance, the policy in Table 1expresses the in-
tent “Dehumanisation” and has three slots: “tar-
get”, “protected characteristic”, and “dehumanis-
ing comparison”. All slots must be present for a
post to violate a policy. Given this definition, the
post in Table 1contains a target (Artists” ) and
a dehumanising comparison (“are parasites”) but
does not violate the policy since it does not have a
value for protected characteristic.
We create and make publicly available the
Policy-aware Explainable Abuse Detection
(PLEAD) dataset which contains (intent and slot)
annotations for 3,535 abusive and non-abusive
posts. To decide whether a post violates the
policy and explain the decision, we design a
sequence-to-sequence model that generates a
structured representation of the input by first
detecting and then filling slots. Intent is assigned
deterministically based on the filled slots, leading
to the final abusive/non-abusive classification.
Experiments show our model is more reliable
than classification-only approaches, as it delivers
transparent predictions.
2 Related Work
We use abuse as an umbrella term covering any
kind of harmful content on the Web, as this is ac-
cepted practice in the field (Vidgen et al.,2019;
Waseem et al.,2017). Abuse is hard to recognise,
due to ambiguity in its definition and differences
in annotator sensitivity (Ross et al.,2016). Re-
cent research suggests embracing disagreements
by developing multi-annotator architectures that
capture differences in annotator perspective (Da-
vani et al.,2022;Basile et al.,2021;Uma et al.,
2021). While this approach better models how
abuse is perceived, it is not suitable for content
moderation where one has to decide whether to re-
move a post and a prescriptive paradigm is prefer-
able (Rottger et al.,2022).
Zufall et al. (2020) adopt a more objective ap-
proach, as they aim to detect content that is il-
legal according to EU legislation. However, as
they explain, illegal content constitutes only a tiny
portion of abusive content, and no explicit knowl-
edge about the legal framework is provided to their
model. The problem is framed as the combina-
tion of two binary tasks: whether a post contains a
protected characteristic, and whether it incites vi-
olence. The authors also create a dataset which,
however, is not publicly available.
Most existing work ignores these annotation
difficulties and models abuse detection with
transformer-based models (Vidgen et al.,2021b;
Kennedy et al.,2020;Mozafari et al.,2019).
Despite impressive F1-scores, these models are
black-box and not very informative for modera-
tors. Efforts to shed light on their behaviour, re-
Intent Policy Definition Examples and Slot Annotations
Dehumanisation Posts containing ::::::::::::
dehumanising
::::::::::::
comparisons
targeted to an individual or group based on
their protected characteristics violate the policy.
Immigrants are parasites
[Target Immigrants],[ProtectedCharacteristic Immigrants],
[DehumanisingComparison are parasites]
Threatening Posts containing ::::::::::
threatening::::::::
language
:::::::
targeted to an individual or group based on
their ::::::::
protected :::::::::::::
characteristics violate the policy.
"Women should be shot? Agreed!"*
[Target Women],[ProtectedCharacteristic Women],
[ThreateningSpeech should be shot]
Derogation Posts containing :::::::::
derogatory
::::::
terms::
or::::::
insults
:::::::
targeted to an individual or group based on
their ::::::::
protected :::::::::::::
characteristics violate the policy.
"Truly, I have been to Poland and there’s nothing to see there. Polish people are so nasty as well."
[Target Polish people],[ProtectedCharacteristic Polish],
[DerogatoryOpinion are so nasty]
Animosity Posts containing :::::::
implicit :::::::
abusive::::::::
language
:::::::
targeted to an individual or group based on
their ::::::::
protected :::::::::::::
characteristics violate the policy.
"I don’t want to have to be around a breastfeeding woman, if she wants to do that she can do it at home"
[Target a breastfeeding woman],[ProtectedCharacteristic she],
[NegativeOpinion can do it at home]
Pro Hate Crime Posts that glorify, ::::::
support or deny ::::::
hateful
actions, events, organisations and :::::::::
individuals.
"Hitler was right all along. We are witnessing it at home EVERY day."
[HateEntity Hitler],[Support was right all along]
Table 2: Definition of policy guidelines, intents, and slots associated with them. Example posts and their
annotations. Wording in the guidelines which is mapped onto slots is :::::::::
underlined.
veal that they are good at exploiting spurious cor-
relations in the data but unreliable in more re-
alistic scenarios (Calabrese et al.,2021;Röttger
et al.,2021). Although explainability is consid-
ered a critical capability (Mishra et al.,2019) in
the context of abuse detection, to our knowledge,
Sarwar et al. (2022) represent the only explainable
approach. Their model justifies its predictions by
returning the knearest neighbours that determined
the classification outcome. However, such “ex-
planations” may not be easily understandable to
humans, who are less skilled at detecting patterns
than transformers (Vaswani et al.,2017).
In our work, we formalize the problem of
policy-aware abuse detection as an instance of in-
tent classification and slot filling (ICSF), where
slots are properties like “target” and “protected
characteristic” and intents are policy rules or
guidelines (e.g., “dehumanisation”). While Ah-
mad et al. (2021) use ICSF to parse and explain
the content of a privacy policy, we are not aware of
any work that infers policy violations in text with
ICSF. State-of-the-art models developed for ICSF
are sequence-to-sequence transformers built on
top of pretrained architectures like BART (Agha-
janyan et al.,2020), and also represent the starting
point for our modeling approach.
3 Problem Formulation
Given a policy for the moderation of abusive con-
tent, and a post p, our task is to decide whether pis
abusive. We further note that policies are often ex-
pressed as a set of guidelines R={r1, r2,...rN}
as shown in Table 2and a post pis abusive when
its content violates any riR. Aside from decid-
ing whether a guideline has been violated, we also
expect our model to return a human-readable ex-
planation which should be specific to p(i.e., an
extract from the policy describing the guideline
being violated is not an explanation), since cus-
tomised explanations can help moderators make
more informed decisions, and developers better
understand model behaviour.
Intent Classification and Slot Filling The gen-
eration of post specific explanations requires de-
tection systems to be able to reason over the
content of the policy. To facilitate this process,
we draw inspiration from previous work (Gupta
et al.,2018) on intent classification and slot fill-
ing (ICSF), a task where systems have to classify
the intent of a query (e.g., IN:CREATE_CALL
for the query “Call John”) and fill the slot asso-
ciated with it (e.g., “Call” is the filler for the slot
SL:METHOD and “John” for SL:CONTACT). For
our task, we decompose policies into a collection
of intents corresponding to the guidelines men-
tioned above, and each intent is characterized by
a set of properties, i.e., slots (see Table 2).
The canonical output of ICSF systems is a tree
structure. Multiple representations have been de-
fined, each with a different trade-off between ex-
pressivity and ease of parsing. For our use case,
we adopt the decoupled representation proposed
in Aghajanyan et al. (2020): non-terminal nodes
are either slots or intents, the root node is an in-
tent, and terminal nodes are words attested in the
post (see Figure 1). In this representation, it is not
necessary for all input words to appear in the tree
(i.e., in-order traversal of the tree cannot recon-
struct the original utterance). Although this ulti-
mately renders the parsing task harder, it is cru-
cial for our domain where words can be associated
with multiple slots or no slots, and reasoning over
摘要:

ExplainableAbuseDetectionasIntentClassicationandSlotFillingAgostinaCalabreseandBjörnRossandMirellaLapataInstituteforLanguage,CognitionandComputationSchoolofInformatics,UniversityofEdinburgh10CrichtonStreet,EdinburghEH89AB,UnitedKingdom{a.calabrese@,b.ross@,mlap@inf.}ed.ac.ukAbstractToproactivelyoff...

展开>> 收起<<
Explainable Abuse Detection as Intent Classification and Slot Filling Agostina Calabrese and Björn Ross and Mirella Lapata Institute for Language Cognition and Computation.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.18MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注