Explainable Abuse Detection as Intent Classiﬁcation and Slot Filling Agostina Calabrese and Björn Ross and Mirella Lapata Institute for Language Cognition and Computation

2025-04-27 0 0 1.18MB 14 页 10玖币

侵权投诉

Explainable Abuse Detection as Intent Classiﬁcation and Slot Filling

Agostina Calabrese and Björn Ross and Mirella Lapata

Institute for Language, Cognition and Computation

School of Informatics, University of Edinburgh

10 Crichton Street, Edinburgh EH8 9AB, United Kingdom

{a.calabrese@,b.ross@,mlap@inf.}ed.ac.uk

Abstract

To proactively offer social media users a

safe online experience, there is a need for

systems that can detect harmful posts and

promptly alert platform moderators. In or-

der to guarantee the enforcement of a con-

sistent policy, moderators are provided with

detailed guidelines. In contrast, most state-

of-the-art models learn what abuse is from

labelled examples and as a result base their

predictions on spurious cues, such as the

presence of group identiﬁers, which can

be unreliable. In this work we introduce

the concept of policy-aware abuse detection,

abandoning the unrealistic expectation that

systems can reliably learn which phenom-

ena constitute abuse from inspecting the

data alone. We propose a machine-friendly

representation of the policy that moderators

wish to enforce, by breaking it down into

a collection of intents and slots. We col-

lect and annotate a dataset of 3,535 English

posts with such slots, and show how archi-

tectures for intent classiﬁcation and slot ﬁll-

ing can be used for abuse detection, while

providing a rationale for model decisions.1

1 Introduction

The central goal of online content moderation is

to offer users a safer experience by taking ac-

tions against abusive behaviours, such as hate

speech. Researchers have been developing super-

vised classiﬁers to detect hateful content, start-

ing from a collection of posts known to be abu-

sive and non-abusive. To successfully accomplish

this task, models are expected to learn complex

concepts from previously ﬂagged examples. For

example, hate speech has been deﬁned as “abu-

sive speech targeting speciﬁc group characteris-

tics, such as ethnic origin, religion, gender or sex-

ual orientation” (Warner and Hirschberg,2012),

1Accepted at TACL. Our code and data are available at

https://github.com/Ago3/PLEAD.

but there is no clear deﬁnition of what constitutes

abusive speech.

Recent research (Dixon et al.,2018) has shown

that supervised models fail to grasp these com-

plexities; instead, they exploit spurious correla-

tions in the data, they become overly reliant on

low-level lexical features and ﬂag posts because

of, for instance, the presence of group identiﬁers

alone (e.g., women or gay). Efforts to mitigate

these problems focus on regularization, e.g., pre-

venting the model from paying attention to group

identiﬁers during training (Kennedy et al.,2020;

Zhang et al.,2020), however, they do not seem

effective at producing better classiﬁers (Calabrese

et al.,2021). Social media companies, on the other

hand, give moderators detailed guidelines to help

them decide whether a post should be deleted, and

these guidelines also help ensure consistency in

their decisions (see Table 1). Models are not given

access to these guidelines, and arguably this is the

reason for many of their documented weaknesses.

Let us illustrate this with the following exam-

ple. Assume we are shown two posts, the abusive

“Immigrants are parasites”, and the non-abusive “I

love artists”, and are asked to judge whether a new

post “Artists are parasites” is abusive. While the

post is insulting, it does not contain hate speech, as

professions are not usually protected, but we can-

not know that without access to moderation guide-

lines. Based on these two posts alone, we might

struggle to decide which label to assign. We are

then given more examples, speciﬁcally the non-

abusive “I hate artists” and the abusive “I hate im-

migrants”. In the absence of any other informa-

tion, we would probably label the post “Artists are

parasites” as non-abusive. The example highlights

that 1) the current problem formulation (i.e., given

post pand a collection of labelled examples C, de-

cide whether pis abusive) is not adequate, since

even humans would struggle to agree on the cor-

rect classiﬁcation, and 2) relying on group identi-

arXiv:2210.02659v1 [cs.CL] 6 Oct 2022

Post: Artists are parasites

Policy: Posts containing dehumanising comparisons

targeted to a group based on their protected character-

istics violate the policy. Protected characteristics in-

clude race, ethnicity, national origin, disability, reli-

gious afﬁliation, caste, sexual orientation, sex, gender

identity, serious disease and immigration status.

Old Formulation: Is the post abusive?

Our Formulation: Does the post violate the policy?

Table 1: While it is hard to judge whether a post is

abusive based solely on its content, taking the pol-

icy into account facilitates decision making. The

example is based on the Facebook Community

Standards.

ﬁers is a natural consequence of the problem def-

inition, and often not incorrect. Note that the dif-

ﬁculty does not arise due to the lack of data anno-

tated with real moderator decisions who would be

presumably making labeling decisions according

the policy. Rather, models are not able to distin-

guish between necessary and sufﬁcient conditions

for making a decision based on examples alone

(Balkir et al.,2022).

In this work we depart from the com-

mon approach that aims to mitigate undesired

model behaviour by adding artiﬁcial constraints

(e.g., ignoring group identiﬁers when judging hate

speech) and instead re-deﬁne the task through the

the concept of policy-awareness: given post pand

policy P, decide whether pviolates P. This en-

tails models are given policy-related information

in order to classify posts like “Artists are para-

sites”; e.g., they know that posts containing dehu-

manising comparisons targeted to a group based

on their protected characteristics violate the pol-

icy, and that profession is not listed among the

protected characteristics (see Table 1). To enable

models to exploit the policy, we formalize the task

as an instance of intent classiﬁcation and slot ﬁll-

ing and create a machine-friendly representation

of a policy for hate speech by decomposing it into

a collection of intents and corresponding slots. For

instance, the policy in Table 1expresses the in-

tent “Dehumanisation” and has three slots: “tar-

get”, “protected characteristic”, and “dehumanis-

ing comparison”. All slots must be present for a

post to violate a policy. Given this deﬁnition, the

post in Table 1contains a target (“Artists” ) and

a dehumanising comparison (“are parasites”) but

does not violate the policy since it does not have a

value for protected characteristic.

We create and make publicly available the

Policy-aware Explainable Abuse Detection

(PLEAD) dataset which contains (intent and slot)

annotations for 3,535 abusive and non-abusive

posts. To decide whether a post violates the

policy and explain the decision, we design a

sequence-to-sequence model that generates a

structured representation of the input by ﬁrst

detecting and then ﬁlling slots. Intent is assigned

deterministically based on the ﬁlled slots, leading

to the ﬁnal abusive/non-abusive classiﬁcation.

Experiments show our model is more reliable

than classiﬁcation-only approaches, as it delivers

transparent predictions.

2 Related Work

We use abuse as an umbrella term covering any

kind of harmful content on the Web, as this is ac-

cepted practice in the ﬁeld (Vidgen et al.,2019;

Waseem et al.,2017). Abuse is hard to recognise,

due to ambiguity in its deﬁnition and differences

in annotator sensitivity (Ross et al.,2016). Re-

cent research suggests embracing disagreements

by developing multi-annotator architectures that

capture differences in annotator perspective (Da-

vani et al.,2022;Basile et al.,2021;Uma et al.,

2021). While this approach better models how

abuse is perceived, it is not suitable for content

moderation where one has to decide whether to re-

move a post and a prescriptive paradigm is prefer-

able (Rottger et al.,2022).

Zufall et al. (2020) adopt a more objective ap-

proach, as they aim to detect content that is il-

legal according to EU legislation. However, as

they explain, illegal content constitutes only a tiny

portion of abusive content, and no explicit knowl-

edge about the legal framework is provided to their

model. The problem is framed as the combina-

tion of two binary tasks: whether a post contains a

protected characteristic, and whether it incites vi-

olence. The authors also create a dataset which,

however, is not publicly available.

Most existing work ignores these annotation

difﬁculties and models abuse detection with

transformer-based models (Vidgen et al.,2021b;

Kennedy et al.,2020;Mozafari et al.,2019).

Despite impressive F1-scores, these models are

black-box and not very informative for modera-

tors. Efforts to shed light on their behaviour, re-

Intent Policy Deﬁnition Examples and Slot Annotations

Dehumanisation Posts containing ::::::::::::

dehumanising

::::::::::::

comparisons

targeted to an individual or group based on

their protected characteristics violate the policy.

Immigrants are parasites

[Target Immigrants],[ProtectedCharacteristic Immigrants],

[DehumanisingComparison are parasites]

Threatening Posts containing ::::::::::

threatening::::::::

language

:::::::

targeted to an individual or group based on

their ::::::::

protected :::::::::::::

characteristics violate the policy.

"Women should be shot? Agreed!"*

[Target Women],[ProtectedCharacteristic Women],

[ThreateningSpeech should be shot]

Derogation Posts containing :::::::::

derogatory

::::::

terms::

or::::::

insults

:::::::

targeted to an individual or group based on

their ::::::::

protected :::::::::::::

characteristics violate the policy.

"Truly, I have been to Poland and there’s nothing to see there. Polish people are so nasty as well."

[Target Polish people],[ProtectedCharacteristic Polish],

[DerogatoryOpinion are so nasty]

Animosity Posts containing :::::::

implicit :::::::

abusive::::::::

language

:::::::

targeted to an individual or group based on

their ::::::::

protected :::::::::::::

characteristics violate the policy.

"I don’t want to have to be around a breastfeeding woman, if she wants to do that she can do it at home"

[Target a breastfeeding woman],[ProtectedCharacteristic she],

[NegativeOpinion can do it at home]

Pro Hate Crime Posts that glorify, ::::::

support or deny ::::::

hateful

actions, events, organisations and :::::::::

individuals.

"Hitler was right all along. We are witnessing it at home EVERY day."

[HateEntity Hitler],[Support was right all along]

Table 2: Deﬁnition of policy guidelines, intents, and slots associated with them. Example posts and their

annotations. Wording in the guidelines which is mapped onto slots is :::::::::

underlined.

veal that they are good at exploiting spurious cor-

relations in the data but unreliable in more re-

alistic scenarios (Calabrese et al.,2021;Röttger

et al.,2021). Although explainability is consid-

ered a critical capability (Mishra et al.,2019) in

the context of abuse detection, to our knowledge,

Sarwar et al. (2022) represent the only explainable

approach. Their model justiﬁes its predictions by

returning the knearest neighbours that determined

the classiﬁcation outcome. However, such “ex-

planations” may not be easily understandable to

humans, who are less skilled at detecting patterns

than transformers (Vaswani et al.,2017).

In our work, we formalize the problem of

policy-aware abuse detection as an instance of in-

tent classiﬁcation and slot ﬁlling (ICSF), where

slots are properties like “target” and “protected

characteristic” and intents are policy rules or

guidelines (e.g., “dehumanisation”). While Ah-

mad et al. (2021) use ICSF to parse and explain

the content of a privacy policy, we are not aware of

any work that infers policy violations in text with

ICSF. State-of-the-art models developed for ICSF

are sequence-to-sequence transformers built on

top of pretrained architectures like BART (Agha-

janyan et al.,2020), and also represent the starting

point for our modeling approach.

3 Problem Formulation

Given a policy for the moderation of abusive con-

tent, and a post p, our task is to decide whether pis

abusive. We further note that policies are often ex-

pressed as a set of guidelines R={r1, r2,...rN}

as shown in Table 2and a post pis abusive when

its content violates any ri∈R. Aside from decid-

ing whether a guideline has been violated, we also

expect our model to return a human-readable ex-

planation which should be speciﬁc to p(i.e., an

extract from the policy describing the guideline

being violated is not an explanation), since cus-

tomised explanations can help moderators make

more informed decisions, and developers better

understand model behaviour.

Intent Classiﬁcation and Slot Filling The gen-

eration of post speciﬁc explanations requires de-

tection systems to be able to reason over the

content of the policy. To facilitate this process,

we draw inspiration from previous work (Gupta

et al.,2018) on intent classiﬁcation and slot ﬁll-

ing (ICSF), a task where systems have to classify

the intent of a query (e.g., IN:CREATE_CALL

for the query “Call John”) and ﬁll the slot asso-

ciated with it (e.g., “Call” is the ﬁller for the slot

SL:METHOD and “John” for SL:CONTACT). For

our task, we decompose policies into a collection

of intents corresponding to the guidelines men-

tioned above, and each intent is characterized by

a set of properties, i.e., slots (see Table 2).

The canonical output of ICSF systems is a tree

structure. Multiple representations have been de-

ﬁned, each with a different trade-off between ex-

pressivity and ease of parsing. For our use case,

we adopt the decoupled representation proposed

in Aghajanyan et al. (2020): non-terminal nodes

are either slots or intents, the root node is an in-

tent, and terminal nodes are words attested in the

post (see Figure 1). In this representation, it is not

necessary for all input words to appear in the tree

(i.e., in-order traversal of the tree cannot recon-

struct the original utterance). Although this ulti-

mately renders the parsing task harder, it is cru-

cial for our domain where words can be associated

with multiple slots or no slots, and reasoning over

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplainableAbuseDetectionasIntentClassicationandSlotFillingAgostinaCalabreseandBjörnRossandMirellaLapataInstituteforLanguage,CognitionandComputationSchoolofInformatics,UniversityofEdinburgh10CrichtonStreet,EdinburghEH89AB,UnitedKingdom{a.calabrese@,b.ross@,mlap@inf.}ed.ac.ukAbstractToproactivelyoff...

展开>> 收起<<

Explainable Abuse Detection as Intent Classiﬁcation and Slot Filling Agostina Calabrese and Björn Ross and Mirella Lapata Institute for Language Cognition and Computation.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Explainable Abuse Detection as Intent Classiﬁcation and Slot Filling Agostina Calabrese and Björn Ross and Mirella Lapata Institute for Language Cognition and Computation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: