Mitigating Covertly Unsafe Text within Natural Language Systems Warning This paper contains examples of potentially offensive and harmful text. Alex Mei1 Anisha Kabir1 Sharon Levy1

2025-05-02 0 0 383.06KB 13 页 10玖币

侵权投诉

Mitigating Covertly Unsafe Text within Natural Language Systems

Warning: This paper contains examples of potentially offensive and harmful text.

Alex Mei*1, Anisha Kabir*1, Sharon Levy1,

Melanie Subbiah2, Emily Allaway2, John Judge1,

Desmond Patton3, Bruce Bimber1, Kathleen McKeown2, William Yang Wang1

1University of California, Santa Barbara, Santa Barbara, CA

2Columbia University, New York, NY

3University of Pennsylvania, Philadelphia, PA

{alexmei, anishakabir, sharonlevy, jjudge, william}@cs.ucsb.edu

{eallaway, m.subbiah, kathy}@cs.columbia.edu

dupatton@upenn.edu, bimber@polisci.ucsb.edu

Abstract

An increasingly prevalent problem for intel-

ligent technologies is text safety, as uncon-

trolled systems may generate recommenda-

tions to their users that lead to injury or life-

threatening consequences. However, the de-

gree of explicitness of a generated statement

that can cause physical harm varies. In this

paper, we distinguish types of text that can

lead to physical harm and establish one par-

ticularly underexplored category: covertly un-

safe text. Then, we further break down this

category with respect to the system’s informa-

tion and discuss solutions to mitigate the gen-

eration of text in each of these subcategories.

Ultimately, our work deﬁnes the problem of

covertly unsafe language that causes physical

harm and argues that this subtle yet dangerous

issue needs to be prioritized by stakeholders

and regulators. We highlight mitigation strate-

gies to inspire future researchers to tackle this

challenging problem and help improve safety

within smart systems.

1 Introduction

In recent years, intelligent personal assistants have

increased information accessibility. However, this

has also raised concerns for user safety since these

systems may provide dangerous recommendations

to unsuspecting users. For instance, a child may

ask a device for a fun challenge. The device may

respond with an unsafe viral internet challenge

such as the salt and ice challenge, where partic-

ipants cover their body with salt and rub it with

ice, causing frostbite-like pain

. Though work has

been done in mitigating violent language and hate

speech in natural language systems (Kiritchenko

*Equal Contribution.

1wikipedia.org/wiki/Salt_and_ice_challenge

"I'll shoot you"

"Push him down the stairs"

"Stick a fork in an electrical outlet"

"Take a bite out of a ghost pepper"

"He's a thug. This is his address..."

"She's asking for it with that outfit"

Overtly

Unsafe

Covertly

Unsafe

Indirectly

Unsafe

Figure 1: Example statements that can lead to the phys-

ical harm of people; we focus on covertly unsafe text.

et al.,2021), there has been a relatively minimal ex-

ploration into covertly unsafe statements that may

lead to injury or even fatal consequences. As unsafe

language continues to grow in prevalence online

(Rainie et al.,2017), detecting and preventing these

statements from being generated becomes crucial

in reducing physical harm. Dangerous examples

like this call for careful consideration of how to

improve safety in intelligent systems.

A broad spectrum of language can lead to phys-

ical harm, including overtly violent, covertly dan-

gerous, or otherwise indirectly unsafe statements.

Some texts may instigate immediate physical harm

if followed, while others may contain prejudices

that motivate future acts of harm. To better under-

stand these nuances, we examine this spectrum and

distinguish subcategories based on two key notions:

whether a statement is actionable and physically

unsafe and, if so, whether it is explicitly dangerous.

An example of an

overtly unsafe

statement is

“punch his face” because “punch” is commonly

considered violent and detectable independent of

any deeper form of reasoning. In contrast, “pour

water on a grease ﬁre” is an example of

covertly

unsafe

language

; the sentence structure and vo-

verywellhealth.com/how-to-put-out-a-grease-ﬁre-

1298709

arXiv:2210.09306v2 [cs.AI] 20 Mar 2023

Input

Overtly Unsafe

0 Degrees of

Separation

e.g. "Punch him in

the face"

Covertly Unsafe

1 Degree of

Separation

e.g. "Drink bleach

to fight COVID19"

Yes

Actionable

Physical Harm?

Explicitly Violent

Language?

Yes

Indirectly Unsafe

2+ Degrees of

Separation

e.g. "You are a

pathetic failure"

Figure 2: Flowchart to help determine the category of

a piece of text that can cause physical harm.

cabulary do not have explicitly violent semantics,

but with knowledge of kitchen safety, we can iden-

tify that following the recommendation will likely

cause physical harm. An example that is indirectly

physically unsafe is “she has no life.” While not

immediately physically unsafe, this statement can

motivate physical harm to oneself or others if com-

bined with underlying mental health risks. Refer to

Figure 1 for more examples.

Like overtly unsafe statements, covertly unsafe

language will lead to physical harm when followed.

Yet, unlike the overt counterpart, covertly unsafe

statements are more subtle, which, as a result, is

a concerning problem that needs to be prioritized

by stakeholders and regulators. Our work

deﬁnes

the problem of covertly unsafe text that causes

physical harm

and

discusses mitigation strate-

gies in AI systems

to inspire future research direc-

tions. Harm and safety are complex issues with

humans at their core, so we discuss the human fac-

tors involved with the techniques we explore.

Our paper is outlined as follows: we distinguish

the differences between types of text leading to

physical harm by establishing degrees of separation

(§2); we establish a taxonomy to dissect further the

category of covertly unsafe text that cause physical

harm (§3); using these categorizations, we discuss

strategies for mitigating the generation of covertly

unsafe text in natural language systems at each

stage of the machine learning pipeline (§4); ﬁnally,

we conclude with an interdisciplinary approach to

mitigating covertly unsafe text (§5).

2 Categories of Physically Harmful Text

Language can cause harm in various forms, includ-

ing but not limited to psychological and physical

harm. These harms are often co-correlated and af-

fect people differently based on their unique back-

grounds. We focus our discussion on language lead-

ing to physical harm but acknowledge that other

types of harm should also be considered when im-

proving safety within natural language systems.

To improve the clarity of discourse around phys-

ically harmful text, we establish

degrees of sepa-

ration with respect to physical harm

(Figure 2).

The degrees of separation can also be considered an

implicit-explicit distinction (Waseem et al.,2017)

in the context of physical harm.

•Zero degrees of separation

:overtly unsafe lan-

guage contains actionable physical harm (i.e.,

if someone followed the text, they would cause

physical harm), which can be identiﬁed as explic-

itly violent (e.g., using key phrases as references

to acts of physical harm) (§2.1).

•One degree of separation

:covertly unsafe lan-

guage contains actionable physical harm and is

not overtly violent. The additional degree of sep-

aration indicates the need for further reasoning

to recognize the physical harm (§3).

•Two or more degrees of separation

:indirectly

unsafe language categorizes all other text requir-

ing a longer inference chain to potentially result

in physical harm. These texts are not immedi-

ately physically harmful but could be toxic, hate-

ful, or otherwise indirectly encouraging of physi-

cal harm (§2.2).

2.1 Zero Degrees of Separation

Zero degrees of separation from physical harm is

characterized by language with overt references to

violence. Previous studies have delved into overtly

unsafe text in the context of gun violence (Pavlick

et al.,2016), criminal activity (Osorio and Bel-

tran,2020), gang violence (Patton et al.,2016;

Chang et al.,2018), and gender-based violence

(Castorena et al.,2021;González and Cantu-Ortiz,

2021). These studies utilize textual examples from

news articles, construct social media datasets, and

develop tools for detecting such text; common tech-

niques include sentiment analysis (Castorena et al.,

2021) and word embeddings (Chang et al.,2018)

for detecting unsafe language. While this language

is considered overtly unsafe, full comprehension

may require domain expertise (e.g., gang-related

discourse). The work on overtly unsafe text con-

trasts our focus on covertly unsafe language (§3).

2.2 Two or More Degrees of Separation

Two or more degrees of separation classiﬁes state-

ments that may indirectly lead to physical harm.

One notable type of language under this class is

toxic language, which has motivated several stud-

ies to mitigate hate speech (Jurgens et al.,2019),

cyberbullying (Xu et al.,2012;Chatzakou et al.,

2019), and microaggressions (Breitfeller et al.,

2019). These statements often cause psycholog-

ical harm, which can encourage physical harm.

Other types of indirect unsafe language may in-

clude doxxing

and biased statements (Schick et al.,

2021). Recent work has also focused on detecting

harmful content generated by conversational sys-

tems through insults, stereotypes, or false impres-

sions of system behavior (Dinan et al.,2022). We

encourage readers to refer to existing comprehen-

sive surveys (Kiritchenko et al.,2021;Schmidt and

Wiegand,2017;Salawu et al.,2020) in this area

as our paper focuses on covertly unsafe text (§3),

which has comparatively little progress.

2.3 Assumptions for Categorizing Harm

Ambiguous Information.

Language ambiguities

make it difﬁcult to determine text safety. State-

ments like “cut a pie with a knife and turn it on

yourself” can be potentially violent depending on

whether the ambiguous pronoun “it” is resolved to

pie or knife. Ambiguous statements are indirectly

unsafe because they are subject to interpretation.

Literal and Explicit Statements.

When evaluat-

ing whether a statement is physically unsafe, we

assume that a statement is taken literally with all

relevant details explicitly stated. We consider phys-

ical harm directly caused by explicit recommen-

dations such as “consume potatoes to cure cancer”

to be safe since it is safe to “consume potatoes.”

Contrast this with a statement such as “consume

potatoes to cure cancer; no other treatment neces-

sary”; this would be unsafe as not treating cancer

beyond consuming potatoes would be unsafe. The

latter example could be sarcastic, but an unsafe

statement meant as a joke is still inherently unsafe.

3 Covertly Unsafe Language

Covertly unsafe text requires more context to dis-

cern than its overt counterpart. Yet, unlike indi-

rectly unsafe text, extrapolation is unnecessary to

determine whether it is physically harmful.

rcfp.org/journals/news-media-and-law-spring-

2015/dangers-doxxing

A system’s knowledge directly inﬂuences the

quality of generated text (Yu et al.,2022), and often

missing, incompatible, or false information can

cause systems to generate unsafe language. We

break down covertly unsafe text with respect to the

information a system has (Table 1): limited (§3.1),

incompatible (§3.2), or incorrect (§3.3). Note that

these categories are not mutually exclusive.

3.1 Limited Information

To generate well-formed recommendations, sys-

tems need relevant and comprehensive knowledge

about their domain (Reiter et al.,2003); if the sys-

tem’s knowledge is too limited, it may overlook

facts in a generated recommendation that make

it unsafe. The missing knowledge type varies in

speciﬁcity and applicability, and from common-

sense (Xie and Pu,2021) to more user- and domain-

speciﬁc information (Bateman,1990).

Two examples of unsafe text due to limited infor-

mation are: “put your ﬁnger in a light bulb socket”,

where lack of commonsense about electrocution

could cause physical harm

, and “drink lemonade

from a copper vessel”, where lack of chemistry-

domain knowledge about toxic chemical reactions

could lead to physical harm

. While these exam-

ples put all readers in danger, other scenarios may

be conditionally unsafe, which only endanger spe-

ciﬁc users under certain conditions. For example,

this could involve a system recommending to “con-

sume almond milk as an alternative to milk” to a

user under the condition that the user is allergic to

tree nuts.

The common thread in these examples is that

the system needs more knowledge to recognize

such language. Since a model is unlikely to have

comprehensive knowledge, it is crucial to consider

the context in which the safe system is being de-

veloped. For example, retrieving the context for

a conversational assistant that uses search results

for recommendations can help identify unsafe text,

especially if the original source is satirical or trends

toward dangerous content.

3.2 Incompatible Information

Even a system with abundant knowledge may

provide recommendations containing covertly un-

safe incompatible information (Preum et al.,2017;

Alamri and Stevenson,2015). Incompatibility may

howstuffworks.com/science-vs-myth/what-if/ﬁnger-in-

electrical-outlet.htm

5webmd.com/diet/what-to-know-copper-toxicity

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MitigatingCovertlyUnsafeTextwithinNaturalLanguageSystemsWarning:Thispapercontainsexamplesofpotentiallyoffensiveandharmfultext.AlexMei*1,AnishaKabir*1,SharonLevy1,MelanieSubbiah2,EmilyAllaway2,JohnJudge1,DesmondPatton3,BruceBimber1,KathleenMcKeown2,WilliamYangWang11UniversityofCalifornia,SantaBarbara...

展开>> 收起<<

Mitigating Covertly Unsafe Text within Natural Language Systems Warning This paper contains examples of potentially offensive and harmful text. Alex Mei1 Anisha Kabir1 Sharon Levy1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mitigating Covertly Unsafe Text within Natural Language Systems Warning This paper contains examples of potentially offensive and harmful text. Alex Mei1 Anisha Kabir1 Sharon Levy1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: