Mitigating Covertly Unsafe Text within Natural Language Systems Warning This paper contains examples of potentially offensive and harmful text. Alex Mei1 Anisha Kabir1 Sharon Levy1

2025-05-02 0 0 383.06KB 13 页 10玖币
侵权投诉
Mitigating Covertly Unsafe Text within Natural Language Systems
Warning: This paper contains examples of potentially offensive and harmful text.
Alex Mei*1, Anisha Kabir*1, Sharon Levy1,
Melanie Subbiah2, Emily Allaway2, John Judge1,
Desmond Patton3, Bruce Bimber1, Kathleen McKeown2, William Yang Wang1
1University of California, Santa Barbara, Santa Barbara, CA
2Columbia University, New York, NY
3University of Pennsylvania, Philadelphia, PA
{alexmei, anishakabir, sharonlevy, jjudge, william}@cs.ucsb.edu
{eallaway, m.subbiah, kathy}@cs.columbia.edu
dupatton@upenn.edu, bimber@polisci.ucsb.edu
Abstract
An increasingly prevalent problem for intel-
ligent technologies is text safety, as uncon-
trolled systems may generate recommenda-
tions to their users that lead to injury or life-
threatening consequences. However, the de-
gree of explicitness of a generated statement
that can cause physical harm varies. In this
paper, we distinguish types of text that can
lead to physical harm and establish one par-
ticularly underexplored category: covertly un-
safe text. Then, we further break down this
category with respect to the system’s informa-
tion and discuss solutions to mitigate the gen-
eration of text in each of these subcategories.
Ultimately, our work defines the problem of
covertly unsafe language that causes physical
harm and argues that this subtle yet dangerous
issue needs to be prioritized by stakeholders
and regulators. We highlight mitigation strate-
gies to inspire future researchers to tackle this
challenging problem and help improve safety
within smart systems.
1 Introduction
In recent years, intelligent personal assistants have
increased information accessibility. However, this
has also raised concerns for user safety since these
systems may provide dangerous recommendations
to unsuspecting users. For instance, a child may
ask a device for a fun challenge. The device may
respond with an unsafe viral internet challenge
such as the salt and ice challenge, where partic-
ipants cover their body with salt and rub it with
ice, causing frostbite-like pain
1
. Though work has
been done in mitigating violent language and hate
speech in natural language systems (Kiritchenko
*Equal Contribution.
1wikipedia.org/wiki/Salt_and_ice_challenge
"I'll shoot you"
"Push him down the stairs"
"Stick a fork in an electrical outlet"
"Take a bite out of a ghost pepper"
"He's a thug. This is his address..."
"She's asking for it with that outfit"
Overtly
Unsafe
Covertly
Unsafe
Indirectly
Unsafe
Figure 1: Example statements that can lead to the phys-
ical harm of people; we focus on covertly unsafe text.
et al.,2021), there has been a relatively minimal ex-
ploration into covertly unsafe statements that may
lead to injury or even fatal consequences. As unsafe
language continues to grow in prevalence online
(Rainie et al.,2017), detecting and preventing these
statements from being generated becomes crucial
in reducing physical harm. Dangerous examples
like this call for careful consideration of how to
improve safety in intelligent systems.
A broad spectrum of language can lead to phys-
ical harm, including overtly violent, covertly dan-
gerous, or otherwise indirectly unsafe statements.
Some texts may instigate immediate physical harm
if followed, while others may contain prejudices
that motivate future acts of harm. To better under-
stand these nuances, we examine this spectrum and
distinguish subcategories based on two key notions:
whether a statement is actionable and physically
unsafe and, if so, whether it is explicitly dangerous.
An example of an
overtly unsafe
statement is
“punch his face” because “punch” is commonly
considered violent and detectable independent of
any deeper form of reasoning. In contrast, “pour
water on a grease fire” is an example of
covertly
unsafe
language
2
; the sentence structure and vo-
2
verywellhealth.com/how-to-put-out-a-grease-fire-
1298709
arXiv:2210.09306v2 [cs.AI] 20 Mar 2023
Input
Overtly Unsafe
0 Degrees of
Separation
e.g. "Punch him in
the face"
Covertly Unsafe
1 Degree of
Separation
e.g. "Drink bleach
to fight COVID19"
No
Yes
Actionable
Physical Harm?
Explicitly Violent
Language?
Yes
No
Indirectly Unsafe
2+ Degrees of
Separation
e.g. "You are a
pathetic failure"
Figure 2: Flowchart to help determine the category of
a piece of text that can cause physical harm.
cabulary do not have explicitly violent semantics,
but with knowledge of kitchen safety, we can iden-
tify that following the recommendation will likely
cause physical harm. An example that is indirectly
physically unsafe is “she has no life. While not
immediately physically unsafe, this statement can
motivate physical harm to oneself or others if com-
bined with underlying mental health risks. Refer to
Figure 1 for more examples.
Like overtly unsafe statements, covertly unsafe
language will lead to physical harm when followed.
Yet, unlike the overt counterpart, covertly unsafe
statements are more subtle, which, as a result, is
a concerning problem that needs to be prioritized
by stakeholders and regulators. Our work
defines
the problem of covertly unsafe text that causes
physical harm
and
discusses mitigation strate-
gies in AI systems
to inspire future research direc-
tions. Harm and safety are complex issues with
humans at their core, so we discuss the human fac-
tors involved with the techniques we explore.
Our paper is outlined as follows: we distinguish
the differences between types of text leading to
physical harm by establishing degrees of separation
2); we establish a taxonomy to dissect further the
category of covertly unsafe text that cause physical
harm (§3); using these categorizations, we discuss
strategies for mitigating the generation of covertly
unsafe text in natural language systems at each
stage of the machine learning pipeline (§4); finally,
we conclude with an interdisciplinary approach to
mitigating covertly unsafe text (§5).
2 Categories of Physically Harmful Text
Language can cause harm in various forms, includ-
ing but not limited to psychological and physical
harm. These harms are often co-correlated and af-
fect people differently based on their unique back-
grounds. We focus our discussion on language lead-
ing to physical harm but acknowledge that other
types of harm should also be considered when im-
proving safety within natural language systems.
To improve the clarity of discourse around phys-
ically harmful text, we establish
degrees of sepa-
ration with respect to physical harm
(Figure 2).
The degrees of separation can also be considered an
implicit-explicit distinction (Waseem et al.,2017)
in the context of physical harm.
Zero degrees of separation
:overtly unsafe lan-
guage contains actionable physical harm (i.e.,
if someone followed the text, they would cause
physical harm), which can be identified as explic-
itly violent (e.g., using key phrases as references
to acts of physical harm) (§2.1).
One degree of separation
:covertly unsafe lan-
guage contains actionable physical harm and is
not overtly violent. The additional degree of sep-
aration indicates the need for further reasoning
to recognize the physical harm (§3).
Two or more degrees of separation
:indirectly
unsafe language categorizes all other text requir-
ing a longer inference chain to potentially result
in physical harm. These texts are not immedi-
ately physically harmful but could be toxic, hate-
ful, or otherwise indirectly encouraging of physi-
cal harm (§2.2).
2.1 Zero Degrees of Separation
Zero degrees of separation from physical harm is
characterized by language with overt references to
violence. Previous studies have delved into overtly
unsafe text in the context of gun violence (Pavlick
et al.,2016), criminal activity (Osorio and Bel-
tran,2020), gang violence (Patton et al.,2016;
Chang et al.,2018), and gender-based violence
(Castorena et al.,2021;González and Cantu-Ortiz,
2021). These studies utilize textual examples from
news articles, construct social media datasets, and
develop tools for detecting such text; common tech-
niques include sentiment analysis (Castorena et al.,
2021) and word embeddings (Chang et al.,2018)
for detecting unsafe language. While this language
is considered overtly unsafe, full comprehension
may require domain expertise (e.g., gang-related
discourse). The work on overtly unsafe text con-
trasts our focus on covertly unsafe language (§3).
2.2 Two or More Degrees of Separation
Two or more degrees of separation classifies state-
ments that may indirectly lead to physical harm.
One notable type of language under this class is
toxic language, which has motivated several stud-
ies to mitigate hate speech (Jurgens et al.,2019),
cyberbullying (Xu et al.,2012;Chatzakou et al.,
2019), and microaggressions (Breitfeller et al.,
2019). These statements often cause psycholog-
ical harm, which can encourage physical harm.
Other types of indirect unsafe language may in-
clude doxxing
3
and biased statements (Schick et al.,
2021). Recent work has also focused on detecting
harmful content generated by conversational sys-
tems through insults, stereotypes, or false impres-
sions of system behavior (Dinan et al.,2022). We
encourage readers to refer to existing comprehen-
sive surveys (Kiritchenko et al.,2021;Schmidt and
Wiegand,2017;Salawu et al.,2020) in this area
as our paper focuses on covertly unsafe text (§3),
which has comparatively little progress.
2.3 Assumptions for Categorizing Harm
Ambiguous Information.
Language ambiguities
make it difficult to determine text safety. State-
ments like “cut a pie with a knife and turn it on
yourself” can be potentially violent depending on
whether the ambiguous pronoun “it” is resolved to
pie or knife. Ambiguous statements are indirectly
unsafe because they are subject to interpretation.
Literal and Explicit Statements.
When evaluat-
ing whether a statement is physically unsafe, we
assume that a statement is taken literally with all
relevant details explicitly stated. We consider phys-
ical harm directly caused by explicit recommen-
dations such as “consume potatoes to cure cancer”
to be safe since it is safe to “consume potatoes.
Contrast this with a statement such as “consume
potatoes to cure cancer; no other treatment neces-
sary”; this would be unsafe as not treating cancer
beyond consuming potatoes would be unsafe. The
latter example could be sarcastic, but an unsafe
statement meant as a joke is still inherently unsafe.
3 Covertly Unsafe Language
Covertly unsafe text requires more context to dis-
cern than its overt counterpart. Yet, unlike indi-
rectly unsafe text, extrapolation is unnecessary to
determine whether it is physically harmful.
3
rcfp.org/journals/news-media-and-law-spring-
2015/dangers-doxxing
A system’s knowledge directly influences the
quality of generated text (Yu et al.,2022), and often
missing, incompatible, or false information can
cause systems to generate unsafe language. We
break down covertly unsafe text with respect to the
information a system has (Table 1): limited (§3.1),
incompatible (§3.2), or incorrect (§3.3). Note that
these categories are not mutually exclusive.
3.1 Limited Information
To generate well-formed recommendations, sys-
tems need relevant and comprehensive knowledge
about their domain (Reiter et al.,2003); if the sys-
tem’s knowledge is too limited, it may overlook
facts in a generated recommendation that make
it unsafe. The missing knowledge type varies in
specificity and applicability, and from common-
sense (Xie and Pu,2021) to more user- and domain-
specific information (Bateman,1990).
Two examples of unsafe text due to limited infor-
mation are: “put your finger in a light bulb socket”,
where lack of commonsense about electrocution
could cause physical harm
4
, and “drink lemonade
from a copper vessel”, where lack of chemistry-
domain knowledge about toxic chemical reactions
could lead to physical harm
5
. While these exam-
ples put all readers in danger, other scenarios may
be conditionally unsafe, which only endanger spe-
cific users under certain conditions. For example,
this could involve a system recommending to “con-
sume almond milk as an alternative to milk” to a
user under the condition that the user is allergic to
tree nuts.
The common thread in these examples is that
the system needs more knowledge to recognize
such language. Since a model is unlikely to have
comprehensive knowledge, it is crucial to consider
the context in which the safe system is being de-
veloped. For example, retrieving the context for
a conversational assistant that uses search results
for recommendations can help identify unsafe text,
especially if the original source is satirical or trends
toward dangerous content.
3.2 Incompatible Information
Even a system with abundant knowledge may
provide recommendations containing covertly un-
safe incompatible information (Preum et al.,2017;
Alamri and Stevenson,2015). Incompatibility may
4
howstuffworks.com/science-vs-myth/what-if/finger-in-
electrical-outlet.htm
5webmd.com/diet/what-to-know-copper-toxicity
摘要:

MitigatingCovertlyUnsafeTextwithinNaturalLanguageSystemsWarning:Thispapercontainsexamplesofpotentiallyoffensiveandharmfultext.AlexMei*1,AnishaKabir*1,SharonLevy1,MelanieSubbiah2,EmilyAllaway2,JohnJudge1,DesmondPatton3,BruceBimber1,KathleenMcKeown2,WilliamYangWang11UniversityofCalifornia,SantaBarbara...

展开>> 收起<<
Mitigating Covertly Unsafe Text within Natural Language Systems Warning This paper contains examples of potentially offensive and harmful text. Alex Mei1 Anisha Kabir1 Sharon Levy1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:383.06KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注