2 Background
2.1 Important Questions for AI Safety
AI Safety.
The fundamental goal of AI safety is to ensure that AI models do not harm humans
(Bostrom and Yudkowsky, 2014; Russell, 2019; Tegmark, 2017; Hendrycks et al., 2021d). AI
systems are trained to optimize given objectives. However, it is not easy to define a perfect objective,
because correct, formal specifications require us to express many of the human values that are in
the background of simple objectives. When we ask a robot to fetch coffee, for instance, we do not
mean: fetch coffee no matter what it takes. We mean something more like: fetch coffee, if coffee or a
reasonable substitute is available at a reasonable price, within a reasonable time frame, and when
the fetching will not have a non-trivial expectation of endangering other agents or impeding more
important goals, weighing my goals as somewhat more important than those of others. AI safety
researchers point out that human objectives and their associated values are often too complex to
capture and express (Bostrom and Yudkowsky, 2014; Russell, 2019).
However, recent research in the field of cognitive science has begun to reveal that human values
indeed have a systematic and predictable structure (Mikhail, 2011; Greene, 2014; Kleiman-Weiner
et al., 2015). Of course, values vary across cultures – and even across individuals within a single
culture. Sometimes even the same individual can hold conflicting values or make contradictory
judgments. Despite this important and pervasive variation in human moral judgment, it is still
possible to describe systematic ways that a particular population of humans responds to morally
charged cases. In this paper we draw on recent advances in the cognitive science of moral judgment
which reveal the structure behind human value-guided judgment (Levine et al., 2018; Awad et al.,
2022b). Integrating models of value-driven human decisions in AI systems can bring us closer to the
goal of aligning AI with human values.
An Urgent Need for Safe LLMs.
AI safety research in NLP has become increasingly urgent due to
the recent advancement of LLMs (Radford et al., 2018, 2019; Devlin et al., 2019; Liu et al., 2019;
Brown et al., 2020) and their broad applications to many tasks (Chen et al., 2021; Stiennon et al.,
2020; Ram et al., 2018; Fan et al., 2019). Existing AI safety work in NLP includes (1) high-level
methodology design (Irving et al., 2018; Ziegler et al., 2019; Askell et al., 2021), (2) training analysis
such as the scaling effect (Rae et al., 2021), (3) identification of challenging tasks such as mathematics
(Hendrycks et al., 2021c; Cobbe et al., 2021), coding (Hendrycks et al., 2021a), and truthful question
answering (Lin et al., 2021), (4) analysis of undesired behaviors of LLMs such as toxicity (Gehman
et al., 2020; Perez et al., 2022), misinformation harms and other risk areas (Weidinger et al., 2021), (5)
risks arising from misspecification (Kenton et al., 2021), and (6) improvements such as encouraging
LLMs to explicitly retrieve evidence (Borgeaud et al., 2021; Talmor et al., 2020), among many others.
In this context, our
MoralExceptQA
work intersects with (3) – (6) in that we address the important
potential risk that LLMs might follow human-misspecified rules commands too literally which might
trigger dangerous failure modes (for (5)), contribute a challenge set to predict human moral judgment
in cases where a rule should be permissibly broken (for (3)), analyze how and why current LLMs
fail in moral flexibility questions (for (4)), and finally propose a MORALCOT prompting strategy to
improve the reliability of moral reasoning in LLMs (for (6)).
2.2 The Human Moral Mind Is Flexible
Insights from Cognitive Science.
The last few decades of research in moral psychology has
revealed that rules are critical to the way that the human mind makes moral decisions. Nearly every
contemporary theory of moral psychology has some role for rules (Cushman, 2013; Greene, 2014;
Holyoak and Powell, 2016; Nichols, 2004; Haidt, 2013). While rules are often thought of as fixed
and strict, more recent work in moral psychology has begun to investigate the human capacity to
understand rules in flexible terms – the ability to decide when it would be permissible to break a
rule, update a rule, or create a rule when none existed before (Levine et al., 2020; Awad et al., 2022b;
Levine et al., 2018; Weld and Etzioni, 1994; Rudinger et al., 2020).
The flexibility of rules is obvious upon reflection. Although there is an explicit rule against cutting
in line (“jumping the queue”), for example, there are also myriads of exceptions to the rule where
cutting is perfectly permitted. It may be OK to cut a line at a deli if you were given the wrong order,
or to cut a bathroom line if you are about to be sick, or to cut an airport security line if you are the
3