Goal Misgeneralization Why Correct Specifications Arent Enough For Correct Goals Rohin Shahy

2025-05-01 0 0 2.19MB 24 页 10玖币
侵权投诉
Goal Misgeneralization: Why Correct Specifications
Aren’t Enough For Correct Goals
Rohin Shah ∗ †
rohinmshah@deepmind.com
Vikrant Varma ∗ †
vikrantvarma@deepmind.com
Ramana Kumar
Mary Phuong Victoria Krakovna Jonathan Uesato Zac Kenton
Abstract
The field of AI alignment is concerned with AI systems that pursue unintended
goals. One commonly studied mechanism by which an unintended goal might arise
is specification gaming, in which the designer-provided specification is flawed in
a way that the designers did not foresee. However, an AI system may pursue an
undesired goal even when the specification is correct, in the case of goal misgen-
eralization. Goal misgeneralization is a specific form of robustness failure for
learning algorithms in which the learned program competently pursues an unde-
sired goal that leads to good performance in training situations but bad performance
in novel test situations. We demonstrate that goal misgeneralization can occur in
practical systems by providing several examples in deep learning systems across a
variety of domains. Extrapolating forward to more capable systems, we provide
hypotheticals that illustrate how goal misgeneralization could lead to catastrophic
risk. We suggest several research directions that could reduce the risk of goal
misgeneralization for future systems.
1 Introduction
Recent years have seen a rise in concern about catastrophic risk from AI misalignment, where a
highly capable AI system that pursues an unintended goal determines that it can better achieve its
goal by disempowering humanity [
48
,
6
,
3
]. But how do we get into a situation in which an AI system
pursues an unintended goal? Much work considers the case where the designers provide an incorrect
specification, e.g. an incorrect reward function for reinforcement learning (RL) [
33
,
24
]. Recent
work [
29
,
20
] suggests that, in the case of learning systems, there is another pathway by which the
system may pursue an unintended goal: even if the specification is correct, the system may coherently
pursue an unintended goal that agrees with the specification during training, but differs from the
specification at deployment.
Consider the example illustrated in Figure 1 using the MEDAL-ADR agent and environment from CGI
et al.
[10]
. An agent is trained with RL to visit a set of coloured spheres in some order that is initially
unknown to the agent. To encourage the agent to learn from other actors in the environment (“cultural
transmission”), the environment initially contains an expert bot that visits the spheres in the correct
order. In such cases, the agent can determine the correct order by observing the expert, rather than
doing its own costly exploration. Indeed, by imitating the expert, the final trained agent typically
visits the target locations correctly on its first try (Figure 1a).
What happens when we pair the agent with an “anti-expert” that visits the spheres in an incorrect
order? Intuitively, as depicted in Figure 1d, we would want the agent to notice that it receives
equal contribution
DeepMind
Preprint. Under review.
arXiv:2210.01790v2 [cs.LG] 2 Nov 2022
(a)
Training:
The agent is partnered with an “ex-
pert” that visits the spheres in the correct order.
The agent learns to visit the spheres in the correct
order, closely mimicking the expert’s path.
(b)
Capability misgeneralization:
when we ver-
tically flip the agent’s observation at test time, it
gets stuck in a location near the top of the map.
(c)
Goal misgeneralization:
At test time, we re-
place the expert with an “anti-expert” that always
visits the spheres in an incorrect order. The agent
continues to follow the anti-expert’s path, despite
receiving negative rewards, demonstrating clear
capabilities but an unintended goal.
(d)
Intended generalization:
Ideally, the agent
initially follows the anti-expert to the yellow and
purple spheres. Upon entering the purple sphere, it
observes that it gets a negative reward, and now ex-
plores to discover the correct sphere order instead
of following the anti-expert.
Figure 1:
Goal misgeneralization in a 3D environment.
The agent (blue) must visit the coloured
spheres in an order that is randomly generated at the start of the episode. The agent receives a positive
reward when visiting the correct next sphere, and a negative reward when visiting an incorrect sphere.
A partner bot (pink) follows a predetermined policy. We visualize agent and partner paths as coloured
trails that start thin and become thicker as the episode progresses. A player’s total reward is displayed
above their avatar, and past reward is directly available to the agent as an observation.
a negative reward (which is available as an observation) when using the order suggested by the
anti-expert, and then switch to exploration in order to determine the correct order. However, in
practice the agent simply continues to follow the anti-expert path, accumulating more and more
negative reward (Figure 1c). Note that the agent still displays an impressive ability to navigate an
environment full of obstacles: the problem is that these capabilities have been put to use towards
the undesired goal of following its partner, rather than the intended goal of visiting the spheres in
the correct order. This problem arose even though the agent was only ever rewarded for visiting the
spheres in the correct order: there was no reward misspecification.
Goal misgeneralization refers to this pathological behaviour, in which a learned model behaves
as though it is optimizing an unintended goal, despite receiving correct feedback during training.
This makes goal misgeneralization a specific kind of robustness or generalization failure, in which
the model’s capabilites generalize to the test setting, but the pursued goal does not. Note that goal
misgeneralization is a strict subset of generalization failures. It excludes situations in which the
model “breaks” or “acts randomly” or otherwise no longer demonstrates competent capabilities. In
our running example, if we flip the agent’s observations vertically at test time, it simply gets stuck
in a location and doesn’t seem to do anything coherent (Figure 1b), so this is misgeneralization but
2
not goal misgeneralization. Relative to these “random” failures, goal misgeneralization can lead to
significantly worse outcomes: following the anti-expert leads to significant negative reward, while
doing nothing or acting randomly would usually lead to a reward of 0 or 1. With more powerful
systems, coherent behaviour towards an unintended goal can produce catastrophic outcomes [
6
,
54
].
In this paper, we advance our understanding of goal misgeneralization through four contributions:
We provide an operationalization of goal misgeneralization (Section 2) that does not require
the RL framework assumed in Di Langosco et al.
[20]
, nor the structural assumptions used
in Hubinger et al. [29].
We show that goal misgeneralization can occur in practice by presenting several new
examples in hand-designed (Sections 3.1-3.3) and “in-the-wild” (Sections 3.4-3.5) settings.
We apply the lens of goal misgeneralization for the first time to agent-induced distribution
shifts (Sections 3.1-3.2) and few-shot learning without RL (Section 3.3).
We describe through concrete hypotheticals how goal misgeneralization provides a mecha-
nism by which powerful AI systems could pose a catastrophic risk to humanity (Section 4).
2 A model for goal misgeneralization
We present a general model for misgeneralization and then discuss the properties that characterize
goal misgeneralization in particular. We will focus on the case of deep learning since all of our main
examples in Section 3 use deep learning. However, our model is more general and can apply to any
learning system. We discuss a concrete example without deep learning in Appendix A.
2.1 Standard misgeneralization framework
We consider the standard picture for misgeneralization within the empirical risk minimization
framework. We aim to learn some function
f:X → Y
that maps inputs
x∈ X
to outputs
y∈ Y
. For example, in classification problems
X
is the set of inputs, and
Y
is the set of labels. In
reinforcement learning (RL),
X
is the set of states or observation histories, and
Y
is the set of actions.
We consider a parameterized family of functions
FΘ
, such as those implemented by deep neural net-
works. Functions are selected based on a scoring function
s(fθ,Dtrain)
that evaluates the performance
of
fθ
on the given dataset
Dtrain3
. Misgeneralization can occur when there are two parameterizations
θ1
and
θ2
such that
fθ1
and
fθ2
both perform well on
Dtrain
but differ on
Dtest
. Depending on which
of
θ1
and
θ2
is chosen, we may then get very bad scores on
Dtest
. Whether we get
fθ1
or
fθ2
depends
on the inductive biases of the model and random effects (such as the random initialization of model
parameters).
Note that while sometimes
Dtest
is assumed to be sampled from the same distribution as
Dtrain
, in
this paper we primarily consider cases where it is sampled from a different distribution, known as
distribution shift. This further increases risk of misgeneralization.
2.2 Goal misgeneralization
We now characterize goal misgeneralization. Intuitively, goal misgeneralization occurs when we
learn a function fθbad that has robust capabilities but pursues an undesired goal.
It is quite challenging to define what a “capability” is in the context of neural networks. We provide a
provisional definition following Chen et al.
[11]
. We say that the model is
capable
of some task
X
in setting
Y
if it can be quickly tuned to perform task
X
well in setting
Y
(relative to learning
X
from scratch). For example, tuning could be done by prompt engineering or by fine-tuning on a small
quantity of data [
52
]. We emphasize that this is a provisional definition and hope that future work
will provide better definitions of what it means for a model to have a particular “capability”.
Inspired by the intentional stance [
19
], we say that the model’s behaviour is
consistent with a goal
to
perform task
X
in setting
Y
if its behaviour in setting
Y
can be viewed as solving
X
, i.e. it performs
3
The ‘dataset’ consists of the inputs over which losses and gradients are calculated. For example, in many
RL algorithms, the training dataset consists of the (s, a, r, s0)transitions used to compute the surrogate loss.
3
Table 1: Goals and capabilities for the examples of Section 3. Both the intended and misgeneralized
goals are training goals, but only the misgeneralized goal is a test goal.
Example Intended goal Misgeneralized goal Capabilities
Monster
Gridworld
Collect apples and avoid
being attacked by monsters
Collect apples and
shields
Collecting apples
Collecting shields
Dodging monsters
Tree
Gridworld Chop trees sustainably Chop trees as
fast as possible
Chopping trees
at a given speed
Evaluating
Expressions
Compute expression with
minimal user interaction
Ask questions then
compute expression
Querying the user
Performing arithmetic
Cultural
Transmission Navigate to rewarding points Imitate demonstration Traversing the environment
Imitating another agent
InstructGPT Be helpful, truthful,
and harmless
Be informative, even
when harmful
Answering questions
Grammar
task
X
well in setting
Y
(without any further tuning). Consistent goals are exactly those capabilities
of a model that are exhibited without any tuning. We call a goal that is consistent with the training
(resp. test) setting a
train (resp. test) goal
. Note that there may be multiple goals that are consistent
with the model’s behaviour in a given setting. Our definition does not require that the model has an
internal representation of a goal, or a “desire” to pursue it.
Goal misgeneralization
occurs if, in the test setting
Ytest
, the model’s capabilities include those
necessary to achieve the intended goal (given by scoring function
s
), but the model’s behaviour is not
consistent with the intended goal
s
and is consistent with some other goal (the
misgeneralized goal
).
Related models of goal misgeneralization.
Di Langosco et al.
[20]
say that goal misgeneralization
occurs when the policy acts in a goal-directed manner but does not achieve high reward according
to
s
. They formalize the goal-directedness of the policy in the reinforcement learning (RL) setting
using the Agents and Devices framework [
41
]. Our definition of goal misgeneralization is more
general and applies to any learning framework, rather than being restricted to RL. It also includes
an additional criterion that the model has to be capable of carrying out the intended goal in the test
environment. Intuitively, if the model is not capable of pursuing the intended goal, we would call this
a capability generalization failure. Thus, our definition more precisely identifies the situations that
we are concerned about.
3 Examples of goal misgeneralization
In this section we provide several examples of goal misgeneralization, summarized in Table 1.
Existing examples in the literature are discussed in Appendix B. We strongly recommend watch-
ing videos of agent behaviour alongside this section, available at
sites.google.com/view/
goal-misgeneralization. Our examples meet the following desiderata:
P1. Misgeneralization.
The model should be trained to behave well in the training setting, and then
should behave badly zero-shot in the deployment setting.
P2. Robust capabilities.
The model should have clear capabilities that it visibly retains in the
deployment setting, despite producing bad behaviour.
P3. Attributable goal.
We should be able to attribute some goal to the model in the deployment
setting: there should be some non-trivial task on which the model achieves a near-optimal score.
3.1 Example: Monster Gridworld
This RL environment is a 2D fully observed gridworld. The agent must collect apples (+1 reward)
while avoiding monsters that chase it (-1 reward on collision). The agent may also pick up shields for
protection. When a monster collides with a shielded agent, both the monster and shield are destroyed.
See Appendix C.2 for further details. The optimal policy focuses on shields while monsters are
present, and on apples when there are no monsters.
4
Figure 2:
Monster Gridworld.
We visualize summary statistics for different agents over the course
of an episode, averaging over
100
episodes of
200
steps. Agent
trainN
is trained on episode length
of
N
steps, and
random
is a random agent. Note that lines corresponding to
train100
and
train200
are nearly identical and mostly overlap.
Our main agent of interest,
train25
, is trained on short episodes of length 25, but tested on long
episodes of length 200. As shown in Figure 2, relative to
train200
which is trained directly on
episodes of length 200, train25 collects more shields and fewer apples.
Why does this happen? During the first
25
steps, monsters are almost always present and agents
focus on collecting shields. This leads to a spectrum of goals in the training setting for
train25
:
prefer shields over apples always (maximally misgeneralized), or only when monsters are present
(intended). Note that the information required to distinguish these goals is present during training: the
agent does consume some apples and get positive reward, and the agent is never positively rewarded
for getting a shield. Nonetheless, agents pursuing the misgeneralized goal would perform well in the
training situation, and this is sufficient to make goal misgeneralization possible.
After 25 steps, trained agents often destroy all the monsters, inducing a distribution shift for the
train25
agent. It then continues to capably collect shields (higher blue curve) at the expense of
apples (lower green curve). Thus, the test goal is in the middle of the spectrum:
train25
collects
somewhat fewer shields and more apples, but not as few shields or as many apples as train200.
Increasing diversity in training fixes the issue: the
train100
agent encounters situations with no
monsters, and so generalizes successfully, behaving almost identically to the
train200
agent. These
agents collect shields at roughly the same rate as a random agent once the monsters are destroyed.
3.2 Example: Tree Gridworld
In this 2D fully observed gridworld, the agent collects reward by chopping trees, which removes
the trees from the environment. Trees respawn at a rate that increases with the number of trees left.
When there are no trees left, the respawn rate is very small but positive (see Appendix C.3).
We consider the online, reset-free setting, in which the agent acts and learns in the environment,
without any train / test distinction and without an ability to reset the environment (preventing episodic
learning). To cast this within our framework, we say that at a given timestep,
Dtrain
consists of all past
experience the agent has accumulated. The optimal policy in this setting is to chop trees sustainably:
the agent should chop fewer trees when they are scarce, to keep the respawn rate high.
5
摘要:

GoalMisgeneralization:WhyCorrectSpecicationsAren'tEnoughForCorrectGoalsRohinShahyrohinmshah@deepmind.comVikrantVarmayvikrantvarma@deepmind.comRamanaKumaryMaryPhuongyVictoriaKrakovnayJonathanUesatoyZacKentonyAbstractTheeldofAIalignmentisconcernedwithAIsystemsthatpursueunintendedgoals.Onecommonlys...

展开>> 收起<<
Goal Misgeneralization Why Correct Specifications Arent Enough For Correct Goals Rohin Shahy.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:2.19MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注