Goal Misgeneralization Why Correct Speciﬁcations Arent Enough For Correct Goals Rohin Shahy

2025-05-01 1 0 2.19MB 24 页 10玖币

侵权投诉

Goal Misgeneralization: Why Correct Speciﬁcations

Aren’t Enough For Correct Goals

Rohin Shah ∗ †

rohinmshah@deepmind.com

Vikrant Varma ∗ †

vikrantvarma@deepmind.com

Ramana Kumar †

Mary Phuong †Victoria Krakovna †Jonathan Uesato †Zac Kenton †

Abstract

The ﬁeld of AI alignment is concerned with AI systems that pursue unintended

goals. One commonly studied mechanism by which an unintended goal might arise

is speciﬁcation gaming, in which the designer-provided speciﬁcation is ﬂawed in

a way that the designers did not foresee. However, an AI system may pursue an

undesired goal even when the speciﬁcation is correct, in the case of goal misgen-

eralization. Goal misgeneralization is a speciﬁc form of robustness failure for

learning algorithms in which the learned program competently pursues an unde-

sired goal that leads to good performance in training situations but bad performance

in novel test situations. We demonstrate that goal misgeneralization can occur in

practical systems by providing several examples in deep learning systems across a

variety of domains. Extrapolating forward to more capable systems, we provide

hypotheticals that illustrate how goal misgeneralization could lead to catastrophic

risk. We suggest several research directions that could reduce the risk of goal

misgeneralization for future systems.

1 Introduction

Recent years have seen a rise in concern about catastrophic risk from AI misalignment, where a

highly capable AI system that pursues an unintended goal determines that it can better achieve its

goal by disempowering humanity [

]. But how do we get into a situation in which an AI system

pursues an unintended goal? Much work considers the case where the designers provide an incorrect

speciﬁcation, e.g. an incorrect reward function for reinforcement learning (RL) [

]. Recent

work [

] suggests that, in the case of learning systems, there is another pathway by which the

system may pursue an unintended goal: even if the speciﬁcation is correct, the system may coherently

pursue an unintended goal that agrees with the speciﬁcation during training, but differs from the

speciﬁcation at deployment.

Consider the example illustrated in Figure 1 using the MEDAL-ADR agent and environment from CGI

et al.

[10]

. An agent is trained with RL to visit a set of coloured spheres in some order that is initially

unknown to the agent. To encourage the agent to learn from other actors in the environment (“cultural

transmission”), the environment initially contains an expert bot that visits the spheres in the correct

order. In such cases, the agent can determine the correct order by observing the expert, rather than

doing its own costly exploration. Indeed, by imitating the expert, the ﬁnal trained agent typically

visits the target locations correctly on its ﬁrst try (Figure 1a).

What happens when we pair the agent with an “anti-expert” that visits the spheres in an incorrect

order? Intuitively, as depicted in Figure 1d, we would want the agent to notice that it receives

∗equal contribution

†DeepMind

Preprint. Under review.

arXiv:2210.01790v2 [cs.LG] 2 Nov 2022

(a)

Training:

The agent is partnered with an “ex-

pert” that visits the spheres in the correct order.

The agent learns to visit the spheres in the correct

order, closely mimicking the expert’s path.

(b)

Capability misgeneralization:

when we ver-

tically ﬂip the agent’s observation at test time, it

gets stuck in a location near the top of the map.

(c)

Goal misgeneralization:

At test time, we re-

place the expert with an “anti-expert” that always

visits the spheres in an incorrect order. The agent

continues to follow the anti-expert’s path, despite

receiving negative rewards, demonstrating clear

capabilities but an unintended goal.

(d)

Intended generalization:

Ideally, the agent

initially follows the anti-expert to the yellow and

purple spheres. Upon entering the purple sphere, it

observes that it gets a negative reward, and now ex-

plores to discover the correct sphere order instead

of following the anti-expert.

Figure 1:

Goal misgeneralization in a 3D environment.

The agent (blue) must visit the coloured

spheres in an order that is randomly generated at the start of the episode. The agent receives a positive

reward when visiting the correct next sphere, and a negative reward when visiting an incorrect sphere.

A partner bot (pink) follows a predetermined policy. We visualize agent and partner paths as coloured

trails that start thin and become thicker as the episode progresses. A player’s total reward is displayed

above their avatar, and past reward is directly available to the agent as an observation.

a negative reward (which is available as an observation) when using the order suggested by the

anti-expert, and then switch to exploration in order to determine the correct order. However, in

practice the agent simply continues to follow the anti-expert path, accumulating more and more

negative reward (Figure 1c). Note that the agent still displays an impressive ability to navigate an

environment full of obstacles: the problem is that these capabilities have been put to use towards

the undesired goal of following its partner, rather than the intended goal of visiting the spheres in

the correct order. This problem arose even though the agent was only ever rewarded for visiting the

spheres in the correct order: there was no reward misspeciﬁcation.

Goal misgeneralization refers to this pathological behaviour, in which a learned model behaves

as though it is optimizing an unintended goal, despite receiving correct feedback during training.

This makes goal misgeneralization a speciﬁc kind of robustness or generalization failure, in which

the model’s capabilites generalize to the test setting, but the pursued goal does not. Note that goal

misgeneralization is a strict subset of generalization failures. It excludes situations in which the

model “breaks” or “acts randomly” or otherwise no longer demonstrates competent capabilities. In

our running example, if we ﬂip the agent’s observations vertically at test time, it simply gets stuck

in a location and doesn’t seem to do anything coherent (Figure 1b), so this is misgeneralization but

not goal misgeneralization. Relative to these “random” failures, goal misgeneralization can lead to

signiﬁcantly worse outcomes: following the anti-expert leads to signiﬁcant negative reward, while

doing nothing or acting randomly would usually lead to a reward of 0 or 1. With more powerful

systems, coherent behaviour towards an unintended goal can produce catastrophic outcomes [

In this paper, we advance our understanding of goal misgeneralization through four contributions:

•

We provide an operationalization of goal misgeneralization (Section 2) that does not require

the RL framework assumed in Di Langosco et al.

[20]

, nor the structural assumptions used

in Hubinger et al. [29].

•

We show that goal misgeneralization can occur in practice by presenting several new

examples in hand-designed (Sections 3.1-3.3) and “in-the-wild” (Sections 3.4-3.5) settings.

•

We apply the lens of goal misgeneralization for the ﬁrst time to agent-induced distribution

shifts (Sections 3.1-3.2) and few-shot learning without RL (Section 3.3).

• We describe through concrete hypotheticals how goal misgeneralization provides a mecha-

nism by which powerful AI systems could pose a catastrophic risk to humanity (Section 4).

2 A model for goal misgeneralization

We present a general model for misgeneralization and then discuss the properties that characterize

goal misgeneralization in particular. We will focus on the case of deep learning since all of our main

examples in Section 3 use deep learning. However, our model is more general and can apply to any

learning system. We discuss a concrete example without deep learning in Appendix A.

2.1 Standard misgeneralization framework

We consider the standard picture for misgeneralization within the empirical risk minimization

framework. We aim to learn some function

f∗:X → Y

that maps inputs

x∈ X

to outputs

y∈ Y

. For example, in classiﬁcation problems

is the set of inputs, and

is the set of labels. In

reinforcement learning (RL),

is the set of states or observation histories, and

is the set of actions.

We consider a parameterized family of functions

FΘ

, such as those implemented by deep neural net-

works. Functions are selected based on a scoring function

s(fθ,Dtrain)

that evaluates the performance

fθ

on the given dataset

Dtrain3

. Misgeneralization can occur when there are two parameterizations

θ1

and

θ2

such that

fθ1

and

fθ2

both perform well on

Dtrain

but differ on

Dtest

. Depending on which

θ1

and

θ2

is chosen, we may then get very bad scores on

Dtest

. Whether we get

fθ1

fθ2

depends

on the inductive biases of the model and random effects (such as the random initialization of model

parameters).

Note that while sometimes

Dtest

is assumed to be sampled from the same distribution as

Dtrain

, in

this paper we primarily consider cases where it is sampled from a different distribution, known as

distribution shift. This further increases risk of misgeneralization.

2.2 Goal misgeneralization

We now characterize goal misgeneralization. Intuitively, goal misgeneralization occurs when we

learn a function fθbad that has robust capabilities but pursues an undesired goal.

It is quite challenging to deﬁne what a “capability” is in the context of neural networks. We provide a

provisional deﬁnition following Chen et al.

[11]

. We say that the model is

capable

of some task

in setting

if it can be quickly tuned to perform task

well in setting

(relative to learning

from scratch). For example, tuning could be done by prompt engineering or by ﬁne-tuning on a small

quantity of data [

]. We emphasize that this is a provisional deﬁnition and hope that future work

will provide better deﬁnitions of what it means for a model to have a particular “capability”.

Inspired by the intentional stance [

], we say that the model’s behaviour is

consistent with a goal

perform task

in setting

if its behaviour in setting

can be viewed as solving

, i.e. it performs

The ‘dataset’ consists of the inputs over which losses and gradients are calculated. For example, in many

RL algorithms, the training dataset consists of the (s, a, r, s0)transitions used to compute the surrogate loss.

Table 1: Goals and capabilities for the examples of Section 3. Both the intended and misgeneralized

goals are training goals, but only the misgeneralized goal is a test goal.

Example Intended goal Misgeneralized goal Capabilities

Monster

Gridworld

Collect apples and avoid

being attacked by monsters

Collect apples and

shields

Collecting apples

Collecting shields

Dodging monsters

Tree

Gridworld Chop trees sustainably Chop trees as

fast as possible

Chopping trees

at a given speed

Evaluating

Expressions

Compute expression with

minimal user interaction

Ask questions then

compute expression

Querying the user

Performing arithmetic

Cultural

Transmission Navigate to rewarding points Imitate demonstration Traversing the environment

Imitating another agent

InstructGPT Be helpful, truthful,

and harmless

Be informative, even

when harmful

Answering questions

Grammar

task

well in setting

(without any further tuning). Consistent goals are exactly those capabilities

of a model that are exhibited without any tuning. We call a goal that is consistent with the training

(resp. test) setting a

train (resp. test) goal

. Note that there may be multiple goals that are consistent

with the model’s behaviour in a given setting. Our deﬁnition does not require that the model has an

internal representation of a goal, or a “desire” to pursue it.

Goal misgeneralization

occurs if, in the test setting

Ytest

, the model’s capabilities include those

necessary to achieve the intended goal (given by scoring function

), but the model’s behaviour is not

consistent with the intended goal

and is consistent with some other goal (the

misgeneralized goal

Related models of goal misgeneralization.

Di Langosco et al.

[20]

say that goal misgeneralization

occurs when the policy acts in a goal-directed manner but does not achieve high reward according

. They formalize the goal-directedness of the policy in the reinforcement learning (RL) setting

using the Agents and Devices framework [

]. Our deﬁnition of goal misgeneralization is more

general and applies to any learning framework, rather than being restricted to RL. It also includes

an additional criterion that the model has to be capable of carrying out the intended goal in the test

environment. Intuitively, if the model is not capable of pursuing the intended goal, we would call this

a capability generalization failure. Thus, our deﬁnition more precisely identiﬁes the situations that

we are concerned about.

3 Examples of goal misgeneralization

In this section we provide several examples of goal misgeneralization, summarized in Table 1.

Existing examples in the literature are discussed in Appendix B. We strongly recommend watch-

ing videos of agent behaviour alongside this section, available at

sites.google.com/view/

goal-misgeneralization. Our examples meet the following desiderata:

P1. Misgeneralization.

The model should be trained to behave well in the training setting, and then

should behave badly zero-shot in the deployment setting.

P2. Robust capabilities.

The model should have clear capabilities that it visibly retains in the

deployment setting, despite producing bad behaviour.

P3. Attributable goal.

We should be able to attribute some goal to the model in the deployment

setting: there should be some non-trivial task on which the model achieves a near-optimal score.

3.1 Example: Monster Gridworld

This RL environment is a 2D fully observed gridworld. The agent must collect apples (+1 reward)

while avoiding monsters that chase it (-1 reward on collision). The agent may also pick up shields for

protection. When a monster collides with a shielded agent, both the monster and shield are destroyed.

See Appendix C.2 for further details. The optimal policy focuses on shields while monsters are

present, and on apples when there are no monsters.

Figure 2:

Monster Gridworld.

We visualize summary statistics for different agents over the course

of an episode, averaging over

100

episodes of

200

steps. Agent

trainN

is trained on episode length

steps, and

random

is a random agent. Note that lines corresponding to

train100

and

train200

are nearly identical and mostly overlap.

Our main agent of interest,

train25

, is trained on short episodes of length 25, but tested on long

episodes of length 200. As shown in Figure 2, relative to

train200

which is trained directly on

episodes of length 200, train25 collects more shields and fewer apples.

Why does this happen? During the ﬁrst

steps, monsters are almost always present and agents

focus on collecting shields. This leads to a spectrum of goals in the training setting for

train25

prefer shields over apples always (maximally misgeneralized), or only when monsters are present

(intended). Note that the information required to distinguish these goals is present during training: the

agent does consume some apples and get positive reward, and the agent is never positively rewarded

for getting a shield. Nonetheless, agents pursuing the misgeneralized goal would perform well in the

training situation, and this is sufﬁcient to make goal misgeneralization possible.

After 25 steps, trained agents often destroy all the monsters, inducing a distribution shift for the

train25

agent. It then continues to capably collect shields (higher blue curve) at the expense of

apples (lower green curve). Thus, the test goal is in the middle of the spectrum:

train25

collects

somewhat fewer shields and more apples, but not as few shields or as many apples as train200.

Increasing diversity in training ﬁxes the issue: the

train100

agent encounters situations with no

monsters, and so generalizes successfully, behaving almost identically to the

train200

agent. These

agents collect shields at roughly the same rate as a random agent once the monsters are destroyed.

3.2 Example: Tree Gridworld

In this 2D fully observed gridworld, the agent collects reward by chopping trees, which removes

the trees from the environment. Trees respawn at a rate that increases with the number of trees left.

When there are no trees left, the respawn rate is very small but positive (see Appendix C.3).

We consider the online, reset-free setting, in which the agent acts and learns in the environment,

without any train / test distinction and without an ability to reset the environment (preventing episodic

learning). To cast this within our framework, we say that at a given timestep,

Dtrain

consists of all past

experience the agent has accumulated. The optimal policy in this setting is to chop trees sustainably:

the agent should chop fewer trees when they are scarce, to keep the respawn rate high.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GoalMisgeneralization:WhyCorrectSpecicationsAren'tEnoughForCorrectGoalsRohinShahyrohinmshah@deepmind.comVikrantVarmayvikrantvarma@deepmind.comRamanaKumaryMaryPhuongyVictoriaKrakovnayJonathanUesatoyZacKentonyAbstractTheeldofAIalignmentisconcernedwithAIsystemsthatpursueunintendedgoals.Onecommonlys...

展开>> 收起<<

Goal Misgeneralization Why Correct Speciﬁcations Arent Enough For Correct Goals Rohin Shahy.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Goal Misgeneralization Why Correct Speciﬁcations Arent Enough For Correct Goals Rohin Shahy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: