
not goal misgeneralization. Relative to these “random” failures, goal misgeneralization can lead to
significantly worse outcomes: following the anti-expert leads to significant negative reward, while
doing nothing or acting randomly would usually lead to a reward of 0 or 1. With more powerful
systems, coherent behaviour towards an unintended goal can produce catastrophic outcomes [
6
,
54
].
In this paper, we advance our understanding of goal misgeneralization through four contributions:
•
We provide an operationalization of goal misgeneralization (Section 2) that does not require
the RL framework assumed in Di Langosco et al.
[20]
, nor the structural assumptions used
in Hubinger et al. [29].
•
We show that goal misgeneralization can occur in practice by presenting several new
examples in hand-designed (Sections 3.1-3.3) and “in-the-wild” (Sections 3.4-3.5) settings.
•
We apply the lens of goal misgeneralization for the first time to agent-induced distribution
shifts (Sections 3.1-3.2) and few-shot learning without RL (Section 3.3).
• We describe through concrete hypotheticals how goal misgeneralization provides a mecha-
nism by which powerful AI systems could pose a catastrophic risk to humanity (Section 4).
2 A model for goal misgeneralization
We present a general model for misgeneralization and then discuss the properties that characterize
goal misgeneralization in particular. We will focus on the case of deep learning since all of our main
examples in Section 3 use deep learning. However, our model is more general and can apply to any
learning system. We discuss a concrete example without deep learning in Appendix A.
2.1 Standard misgeneralization framework
We consider the standard picture for misgeneralization within the empirical risk minimization
framework. We aim to learn some function
f∗:X → Y
that maps inputs
x∈ X
to outputs
y∈ Y
. For example, in classification problems
X
is the set of inputs, and
Y
is the set of labels. In
reinforcement learning (RL),
X
is the set of states or observation histories, and
Y
is the set of actions.
We consider a parameterized family of functions
FΘ
, such as those implemented by deep neural net-
works. Functions are selected based on a scoring function
s(fθ,Dtrain)
that evaluates the performance
of
fθ
on the given dataset
Dtrain3
. Misgeneralization can occur when there are two parameterizations
θ1
and
θ2
such that
fθ1
and
fθ2
both perform well on
Dtrain
but differ on
Dtest
. Depending on which
of
θ1
and
θ2
is chosen, we may then get very bad scores on
Dtest
. Whether we get
fθ1
or
fθ2
depends
on the inductive biases of the model and random effects (such as the random initialization of model
parameters).
Note that while sometimes
Dtest
is assumed to be sampled from the same distribution as
Dtrain
, in
this paper we primarily consider cases where it is sampled from a different distribution, known as
distribution shift. This further increases risk of misgeneralization.
2.2 Goal misgeneralization
We now characterize goal misgeneralization. Intuitively, goal misgeneralization occurs when we
learn a function fθbad that has robust capabilities but pursues an undesired goal.
It is quite challenging to define what a “capability” is in the context of neural networks. We provide a
provisional definition following Chen et al.
[11]
. We say that the model is
capable
of some task
X
in setting
Y
if it can be quickly tuned to perform task
X
well in setting
Y
(relative to learning
X
from scratch). For example, tuning could be done by prompt engineering or by fine-tuning on a small
quantity of data [
52
]. We emphasize that this is a provisional definition and hope that future work
will provide better definitions of what it means for a model to have a particular “capability”.
Inspired by the intentional stance [
19
], we say that the model’s behaviour is
consistent with a goal
to
perform task
X
in setting
Y
if its behaviour in setting
Y
can be viewed as solving
X
, i.e. it performs
3
The ‘dataset’ consists of the inputs over which losses and gradients are calculated. For example, in many
RL algorithms, the training dataset consists of the (s, a, r, s0)transitions used to compute the surrogate loss.
3