
Our main results are empirically validated functional forms for the gold reward model scores
R
as a function of the Kullback–Leibler divergence from the initial policy to the optimized policy
KL := DKL (πkπinit)
, which depends on the method of optimization used. This KL distance
between the initial and optimized policies increases monotonically during during RL training (fig. 14),
and can be computed analytically as a function of
n
for BoN. Further, because it is a quadratic
metric of distance [Bai et al., 2022, Section 4.3], we will define
d:= pDKL (πkπinit)
, and write
our functional forms in terms of d.
We find empirically that for best-of-n(BoN) sampling,
Rbon(d) = d(αbon−βbond),
and for reinforcement learning,1
RRL (d) = d(αRL −βRL log d),
Here,
R(0) := 0
by definition and
αRL
,
βRL
,
αbon
and
βbon
are parameters that may depend on the
number of proxy reward model parameters, the size of the proxy reward model dataset, and so on.
We see that these scaling laws make accurate predictions.
We also find the following.
•RL versus best-of-n.
As a function of the KL divergence, reinforcement learning tends to
be slower than best-of-
n
sampling at both optimization and overoptimization. This suggests
inadequacies with using KL to compare amount of (over)optimization across methods.
However, the relationship between the proxy reward model score and the gold reward model
score is similar for both methods.
•Smooth coefficient scaling.
The
α
and
β
coefficients in the BoN and RL functional forms
vary smoothly with the number of proxy reward model parameters, following approximate
logarithmic trends.2This allows prediction of attained gold RM score.
•Weak dependence on policy size.
While larger policies perform better overall and benefit
less from optimization against an RM as measured by increase in gold reward, they lead to
very similar amounts of overoptimization, as measured through the gap between the proxy
and gold scores (which indicates the shortfall between predicted and actual reward), and KL
distance at which the maximum gold RM score is attained.
•KL penalty ineffectiveness.
In our reinforcement learning setup, using a KL penalty
increases the proxy reward model score that can be achieved for a given KL divergence, but
this does not correspond to a measurable improvement in the gold RM score–
KLRL
frontier.
However, we note this result could be particularly sensitive to hyperparameters.
Finally, we discuss the implications of these findings for Reinforcement Learning From Human
Feedback (RLHF), existing models of Goodhart’s law, and AI Alignment more broadly.
2 Methodology
The setting used throughout this paper is the same as for InstructGPT [Ouyang et al., 2022]. In
our environment, the observations are text prompts and the policy is used to generate a response to
the prompt. The prompts are drawn from a broad range of natural language instructions describing
different language model tasks. Then, a learned RM is used to provide the reward signal for the
response, which is used by either RL or BoN for optimization.
For all experiments, we use pretrained GPT-3 series language models as the initial checkpoint [Brown
et al., 2020]. All initial policies are trained with supervised fine-tuning (SFT) on human-generated
InstructGPT demonstrations [Ouyang et al., 2022] for 2 epochs. All RMs also use the GPT-3
architecture but have an added scalar head to output the reward.
1
We note that this form likely does not hold near the origin, as it has infinite slope there. We experimented
with a number of different forms, but found worse fits and extrapolation. See appendix B for more details.
2The coefficient αRL in particular being nearly independent of RM parameter count.
2