Scaling Laws for Reward Model Overoptimization Leo Gao OpenAIJohn Schulman

2025-05-03 0 0 3.04MB 28 页 10玖币
侵权投诉
Scaling Laws for Reward Model Overoptimization
Leo Gao
OpenAI
John Schulman
OpenAI
Jacob Hilton
OpenAI
Abstract
In reinforcement learning from human feedback, it is common to optimize against
a reward model trained to predict human preferences. Because the reward model
is an imperfect proxy, optimizing its value too much can hinder ground truth
performance, in accordance with Goodhart’s law. This effect has been frequently
observed, but not carefully measured due to the expense of collecting human
preference data. In this work, we use a synthetic setup in which a fixed “gold-
standard” reward model plays the role of humans, providing labels used to train a
proxy reward model. We study how the gold reward model score changes as we
optimize against the proxy reward model using either reinforcement learning or
best-of-
n
sampling. We find that this relationship follows a different functional
form depending on the method of optimization, and that in both cases its coefficients
scale smoothly with the number of reward model parameters. We also study the
effect on this relationship of the size of the reward model dataset, the number of
reward model and policy parameters, and the coefficient of the KL penalty added
to the reward in the reinforcement learning setup. We explore the implications of
these empirical results for theoretical considerations in AI alignment.
1 Introduction
Goodhart’s law is an adage that states, “When a measure becomes a target, it ceases to be a good
measure. In machine learning, this effect arises with proxy objectives provided by static learned
models, such as discriminators and reward models. Optimizing too much against such a model
eventually hinders the true objective, a phenomenon we refer to as overoptimization. It is important to
understand the size of this effect and how it scales, in order to predict how much a learned model can
be safely optimized against. Moreover, studying this effect empirically could aid in the development
of theoretical models of Goodhart’s law for neural networks, which could be critical for avoiding
dangerous misalignment of future AI systems.
In this work, we study overoptimization in the context of large language models fine-tuned as
reward models trained to predict which of two options a human will prefer. Such reward models
have been used to train language models to perform a variety of complex tasks that are hard to
judge automatically, including summarization [Stiennon et al., 2020], question-answering [Nakano
et al., 2021, Menick et al., 2022], and general assistance [Ouyang et al., 2022, Bai et al., 2022,
Glaese et al., 2022]. Typically, the reward model score is optimized using either policy gradient-
based reinforcement learning or best-of-
n
sampling, also known as rejection sampling or reranking.
Overoptimization can occur with both methods, and we study both to better understand whether and
how overoptimization behaves differently across both methods.
A major challenge in studying overoptimization in this context is the expense of collecting human
preference labels. A large number of labels are required to accurately estimate overall preference
probabilities, and this is exacerbated by small effect sizes and the need to take many measurements in
order to fit scaling laws. To overcome this, we use a synthetic setup that is described in Section 2, in
which labels are supplied by a “gold-standard” reward model (RM) instead of humans.
Preprint. Under review.
arXiv:2210.10760v1 [cs.LG] 19 Oct 2022
Our main results are empirically validated functional forms for the gold reward model scores
R
as a function of the Kullback–Leibler divergence from the initial policy to the optimized policy
KL := DKL (πkπinit)
, which depends on the method of optimization used. This KL distance
between the initial and optimized policies increases monotonically during during RL training (fig. 14),
and can be computed analytically as a function of
n
for BoN. Further, because it is a quadratic
metric of distance [Bai et al., 2022, Section 4.3], we will define
d:= pDKL (πkπinit)
, and write
our functional forms in terms of d.
We find empirically that for best-of-n(BoN) sampling,
Rbon(d) = d(αbonβbond),
and for reinforcement learning,1
RRL (d) = d(αRL βRL log d),
Here,
R(0) := 0
by definition and
αRL
,
βRL
,
αbon
and
βbon
are parameters that may depend on the
number of proxy reward model parameters, the size of the proxy reward model dataset, and so on.
We see that these scaling laws make accurate predictions.
We also find the following.
RL versus best-of-n.
As a function of the KL divergence, reinforcement learning tends to
be slower than best-of-
n
sampling at both optimization and overoptimization. This suggests
inadequacies with using KL to compare amount of (over)optimization across methods.
However, the relationship between the proxy reward model score and the gold reward model
score is similar for both methods.
Smooth coefficient scaling.
The
α
and
β
coefficients in the BoN and RL functional forms
vary smoothly with the number of proxy reward model parameters, following approximate
logarithmic trends.2This allows prediction of attained gold RM score.
Weak dependence on policy size.
While larger policies perform better overall and benefit
less from optimization against an RM as measured by increase in gold reward, they lead to
very similar amounts of overoptimization, as measured through the gap between the proxy
and gold scores (which indicates the shortfall between predicted and actual reward), and KL
distance at which the maximum gold RM score is attained.
KL penalty ineffectiveness.
In our reinforcement learning setup, using a KL penalty
increases the proxy reward model score that can be achieved for a given KL divergence, but
this does not correspond to a measurable improvement in the gold RM score–
KLRL
frontier.
However, we note this result could be particularly sensitive to hyperparameters.
Finally, we discuss the implications of these findings for Reinforcement Learning From Human
Feedback (RLHF), existing models of Goodhart’s law, and AI Alignment more broadly.
2 Methodology
The setting used throughout this paper is the same as for InstructGPT [Ouyang et al., 2022]. In
our environment, the observations are text prompts and the policy is used to generate a response to
the prompt. The prompts are drawn from a broad range of natural language instructions describing
different language model tasks. Then, a learned RM is used to provide the reward signal for the
response, which is used by either RL or BoN for optimization.
For all experiments, we use pretrained GPT-3 series language models as the initial checkpoint [Brown
et al., 2020]. All initial policies are trained with supervised fine-tuning (SFT) on human-generated
InstructGPT demonstrations [Ouyang et al., 2022] for 2 epochs. All RMs also use the GPT-3
architecture but have an added scalar head to output the reward.
1
We note that this form likely does not hold near the origin, as it has infinite slope there. We experimented
with a number of different forms, but found worse fits and extrapolation. See appendix B for more details.
2The coefficient αRL in particular being nearly independent of RM parameter count.
2
(a) BoN
(b) RL
Figure 1: Reward model (RM) parameter size scaling experiments using the InstructGPT environment.
Policy size is held constant (
1.2B
), while reward model size is varied. The x-axes have a square-root
scale. Note that the plots have different x-axes. The gold reward represents the ground truth reward;
we observe that when we optimize for a learned proxy of the gold reward, the gold reward initially
increases and later decreases. We show that our functional forms fit this effect well.
3
Figure 2: Diagram of the real and synthetic RM training setups. Human labellers generate comparison
data. In the real RLHF setting, this data is used to train a proxy RM that is optimized by RL/BoN. In
our synthetic setting, we instead use a “Gold RM” as our ground truth. In both settings, the proxy
RM is a proxy for the ground truth process generating the labels (either the human or gold RM).
The RL experiments use Proximal Policy Optimization (PPO) [Schulman et al., 2017]. KL penalty for
all RL experiments is set to 0 except for in section 3.6. See appendix C for all other hyperparameters.
We mostly use defaults for the PPO hyperparameters; thus, it is possible that there exist different
trends for other hyperparameter configurations.
In BoN, we generate
n
trajectories for the policy and use the reward model to pick the one with the
highest proxy RM score. We use the unbiased estimator from Nakano et al. [2021, Appendix I] to
compute all of the gold and proxy scores for intermediate
n
between 1 and the maximum
n
with
lower variance and more efficiently than the naive estimator of randomly sampling
n
samples with
replacement repeatedly and taking the mean of the maximum gold and proxy RM scores. The KL
distances for BoN are computed analytically:
KLbon= log nn1
n
[Stiennon et al., 2020, Appendix
G.3].
2.1 Synthetic Data Setup
Because getting a ground truth gold reward signal from human labellers is expensive, we instead
use a synthetic task where the ground truth is defined to be the output of a particular large “gold”
RM. The 6B reward model from Ouyang et al. [2022] is used as the gold RM, and our proxy RMs
vary from 3M to 3B parameters
3
. This synthetic gold reward is used to label pairs of rollouts from
the policy given the same prompt to create synthetic RM training data. The synthetic comparisons
are created deterministically by always marking the trajectory with the higher gold RM score as
preferred.
4
We generate 100,000 synthetic comparisons and reserve 10% of these as a held out test
set for computing the validation loss of RMs.
See fig. 2 for a diagram of the synthetic setup.
2.2 Recalibration
The RM scores are translation-invariant, so to ensure comparability across different reward models,
we recenter each RM such that the average reward of the initial policy is 0. We also unit normalize
the variance of the gold RM scores.
5
Because our hard thresholding synthetic data setup produces
labels that are miscalibrated (since they do not incorporate the gold RM’s confidence), we recalibrate
the proxy RMs by rescaling the logits to minimize cross-entropy loss using a validation set of soft
labels. All renormalization and recalibration is applied after the experiments; this does not affect
BoN at all, and likely has no impact on RL because Adam is loss scale invariant, though it is possible
that there are slight differences due to algorithmic details.
3
We originally trained two additional RMs smaller than 3M parameters, which achieved near-chance accuracy
and were off-trend, and so were excluded.
4We had experimented with sampling for creating labels, but observed noisier results.
5We later decided this was unnecessary but decided not to change it.
4
3 Results
3.1 Fitting and validating functional forms
We chose our functional forms through experimentation with all RM data and parameter scaling
curves in the remainder of this paper.
The BoN functional form was hypothesized using data up to
n= 1000
. In order to validate the
functional forms, we performed a BoN experiment with up to
n= 60,000
(KL
10 nats), after
only having seen data up to
n= 1,000
(KL
6 nats). As this experiment was conducted after the
functional form was hypothesized based on data up to 6 nats, this was a true advance prediction.
We also test extrapolation of the BoN and RL functional forms from low KLs to to unseen larger
KLs; see fig. 26 for details.
We also attempted to model the proxy scores but were unable to obtain a satisfactory fit. For BoN,
despite visual similarity, a linear fit (
bon
) did not work well (fig. 20). The predictions for RL and
BoN are not as easily modelled as the gold score predictions. We leave a better understanding of the
proxy RM score behavior to future work.
3.2 Scaling with RM Parameter Count
We hold policy size (1.2B) and data size (90,000) constant (fig. 1). We observe that for the gold RM
scores,
αbon
and
βbon
change smoothly with RM size (figs. 3a and 3b). For RL, we find that we can
hold
αRL
constant across all RM sizes, resulting in a clean scaling curve for
βRL
(fig. 3c). These
scaling laws allow us to predict properties of training runs; for instance, we can also predict the peak
gold RM scores for different RM sizes (fig. 12).
When modelled using the same functional forms as the respective gold scores, the proxy score fits
have much lower values of
βbon
. We also see smooth scaling in the proxy score’s
αbon
and
βbon
.
However, for the reasons in section 3.1, we are less confident about these fits. For both BoN and RL,
we observe systematic underestimates of the proxy reward model when extrapolated to higher KLs.
Both appear to eventually grow roughly linearly in KL, as in Bai et al. [2022].
(a) αbon(b) βbon(c) βRL
Figure 3: The values of
αbon
,
βbon
and
βRL
in the BoN and RL overoptimization scaling laws for
both proxy (dashed line) and gold (solid line) rewards as they scale with parameter count.
3.3 Scaling with RM Data Size
We hold RM size constant (12M) and sweep RM data size for both RL and BoN.
6
. Overall, the results
are consistent with intuition: more data leads to better gold scores and less goodharting. The scaling
of αand βwith data size are not as cleanly described as for RM size scaling (fig. 17, fig. 18).
For all RM sizes, we observe that for amounts of data less than around 2,000 comparisons
7
, there is
very little improvement over near-chance loss (Figure 6). This is also reflected in gold scores after
optimization (fig. 21). After this threshold, all models improve with more data, though larger RMs
6
For BoN, we actually sweep all combinations of RM size and data size; see fig. 10. For a version of fig. 4a
against a 3B RM, see fig. 19.
7
To test the hypothesis that some minimum number of RM finetuning steps is needed, we control for the
number of SGD steps by running multiple epochs and observe that running 4 epochs instead of 1 yields no
change in gold score whatsoever, whereas 1 epoch of 4 times as much data performs substantially better (fig. 13).
5
摘要:

ScalingLawsforRewardModelOveroptimizationLeoGaoOpenAIJohnSchulmanOpenAIJacobHiltonOpenAIAbstractInreinforcementlearningfromhumanfeedback,itiscommontooptimizeagainstarewardmodeltrainedtopredicthumanpreferences.Becausetherewardmodelisanimperfectproxy,optimizingitsvaluetoomuchcanhindergroundtruthperfor...

展开>> 收起<<
Scaling Laws for Reward Model Overoptimization Leo Gao OpenAIJohn Schulman.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:3.04MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注