Scaling Laws for Reward Model Overoptimization Leo Gao OpenAIJohn Schulman

2025-05-03 0 0 3.04MB 28 页 10玖币

侵权投诉

Scaling Laws for Reward Model Overoptimization

Leo Gao

OpenAI

John Schulman

OpenAI

Jacob Hilton

OpenAI

Abstract

In reinforcement learning from human feedback, it is common to optimize against

a reward model trained to predict human preferences. Because the reward model

is an imperfect proxy, optimizing its value too much can hinder ground truth

performance, in accordance with Goodhart’s law. This effect has been frequently

observed, but not carefully measured due to the expense of collecting human

preference data. In this work, we use a synthetic setup in which a ﬁxed “gold-

standard” reward model plays the role of humans, providing labels used to train a

proxy reward model. We study how the gold reward model score changes as we

optimize against the proxy reward model using either reinforcement learning or

best-of-

sampling. We ﬁnd that this relationship follows a different functional

form depending on the method of optimization, and that in both cases its coefﬁcients

scale smoothly with the number of reward model parameters. We also study the

effect on this relationship of the size of the reward model dataset, the number of

reward model and policy parameters, and the coefﬁcient of the KL penalty added

to the reward in the reinforcement learning setup. We explore the implications of

these empirical results for theoretical considerations in AI alignment.

1 Introduction

Goodhart’s law is an adage that states, “When a measure becomes a target, it ceases to be a good

measure.” In machine learning, this effect arises with proxy objectives provided by static learned

models, such as discriminators and reward models. Optimizing too much against such a model

eventually hinders the true objective, a phenomenon we refer to as overoptimization. It is important to

understand the size of this effect and how it scales, in order to predict how much a learned model can

be safely optimized against. Moreover, studying this effect empirically could aid in the development

of theoretical models of Goodhart’s law for neural networks, which could be critical for avoiding

dangerous misalignment of future AI systems.

In this work, we study overoptimization in the context of large language models ﬁne-tuned as

reward models trained to predict which of two options a human will prefer. Such reward models

have been used to train language models to perform a variety of complex tasks that are hard to

judge automatically, including summarization [Stiennon et al., 2020], question-answering [Nakano

et al., 2021, Menick et al., 2022], and general assistance [Ouyang et al., 2022, Bai et al., 2022,

Glaese et al., 2022]. Typically, the reward model score is optimized using either policy gradient-

based reinforcement learning or best-of-

sampling, also known as rejection sampling or reranking.

Overoptimization can occur with both methods, and we study both to better understand whether and

how overoptimization behaves differently across both methods.

A major challenge in studying overoptimization in this context is the expense of collecting human

preference labels. A large number of labels are required to accurately estimate overall preference

probabilities, and this is exacerbated by small effect sizes and the need to take many measurements in

order to ﬁt scaling laws. To overcome this, we use a synthetic setup that is described in Section 2, in

which labels are supplied by a “gold-standard” reward model (RM) instead of humans.

Preprint. Under review.

arXiv:2210.10760v1 [cs.LG] 19 Oct 2022

Our main results are empirically validated functional forms for the gold reward model scores

as a function of the Kullback–Leibler divergence from the initial policy to the optimized policy

KL := DKL (πkπinit)

, which depends on the method of optimization used. This KL distance

between the initial and optimized policies increases monotonically during during RL training (ﬁg. 14),

and can be computed analytically as a function of

for BoN. Further, because it is a quadratic

metric of distance [Bai et al., 2022, Section 4.3], we will deﬁne

d:= pDKL (πkπinit)

, and write

our functional forms in terms of d.

We ﬁnd empirically that for best-of-n(BoN) sampling,

Rbon(d) = d(αbon−βbond),

and for reinforcement learning,1

RRL (d) = d(αRL −βRL log d),

Here,

R(0) := 0

by deﬁnition and

αRL

βRL

αbon

and

βbon

are parameters that may depend on the

number of proxy reward model parameters, the size of the proxy reward model dataset, and so on.

We see that these scaling laws make accurate predictions.

We also ﬁnd the following.

•RL versus best-of-n.

As a function of the KL divergence, reinforcement learning tends to

be slower than best-of-

sampling at both optimization and overoptimization. This suggests

inadequacies with using KL to compare amount of (over)optimization across methods.

However, the relationship between the proxy reward model score and the gold reward model

score is similar for both methods.

•Smooth coefﬁcient scaling.

The

and

coefﬁcients in the BoN and RL functional forms

vary smoothly with the number of proxy reward model parameters, following approximate

logarithmic trends.2This allows prediction of attained gold RM score.

•Weak dependence on policy size.

While larger policies perform better overall and beneﬁt

less from optimization against an RM as measured by increase in gold reward, they lead to

very similar amounts of overoptimization, as measured through the gap between the proxy

and gold scores (which indicates the shortfall between predicted and actual reward), and KL

distance at which the maximum gold RM score is attained.

•KL penalty ineffectiveness.

In our reinforcement learning setup, using a KL penalty

increases the proxy reward model score that can be achieved for a given KL divergence, but

this does not correspond to a measurable improvement in the gold RM score–

KLRL

frontier.

However, we note this result could be particularly sensitive to hyperparameters.

Finally, we discuss the implications of these ﬁndings for Reinforcement Learning From Human

Feedback (RLHF), existing models of Goodhart’s law, and AI Alignment more broadly.

2 Methodology

The setting used throughout this paper is the same as for InstructGPT [Ouyang et al., 2022]. In

our environment, the observations are text prompts and the policy is used to generate a response to

the prompt. The prompts are drawn from a broad range of natural language instructions describing

different language model tasks. Then, a learned RM is used to provide the reward signal for the

response, which is used by either RL or BoN for optimization.

For all experiments, we use pretrained GPT-3 series language models as the initial checkpoint [Brown

et al., 2020]. All initial policies are trained with supervised ﬁne-tuning (SFT) on human-generated

InstructGPT demonstrations [Ouyang et al., 2022] for 2 epochs. All RMs also use the GPT-3

architecture but have an added scalar head to output the reward.

We note that this form likely does not hold near the origin, as it has inﬁnite slope there. We experimented

with a number of different forms, but found worse ﬁts and extrapolation. See appendix B for more details.

2The coefﬁcient αRL in particular being nearly independent of RM parameter count.

(a) BoN

(b) RL

Figure 1: Reward model (RM) parameter size scaling experiments using the InstructGPT environment.

Policy size is held constant (

1.2B

), while reward model size is varied. The x-axes have a square-root

scale. Note that the plots have different x-axes. The gold reward represents the ground truth reward;

we observe that when we optimize for a learned proxy of the gold reward, the gold reward initially

increases and later decreases. We show that our functional forms ﬁt this effect well.

Figure 2: Diagram of the real and synthetic RM training setups. Human labellers generate comparison

data. In the real RLHF setting, this data is used to train a proxy RM that is optimized by RL/BoN. In

our synthetic setting, we instead use a “Gold RM” as our ground truth. In both settings, the proxy

RM is a proxy for the ground truth process generating the labels (either the human or gold RM).

The RL experiments use Proximal Policy Optimization (PPO) [Schulman et al., 2017]. KL penalty for

all RL experiments is set to 0 except for in section 3.6. See appendix C for all other hyperparameters.

We mostly use defaults for the PPO hyperparameters; thus, it is possible that there exist different

trends for other hyperparameter conﬁgurations.

In BoN, we generate

trajectories for the policy and use the reward model to pick the one with the

highest proxy RM score. We use the unbiased estimator from Nakano et al. [2021, Appendix I] to

compute all of the gold and proxy scores for intermediate

between 1 and the maximum

with

lower variance and more efﬁciently than the naive estimator of randomly sampling

samples with

replacement repeatedly and taking the mean of the maximum gold and proxy RM scores. The KL

distances for BoN are computed analytically:

KLbon= log n−n−1

[Stiennon et al., 2020, Appendix

G.3].

2.1 Synthetic Data Setup

Because getting a ground truth gold reward signal from human labellers is expensive, we instead

use a synthetic task where the ground truth is deﬁned to be the output of a particular large “gold”

RM. The 6B reward model from Ouyang et al. [2022] is used as the gold RM, and our proxy RMs

vary from 3M to 3B parameters

. This synthetic gold reward is used to label pairs of rollouts from

the policy given the same prompt to create synthetic RM training data. The synthetic comparisons

are created deterministically by always marking the trajectory with the higher gold RM score as

preferred.

We generate 100,000 synthetic comparisons and reserve 10% of these as a held out test

set for computing the validation loss of RMs.

See ﬁg. 2 for a diagram of the synthetic setup.

2.2 Recalibration

The RM scores are translation-invariant, so to ensure comparability across different reward models,

we recenter each RM such that the average reward of the initial policy is 0. We also unit normalize

the variance of the gold RM scores.

Because our hard thresholding synthetic data setup produces

labels that are miscalibrated (since they do not incorporate the gold RM’s conﬁdence), we recalibrate

the proxy RMs by rescaling the logits to minimize cross-entropy loss using a validation set of soft

labels. All renormalization and recalibration is applied after the experiments; this does not affect

BoN at all, and likely has no impact on RL because Adam is loss scale invariant, though it is possible

that there are slight differences due to algorithmic details.

We originally trained two additional RMs smaller than 3M parameters, which achieved near-chance accuracy

and were off-trend, and so were excluded.

4We had experimented with sampling for creating labels, but observed noisier results.

5We later decided this was unnecessary but decided not to change it.

3 Results

3.1 Fitting and validating functional forms

We chose our functional forms through experimentation with all RM data and parameter scaling

curves in the remainder of this paper.

The BoN functional form was hypothesized using data up to

n= 1000

. In order to validate the

functional forms, we performed a BoN experiment with up to

n= 60,000

(KL

≈

10 nats), after

only having seen data up to

n= 1,000

(KL

≈

6 nats). As this experiment was conducted after the

functional form was hypothesized based on data up to 6 nats, this was a true advance prediction.

We also test extrapolation of the BoN and RL functional forms from low KLs to to unseen larger

KLs; see ﬁg. 26 for details.

We also attempted to model the proxy scores but were unable to obtain a satisfactory ﬁt. For BoN,

despite visual similarity, a linear ﬁt (

dαbon

) did not work well (ﬁg. 20). The predictions for RL and

BoN are not as easily modelled as the gold score predictions. We leave a better understanding of the

proxy RM score behavior to future work.

3.2 Scaling with RM Parameter Count

We hold policy size (1.2B) and data size (90,000) constant (ﬁg. 1). We observe that for the gold RM

scores,

αbon

and

βbon

change smoothly with RM size (ﬁgs. 3a and 3b). For RL, we ﬁnd that we can

hold

αRL

constant across all RM sizes, resulting in a clean scaling curve for

βRL

(ﬁg. 3c). These

scaling laws allow us to predict properties of training runs; for instance, we can also predict the peak

gold RM scores for different RM sizes (ﬁg. 12).

When modelled using the same functional forms as the respective gold scores, the proxy score ﬁts

have much lower values of

βbon

. We also see smooth scaling in the proxy score’s

αbon

and

βbon

However, for the reasons in section 3.1, we are less conﬁdent about these ﬁts. For both BoN and RL,

we observe systematic underestimates of the proxy reward model when extrapolated to higher KLs.

Both appear to eventually grow roughly linearly in √KL, as in Bai et al. [2022].

(a) αbon(b) βbon(c) βRL

Figure 3: The values of

αbon

βbon

and

βRL

in the BoN and RL overoptimization scaling laws for

both proxy (dashed line) and gold (solid line) rewards as they scale with parameter count.

3.3 Scaling with RM Data Size

We hold RM size constant (12M) and sweep RM data size for both RL and BoN.

. Overall, the results

are consistent with intuition: more data leads to better gold scores and less goodharting. The scaling

of αand βwith data size are not as cleanly described as for RM size scaling (ﬁg. 17, ﬁg. 18).

For all RM sizes, we observe that for amounts of data less than around 2,000 comparisons

, there is

very little improvement over near-chance loss (Figure 6). This is also reﬂected in gold scores after

optimization (ﬁg. 21). After this threshold, all models improve with more data, though larger RMs

For BoN, we actually sweep all combinations of RM size and data size; see ﬁg. 10. For a version of ﬁg. 4a

against a 3B RM, see ﬁg. 19.

To test the hypothesis that some minimum number of RM ﬁnetuning steps is needed, we control for the

number of SGD steps by running multiple epochs and observe that running 4 epochs instead of 1 yields no

change in gold score whatsoever, whereas 1 epoch of 4 times as much data performs substantially better (ﬁg. 13).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ScalingLawsforRewardModelOveroptimizationLeoGaoOpenAIJohnSchulmanOpenAIJacobHiltonOpenAIAbstractInreinforcementlearningfromhumanfeedback,itiscommontooptimizeagainstarewardmodeltrainedtopredicthumanpreferences.Becausetherewardmodelisanimperfectproxy,optimizingitsvaluetoomuchcanhindergroundtruthperfor...

展开>> 收起<<

Scaling Laws for Reward Model Overoptimization Leo Gao OpenAIJohn Schulman.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Scaling Laws for Reward Model Overoptimization Leo Gao OpenAIJohn Schulman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: