Preprint. Under review. REWARD LEARNING WITH TREES METHODS AND EVALUATION

2025-05-06 0 0 5.97MB 22 页 10玖币

侵权投诉

Preprint. Under review.

REWARD LEARNING WITH TREES:

METHODS AND EVALUATION

Tom Bewley1, Jonathan Lawry1, Arthur Richards1, Rachel Craddock2& Ian Henderson2

1University of Bristol and 2Thales, United Kingdom

{firstname.surname}@{1bristol.ac.uk,1uk.thalesgroup.com}

ABSTRACT

Recent efforts to learn reward functions from human feedback have tended to use

deep neural networks, whose lack of transparency hampers our ability to explain

agent behaviour or verify alignment. We explore the merits of learning intrinsi-

cally interpretable tree models instead. We develop a recently proposed method

for learning reward trees from preference labels, and show it to be broadly com-

petitive with neural networks on challenging high-dimensional tasks, with good

robustness to limited or corrupted data. Having found that reward tree learning

can be done effectively in complex settings, we then consider why it should be

used, demonstrating that the interpretable reward structure gives signiﬁcant scope

for traceability, veriﬁcation and explanation.

1 INTRODUCTION

For a reinforcement learning (RL) agent to reliably achieve a goal or desired behaviour, this objective

must be encoded as a reward function. However, manual reward design is widely understood to be

challenging, with risks of under-, over-, and mis-speciﬁcation leading to undesirable, unsafe and

variable outcomes (Pan et al., 2022). For this reason, there has been growing interest in enabling RL

agents to learn reward functions from normative feedback provided by humans (Leike et al., 2018).

These efforts have proven successful from a technical perspective, but an oft-unquestioned aspect of

the approach creates a roadblock to practical applications: reward learning typically uses black-box

neural networks (NNs), which resist human scrutiny and interpretation.

For advocates of explainable AI (XAI), this is a problematic state of affairs. The XAI community is

vocal about the safety and accountability risks of opaque learning algorithms (Rudin, 2019), but an

inability to interpret even the objective that an agent is optimising places us in yet murkier epistemic

territory, in which an understanding of the causal origins of learnt behaviour, and their alignment

with human preferences, becomes virtually unattainable. Black-box reward learning could also be

seen as a missed scientiﬁc opportunity. A learnt reward function is a tantalising object of study

from an XAI perspective, due to its triple status as (1) an explanatory model of revealed human

preferences, (2) a normative model of agent behaviour, and (3) a causal link between the two.

The approach proposed by Bewley & Lecue (2022) provides a promising way forward. Here, human

preference labels over pairs of agent behaviours are used to learn tree-structured reward functions

(reward trees), which are hierarchies of local rules that admit visual and textual representation and

can be leveraged to monitor and debug agent learning. In this paper, we adapt and extend the

method (including by integrating it with model-based RL agents), and compare it to NN-based re-

ward learning in a challenging aircraft handling domain. We ﬁnd it to be broadly competitive on both

quantitative metrics and qualitative assessments, with our new modiﬁcation to tree growth yielding

signiﬁcant improvements. The resultant trees are small enough to be globally interpretable (≈20

leaves), and we demonstrate how they can be analysed, veriﬁed, and used to generate explanations.

The primary contribution of this paper is positive evidence that reward learning can be performed

effectively using interpretable models such as trees, even in complex, high-dimensional continuous

environments. Our secondary contributions are improvements to the originally-proposed learning

algorithm, as well as metrics and methods for reward evaluation and interpretability that may be

useful to others working in what remains a somewhat preparadigmatic ﬁeld. After reviewing the

necessary background and related work in Sections 2 and 3, we present our reﬁnement of reward

tree learning in Section 4, and describe how we deploy it online with a model-based agent in Section

5. Section 6 contains our experiments and results, which consider both quantitative and qualitative

aspects of learning performance, and an illustrative analysis of learnt tree structures. Finally, Section

7 concludes and discusses avenues for future work.

arXiv:2210.01007v1 [cs.LG] 3 Oct 2022

Preprint. Under review.

2 BACKGROUND AND RELATED WORK

Markov Decision Processes (MDPs) In this canonical formulation of sequential decision making,

the state of a system at discrete time t,st∈ S, and the action of an agent, at∈ A, condition the

successor state st+1 according to a dynamics function D:S × A → ∆(S)(we use ∆(·)to denote

the set of all probability distributions over a set). A reward function R:S × A × S → Rthen

outputs a scalar reward rt+1 given st,atand st+1. RL algorithms use exploratory data collection to

learn action-selection policies π:S → ∆(A), with the goal of maximising the expected discounted

sum of future reward ED,π P∞

h=0 γhrt+h+1, γ ∈[0,1].

Reward Learning In the usual MDP framing, Ris an immutable property of the environment,

which belies the practical fact that AI objectives originate in the uncertain goals and preferences of

fallible humans (Russell, 2019). Reward learning (or modelling) (Leike et al., 2018) replaces hand-

speciﬁed reward functions with models learnt from humans via revealed preference cues such as

demonstrations (Ng et al., 2000), scalar evaluations (Knox & Stone, 2008), approval labels (Grifﬁth

et al., 2013), corrections (Bajcsy et al., 2017), and preference rankings (Christiano et al., 2017).

XAI for RL (XRL) Surveys of XAI for RL (Puiutta & Veith, 2020; Heuillet et al., 2021) tax-

onomise a diverse and expanding range of methods. A key division is between intrinsic approaches,

which imbue agents with structure such as object-oriented representations (Zhu et al., 2018) or

symbolic policy primitives (Verma et al., 2018), and post hoc (often visual) analyses of learnt rep-

resentations (Zahavy et al., 2016), including computing feature importance/saliency (Huber et al.,

2019). Spatiotemporal scope varies from the local explanation of single actions (van der Waa et al.,

2018) to the global summary of entire policies by showing representative trajectories (Amir & Amir,

2018) or critical states (Huang et al., 2018). While most post hoc methods focus on single policies,

some provide insight into the dynamics of agent learning (Dao et al., 2018; Bewley et al., 2022).

Explainable Reward Functions At the intersection of reward learning and XRL lie efforts to im-

prove human understanding of reward functions and their effects on action selection. While this

area is “less developed” than other XRL sub-ﬁelds (Glanois et al., 2021), a distinction has again

emerged between intrinsic approaches which create rewards that decompose into semantic com-

ponents (Juozapaitis et al., 2019) or optimise for sparsity (Devidze et al., 2021), and post hoc ap-

proaches which apply feature importance analysis (Russell & Santos, 2019), counterfactual probing

(Michaud et al., 2020), or simplifying transformations (Jenner & Gleave, 2022). Sanneman & Shah

(2022) use a set of human-oriented metrics to compare the efﬁcacy of reward explanation techniques.

Trees for Explainable Agency Prior uses of tree models in XRL again divide into intrinsic meth-

ods, in which an agent’s policy (Silva et al., 2020), value function (Liu et al., 2018; Roth et al., 2019)

or dynamics model (Jiang et al., 2019) is implemented as a tree, and post hoc tree approximations of

an existing agent’s (usually NN) policy (Bastani et al., 2018; Coppens et al., 2019) or transitions in

the environment (Bewley et al., 2022). Related to our focus on human-centric learning: Cobo et al.

(2012) learn tree-structured MDP abstractions from human demonstrations; Lafond et al. (2013) use

tree models to model expert judgements in a naval air defence setting; and Tambwekar et al. (2021)

warm-start RL by converting a natural language speciﬁcation into a differentiable tree policy.

3 PREFERENCE-BASED REWARD LEARNING

We adopt the preference-based approach to reward learning, in which a human is presented with

pairs of agent trajectories (sequences of state, action, next state transitions) and expresses which of

the two they prefer as a solution to a given task of interest. A reward function is then learnt to explain

the pattern of preferences. This approach is popular in the existing literature (Wirth et al., 2016;

Christiano et al., 2017; Lee et al., 2021b) and has a ﬁrm psychological basis. Experimental results

indicate that humans ﬁnd it cognitively easier to make relative (vs. absolute) quality judgements

(Kendall, 1975; Wilde et al., 2020) and exhibit lower variance when doing so (Guo et al., 2018).

This is due in part to the lack of requirement for an absolute scale to be maintained in working

memory, which is liable to induce bias as it shifts over time (Eric et al., 2007).

We formalise a trajectory ξias a sequence (xi

1,..., xi

Ti), where xi

t=φ(si

t−1, ai

t−1, si

t)∈RFrepre-

sents a single transition as an F-dimensional feature vector. Given Ntrajectories, Ξ = {ξi}N

i=1, we

assume the human provides K≤N(N−1)/2pairwise preference labels, L={(i, j)}K

k=1, each

of which indicates that the jth trajectory is preferred to the ith (denoted by ξjξi). Figure 1 (left)

shows how a preference dataset D= (Ξ,L)can be viewed as a directed graph.

Preprint. Under review.

4.1 4.2 4.3 4.4

Figure 1: Left: The input to preference-based reward learning is a directed graph over a trajectory

set Ξ = {ξi}N

i=1, where each edge (i, j)represents a preference ξjξi. Each member of Ξis a

sequence in RF(blue connectors show mapping). Right: The four model induction stages; return

estimation (Section 4.1), leaf-level reward prediction (4.2), tree growth (4.3), and pruning (4.4).

To learn a reward function from D, we must assume a generative model for the preference labels.

Typically, it is assumed that the human produces labels in Boltzmann-rational accordance with the

sum of rewards (or return) output by a latent reward function over the feature space, R:RF→R.

This is formalised by adapting the classic preference model of Bradley & Terry (1952):

P(ξjξi|R) = 1

1 + exp( 1

β(G(ξi|R)−G(ξj|R))),where G(ξi|R) = XTi

t=1 R(xi

t),(1)

and β > 0is a temperature coefﬁcient. The objective of reward learning is to approximate Rwithin

some learnable function class R. This is often formalised as minimising the negative log-likelihood

(NLL) loss over L. Wirth et al. (2016) also use the discrete 0-1loss, which considers only the

directions of predicted preferences rather than their strengths. These two losses are deﬁned as:

`NLL(D, R) = X

(i,j)∈L

−log P(ξjξi|R); `0-1(D, R) = X

(i,j)∈L

I[P(ξjξi|R)≤0.5].(2)

4 REWARD TREE INDUCTION

In prior work, the function class Rhas been that of linear models R(x) = w·x(Sadigh et al., 2017),

which have very limited expressivity, or deep NNs (Christiano et al., 2017), which resist human in-

terpretation. As an intermediate option, Bewley & Lecue (2022) (henceforth BL) propose the reward

tree model. Here, the parameter space consists of node-level splitting rules and reward predictions

for an axis-aligned decision tree, whose leaves induce a hyperrectangular partition of RF. These

parameters are non-differentiable, making end-to-end optimisation of the losses in Equation 2 com-

putationally intractable. BL introduce a bespoke, multi-stage model induction method that greedily

optimises a proxy objective at each stage. The four stages outlined in this section, and depicted in

Figure 1 (right), represent a reﬁnement of this approach with simpliﬁed notation.

4.1 TRAJECTORY-LEVEL RETURN ESTIMATION

The stage considers the Ntrajectories as atomic units, and uses the preference graph to construct a

vector of return estimates g∈RN, which should be higher for more commonly preferred trajectories

(blue in Figure 1 (4.1), c.f. red). This is a vanilla preference learning problem of the kind routinely

faced in AI, psychology and economics, and thus admits a standard solution. BL’s original proposal

ﬁnds the least squares solution for gunder Thurstone’s Case V preference model (Gulliksen, 1956).

For consistency with prior work, we instead minimise the NLL loss under the Bradley-Terry model

(we ﬁnd it makes little difference to the result). Concretely, the proxy objective for this stage is

argmin

g∈RNhX

(i,j)∈L

−log 1

1 + exp(gi−gj)i,subject to min(g) = 0

or max(g) = 0 and std(g) = β, (3)

where βis the mean trajectory length in Ξ,PN

i=1 Ti/N. The min-or-max constraint ensures that

all return estimates have the same sign (positive or negative), which aids both policy learning and

interpretability (see Appendix A.1). We optimise Equation 3 by unconstrained gradient descent with

the Adam optimiser (Kingma & Ba, 2014), followed by post-normalisation to meet the constraints.

4.2 LEAF-LEVEL REWARD ESTIMATION

The vector gestimates trajectory-level returns, but the aim of reward learning is to decompose these

into sums of rewards for the constituent transitions. BL’s novel contribution is to do this using a tree

model T, consisting of a hierarchy of rules that partition the transition-level feature space RFinto

LThyperrectangular subsets called leaves. Let leafT:RF→ {1..LT}be a function that maps a

feature vector x∈RFto the leaf in which it resides by propagating it through the rule hierarchy. The

Preprint. Under review.

predicted reward for any xlying in leaf lis deﬁned as an average over g, weighted by the proportion

of timesteps that each trajectory in Ξspends in l:

RT(x) = rleaf(x),where rl=

i=1

TiPTi

t=1 I[leafT(xi

t) = l]

i=1 PTi

t=1 I[leafT(xi

t) = l],∀1≤l≤LT.(4)

In essence, the effect of Equation 4 is to predict higher reward in leaves that contain more timesteps

from trajectories with high gvalues. While ostensibly na¨

ıve, BL ﬁnd that this time-weighted credit

assignment is more robust than several more sophisticated alternatives. It reduces the number of free

parameters in subsequent induction stages, permits fast implementation, and provides an intuitive

interpretation of predicted reward that is traceable back to the contribution of each ξi∈Ξ. Figure 1

(4.2) shows how timesteps from ξ1,ξ3and ξ4contribute to the reward prediction for one leaf.

4.3 TREE GROWTH

Recall that the objective of preference-based reward learning is to ﬁnd a reward model that optimises

a measure of ﬁdelity to D, such as the losses in Equation 2. When the model is a tree, this is achieved

by the discrete operations of growth (adding partitioning rules) and pruning (removing rules). Given

a tree T, a new rule has the effect of splitting the lth leaf with a hyperplane at a location c∈ Cfalong

the fth feature dimension (where Cf⊂Ris a set of candidate split thresholds, e.g. all midpoints

between unique values in Ξ). Let T+ [lf c]denote the newly-enlarged tree. Splitting recursively

creates an increasingly ﬁne partition of RF. Figure 1 (4.3) shows an example with 23 leaves.

A central issue is the criterion for selecting the next rule to add. BL use the proxy objective of

minimising the local variance of reward predictions, which is exactly the CART algorithm (Breiman

et al., 2017). While very fast, this criterion is only loosely aligned with ﬁdelity to D. We additionally

consider the more direct criterion of greedily maximising the immediate reduction in `0-1:

argmax1≤l≤LT,1≤f≤F, c∈Cf`0-1(D, RT)−`0-1(D, RT+[lfc]).(5)

In Section 6.1, we show that switching to this criterion consistently improves performance on most

quantitative measures (we also tried a criterion based on `NLL, but found it to be far more computa-

tionally costly and prone to overﬁtting). Recursive splitting stops when either no reduction in `0-1

can be achieved by any single split, or a tree size limit LT=Lmax is reached.

4.4 TREE PRUNING

Growth is followed by a pruning sweep which reduces the size of the tree by rule removal. Such

reduction is beneﬁcial for both performance (Tien et al. (2022) ﬁnd that increased model capacity

raises the risk of causal confusion in preference-based reward learning) and human comprehension

(in the language of Jenner & Gleave (2022), it is a form of “processing for interpretability”). Given

a tree T, one pruning operation has the effect of merging two leaves into one by removing the rule at

the common parent node. Let Tdenote the sequence of nested subtrees induced by pruning the tree

recursively back to its root, at each step removing the rule that minimises the next subtree’s `0-1. We

select the T ∈ Tthat minimises `0-1, additionally regularised by a term that encourages small trees:

argminT ∈T[`0-1(D, RT) + αLT], where α≥0. Note that even with α= 0 pruning may still yield

a reduced tree, as unlike in traditional decision tree induction, the effect of individual rules on `0-1

depends on the order in which they are added or removed. In the example in Figure 1 (4.4), pruning

yields a ﬁnal tree with 3leaves, for which indicative leaf-level reward predictions are shown.

5 ONLINE LEARNING SETUP

5.1 ITERATED POLICY AND REWARD LEARNING

Sections 3 and 4 do not discuss the origins of the trajectories Ξ, or how reward learning relates to

the downstream objective of learning a policy for the underlying task. Following most recent work

since Christiano et al. (2017), we resolve both questions with an online bootstrapped approach.

Assuming an episodic MDP, the ith episode of policy learning produces a new trajectory ξito add to

Ξ. We immediately connect ξito the preference graph by asking the human to compare it to Kbatch

random trajectories from the existing set (while Sadigh et al. (2017) and others have proposed active

querying schemes, that is not our focus here, and this simple strategy performs satisfactorily). We

then update the reward tree on the full preference graph via the four stages given in Section 4. We

ﬁnd that BL’s original method of starting growth from the current state of the tree causes lock-in to

poor initial solutions, so instead re-grow from scratch on each update. The rule structure nonetheless

tends to stabilise, as the enlarging preference graph becomes increasingly similar for later updates.

Preprint. Under review.

For the (i+ 1)th episode, the policy learning agent then attempts to optimise for the newly-updated

reward. By iterating this process up to a total preference budget Kmax and/or episode budget Nmax,

we hope to converge to both a reward tree that reﬂects the human’s preferences, and an agent policy

that satisﬁes those preferences. Appendix A.2 contains pseudocode for the online algorithm.

5.2 INTEGRATION WITH MODEL-BASED RL

Reward learning methods are generally agnostic to the structure of the policy learning agent; this

modularity is hailed as an advantage over other human-agent teaching paradigms (Leike et al., 2018).

In line with most recent works, BL use a model-free RL agent, speciﬁcally soft actor-critic (SAC)

(Haarnoja et al., 2018). However, other works (Reddy et al., 2020; Rahtz et al., 2022) use model-

based RL (MBRL) agents that leverage learnt dynamics models and planning. MBRL is attractive in

the reward learning context because it disentangles the predictive and normative aspects of decision-

making. Since (assuming no changes to the environment) dynamics remain stationary during online

reward learning, the amount of re-learning required is reduced and along with it, the risk of pitfalls

such as manipulation (Armstrong et al., 2020) and premature convergence. MBRL agents also

provide a richer target for XAI methods due to “the explicit nature of their world knowledge and of

the reasoning performed to take decisions” (Hoffmann & Magazzeni, 2019). In future work, we plan

to explore how this could be synergistic with the adoption of reward trees (see Section 7). Finally,

MBRL can be very data-efﬁcient; we ﬁnd that switching from SAC to a model-based algorithm

called PETS (Chua et al., 2018) reduces training steps by multiple orders of magnitude, and cuts

wall-clock runtime (see Appendix B). PETS selects actions by decision-time planning through a

learnt dynamics model D0:S × A → ∆(S)up to a horizon H. In state s, planning searches for a

sequence of Hfuture actions that maximise return under the current reward model:

argmax

(a0,...,aH−1)∈AH

ED0hXH−1

h=0 γhRT(φ(sh, ah, sh+1))i,where s0=s, sh+1 ∼D0(sh, ah).(6)

The ﬁrst action a=a0is executed, and then the agent re-plans on the next timestep. In practice, D0

is an ensemble of probabilistic NNs, the expectation over D0is replaced by a Monte Carlo estimate,

and the optimisation is approximated by the iterative cross-entropy method.

6 EXPERIMENTS AND RESULTS

In this section, we combine quantitative and qualitative evaluations to assess the performance of

reward tree learning, speciﬁcally in comparison to the standard approach of using NNs. We also

illustrate how the intrinsic interpretability of reward trees allows us to analyse what they have learnt.

Our experiments consider three episodic tasks in an aircraft handling environment, in which the

agent must manoeuvre an aircraft (the ego jet, EJ) in a desired manner relative to a second reference

jet (RJ) whose motion, if any, is considered part of the environment dynamics. These tasks provide

useful test cases for reward learning because each has a large space of plausible reward functions,

which may reﬂect the divergent priorities and stylistic preferences of domain experts. Such expert

knowledge is often tacit and difﬁcult to codify by hand (Sternberg & Horvath, 1999), motivating a

data-driven approach. Figure 2 shows a schematic of each task, and Appendix C contains a broader

justiﬁcation for this experimental domain alongside task and implementation details.

Ego Jet

Reference Jet State = [ positions, attitudes,

velocities, accelerations ]

Action = [ pitch, roll,

yaw, thrust ]

RJ EJ

Chase

Land

Figure 2: Handling tasks. Follow: turn to ﬂy in formation with RJ on a linear path. Chase: maintain

distance/line of sight to RJ as it turns randomly. Land: approach a runway using RJ as a reference.

In place of costly human-in-the-loop evaluation, our experiments use synthetic oracle preferences

with respect to nominal reward functions of varying complexity, which are given in Appendix C.3.

This approach is popular (Grifﬁth et al., 2013; Christiano et al., 2017; Reddy et al., 2020; Lindner

et al., 2021) as it enables scalable systematic comparison, with the ability to quantify performance

(and in our case, appraise learnt trees) in terms of reconstruction of a known ground truth. However,

emulating a human with an oracle that responds with perfect rationality is unrealistic (Lee et al.,

2021a). For this reason, Section 6.3 examines the performance impacts of noisy and myopic oracles,

and a restricted data budget. Experimental details and hyperparameters are given in Appendix D.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Preprint.Underreview.REWARDLEARNINGWITHTREES:METHODSANDEVALUATIONTomBewley1,JonathanLawry1,ArthurRichards1,RachelCraddock2&IanHenderson21UniversityofBristoland2Thales,UnitedKingdomffirstname.surnameg@f1bristol.ac.uk,1uk.thalesgroup.comgABSTRACTRecenteffortstolearnrewardfunctionsfromhumanfeedbackhave...

展开>> 收起<<

Preprint. Under review. REWARD LEARNING WITH TREES METHODS AND EVALUATION.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint. Under review. REWARD LEARNING WITH TREES METHODS AND EVALUATION

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: