Preprint. Under review. REWARD LEARNING WITH TREES METHODS AND EVALUATION

2025-05-06 0 0 5.97MB 22 页 10玖币
侵权投诉
Preprint. Under review.
REWARD LEARNING WITH TREES:
METHODS AND EVALUATION
Tom Bewley1, Jonathan Lawry1, Arthur Richards1, Rachel Craddock2& Ian Henderson2
1University of Bristol and 2Thales, United Kingdom
{firstname.surname}@{1bristol.ac.uk,1uk.thalesgroup.com}
ABSTRACT
Recent efforts to learn reward functions from human feedback have tended to use
deep neural networks, whose lack of transparency hampers our ability to explain
agent behaviour or verify alignment. We explore the merits of learning intrinsi-
cally interpretable tree models instead. We develop a recently proposed method
for learning reward trees from preference labels, and show it to be broadly com-
petitive with neural networks on challenging high-dimensional tasks, with good
robustness to limited or corrupted data. Having found that reward tree learning
can be done effectively in complex settings, we then consider why it should be
used, demonstrating that the interpretable reward structure gives significant scope
for traceability, verification and explanation.
1 INTRODUCTION
For a reinforcement learning (RL) agent to reliably achieve a goal or desired behaviour, this objective
must be encoded as a reward function. However, manual reward design is widely understood to be
challenging, with risks of under-, over-, and mis-specification leading to undesirable, unsafe and
variable outcomes (Pan et al., 2022). For this reason, there has been growing interest in enabling RL
agents to learn reward functions from normative feedback provided by humans (Leike et al., 2018).
These efforts have proven successful from a technical perspective, but an oft-unquestioned aspect of
the approach creates a roadblock to practical applications: reward learning typically uses black-box
neural networks (NNs), which resist human scrutiny and interpretation.
For advocates of explainable AI (XAI), this is a problematic state of affairs. The XAI community is
vocal about the safety and accountability risks of opaque learning algorithms (Rudin, 2019), but an
inability to interpret even the objective that an agent is optimising places us in yet murkier epistemic
territory, in which an understanding of the causal origins of learnt behaviour, and their alignment
with human preferences, becomes virtually unattainable. Black-box reward learning could also be
seen as a missed scientific opportunity. A learnt reward function is a tantalising object of study
from an XAI perspective, due to its triple status as (1) an explanatory model of revealed human
preferences, (2) a normative model of agent behaviour, and (3) a causal link between the two.
The approach proposed by Bewley & Lecue (2022) provides a promising way forward. Here, human
preference labels over pairs of agent behaviours are used to learn tree-structured reward functions
(reward trees), which are hierarchies of local rules that admit visual and textual representation and
can be leveraged to monitor and debug agent learning. In this paper, we adapt and extend the
method (including by integrating it with model-based RL agents), and compare it to NN-based re-
ward learning in a challenging aircraft handling domain. We find it to be broadly competitive on both
quantitative metrics and qualitative assessments, with our new modification to tree growth yielding
significant improvements. The resultant trees are small enough to be globally interpretable (20
leaves), and we demonstrate how they can be analysed, verified, and used to generate explanations.
The primary contribution of this paper is positive evidence that reward learning can be performed
effectively using interpretable models such as trees, even in complex, high-dimensional continuous
environments. Our secondary contributions are improvements to the originally-proposed learning
algorithm, as well as metrics and methods for reward evaluation and interpretability that may be
useful to others working in what remains a somewhat preparadigmatic field. After reviewing the
necessary background and related work in Sections 2 and 3, we present our refinement of reward
tree learning in Section 4, and describe how we deploy it online with a model-based agent in Section
5. Section 6 contains our experiments and results, which consider both quantitative and qualitative
aspects of learning performance, and an illustrative analysis of learnt tree structures. Finally, Section
7 concludes and discusses avenues for future work.
1
arXiv:2210.01007v1 [cs.LG] 3 Oct 2022
Preprint. Under review.
2 BACKGROUND AND RELATED WORK
Markov Decision Processes (MDPs) In this canonical formulation of sequential decision making,
the state of a system at discrete time t,st∈ S, and the action of an agent, at∈ A, condition the
successor state st+1 according to a dynamics function D:S × A ∆(S)(we use ∆(·)to denote
the set of all probability distributions over a set). A reward function R:S × A × S Rthen
outputs a scalar reward rt+1 given st,atand st+1. RL algorithms use exploratory data collection to
learn action-selection policies π:S ∆(A), with the goal of maximising the expected discounted
sum of future reward EDP
h=0 γhrt+h+1, γ [0,1].
Reward Learning In the usual MDP framing, Ris an immutable property of the environment,
which belies the practical fact that AI objectives originate in the uncertain goals and preferences of
fallible humans (Russell, 2019). Reward learning (or modelling) (Leike et al., 2018) replaces hand-
specified reward functions with models learnt from humans via revealed preference cues such as
demonstrations (Ng et al., 2000), scalar evaluations (Knox & Stone, 2008), approval labels (Griffith
et al., 2013), corrections (Bajcsy et al., 2017), and preference rankings (Christiano et al., 2017).
XAI for RL (XRL) Surveys of XAI for RL (Puiutta & Veith, 2020; Heuillet et al., 2021) tax-
onomise a diverse and expanding range of methods. A key division is between intrinsic approaches,
which imbue agents with structure such as object-oriented representations (Zhu et al., 2018) or
symbolic policy primitives (Verma et al., 2018), and post hoc (often visual) analyses of learnt rep-
resentations (Zahavy et al., 2016), including computing feature importance/saliency (Huber et al.,
2019). Spatiotemporal scope varies from the local explanation of single actions (van der Waa et al.,
2018) to the global summary of entire policies by showing representative trajectories (Amir & Amir,
2018) or critical states (Huang et al., 2018). While most post hoc methods focus on single policies,
some provide insight into the dynamics of agent learning (Dao et al., 2018; Bewley et al., 2022).
Explainable Reward Functions At the intersection of reward learning and XRL lie efforts to im-
prove human understanding of reward functions and their effects on action selection. While this
area is “less developed” than other XRL sub-fields (Glanois et al., 2021), a distinction has again
emerged between intrinsic approaches which create rewards that decompose into semantic com-
ponents (Juozapaitis et al., 2019) or optimise for sparsity (Devidze et al., 2021), and post hoc ap-
proaches which apply feature importance analysis (Russell & Santos, 2019), counterfactual probing
(Michaud et al., 2020), or simplifying transformations (Jenner & Gleave, 2022). Sanneman & Shah
(2022) use a set of human-oriented metrics to compare the efficacy of reward explanation techniques.
Trees for Explainable Agency Prior uses of tree models in XRL again divide into intrinsic meth-
ods, in which an agent’s policy (Silva et al., 2020), value function (Liu et al., 2018; Roth et al., 2019)
or dynamics model (Jiang et al., 2019) is implemented as a tree, and post hoc tree approximations of
an existing agent’s (usually NN) policy (Bastani et al., 2018; Coppens et al., 2019) or transitions in
the environment (Bewley et al., 2022). Related to our focus on human-centric learning: Cobo et al.
(2012) learn tree-structured MDP abstractions from human demonstrations; Lafond et al. (2013) use
tree models to model expert judgements in a naval air defence setting; and Tambwekar et al. (2021)
warm-start RL by converting a natural language specification into a differentiable tree policy.
3 PREFERENCE-BASED REWARD LEARNING
We adopt the preference-based approach to reward learning, in which a human is presented with
pairs of agent trajectories (sequences of state, action, next state transitions) and expresses which of
the two they prefer as a solution to a given task of interest. A reward function is then learnt to explain
the pattern of preferences. This approach is popular in the existing literature (Wirth et al., 2016;
Christiano et al., 2017; Lee et al., 2021b) and has a firm psychological basis. Experimental results
indicate that humans find it cognitively easier to make relative (vs. absolute) quality judgements
(Kendall, 1975; Wilde et al., 2020) and exhibit lower variance when doing so (Guo et al., 2018).
This is due in part to the lack of requirement for an absolute scale to be maintained in working
memory, which is liable to induce bias as it shifts over time (Eric et al., 2007).
We formalise a trajectory ξias a sequence (xi
1,..., xi
Ti), where xi
t=φ(si
t1, ai
t1, si
t)RFrepre-
sents a single transition as an F-dimensional feature vector. Given Ntrajectories, Ξ = {ξi}N
i=1, we
assume the human provides KN(N1)/2pairwise preference labels, L={(i, j)}K
k=1, each
of which indicates that the jth trajectory is preferred to the ith (denoted by ξjξi). Figure 1 (left)
shows how a preference dataset D= (Ξ,L)can be viewed as a directed graph.
2
Preprint. Under review.
4.1 4.2 4.3 4.4
Figure 1: Left: The input to preference-based reward learning is a directed graph over a trajectory
set Ξ = {ξi}N
i=1, where each edge (i, j)represents a preference ξjξi. Each member of Ξis a
sequence in RF(blue connectors show mapping). Right: The four model induction stages; return
estimation (Section 4.1), leaf-level reward prediction (4.2), tree growth (4.3), and pruning (4.4).
To learn a reward function from D, we must assume a generative model for the preference labels.
Typically, it is assumed that the human produces labels in Boltzmann-rational accordance with the
sum of rewards (or return) output by a latent reward function over the feature space, R:RFR.
This is formalised by adapting the classic preference model of Bradley & Terry (1952):
P(ξjξi|R) = 1
1 + exp( 1
β(G(ξi|R)G(ξj|R))),where G(ξi|R) = XTi
t=1 R(xi
t),(1)
and β > 0is a temperature coefficient. The objective of reward learning is to approximate Rwithin
some learnable function class R. This is often formalised as minimising the negative log-likelihood
(NLL) loss over L. Wirth et al. (2016) also use the discrete 0-1loss, which considers only the
directions of predicted preferences rather than their strengths. These two losses are defined as:
`NLL(D, R) = X
(i,j)∈L
log P(ξjξi|R); `0-1(D, R) = X
(i,j)∈L
I[P(ξjξi|R)0.5].(2)
4 REWARD TREE INDUCTION
In prior work, the function class Rhas been that of linear models R(x) = w·x(Sadigh et al., 2017),
which have very limited expressivity, or deep NNs (Christiano et al., 2017), which resist human in-
terpretation. As an intermediate option, Bewley & Lecue (2022) (henceforth BL) propose the reward
tree model. Here, the parameter space consists of node-level splitting rules and reward predictions
for an axis-aligned decision tree, whose leaves induce a hyperrectangular partition of RF. These
parameters are non-differentiable, making end-to-end optimisation of the losses in Equation 2 com-
putationally intractable. BL introduce a bespoke, multi-stage model induction method that greedily
optimises a proxy objective at each stage. The four stages outlined in this section, and depicted in
Figure 1 (right), represent a refinement of this approach with simplified notation.
4.1 TRAJECTORY-LEVEL RETURN ESTIMATION
The stage considers the Ntrajectories as atomic units, and uses the preference graph to construct a
vector of return estimates gRN, which should be higher for more commonly preferred trajectories
(blue in Figure 1 (4.1), c.f. red). This is a vanilla preference learning problem of the kind routinely
faced in AI, psychology and economics, and thus admits a standard solution. BLs original proposal
finds the least squares solution for gunder Thurstone’s Case V preference model (Gulliksen, 1956).
For consistency with prior work, we instead minimise the NLL loss under the Bradley-Terry model
(we find it makes little difference to the result). Concretely, the proxy objective for this stage is
argmin
gRNhX
(i,j)∈L
log 1
1 + exp(gigj)i,subject to min(g) = 0
or max(g) = 0 and std(g) = β, (3)
where βis the mean trajectory length in Ξ,PN
i=1 Ti/N. The min-or-max constraint ensures that
all return estimates have the same sign (positive or negative), which aids both policy learning and
interpretability (see Appendix A.1). We optimise Equation 3 by unconstrained gradient descent with
the Adam optimiser (Kingma & Ba, 2014), followed by post-normalisation to meet the constraints.
4.2 LEAF-LEVEL REWARD ESTIMATION
The vector gestimates trajectory-level returns, but the aim of reward learning is to decompose these
into sums of rewards for the constituent transitions. BLs novel contribution is to do this using a tree
model T, consisting of a hierarchy of rules that partition the transition-level feature space RFinto
LThyperrectangular subsets called leaves. Let leafT:RF→ {1..LT}be a function that maps a
feature vector xRFto the leaf in which it resides by propagating it through the rule hierarchy. The
3
Preprint. Under review.
predicted reward for any xlying in leaf lis defined as an average over g, weighted by the proportion
of timesteps that each trajectory in Ξspends in l:
RT(x) = rleaf(x),where rl=
N
X
i=1
gi
TiPTi
t=1 I[leafT(xi
t) = l]
PN
i=1 PTi
t=1 I[leafT(xi
t) = l],1lLT.(4)
In essence, the effect of Equation 4 is to predict higher reward in leaves that contain more timesteps
from trajectories with high gvalues. While ostensibly na¨
ıve, BL find that this time-weighted credit
assignment is more robust than several more sophisticated alternatives. It reduces the number of free
parameters in subsequent induction stages, permits fast implementation, and provides an intuitive
interpretation of predicted reward that is traceable back to the contribution of each ξiΞ. Figure 1
(4.2) shows how timesteps from ξ1,ξ3and ξ4contribute to the reward prediction for one leaf.
4.3 TREE GROWTH
Recall that the objective of preference-based reward learning is to find a reward model that optimises
a measure of fidelity to D, such as the losses in Equation 2. When the model is a tree, this is achieved
by the discrete operations of growth (adding partitioning rules) and pruning (removing rules). Given
a tree T, a new rule has the effect of splitting the lth leaf with a hyperplane at a location c∈ Cfalong
the fth feature dimension (where CfRis a set of candidate split thresholds, e.g. all midpoints
between unique values in Ξ). Let T+ [lf c]denote the newly-enlarged tree. Splitting recursively
creates an increasingly fine partition of RF. Figure 1 (4.3) shows an example with 23 leaves.
A central issue is the criterion for selecting the next rule to add. BL use the proxy objective of
minimising the local variance of reward predictions, which is exactly the CART algorithm (Breiman
et al., 2017). While very fast, this criterion is only loosely aligned with fidelity to D. We additionally
consider the more direct criterion of greedily maximising the immediate reduction in `0-1:
argmax1lLT,1fF, c∈Cf`0-1(D, RT)`0-1(D, RT+[lfc]).(5)
In Section 6.1, we show that switching to this criterion consistently improves performance on most
quantitative measures (we also tried a criterion based on `NLL, but found it to be far more computa-
tionally costly and prone to overfitting). Recursive splitting stops when either no reduction in `0-1
can be achieved by any single split, or a tree size limit LT=Lmax is reached.
4.4 TREE PRUNING
Growth is followed by a pruning sweep which reduces the size of the tree by rule removal. Such
reduction is beneficial for both performance (Tien et al. (2022) find that increased model capacity
raises the risk of causal confusion in preference-based reward learning) and human comprehension
(in the language of Jenner & Gleave (2022), it is a form of “processing for interpretability”). Given
a tree T, one pruning operation has the effect of merging two leaves into one by removing the rule at
the common parent node. Let Tdenote the sequence of nested subtrees induced by pruning the tree
recursively back to its root, at each step removing the rule that minimises the next subtree’s `0-1. We
select the T Tthat minimises `0-1, additionally regularised by a term that encourages small trees:
argminT ∈T[`0-1(D, RT) + αLT], where α0. Note that even with α= 0 pruning may still yield
a reduced tree, as unlike in traditional decision tree induction, the effect of individual rules on `0-1
depends on the order in which they are added or removed. In the example in Figure 1 (4.4), pruning
yields a final tree with 3leaves, for which indicative leaf-level reward predictions are shown.
5 ONLINE LEARNING SETUP
5.1 ITERATED POLICY AND REWARD LEARNING
Sections 3 and 4 do not discuss the origins of the trajectories Ξ, or how reward learning relates to
the downstream objective of learning a policy for the underlying task. Following most recent work
since Christiano et al. (2017), we resolve both questions with an online bootstrapped approach.
Assuming an episodic MDP, the ith episode of policy learning produces a new trajectory ξito add to
Ξ. We immediately connect ξito the preference graph by asking the human to compare it to Kbatch
random trajectories from the existing set (while Sadigh et al. (2017) and others have proposed active
querying schemes, that is not our focus here, and this simple strategy performs satisfactorily). We
then update the reward tree on the full preference graph via the four stages given in Section 4. We
find that BLs original method of starting growth from the current state of the tree causes lock-in to
poor initial solutions, so instead re-grow from scratch on each update. The rule structure nonetheless
tends to stabilise, as the enlarging preference graph becomes increasingly similar for later updates.
4
Preprint. Under review.
For the (i+ 1)th episode, the policy learning agent then attempts to optimise for the newly-updated
reward. By iterating this process up to a total preference budget Kmax and/or episode budget Nmax,
we hope to converge to both a reward tree that reflects the human’s preferences, and an agent policy
that satisfies those preferences. Appendix A.2 contains pseudocode for the online algorithm.
5.2 INTEGRATION WITH MODEL-BASED RL
Reward learning methods are generally agnostic to the structure of the policy learning agent; this
modularity is hailed as an advantage over other human-agent teaching paradigms (Leike et al., 2018).
In line with most recent works, BL use a model-free RL agent, specifically soft actor-critic (SAC)
(Haarnoja et al., 2018). However, other works (Reddy et al., 2020; Rahtz et al., 2022) use model-
based RL (MBRL) agents that leverage learnt dynamics models and planning. MBRL is attractive in
the reward learning context because it disentangles the predictive and normative aspects of decision-
making. Since (assuming no changes to the environment) dynamics remain stationary during online
reward learning, the amount of re-learning required is reduced and along with it, the risk of pitfalls
such as manipulation (Armstrong et al., 2020) and premature convergence. MBRL agents also
provide a richer target for XAI methods due to “the explicit nature of their world knowledge and of
the reasoning performed to take decisions” (Hoffmann & Magazzeni, 2019). In future work, we plan
to explore how this could be synergistic with the adoption of reward trees (see Section 7). Finally,
MBRL can be very data-efficient; we find that switching from SAC to a model-based algorithm
called PETS (Chua et al., 2018) reduces training steps by multiple orders of magnitude, and cuts
wall-clock runtime (see Appendix B). PETS selects actions by decision-time planning through a
learnt dynamics model D0:S × A ∆(S)up to a horizon H. In state s, planning searches for a
sequence of Hfuture actions that maximise return under the current reward model:
argmax
(a0,...,aH1)∈AH
ED0hXH1
h=0 γhRT(φ(sh, ah, sh+1))i,where s0=s, sh+1 D0(sh, ah).(6)
The first action a=a0is executed, and then the agent re-plans on the next timestep. In practice, D0
is an ensemble of probabilistic NNs, the expectation over D0is replaced by a Monte Carlo estimate,
and the optimisation is approximated by the iterative cross-entropy method.
6 EXPERIMENTS AND RESULTS
In this section, we combine quantitative and qualitative evaluations to assess the performance of
reward tree learning, specifically in comparison to the standard approach of using NNs. We also
illustrate how the intrinsic interpretability of reward trees allows us to analyse what they have learnt.
Our experiments consider three episodic tasks in an aircraft handling environment, in which the
agent must manoeuvre an aircraft (the ego jet, EJ) in a desired manner relative to a second reference
jet (RJ) whose motion, if any, is considered part of the environment dynamics. These tasks provide
useful test cases for reward learning because each has a large space of plausible reward functions,
which may reflect the divergent priorities and stylistic preferences of domain experts. Such expert
knowledge is often tacit and difficult to codify by hand (Sternberg & Horvath, 1999), motivating a
data-driven approach. Figure 2 shows a schematic of each task, and Appendix C contains a broader
justification for this experimental domain alongside task and implementation details.
Ego Jet
Reference Jet State = [ positions, attitudes,
velocities, accelerations ]
Action = [ pitch, roll,
yaw, thrust ]
RJ
EJ
Follow
RJ EJ
Chase
RJ
EJ
Land
Figure 2: Handling tasks. Follow: turn to fly in formation with RJ on a linear path. Chase: maintain
distance/line of sight to RJ as it turns randomly. Land: approach a runway using RJ as a reference.
In place of costly human-in-the-loop evaluation, our experiments use synthetic oracle preferences
with respect to nominal reward functions of varying complexity, which are given in Appendix C.3.
This approach is popular (Griffith et al., 2013; Christiano et al., 2017; Reddy et al., 2020; Lindner
et al., 2021) as it enables scalable systematic comparison, with the ability to quantify performance
(and in our case, appraise learnt trees) in terms of reconstruction of a known ground truth. However,
emulating a human with an oracle that responds with perfect rationality is unrealistic (Lee et al.,
2021a). For this reason, Section 6.3 examines the performance impacts of noisy and myopic oracles,
and a restricted data budget. Experimental details and hyperparameters are given in Appendix D.
5
摘要:

Preprint.Underreview.REWARDLEARNINGWITHTREES:METHODSANDEVALUATIONTomBewley1,JonathanLawry1,ArthurRichards1,RachelCraddock2&IanHenderson21UniversityofBristoland2Thales,UnitedKingdomffirstname.surnameg@f1bristol.ac.uk,1uk.thalesgroup.comgABSTRACTRecenteffortstolearnrewardfunctionsfromhumanfeedbackhave...

展开>> 收起<<
Preprint. Under review. REWARD LEARNING WITH TREES METHODS AND EVALUATION.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:22 页 大小:5.97MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注