
Preprint. Under review.
2 BACKGROUND AND RELATED WORK
Markov Decision Processes (MDPs) In this canonical formulation of sequential decision making,
the state of a system at discrete time t,st∈ S, and the action of an agent, at∈ A, condition the
successor state st+1 according to a dynamics function D:S × A → ∆(S)(we use ∆(·)to denote
the set of all probability distributions over a set). A reward function R:S × A × S → Rthen
outputs a scalar reward rt+1 given st,atand st+1. RL algorithms use exploratory data collection to
learn action-selection policies π:S → ∆(A), with the goal of maximising the expected discounted
sum of future reward ED,π P∞
h=0 γhrt+h+1, γ ∈[0,1].
Reward Learning In the usual MDP framing, Ris an immutable property of the environment,
which belies the practical fact that AI objectives originate in the uncertain goals and preferences of
fallible humans (Russell, 2019). Reward learning (or modelling) (Leike et al., 2018) replaces hand-
specified reward functions with models learnt from humans via revealed preference cues such as
demonstrations (Ng et al., 2000), scalar evaluations (Knox & Stone, 2008), approval labels (Griffith
et al., 2013), corrections (Bajcsy et al., 2017), and preference rankings (Christiano et al., 2017).
XAI for RL (XRL) Surveys of XAI for RL (Puiutta & Veith, 2020; Heuillet et al., 2021) tax-
onomise a diverse and expanding range of methods. A key division is between intrinsic approaches,
which imbue agents with structure such as object-oriented representations (Zhu et al., 2018) or
symbolic policy primitives (Verma et al., 2018), and post hoc (often visual) analyses of learnt rep-
resentations (Zahavy et al., 2016), including computing feature importance/saliency (Huber et al.,
2019). Spatiotemporal scope varies from the local explanation of single actions (van der Waa et al.,
2018) to the global summary of entire policies by showing representative trajectories (Amir & Amir,
2018) or critical states (Huang et al., 2018). While most post hoc methods focus on single policies,
some provide insight into the dynamics of agent learning (Dao et al., 2018; Bewley et al., 2022).
Explainable Reward Functions At the intersection of reward learning and XRL lie efforts to im-
prove human understanding of reward functions and their effects on action selection. While this
area is “less developed” than other XRL sub-fields (Glanois et al., 2021), a distinction has again
emerged between intrinsic approaches which create rewards that decompose into semantic com-
ponents (Juozapaitis et al., 2019) or optimise for sparsity (Devidze et al., 2021), and post hoc ap-
proaches which apply feature importance analysis (Russell & Santos, 2019), counterfactual probing
(Michaud et al., 2020), or simplifying transformations (Jenner & Gleave, 2022). Sanneman & Shah
(2022) use a set of human-oriented metrics to compare the efficacy of reward explanation techniques.
Trees for Explainable Agency Prior uses of tree models in XRL again divide into intrinsic meth-
ods, in which an agent’s policy (Silva et al., 2020), value function (Liu et al., 2018; Roth et al., 2019)
or dynamics model (Jiang et al., 2019) is implemented as a tree, and post hoc tree approximations of
an existing agent’s (usually NN) policy (Bastani et al., 2018; Coppens et al., 2019) or transitions in
the environment (Bewley et al., 2022). Related to our focus on human-centric learning: Cobo et al.
(2012) learn tree-structured MDP abstractions from human demonstrations; Lafond et al. (2013) use
tree models to model expert judgements in a naval air defence setting; and Tambwekar et al. (2021)
warm-start RL by converting a natural language specification into a differentiable tree policy.
3 PREFERENCE-BASED REWARD LEARNING
We adopt the preference-based approach to reward learning, in which a human is presented with
pairs of agent trajectories (sequences of state, action, next state transitions) and expresses which of
the two they prefer as a solution to a given task of interest. A reward function is then learnt to explain
the pattern of preferences. This approach is popular in the existing literature (Wirth et al., 2016;
Christiano et al., 2017; Lee et al., 2021b) and has a firm psychological basis. Experimental results
indicate that humans find it cognitively easier to make relative (vs. absolute) quality judgements
(Kendall, 1975; Wilde et al., 2020) and exhibit lower variance when doing so (Guo et al., 2018).
This is due in part to the lack of requirement for an absolute scale to be maintained in working
memory, which is liable to induce bias as it shifts over time (Eric et al., 2007).
We formalise a trajectory ξias a sequence (xi
1,..., xi
Ti), where xi
t=φ(si
t−1, ai
t−1, si
t)∈RFrepre-
sents a single transition as an F-dimensional feature vector. Given Ntrajectories, Ξ = {ξi}N
i=1, we
assume the human provides K≤N(N−1)/2pairwise preference labels, L={(i, j)}K
k=1, each
of which indicates that the jth trajectory is preferred to the ith (denoted by ξjξi). Figure 1 (left)
shows how a preference dataset D= (Ξ,L)can be viewed as a directed graph.
2