traditional Q-learning: to optimize the overall reward func-
tion R(s, a, s0) = Pc∈CRc(s, a, s0). HRA achieves this by
training several Q-functions Qc(s, a)that only account for
rewards related to their component c. For choosing an ac-
tion for the next step, the HRA agent uses the sum of these
individual Q-functions: QHRA(s, a) := Pc∈CQc(s, a).
For the update YDoubleDQN
teach head is used individu-
ally. If the underlying agent uses deep Q-learning, the dif-
ferent Q-functions Qc(s, a)can share multiple lower-level
layers of the neural network. In this case, the collection of
Q-functions that each have one type of reward can be viewed
as a single agent with multiple heads, such that each head
calculates the action-values of a current state under his re-
ward function.
HRA was originally proposed to make the learning pro-
cess more efficient. However, Juozapaitis et al. (2019) sug-
gested the use of Reward Decomposition (RD) as a local ex-
planation method. Traditional Q-values do not give any in-
sight into the positive and negative factors contributing to the
agent’s decision since the individual reward components are
mixed into a single reward scalar. Showing the individual Q-
values Qc(s, a)for each reward component ccan explicitly
expose the different types of rewards that affect the agent’s
behavior.
This increase in explainability should not result in a
decreased performance. Van Seijen et al. (2017) already
showed that HRA can even result in increased performance
for Pacman. We additionally conducted a sanity check in the
Highway environment to verify that HRA results in com-
parable learning to that obtained without decomposing the
reward function (see Appendix A).
3.2 Policy Summaries
Agent strategy summarization (Amir, Doshi-Velez, and
Sarne 2019) is a paradigm for conveying the global behavior
of an agent. In this paradigm, the agent’s policy is demon-
strated through a carefully selected set of world states. The
goal of strategy summarization is to choose the subset of
state-action pairs that best describes the agent’s policy. In a
formal way, Amir & Amir (Amir and Amir 2018) defined
the set T=< t1, ..., tk>as the trajectories that are in-
cluded in the summary, where each trajectory is composed
of a sequence of lconsecutive states and the actions taken
in those states, <(si, ai), ..., (si+l−1, ai+l−1)>. Since it is
not feasible for people to review the behavior of an agent in
all possible states, kis defined as the size of the summary
e.g |T|=k.
We use a summarization approach called HIGHLIGHTS
(Amir and Amir 2018) that extracts the most “important”
states from execution traces of the agent. The importance of
a state sis denoted as I(s)and is defined differently between
environments as it is influenced by the different actions that
are possible in the environment. The general idea is that a
state is important if the outcome of the chosen action had a
big impact.
4 Integrating Policy Summaries and Reward
Decomposition
We combined HIGHLIGHTS as a global explanation with
reward decomposition as a local explanation. We used
HIGHLIGHTS to find the most important states during the
agents’ gameplay. For each state that was chosen, we cre-
ated reward decomposition bars that depict the decomposed
Q-values for actions in the chosen state (see Figures 1 and
2). We chose to combine these two types of explanations be-
cause we believe they complement each other. Reward de-
composition reflects the intentions of an agent while HIGH-
LIGHTS gives a broader perspective on the agent’s deci-
sions.
HIGHLIGHTS summaries are typically shown as videos.
However, the reward decomposition bars are static, and vary
for each state. Therefore, when integrating the two methods,
we used HIGHLIGHTS to extract the important states, but
displayed them using static images rather than videos.
5 Empirical Methodology
To evaluate the benefits of integrating HIGHLIGHTS with
reward decomposition as well as their respective contribu-
tions to users’ understanding of agents’ behavior, we con-
ducted two user studies in which participants were asked
to evaluate the preferences of different agents. We hypoth-
esized that the combined explanations would best support
participants’ ability to correctly identify agents’ preferences
and that both the local and global explanations would be bet-
ter than the baseline information.
5.1 Experimental Environments and Agent
Training
Highway Environment We used a multi-lane Highway
environment (shown in the top part of Figure 1) for our first
experiments. The environment can be modified by setting
different variables such as the number of vehicles, vehicle
density, rewards, speed range, and more. Our settings can be
found in the appendix.
In the environment, the RL agent controls the green ve-
hicle. The objective of the agent is to maximize its reward
by navigating a multi-lane highway while driving alongside
other (blue) vehicles. Positive rewards can be given for each
of the following actions: changing lanes (CL), speeding up
(SU), and moving to the right-most lane (RML). Therefore,
we decided to set |C|= 3 for each positive reward type. We
trained the agents as described in Section 3. The network in-
put is an array of size 25 (5X5) that represents the state. The
input layer is followed by two fully connected hidden layers
of length 256. The last of these two layers is connected to
three heads. Each head consists of a linear layer and outputs
a Q-value vector of length of 5 that contains the following:
lane left, idle, lane right, faster, slower.
We trained the following four RL agents which differ in
their policies:
1. The Good Citizen - Highest reward for being in the right
lane, next to change lane, and lastly to speed up.
2. Fast And Furious - Highest reward for speeding up, then
changing lanes, and lastly to be in the right-most lane.