Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents Yael Septon1Tobias Huber2Elisabeth Andr e2Ofra Amir1

2025-05-05 0 0 1.2MB 12 页 10玖币
侵权投诉
Integrating Policy Summaries with Reward Decomposition for Explaining
Reinforcement Learning Agents
Yael Septon, 1Tobias Huber, 2Elisabeth Andr´
e, 2Ofra Amir, 1
1Technion, Israel Institute of Technology
2Chair for Human-Centered Artificial Intelligence, University of Augsburg
{yael123, oamir}@technion.ac.il ,{tobias.huber, andre}@informatik.uni-augsburg.de
Abstract
Explaining the behavior of reinforcement learning agents op-
erating in sequential decision-making settings is challenging,
as their behavior is affected by a dynamic environment and
delayed rewards. Methods that help users understand the be-
havior of such agents can roughly be divided into local ex-
planations that analyze specific decisions of the agents and
global explanations that convey the general strategy of the
agents. In this work, we study a novel combination of local
and global explanations for reinforcement learning agents.
Specifically, we combine reward decomposition, a local ex-
planation method that exposes which components of the re-
ward function influenced a specific decision, and HIGH-
LIGHTS, a global explanation method that shows a summary
of the agent’s behavior in decisive states. We conducted two
user studies to evaluate the integration of these explanation
methods and their respective benefits. Our results show sig-
nificant benefits for both methods. In general, we found that
the local reward decomposition was more useful for identi-
fying the agents’ priorities. However, when there was only
a minor difference between the agents’ preferences, then the
global information provided by HIGHLIGHTS additionally
improved participants’ understanding.
1 Introduction
Artificial Intelligence (AI) agents are being deployed in a
variety of domains such as self-driving cars, medical care,
home assistance, and more. With the advancement of such
agents, the need for improving people’s understanding of
such agents’ behavior has become more apparent. In this
work, we focus on explaining the behavior of agents that
operate in sequential decision-making settings, which are
trained in a deep reinforcement learning (RL) framework.
This is challenging, as the behavior of RL agents is affected
by the dynamics of the environment, the reward specifica-
tion, and their ability to attribute delayed outcomes to their
actions.
We study the effectiveness of providing users with global
and local explanations of the behavior of RL agents. Global
explanations explain the general behavior of the agent, e.g.,
by describing decision rules or strategies. In contrast, local
explanations try to explain specific decisions that an agent
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
makes. While local explanations can provide detailed infor-
mation about single decisions of an agent, they do not pro-
vide any information about its behavior in different contexts.
Prior work investigated a combination of the complemen-
tary benefits of global and local explanations. They com-
pared local saliency maps, that highlight relevant features
within the input, with global policy summaries that demon-
strate the behavior of agents in a selected set of world states
(Huber et al. 2021). While their results were promising, the
local saliency maps were lacking since they were hard for
users to interpret correctly.
In this paper, we propose and evaluate a novel com-
bination of global policy summaries with local explana-
tions based on reward decomposition. Reward decomposi-
tion aims to reveal the agent’s reasoning in particular situa-
tions by decomposing the rewards into reward components,
explicitly revealing which reward components the agent ex-
pects from each action (Juozapaitis et al. 2019). For ex-
ample, in a driving environment, such reward components
could be a reward for driving safely, a reward for driving
fast, etc. Since reward decomposition is more explicit in
reflecting the agent’s decision-making than saliency maps,
we hypothesized that it will be easier to interpret. Further-
more, reward decomposition is incorporated directly into
the agents’ underlying decision model through its training.
In contrast, saliency maps are generated after the agent is
trained and might not be faithful to the agents’ underlying
decision model (Rudin 2019; Huber, Limmer, and Andr´
e
2022). Therefore, we hypothesized that a combination of
policy summaries with reward decomposition will improve
users’ understanding of the agents’ strategy compared to
only using local or global explanations.
We conducted two user studies in which participants were
randomly assigned to one of four different conditions that
vary in the combination of global and local information:
(1) being presented or not presented with a local explana-
tion (reward decomposition), and (2) being presented with
a global explanation in the form of a HIGHLIGHTS pol-
icy summary (Amir and Amir 2018) or being presented with
frequent states the agent encounters (a baseline for convey-
ing global information). We used a Highway and a Pacman
environment and trained agents that varied in their priorities
by modifying the reward function. Participants were asked
to determine the priorities of these agents based on the ex-
arXiv:2210.11825v1 [cs.LG] 21 Oct 2022
planations in their condition.
Our results show that the use of reward decomposition
as a local explanation helped users comprehend the agents’
preferences. In addition, the HIGHLIGHTS global explana-
tion helped users understand the agents’ preferences in the
environment of Pacman. While we found that the benefit of
presenting reward decomposition was greater than that of
providing HIGHLIGHTS summaries, the combined expla-
nations further helped users to differ between the agents’
priorities when there only was a minor difference between
the agents’ preferences.
2 Related Work
Explainable reinforcement learning methods can broadly be
divided into two classes based on their scope: local and
global explanations (Molnar 2022). Local explanations an-
alyze specific actions of the agent. This is often further
divided into post-hoc and intrinsic explainability methods.
Post-hoc methods analyze the agent after training, for ex-
ample by creating saliency maps (Hilton et al. 2020; Huber,
Schiller, and Andr´
e 2019; Puri et al. 2020). Intrinsic methods
are built into the agent’s underlying decision model to make
it more explainable. For example, in reinforcement learning,
this is done for casual explanations (Madumal et al. 2020)
or by reward decomposition (Juozapaitis et al. 2019). In this
paper, we focus on reward decomposition as a local expla-
nation method. In reward decomposition, the Q-value is de-
composed into several components (Van Seijen et al. 2017).
This approach has been used for explainability (Juozapaitis
et al. 2019) as the decomposition of the reward function into
meaningful reward types can help reveal which rewards an
agent expects from different actions. A user study exploring
the usefulness of different local RL explainability methods
showed that reward decomposition contributed to people’s
understanding of agent behavior (Anderson et al. 2019).
Global explanations attempt to describe the high-level
policy of an agent, for example by extracting logical rules
that describe the agent’s strategy (Booth, Muise, and Shah
2019). In other work, the goal was to enable users to cor-
rectly anticipate a robot’s behavior in novel situations. The
key idea was to select states that are optimized to allow the
reconstruction of the agent’s policy (Huang et al. 2019). In
this work, we utilize strategy summarization (Amir, Doshi-
Velez, and Sarne 2019) as a global explanation method.
Strategy summaries demonstrate an agent’s behavior in care-
fully selected world states. The states can be selected based
on different criteria, e.g., state importance (Amir and Amir
2018) or using machine teaching approaches (Lage et al.
2019).
Most closely related to our work is a study by Huber et
al. (Huber et al. 2021) which combined local and global ex-
planation methods in RL agents. They used strategy sum-
maries (global explanation) with saliency maps (local expla-
nation). Since this study showed that using saliency maps as
local explanations is lacking, we study the integration of re-
ward decomposition as an alternative local explanation, to-
gether with policy summaries. In addition, reward decom-
position has the advantage of faithfully representing the un-
derlying decision model, while saliency maps are post-hoc
explanations that may not accurately reflect what the model
learned (Rudin 2019; Huber, Limmer, and Andr´
e 2022).
3 Background
We assume a Markov Decision Process (MDP) setting. For-
mally, an MDP is defined by a tuple < S, A, Ra, T r >:
S: Set of states.
A: Set of actions.
Rs,a,s0: The reward received after transitioning from
state sto state s0, due to action a.
T r: A transition probability function T r(s, a, s0) : S×
A×S[0,1] defining the probability of transitioning
to state s0after taking action ain s.
The Q-function is defined as the expected value of taking
action ain state sunder a policy πthroughout an infinite
horizon while using a discount factor γ:
Qπ(s, a) = Eπ[Pinf
t=0 γtRt+1|st=s, at=a].
To learn this Q-function, we used deep Q-Networks
(DQN) (Mnih et al. 2015). A DQN is a multi-layered neu-
ral network that for a given state sand action aoutputs
q-value Q(s, a;θ), where θare the parameters of the net-
work. During training, the DQN contains two networks, the
target network and the value network. The target network,
with parameters θ, is the same as the value network ex-
cept that its parameters are copied every τsteps from the
value network i.e. θ
t=θtand kept fixed on all other steps.
The value network is trained by minimizing the sequence
of loss functions: Lt=Es,a,r,s0[(YDQN
tQ(s, a;θt))2],
where the target YDQN
tis given by the target network:
YDQN
tRt+1 +γmax
aQ(St+1, a;θ
t). In this work,
we use an improvement of the DQN called Double DQN
(Van Hasselt, Guez, and Silver 2016). Double DQN re-
places the target YDQN
twith YDoubleDQN
tRt+1 +
γQ(St+1,argmax
a
Q(St+1, a;θt), θ
t). The update to the
target network stays unchanged from DQN and remains a
periodic copy of the value network.
The exact architecture of the networks we used was spe-
cific for each environment and will be described in the cor-
responding sections.
3.1 Reward Decomposition
Van Seijen et al. (Van Seijen et al. 2017) proposed the Hier-
archical Reward Architecture (HRA) model. HRA takes as
input a decomposed reward function and learns a separate
Q-function for each reward component. In a game like Pac-
man, such reward components could for instance correspond
to dying or reaching specific goals. Because each component
typically only depends on a subset of all features, the corre-
sponding Q-function can be approximated more easily by
a low-dimensional representation, enabling more effective
learning.
This can be incorporated in the MDP formulation by spec-
ifying a set of reward components Cand decomposing the
reward function Rinto |C|reward functions Rc(s, a, s0).
The objective for the HRA agent remains the same as for
traditional Q-learning: to optimize the overall reward func-
tion R(s, a, s0) = PcCRc(s, a, s0). HRA achieves this by
training several Q-functions Qc(s, a)that only account for
rewards related to their component c. For choosing an ac-
tion for the next step, the HRA agent uses the sum of these
individual Q-functions: QHRA(s, a) := PcCQc(s, a).
For the update YDoubleDQN
teach head is used individu-
ally. If the underlying agent uses deep Q-learning, the dif-
ferent Q-functions Qc(s, a)can share multiple lower-level
layers of the neural network. In this case, the collection of
Q-functions that each have one type of reward can be viewed
as a single agent with multiple heads, such that each head
calculates the action-values of a current state under his re-
ward function.
HRA was originally proposed to make the learning pro-
cess more efficient. However, Juozapaitis et al. (2019) sug-
gested the use of Reward Decomposition (RD) as a local ex-
planation method. Traditional Q-values do not give any in-
sight into the positive and negative factors contributing to the
agent’s decision since the individual reward components are
mixed into a single reward scalar. Showing the individual Q-
values Qc(s, a)for each reward component ccan explicitly
expose the different types of rewards that affect the agent’s
behavior.
This increase in explainability should not result in a
decreased performance. Van Seijen et al. (2017) already
showed that HRA can even result in increased performance
for Pacman. We additionally conducted a sanity check in the
Highway environment to verify that HRA results in com-
parable learning to that obtained without decomposing the
reward function (see Appendix A).
3.2 Policy Summaries
Agent strategy summarization (Amir, Doshi-Velez, and
Sarne 2019) is a paradigm for conveying the global behavior
of an agent. In this paradigm, the agent’s policy is demon-
strated through a carefully selected set of world states. The
goal of strategy summarization is to choose the subset of
state-action pairs that best describes the agent’s policy. In a
formal way, Amir & Amir (Amir and Amir 2018) defined
the set T=< t1, ..., tk>as the trajectories that are in-
cluded in the summary, where each trajectory is composed
of a sequence of lconsecutive states and the actions taken
in those states, <(si, ai), ..., (si+l1, ai+l1)>. Since it is
not feasible for people to review the behavior of an agent in
all possible states, kis defined as the size of the summary
e.g |T|=k.
We use a summarization approach called HIGHLIGHTS
(Amir and Amir 2018) that extracts the most “important”
states from execution traces of the agent. The importance of
a state sis denoted as I(s)and is defined differently between
environments as it is influenced by the different actions that
are possible in the environment. The general idea is that a
state is important if the outcome of the chosen action had a
big impact.
4 Integrating Policy Summaries and Reward
Decomposition
We combined HIGHLIGHTS as a global explanation with
reward decomposition as a local explanation. We used
HIGHLIGHTS to find the most important states during the
agents’ gameplay. For each state that was chosen, we cre-
ated reward decomposition bars that depict the decomposed
Q-values for actions in the chosen state (see Figures 1 and
2). We chose to combine these two types of explanations be-
cause we believe they complement each other. Reward de-
composition reflects the intentions of an agent while HIGH-
LIGHTS gives a broader perspective on the agent’s deci-
sions.
HIGHLIGHTS summaries are typically shown as videos.
However, the reward decomposition bars are static, and vary
for each state. Therefore, when integrating the two methods,
we used HIGHLIGHTS to extract the important states, but
displayed them using static images rather than videos.
5 Empirical Methodology
To evaluate the benefits of integrating HIGHLIGHTS with
reward decomposition as well as their respective contribu-
tions to users’ understanding of agents’ behavior, we con-
ducted two user studies in which participants were asked
to evaluate the preferences of different agents. We hypoth-
esized that the combined explanations would best support
participants’ ability to correctly identify agents’ preferences
and that both the local and global explanations would be bet-
ter than the baseline information.
5.1 Experimental Environments and Agent
Training
Highway Environment We used a multi-lane Highway
environment (shown in the top part of Figure 1) for our first
experiments. The environment can be modified by setting
different variables such as the number of vehicles, vehicle
density, rewards, speed range, and more. Our settings can be
found in the appendix.
In the environment, the RL agent controls the green ve-
hicle. The objective of the agent is to maximize its reward
by navigating a multi-lane highway while driving alongside
other (blue) vehicles. Positive rewards can be given for each
of the following actions: changing lanes (CL), speeding up
(SU), and moving to the right-most lane (RML). Therefore,
we decided to set |C|= 3 for each positive reward type. We
trained the agents as described in Section 3. The network in-
put is an array of size 25 (5X5) that represents the state. The
input layer is followed by two fully connected hidden layers
of length 256. The last of these two layers is connected to
three heads. Each head consists of a linear layer and outputs
a Q-value vector of length of 5 that contains the following:
lane left, idle, lane right, faster, slower.
We trained the following four RL agents which differ in
their policies:
1. The Good Citizen - Highest reward for being in the right
lane, next to change lane, and lastly to speed up.
2. Fast And Furious - Highest reward for speeding up, then
changing lanes, and lastly to be in the right-most lane.
摘要:

IntegratingPolicySummarieswithRewardDecompositionforExplainingReinforcementLearningAgentsYaelSepton,1TobiasHuber,2ElisabethAndr´e,2OfraAmir,11Technion,IsraelInstituteofTechnology2ChairforHuman-CenteredArticialIntelligence,UniversityofAugsburgfyael123,oamirg@technion.ac.il,ftobias.huber,andreg@infor...

展开>> 收起<<
Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents Yael Septon1Tobias Huber2Elisabeth Andr e2Ofra Amir1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.2MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注