Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents Yael Septon1Tobias Huber2Elisabeth Andr e2Ofra Amir1

2025-05-05 0 0 1.2MB 12 页 10玖币

侵权投诉

Integrating Policy Summaries with Reward Decomposition for Explaining

Reinforcement Learning Agents

Yael Septon, 1Tobias Huber, 2Elisabeth Andr´

e, 2Ofra Amir, 1

1Technion, Israel Institute of Technology

2Chair for Human-Centered Artiﬁcial Intelligence, University of Augsburg

{yael123, oamir}@technion.ac.il ,{tobias.huber, andre}@informatik.uni-augsburg.de

Abstract

Explaining the behavior of reinforcement learning agents op-

erating in sequential decision-making settings is challenging,

as their behavior is affected by a dynamic environment and

delayed rewards. Methods that help users understand the be-

havior of such agents can roughly be divided into local ex-

planations that analyze speciﬁc decisions of the agents and

global explanations that convey the general strategy of the

agents. In this work, we study a novel combination of local

and global explanations for reinforcement learning agents.

Speciﬁcally, we combine reward decomposition, a local ex-

planation method that exposes which components of the re-

ward function inﬂuenced a speciﬁc decision, and HIGH-

LIGHTS, a global explanation method that shows a summary

of the agent’s behavior in decisive states. We conducted two

user studies to evaluate the integration of these explanation

methods and their respective beneﬁts. Our results show sig-

niﬁcant beneﬁts for both methods. In general, we found that

the local reward decomposition was more useful for identi-

fying the agents’ priorities. However, when there was only

a minor difference between the agents’ preferences, then the

global information provided by HIGHLIGHTS additionally

improved participants’ understanding.

1 Introduction

Artiﬁcial Intelligence (AI) agents are being deployed in a

variety of domains such as self-driving cars, medical care,

home assistance, and more. With the advancement of such

agents, the need for improving people’s understanding of

such agents’ behavior has become more apparent. In this

work, we focus on explaining the behavior of agents that

operate in sequential decision-making settings, which are

trained in a deep reinforcement learning (RL) framework.

This is challenging, as the behavior of RL agents is affected

by the dynamics of the environment, the reward speciﬁca-

tion, and their ability to attribute delayed outcomes to their

actions.

We study the effectiveness of providing users with global

and local explanations of the behavior of RL agents. Global

explanations explain the general behavior of the agent, e.g.,

by describing decision rules or strategies. In contrast, local

explanations try to explain speciﬁc decisions that an agent

makes. While local explanations can provide detailed infor-

mation about single decisions of an agent, they do not pro-

vide any information about its behavior in different contexts.

Prior work investigated a combination of the complemen-

tary beneﬁts of global and local explanations. They com-

pared local saliency maps, that highlight relevant features

within the input, with global policy summaries that demon-

strate the behavior of agents in a selected set of world states

(Huber et al. 2021). While their results were promising, the

local saliency maps were lacking since they were hard for

users to interpret correctly.

In this paper, we propose and evaluate a novel com-

bination of global policy summaries with local explana-

tions based on reward decomposition. Reward decomposi-

tion aims to reveal the agent’s reasoning in particular situa-

tions by decomposing the rewards into reward components,

explicitly revealing which reward components the agent ex-

pects from each action (Juozapaitis et al. 2019). For ex-

ample, in a driving environment, such reward components

could be a reward for driving safely, a reward for driving

fast, etc. Since reward decomposition is more explicit in

reﬂecting the agent’s decision-making than saliency maps,

we hypothesized that it will be easier to interpret. Further-

more, reward decomposition is incorporated directly into

the agents’ underlying decision model through its training.

In contrast, saliency maps are generated after the agent is

trained and might not be faithful to the agents’ underlying

decision model (Rudin 2019; Huber, Limmer, and Andr´

2022). Therefore, we hypothesized that a combination of

policy summaries with reward decomposition will improve

users’ understanding of the agents’ strategy compared to

only using local or global explanations.

We conducted two user studies in which participants were

randomly assigned to one of four different conditions that

vary in the combination of global and local information:

(1) being presented or not presented with a local explana-

tion (reward decomposition), and (2) being presented with

a global explanation in the form of a HIGHLIGHTS pol-

icy summary (Amir and Amir 2018) or being presented with

frequent states the agent encounters (a baseline for convey-

ing global information). We used a Highway and a Pacman

environment and trained agents that varied in their priorities

by modifying the reward function. Participants were asked

to determine the priorities of these agents based on the ex-

arXiv:2210.11825v1 [cs.LG] 21 Oct 2022

planations in their condition.

Our results show that the use of reward decomposition

as a local explanation helped users comprehend the agents’

preferences. In addition, the HIGHLIGHTS global explana-

tion helped users understand the agents’ preferences in the

environment of Pacman. While we found that the beneﬁt of

presenting reward decomposition was greater than that of

providing HIGHLIGHTS summaries, the combined expla-

nations further helped users to differ between the agents’

priorities when there only was a minor difference between

the agents’ preferences.

2 Related Work

Explainable reinforcement learning methods can broadly be

divided into two classes based on their scope: local and

global explanations (Molnar 2022). Local explanations an-

alyze speciﬁc actions of the agent. This is often further

divided into post-hoc and intrinsic explainability methods.

Post-hoc methods analyze the agent after training, for ex-

ample by creating saliency maps (Hilton et al. 2020; Huber,

Schiller, and Andr´

e 2019; Puri et al. 2020). Intrinsic methods

are built into the agent’s underlying decision model to make

it more explainable. For example, in reinforcement learning,

this is done for casual explanations (Madumal et al. 2020)

or by reward decomposition (Juozapaitis et al. 2019). In this

paper, we focus on reward decomposition as a local expla-

nation method. In reward decomposition, the Q-value is de-

composed into several components (Van Seijen et al. 2017).

This approach has been used for explainability (Juozapaitis

et al. 2019) as the decomposition of the reward function into

meaningful reward types can help reveal which rewards an

agent expects from different actions. A user study exploring

the usefulness of different local RL explainability methods

showed that reward decomposition contributed to people’s

understanding of agent behavior (Anderson et al. 2019).

Global explanations attempt to describe the high-level

policy of an agent, for example by extracting logical rules

that describe the agent’s strategy (Booth, Muise, and Shah

2019). In other work, the goal was to enable users to cor-

rectly anticipate a robot’s behavior in novel situations. The

key idea was to select states that are optimized to allow the

reconstruction of the agent’s policy (Huang et al. 2019). In

this work, we utilize strategy summarization (Amir, Doshi-

Velez, and Sarne 2019) as a global explanation method.

Strategy summaries demonstrate an agent’s behavior in care-

fully selected world states. The states can be selected based

on different criteria, e.g., state importance (Amir and Amir

2018) or using machine teaching approaches (Lage et al.

2019).

Most closely related to our work is a study by Huber et

al. (Huber et al. 2021) which combined local and global ex-

planation methods in RL agents. They used strategy sum-

maries (global explanation) with saliency maps (local expla-

nation). Since this study showed that using saliency maps as

local explanations is lacking, we study the integration of re-

ward decomposition as an alternative local explanation, to-

gether with policy summaries. In addition, reward decom-

position has the advantage of faithfully representing the un-

derlying decision model, while saliency maps are post-hoc

explanations that may not accurately reﬂect what the model

learned (Rudin 2019; Huber, Limmer, and Andr´

e 2022).

3 Background

We assume a Markov Decision Process (MDP) setting. For-

mally, an MDP is deﬁned by a tuple < S, A, Ra, T r >:

•S: Set of states.

•A: Set of actions.

•Rs,a,s0: The reward received after transitioning from

state sto state s0, due to action a.

•T r: A transition probability function T r(s, a, s0) : S×

A×S→[0,1] deﬁning the probability of transitioning

to state s0after taking action ain s.

The Q-function is deﬁned as the expected value of taking

action ain state sunder a policy πthroughout an inﬁnite

horizon while using a discount factor γ:

Qπ(s, a) = Eπ[Pinf

t=0 γtRt+1|st=s, at=a].

To learn this Q-function, we used deep Q-Networks

(DQN) (Mnih et al. 2015). A DQN is a multi-layered neu-

ral network that for a given state sand action aoutputs

q-value Q(s, a;θ), where θare the parameters of the net-

work. During training, the DQN contains two networks, the

target network and the value network. The target network,

with parameters θ−, is the same as the value network ex-

cept that its parameters are copied every τsteps from the

value network i.e. θ−

t=θtand kept ﬁxed on all other steps.

The value network is trained by minimizing the sequence

of loss functions: Lt=Es,a,r,s0[(YDQN

t−Q(s, a;θt))2],

where the target YDQN

tis given by the target network:

YDQN

t≡Rt+1 +γmax

aQ(St+1, a;θ−

t). In this work,

we use an improvement of the DQN called Double DQN

(Van Hasselt, Guez, and Silver 2016). Double DQN re-

places the target YDQN

twith YDoubleDQN

t≡Rt+1 +

γQ(St+1,argmax

Q(St+1, a;θt), θ−

t). The update to the

target network stays unchanged from DQN and remains a

periodic copy of the value network.

The exact architecture of the networks we used was spe-

ciﬁc for each environment and will be described in the cor-

responding sections.

3.1 Reward Decomposition

Van Seijen et al. (Van Seijen et al. 2017) proposed the Hier-

archical Reward Architecture (HRA) model. HRA takes as

input a decomposed reward function and learns a separate

Q-function for each reward component. In a game like Pac-

man, such reward components could for instance correspond

to dying or reaching speciﬁc goals. Because each component

typically only depends on a subset of all features, the corre-

sponding Q-function can be approximated more easily by

a low-dimensional representation, enabling more effective

learning.

This can be incorporated in the MDP formulation by spec-

ifying a set of reward components Cand decomposing the

reward function Rinto |C|reward functions Rc(s, a, s0).

The objective for the HRA agent remains the same as for

traditional Q-learning: to optimize the overall reward func-

tion R(s, a, s0) = Pc∈CRc(s, a, s0). HRA achieves this by

training several Q-functions Qc(s, a)that only account for

rewards related to their component c. For choosing an ac-

tion for the next step, the HRA agent uses the sum of these

individual Q-functions: QHRA(s, a) := Pc∈CQc(s, a).

For the update YDoubleDQN

teach head is used individu-

ally. If the underlying agent uses deep Q-learning, the dif-

ferent Q-functions Qc(s, a)can share multiple lower-level

layers of the neural network. In this case, the collection of

Q-functions that each have one type of reward can be viewed

as a single agent with multiple heads, such that each head

calculates the action-values of a current state under his re-

ward function.

HRA was originally proposed to make the learning pro-

cess more efﬁcient. However, Juozapaitis et al. (2019) sug-

gested the use of Reward Decomposition (RD) as a local ex-

planation method. Traditional Q-values do not give any in-

sight into the positive and negative factors contributing to the

agent’s decision since the individual reward components are

mixed into a single reward scalar. Showing the individual Q-

values Qc(s, a)for each reward component ccan explicitly

expose the different types of rewards that affect the agent’s

behavior.

This increase in explainability should not result in a

decreased performance. Van Seijen et al. (2017) already

showed that HRA can even result in increased performance

for Pacman. We additionally conducted a sanity check in the

Highway environment to verify that HRA results in com-

parable learning to that obtained without decomposing the

reward function (see Appendix A).

3.2 Policy Summaries

Agent strategy summarization (Amir, Doshi-Velez, and

Sarne 2019) is a paradigm for conveying the global behavior

of an agent. In this paradigm, the agent’s policy is demon-

strated through a carefully selected set of world states. The

goal of strategy summarization is to choose the subset of

state-action pairs that best describes the agent’s policy. In a

formal way, Amir & Amir (Amir and Amir 2018) deﬁned

the set T=< t1, ..., tk>as the trajectories that are in-

cluded in the summary, where each trajectory is composed

of a sequence of lconsecutive states and the actions taken

in those states, <(si, ai), ..., (si+l−1, ai+l−1)>. Since it is

not feasible for people to review the behavior of an agent in

all possible states, kis deﬁned as the size of the summary

e.g |T|=k.

We use a summarization approach called HIGHLIGHTS

(Amir and Amir 2018) that extracts the most “important”

states from execution traces of the agent. The importance of

a state sis denoted as I(s)and is deﬁned differently between

environments as it is inﬂuenced by the different actions that

are possible in the environment. The general idea is that a

state is important if the outcome of the chosen action had a

big impact.

4 Integrating Policy Summaries and Reward

Decomposition

We combined HIGHLIGHTS as a global explanation with

reward decomposition as a local explanation. We used

HIGHLIGHTS to ﬁnd the most important states during the

agents’ gameplay. For each state that was chosen, we cre-

ated reward decomposition bars that depict the decomposed

Q-values for actions in the chosen state (see Figures 1 and

2). We chose to combine these two types of explanations be-

cause we believe they complement each other. Reward de-

composition reﬂects the intentions of an agent while HIGH-

LIGHTS gives a broader perspective on the agent’s deci-

sions.

HIGHLIGHTS summaries are typically shown as videos.

However, the reward decomposition bars are static, and vary

for each state. Therefore, when integrating the two methods,

we used HIGHLIGHTS to extract the important states, but

displayed them using static images rather than videos.

5 Empirical Methodology

To evaluate the beneﬁts of integrating HIGHLIGHTS with

reward decomposition as well as their respective contribu-

tions to users’ understanding of agents’ behavior, we con-

ducted two user studies in which participants were asked

to evaluate the preferences of different agents. We hypoth-

esized that the combined explanations would best support

participants’ ability to correctly identify agents’ preferences

and that both the local and global explanations would be bet-

ter than the baseline information.

5.1 Experimental Environments and Agent

Training

Highway Environment We used a multi-lane Highway

environment (shown in the top part of Figure 1) for our ﬁrst

experiments. The environment can be modiﬁed by setting

different variables such as the number of vehicles, vehicle

density, rewards, speed range, and more. Our settings can be

found in the appendix.

In the environment, the RL agent controls the green ve-

hicle. The objective of the agent is to maximize its reward

by navigating a multi-lane highway while driving alongside

other (blue) vehicles. Positive rewards can be given for each

of the following actions: changing lanes (CL), speeding up

(SU), and moving to the right-most lane (RML). Therefore,

we decided to set |C|= 3 for each positive reward type. We

trained the agents as described in Section 3. The network in-

put is an array of size 25 (5X5) that represents the state. The

input layer is followed by two fully connected hidden layers

of length 256. The last of these two layers is connected to

three heads. Each head consists of a linear layer and outputs

a Q-value vector of length of 5 that contains the following:

lane left, idle, lane right, faster, slower.

We trained the following four RL agents which differ in

their policies:

1. The Good Citizen - Highest reward for being in the right

lane, next to change lane, and lastly to speed up.

2. Fast And Furious - Highest reward for speeding up, then

changing lanes, and lastly to be in the right-most lane.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IntegratingPolicySummarieswithRewardDecompositionforExplainingReinforcementLearningAgentsYaelSepton,1TobiasHuber,2ElisabethAndr´e,2OfraAmir,11Technion,IsraelInstituteofTechnology2ChairforHuman-CenteredArticialIntelligence,UniversityofAugsburgfyael123,oamirg@technion.ac.il,ftobias.huber,andreg@infor...

展开>> 收起<<

Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents Yael Septon1Tobias Huber2Elisabeth Andr e2Ofra Amir1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents Yael Septon1Tobias Huber2Elisabeth Andr e2Ofra Amir1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: