Causal Explanation for Reinforcement Learning Quantifying State and Temporal Importance Xiaoxiao Wang1Fanyu Meng1Xin Liu1Zhaodan Kong1Xin Chen2

2025-04-30 0 0 6.71MB 19 页 10玖币

侵权投诉

Causal Explanation for Reinforcement Learning:

Quantifying State and Temporal Importance

Xiaoxiao Wang,1Fanyu Meng, 1Xin Liu, 1Zhaodan Kong, 1Xin Chen 2

1University of California, Davis

2Georgia Institute of Technology

{xxwa, fymeng, xinliu, zdkong}@ucdavis.edu, xinchen@gatech.edu

Abstract

Explainability plays an increasingly important role in ma-

chine learning. Furthermore, humans view the world through

a causal lens and thus prefer causal explanations over asso-

ciational ones. Therefore, in this paper, we develop a causal

explanation mechanism that quantiﬁes the causal importance

of states on actions and such importance over time. We also

demonstrate the advantages of our mechanism over state-of-

the-art associational methods in terms of RL policy expla-

nation through a series of simulation studies, including crop

irrigation, Blackjack, collision avoidance, and lunar lander.

1 Introduction

Reinforcement learning (RL) is increasingly being consid-

ered in domains with signiﬁcant social and safety implica-

tions such as healthcare, transportation, and ﬁnance. This

growing societal-scale impact has raised a set of concerns,

including trust, bias, and explainability. For example, can

we explain how an RL agent arrives at a certain decision?

When a policy performs well, can we explain why? These

concerns mainly arise from two factors. First, many popu-

lar RL algorithms, particularly deep RL, utilize neural net-

works, which are essentially black boxes with their inner

workings being opaque not only to lay persons but also to

data scientists. Second, RL is a trial-and-error learning algo-

rithm in which an agent tries to ﬁnd a policy that minimizes a

long-term reward by repeatedly interacting with its environ-

ment. Temporal information such as relationships between

states at different time instances plays a key role in RL and

subsequently adds another layer of complexity compared to

supervised learning.

The ﬁeld of explainable RL (XRL), a sub-ﬁeld of explain-

able AI (XAI), aims to partially address these concerns by

providing explanations as to why an RL agent arrives at a

particular conclusion or action. While still in its infancy,

XRL has made good progress over the past few years, partic-

ularly by taking advantage of existing XAI methods (Puiutta

and Veith 2020; Heuillet, Couthouis, and D´

ıaz-Rodr´

ıguez

2021; Wells and Bednarz 2021). For instance, inspired by

the saliency map method (Simonyan, Vedaldi, and Zisser-

man 2014) in supervised learning which explains image

classiﬁers by highlighting “important” pixels in terms of

classifying images, some XRL methods attempt to explain

the decisions made by an RL agent by generating maps that

highlight “important” state features (Iyer et al. 2018; Grey-

danus et al. 2018; Mott et al. 2019). However, there exist at

least two major limitations in state-of-the-art XRL methods.

First, the majority of them take an associational perspective.

For instance, the aforementioned studies quantify the “im-

portance” of a feature by calculating the correlation between

the state feature and an action. Since it is well known that

“correlation doesn’t imply causation” (Pearl 2009), it is pos-

sible that features with a high correlation may not necessar-

ily be the real “cause” of the action, resulting in a mislead-

ing explanation that can lead to user skepticism and possibly

even rejection of the RL system. Second, temporal informa-

tion is not generally considered. Temporal effects, such as

the interaction between states and actions over time, which

as mentioned previously is essential in RL, are not taken into

account.

Figure 1: Causal graph of the crop irrigation problem. En-

dogenous and exogenous states are denoted by dashed and

solid rectangles, respectively, while actions are denoted by

circles. More details about causal graphs can be found in the

Preliminaries section.

In this paper, we propose a causal XRL mechanism.

Speciﬁcally, we explain an RL policy by incorporating a

causal model that we have about the relationship between

states and actions. To best illustrate the key features of our

XRL mechanism, we use a concrete crop irrigation prob-

lem as an example, as shown in Fig. 1 (more details can be

found in the Evaluation section). In this problem, an RL

policy πcontrols the amount of irrigation water (It) based

on the following endogenous (observed) state variables: hu-

midity (Ht), crop weight (Ct), and radiation (Dt). Its goal

is to maximize the crop yield during harvest. Crop growth

arXiv:2210.13507v2 [cs.AI] 30 Jun 2023

is also affected by some other features, including the ob-

served precipitation (Pt) and other exogenous (unobserved)

variables Ut. To explain why policy πarrives at a particular

action Itat the current state, our XRL method quantiﬁes the

causal importance of each state feature, such as Ht, in the

context of this action Itvia counterfactual reasoning (Byrne

2019; Miller 2019), i.e., by calculating how the action would

have changed if the feature had been different.

Our proposed XRL mechanism addresses the aforemen-

tioned limitations as follows. First, our method can gener-

ate inherently causal explanations. To be more speciﬁc, in

essence, importance measures used in associational meth-

ods can only capture direct effects while our causal impor-

tance measures capture total causal effects. For example, for

the state feature Ht, our method can account for two causal

chains: the direct effect chain Ht→Itand the indirect ef-

fect chain Ht→Ct→It, while associational methods only

consider the former. Second, our method can quantify the

temporal effect between actions and states, such as the ef-

fect of today’s humidity Hton tomorrow’s irrigation It+1. In

contrast, associational methods, such as saliency map (Grey-

danus et al. 2018), cannot measure how previous state fea-

tures can affect the current action because their models only

formulate the relationship between state and action in one

time step and ignore temporal relations. To the best of our

knowledge, our XRL mechanism is the ﬁrst work that ex-

plains RL policies by causally explaining their actions based

on causal state and temporal importance. It has been studied

that humans are more receptive to a contrastive explanation,

i.e., humans answer a “Why X?” question through the an-

swer to the often only implied-counterfactual “Why not Y

instead?” (Hilton 2007; Miller 2019). Because our causal ex-

planations are based on contrastive samples, users may ﬁnd

our explanations more intuitive.

2 Related Work

Explainable RL (XRL) Based on how an XRL algo-

rithm generates its explanation, we can categorize existing

XRL methods into state-based, reward-based, and global

surrogate explanations (Puiutta and Veith 2020; Heuillet,

Couthouis, and D´

ıaz-Rodr´

ıguez 2021; Wells and Bednarz

2021). State-based methods explain an action by highlight-

ing state features that are important in terms of generat-

ing the action (Greydanus et al. 2018; Puri et al. 2019).

Reward-based methods generally apply reward decomposi-

tion and identify the sub-rewards that contribute the most

to decision making (Juozapaitis et al. 2019). Global surro-

gate methods generally approximate the original RL policy

with a simpler and transparent (also called intrinsically ex-

plainable) surrogate model, such as decision trees, and then

generate explanations with the surrogate model (Verma et al.

2018). In the context of state-based methods, there are gen-

erally two ways to quantify feature importance: (i) gradient-

based methods, such as simple gradient (Simonyan, Vedaldi,

and Zisserman 2013) and integrated gradients (Sundarara-

jan, Taly, and Yan 2017), and (ii) sensitivity-based meth-

ods, such as LIME (Ribeiro, Singh, and Guestrin 2016) and

SHAP (Lundberg and Lee 2017). Our work belongs to the

category of state-based methods. However, instead of us-

ing associations to calculate importance, a method generally

used in existing state-based methods, our method adopts a

causal perspective. The beneﬁts of such a causal approach

have been discussed in the Introduction section.

Causal Explanation Causality has already been utilized

in XAI, mainly in supervised learning settings. Most ex-

isting studies quantify feature importance by either using

Granger causality (Schwab and Karlen 2019) and average or

individual causal effect metric (Chattopadhyay et al. 2019)

or by applying random valued interventions (Datta, Sen, and

Zick 2016). Two recent studies (Madumal et al. 2020) and

(Olson et al. 2021) are both focused on causal explanations

in an RL setting. Compared with (Madumal et al. 2020), the

main difference is that we provide a different type of expla-

nation. Our method involves ﬁnding an importance vector

that quantiﬁes the impact of each state feature, while (Mad-

umal et al. 2020) provides a causal chain starting from the

action. We also demonstrate the ability of our approach to

provide temporal importance explanations that can capture

the impact of a state feature or action on the future state

or action. This aspect has been discussed in the crop irriga-

tion experiment in Section 6.1. Additionally, we construct

structural causal models(SCM) differently. While the action

is modeled as an edge in the SCM in the paper (Madumal

et al. 2020), our method formulates the action as a vertex

in the SCM model, allowing us to quantify the state feature

impact on action. As for (Olson et al. 2021), our approach is

unique in that it can calculate the temporal importance of a

state, which is not achievable by their method. Furthermore,

we have provided a value-based importance deﬁnition of Q-

value that differs from their method. Another signiﬁcant dif-

ference between our approach and (Olson et al. 2021) is the

underlying assumption. Our method takes into account intra-

state relations, which are ignored in Olson’s work. Neglect-

ing intra-state causality is more likely to result in an invalid

state after the intervention, leading to inaccurate estimates

of importance. Therefore, our approach considers the causal

relationships between state features to provide a more accu-

rate and comprehensive explanation of the problem.

3 Preliminaries

We introduce the notations used throughout the paper. We

use capital letters such as Xto denote a random variable

and small letters such as xfor its value. Bold letters such as

Xdenote a vector of random variables and superscripts such

as X(i)denote its i-th element. Calligraphic letters such as

Xdenote sets. For a given natural number n,[n]denotes the

set {1,2,· · · , n}.

Causal Graph and Skeleton Causal graphs are proba-

bilistic graphical models that deﬁne data-generating pro-

cesses (Pearl 2009). Each vertex of the graph represents a

variable. Given a set of variables V={Vi, i ∈[n]}, a di-

rected edge from a variable Vjto Videnotes that Viresponds

to changes in Vjwhen all other variables are held constant.

Variables connected to Vithrough directed edges are deﬁned

as the parents of Vi, or “direct causes of Vi,” and the set

of all such variables is denoted by Pai. The skeleton of a

causal graph is deﬁned as the topology of the graph. The

skeleton can be obtained using background knowledge or

learned using causal discovery algorithms, such as the clas-

sical constraint-based PC algorithm (Spirtes et al. 2000) and

those based on linear non-Gaussian models (Shimizu et al.

2006). In this work, we assume the skeleton is given.

SCM In a causal graph, we can deﬁne the value of each

variable Vias a function of its parents and exogenous vari-

ables. Formally, we have the following deﬁnition of SCM:

let V={Vi, i ∈[n]}be a set of endogenous(observed)

variables and U={Ui, i ∈[n]}be a set of exoge-

nous(unobserved) variables. A SCM (Pearl 2009) is deﬁned

as a set of structural equations in the form of

Vi=fi(Pai, Ui),Pai⊂ V, Ui⊂ U, i ∈[n],(1)

where function firepresents a causal mechanism that deter-

mines the value of Viusing its parents and the exogenous

variables.

Intervention and Do-operation SCM can be used for

causal interventions, denoted by the do(·)operator. do(Vi=

v)means setting the value of Vito a constant vregardless

of its structural equation in the SCM, i.e., ignoring the edges

into the vertex Vi. Note that the do-operation differs from

the conditioning operation in statistics. Conditioning on a

variable implies information about its parent variables due

to correlation.

Counterfactual Reasoning Counterfactual reasoning al-

lows us to answer “what if” questions. For example, assume

that the state is Xt=xand the action is At=a. We are

interested in knowing what would have happened if the state

had been at a different value x′. This implies a counterfac-

tual question (Pearl 2009). The counterfactual outcome of

Atcan be represented as At,Xt=x′|Xt=x, At=a. Given

an SCM, we can perform counterfactual reasoning based on

intervention through the following two steps:

1. Recover the value of exogenous variable Uas uthrough

the structural function fand the values Xt=x,At=a;

2. Calculate the counterfactual outcome as At|do(Xt=

x′), U =u. More speciﬁcally, in SCM, we set up the

value of Xtto x′. Then we substitute all exogenous vari-

able values to the right side of the functions and get the

counterfactual outcome At.

MDP and RL An inﬁnite-horizon Markov Decision Pro-

cess (MDP) is a tuple (S,A, P, R), where S ∈ Rmand

A ∈ Rare ﬁnite sets of states and actions, P(s, a, s′)is the

probability of transitioning from state sto state s′after tak-

ing action a, and R(s, a)is the reward for taking ain s. An

RL policy πreturns an action to take at state s, and its asso-

ciated Q-function, Qπ(s, a), provides the expected inﬁnite-

horizon γ-discounted cumulative reward for taking action a

at state sand following πthereafter.

4 Problem Formulation

Our focus is on policy explainability, and we assume that the

policy πand its associated Q-function, Qπ(s, a), are given.

Note that the policy may or may not be optimal. We require

a dataset containing trajectories of the agent interacting with

the MDP using the policy π. A single trajectory consists of a

sequence of (s, a, r, s′)tuples. Additionally, We assume that

the skeleton of the causal graph, such as the one shown in

Fig. 1 for the crop irrigation problem, is known. We do not

assume that the SCM, more speciﬁcally its structural func-

tions, is given. We assume the additive noise for the SCM

but not its linearity (discussed in Eq. (2) in Section 5.1). The

goal is to answer the question “why does the policy πse-

lect the current action aat the current state s?” We provide

causal explanations for this question from two perspectives:

state importance and temporal importance.

Importance vector for state The ﬁrst aspect of our ex-

planation is to use the important state feature to provide an

explanation. Speciﬁcally, we seek to construct an impor-

tance vector for the state, where each dimension measures

the impact of the corresponding state feature on the action.

For instance, in the crop irrigation problem, we can answer

the question “why does the RL agent irrigate more water to-

day?” by stating that “the impact of humidity, crop weight,

and radiation on the current irrigation decision is quantiﬁed

as [0.8,0.1,0.1] respectively. Formally, we have the follow-

ing deﬁnition of the importance vector for state explanation.

Given state stand policy π, the importance of each feature

of stfor the current action atis quantiﬁed as wt. The expla-

nation is that the features in state sthave causal importance

wton policy πto select action atat state st.

Temporal importance of action/state The second aspect

of our explanation considers the temporal aspect of RL.

Here, we measure how the actions and states in the past

impact the current action. We can generalize the impor-

tance vector above to past states and actions. Formally, given

state st, policy πand the history trajectory of the agent

Ht:= {(sτ, aτ), τ ≤t}, we deﬁne the effect of a past action

aτon the current action atas waτ

t. Similarly, for a past state

sτ, we deﬁne the temporal importance vector wτ

t, in which

each dimension measures the impact of the corresponding

state feature at time step τon current action at. Then we use

waτ

tand wτ

tto quantify the impact of past states and action.

5 Explanation

5.1 Importance Vector for State

Our mechanism implements the following two steps to ob-

tain the importance vector wt.

1. Train SCM structural functions between the states and

actions using the data of historical trajectories of the RL

agent;

2. Compute the important vector by intervening in the

SCM.

First, we notice that there are three types of causal re-

lations between the states and actions: intra-state, policy-

deﬁned, and transition-deﬁned relations. As shown in Fig. 2,

the green directed edges represent the intra-state relations,

which are deﬁned by the underlying causal mechanism. The

orange edges describe the policy and represent how the

state variables affect the action. The third type of relation

Figure 2: Example causal graph between the state and ac-

tion. S(i)

tis the i-th dimension of the interested state Sat

time t. Each vertex also has a corresponding exogenous vari-

able, which has no parent and its only child is the associated

endogenous variable. Per causality conventions, the exoge-

nous variables are omitted in the graph.

shown as blue edges is the causal relationship between the

states across different times. They represent the dynamics

of the environment and depend on the transition probability

P(st, at,st+1)in the MDP.

We assume that the intra-state and transition-deﬁned

causal relations are captured by the causal graph skeleton.

For the policy-deﬁned relations, we assume a general case

where all state features are the causal parents of the action.

In the causal graph, each edge deﬁnes a causal relation, and

the vertex deﬁnes a variable Vwith a causal structural func-

tion f. Then we only need to learn the causal structural func-

tions between the vertices. To achieve this, we can learn each

vertex’s function separately. For a vertex Viand its parents

Pai, based on Eq. (1), we make an additive noise assump-

tion to simplify the problem and formulate the function map-

ping between Viand Paias

Vi=fi(Pai) + Ui,(2)

where Uiis an exogenous variable. We note that the additive

noise assumption is widely used in the causal discovery lit-

erature (Hoyer et al. 2008; Peters et al. 2014). We then use

supervised learning to learn the function mapping among the

vertices. Speciﬁcally, fafor action atis deﬁned as

At=fa(S(1)

t,· · · ,S(m)

t, Ua),

where mis the dimension of the state, and Uais the exoge-

nous variable for the actions.

For the state variables, we denote all exogenous variables

as a vector US:= [U1,· · · , Um]and learn the structural

functions. Intuitively, the exogenous variables Uaand US

represent not only random noise but also hidden features or

the stochasticity of the policy for the intra-state and policy-

deﬁned causal relations. For transition-deﬁned relations, the

exogenous variables can be regarded as the stochasticity in

the environment.

5.2 Action-based Importance

Given a state stand an action at, the importance vector wt

is calculated by applying intervention on the learned SCM.

Based on the additive noise assumption, we recover the val-

ues of the exogenous variables Usand Uaaccording to the

value of at,stand the learned causal structural functions.

Then we deﬁne wtusing the intervention operation (coun-

terfactual reasoning). Speciﬁcally, we deﬁne the importance

vector wt= [w(1)

t,· · · ,w(m)

t]as

w(i)

t=



At,S(i)

t=s(i)

t+δ



St=st, At=at−at



δ,(3)

where | · | is a vector norm (e.g., absolute-value norm) and δ

is a small perturbation value chosen according to the prob-

lem setting. The term At,S(i)

t=s(i)

t+δ|St=st, At=atrep-

resents the counterfactual outcome of Atif we set S(i)

s(i)

t+δ. In our case, the value of the exogenous variables

can be recovered using the additive noise assumption, so the

value of At,S(i)

t=s(i)

t+δ|St=st, At=atcan be determined.

We interpret the result as that the features with a larger w(i)

have a more signiﬁcant causal impact on the agent’s action

at. Note that in the simulation, we average the importance

from both positive and negative δand return the average as

the ﬁnal score. The perturbation amount δis a hyperparame-

ter and should be selected according to each problem setting.

5.3 Q-value-based Importance

While action-based importance can capture the causal im-

pact of states on the change of the action, it may not cap-

ture the more subtle causal importance when the selected

action does not change, especially when the action space is

discrete. Speciﬁcally, At,S(i)

t=s(i)

t+δ|St=st, At=atmay

not change after a perturbation of δ, which will result in a

w(i)

t= 0. However, this is different from when there are no

causal paths from feature S(i)

tto the action At, also result-

ing in a w(i)

t= 0. Therefore, we also deﬁne Q-value-based

importance as follows:

Qw(i)

t=|Qperturb

π−Qπ(st, at)|

δ,(4)

where Qperturb

π=Qπ(St,S(i)

t=s(i)

t+δ, At,S(i)

t=s(i)

t+δ|St=

st, At=at). In detail, we use counterfactual reasoning to

compute the counterfactual outcome of Atand Stafter set-

ting S(i)

t=s(i)

t+δand then substituting them into Qπto

evaluate the corresponding Q-value. Similar to the action-

based importance, we account for both positive and nega-

tive importance in practice. See the Blackjack Section 6.3 in

evaluation for a comparison between Eq. (3) and Eq. (4) on

an example with a discrete action space.

In most RL algorithms, Q-value critically impacts which

actions to choose. Therefore, we consider Q-valued-based

importance as explanations on the action through the Q-

value. However, we note that the Q-value-based importance

method sometimes cannot reﬂect which features the policy

really depends on. Some features may contribute largely to

the Q-value of all state-action pairs ({Q(st, at), at∈ A},

but not to the decision making process - the action with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CausalExplanationforReinforcementLearning:QuantifyingStateandTemporalImportanceXiaoxiaoWang,1FanyuMeng,1XinLiu,1ZhaodanKong,1XinChen21UniversityofCalifornia,Davis2GeorgiaInstituteofTechnology{xxwa,fymeng,xinliu,zdkong}@ucdavis.edu,xinchen@gatech.eduAbstractExplainabilityplaysanincreasinglyimportantr...

展开>> 收起<<

Causal Explanation for Reinforcement Learning Quantifying State and Temporal Importance Xiaoxiao Wang1Fanyu Meng1Xin Liu1Zhaodan Kong1Xin Chen2.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Causal Explanation for Reinforcement Learning Quantifying State and Temporal Importance Xiaoxiao Wang1Fanyu Meng1Xin Liu1Zhaodan Kong1Xin Chen2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: