Causal Explanation for Reinforcement Learning Quantifying State and Temporal Importance Xiaoxiao Wang1Fanyu Meng1Xin Liu1Zhaodan Kong1Xin Chen2

2025-04-30 0 0 6.71MB 19 页 10玖币
侵权投诉
Causal Explanation for Reinforcement Learning:
Quantifying State and Temporal Importance
Xiaoxiao Wang,1Fanyu Meng, 1Xin Liu, 1Zhaodan Kong, 1Xin Chen 2
1University of California, Davis
2Georgia Institute of Technology
{xxwa, fymeng, xinliu, zdkong}@ucdavis.edu, xinchen@gatech.edu
Abstract
Explainability plays an increasingly important role in ma-
chine learning. Furthermore, humans view the world through
a causal lens and thus prefer causal explanations over asso-
ciational ones. Therefore, in this paper, we develop a causal
explanation mechanism that quantifies the causal importance
of states on actions and such importance over time. We also
demonstrate the advantages of our mechanism over state-of-
the-art associational methods in terms of RL policy expla-
nation through a series of simulation studies, including crop
irrigation, Blackjack, collision avoidance, and lunar lander.
1 Introduction
Reinforcement learning (RL) is increasingly being consid-
ered in domains with significant social and safety implica-
tions such as healthcare, transportation, and finance. This
growing societal-scale impact has raised a set of concerns,
including trust, bias, and explainability. For example, can
we explain how an RL agent arrives at a certain decision?
When a policy performs well, can we explain why? These
concerns mainly arise from two factors. First, many popu-
lar RL algorithms, particularly deep RL, utilize neural net-
works, which are essentially black boxes with their inner
workings being opaque not only to lay persons but also to
data scientists. Second, RL is a trial-and-error learning algo-
rithm in which an agent tries to find a policy that minimizes a
long-term reward by repeatedly interacting with its environ-
ment. Temporal information such as relationships between
states at different time instances plays a key role in RL and
subsequently adds another layer of complexity compared to
supervised learning.
The field of explainable RL (XRL), a sub-field of explain-
able AI (XAI), aims to partially address these concerns by
providing explanations as to why an RL agent arrives at a
particular conclusion or action. While still in its infancy,
XRL has made good progress over the past few years, partic-
ularly by taking advantage of existing XAI methods (Puiutta
and Veith 2020; Heuillet, Couthouis, and D´
ıaz-Rodr´
ıguez
2021; Wells and Bednarz 2021). For instance, inspired by
the saliency map method (Simonyan, Vedaldi, and Zisser-
man 2014) in supervised learning which explains image
classifiers by highlighting “important” pixels in terms of
classifying images, some XRL methods attempt to explain
the decisions made by an RL agent by generating maps that
highlight “important” state features (Iyer et al. 2018; Grey-
danus et al. 2018; Mott et al. 2019). However, there exist at
least two major limitations in state-of-the-art XRL methods.
First, the majority of them take an associational perspective.
For instance, the aforementioned studies quantify the “im-
portance” of a feature by calculating the correlation between
the state feature and an action. Since it is well known that
“correlation doesn’t imply causation” (Pearl 2009), it is pos-
sible that features with a high correlation may not necessar-
ily be the real “cause” of the action, resulting in a mislead-
ing explanation that can lead to user skepticism and possibly
even rejection of the RL system. Second, temporal informa-
tion is not generally considered. Temporal effects, such as
the interaction between states and actions over time, which
as mentioned previously is essential in RL, are not taken into
account.
Figure 1: Causal graph of the crop irrigation problem. En-
dogenous and exogenous states are denoted by dashed and
solid rectangles, respectively, while actions are denoted by
circles. More details about causal graphs can be found in the
Preliminaries section.
In this paper, we propose a causal XRL mechanism.
Specifically, we explain an RL policy by incorporating a
causal model that we have about the relationship between
states and actions. To best illustrate the key features of our
XRL mechanism, we use a concrete crop irrigation prob-
lem as an example, as shown in Fig. 1 (more details can be
found in the Evaluation section). In this problem, an RL
policy πcontrols the amount of irrigation water (It) based
on the following endogenous (observed) state variables: hu-
midity (Ht), crop weight (Ct), and radiation (Dt). Its goal
is to maximize the crop yield during harvest. Crop growth
arXiv:2210.13507v2 [cs.AI] 30 Jun 2023
is also affected by some other features, including the ob-
served precipitation (Pt) and other exogenous (unobserved)
variables Ut. To explain why policy πarrives at a particular
action Itat the current state, our XRL method quantifies the
causal importance of each state feature, such as Ht, in the
context of this action Itvia counterfactual reasoning (Byrne
2019; Miller 2019), i.e., by calculating how the action would
have changed if the feature had been different.
Our proposed XRL mechanism addresses the aforemen-
tioned limitations as follows. First, our method can gener-
ate inherently causal explanations. To be more specific, in
essence, importance measures used in associational meth-
ods can only capture direct effects while our causal impor-
tance measures capture total causal effects. For example, for
the state feature Ht, our method can account for two causal
chains: the direct effect chain HtItand the indirect ef-
fect chain HtCtIt, while associational methods only
consider the former. Second, our method can quantify the
temporal effect between actions and states, such as the ef-
fect of today’s humidity Hton tomorrow’s irrigation It+1. In
contrast, associational methods, such as saliency map (Grey-
danus et al. 2018), cannot measure how previous state fea-
tures can affect the current action because their models only
formulate the relationship between state and action in one
time step and ignore temporal relations. To the best of our
knowledge, our XRL mechanism is the first work that ex-
plains RL policies by causally explaining their actions based
on causal state and temporal importance. It has been studied
that humans are more receptive to a contrastive explanation,
i.e., humans answer a “Why X?” question through the an-
swer to the often only implied-counterfactual “Why not Y
instead?” (Hilton 2007; Miller 2019). Because our causal ex-
planations are based on contrastive samples, users may find
our explanations more intuitive.
2 Related Work
Explainable RL (XRL) Based on how an XRL algo-
rithm generates its explanation, we can categorize existing
XRL methods into state-based, reward-based, and global
surrogate explanations (Puiutta and Veith 2020; Heuillet,
Couthouis, and D´
ıaz-Rodr´
ıguez 2021; Wells and Bednarz
2021). State-based methods explain an action by highlight-
ing state features that are important in terms of generat-
ing the action (Greydanus et al. 2018; Puri et al. 2019).
Reward-based methods generally apply reward decomposi-
tion and identify the sub-rewards that contribute the most
to decision making (Juozapaitis et al. 2019). Global surro-
gate methods generally approximate the original RL policy
with a simpler and transparent (also called intrinsically ex-
plainable) surrogate model, such as decision trees, and then
generate explanations with the surrogate model (Verma et al.
2018). In the context of state-based methods, there are gen-
erally two ways to quantify feature importance: (i) gradient-
based methods, such as simple gradient (Simonyan, Vedaldi,
and Zisserman 2013) and integrated gradients (Sundarara-
jan, Taly, and Yan 2017), and (ii) sensitivity-based meth-
ods, such as LIME (Ribeiro, Singh, and Guestrin 2016) and
SHAP (Lundberg and Lee 2017). Our work belongs to the
category of state-based methods. However, instead of us-
ing associations to calculate importance, a method generally
used in existing state-based methods, our method adopts a
causal perspective. The benefits of such a causal approach
have been discussed in the Introduction section.
Causal Explanation Causality has already been utilized
in XAI, mainly in supervised learning settings. Most ex-
isting studies quantify feature importance by either using
Granger causality (Schwab and Karlen 2019) and average or
individual causal effect metric (Chattopadhyay et al. 2019)
or by applying random valued interventions (Datta, Sen, and
Zick 2016). Two recent studies (Madumal et al. 2020) and
(Olson et al. 2021) are both focused on causal explanations
in an RL setting. Compared with (Madumal et al. 2020), the
main difference is that we provide a different type of expla-
nation. Our method involves finding an importance vector
that quantifies the impact of each state feature, while (Mad-
umal et al. 2020) provides a causal chain starting from the
action. We also demonstrate the ability of our approach to
provide temporal importance explanations that can capture
the impact of a state feature or action on the future state
or action. This aspect has been discussed in the crop irriga-
tion experiment in Section 6.1. Additionally, we construct
structural causal models(SCM) differently. While the action
is modeled as an edge in the SCM in the paper (Madumal
et al. 2020), our method formulates the action as a vertex
in the SCM model, allowing us to quantify the state feature
impact on action. As for (Olson et al. 2021), our approach is
unique in that it can calculate the temporal importance of a
state, which is not achievable by their method. Furthermore,
we have provided a value-based importance definition of Q-
value that differs from their method. Another significant dif-
ference between our approach and (Olson et al. 2021) is the
underlying assumption. Our method takes into account intra-
state relations, which are ignored in Olson’s work. Neglect-
ing intra-state causality is more likely to result in an invalid
state after the intervention, leading to inaccurate estimates
of importance. Therefore, our approach considers the causal
relationships between state features to provide a more accu-
rate and comprehensive explanation of the problem.
3 Preliminaries
We introduce the notations used throughout the paper. We
use capital letters such as Xto denote a random variable
and small letters such as xfor its value. Bold letters such as
Xdenote a vector of random variables and superscripts such
as X(i)denote its i-th element. Calligraphic letters such as
Xdenote sets. For a given natural number n,[n]denotes the
set {1,2,· · · , n}.
Causal Graph and Skeleton Causal graphs are proba-
bilistic graphical models that define data-generating pro-
cesses (Pearl 2009). Each vertex of the graph represents a
variable. Given a set of variables V={Vi, i [n]}, a di-
rected edge from a variable Vjto Videnotes that Viresponds
to changes in Vjwhen all other variables are held constant.
Variables connected to Vithrough directed edges are defined
as the parents of Vi, or “direct causes of Vi,” and the set
of all such variables is denoted by Pai. The skeleton of a
causal graph is defined as the topology of the graph. The
skeleton can be obtained using background knowledge or
learned using causal discovery algorithms, such as the clas-
sical constraint-based PC algorithm (Spirtes et al. 2000) and
those based on linear non-Gaussian models (Shimizu et al.
2006). In this work, we assume the skeleton is given.
SCM In a causal graph, we can define the value of each
variable Vias a function of its parents and exogenous vari-
ables. Formally, we have the following definition of SCM:
let V={Vi, i [n]}be a set of endogenous(observed)
variables and U={Ui, i [n]}be a set of exoge-
nous(unobserved) variables. A SCM (Pearl 2009) is defined
as a set of structural equations in the form of
Vi=fi(Pai, Ui),Pai⊂ V, Ui⊂ U, i [n],(1)
where function firepresents a causal mechanism that deter-
mines the value of Viusing its parents and the exogenous
variables.
Intervention and Do-operation SCM can be used for
causal interventions, denoted by the do(·)operator. do(Vi=
v)means setting the value of Vito a constant vregardless
of its structural equation in the SCM, i.e., ignoring the edges
into the vertex Vi. Note that the do-operation differs from
the conditioning operation in statistics. Conditioning on a
variable implies information about its parent variables due
to correlation.
Counterfactual Reasoning Counterfactual reasoning al-
lows us to answer “what if” questions. For example, assume
that the state is Xt=xand the action is At=a. We are
interested in knowing what would have happened if the state
had been at a different value x. This implies a counterfac-
tual question (Pearl 2009). The counterfactual outcome of
Atcan be represented as At,Xt=x|Xt=x, At=a. Given
an SCM, we can perform counterfactual reasoning based on
intervention through the following two steps:
1. Recover the value of exogenous variable Uas uthrough
the structural function fand the values Xt=x,At=a;
2. Calculate the counterfactual outcome as At|do(Xt=
x), U =u. More specifically, in SCM, we set up the
value of Xtto x. Then we substitute all exogenous vari-
able values to the right side of the functions and get the
counterfactual outcome At.
MDP and RL An infinite-horizon Markov Decision Pro-
cess (MDP) is a tuple (S,A, P, R), where S Rmand
A ∈ Rare finite sets of states and actions, P(s, a, s)is the
probability of transitioning from state sto state safter tak-
ing action a, and R(s, a)is the reward for taking ain s. An
RL policy πreturns an action to take at state s, and its asso-
ciated Q-function, Qπ(s, a), provides the expected infinite-
horizon γ-discounted cumulative reward for taking action a
at state sand following πthereafter.
4 Problem Formulation
Our focus is on policy explainability, and we assume that the
policy πand its associated Q-function, Qπ(s, a), are given.
Note that the policy may or may not be optimal. We require
a dataset containing trajectories of the agent interacting with
the MDP using the policy π. A single trajectory consists of a
sequence of (s, a, r, s)tuples. Additionally, We assume that
the skeleton of the causal graph, such as the one shown in
Fig. 1 for the crop irrigation problem, is known. We do not
assume that the SCM, more specifically its structural func-
tions, is given. We assume the additive noise for the SCM
but not its linearity (discussed in Eq. (2) in Section 5.1). The
goal is to answer the question “why does the policy πse-
lect the current action aat the current state s?” We provide
causal explanations for this question from two perspectives:
state importance and temporal importance.
Importance vector for state The first aspect of our ex-
planation is to use the important state feature to provide an
explanation. Specifically, we seek to construct an impor-
tance vector for the state, where each dimension measures
the impact of the corresponding state feature on the action.
For instance, in the crop irrigation problem, we can answer
the question “why does the RL agent irrigate more water to-
day?” by stating that “the impact of humidity, crop weight,
and radiation on the current irrigation decision is quantified
as [0.8,0.1,0.1] respectively. Formally, we have the follow-
ing definition of the importance vector for state explanation.
Given state stand policy π, the importance of each feature
of stfor the current action atis quantified as wt. The expla-
nation is that the features in state sthave causal importance
wton policy πto select action atat state st.
Temporal importance of action/state The second aspect
of our explanation considers the temporal aspect of RL.
Here, we measure how the actions and states in the past
impact the current action. We can generalize the impor-
tance vector above to past states and actions. Formally, given
state st, policy πand the history trajectory of the agent
Ht:= {(sτ, aτ), τ t}, we define the effect of a past action
aτon the current action atas waτ
t. Similarly, for a past state
sτ, we define the temporal importance vector wτ
t, in which
each dimension measures the impact of the corresponding
state feature at time step τon current action at. Then we use
waτ
tand wτ
tto quantify the impact of past states and action.
5 Explanation
5.1 Importance Vector for State
Our mechanism implements the following two steps to ob-
tain the importance vector wt.
1. Train SCM structural functions between the states and
actions using the data of historical trajectories of the RL
agent;
2. Compute the important vector by intervening in the
SCM.
First, we notice that there are three types of causal re-
lations between the states and actions: intra-state, policy-
defined, and transition-defined relations. As shown in Fig. 2,
the green directed edges represent the intra-state relations,
which are defined by the underlying causal mechanism. The
orange edges describe the policy and represent how the
state variables affect the action. The third type of relation
Figure 2: Example causal graph between the state and ac-
tion. S(i)
tis the i-th dimension of the interested state Sat
time t. Each vertex also has a corresponding exogenous vari-
able, which has no parent and its only child is the associated
endogenous variable. Per causality conventions, the exoge-
nous variables are omitted in the graph.
shown as blue edges is the causal relationship between the
states across different times. They represent the dynamics
of the environment and depend on the transition probability
P(st, at,st+1)in the MDP.
We assume that the intra-state and transition-defined
causal relations are captured by the causal graph skeleton.
For the policy-defined relations, we assume a general case
where all state features are the causal parents of the action.
In the causal graph, each edge defines a causal relation, and
the vertex defines a variable Vwith a causal structural func-
tion f. Then we only need to learn the causal structural func-
tions between the vertices. To achieve this, we can learn each
vertex’s function separately. For a vertex Viand its parents
Pai, based on Eq. (1), we make an additive noise assump-
tion to simplify the problem and formulate the function map-
ping between Viand Paias
Vi=fi(Pai) + Ui,(2)
where Uiis an exogenous variable. We note that the additive
noise assumption is widely used in the causal discovery lit-
erature (Hoyer et al. 2008; Peters et al. 2014). We then use
supervised learning to learn the function mapping among the
vertices. Specifically, fafor action atis defined as
At=fa(S(1)
t,· · · ,S(m)
t, Ua),
where mis the dimension of the state, and Uais the exoge-
nous variable for the actions.
For the state variables, we denote all exogenous variables
as a vector US:= [U1,· · · , Um]and learn the structural
functions. Intuitively, the exogenous variables Uaand US
represent not only random noise but also hidden features or
the stochasticity of the policy for the intra-state and policy-
defined causal relations. For transition-defined relations, the
exogenous variables can be regarded as the stochasticity in
the environment.
5.2 Action-based Importance
Given a state stand an action at, the importance vector wt
is calculated by applying intervention on the learned SCM.
Based on the additive noise assumption, we recover the val-
ues of the exogenous variables Usand Uaaccording to the
value of at,stand the learned causal structural functions.
Then we define wtusing the intervention operation (coun-
terfactual reasoning). Specifically, we define the importance
vector wt= [w(1)
t,· · · ,w(m)
t]as
w(i)
t=
At,S(i)
t=s(i)
t+δ
St=st, At=atat
δ,(3)
where | · | is a vector norm (e.g., absolute-value norm) and δ
is a small perturbation value chosen according to the prob-
lem setting. The term At,S(i)
t=s(i)
t+δ|St=st, At=atrep-
resents the counterfactual outcome of Atif we set S(i)
t=
s(i)
t+δ. In our case, the value of the exogenous variables
can be recovered using the additive noise assumption, so the
value of At,S(i)
t=s(i)
t+δ|St=st, At=atcan be determined.
We interpret the result as that the features with a larger w(i)
t
have a more significant causal impact on the agent’s action
at. Note that in the simulation, we average the importance
from both positive and negative δand return the average as
the final score. The perturbation amount δis a hyperparame-
ter and should be selected according to each problem setting.
5.3 Q-value-based Importance
While action-based importance can capture the causal im-
pact of states on the change of the action, it may not cap-
ture the more subtle causal importance when the selected
action does not change, especially when the action space is
discrete. Specifically, At,S(i)
t=s(i)
t+δ|St=st, At=atmay
not change after a perturbation of δ, which will result in a
w(i)
t= 0. However, this is different from when there are no
causal paths from feature S(i)
tto the action At, also result-
ing in a w(i)
t= 0. Therefore, we also define Q-value-based
importance as follows:
Qw(i)
t=|Qperturb
πQπ(st, at)|
δ,(4)
where Qperturb
π=Qπ(St,S(i)
t=s(i)
t+δ, At,S(i)
t=s(i)
t+δ|St=
st, At=at). In detail, we use counterfactual reasoning to
compute the counterfactual outcome of Atand Stafter set-
ting S(i)
t=s(i)
t+δand then substituting them into Qπto
evaluate the corresponding Q-value. Similar to the action-
based importance, we account for both positive and nega-
tive importance in practice. See the Blackjack Section 6.3 in
evaluation for a comparison between Eq. (3) and Eq. (4) on
an example with a discrete action space.
In most RL algorithms, Q-value critically impacts which
actions to choose. Therefore, we consider Q-valued-based
importance as explanations on the action through the Q-
value. However, we note that the Q-value-based importance
method sometimes cannot reflect which features the policy
really depends on. Some features may contribute largely to
the Q-value of all state-action pairs ({Q(st, at), at∈ A},
but not to the decision making process - the action with
摘要:

CausalExplanationforReinforcementLearning:QuantifyingStateandTemporalImportanceXiaoxiaoWang,1FanyuMeng,1XinLiu,1ZhaodanKong,1XinChen21UniversityofCalifornia,Davis2GeorgiaInstituteofTechnology{xxwa,fymeng,xinliu,zdkong}@ucdavis.edu,xinchen@gatech.eduAbstractExplainabilityplaysanincreasinglyimportantr...

展开>> 收起<<
Causal Explanation for Reinforcement Learning Quantifying State and Temporal Importance Xiaoxiao Wang1Fanyu Meng1Xin Liu1Zhaodan Kong1Xin Chen2.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:6.71MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注