is also affected by some other features, including the ob-
served precipitation (Pt) and other exogenous (unobserved)
variables Ut. To explain why policy πarrives at a particular
action Itat the current state, our XRL method quantifies the
causal importance of each state feature, such as Ht, in the
context of this action Itvia counterfactual reasoning (Byrne
2019; Miller 2019), i.e., by calculating how the action would
have changed if the feature had been different.
Our proposed XRL mechanism addresses the aforemen-
tioned limitations as follows. First, our method can gener-
ate inherently causal explanations. To be more specific, in
essence, importance measures used in associational meth-
ods can only capture direct effects while our causal impor-
tance measures capture total causal effects. For example, for
the state feature Ht, our method can account for two causal
chains: the direct effect chain Ht→Itand the indirect ef-
fect chain Ht→Ct→It, while associational methods only
consider the former. Second, our method can quantify the
temporal effect between actions and states, such as the ef-
fect of today’s humidity Hton tomorrow’s irrigation It+1. In
contrast, associational methods, such as saliency map (Grey-
danus et al. 2018), cannot measure how previous state fea-
tures can affect the current action because their models only
formulate the relationship between state and action in one
time step and ignore temporal relations. To the best of our
knowledge, our XRL mechanism is the first work that ex-
plains RL policies by causally explaining their actions based
on causal state and temporal importance. It has been studied
that humans are more receptive to a contrastive explanation,
i.e., humans answer a “Why X?” question through the an-
swer to the often only implied-counterfactual “Why not Y
instead?” (Hilton 2007; Miller 2019). Because our causal ex-
planations are based on contrastive samples, users may find
our explanations more intuitive.
2 Related Work
Explainable RL (XRL) Based on how an XRL algo-
rithm generates its explanation, we can categorize existing
XRL methods into state-based, reward-based, and global
surrogate explanations (Puiutta and Veith 2020; Heuillet,
Couthouis, and D´
ıaz-Rodr´
ıguez 2021; Wells and Bednarz
2021). State-based methods explain an action by highlight-
ing state features that are important in terms of generat-
ing the action (Greydanus et al. 2018; Puri et al. 2019).
Reward-based methods generally apply reward decomposi-
tion and identify the sub-rewards that contribute the most
to decision making (Juozapaitis et al. 2019). Global surro-
gate methods generally approximate the original RL policy
with a simpler and transparent (also called intrinsically ex-
plainable) surrogate model, such as decision trees, and then
generate explanations with the surrogate model (Verma et al.
2018). In the context of state-based methods, there are gen-
erally two ways to quantify feature importance: (i) gradient-
based methods, such as simple gradient (Simonyan, Vedaldi,
and Zisserman 2013) and integrated gradients (Sundarara-
jan, Taly, and Yan 2017), and (ii) sensitivity-based meth-
ods, such as LIME (Ribeiro, Singh, and Guestrin 2016) and
SHAP (Lundberg and Lee 2017). Our work belongs to the
category of state-based methods. However, instead of us-
ing associations to calculate importance, a method generally
used in existing state-based methods, our method adopts a
causal perspective. The benefits of such a causal approach
have been discussed in the Introduction section.
Causal Explanation Causality has already been utilized
in XAI, mainly in supervised learning settings. Most ex-
isting studies quantify feature importance by either using
Granger causality (Schwab and Karlen 2019) and average or
individual causal effect metric (Chattopadhyay et al. 2019)
or by applying random valued interventions (Datta, Sen, and
Zick 2016). Two recent studies (Madumal et al. 2020) and
(Olson et al. 2021) are both focused on causal explanations
in an RL setting. Compared with (Madumal et al. 2020), the
main difference is that we provide a different type of expla-
nation. Our method involves finding an importance vector
that quantifies the impact of each state feature, while (Mad-
umal et al. 2020) provides a causal chain starting from the
action. We also demonstrate the ability of our approach to
provide temporal importance explanations that can capture
the impact of a state feature or action on the future state
or action. This aspect has been discussed in the crop irriga-
tion experiment in Section 6.1. Additionally, we construct
structural causal models(SCM) differently. While the action
is modeled as an edge in the SCM in the paper (Madumal
et al. 2020), our method formulates the action as a vertex
in the SCM model, allowing us to quantify the state feature
impact on action. As for (Olson et al. 2021), our approach is
unique in that it can calculate the temporal importance of a
state, which is not achievable by their method. Furthermore,
we have provided a value-based importance definition of Q-
value that differs from their method. Another significant dif-
ference between our approach and (Olson et al. 2021) is the
underlying assumption. Our method takes into account intra-
state relations, which are ignored in Olson’s work. Neglect-
ing intra-state causality is more likely to result in an invalid
state after the intervention, leading to inaccurate estimates
of importance. Therefore, our approach considers the causal
relationships between state features to provide a more accu-
rate and comprehensive explanation of the problem.
3 Preliminaries
We introduce the notations used throughout the paper. We
use capital letters such as Xto denote a random variable
and small letters such as xfor its value. Bold letters such as
Xdenote a vector of random variables and superscripts such
as X(i)denote its i-th element. Calligraphic letters such as
Xdenote sets. For a given natural number n,[n]denotes the
set {1,2,· · · , n}.
Causal Graph and Skeleton Causal graphs are proba-
bilistic graphical models that define data-generating pro-
cesses (Pearl 2009). Each vertex of the graph represents a
variable. Given a set of variables V={Vi, i ∈[n]}, a di-
rected edge from a variable Vjto Videnotes that Viresponds
to changes in Vjwhen all other variables are held constant.
Variables connected to Vithrough directed edges are defined
as the parents of Vi, or “direct causes of Vi,” and the set
of all such variables is denoted by Pai. The skeleton of a