Redening Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities 3
legally entitled to an explanation under the regulation within the GDPR in EU [
31
]. From the perspective of expert and
non-expert users of the system, explainability is necessary to ensure trust. For experts that use AI systems as an aid in
their everyday tasks, trust is a crucial component necessary for successful collaboration. For example, a medical doctor
using an AI system for diagnostics needs to understand it to trust its decisions and use them for this high-risk task [
49
].
Similarly, for non-expert users, trust is needed to encourage interaction with the system. If an AI system is used to
make potentially life-altering decisions for the user, they need to understand how the system operates to maintain their
condence and trust in the system.
The eld of explainable AI (XAI) explores methods for interpreting decisions of black-box systems in various elds
such as machine learning, reinforcement learning, explainable planning [
2
,
16
,
26
,
32
,
50
,
60
,
74
,
76
,
78
,
83
,
90
,
91
,
97
]. In
recent years, the focus of XAI has mostly been on explaining decisions of supervised learning models [
10
]. Specically,
the majority of XAI methods have focused on explaining the decisions of neural networks, due to the emergence of deep
learning as the state-of-the-art approach to many supervised learning tasks [
1
,
13
,
99
]. In contrast, explainable RL (XRL)
is a fairly novel eld that has not yet received an equal amount of attention. Most often, existing XRL methods focus on
explaining DRL algorithms, which rely on neural networks to represent the agent’s policy, due to their prevalence and
success [
95
]. However, as RL algorithms are becoming more prominent and are being considered for use in real-life tasks,
there is a need for understanding their decisions [
24
,
73
]. For example, RL algorithms are being developed for dierent
tasks in healthcare, such as dynamic treatment design [
53
,
59
,
101
]. Without rigorous verication and understanding
of such systems, the medical experts will be reluctant to collaborate and rely on them [
75
]. Similarly, RL algorithms
have been explored for enabling autonomous driving [
4
]. To understand and prevent mistakes such as the 2017 Uber
accident [
47
] where a self-driving car failed to stop before a pedestrian, the underlying decision-making systems have
to be scrutable. Specic to the RL framework, explainability is also necessary to correct and prevent “reward hacking” –
a phenomenon where an RL agent learns to trick a potentially misspecied reward function, such as a vacuum cleaner
ejecting collected dust to increase its cleaning time [3,68].
In this work, we explore counterfactual explanations in supervised and reinforcement learning. Counterfactual
explanations answer the question: “Given that the black-box model made decision y for input x, how can x be changed
for the model to output alternative decision y’?” [
93
]. Counterfactual explanations oer actionable advice to users of
black-box systems by generating counterfactual instances – instances as similar as possible to the original instance
being explained but producing a desired outcome. If the user is not satised with the decision of a black-box system,
a counterfactual explanation oers them a recipe for altering their input features, to obtain a dierent output. For
example, if a user is denied a loan by an AI system, they might be interested to know how they can change their
application so that it gets accepted in the future. Counterfactual explanations are targeted at non-expert users, as they
often deal in high-level terms, and oer actionable advice to the user. They are also selective, aiming to change as few
features as possible to achieve the desired output. As explanations that can suggest potentially life-altering actions
to the users, counterfactuals carry great responsibility. A useful counterfactual explanation can help user achieve a
desired outcome, and increase their trust and condence in the system. However, an ill-dened counterfactual that
proposes unrealistic changes to the input features or does not deliver the desired outcome can waste user’s time and
eort and erode their trust in the system. For this reason, careful selection of counterfactual explanations is essential
for maintaining user trust and encouraging their collaboration with the system.
Although they have been explored in supervised learning [
14
,
19
,
40
,
58
,
72
,
96
], counterfactual explanations are
rarely applied to RL tasks [
67
]. In supervised learning, methods for generating counterfactual explanations often follow
a similar pattern. Firstly, a loss function is dened, taking into account dierent properties of counterfactual instances,
Manuscript submitted to ACM