Redefining Counterfactual Explanations for Reinforcement Learning Overview Challenges and Opportunities

2025-04-29 0 0 1.31MB 32 页 10玖币

Redefining Counterfactual Explanations for Reinforcement Learning: Overview,

Challenges and Opportunities

JASMINA GAJCIN and IVANA DUSPARIC,Trinity College Dublin, Ireland

While AI algorithms have shown remarkable success in various elds, their lack of transparency hinders their application to real-life

tasks. Although explanations targeted at non-experts are necessary for user trust and human-AI collaboration, the majority of

explanation methods for AI are focused on developers and expert users. Counterfactual explanations are local explanations that oer

users advice on what can be changed in the input for the output of the black-box model to change. Counterfactuals are user-friendly

and provide actionable advice for achieving the desired output from the AI system. While extensively researched in supervised learning,

there are few methods applying them to reinforcement learning (RL). In this work, we explore the reasons for the underrepresentation

of a powerful explanation method in RL. We start by reviewing the current work in counterfactual explanations in supervised learning.

Additionally, we explore the dierences between counterfactual explanations in supervised learning and RL and identify the main

challenges that prevent the adoption of methods from supervised in reinforcement learning. Finally, we redene counterfactuals for

RL and propose research directions for implementing counterfactuals in RL.

CCS Concepts: •Computing methodologies →Reinforcement learning.

Additional Key Words and Phrases: Reinforcement Learning, Explainability, Interpretability, Counterfactual Explanations

ACM Reference Format:

Jasmina Gajcin and Ivana Dusparic. 2023. Redening Counterfactual Explanations for Reinforcement Learning: Overview, Challenges

and Opportunities. 1, 1 (February 2023), 32 pages. https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION

Articial intelligence (AI) solutions have become pervasive in various elds in the last decades, thanks in part to the

adoption of deep learning algorithms. In particular, deep learning has shown remarkable success in supervised learning

tasks, where the goal is to learn patterns in a labeled training data set and use them to accurately predict labels on

unseen data [

]. Deep learning algorithms rely on neural networks, which allow for ecient processing of large

amounts of unstructured data. However, they also rely on a large number of parameters, making their decision-making

process dicult to understand. These models are often referred to as black-box due to the lack of transparency in their

inner workings.

Reinforcement learning (RL) [

] is a sub-eld of AI that focuses on developing intelligent agents for sequential

decision-making tasks. RL employs a trial-and-error learning approach in which an agent learns a task from scratch

through interactions with its environment. An agent can observe the environment, perform actions that alter its

state and receive rewards from the environment, which guide it towards an optimal behavior. The goal of RL is to

obtain an optimal policy

𝜋

, which maps the agent’s states to optimal actions. This bears some similarity to supervised

learning approaches, where the goal is to classify an instance into the correct class according to the input features.

Authors’ address: Jasmina Gajcin, gajcinj@tcd.ie; Ivana Dusparic, ivana.dusparic@tcd.ie, Trinity College Dublin, College Green, Dublin, , Ireland, D02.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not

made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.

Manuscript submitted to ACM

Manuscript submitted to ACM 1

arXiv:2210.11846v2 [cs.AI] 9 Feb 2024

2 Gajcin and Dusparic

Fig. 1. A summary of goals of XAI depending on the target audience: while the focus of developers is to beer understand the abilities

of the system and enable successful deployment, experts using the system require explanations to beer collaborate with the system.

Explanations are necessary for non-expert users to develop trust, ensure system decisions are fair, and give users actionable feedback

on how to elicit a dierent decision from the system.

However, while supervised learning algorithms rely on labeled training instances to learn patterns in the data, RL agents

approach the task without prior knowledge and learn it through interactions with the environment. Deep reinforcement

learning (DRL) algorithms [

], which employ a neural network to represent an agent’s policy, are currently the most

popular approach for learning RL policies [

]. DRL algorithms have shown remarkable success in navigating sequential

decision-making problems in games, autonomous driving, healthcare, and robotics [

]. Although they can

process large amounts of unstructured, high-dimensional data, they also struggle to explain agent’s decisions, due to

their reliance on neural networks.

Depending on the task and the user, AI systems require explainability for a variety of reasons (Figure 1). From the

perspective of the developer, explainability is necessary to verify the system’s behavior before deployment. Understand-

ing how the input features inuence the decision of the AI system is necessary to avoid deployment of models that rely

on spurious correlations [

] and to ensure robustness to adversarial attacks [

]. From the perspective of fairness,

understanding the decision-making process of an AI system is necessary to prevent automated discrimination. Lack of

transparency of AI models can cause them to inadvertently adopt historical biases ingrained in training data and use

them in their decision logic [

]. To prevent discrimination, users of autonomous decision-making systems are now

Manuscript submitted to ACM

Redening Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities 3

legally entitled to an explanation under the regulation within the GDPR in EU [

]. From the perspective of expert and

non-expert users of the system, explainability is necessary to ensure trust. For experts that use AI systems as an aid in

their everyday tasks, trust is a crucial component necessary for successful collaboration. For example, a medical doctor

using an AI system for diagnostics needs to understand it to trust its decisions and use them for this high-risk task [

Similarly, for non-expert users, trust is needed to encourage interaction with the system. If an AI system is used to

make potentially life-altering decisions for the user, they need to understand how the system operates to maintain their

condence and trust in the system.

The eld of explainable AI (XAI) explores methods for interpreting decisions of black-box systems in various elds

such as machine learning, reinforcement learning, explainable planning [

]. In

recent years, the focus of XAI has mostly been on explaining decisions of supervised learning models [

]. Specically,

the majority of XAI methods have focused on explaining the decisions of neural networks, due to the emergence of deep

learning as the state-of-the-art approach to many supervised learning tasks [

]. In contrast, explainable RL (XRL)

is a fairly novel eld that has not yet received an equal amount of attention. Most often, existing XRL methods focus on

explaining DRL algorithms, which rely on neural networks to represent the agent’s policy, due to their prevalence and

success [

]. However, as RL algorithms are becoming more prominent and are being considered for use in real-life tasks,

there is a need for understanding their decisions [

]. For example, RL algorithms are being developed for dierent

tasks in healthcare, such as dynamic treatment design [

101

]. Without rigorous verication and understanding

of such systems, the medical experts will be reluctant to collaborate and rely on them [

]. Similarly, RL algorithms

have been explored for enabling autonomous driving [

]. To understand and prevent mistakes such as the 2017 Uber

accident [

] where a self-driving car failed to stop before a pedestrian, the underlying decision-making systems have

to be scrutable. Specic to the RL framework, explainability is also necessary to correct and prevent “reward hacking” –

a phenomenon where an RL agent learns to trick a potentially misspecied reward function, such as a vacuum cleaner

ejecting collected dust to increase its cleaning time [3,68].

In this work, we explore counterfactual explanations in supervised and reinforcement learning. Counterfactual

explanations answer the question: “Given that the black-box model made decision y for input x, how can x be changed

for the model to output alternative decision y’?” [

]. Counterfactual explanations oer actionable advice to users of

black-box systems by generating counterfactual instances – instances as similar as possible to the original instance

being explained but producing a desired outcome. If the user is not satised with the decision of a black-box system,

a counterfactual explanation oers them a recipe for altering their input features, to obtain a dierent output. For

example, if a user is denied a loan by an AI system, they might be interested to know how they can change their

application so that it gets accepted in the future. Counterfactual explanations are targeted at non-expert users, as they

often deal in high-level terms, and oer actionable advice to the user. They are also selective, aiming to change as few

features as possible to achieve the desired output. As explanations that can suggest potentially life-altering actions

to the users, counterfactuals carry great responsibility. A useful counterfactual explanation can help user achieve a

desired outcome, and increase their trust and condence in the system. However, an ill-dened counterfactual that

proposes unrealistic changes to the input features or does not deliver the desired outcome can waste user’s time and

eort and erode their trust in the system. For this reason, careful selection of counterfactual explanations is essential

for maintaining user trust and encouraging their collaboration with the system.

Although they have been explored in supervised learning [

], counterfactual explanations are

rarely applied to RL tasks [

]. In supervised learning, methods for generating counterfactual explanations often follow

a similar pattern. Firstly, a loss function is dened, taking into account dierent properties of counterfactual instances,

Manuscript submitted to ACM

4 Gajcin and Dusparic

such as the prediction of the desired class or similarity to the original instance. The loss function is then optimized over

the training data to nd the most suitable counterfactual instance. While the exact design of the loss function and the

optimization algorithm vary between approaches, the high-level approach often takes the same form.

In this work, we challenge the denition of counterfactuals inherited from supervised learning for explaining RL

agents. We examine the similarities and dierences between the supervised and RL from the perspective of counterfactual

explanations and argue that the same denition of counterfactual explanations cannot be directly translated from

supervised to RL. Even though the two learning paradigms share similarities, in this work we demonstrate that

the sequential nature of RL tasks, as well as the agent’s goals, plans, and motivations, make these two approaches

substantially dierent from the perspective of counterfactual explanations. We start by reviewing the existing state-

of-the-art methods for generating counterfactual explanations in supervised learning. Furthermore, we identify the

main dierences between supervised and reinforcement learning from the perspective of counterfactual explanations

and redene them for RL use. Finally, we identify research questions that need to be answered before counterfactual

explanation methods can be applied to RL and propose potential solutions.

Previous surveys of XRL recognize counterfactual explanations as an important method but do not oer an in-depth

review of methods for generating this type of explanation [

]. Previous surveys of counterfactual explanations,

however, focus only on methods for explaining supervised learning models and oer a theoretical background and

review of state-of-the-art approaches [

]. Similarly, Guidotti

[33]

review counterfactuals for supervised learning

and oer a demonstration and comparison of dierent approaches. On the other hand, in this work, we focus specically

on counterfactual explanations from the perspective of RL. Additionally, while previous work has explored dierences

between supervised and RL for causal explanations [

], we utilize this to redene counterfactual explanations for RL

use, as well as explore challenges of applying supervised learning methods for generating counterfactual explanations

directly in RL.

The rest of the work is organized as follows. Section 2provides a taxonomy and a short overview of methods for

explaining the behavior of RL agents. In Section 3we identify the key similarities and dierences between supervised

and reinforcement learning from the perspective of explainability. Properties of counterfactual explanations are explored

in Section 4. Furthermore, Section 5oers a review of the state-of-the-art methods for generating counterfactual expla-

nations in supervised and reinforcement learning. Finally, Section 6focuses on redening counterfactual explanations

for RL and identifying challenges and open questions in this eld.

2 EXPLAINABLE REINFORCEMENT LEARNING (XRL)

Recent years have seen a rise in the development of explainable methods for RL tasks [

]. In this section, we provide

an overview of RL framework, XRL taxonomy and oer a condensed review of current state-of-the-art XRL methods.

2.1 Reinforcement Learning (RL)

Reinforcement learning (RL) [

] is a trial-and-error learning paradigm for navigating sequential decision-making tasks.

RL problems are usually presented in the form of Markov Decision Process (MDP)

𝑀=(𝑆, 𝐴, 𝑃, 𝑅, 𝛾)

, where

𝑆

is a set

of states an agent can observe and

𝐴

a set of actions it can perform. Transition function

𝑃

𝑆×𝐴→𝑆

denes how

actions change the current state and produce a new state. Actions can elicit rewards 𝑟∈𝑅, whose purpose is to guide

the agent toward the desired behavior. Parameter

𝛾

is the discount factor. Agent’s performance, starting from time step

𝑡, can be calculated as the return:

Manuscript submitted to ACM

Redening Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities 5

Table 1. Categorization of methods for explainable RL based on the XRL taxonomy (Section 2.2). Methods are classified based on

their scope, reliance on the black-box model, time of explanation and intended audience.

Method Subcategory Scope Model-reliance Explanation time Audience Publications

Global surrogates Decision trees Global Model-specic Post-hoc Developers/Experts Coppens et al. [16], Liu et al. [57]

Custom model Global Model-specic Post-hoc Developers/Experts Verma [90], Verma et al. [91]

Saliency maps Gradient-based Local Model-specic Post-hoc Developers/Experts Wang et al. [97]

Perturbation-based Local Model-agnostic Post-hoc Developers/Experts Greydanus et al. [32], Puri et al. [74]

Summaries Global Model-agnostic Post-hoc All Amir and Amir [2], Sequeira and Gervasio [78]

Contrastive Expl. Contrasting outcomes Local Model-agnostic Post-hoc All Madumal et al. [61], van der Waa et al. [89]

Contrasting objectives Local/Global Model-agnostic Post-hoc All Gajcin et al. [28], Juozapaitis et al. [41], Sukkerd et al. [84]

Hierarchical RL Global Model-specic Interpretable Developers/Experts Beyret et al. [9]

Visualization t-SNE Global Model-agnostic Post-hoc Developers/Experts Mnih et al. [64], Zrihem et al. [102]

Other

Abstract graphs Global Model-specic Interpretable Developers/Experts Topin and Veloso [87]

Symbolic policies Global Model-specic Interpretable Developers/Experts Landajuela et al. [51]

Causal confusion Global Model-agnostic Post-hoc Developers Gajcin and Dusparic [27]

𝐺𝑡=

𝑇



𝑖=𝑡+1

𝛾𝑖−𝑡−1·𝑟𝑖(1)

where 𝛾balances between the importance of short and long-term rewards.

The goal of the RL agent is to learn a policy

𝜋

𝑆→𝐴

, which maps the agent’s states to optimal actions. One of

the most notable approaches for learning the optimal policy is deep reinforcement learning (DRL). DRL is a eld at

the intersection of deep learning and RL. DRL algorithms use neural networks to represent an agent’s policy. This

allows easier processing of unstructured observations, such as images, and overcomes scalability limitations of tabular

methods such as Q-learning. DRL algorithms have been successfully applied to various domains, such as games [

autonomous driving [

] and robotics [

]. Despite their success, DRL algorithms cannot be safely deployed to

real-life tasks before their behavior is fully veried and understood. However, since these methods rely on neural

networks to learn the optimal policy, the agent’s behavior is dicult to explain and the reasoning behind its decisions

is not transparent.

Additionally, there are a number of extensions to the RL framework. In multi-goal RL, for example, an agent operates

in an environment with more than one goal and must navigate them successfully to nish the task. A simple taxi

environment in which an agent needs to pick up and drop o passengers is an example of a multi-goal problem, with

two subtasks. Similarly, multi-objective RL applies to tasks where the agent needs to balance between dierent, often

conicting desiderata. In autonomous driving, for example, an agent needs to consider dierent objectives such as

speed, safety, and user comfort.

2.2 XRL Taxonomy

According to a previous survey of XRL by Puiutta and Veith

[73]

the taxonomy of XRL methods largely overlaps

with the taxonomy of XAI approaches, and classies methods based on the reliance of the black-box model, time of

interpretation, scope and target audience.

Depending on their reliance on the black-box model, XRL methods can be model-agnostic or model-specic. Model-

specic explanations are developed for a particular type of black-box algorithm, while model-agnostic approaches can

explain any method.

Manuscript submitted to ACM

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RedefiningCounterfactualExplanationsforReinforcementLearning:Overview,ChallengesandOpportunitiesJASMINAGAJCINandIVANADUSPARIC,TrinityCollegeDublin,IrelandWhileAIalgorithmshaveshownremarkablesuccessinvariousfields,theirlackoftransparencyhinderstheirapplicationtoreal-lifetasks.Althoughexplanationstarg...

展开>> 收起<<

Redefining Counterfactual Explanations for Reinforcement Learning Overview Challenges and Opportunities.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Redefining Counterfactual Explanations for Reinforcement Learning Overview Challenges and Opportunities

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: