Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems Felix Feit Andreas Metzger Klaus Pohl paluno The Ruhr Institute for Software Technology

2025-04-26 0 0 581.87KB 10 页 10玖币

侵权投诉

Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems

Felix Feit, Andreas Metzger, Klaus Pohl

paluno (The Ruhr Institute for Software Technology)

University of Duisburg-Essen; Essen, Germany

f.m.feit@gmail.com, andreas.metzger@paluno.uni-due.de, klaus.pohl@paluno.uni-due.de

Abstract—Design time uncertainty poses an important chal-

lenge when developing a self-adaptive system. As an example,

deﬁning how the system should adapt when facing a new

environment state, requires understanding the precise effect of

an adaptation, which may not be known at design time. Online

reinforcement learning, i.e., employing reinforcement learning

(RL) at runtime, is an emerging approach to realizing self-

adaptive systems in the presence of design time uncertainty.

By using Online RL, the self-adaptive system can learn from

actual operational data and leverage feedback only available

at runtime. Recently, Deep RL is gaining interest. Deep RL

represents learned knowledge as a neural network whereby it

can generalize over unseen inputs, as well as handle continuous

environment states and adaptation actions. A fundamental

problem of Deep RL is that learned knowledge is not explicitly

represented. For a human, it is practically impossible to relate

the parametrization of the neural network to concrete RL

decisions and thus Deep RL essentially appears as a black

box. Yet, understanding the decisions made by Deep RL is

key to (1) increasing trust, and (2) facilitating debugging.

Such debugging is especially relevant for self-adaptive systems,

because the reward function, which quantiﬁes the feedback

to the RL algorithm, must be deﬁned by developers. The

reward function must be explicitly deﬁned by developers, thus

introducing a potential for human error. To explain Deep

RL for self-adaptive systems, we enhance and combine two

existing explainable RL techniques from the machine learning

literature. The combined technique, XRL-DINE, overcomes the

respective limitations of the individual techniques. We present

a proof-of-concept implementation of XRL-DINE, as well as

qualitative and quantitative results of applying XRL-DINE to

a self-adaptive system exemplar.

I. INTRODUCTION

A self-adaptive system can modify its own structure and

behavior at runtime based on its perception of the envi-

ronment, of itself and of its requirements [1]. One key

element of a self-adaptive system is its self-adaptation logic

that encodes when and how the system should adapt itself.

When developing the adaptation logic, developers face the

challenge of design time uncertainty [2], [3], [4]. To deﬁne

when the system should adapt, they have to anticipate all

potential environment states. However, this is infeasible in

most cases due to incomplete information at design time. As

an example, the concrete services that may be dynamically

bound during the execution of a service orchestration and

thus their quality characteristics are typically not known at

design time. To deﬁne how the system should adapt itself,

developers need to know the precise effect an adaptation

action has. However, the precise effect may not be known

at design time. As an example, while developers may know

in principle that enabling more features will negatively inﬂu-

ence the performance, exactly determining the performance

impact is more challenging. A recent industrial survey iden-

tiﬁed optimal design and design complexity together with

design time uncertainty to be the most frequently observed

difﬁculties in designing self-adaptation in practice [4].

Online reinforcement learning (Online RL) is an emerging

approach to realize self-adaptive systems in the presence of

design time uncertainty. Online RL means that reinforcement

learning [5] is employed at runtime (see [3] for a discussion

of existing solutions). The self-adaptive system thereby can

learn from actual operational data and thus leverages infor-

mation only available at runtime. A recent survey indicates

that since 2019 the use of learning dominates over the use

of predetermined and static policies or rules [6].

Online RL aims at learning suitable adaptation actions

via the self-adaptive system’s interactions with its initially

unknown environment [7]. During system operation, the

RL algorithm receives a numerical reward based on actual

runtime monitoring data for executing an adaptation action.

The reward expresses how suitable that adaptation action

was in the short term. The goal of Online RL is to maximize

the cumulative reward.

Initially, research on self-adaptive systems leveraged RL

algorithms that represent learned knowledge as a so-called

value function [7]. The value function quantiﬁes how much

cumulative reward can be expected if a particular adaptation

is chosen in a given environment state. Typically, this value

function was represented as a table. However, such tabular

approaches exhibit key limitations. First, they require a ﬁnite

set of environment states and a ﬁnite set of adaptations and

thus cannot be directly applied to continuous state spaces

or continuous action spaces. Second, they do not generalize

over neighboring states, which leads to slow learning in the

presence of continuous environment states [7].

Deep reinforcement learning (Deep RL) addresses these

disadvantages by representing the learned knowledge as a

neural network. Since neural network inputs are not limited

to elements of ﬁnite or discrete sets, and neural networks can

generalize well over inputs, deep RL has shown remarkable

arXiv:2210.05931v1 [cs.LG] 12 Oct 2022

success in different application areas. Recently, Deep RL is

also being applied to self-adaptive systems [8], [9], [7].

A principal problem of Deep RL is that learned knowledge

is not explicitly represented. Instead, it is “hidden” in the

parametrization of the neural network. For a human, it

is practically impossible to relate this parametrization to

concrete RL decisions. Deep RL thus essentially appears

as a black box [10]. Yet, understanding the decisions made

by Deep RL systems is key to (1) increase trust in these

systems, and (2) facilitate their debugging [11], [12].

Facilitating the debugging of Deep RL is especially rel-

evant for self-adaptive systems, because Online RL does

not completely eliminate manual development effort. Since

developers need to explicitly deﬁne the reward function, this

introduces a potential source for human error.

To explain Deep RL systems, various Explainable Re-

inforcement Learning (XRL) techniques were recently put

forward in machine learning research [10], [13]. Here, we

set out to answer the question how existing XRL techniques

can be applied for the explainability of Online RL for self-

adaptive systems. We follow XRL literature and use ”ex-

plainable” to also include ”interpretable”, even though one

may consider ”interpretable” only as basis for ”explainable”.

Our contribution is to enhance and combine two existing

XRL techniques from the literature: Reward Decomposi-

tion [14] and Interestingness Elements [15]. Reward Decom-

position uses a suitable decomposition of the reward function

into sub-functions to explain the short-term goal orientation

of RL, thereby providing contrastive explanations.

Reward composition is especially helpful for the typical

problem of adapting a system while taking into account

multiple quality goals. Each of these quality goals could

then be expressed as a reward sub-function. However, no

indication for the explanation’s relevance is provided, but

instead it requires manually selecting relevant RL decisions

to be explained. In particular when RL decisions are taken

at runtime, which is the case for Online RL for self-

adaptive systems, monitoring all explanations to identify

relevant ones introduces cognitive overhead for developers.

In contrast, Interestingness Elements collect and evaluate

metrics at runtime to identify relevant moments of inter-

action between the system and its environment. However,

for an identiﬁed relevant moment of interaction, it does not

provide explanations whether the system’s decision making

behaves as expected and due to the right reasons.

Our technique, XRL-DINE, combines the two aforemen-

tioned techniques to overcome their respective limitations.

XRL-DINE provides detailed explanations at relevant points

in time by computing and visualizing so called Decomposed

INterestingness Elements (DINEs). We introduce three types

of DINEs: “Important Interaction”, “Reward Channel Ex-

tremum”, and “Reward Channel Dominance”.

We prototypically implement and apply XRL-DINE to the

self-adaptive system exemplar SWIM – a self-adaptive web

application [16] – to serve as proof a concept and to provide

qualitative and quantitative results.

Sect. II provides foundations as basis for introducing

XRL-DINE in Sect. III. Sect. IV describes the proof-of-

concept implementation of XRL-DINE as well as its qual-

itative and quantitative evaluation, while Sect. V discusses

limitations. Sect. VI relates XRL-DINE to existing work.

II. FOUNDATIONS

A. Online RL for Self-adaptive Systems

Reinforcement Learning (RL) aims to learn an optimal

action selection policy via a system’s (called agent in RL)

interactions with its initially unknown environment [5]. As

shown in Fig. 1(a), the agent ﬁnds itself in environment state

sat a given time step. The agent then selects an action a

(from its set of potential adaptation actions) and executes it.

As a result, the environment transitions to the next state s0

and the agent receives a reward rfor executing the action.

The reward rtogether with the information about the next

state s0are used to update the action selection policy of

the agent. The goal of RL is to maximize the cumulative

reward. Online RL applies RL during system operation,

where actions have an effect on the live system, resulting

in reward signals based on actual monitoring data [7].

Self-Adaptation Logic

Realized via Reinforcement Learning

Execute

Policy

(Knowledge)

Monitor

Action

Selection

(Analyze + Plan)

Policy Update

Self-Adaptation Logic

Analyze

Monitor Execute

Plan

Knowledge

(a)

(b)

(c)

Action a

State s

Reward r

Action

Selection

Next state s’

Agent

Policy

Policy Update

Environment

Adaptation

Action

State s

Reward r

Next state s’

Figure 1. RL, MAPE-K, and their integration (adapted from [3])

This paper focuses on explaining value-based Deep RL

approaches for self-adaptive systems. The reason is that

the employed XRL technique of Reward Decomposition

requires using value-based RL. In value-based Deep RL, the

policy depends on a learned action-value function Q(S, A),

which gives the expected cumulative reward when executing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplainingOnlineReinforcementLearningDecisionsofSelf-AdaptiveSystemsFelixFeit,AndreasMetzger,KlausPohlpaluno(TheRuhrInstituteforSoftwareTechnology)UniversityofDuisburg-Essen;Essen,Germanyf.m.feit@gmail.com,andreas.metzger@paluno.uni-due.de,klaus.pohl@paluno.uni-due.deAbstractDesigntimeuncertaintypo...

展开>> 收起<<

Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems Felix Feit Andreas Metzger Klaus Pohl paluno The Ruhr Institute for Software Technology.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems Felix Feit Andreas Metzger Klaus Pohl paluno The Ruhr Institute for Software Technology

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: