Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems Felix Feit Andreas Metzger Klaus Pohl paluno The Ruhr Institute for Software Technology

2025-04-26 0 0 581.87KB 10 页 10玖币
侵权投诉
Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Felix Feit, Andreas Metzger, Klaus Pohl
paluno (The Ruhr Institute for Software Technology)
University of Duisburg-Essen; Essen, Germany
f.m.feit@gmail.com, andreas.metzger@paluno.uni-due.de, klaus.pohl@paluno.uni-due.de
Abstract—Design time uncertainty poses an important chal-
lenge when developing a self-adaptive system. As an example,
defining how the system should adapt when facing a new
environment state, requires understanding the precise effect of
an adaptation, which may not be known at design time. Online
reinforcement learning, i.e., employing reinforcement learning
(RL) at runtime, is an emerging approach to realizing self-
adaptive systems in the presence of design time uncertainty.
By using Online RL, the self-adaptive system can learn from
actual operational data and leverage feedback only available
at runtime. Recently, Deep RL is gaining interest. Deep RL
represents learned knowledge as a neural network whereby it
can generalize over unseen inputs, as well as handle continuous
environment states and adaptation actions. A fundamental
problem of Deep RL is that learned knowledge is not explicitly
represented. For a human, it is practically impossible to relate
the parametrization of the neural network to concrete RL
decisions and thus Deep RL essentially appears as a black
box. Yet, understanding the decisions made by Deep RL is
key to (1) increasing trust, and (2) facilitating debugging.
Such debugging is especially relevant for self-adaptive systems,
because the reward function, which quantifies the feedback
to the RL algorithm, must be defined by developers. The
reward function must be explicitly defined by developers, thus
introducing a potential for human error. To explain Deep
RL for self-adaptive systems, we enhance and combine two
existing explainable RL techniques from the machine learning
literature. The combined technique, XRL-DINE, overcomes the
respective limitations of the individual techniques. We present
a proof-of-concept implementation of XRL-DINE, as well as
qualitative and quantitative results of applying XRL-DINE to
a self-adaptive system exemplar.
I. INTRODUCTION
A self-adaptive system can modify its own structure and
behavior at runtime based on its perception of the envi-
ronment, of itself and of its requirements [1]. One key
element of a self-adaptive system is its self-adaptation logic
that encodes when and how the system should adapt itself.
When developing the adaptation logic, developers face the
challenge of design time uncertainty [2], [3], [4]. To define
when the system should adapt, they have to anticipate all
potential environment states. However, this is infeasible in
most cases due to incomplete information at design time. As
an example, the concrete services that may be dynamically
bound during the execution of a service orchestration and
thus their quality characteristics are typically not known at
design time. To define how the system should adapt itself,
developers need to know the precise effect an adaptation
action has. However, the precise effect may not be known
at design time. As an example, while developers may know
in principle that enabling more features will negatively influ-
ence the performance, exactly determining the performance
impact is more challenging. A recent industrial survey iden-
tified optimal design and design complexity together with
design time uncertainty to be the most frequently observed
difficulties in designing self-adaptation in practice [4].
Online reinforcement learning (Online RL) is an emerging
approach to realize self-adaptive systems in the presence of
design time uncertainty. Online RL means that reinforcement
learning [5] is employed at runtime (see [3] for a discussion
of existing solutions). The self-adaptive system thereby can
learn from actual operational data and thus leverages infor-
mation only available at runtime. A recent survey indicates
that since 2019 the use of learning dominates over the use
of predetermined and static policies or rules [6].
Online RL aims at learning suitable adaptation actions
via the self-adaptive system’s interactions with its initially
unknown environment [7]. During system operation, the
RL algorithm receives a numerical reward based on actual
runtime monitoring data for executing an adaptation action.
The reward expresses how suitable that adaptation action
was in the short term. The goal of Online RL is to maximize
the cumulative reward.
Initially, research on self-adaptive systems leveraged RL
algorithms that represent learned knowledge as a so-called
value function [7]. The value function quantifies how much
cumulative reward can be expected if a particular adaptation
is chosen in a given environment state. Typically, this value
function was represented as a table. However, such tabular
approaches exhibit key limitations. First, they require a finite
set of environment states and a finite set of adaptations and
thus cannot be directly applied to continuous state spaces
or continuous action spaces. Second, they do not generalize
over neighboring states, which leads to slow learning in the
presence of continuous environment states [7].
Deep reinforcement learning (Deep RL) addresses these
disadvantages by representing the learned knowledge as a
neural network. Since neural network inputs are not limited
to elements of finite or discrete sets, and neural networks can
generalize well over inputs, deep RL has shown remarkable
arXiv:2210.05931v1 [cs.LG] 12 Oct 2022
success in different application areas. Recently, Deep RL is
also being applied to self-adaptive systems [8], [9], [7].
A principal problem of Deep RL is that learned knowledge
is not explicitly represented. Instead, it is “hidden” in the
parametrization of the neural network. For a human, it
is practically impossible to relate this parametrization to
concrete RL decisions. Deep RL thus essentially appears
as a black box [10]. Yet, understanding the decisions made
by Deep RL systems is key to (1) increase trust in these
systems, and (2) facilitate their debugging [11], [12].
Facilitating the debugging of Deep RL is especially rel-
evant for self-adaptive systems, because Online RL does
not completely eliminate manual development effort. Since
developers need to explicitly define the reward function, this
introduces a potential source for human error.
To explain Deep RL systems, various Explainable Re-
inforcement Learning (XRL) techniques were recently put
forward in machine learning research [10], [13]. Here, we
set out to answer the question how existing XRL techniques
can be applied for the explainability of Online RL for self-
adaptive systems. We follow XRL literature and use ”ex-
plainable” to also include ”interpretable”, even though one
may consider ”interpretable” only as basis for ”explainable”.
Our contribution is to enhance and combine two existing
XRL techniques from the literature: Reward Decomposi-
tion [14] and Interestingness Elements [15]. Reward Decom-
position uses a suitable decomposition of the reward function
into sub-functions to explain the short-term goal orientation
of RL, thereby providing contrastive explanations.
Reward composition is especially helpful for the typical
problem of adapting a system while taking into account
multiple quality goals. Each of these quality goals could
then be expressed as a reward sub-function. However, no
indication for the explanation’s relevance is provided, but
instead it requires manually selecting relevant RL decisions
to be explained. In particular when RL decisions are taken
at runtime, which is the case for Online RL for self-
adaptive systems, monitoring all explanations to identify
relevant ones introduces cognitive overhead for developers.
In contrast, Interestingness Elements collect and evaluate
metrics at runtime to identify relevant moments of inter-
action between the system and its environment. However,
for an identified relevant moment of interaction, it does not
provide explanations whether the system’s decision making
behaves as expected and due to the right reasons.
Our technique, XRL-DINE, combines the two aforemen-
tioned techniques to overcome their respective limitations.
XRL-DINE provides detailed explanations at relevant points
in time by computing and visualizing so called Decomposed
INterestingness Elements (DINEs). We introduce three types
of DINEs: “Important Interaction”, “Reward Channel Ex-
tremum”, and “Reward Channel Dominance”.
We prototypically implement and apply XRL-DINE to the
self-adaptive system exemplar SWIM – a self-adaptive web
application [16] – to serve as proof a concept and to provide
qualitative and quantitative results.
Sect. II provides foundations as basis for introducing
XRL-DINE in Sect. III. Sect. IV describes the proof-of-
concept implementation of XRL-DINE as well as its qual-
itative and quantitative evaluation, while Sect. V discusses
limitations. Sect. VI relates XRL-DINE to existing work.
II. FOUNDATIONS
A. Online RL for Self-adaptive Systems
Reinforcement Learning (RL) aims to learn an optimal
action selection policy via a system’s (called agent in RL)
interactions with its initially unknown environment [5]. As
shown in Fig. 1(a), the agent finds itself in environment state
sat a given time step. The agent then selects an action a
(from its set of potential adaptation actions) and executes it.
As a result, the environment transitions to the next state s0
and the agent receives a reward rfor executing the action.
The reward rtogether with the information about the next
state s0are used to update the action selection policy of
the agent. The goal of RL is to maximize the cumulative
reward. Online RL applies RL during system operation,
where actions have an effect on the live system, resulting
in reward signals based on actual monitoring data [7].
Self-Adaptation Logic
Realized via Reinforcement Learning
Execute
Policy
(Knowledge)
Monitor
Action
Selection
(Analyze + Plan)
Policy Update
Self-Adaptation Logic
Analyze
Monitor Execute
Plan
Knowledge
(a)
(b)
(c)
Action a
State s
Reward r
Action
Selection
Next state s’
Agent
Policy
Policy Update
Environment
Adaptation
Action
a
State s
Reward r
Next state s’
Figure 1. RL, MAPE-K, and their integration (adapted from [3])
This paper focuses on explaining value-based Deep RL
approaches for self-adaptive systems. The reason is that
the employed XRL technique of Reward Decomposition
requires using value-based RL. In value-based Deep RL, the
policy depends on a learned action-value function Q(S, A),
which gives the expected cumulative reward when executing
摘要:

ExplainingOnlineReinforcementLearningDecisionsofSelf-AdaptiveSystemsFelixFeit,AndreasMetzger,KlausPohlpaluno(TheRuhrInstituteforSoftwareTechnology)UniversityofDuisburg-Essen;Essen,Germanyf.m.feit@gmail.com,andreas.metzger@paluno.uni-due.de,klaus.pohl@paluno.uni-due.deAbstract—Designtimeuncertaintypo...

展开>> 收起<<
Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems Felix Feit Andreas Metzger Klaus Pohl paluno The Ruhr Institute for Software Technology.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:581.87KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注