Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Felix Feit, Andreas Metzger, Klaus Pohl
paluno (The Ruhr Institute for Software Technology)
University of Duisburg-Essen; Essen, Germany
f.m.feit@gmail.com, andreas.metzger@paluno.uni-due.de, klaus.pohl@paluno.uni-due.de
Abstract—Design time uncertainty poses an important chal-
lenge when developing a self-adaptive system. As an example,
defining how the system should adapt when facing a new
environment state, requires understanding the precise effect of
an adaptation, which may not be known at design time. Online
reinforcement learning, i.e., employing reinforcement learning
(RL) at runtime, is an emerging approach to realizing self-
adaptive systems in the presence of design time uncertainty.
By using Online RL, the self-adaptive system can learn from
actual operational data and leverage feedback only available
at runtime. Recently, Deep RL is gaining interest. Deep RL
represents learned knowledge as a neural network whereby it
can generalize over unseen inputs, as well as handle continuous
environment states and adaptation actions. A fundamental
problem of Deep RL is that learned knowledge is not explicitly
represented. For a human, it is practically impossible to relate
the parametrization of the neural network to concrete RL
decisions and thus Deep RL essentially appears as a black
box. Yet, understanding the decisions made by Deep RL is
key to (1) increasing trust, and (2) facilitating debugging.
Such debugging is especially relevant for self-adaptive systems,
because the reward function, which quantifies the feedback
to the RL algorithm, must be defined by developers. The
reward function must be explicitly defined by developers, thus
introducing a potential for human error. To explain Deep
RL for self-adaptive systems, we enhance and combine two
existing explainable RL techniques from the machine learning
literature. The combined technique, XRL-DINE, overcomes the
respective limitations of the individual techniques. We present
a proof-of-concept implementation of XRL-DINE, as well as
qualitative and quantitative results of applying XRL-DINE to
a self-adaptive system exemplar.
I. INTRODUCTION
A self-adaptive system can modify its own structure and
behavior at runtime based on its perception of the envi-
ronment, of itself and of its requirements [1]. One key
element of a self-adaptive system is its self-adaptation logic
that encodes when and how the system should adapt itself.
When developing the adaptation logic, developers face the
challenge of design time uncertainty [2], [3], [4]. To define
when the system should adapt, they have to anticipate all
potential environment states. However, this is infeasible in
most cases due to incomplete information at design time. As
an example, the concrete services that may be dynamically
bound during the execution of a service orchestration and
thus their quality characteristics are typically not known at
design time. To define how the system should adapt itself,
developers need to know the precise effect an adaptation
action has. However, the precise effect may not be known
at design time. As an example, while developers may know
in principle that enabling more features will negatively influ-
ence the performance, exactly determining the performance
impact is more challenging. A recent industrial survey iden-
tified optimal design and design complexity together with
design time uncertainty to be the most frequently observed
difficulties in designing self-adaptation in practice [4].
Online reinforcement learning (Online RL) is an emerging
approach to realize self-adaptive systems in the presence of
design time uncertainty. Online RL means that reinforcement
learning [5] is employed at runtime (see [3] for a discussion
of existing solutions). The self-adaptive system thereby can
learn from actual operational data and thus leverages infor-
mation only available at runtime. A recent survey indicates
that since 2019 the use of learning dominates over the use
of predetermined and static policies or rules [6].
Online RL aims at learning suitable adaptation actions
via the self-adaptive system’s interactions with its initially
unknown environment [7]. During system operation, the
RL algorithm receives a numerical reward based on actual
runtime monitoring data for executing an adaptation action.
The reward expresses how suitable that adaptation action
was in the short term. The goal of Online RL is to maximize
the cumulative reward.
Initially, research on self-adaptive systems leveraged RL
algorithms that represent learned knowledge as a so-called
value function [7]. The value function quantifies how much
cumulative reward can be expected if a particular adaptation
is chosen in a given environment state. Typically, this value
function was represented as a table. However, such tabular
approaches exhibit key limitations. First, they require a finite
set of environment states and a finite set of adaptations and
thus cannot be directly applied to continuous state spaces
or continuous action spaces. Second, they do not generalize
over neighboring states, which leads to slow learning in the
presence of continuous environment states [7].
Deep reinforcement learning (Deep RL) addresses these
disadvantages by representing the learned knowledge as a
neural network. Since neural network inputs are not limited
to elements of finite or discrete sets, and neural networks can
generalize well over inputs, deep RL has shown remarkable
arXiv:2210.05931v1 [cs.LG] 12 Oct 2022