Understanding the Evolution of Linear Regions in Deep Reinforcement Learning Setareh Cohan

2025-05-06 0 0 4.67MB 40 页 10玖币
侵权投诉
Understanding the Evolution of Linear Regions in
Deep Reinforcement Learning
Setareh Cohan
Department of Computer Science
University of British Columbia
setarehc@cs.ubc.ca
Nam Hee Kim
Department of Computer Science
Aalto University
namhee.kim@aalto.fi
David Rolnick
School of Computer Science
McGill University
drolnick@cs.mcgill.ca
Michiel van de Panne
Department of Computer Science
University of British Columbia
van@cs.ubc.ca
Abstract
Policies produced by deep reinforcement learning are typically characterised by
their learning curves, but they remain poorly understood in many other respects.
ReLU-based policies result in a partitioning of the input space into piecewise lin-
ear regions. We seek to understand how observed region counts and their densi-
ties evolve during deep reinforcement learning using empirical results that span
a range of continuous control tasks and policy network dimensions. Intuitively,
we may expect that during training, the region density increases in the areas that
are frequently visited by the policy, thereby affording fine-grained control. We
use recent theoretical and empirical results for the linear regions induced by neu-
ral networks in supervised learning settings for grounding and comparison of our
results. Empirically, we find that the region density increases only moderately
throughout training, as measured along fixed trajectories coming from the final
policy. However, the trajectories themselves also increase in length during train-
ing, and thus the region densities decrease as seen from the perspective of the
current trajectory. Our findings suggest that the complexity of deep reinforce-
ment learning policies does not principally emerge from a significant growth in
the complexity of functions observed on-and-around trajectories of the policy.
1 Introduction
Deep reinforcement learning (RL) utilizes neural networks to represent the policy and to train this
network to optimize an objective, typically the expected value of time-discounted future rewards.
Deep RL algorithms have been successfully applied to diverse applications including robotics, chal-
lenging games, and an increasing number of real-world decision-and-control problems [François-
Lavet et al., 2018]. For a given choice of task, RL algorithm, and policy network configuration,
the performance is commonly characterised via the learning curves, which provide insight into the
learning efficiency and the final performance. However, little has been done to understand the de-
tailed structure of the state-to-action mappings induced by the control policies and how these evolve
over time.
In this work, we aim to further understand deep feed-forward neural network policies that use recti-
fied linear activation functions (rectifier linear units or ReLUs). ReLUs [Nair and Hinton, 2010] are
among the most popular choices of activation functions due to their practical successes [Montúfar
et al., 2014]. For RL, these activations induce a piecewise linear mapping from states to actions,
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.13611v2 [cs.LG] 7 Nov 2022
Figure 1: Schematic illustration of a trajectory traversing the piecewise linear regions in the policy
state space. S0and Skindicate the initial and final states of the trajectory.
where the input space, i.e, the state space, is divided into distinct linear regions, where within each
region, the actions are a linear function of the states. Note that the regions are formally an affine
function of the input, due to the constant-valued bias terms. For simplicity and convenience, these
are more commonly described as linear regions. Figure 1 provides a schematic illustration of these
regions, along with a policy trajectory.
The number of distinct regions into which the input space is divided is a natural measure of network
expressivity. Learned functions with many linear regions have the capacity to build complex and
flexible decision boundaries. Thus, the problem of counting the number of linear regions has been
extensively studied in recent literature [Montúfar et al., 2014, Raghu et al., 2017, Hanin and Rolnick,
2019a, Serra et al., 2018]. While the maximum number of regions is exponential with respect to
network depth [Montúfar et al., 2014], recent work has demonstrated that the number of regions is
instead typically proportional to the number of neurons [Hanin and Rolnick, 2019a].
For RL, we are interested in the local granularity (density) of linear regions along trajectories arising
from the policy. Fine-grained regions afford fine-grained control, and thus we may hypothesize that
region density increases in regions frequently visited by the policy, in order to afford better control.
Recent work in supervised-learning of image classification is inconsistent with regard to findings
about the region density seen in the vicinity of data points, with some reporting a decrease [Novak
et al., 2018] to provide better generalization and robustness to perturbation, and others not [Hanin
and Rolnick, 2019a]. For the RL setting, we note that counting regions visited along an episode tra-
jectory arguably provides a meaningful and task-grounded measurement in contrast to line-segments
and ellipses that pass through randomly sampled points sampled from training data, which have been
used in the prior works mentioned above. We further note that piecewise-affine control strategies
are commonly designed into control systems, e.g., via gain scheduling. Understanding how these
regions are designed and distributed by deep RL thus helps establish bridges with these existing
methods.
To the best of our knowledge, our work is the first to investigate the structure and evolution of
linear regions of ReLU-based deep RL policies in detail. We seek to answer several basic empirical
questions :
Q1 Do findings for network expressivity, originally developed in supervised learning settings,
apply to RL policies? How are the region densities affected by the policy network config-
uration? Do deeper policy networks result in finer-grained regions and hence an increased
expressivity?
Q2 How do the linear regions of a policy evolve during training? Do we see a significantly
greater density of regions emerge along the areas of the state space frequently visited by
the episodic trajectories, thereby allowing for finer-grained control? Do random-action
trajectories see different densities?
The key results can be summarized as follows, for policies trained using proximal policy optimiza-
tion (PPO) [Schulman et al., 2017], and evaluated on four different continuous control tasks. Q1:
There is a general alignment with recent theoretical and empirical results for supervised learning
settings. Region density is principally proportional to the number of neurons, with an additional
small observed increase in density for deeper networks. Q2: Only a moderate increase of density
2
is observed during training, as measured along fixed final-policy trajectories. Therefore, the com-
plexity of a final learned policy does not come principally from increased density on-and-around
the optimal trajectories, which is a potentially surprising result. In contrast, as measured along the
evolving current-policy trajectories, a decrease in region density is observed during training. Across
all settings, we also observe that the region-transition count, as observed during fixed time-duration
episodes, grows during training before converging to a plateau. However, the trajectory length, as
measured in the input space, also grows towards a plateau, although not at the same rate, and this
leads to variations in the mean region densities as observed along current trajectories during training.
2 Related Work
Understanding the expressivity of a neural network is fundamental to better understanding its opera-
tion. Several works study exprressivity of deep neural networks with piecewise linear activations by
counting their linear regions [Arora et al., 2016, Bianchini and Scarselli, 2014]. On the theoretical
side, Pascanu et al. [2013] show that in the asymptotic limit of many hidden layers, deep ReLU net-
works are capable of separating their input space into exponentially more linear regions compared
with shallow networks, despite using the same number of computational units. Following this, Mon-
túfar et al. [2014] also explore the complexity of functions computable by deep feed-forward neural
networks with ReLU and maxout activations, and provide tighter upper and lower bounds for the
maximal number of linear regions. They show that the number of linear regions is polynomial in the
network width and exponential with respect to the network depth. Furthermore, Raghu et al. [2017]
improve the upper bound for ReLU networks by introducing a new set of expressivity measures
and showing that expressivity grows exponentially with network depth for ReLU and tanh activated
neural networks. Serra et al. [2018] generalizes these results by providing even tighter upper and
lower bounds for the number of regions for ReLU networks and show that the maximal number of
regions grows exponentially with depth when input dimension is sufficiently large.
More recent works that touch on expressivity of depth show that the effect of depth on the expres-
sivity of neural networks is likely far below that of the theoretical maximum proposed by prior liter-
ature. Hanin and Rolnick [2019b] study the importance of depth on expressivity of neural networks
in practice. They show that the average number of linear regions for ReLU networks at initialization
is bounded by the number of neurons raised to the input dimension, and is independent of network
depth. They also empirically show that this bound remains tight during training. Similarly, Hanin
and Rolnick [2019a] find that the average distance to the boundary of the linear regions depends only
on the number of neurons and not on the network depth – both at initialization and during training
for supervised-learning tasks on ReLU networks. This strongly suggests that deeper networks do
not necessarily learn more complex functions in comparison to shallow networks. Prior to this, a
number of works have shown that the strength of deep learning may arise in part from a good match
between deep architectures and current training procedures [Mhaskar and Poggio, 2016, Mhaskar
et al., 2016, Zhang et al., 2021]. Notably, Ba and Caruana [2014] show that, once deep networks are
trained to perform a task successfully, their behavior can often be replicated by shallow networks,
suggesting that the advantages of depth may be linked to easier learning.
Another line of work studies function complexity in terms of robustness to perturbations to the
input. Sokoli´
c et al. [2017] theoretically studies the input-output Jacobian, which is a measure of
robustness and also relates to generalization. Similarly, Zahavy et al. [2016b] propose a sensitivity
measure in terms of adversarial robustness, and provides theoretical and experimental insights on
how it relates to generalization. Novak et al. [2018] also study robustness and sensitivity using the
input-output Jacobian and number of transitions along trajectories in the input space as measures of
robustness. They show that neural networks trained for image classification tasks are more robust to
input perturbations in the vicinity of the training data manifold, due to training points lying in regions
of lower density. Several other recent works have also focused on proposing tight generalization
bounds for neural networks [Bartlett et al., 2017, Dziugaite and Roy, 2017, Neyshabur et al., 2017].
There are a number of works that touch on understanding deep neural networks by finding general
principles and patterns during training. Arpit et al. [2017] empirically show that deep networks prior-
itize learning simple patterns of the data during training. Xu et al. [2019] find a similar phenomenon
in the case of 2-layer networks with Sigmoid activations. Rahaman et al. [2019] study deep ReLU
activated networks through the lens of Fourier analysis and show that while deep neural networks
can approximate arbitrary functions, they favour low frequency ones and thus, they exhibit a bias
3
towards smooth functions. Samek et al. [2017] present two approaches in explaining predictions of
deep learning models in a classification task, with the first method computing the sensitivity of the
prediction with respect to input perturbations and the second method that meaningfully decomposes
the decision in terms of input variables.
While deep RL methods are widely used and extensively studied, few works focus on understanding
the policy structure in detail. Zahavy et al. [2016a] propose a visualization method to interpret the
agent’s actions by describing the Markov Decision Process as a directed graph on a t-SNE map.
They then suggest ways to interpret, debug and optimize deep neural network policies using the
proposed visualization maps. Rupprecht et al. [2019] train a generative model over the state space
of Atari games to visualize states which minimize or maximize given action probabilities. Luo et al.
[2018] adapt three visualization techniques to the domain of image-based RL using convolutional
neural networks in order to understand the decision making process of the RL agent.
3 Piecewise Linear Regions
Throughout this work, we consider RL policies based on ReLU networks. A ReLU network is
a ReLU-activated feed-forward neural network, or a multi-layer perceptron (MLP) which can be
formulated as a scalar function f:RdRodefined by a neural network with Lhidden layers
of width d1, ..., dLand a o-dimensional output. Assuming the output to be 1-dimensional, and
following the notation of Rahaman et al. [2019], we have:
f(x)=(T(L+1) σT(L)... σT(1))(x),(1)
where T(k):Rdk1Rdkcomputes the weighted sum T(k)(x) = W(k)x+b(k)for some weight
matrix W(k)and bias vector b(k). Here, σ(u) = max(0,u)denotes the ReLU activation function
acting element-wise on vector u= (u1, ..., un).
Given the ReLU network ffrom Equation 1, again following Rahaman et al. [2019], piecewise
linearity can be explicit written by
f(x) = X
rR
1Pr(x)(Wrx+br)(2)
where ris the index for the linear region Prand 1Pris the indicator function on Pr. The 1×d
matrix Wris given by:
Wr=W(L+1)W(L)
r...W (1)
r(3)
where W(k)
ris obtained from the original weight W(k)by setting its jth column to zero whenever
neuron jof layer k1is inactive for k > 1.
4 Counting Linear Regions in RL Policies
To answer the key questions posed in the introduction, we need a method for counting the linear
regions encountered during the episodic trajectories taken during RL. For each input s, we encode
each neuron of the policy network with a binary code of 0if its pre-activation is negative, and with
a binary code 1otherwise. The linear region of the input scan thus be uniquely identified by the
concatenation of binary codes of all the neurons in the network called an activation pattern. Figure 2
illustrates the activation pattern construction for a 2D input space.
Figure 1 provides a schematic illustration of the linear regions for a 2D state space, together with
an example trajectory that consists of sequential states encountered by the policy, connected with
straight-line segments. As seen in the figure, regions can be revisited, which leads to a distinction
between the number of transitions between regions and the number of unique regions visited along a
4
1
0
1
0
0
0
1
1
1
1
0
1
Figure 2: An overview of a ReLU-activated policy network and the binary labeling scheme of the
linear regions. State s= [x, ˙x]is within the linear region uniquely identified by the activation pattern
computed by concatenating the binary states of the activations of the policy network given input s.
trajectory. We compute both of these metrics along episodic trajectories, as enabled by keeping a list
of the regions already visited at any point in time when processing a given episode trajectory. Our
region-counting method also includes regions that are crossed by the straight-line segments between
successive trajectory states, although the policy itself does not explicitly encounter these regions due
to the discrete-time nature of typical RL environments.
To represent the k-th line segment of the episodic trajectories in the input space, we develop a
parameterized line segment given by s(u) = (1u)sk1+usk, where u[0,1] and sk1, skRd
denote the endpoints of the line segment. We then calculate the exact number of the linear regions
over this line segment by considering the hidden layers of the policy network one at a time, starting
from input towards the output, and observe how each region can be split by the neurons. Starting
with the first layer, we consider neurons one by one, and identify the point in the domain of u, if
any, that induces a change in the binary labeling for that neuron. Each such point subdivides the
domain of uinto two new regions where the pre-activation will be zero for one of them in the next
layer. By maintaining a list of these regions and the linear functions defined over them, and whether
their pre-activation vanishes in the next layer, we proceed to the neurons of the next layer. This
process repeats for each of the regions resulting from the previous layer. We record the activation
patterns of all the final regions, to track state visits during all segments of episode trajectories. In
the end, for each trajectory τ, we compute the total region transitions, RT(τ), as well as the number
of unique visited regions, RU(τ), where RT(τ)+1RU(τ). We further compute the trajectory
length in the input space, L(τ). This allows us to compute normalized region densities according
to ρ(τ) = RT(τ)/(NL(τ)) where Nis the total number of neurons in the policy network, in
accordance with Hanin and Rolnick [2019a].
We use the above metrics along two types of trajectories. First, we consider fixed trajectories, τ
as sampled from the final fully-trained policy. Second, we consider current trajectories, τ, as sam-
pled from the current policy during training. Both of these trajectories offer informative views of
the evolution of the policy. The former offers a direct picture of the linear-region density along a
meaningful, fixed region of the state space, i.e., that of the final optimized policy. The latter offers a
view of what the policy trajectories actually encounter during training. In order to better understand
the evolution of τ, we also track its length, L(τ), given that the density is determined by the region-
transition count as well as the length of the trajectory. Figure 3 shows the evolution of linear regions
and the types of trajectories τand τfor a simple 2D toy environment, and a ReLU-activated policy
network of depth 2 and width 8.
5 Experimental Results
We conduct our experiments on four continuous control tasks including HalfCheetah-v2, Walker-v2,
Ant-v2, and Swimmer-v2 environments from the OpenAI gym benchmark suits [Brockman et al.,
2016]. We use the Stable-Baselines3 implementations of the PPO algorithm [Schulman et al., 2017]
in all of our experiments throughout this work 1. We run each experiment with 5 different random
1Our code is available at https://github.com/setarehc/deep_rl_regions.
5
摘要:

UnderstandingtheEvolutionofLinearRegionsinDeepReinforcementLearningSetarehCohanDepartmentofComputerScienceUniversityofBritishColumbiasetarehc@cs.ubc.caNamHeeKimDepartmentofComputerScienceAaltoUniversitynamhee.kim@aalto.fiDavidRolnickSchoolofComputerScienceMcGillUniversitydrolnick@cs.mcgill.caMichiel...

展开>> 收起<<
Understanding the Evolution of Linear Regions in Deep Reinforcement Learning Setareh Cohan.pdf

共40页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:40 页 大小:4.67MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 40
客服
关注