Understanding the Evolution of Linear Regions in Deep Reinforcement Learning Setareh Cohan

2025-05-06 0 0 4.67MB 40 页 10玖币

侵权投诉

Understanding the Evolution of Linear Regions in

Deep Reinforcement Learning

Setareh Cohan

Department of Computer Science

University of British Columbia

setarehc@cs.ubc.ca

Nam Hee Kim

Department of Computer Science

Aalto University

namhee.kim@aalto.fi

David Rolnick

School of Computer Science

McGill University

drolnick@cs.mcgill.ca

Michiel van de Panne

Department of Computer Science

University of British Columbia

van@cs.ubc.ca

Abstract

Policies produced by deep reinforcement learning are typically characterised by

their learning curves, but they remain poorly understood in many other respects.

ReLU-based policies result in a partitioning of the input space into piecewise lin-

ear regions. We seek to understand how observed region counts and their densi-

ties evolve during deep reinforcement learning using empirical results that span

a range of continuous control tasks and policy network dimensions. Intuitively,

we may expect that during training, the region density increases in the areas that

are frequently visited by the policy, thereby affording ﬁne-grained control. We

use recent theoretical and empirical results for the linear regions induced by neu-

ral networks in supervised learning settings for grounding and comparison of our

results. Empirically, we ﬁnd that the region density increases only moderately

throughout training, as measured along ﬁxed trajectories coming from the ﬁnal

policy. However, the trajectories themselves also increase in length during train-

ing, and thus the region densities decrease as seen from the perspective of the

current trajectory. Our ﬁndings suggest that the complexity of deep reinforce-

ment learning policies does not principally emerge from a signiﬁcant growth in

the complexity of functions observed on-and-around trajectories of the policy.

1 Introduction

Deep reinforcement learning (RL) utilizes neural networks to represent the policy and to train this

network to optimize an objective, typically the expected value of time-discounted future rewards.

Deep RL algorithms have been successfully applied to diverse applications including robotics, chal-

lenging games, and an increasing number of real-world decision-and-control problems [François-

Lavet et al., 2018]. For a given choice of task, RL algorithm, and policy network conﬁguration,

the performance is commonly characterised via the learning curves, which provide insight into the

learning efﬁciency and the ﬁnal performance. However, little has been done to understand the de-

tailed structure of the state-to-action mappings induced by the control policies and how these evolve

over time.

In this work, we aim to further understand deep feed-forward neural network policies that use recti-

ﬁed linear activation functions (rectiﬁer linear units or ReLUs). ReLUs [Nair and Hinton, 2010] are

among the most popular choices of activation functions due to their practical successes [Montúfar

et al., 2014]. For RL, these activations induce a piecewise linear mapping from states to actions,

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.13611v2 [cs.LG] 7 Nov 2022

Figure 1: Schematic illustration of a trajectory traversing the piecewise linear regions in the policy

state space. S0and Skindicate the initial and ﬁnal states of the trajectory.

where the input space, i.e, the state space, is divided into distinct linear regions, where within each

region, the actions are a linear function of the states. Note that the regions are formally an afﬁne

function of the input, due to the constant-valued bias terms. For simplicity and convenience, these

are more commonly described as linear regions. Figure 1 provides a schematic illustration of these

regions, along with a policy trajectory.

The number of distinct regions into which the input space is divided is a natural measure of network

expressivity. Learned functions with many linear regions have the capacity to build complex and

ﬂexible decision boundaries. Thus, the problem of counting the number of linear regions has been

extensively studied in recent literature [Montúfar et al., 2014, Raghu et al., 2017, Hanin and Rolnick,

2019a, Serra et al., 2018]. While the maximum number of regions is exponential with respect to

network depth [Montúfar et al., 2014], recent work has demonstrated that the number of regions is

instead typically proportional to the number of neurons [Hanin and Rolnick, 2019a].

For RL, we are interested in the local granularity (density) of linear regions along trajectories arising

from the policy. Fine-grained regions afford ﬁne-grained control, and thus we may hypothesize that

region density increases in regions frequently visited by the policy, in order to afford better control.

Recent work in supervised-learning of image classiﬁcation is inconsistent with regard to ﬁndings

about the region density seen in the vicinity of data points, with some reporting a decrease [Novak

et al., 2018] to provide better generalization and robustness to perturbation, and others not [Hanin

and Rolnick, 2019a]. For the RL setting, we note that counting regions visited along an episode tra-

jectory arguably provides a meaningful and task-grounded measurement in contrast to line-segments

and ellipses that pass through randomly sampled points sampled from training data, which have been

used in the prior works mentioned above. We further note that piecewise-afﬁne control strategies

are commonly designed into control systems, e.g., via gain scheduling. Understanding how these

regions are designed and distributed by deep RL thus helps establish bridges with these existing

methods.

To the best of our knowledge, our work is the ﬁrst to investigate the structure and evolution of

linear regions of ReLU-based deep RL policies in detail. We seek to answer several basic empirical

questions :

Q1 Do ﬁndings for network expressivity, originally developed in supervised learning settings,

apply to RL policies? How are the region densities affected by the policy network conﬁg-

uration? Do deeper policy networks result in ﬁner-grained regions and hence an increased

expressivity?

Q2 How do the linear regions of a policy evolve during training? Do we see a signiﬁcantly

greater density of regions emerge along the areas of the state space frequently visited by

the episodic trajectories, thereby allowing for ﬁner-grained control? Do random-action

trajectories see different densities?

The key results can be summarized as follows, for policies trained using proximal policy optimiza-

tion (PPO) [Schulman et al., 2017], and evaluated on four different continuous control tasks. Q1:

There is a general alignment with recent theoretical and empirical results for supervised learning

settings. Region density is principally proportional to the number of neurons, with an additional

small observed increase in density for deeper networks. Q2: Only a moderate increase of density

is observed during training, as measured along ﬁxed ﬁnal-policy trajectories. Therefore, the com-

plexity of a ﬁnal learned policy does not come principally from increased density on-and-around

the optimal trajectories, which is a potentially surprising result. In contrast, as measured along the

evolving current-policy trajectories, a decrease in region density is observed during training. Across

all settings, we also observe that the region-transition count, as observed during ﬁxed time-duration

episodes, grows during training before converging to a plateau. However, the trajectory length, as

measured in the input space, also grows towards a plateau, although not at the same rate, and this

leads to variations in the mean region densities as observed along current trajectories during training.

2 Related Work

Understanding the expressivity of a neural network is fundamental to better understanding its opera-

tion. Several works study exprressivity of deep neural networks with piecewise linear activations by

counting their linear regions [Arora et al., 2016, Bianchini and Scarselli, 2014]. On the theoretical

side, Pascanu et al. [2013] show that in the asymptotic limit of many hidden layers, deep ReLU net-

works are capable of separating their input space into exponentially more linear regions compared

with shallow networks, despite using the same number of computational units. Following this, Mon-

túfar et al. [2014] also explore the complexity of functions computable by deep feed-forward neural

networks with ReLU and maxout activations, and provide tighter upper and lower bounds for the

maximal number of linear regions. They show that the number of linear regions is polynomial in the

network width and exponential with respect to the network depth. Furthermore, Raghu et al. [2017]

improve the upper bound for ReLU networks by introducing a new set of expressivity measures

and showing that expressivity grows exponentially with network depth for ReLU and tanh activated

neural networks. Serra et al. [2018] generalizes these results by providing even tighter upper and

lower bounds for the number of regions for ReLU networks and show that the maximal number of

regions grows exponentially with depth when input dimension is sufﬁciently large.

More recent works that touch on expressivity of depth show that the effect of depth on the expres-

sivity of neural networks is likely far below that of the theoretical maximum proposed by prior liter-

ature. Hanin and Rolnick [2019b] study the importance of depth on expressivity of neural networks

in practice. They show that the average number of linear regions for ReLU networks at initialization

is bounded by the number of neurons raised to the input dimension, and is independent of network

depth. They also empirically show that this bound remains tight during training. Similarly, Hanin

and Rolnick [2019a] ﬁnd that the average distance to the boundary of the linear regions depends only

on the number of neurons and not on the network depth – both at initialization and during training

for supervised-learning tasks on ReLU networks. This strongly suggests that deeper networks do

not necessarily learn more complex functions in comparison to shallow networks. Prior to this, a

number of works have shown that the strength of deep learning may arise in part from a good match

between deep architectures and current training procedures [Mhaskar and Poggio, 2016, Mhaskar

et al., 2016, Zhang et al., 2021]. Notably, Ba and Caruana [2014] show that, once deep networks are

trained to perform a task successfully, their behavior can often be replicated by shallow networks,

suggesting that the advantages of depth may be linked to easier learning.

Another line of work studies function complexity in terms of robustness to perturbations to the

input. Sokoli´

c et al. [2017] theoretically studies the input-output Jacobian, which is a measure of

robustness and also relates to generalization. Similarly, Zahavy et al. [2016b] propose a sensitivity

measure in terms of adversarial robustness, and provides theoretical and experimental insights on

how it relates to generalization. Novak et al. [2018] also study robustness and sensitivity using the

input-output Jacobian and number of transitions along trajectories in the input space as measures of

robustness. They show that neural networks trained for image classiﬁcation tasks are more robust to

input perturbations in the vicinity of the training data manifold, due to training points lying in regions

of lower density. Several other recent works have also focused on proposing tight generalization

bounds for neural networks [Bartlett et al., 2017, Dziugaite and Roy, 2017, Neyshabur et al., 2017].

There are a number of works that touch on understanding deep neural networks by ﬁnding general

principles and patterns during training. Arpit et al. [2017] empirically show that deep networks prior-

itize learning simple patterns of the data during training. Xu et al. [2019] ﬁnd a similar phenomenon

in the case of 2-layer networks with Sigmoid activations. Rahaman et al. [2019] study deep ReLU

activated networks through the lens of Fourier analysis and show that while deep neural networks

can approximate arbitrary functions, they favour low frequency ones and thus, they exhibit a bias

towards smooth functions. Samek et al. [2017] present two approaches in explaining predictions of

deep learning models in a classiﬁcation task, with the ﬁrst method computing the sensitivity of the

prediction with respect to input perturbations and the second method that meaningfully decomposes

the decision in terms of input variables.

While deep RL methods are widely used and extensively studied, few works focus on understanding

the policy structure in detail. Zahavy et al. [2016a] propose a visualization method to interpret the

agent’s actions by describing the Markov Decision Process as a directed graph on a t-SNE map.

They then suggest ways to interpret, debug and optimize deep neural network policies using the

proposed visualization maps. Rupprecht et al. [2019] train a generative model over the state space

of Atari games to visualize states which minimize or maximize given action probabilities. Luo et al.

[2018] adapt three visualization techniques to the domain of image-based RL using convolutional

neural networks in order to understand the decision making process of the RL agent.

3 Piecewise Linear Regions

Throughout this work, we consider RL policies based on ReLU networks. A ReLU network is

a ReLU-activated feed-forward neural network, or a multi-layer perceptron (MLP) which can be

formulated as a scalar function f:Rd→Rodeﬁned by a neural network with Lhidden layers

of width d1, ..., dLand a o-dimensional output. Assuming the output to be 1-dimensional, and

following the notation of Rahaman et al. [2019], we have:

f(x)=(T(L+1) ◦σ◦T(L)◦... ◦σ◦T(1))(x),(1)

where T(k):Rdk−1→Rdkcomputes the weighted sum T(k)(x) = W(k)x+b(k)for some weight

matrix W(k)and bias vector b(k). Here, σ(u) = max(0,u)denotes the ReLU activation function

acting element-wise on vector u= (u1, ..., un).

Given the ReLU network ffrom Equation 1, again following Rahaman et al. [2019], piecewise

linearity can be explicit written by

f(x) = X

r∈R

1Pr(x)(Wrx+br)(2)

where ris the index for the linear region Prand 1Pris the indicator function on Pr. The 1×d

matrix Wris given by:

Wr=W(L+1)W(L)

r...W (1)

r(3)

where W(k)

ris obtained from the original weight W(k)by setting its jth column to zero whenever

neuron jof layer k−1is inactive for k > 1.

4 Counting Linear Regions in RL Policies

To answer the key questions posed in the introduction, we need a method for counting the linear

regions encountered during the episodic trajectories taken during RL. For each input s, we encode

each neuron of the policy network with a binary code of 0if its pre-activation is negative, and with

a binary code 1otherwise. The linear region of the input scan thus be uniquely identiﬁed by the

concatenation of binary codes of all the neurons in the network called an activation pattern. Figure 2

illustrates the activation pattern construction for a 2D input space.

Figure 1 provides a schematic illustration of the linear regions for a 2D state space, together with

an example trajectory that consists of sequential states encountered by the policy, connected with

straight-line segments. As seen in the ﬁgure, regions can be revisited, which leads to a distinction

between the number of transitions between regions and the number of unique regions visited along a

Figure 2: An overview of a ReLU-activated policy network and the binary labeling scheme of the

linear regions. State s= [x, ˙x]is within the linear region uniquely identiﬁed by the activation pattern

computed by concatenating the binary states of the activations of the policy network given input s.

trajectory. We compute both of these metrics along episodic trajectories, as enabled by keeping a list

of the regions already visited at any point in time when processing a given episode trajectory. Our

region-counting method also includes regions that are crossed by the straight-line segments between

successive trajectory states, although the policy itself does not explicitly encounter these regions due

to the discrete-time nature of typical RL environments.

To represent the k-th line segment of the episodic trajectories in the input space, we develop a

parameterized line segment given by s(u) = (1−u)sk−1+usk, where u∈[0,1] and sk−1, sk∈Rd

denote the endpoints of the line segment. We then calculate the exact number of the linear regions

over this line segment by considering the hidden layers of the policy network one at a time, starting

from input towards the output, and observe how each region can be split by the neurons. Starting

with the ﬁrst layer, we consider neurons one by one, and identify the point in the domain of u, if

any, that induces a change in the binary labeling for that neuron. Each such point subdivides the

domain of uinto two new regions where the pre-activation will be zero for one of them in the next

layer. By maintaining a list of these regions and the linear functions deﬁned over them, and whether

their pre-activation vanishes in the next layer, we proceed to the neurons of the next layer. This

process repeats for each of the regions resulting from the previous layer. We record the activation

patterns of all the ﬁnal regions, to track state visits during all segments of episode trajectories. In

the end, for each trajectory τ, we compute the total region transitions, RT(τ), as well as the number

of unique visited regions, RU(τ), where RT(τ)+1≥RU(τ). We further compute the trajectory

length in the input space, L(τ). This allows us to compute normalized region densities according

to ρ(τ) = RT(τ)/(NL(τ)) where Nis the total number of neurons in the policy network, in

accordance with Hanin and Rolnick [2019a].

We use the above metrics along two types of trajectories. First, we consider ﬁxed trajectories, τ∗

as sampled from the ﬁnal fully-trained policy. Second, we consider current trajectories, τ, as sam-

pled from the current policy during training. Both of these trajectories offer informative views of

the evolution of the policy. The former offers a direct picture of the linear-region density along a

meaningful, ﬁxed region of the state space, i.e., that of the ﬁnal optimized policy. The latter offers a

view of what the policy trajectories actually encounter during training. In order to better understand

the evolution of τ, we also track its length, L(τ), given that the density is determined by the region-

transition count as well as the length of the trajectory. Figure 3 shows the evolution of linear regions

and the types of trajectories τ∗and τfor a simple 2D toy environment, and a ReLU-activated policy

network of depth 2 and width 8.

5 Experimental Results

We conduct our experiments on four continuous control tasks including HalfCheetah-v2, Walker-v2,

Ant-v2, and Swimmer-v2 environments from the OpenAI gym benchmark suits [Brockman et al.,

2016]. We use the Stable-Baselines3 implementations of the PPO algorithm [Schulman et al., 2017]

in all of our experiments throughout this work 1. We run each experiment with 5 different random

1Our code is available at https://github.com/setarehc/deep_rl_regions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderstandingtheEvolutionofLinearRegionsinDeepReinforcementLearningSetarehCohanDepartmentofComputerScienceUniversityofBritishColumbiasetarehc@cs.ubc.caNamHeeKimDepartmentofComputerScienceAaltoUniversitynamhee.kim@aalto.fiDavidRolnickSchoolofComputerScienceMcGillUniversitydrolnick@cs.mcgill.caMichiel...

展开>> 收起<<

Understanding the Evolution of Linear Regions in Deep Reinforcement Learning Setareh Cohan.pdf

共40页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Understanding the Evolution of Linear Regions in Deep Reinforcement Learning Setareh Cohan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: