On Scrambling Phenomena for Randomly Initialized Recurrent Networks Vaggos Chatziafratis

2025-05-02 1 0 1.5MB 17 页 10玖币

侵权投诉

On Scrambling Phenomena

for Randomly Initialized Recurrent Networks

Vaggos Chatziafratis∗

Department of Computer Science

University of California, Santa Cruz

vaggos@ucsc.edu

Ioannis Panageas

University of California, Irvine

ipanagea@ics.uci.edu

Clayton Sanford

Columbia University

clayton@cs.columbia.edu

Stelios Andrew Stavroulakis

University of California, Irvine

sstavrou@uci.edu

Abstract

Recurrent Neural Networks (RNNs) frequently exhibit complicated dynamics,

and their sensitivity to the initialization process often renders them notoriously

hard to train. Recent works have shed light on such phenomena analyzing when

exploding or vanishing gradients may occur, either of which is detrimental for

training dynamics. In this paper, we point to a formal connection between RNNs

and chaotic dynamical systems and prove a qualitatively stronger phenomenon

about RNNs than what exploding gradients seem to suggest. Our main result proves

that under standard initialization (e.g., He, Xavier etc.), RNNs will exhibit Li-Yorke

chaos with constant probability independent of the network’s width. This explains

the experimentally observed phenomenon of scrambling, under which trajectories

of nearby points may appear to be arbitrarily close during some timesteps, yet will

be far away in future timesteps. In stark contrast to their feedforward counterparts,

we show that chaotic behavior in RNNs is preserved under small perturbations

and that their expressive power remains exponential in the number of feedback

iterations. Our technical arguments rely on viewing RNNs as random walks under

non-linear activations, and studying the existence of certain types of higher-order

ﬁxed points called periodic points that lead to phase transitions from order to chaos.

1 Introduction

In standard feedforward neural networks (FNNs), computation is performed “from left to right”

propagating the input through the hidden units to the output. In contrast, recurrent neural networks

(RNNs) form a feedback loop, transfering information from their output back to their input (e.g.,

LSTMs (Hochreiter and Schmidhuber,1997), GRUs (Cho et al.,2014)). This feedback loop allows

RNNs to share parameters across time because the weights and biases of each iteration are identical.

As a result, RNNs can capture long range temporal correlations in the input. For these reasons, they

have been very successful in applications in sequence learning domains, such as speech recognition,

natural language processing, video understanding, and time-series prediction (Bahdanau et al.,2014;

Cho et al.,2014;Chung et al.,2014).

Unfortunately, their unique ability to share parameters across time comes at a cost: RNNs are sensitive

to their initialization processes, which makes them extremely difﬁcult to train and causes complicated

∗

Authors order determined by the output of a randomly initialized recurrent network (operating at the chaotic

regime).

Preprint for Arxiv. Accepted for Publication (2022).

arXiv:2210.05212v1 [cs.LG] 11 Oct 2022

evaluation and training dynamics (Le et al.,2015;Laurent and von Brecht,2016;Miller and Hardt,

2018). Roughly speaking, because the hidden units of an RNN are applied to the input over and over

again, the ﬁnal output can quickly explode or vanish, depending on whether its Jacobian’s spectral

norm is greater or smaller than one respectively. Similar issues also arise during backpropagation

that hinder the learning process (Allen-Zhu et al.,2018).

Besides the hurdles with their implementation, recurrent architectures pose signiﬁcant theoretical

challenges. Several basic questions include how to properly initialize RNNs, what is their expressivity

power (also known as representation capabilities), and why do they converge or diverge, all of which

require further investigations. In this paper, we take a closer look at randomly initialized RNNs:

Can we get a better understanding of the behavior of RNNs at initialization using dynamical systems?

We draw on the extensive dynamical systems literature—which has long asked similar questions

about the topological behavior of iterated compositions of functions—to study the properties of RNNs

with standard random initializations. We prove that under common initialization strategies, e.g., He

or Xavier (He et al.,2015,2016), RNNs can produce dynamics that are characterized by chaos, even

in innocuous settings and even in the absence of external input. Most importantly, chaos arises with

constant probability which is independent of the network’s width. Our theoretical ﬁndings explain

empirically observed behavior of RNNs from prior works, and are also validated in our experiments.

More broadly, our work builds on recent works that aim at understanding neural networks through the

lens of dynamical systems; for example, Chatziafratis et al. (2020b,a) use Sharkovsky’s theorem from

discrete dynamical systems to provide depth-width tradeoffs for the representation capabilities of

neural networks, and Sanford and Chatziafratis (2022) further give more ﬁne-grained lower bounds

based on the notion of “chaotic itineraries”.

1.1 Two Motivating Behaviors of RNNs

Before stating our main result, we illustrate two concrete behaviors of RNNs that inspired our work.

The ﬁrst example demonstrates that randomly initialized RNNs can lead to what is perhaps most

commonly perceived as “chaos”, while the second example demonstrates a qualitatively different

behavior of RNNs compared to FNNs. Our main result uniﬁes the conclusions drawn from these two.

Scrambling Trajectories at Initialization

Prior works have empirically demonstrated that RNNs

can behave chaotically when their weights and biases are chosen according to a certain scheme.

For example, Laurent and von Brecht (2016) consider a simple 4-dimensional RNN with speciﬁc

parameters in the absence of input data. They plot the trajectories of two nearby points

x, y

with

kx−yk ≤ 10−7as they are propagated through many iterations of the RNN. They observe that the

long-term behavior (e.g., after 200 iterations) of the trajectories is highly sensitive to the initial states,

because distances may become small, then large again and so on. We ask:

Are RNNs (provably) chaotic even under standard heuristics for random initialization?

We answer this question in the afﬁrmative both experimentally and theoretically. This question is

helpful to understand for multiple reasons. First, it informs us about the behavior of most RNNs,

as we start with a random setting of the parameters. Second, proving that a system is chaotic is

qualitatively much stronger than simply knowing its gradient to be exploding; this will become

evident below, where we describe the phenomenon of scrambling from dynamical systems. Finally,

understanding why and how often an RNN is chaotic can lead to even better methods for initialization.

To begin, we empirically verify the above statement by examining randomly initialized RNNs.

Figure 1demonstrates that trajectories of different points may be close together during some timesteps

and far apart during future timesteps, or vice versa. This phenomenon, which will be rigorously

established in later sections, is called scrambling (Li and Yorke,1975) and emerges as a direct

consequence of the existence of higher order ﬁxed points (called periodic points) of the continuous

map deﬁned by the random RNN.

Our code is made publicly available here:

https://github.com/steliostavroulakis/Chaos_RNNs/

blob/main/Depth_2_RNNs_and_Chaos_Period_3_Probability_RNNs.ipynb

Figure 1: Points

x, y

with initial distance

kx−yk ≤ 10−7

and their subsequent

kft(x)−ft(y)k

distances across

t= 150

iterations of a randomly initialized RNN. The idea behind scrambling is

that the trajectories, even though they get arbitrarily close, they also separate later and vice versa.

Persistence of Chaos in RNNs vs FNNs

Our second example is related to the expressive power of

neural networks. A standard measure used to capture their representation capabilities is the number

of linear regions formed by the non-linear activations (Montufar et al.,2014); the higher this number,

the more expressive the network is. Counting the maximum possible number of linear regions across

different architectures has also been leveraged in order to obtain depth vs width tradeoffs in many

works (Telgarsky,2015;Eldan and Shamir,2016;Chatziafratis et al.,2020b,a).

Here, we are inspired by a remarkable observation of Hanin and Rolnick (2019a), who showed

that real-valued FNNs from

R→R

may lose their expressive power if their weights are randomly

perturbed or those weights are initialized according to standard methods. Roughly speaking, they

show that in such networks the number of linear regions grows only linearly in the total number of

neurons. These results contrast with the aforementioned analyses of the theoretical maximum number

of regions, which is exponential in depth. We ask the analogous question for RNNs instead of FNNs:

Will randomly initialized or perturbed RNNs lose their high expressivity like FNNs did?

We show a contrast between RNNs and FNNs, by showing that the number of linear regions in random

RNNs remains exponential on the number of its feedback iterations. Once again, the fundamental

difference lies on the fact that RNNs share parameters across time. Indeed, the analyses of many prior

works (Hanin and Rolnick,2019a,b;Hanin et al.,2021) crucially relies on the “fresh” randomness

injected at every layer, e.g., that the weights/biases of each neuron are initialized independently of

each other; obviously, this is no longer true in RNNs where the same units are repeatedly used across

different iterations. This shared randomness raises new technical challenges for bounding the number

of linear regions, but we manage to indirectly bound them by studying ﬁxed points of random RNNs.

2 Our Main Results

Our main contribution is to prove that randomly initialized RNNs can exhibit Li-Yorke chaos (Li and

Yorke,1975) (see deﬁnitions below), and to quantify when and how often this type of chaos appears

as we vary the variance of the weights chosen by the random initialization. To do so, we use discrete

dynamical systems which naturally capture the behavior of RNNs: simply start with some shallow

NN implementing a continuous map

, and after

feedback iterations of the RNN, its output will be

exactly ft(fcomposed with itself ttimes). We begin with some basic deﬁnitions:

Deﬁnition 2.1.

(Scrambled Set) Let

(X, d)

be a compact metric space and let

f:X→X

be a

continuous map. Two points x, y ∈Xare called proximal if:

lim inf

n→∞ d(fn(x), fn(y)) = 0

and are called asymptotic if

lim sup

n→∞ d(fn(x), fn(y)) = 0.

A set

Y⊆X

is called scrambled if

∀x, y ∈Y, x 6=y

, the points

x, y

are proximal, but not asymptotic.

Deﬁnition 2.2.

(Li-Yorke Chaos) The dynamical system

(X, f )

is Li-Yorke chaotic if there is an

uncountable scrambled set Y⊆X.

Figure 2:

Left:

Recreation of a ﬁgure from Hanin and Rolnick (2019a) where small Gaussian

perturbation (

N(0,0.1)

) is added independently on every weight/bias of each layer of a simple FNN.

As a result, after adding noise, the high expressivity (i.e., number of linear regions) breaks down;

having “fresh” noise in each layer was crucial.

Right:

Here we depict the same network as before but

shown as RNN, and we add noise as before. Perhaps surprisingly, RNNs exhibit different behavior

than FNNs: high expressivity is preserved even after noise. As we show, due to shared randomness

across iterations, small perturbations do not “‘break” expressivity.

Li-Yorke chaos leads both to scrambling phenomena (Figure 1) and to high expressivity (Right of

Figure 2). Interestingly, this is not true for FNNs (Left of Figure 2), where the fresh randomness

at each neuron leads to concentration of the Jacobian’s norm around 1, which is sufﬁcient to avoid

chaos. In other words, these deﬁnitions capture exactly the fact that trajectories get arbitrarily close,

but also move apart inﬁnitely often (as in Fig. 1). Intuitively, when scrambling occurs in RNNs, their

linear regions will be large and their input-output Jacobian will have spectral norm larger than one.

Deﬁnition 2.3

(Simple RNN Model)

For

k, σ > 0

, let

RNN(k, σ2)

be a family of recurrent neural

networks with the following properties:

• Input: The input to the network is 1-dimensional, i.e., a single number x∈[0,1].

•

Hidden Layer: There is only one hidden layer of width at least

, with

ReLU(x) =

max(x, 0) activations neurons, each of which has its own bias terms bi∼Unif([0,1]).

•

Output: The output is a real number in

[0,1]

which takes the functional form

fk(x) =

clip(Pk

i=1 aiReLU(x−bi))

, where the weights are i.i.d. Gaussians

ai∼ N(0,σ2

, and

clip(·)

ensures that

fk(x)∈[0,1]

similarly to the input (i.e., it “clips” the input so that it

remains in [0,1] as it is an RNN).

•

Feedback Loop:

fk(x)

becomes the new input number, so after

iterations the output is

(fk◦fk◦. . . ◦fk)(x), i.e., tcompositions of fkwith itself.

As we show, even this innocuous class of RNNs (see Sec. 3for general model) leads to scrambling.

Theorem 2.4

(Li-Yorke Chaos at Initialization)

Consider

fk∈ RNN(kσ, σ2)

initialized according

to the He normal initialization (set

σ2= 2

, so weight variance is

2/k

). Then, there exists some

constant

δ > 0

(independent of the width) and width

kHe >1

, such that for sufﬁciently large

k > kHe,fkis Li-Yorke chaotic with probability at least δ.

This answers our two questions posed earlier: RNNs may remain chaotic under initialization heuris-

tics, and maintain their high expressivity, because Li-Yorke chaos implies an exponential

linear

regions (Theorem 1.5 in Chatziafratis et al. (2020b)). Next, we focus on threshold phenomena:

Theorem 2.5.

[Order-to-Chaos Transition] For

fk∈ RNN(k, σ2)

, we get the following 3 regimes

as we vary the variance of the weights ai∼ N(0,σ2

k):

•

(Low variance, order) Let

ai∼ N(0,1

4klog k)

. Then, the probability that

is Li-Yorke

chaotic is at most 1

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnScramblingPhenomenaforRandomlyInitializedRecurrentNetworksVaggosChatziafratisDepartmentofComputerScienceUniversityofCalifornia,SantaCruzvaggos@ucsc.eduIoannisPanageasUniversityofCalifornia,Irvineipanagea@ics.uci.eduClaytonSanfordColumbiaUniversityclayton@cs.columbia.eduSteliosAndrewStavroulakisUn...

展开>> 收起<<

On Scrambling Phenomena for Randomly Initialized Recurrent Networks Vaggos Chatziafratis.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Scrambling Phenomena for Randomly Initialized Recurrent Networks Vaggos Chatziafratis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: