On Scrambling Phenomena for Randomly Initialized Recurrent Networks Vaggos Chatziafratis

2025-05-02 0 0 1.5MB 17 页 10玖币
侵权投诉
On Scrambling Phenomena
for Randomly Initialized Recurrent Networks
Vaggos Chatziafratis
Department of Computer Science
University of California, Santa Cruz
vaggos@ucsc.edu
Ioannis Panageas
University of California, Irvine
ipanagea@ics.uci.edu
Clayton Sanford
Columbia University
clayton@cs.columbia.edu
Stelios Andrew Stavroulakis
University of California, Irvine
sstavrou@uci.edu
Abstract
Recurrent Neural Networks (RNNs) frequently exhibit complicated dynamics,
and their sensitivity to the initialization process often renders them notoriously
hard to train. Recent works have shed light on such phenomena analyzing when
exploding or vanishing gradients may occur, either of which is detrimental for
training dynamics. In this paper, we point to a formal connection between RNNs
and chaotic dynamical systems and prove a qualitatively stronger phenomenon
about RNNs than what exploding gradients seem to suggest. Our main result proves
that under standard initialization (e.g., He, Xavier etc.), RNNs will exhibit Li-Yorke
chaos with constant probability independent of the network’s width. This explains
the experimentally observed phenomenon of scrambling, under which trajectories
of nearby points may appear to be arbitrarily close during some timesteps, yet will
be far away in future timesteps. In stark contrast to their feedforward counterparts,
we show that chaotic behavior in RNNs is preserved under small perturbations
and that their expressive power remains exponential in the number of feedback
iterations. Our technical arguments rely on viewing RNNs as random walks under
non-linear activations, and studying the existence of certain types of higher-order
fixed points called periodic points that lead to phase transitions from order to chaos.
1 Introduction
In standard feedforward neural networks (FNNs), computation is performed “from left to right”
propagating the input through the hidden units to the output. In contrast, recurrent neural networks
(RNNs) form a feedback loop, transfering information from their output back to their input (e.g.,
LSTMs (Hochreiter and Schmidhuber,1997), GRUs (Cho et al.,2014)). This feedback loop allows
RNNs to share parameters across time because the weights and biases of each iteration are identical.
As a result, RNNs can capture long range temporal correlations in the input. For these reasons, they
have been very successful in applications in sequence learning domains, such as speech recognition,
natural language processing, video understanding, and time-series prediction (Bahdanau et al.,2014;
Cho et al.,2014;Chung et al.,2014).
Unfortunately, their unique ability to share parameters across time comes at a cost: RNNs are sensitive
to their initialization processes, which makes them extremely difficult to train and causes complicated
Authors order determined by the output of a randomly initialized recurrent network (operating at the chaotic
regime).
Preprint for Arxiv. Accepted for Publication (2022).
arXiv:2210.05212v1 [cs.LG] 11 Oct 2022
evaluation and training dynamics (Le et al.,2015;Laurent and von Brecht,2016;Miller and Hardt,
2018). Roughly speaking, because the hidden units of an RNN are applied to the input over and over
again, the final output can quickly explode or vanish, depending on whether its Jacobian’s spectral
norm is greater or smaller than one respectively. Similar issues also arise during backpropagation
that hinder the learning process (Allen-Zhu et al.,2018).
Besides the hurdles with their implementation, recurrent architectures pose significant theoretical
challenges. Several basic questions include how to properly initialize RNNs, what is their expressivity
power (also known as representation capabilities), and why do they converge or diverge, all of which
require further investigations. In this paper, we take a closer look at randomly initialized RNNs:
Can we get a better understanding of the behavior of RNNs at initialization using dynamical systems?
We draw on the extensive dynamical systems literature—which has long asked similar questions
about the topological behavior of iterated compositions of functions—to study the properties of RNNs
with standard random initializations. We prove that under common initialization strategies, e.g., He
or Xavier (He et al.,2015,2016), RNNs can produce dynamics that are characterized by chaos, even
in innocuous settings and even in the absence of external input. Most importantly, chaos arises with
constant probability which is independent of the network’s width. Our theoretical findings explain
empirically observed behavior of RNNs from prior works, and are also validated in our experiments.
1
More broadly, our work builds on recent works that aim at understanding neural networks through the
lens of dynamical systems; for example, Chatziafratis et al. (2020b,a) use Sharkovsky’s theorem from
discrete dynamical systems to provide depth-width tradeoffs for the representation capabilities of
neural networks, and Sanford and Chatziafratis (2022) further give more fine-grained lower bounds
based on the notion of “chaotic itineraries”.
1.1 Two Motivating Behaviors of RNNs
Before stating our main result, we illustrate two concrete behaviors of RNNs that inspired our work.
The first example demonstrates that randomly initialized RNNs can lead to what is perhaps most
commonly perceived as “chaos”, while the second example demonstrates a qualitatively different
behavior of RNNs compared to FNNs. Our main result unifies the conclusions drawn from these two.
Scrambling Trajectories at Initialization
Prior works have empirically demonstrated that RNNs
can behave chaotically when their weights and biases are chosen according to a certain scheme.
For example, Laurent and von Brecht (2016) consider a simple 4-dimensional RNN with specific
parameters in the absence of input data. They plot the trajectories of two nearby points
x, y
with
kxyk ≤ 107as they are propagated through many iterations of the RNN. They observe that the
long-term behavior (e.g., after 200 iterations) of the trajectories is highly sensitive to the initial states,
because distances may become small, then large again and so on. We ask:
Are RNNs (provably) chaotic even under standard heuristics for random initialization?
We answer this question in the affirmative both experimentally and theoretically. This question is
helpful to understand for multiple reasons. First, it informs us about the behavior of most RNNs,
as we start with a random setting of the parameters. Second, proving that a system is chaotic is
qualitatively much stronger than simply knowing its gradient to be exploding; this will become
evident below, where we describe the phenomenon of scrambling from dynamical systems. Finally,
understanding why and how often an RNN is chaotic can lead to even better methods for initialization.
To begin, we empirically verify the above statement by examining randomly initialized RNNs.
Figure 1demonstrates that trajectories of different points may be close together during some timesteps
and far apart during future timesteps, or vice versa. This phenomenon, which will be rigorously
established in later sections, is called scrambling (Li and Yorke,1975) and emerges as a direct
consequence of the existence of higher order fixed points (called periodic points) of the continuous
map defined by the random RNN.
1
Our code is made publicly available here:
https://github.com/steliostavroulakis/Chaos_RNNs/
blob/main/Depth_2_RNNs_and_Chaos_Period_3_Probability_RNNs.ipynb
2
Figure 1: Points
x, y
with initial distance
kxyk ≤ 107
and their subsequent
kft(x)ft(y)k
distances across
t= 150
iterations of a randomly initialized RNN. The idea behind scrambling is
that the trajectories, even though they get arbitrarily close, they also separate later and vice versa.
Persistence of Chaos in RNNs vs FNNs
Our second example is related to the expressive power of
neural networks. A standard measure used to capture their representation capabilities is the number
of linear regions formed by the non-linear activations (Montufar et al.,2014); the higher this number,
the more expressive the network is. Counting the maximum possible number of linear regions across
different architectures has also been leveraged in order to obtain depth vs width tradeoffs in many
works (Telgarsky,2015;Eldan and Shamir,2016;Chatziafratis et al.,2020b,a).
Here, we are inspired by a remarkable observation of Hanin and Rolnick (2019a), who showed
that real-valued FNNs from
RR
may lose their expressive power if their weights are randomly
perturbed or those weights are initialized according to standard methods. Roughly speaking, they
show that in such networks the number of linear regions grows only linearly in the total number of
neurons. These results contrast with the aforementioned analyses of the theoretical maximum number
of regions, which is exponential in depth. We ask the analogous question for RNNs instead of FNNs:
Will randomly initialized or perturbed RNNs lose their high expressivity like FNNs did?
We show a contrast between RNNs and FNNs, by showing that the number of linear regions in random
RNNs remains exponential on the number of its feedback iterations. Once again, the fundamental
difference lies on the fact that RNNs share parameters across time. Indeed, the analyses of many prior
works (Hanin and Rolnick,2019a,b;Hanin et al.,2021) crucially relies on the “fresh” randomness
injected at every layer, e.g., that the weights/biases of each neuron are initialized independently of
each other; obviously, this is no longer true in RNNs where the same units are repeatedly used across
different iterations. This shared randomness raises new technical challenges for bounding the number
of linear regions, but we manage to indirectly bound them by studying fixed points of random RNNs.
2 Our Main Results
Our main contribution is to prove that randomly initialized RNNs can exhibit Li-Yorke chaos (Li and
Yorke,1975) (see definitions below), and to quantify when and how often this type of chaos appears
as we vary the variance of the weights chosen by the random initialization. To do so, we use discrete
dynamical systems which naturally capture the behavior of RNNs: simply start with some shallow
NN implementing a continuous map
f
, and after
t
feedback iterations of the RNN, its output will be
exactly ft(fcomposed with itself ttimes). We begin with some basic definitions:
Definition 2.1.
(Scrambled Set) Let
(X, d)
be a compact metric space and let
f:XX
be a
continuous map. Two points x, y Xare called proximal if:
lim inf
n→∞ d(fn(x), fn(y)) = 0
and are called asymptotic if
lim sup
n→∞ d(fn(x), fn(y)) = 0.
A set
YX
is called scrambled if
x, y Y, x 6=y
, the points
x, y
are proximal, but not asymptotic.
Definition 2.2.
(Li-Yorke Chaos) The dynamical system
(X, f )
is Li-Yorke chaotic if there is an
uncountable scrambled set YX.
3
Figure 2:
Left:
Recreation of a figure from Hanin and Rolnick (2019a) where small Gaussian
perturbation (
N(0,0.1)
) is added independently on every weight/bias of each layer of a simple FNN.
As a result, after adding noise, the high expressivity (i.e., number of linear regions) breaks down;
having “fresh” noise in each layer was crucial.
Right:
Here we depict the same network as before but
shown as RNN, and we add noise as before. Perhaps surprisingly, RNNs exhibit different behavior
than FNNs: high expressivity is preserved even after noise. As we show, due to shared randomness
across iterations, small perturbations do not “‘break” expressivity.
Li-Yorke chaos leads both to scrambling phenomena (Figure 1) and to high expressivity (Right of
Figure 2). Interestingly, this is not true for FNNs (Left of Figure 2), where the fresh randomness
at each neuron leads to concentration of the Jacobian’s norm around 1, which is sufficient to avoid
chaos. In other words, these definitions capture exactly the fact that trajectories get arbitrarily close,
but also move apart infinitely often (as in Fig. 1). Intuitively, when scrambling occurs in RNNs, their
#
linear regions will be large and their input-output Jacobian will have spectral norm larger than one.
Definition 2.3
(Simple RNN Model)
.
For
k, σ > 0
, let
RNN(k, σ2)
be a family of recurrent neural
networks with the following properties:
Input: The input to the network is 1-dimensional, i.e., a single number x[0,1].
Hidden Layer: There is only one hidden layer of width at least
k
, with
ReLU(x) =
max(x, 0) activations neurons, each of which has its own bias terms biUnif([0,1]).
Output: The output is a real number in
[0,1]
which takes the functional form
fk(x) =
clip(Pk
i=1 aiReLU(xbi))
, where the weights are i.i.d. Gaussians
ai∼ N(0,σ2
k)
, and
clip(·)
ensures that
fk(x)[0,1]
similarly to the input (i.e., it “clips” the input so that it
remains in [0,1] as it is an RNN).
Feedback Loop:
fk(x)
becomes the new input number, so after
t
iterations the output is
(fkfk. . . fk)(x), i.e., tcompositions of fkwith itself.
As we show, even this innocuous class of RNNs (see Sec. 3for general model) leads to scrambling.
Theorem 2.4
(Li-Yorke Chaos at Initialization)
.
Consider
fk∈ RNN(kσ, σ2)
initialized according
to the He normal initialization (set
σ2= 2
, so weight variance is
2/k
). Then, there exists some
constant
δ > 0
(independent of the width) and width
kHe >1
, such that for sufficiently large
k > kHe,fkis Li-Yorke chaotic with probability at least δ.
This answers our two questions posed earlier: RNNs may remain chaotic under initialization heuris-
tics, and maintain their high expressivity, because Li-Yorke chaos implies an exponential
#
linear
regions (Theorem 1.5 in Chatziafratis et al. (2020b)). Next, we focus on threshold phenomena:
Theorem 2.5.
[Order-to-Chaos Transition] For
fk∈ RNN(k, σ2)
, we get the following 3 regimes
as we vary the variance of the weights ai∼ N(0,σ2
k):
(Low variance, order) Let
ai∼ N(0,1
4klog k)
. Then, the probability that
fk
is Li-Yorke
chaotic is at most 1
k.
4
摘要:

OnScramblingPhenomenaforRandomlyInitializedRecurrentNetworksVaggosChatziafratisDepartmentofComputerScienceUniversityofCalifornia,SantaCruzvaggos@ucsc.eduIoannisPanageasUniversityofCalifornia,Irvineipanagea@ics.uci.eduClaytonSanfordColumbiaUniversityclayton@cs.columbia.eduSteliosAndrewStavroulakisUn...

展开>> 收起<<
On Scrambling Phenomena for Randomly Initialized Recurrent Networks Vaggos Chatziafratis.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.5MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注