Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks Kentaro Ohno Sekitoshi Kanai Yasutoshi Ida

2025-04-27 0 0 1.23MB 16 页 10玖币

侵权投诉

Fast Saturating Gate for Learning Long Time Scales with

Recurrent Neural Networks

Kentaro Ohno, Sekitoshi Kanai, Yasutoshi Ida

NTT

{kentaro.ohno.tf, sekitoshi.kanai.fu, yasutoshi.ida.yc}@hco.ntt.co.jp

Abstract

Gate functions in recurrent models, such as an LSTM and

GRU, play a central role in learning various time scales in

modeling time series data by using a bounded activation func-

tion. However, it is difﬁcult to train gates to capture extremely

long time scales due to gradient vanishing of the bounded

function for large inputs, which is known as the saturation

problem. We closely analyze the relation between saturation

of the gate function and efﬁciency of the training. We prove

that the gradient vanishing of the gate function can be miti-

gated by accelerating the convergence of the saturating func-

tion, i.e., making the output of the function converge to 0

or 1 faster. Based on the analysis results, we propose a gate

function called fast gate that has a doubly exponential con-

vergence rate with respect to inputs by simple function com-

position. We empirically show that our method outperforms

previous methods in accuracy and computational efﬁciency

on benchmark tasks involving extremely long time scales.

1 Introduction

Recurrent neural networks (RNNs) are models suited to pro-

cessing sequential data in various applications, e.g., speech

recognition (Ling et al. 2020) and video analysis (Zhu et al.

2020). The most widely used RNNs are a long short-term

memory (LSTM) (Hochreiter and Schmidhuber 1997) and

gated recurrent unit (GRU) (Cho et al. 2014), which has a

gating mechanism. The gating mechanism controls the in-

formation ﬂow in the state of RNNs via multiplication with

a gate function bounded to a range [0,1]. For example, when

the forget gate takes a value close to 1 (or 0 for the up-

date gate in the GRU), the state preserves the previous in-

formation. On the other hand, when it gets close to the other

boundary, the RNN updates the state by the current input.

Thus, in order to represent long temporal dependencies of

data involving hundreds or thousands of time steps, it is

crucial for the forget gate to take values near the bound-

aries (Tallec and Ollivier 2018; Mahto et al. 2021).

However, it is difﬁcult to train RNNs so that they have the

gate values near the boundaries. Previous studies hypothe-

sized that this is due to gradient vanishing for the gate func-

tion called saturation (Chandar et al. 2019; Gu et al. 2020b),

i.e., the gradient of the gate function near the boundary is

too small to effectively update the parameters. To avoid the

saturation problem, a previous study used unbounded acti-

vation functions (Chandar et al. 2019). However, this makes

training unstable due to the gradient explosion (Pascanu,

Mikolov, and Bengio 2013). Another study introduced resid-

ual connection for a gate function to push the output value

toward boundaries, hence mitigating the saturation prob-

lem (Gu et al. 2020b). However, it requires additional com-

putational cost due to increasing the number of parameters

for another gate function. For broader application of gated

RNNs, a more efﬁcient solution is necessary.

To overcome the difﬁculty of training, we propose a novel

activation function for the forget gate based on the usual sig-

moid function, which we call the fast gate. Modiﬁcation of

the usual sigmoid gate to the fast gate is simple and easy

to implement since it requires only one additional function

composition. To this end, we analyze the relation between

the saturation and gradient vanishing of the bounded activa-

tion function. Speciﬁcally, we focus on the convergence rate

of the activation function to the boundary, which we call

the order of saturation. For example, the sigmoid function

σ(z) = 1/(1 + e−z)has the exponential order of satura-

tion, i.e., 1−σ(z) = O(e−z)(see Fig. 1), and the derivative

also decays to 0 exponentially as zgoes to inﬁnity. When a

bounded activation function has a higher order of saturation,

the derivative decays much faster as the input grows. Since

previous studies have assumed that the decaying derivative

on the saturating regime causes the stuck of training (Ioffe

and Szegedy 2015), it seems that a higher order of satura-

tion would lead to poor training. Contrarily to this intuition,

we prove that a higher order of saturation alleviates the gra-

dient vanishing on the saturating regime through observa-

tion on a toy problem for learning long time scales. This

result indicates that functions saturating superexponentially

are more suitable for the forget gate to learn long time scales

than the sigmoid function. On the basis of this observation,

we explore a method of realizing such functions by compos-

ing functions which increase faster than the identity function

(e.g., α(z) = z+z3) as σ(α(z)). We ﬁnd that the hyperbolic

sinusoidal function is suitable for achieving a higher order

of saturation in a simple way, and we obtain the fast gate.

Since the fast gate has a doubly exponential order of satu-

ration O(e−ez), it improves the trainability of gated RNNs

for long time scales of sequential data. We evaluate the com-

putational efﬁciency and accuracy of a model with the fast

gate on several benchmark tasks, including synthetic tasks,

pixel-by-pixel image classiﬁcation, and language modeling,

arXiv:2210.01348v1 [cs.LG] 4 Oct 2022

Figure 1: Function values and derivative of various bounded

activation functions. Sigmoid function σ(orange) expo-

nentially converges to 1 as z→ ∞, since 1−σ(z) =

1+ez≈e−z. Derivative also decays exponentially. Normal-

ized version of softsign function softsign(z) = z

1+|z|(blue)

converges to 1 more slowly. Fast gate (red) is proposed

gate function. Although gradient decays faster than sigmoid

function, it provably helps learning values near boundaries.

which involve a wide range of time scales. The model with

the fast gate empirically outperforms other models includ-

ing an LSTM with the sigmoid gate and variants recently

proposed for tackling the saturation problem (Chandar et al.

2019; Gu et al. 2020b) in terms of accuracy and the conver-

gence speed of training while maintaining stability of train-

ing. Further visualization analysis of learning time scales

shows that our theory ﬁts the learning dynamics of actual

models and that the fast gate can learn extremely long time

scales of thousands of time steps.

Our major contributions are as follows:

• We prove that gate functions which saturate faster actu-

ally accelerates learning values near boundaries. The re-

sult indicates that fast saturation improves learnability of

gated RNNs on data with long time scales.

• We propose the fast gate that saturates faster than the

sigmoid function. In spite of its simplicity, the fast gate

achieves a doubly exponential order of saturation, and

thus effectively improves learning of long time scales.

• We evaluate the effectiveness of the fast gate against re-

cently proposed methods such as an NRU (Chandar et al.

2019) and a reﬁne gate (Gu et al. 2020b). The results ver-

ify that the fast gate robustly improves the learnability for

long-term dependencies in both synthetic and real data.

2 Preliminaries

2.1 Time Scales in Gated RNNs

In this section, we review gated RNNs and their time scale

interpretation (Tallec and Ollivier 2018). We begin with an

LSTM (Hochreiter and Schmidhuber 1997), which is one

of the most popular RNNs. An LSTM has a memory cell

ct∈Rnand hidden state ht∈Rninside, which are updated

depending on the sequential input data xtat each time step

t= 1,2,· · · by

ct=ftct−1+it˜ct(1)

ht=ottanh (ct)(2)

ft=σ(Wfxt+Ufht−1+bf)(3)

it=σ(Wixt+Uiht−1+bi)(4)

˜ct= tanh(Wcxt+Ucht−1+bc)(5)

ot=σ(Woxt+Uoht−1+bo)(6)

where W∗, U∗and b∗are weight and bias parameters for

each ∗∈{f, i, c, o}. The sigmoid function σis deﬁned as

σ(x) = 1

1 + e−x.(7)

ft, it, ot∈[0,1]nare called forget, input, and output gates,

respectively. They were initially motivated as a binary mech-

anism, i.e., switching on and off, allowing information to

pass through (Gers, Schmidhuber, and Cummins 2000). The

forget gate has been reinterpreted as the representation for

time scales of memory cells (Tallec and Ollivier 2018). Fol-

lowing that study, we simplify Eq. (1) by assuming ˜ct= 0

for an interval t∈[t0, t1]. Then, we obtain

ct1=ft1ct1−1(8)

=¯

ft1−t0ct1−t0,(9)

where ¯

f= (Qt1

s=t0+1 fs)1

t1−t0is the (entry-wise) geometric

mean of the values of the forget gate. Through Eq. (8), the

memory cell ctloses its information on data up to time t0

exponentially, and the entry of ¯

frepresents its (averaged)

decay rate. This indicates that, in order to capture long-term

dependencies of the sequential data, the forget gate is desired

to take values near 1 on average. We refer the associated time

constant1T=−1/log ¯

fas the time scale of units, which

has been empirically shown to illustrate well the temporal

behavior of LSTMs (Mahto et al. 2021).

The above argument applies not only to an LSTM, but

also to general gated RNNs including a GRU (Cho et al.

2014) with state update of the form

ht=ftht−1+it˜

ht,(10)

where ht, ft, itdenotes the state, forget gate, and input gate,

respectively, and ˜

htis the activation to represent new infor-

mation at time t. Here again, the forget gate fttakes a role

to control the time scale of each unit of the state.

2.2 Saturation in Gating Activation Functions

The sigmoid function σ(z)in the gating mechanism requires

large zto take a value near 1 as the output. On the other

hand, the derivative σ0(z)takes exponentially small values

for z0(Fig. 1). Thus, when a gated model needs to learn

large gate values such as 0.99 with gradient methods, param-

eters in the gate cannot be effectively updated due to gradient

vanishing. This is called saturation of bounded activation

functions (Gulcehre et al. 2016). The behavior of gate func-

tions on the saturating regime is important for gated RNNs

1An exponential function F(t) = e−αt of time tdecreases by

a factor of 1/e in time T= 1/α, which is called the time constant.

because forget gate values need to be large to represent long

time scales as explained in Section 2.1. That is, gated RNNs

must face saturation of the forget gate to learn long time

scales. Thus, it is hypothesized that saturation causes difﬁ-

culty in training gated RNNs for data with extremely long

time scales (Chandar et al. 2019; Gu et al. 2020b).

3 Related Work

Although there is abundant literature on learning long-term

dependencies with RNNs, we outline the most related stud-

ies in this sectoin due to the space limitation and provide

additional discussion of other studies in Appendix A.

Several studies investigate the time scale representation

of the forget gate function to improve learning on data in-

volving long-term dependencies (Tallec and Ollivier 2018;

Mahto et al. 2021). For example, performance of LSTM lan-

guage models can be improved by ﬁxing the bias parameter

of the forget gate in accordance with a power law of time

scale distribution, which underlies natural language (Mahto

et al. 2021). Such techniques require us to know the appro-

priate time scales of data a priori, which is often difﬁcult.

Note that this approach can be combined with our method

since it is complementary with our work.

Several modiﬁcations of the gate function have been pro-

posed to tackle the saturation problem. The noisy gradient

for a piece-wise linear gate function was proposed to prevent

the gradient to take zero values (Gulcehre et al. 2016). This

training protocol includes hyperparameters controlling noise

level, which requires manual tuning. Furthermore, such a

stochastic approach can result in unstable training due to

gradient estimation bias (Bengio, L´

eonard, and Courville

2013). The reﬁne gate (Gu et al. 2020b) was proposed as an-

other modiﬁcation introducing a residual connection to push

the gate value to the boundaries. It is rather heuristic and

does not provide theoretical justiﬁcation. It also requires ad-

ditional parameters for the auxiliary gate, which increases

the computational cost for both inference and training. In

contrast, our method theoretically improves learnability and

does not introduce any additional parameters. Another study

suggests that omitting gates other than the forget gate makes

training of models for long time scales easier (Van Der West-

huizen and Lasenby 2018). However, such simpliﬁcation

may lose the expressive power of the model and limit its ap-

plication ﬁelds. Chandar et al. (2019) proposed an RNN with

a non-saturating activation function to directly avoid the gra-

dient vanishing due to saturation. Since its state and mem-

ory vector evolves in unbounded regions, the behavior of the

gradient can be unstable depending on tasks. Our method

mitigates the gradient vanishing by controlling the order of

saturation, while maintaining the bounded state transition.

4 Analysis on Saturation and Learnability

We discuss the learning behavior of the forget gate for long

time scales. First, we formulate a problem of learning long

time scales in a simpliﬁed setting. Next, we relate the efﬁ-

ciency of learning on the problem to the saturation of the

gate functions. We conclude that the faster saturation makes

learning more efﬁcient. All proofs for mathematical results

below are given in Appendix C.

4.1 Problem Setting

Recall Eq. (8), which describes the time scales of the mem-

ory cell ctof an LSTM via exponential decay. Let the mem-

ory cell at time t1be ct1=λct0with λ∈[0,1]. Requiring

long time scales corresponds to getting λclose to 1. There-

fore, we can consider a long-time-scale learning problem as

minimizing a loss function Lthat measures discrepancy of

ct1and λ∗ct0where λ∗∈[0,1] is a desired value close to 1.

We take Las the absolute loss for example. Then, we obtain

L=|ct1−λ∗ct0|(11)

=|¯

ft1−t0ct0−λ∗ct0|(12)

=ct0|¯

ft1−t0−λ∗|,(13)

using Eq. (8). Let zt=Wfxt+Ufht−1+bf, so that ft=

σ(zt). Since we are interested in the averaged value of ft,

we consider ztto be time-independent, that is, zt=zin the

same way as Tallec and Ollivier (2018). The problem is then

reduced to a problem to obtain zthat minimizes

L(z) = ct0|σ(z)t1−t0−λ∗|.(14)

We consider this as the minimal problem to analyze the

learnability of the forget gate for long time scales. Note

that since the product ct1=¯

ft1−t0ct0is taken element-

wise, we can consider this as a one-dimensional problem.

Furthermore, the global solution can be explicitly written as

z=σ−1(λ1/(t1−t0)

∗)where σ−1is an inverse of σ.

Next, we consider the learning dynamics of the model

on the aforementioned problem Eq. (14). RNNs are usu-

ally trained with gradient methods. Learning dynamics with

gradient methods can be analyzed considering learning rate

→0limit known as gradient ﬂow (Harold and George

2003). Therefore, we consider the following gradient ﬂow

dτ =−∂L

∂z ,(15)

using the loss function introduced above. Here, τdenotes

a time variable for learning dynamics, which should not be

confused with trepresenting the state transition. Our aim

is to investigate the convergence rate of a solution of the

differential equation Eq. (15) when σin the forget gate is

replaced with another function φ.

4.2 Order of Saturation

To investigate the effect of choice of gate functions on the

convergence rate, we ﬁrst deﬁne the candidate set Fof

bounded functions for the gate function.

Deﬁnition 4.1. Let Fbe a set of differentiable and strictly

increasing surjective functions φ:R→[0,1] such that the

derivative φ0is monotone on z > z0for some z0≥0.

Fis a natural class of gating activation functions includ-

ing σ. As we explained in Section 2.2, gated RNNs suf-

fer from gradient vanishing due to saturation when learning

long time scales. To clarify the issue, we ﬁrst show that sat-

uration is inevitable regardless of the choice of φ∈ F.

Proposition 4.2. limz→∞ φ0(z)=0holds for any φ∈ F.

Nevertheless, choices of φsigniﬁcantly affect the efﬁ-

ciency of the training. When the target λ∗takes an extreme

value near boundaries, the efﬁciency of training should de-

pend on the asymptotic behavior of φ(z)for z0, that is,

the rate at which φ(z)converges as z→ ∞. We call the con-

vergence rate of φ(z)as z→ ∞ as the order of saturation.

More precisely, we deﬁne the notion as follows2:

Deﬁnition 4.3. Let g:R→Rbe a decreasing func-

tion. φ∈ F has the order of saturation of O(g(z)) if

limz→∞ g(az)

1−φ(z)= 0 for some a > 0. For φ, ˜

φ∈ F,φhas

ahigher order of saturation than ˜

φif limz→∞ 1−φ(z)

1−˜

φ(az)= 0

holds for any a > 0and ˜

φ−1(φ(z)) is convex for z0.

Intuitively, the order of saturation of O(g(z)) means that

the convergence rate of φto 1 is bounded by the decay rate

of gup to constant multiplication of z. For example, the sig-

moid function σsatisﬁes e−az/(1 −σ(z)) →0as z→ ∞

for any a > 1, thus has the exponential order of saturation

O(e−z). The convexity condition for a higher order of satu-

ration is rather technical, but automatically satisﬁed for typ-

ical functions, see Appendix C.2. If φhas a higher order

of saturation (or saturates faster) than another function ˜

φ,

then φ(z)converges faster than ˜

φ(z)as z→ ∞, and φ0(z)

becomes smaller than ˜

φ0(z). In this sense, training with ˜

seems more efﬁcient than φin the above problem. However,

this is not the case as we discuss in the next section.

4.3 Efﬁcient Learning via Fast Saturation

To precisely analyze learning behavior, we trace the learning

dynamics of the output value f=φ(z)since our purpose is

to obtain the desired output value rather than the input z. We

transform the learning dynamics (Eq. (15)) into that of fby

dτ =dz

dτ

dz =−φ0(z)∂L

∂z =−φ0(z)2∂L

∂f .(16)

To treat Eq. (16) as purely of f, we deﬁne a function gφ(f)

of fby gφ(f) := φ0(φ−1(f)),so that Eq. (16) becomes

dτ =−gφ(f)2∂L

∂f .(17)

Our interest is in the dynamics of fnear the boundary, i.e.,

the limit of f→1. We have the following result:

Theorem 4.4. Let φ, ˜

φ∈ F. If φhas a higher order of

saturation than ˜

φ, then gφ(f)/g˜

φ(f)→ ∞ as f→1.

Theorem 4.4 indicates that a higher order of saturation ac-

celerates the move of the output fnear boundaries in accor-

dance with Eq. (17) since gφ(f)takes larger values. Thus,

contrarily to the intuition in Section 4.2, a higher order of

saturation leads to more efﬁcient training for target values

near boundaries. We demonstrate this effect using two acti-

vation functions, the sigmoid function σ(z)and normalized

2Our deﬁnition for asymptotic order is slightly different from

the usual one which adopts lim supz→∞

g(z)

1−φ(z)<∞, since it is

more suitable for analyzing training efﬁciency.

Figure 2: Learning curves for simpliﬁed long-time-scale

learning problem with gradient descent (markers) and with

gradient ﬂow (solid lines). Gradient descent is done with

learning rate 1. Time difference t1−t0is set to 10. Dashed

lines are lower bounds given in Tab. 1 ﬁtted to each learning

curve with suitable translation. These lower bounds well ap-

proximate asymptotic convergence of gradient ﬂow. Results

of reﬁne and fast gates are explained in Section 5.3.

softsign function σns(z) = (softsign(z/2) + 1)/2where

softsign(z) = z/(1 + |z|).σns is the softsign function mod-

iﬁed so that 0≤σns(z)≤1and σ0

ns(0) = σ0(0).σhas a

higher order of saturation than σns since σhas the order of

saturation of O(e−z)and σns has O(z−1)(see Fig. 1). We

plot the learning dynamics of gradient ﬂow for the problem

in Fig. 2. Since σhas a higher order of saturation than σns,

the gate value fof σns converges slower to the boundary.

Fig. 2 also shows the dynamics of gradient descent with the

learning rate 1. While gradient descent is a discrete approx-

imation of gradient ﬂow, it behaves similar to gradient ﬂow.

Explicit convergence rates. Beyond Theorem 4.4, we

can explicitly calculate effective bounds of the convergence

rate for the problem when the activation function is the sig-

moid function σ(z)or normalized softsign function σns(z).

Proposition 4.5. Consider the problem in Section 4.1 with

the absolute loss L=|ft1−t0−λ∗|with λ∗= 1. For the

sigmoid function f=σ(z), the convergence rate for the

problem is bounded as 1−f=O(τ−1). Similarly, for the

normalized softsign function f=σns(z), the convergence

rate is bounded as 1−f=O(τ−1/3).

Proposition 4.5 shows the quantitative effect of difference

in the order of saturation on the convergence rates. We ﬁt

the bounds to the learning curves with the gradient ﬂow in

Fig. 2. The convergence rates of the learning are well ap-

proximated by the bounds. These asymptotic analysis high-

lights that choices of the function φsigniﬁcantly affects ef-

ﬁciency of training for long time scales.

5 Proposed Method

On the basis of the analysis in Section 4, we construct the

fast gate, which is suitable for learning long time scales.

5.1 Desirable Properties for Gate Functions

We consider modiﬁcation of the usual sigmoid function to

another function φ∈ F for the forget gate in a gated RNN.

Function φshould satisfy the following conditions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FastSaturatingGateforLearningLongTimeScaleswithRecurrentNeuralNetworksKentaroOhno,SekitoshiKanai,YasutoshiIdaNTTfkentaro.ohno.tf,sekitoshi.kanai.fu,yasutoshi.ida.ycg@hco.ntt.co.jpAbstractGatefunctionsinrecurrentmodels,suchasanLSTMandGRU,playacentralroleinlearningvarioustimescalesinmodelingtimeseries...

展开>> 收起<<

Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks Kentaro Ohno Sekitoshi Kanai Yasutoshi Ida.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks Kentaro Ohno Sekitoshi Kanai Yasutoshi Ida

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: