Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks Kentaro Ohno Sekitoshi Kanai Yasutoshi Ida

2025-04-27 0 0 1.23MB 16 页 10玖币
侵权投诉
Fast Saturating Gate for Learning Long Time Scales with
Recurrent Neural Networks
Kentaro Ohno, Sekitoshi Kanai, Yasutoshi Ida
NTT
{kentaro.ohno.tf, sekitoshi.kanai.fu, yasutoshi.ida.yc}@hco.ntt.co.jp
Abstract
Gate functions in recurrent models, such as an LSTM and
GRU, play a central role in learning various time scales in
modeling time series data by using a bounded activation func-
tion. However, it is difficult to train gates to capture extremely
long time scales due to gradient vanishing of the bounded
function for large inputs, which is known as the saturation
problem. We closely analyze the relation between saturation
of the gate function and efficiency of the training. We prove
that the gradient vanishing of the gate function can be miti-
gated by accelerating the convergence of the saturating func-
tion, i.e., making the output of the function converge to 0
or 1 faster. Based on the analysis results, we propose a gate
function called fast gate that has a doubly exponential con-
vergence rate with respect to inputs by simple function com-
position. We empirically show that our method outperforms
previous methods in accuracy and computational efficiency
on benchmark tasks involving extremely long time scales.
1 Introduction
Recurrent neural networks (RNNs) are models suited to pro-
cessing sequential data in various applications, e.g., speech
recognition (Ling et al. 2020) and video analysis (Zhu et al.
2020). The most widely used RNNs are a long short-term
memory (LSTM) (Hochreiter and Schmidhuber 1997) and
gated recurrent unit (GRU) (Cho et al. 2014), which has a
gating mechanism. The gating mechanism controls the in-
formation flow in the state of RNNs via multiplication with
a gate function bounded to a range [0,1]. For example, when
the forget gate takes a value close to 1 (or 0 for the up-
date gate in the GRU), the state preserves the previous in-
formation. On the other hand, when it gets close to the other
boundary, the RNN updates the state by the current input.
Thus, in order to represent long temporal dependencies of
data involving hundreds or thousands of time steps, it is
crucial for the forget gate to take values near the bound-
aries (Tallec and Ollivier 2018; Mahto et al. 2021).
However, it is difficult to train RNNs so that they have the
gate values near the boundaries. Previous studies hypothe-
sized that this is due to gradient vanishing for the gate func-
tion called saturation (Chandar et al. 2019; Gu et al. 2020b),
i.e., the gradient of the gate function near the boundary is
too small to effectively update the parameters. To avoid the
saturation problem, a previous study used unbounded acti-
vation functions (Chandar et al. 2019). However, this makes
training unstable due to the gradient explosion (Pascanu,
Mikolov, and Bengio 2013). Another study introduced resid-
ual connection for a gate function to push the output value
toward boundaries, hence mitigating the saturation prob-
lem (Gu et al. 2020b). However, it requires additional com-
putational cost due to increasing the number of parameters
for another gate function. For broader application of gated
RNNs, a more efficient solution is necessary.
To overcome the difficulty of training, we propose a novel
activation function for the forget gate based on the usual sig-
moid function, which we call the fast gate. Modification of
the usual sigmoid gate to the fast gate is simple and easy
to implement since it requires only one additional function
composition. To this end, we analyze the relation between
the saturation and gradient vanishing of the bounded activa-
tion function. Specifically, we focus on the convergence rate
of the activation function to the boundary, which we call
the order of saturation. For example, the sigmoid function
σ(z) = 1/(1 + ez)has the exponential order of satura-
tion, i.e., 1σ(z) = O(ez)(see Fig. 1), and the derivative
also decays to 0 exponentially as zgoes to infinity. When a
bounded activation function has a higher order of saturation,
the derivative decays much faster as the input grows. Since
previous studies have assumed that the decaying derivative
on the saturating regime causes the stuck of training (Ioffe
and Szegedy 2015), it seems that a higher order of satura-
tion would lead to poor training. Contrarily to this intuition,
we prove that a higher order of saturation alleviates the gra-
dient vanishing on the saturating regime through observa-
tion on a toy problem for learning long time scales. This
result indicates that functions saturating superexponentially
are more suitable for the forget gate to learn long time scales
than the sigmoid function. On the basis of this observation,
we explore a method of realizing such functions by compos-
ing functions which increase faster than the identity function
(e.g., α(z) = z+z3) as σ(α(z)). We find that the hyperbolic
sinusoidal function is suitable for achieving a higher order
of saturation in a simple way, and we obtain the fast gate.
Since the fast gate has a doubly exponential order of satu-
ration O(eez), it improves the trainability of gated RNNs
for long time scales of sequential data. We evaluate the com-
putational efficiency and accuracy of a model with the fast
gate on several benchmark tasks, including synthetic tasks,
pixel-by-pixel image classification, and language modeling,
arXiv:2210.01348v1 [cs.LG] 4 Oct 2022
Figure 1: Function values and derivative of various bounded
activation functions. Sigmoid function σ(orange) expo-
nentially converges to 1 as z→ ∞, since 1σ(z) =
1
1+ezez. Derivative also decays exponentially. Normal-
ized version of softsign function softsign(z) = z
1+|z|(blue)
converges to 1 more slowly. Fast gate (red) is proposed
gate function. Although gradient decays faster than sigmoid
function, it provably helps learning values near boundaries.
which involve a wide range of time scales. The model with
the fast gate empirically outperforms other models includ-
ing an LSTM with the sigmoid gate and variants recently
proposed for tackling the saturation problem (Chandar et al.
2019; Gu et al. 2020b) in terms of accuracy and the conver-
gence speed of training while maintaining stability of train-
ing. Further visualization analysis of learning time scales
shows that our theory fits the learning dynamics of actual
models and that the fast gate can learn extremely long time
scales of thousands of time steps.
Our major contributions are as follows:
We prove that gate functions which saturate faster actu-
ally accelerates learning values near boundaries. The re-
sult indicates that fast saturation improves learnability of
gated RNNs on data with long time scales.
We propose the fast gate that saturates faster than the
sigmoid function. In spite of its simplicity, the fast gate
achieves a doubly exponential order of saturation, and
thus effectively improves learning of long time scales.
We evaluate the effectiveness of the fast gate against re-
cently proposed methods such as an NRU (Chandar et al.
2019) and a refine gate (Gu et al. 2020b). The results ver-
ify that the fast gate robustly improves the learnability for
long-term dependencies in both synthetic and real data.
2 Preliminaries
2.1 Time Scales in Gated RNNs
In this section, we review gated RNNs and their time scale
interpretation (Tallec and Ollivier 2018). We begin with an
LSTM (Hochreiter and Schmidhuber 1997), which is one
of the most popular RNNs. An LSTM has a memory cell
ctRnand hidden state htRninside, which are updated
depending on the sequential input data xtat each time step
t= 1,2,· · · by
ct=ftct1+it˜ct(1)
ht=ottanh (ct)(2)
ft=σ(Wfxt+Ufht1+bf)(3)
it=σ(Wixt+Uiht1+bi)(4)
˜ct= tanh(Wcxt+Ucht1+bc)(5)
ot=σ(Woxt+Uoht1+bo)(6)
where W, Uand bare weight and bias parameters for
each ∗∈{f, i, c, o}. The sigmoid function σis defined as
σ(x) = 1
1 + ex.(7)
ft, it, ot[0,1]nare called forget, input, and output gates,
respectively. They were initially motivated as a binary mech-
anism, i.e., switching on and off, allowing information to
pass through (Gers, Schmidhuber, and Cummins 2000). The
forget gate has been reinterpreted as the representation for
time scales of memory cells (Tallec and Ollivier 2018). Fol-
lowing that study, we simplify Eq. (1) by assuming ˜ct= 0
for an interval t[t0, t1]. Then, we obtain
ct1=ft1ct11(8)
=¯
ft1t0ct1t0,(9)
where ¯
f= (Qt1
s=t0+1 fs)1
t1t0is the (entry-wise) geometric
mean of the values of the forget gate. Through Eq. (8), the
memory cell ctloses its information on data up to time t0
exponentially, and the entry of ¯
frepresents its (averaged)
decay rate. This indicates that, in order to capture long-term
dependencies of the sequential data, the forget gate is desired
to take values near 1 on average. We refer the associated time
constant1T=1/log ¯
fas the time scale of units, which
has been empirically shown to illustrate well the temporal
behavior of LSTMs (Mahto et al. 2021).
The above argument applies not only to an LSTM, but
also to general gated RNNs including a GRU (Cho et al.
2014) with state update of the form
ht=ftht1+it˜
ht,(10)
where ht, ft, itdenotes the state, forget gate, and input gate,
respectively, and ˜
htis the activation to represent new infor-
mation at time t. Here again, the forget gate fttakes a role
to control the time scale of each unit of the state.
2.2 Saturation in Gating Activation Functions
The sigmoid function σ(z)in the gating mechanism requires
large zto take a value near 1 as the output. On the other
hand, the derivative σ0(z)takes exponentially small values
for z0(Fig. 1). Thus, when a gated model needs to learn
large gate values such as 0.99 with gradient methods, param-
eters in the gate cannot be effectively updated due to gradient
vanishing. This is called saturation of bounded activation
functions (Gulcehre et al. 2016). The behavior of gate func-
tions on the saturating regime is important for gated RNNs
1An exponential function F(t) = eαt of time tdecreases by
a factor of 1/e in time T= 1, which is called the time constant.
because forget gate values need to be large to represent long
time scales as explained in Section 2.1. That is, gated RNNs
must face saturation of the forget gate to learn long time
scales. Thus, it is hypothesized that saturation causes diffi-
culty in training gated RNNs for data with extremely long
time scales (Chandar et al. 2019; Gu et al. 2020b).
3 Related Work
Although there is abundant literature on learning long-term
dependencies with RNNs, we outline the most related stud-
ies in this sectoin due to the space limitation and provide
additional discussion of other studies in Appendix A.
Several studies investigate the time scale representation
of the forget gate function to improve learning on data in-
volving long-term dependencies (Tallec and Ollivier 2018;
Mahto et al. 2021). For example, performance of LSTM lan-
guage models can be improved by fixing the bias parameter
of the forget gate in accordance with a power law of time
scale distribution, which underlies natural language (Mahto
et al. 2021). Such techniques require us to know the appro-
priate time scales of data a priori, which is often difficult.
Note that this approach can be combined with our method
since it is complementary with our work.
Several modifications of the gate function have been pro-
posed to tackle the saturation problem. The noisy gradient
for a piece-wise linear gate function was proposed to prevent
the gradient to take zero values (Gulcehre et al. 2016). This
training protocol includes hyperparameters controlling noise
level, which requires manual tuning. Furthermore, such a
stochastic approach can result in unstable training due to
gradient estimation bias (Bengio, L´
eonard, and Courville
2013). The refine gate (Gu et al. 2020b) was proposed as an-
other modification introducing a residual connection to push
the gate value to the boundaries. It is rather heuristic and
does not provide theoretical justification. It also requires ad-
ditional parameters for the auxiliary gate, which increases
the computational cost for both inference and training. In
contrast, our method theoretically improves learnability and
does not introduce any additional parameters. Another study
suggests that omitting gates other than the forget gate makes
training of models for long time scales easier (Van Der West-
huizen and Lasenby 2018). However, such simplification
may lose the expressive power of the model and limit its ap-
plication fields. Chandar et al. (2019) proposed an RNN with
a non-saturating activation function to directly avoid the gra-
dient vanishing due to saturation. Since its state and mem-
ory vector evolves in unbounded regions, the behavior of the
gradient can be unstable depending on tasks. Our method
mitigates the gradient vanishing by controlling the order of
saturation, while maintaining the bounded state transition.
4 Analysis on Saturation and Learnability
We discuss the learning behavior of the forget gate for long
time scales. First, we formulate a problem of learning long
time scales in a simplified setting. Next, we relate the effi-
ciency of learning on the problem to the saturation of the
gate functions. We conclude that the faster saturation makes
learning more efficient. All proofs for mathematical results
below are given in Appendix C.
4.1 Problem Setting
Recall Eq. (8), which describes the time scales of the mem-
ory cell ctof an LSTM via exponential decay. Let the mem-
ory cell at time t1be ct1=λct0with λ[0,1]. Requiring
long time scales corresponds to getting λclose to 1. There-
fore, we can consider a long-time-scale learning problem as
minimizing a loss function Lthat measures discrepancy of
ct1and λct0where λ[0,1] is a desired value close to 1.
We take Las the absolute loss for example. Then, we obtain
L=|ct1λct0|(11)
=|¯
ft1t0ct0λct0|(12)
=ct0|¯
ft1t0λ|,(13)
using Eq. (8). Let zt=Wfxt+Ufht1+bf, so that ft=
σ(zt). Since we are interested in the averaged value of ft,
we consider ztto be time-independent, that is, zt=zin the
same way as Tallec and Ollivier (2018). The problem is then
reduced to a problem to obtain zthat minimizes
L(z) = ct0|σ(z)t1t0λ|.(14)
We consider this as the minimal problem to analyze the
learnability of the forget gate for long time scales. Note
that since the product ct1=¯
ft1t0ct0is taken element-
wise, we can consider this as a one-dimensional problem.
Furthermore, the global solution can be explicitly written as
z=σ1(λ1/(t1t0)
)where σ1is an inverse of σ.
Next, we consider the learning dynamics of the model
on the aforementioned problem Eq. (14). RNNs are usu-
ally trained with gradient methods. Learning dynamics with
gradient methods can be analyzed considering learning rate
0limit known as gradient flow (Harold and George
2003). Therefore, we consider the following gradient flow
dz
=L
z ,(15)
using the loss function introduced above. Here, τdenotes
a time variable for learning dynamics, which should not be
confused with trepresenting the state transition. Our aim
is to investigate the convergence rate of a solution of the
differential equation Eq. (15) when σin the forget gate is
replaced with another function φ.
4.2 Order of Saturation
To investigate the effect of choice of gate functions on the
convergence rate, we first define the candidate set Fof
bounded functions for the gate function.
Definition 4.1. Let Fbe a set of differentiable and strictly
increasing surjective functions φ:R[0,1] such that the
derivative φ0is monotone on z > z0for some z00.
Fis a natural class of gating activation functions includ-
ing σ. As we explained in Section 2.2, gated RNNs suf-
fer from gradient vanishing due to saturation when learning
long time scales. To clarify the issue, we first show that sat-
uration is inevitable regardless of the choice of φ∈ F.
Proposition 4.2. limz→∞ φ0(z)=0holds for any φ∈ F.
Nevertheless, choices of φsignificantly affect the effi-
ciency of the training. When the target λtakes an extreme
value near boundaries, the efficiency of training should de-
pend on the asymptotic behavior of φ(z)for z0, that is,
the rate at which φ(z)converges as z→ ∞. We call the con-
vergence rate of φ(z)as z as the order of saturation.
More precisely, we define the notion as follows2:
Definition 4.3. Let g:RRbe a decreasing func-
tion. φ F has the order of saturation of O(g(z)) if
limz→∞ g(az)
1φ(z)= 0 for some a > 0. For φ, ˜
φ∈ F,φhas
ahigher order of saturation than ˜
φif limz→∞ 1φ(z)
1˜
φ(az)= 0
holds for any a > 0and ˜
φ1(φ(z)) is convex for z0.
Intuitively, the order of saturation of O(g(z)) means that
the convergence rate of φto 1 is bounded by the decay rate
of gup to constant multiplication of z. For example, the sig-
moid function σsatisfies eaz/(1 σ(z)) 0as z→ ∞
for any a > 1, thus has the exponential order of saturation
O(ez). The convexity condition for a higher order of satu-
ration is rather technical, but automatically satisfied for typ-
ical functions, see Appendix C.2. If φhas a higher order
of saturation (or saturates faster) than another function ˜
φ,
then φ(z)converges faster than ˜
φ(z)as z→ ∞, and φ0(z)
becomes smaller than ˜
φ0(z). In this sense, training with ˜
φ
seems more efficient than φin the above problem. However,
this is not the case as we discuss in the next section.
4.3 Efficient Learning via Fast Saturation
To precisely analyze learning behavior, we trace the learning
dynamics of the output value f=φ(z)since our purpose is
to obtain the desired output value rather than the input z. We
transform the learning dynamics (Eq. (15)) into that of fby
df
=dz
df
dz =φ0(z)L
z =φ0(z)2L
f .(16)
To treat Eq. (16) as purely of f, we define a function gφ(f)
of fby gφ(f) := φ0(φ1(f)),so that Eq. (16) becomes
df
=gφ(f)2L
f .(17)
Our interest is in the dynamics of fnear the boundary, i.e.,
the limit of f1. We have the following result:
Theorem 4.4. Let φ, ˜
φ∈ F. If φhas a higher order of
saturation than ˜
φ, then gφ(f)/g˜
φ(f) as f1.
Theorem 4.4 indicates that a higher order of saturation ac-
celerates the move of the output fnear boundaries in accor-
dance with Eq. (17) since gφ(f)takes larger values. Thus,
contrarily to the intuition in Section 4.2, a higher order of
saturation leads to more efficient training for target values
near boundaries. We demonstrate this effect using two acti-
vation functions, the sigmoid function σ(z)and normalized
2Our definition for asymptotic order is slightly different from
the usual one which adopts lim supz→∞
g(z)
1φ(z)<, since it is
more suitable for analyzing training efficiency.
Figure 2: Learning curves for simplified long-time-scale
learning problem with gradient descent (markers) and with
gradient flow (solid lines). Gradient descent is done with
learning rate 1. Time difference t1t0is set to 10. Dashed
lines are lower bounds given in Tab. 1 fitted to each learning
curve with suitable translation. These lower bounds well ap-
proximate asymptotic convergence of gradient flow. Results
of refine and fast gates are explained in Section 5.3.
softsign function σns(z) = (softsign(z/2) + 1)/2where
softsign(z) = z/(1 + |z|).σns is the softsign function mod-
ified so that 0σns(z)1and σ0
ns(0) = σ0(0).σhas a
higher order of saturation than σns since σhas the order of
saturation of O(ez)and σns has O(z1)(see Fig. 1). We
plot the learning dynamics of gradient flow for the problem
in Fig. 2. Since σhas a higher order of saturation than σns,
the gate value fof σns converges slower to the boundary.
Fig. 2 also shows the dynamics of gradient descent with the
learning rate 1. While gradient descent is a discrete approx-
imation of gradient flow, it behaves similar to gradient flow.
Explicit convergence rates. Beyond Theorem 4.4, we
can explicitly calculate effective bounds of the convergence
rate for the problem when the activation function is the sig-
moid function σ(z)or normalized softsign function σns(z).
Proposition 4.5. Consider the problem in Section 4.1 with
the absolute loss L=|ft1t0λ|with λ= 1. For the
sigmoid function f=σ(z), the convergence rate for the
problem is bounded as 1f=O(τ1). Similarly, for the
normalized softsign function f=σns(z), the convergence
rate is bounded as 1f=O(τ1/3).
Proposition 4.5 shows the quantitative effect of difference
in the order of saturation on the convergence rates. We fit
the bounds to the learning curves with the gradient flow in
Fig. 2. The convergence rates of the learning are well ap-
proximated by the bounds. These asymptotic analysis high-
lights that choices of the function φsignificantly affects ef-
ficiency of training for long time scales.
5 Proposed Method
On the basis of the analysis in Section 4, we construct the
fast gate, which is suitable for learning long time scales.
5.1 Desirable Properties for Gate Functions
We consider modification of the usual sigmoid function to
another function φ∈ F for the forget gate in a gated RNN.
Function φshould satisfy the following conditions.
摘要:

FastSaturatingGateforLearningLongTimeScaleswithRecurrentNeuralNetworksKentaroOhno,SekitoshiKanai,YasutoshiIdaNTTfkentaro.ohno.tf,sekitoshi.kanai.fu,yasutoshi.ida.ycg@hco.ntt.co.jpAbstractGatefunctionsinrecurrentmodels,suchasanLSTMandGRU,playacentralroleinlearningvarioustimescalesinmodelingtimeseries...

展开>> 收起<<
Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks Kentaro Ohno Sekitoshi Kanai Yasutoshi Ida.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.23MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注