
because forget gate values need to be large to represent long
time scales as explained in Section 2.1. That is, gated RNNs
must face saturation of the forget gate to learn long time
scales. Thus, it is hypothesized that saturation causes diffi-
culty in training gated RNNs for data with extremely long
time scales (Chandar et al. 2019; Gu et al. 2020b).
3 Related Work
Although there is abundant literature on learning long-term
dependencies with RNNs, we outline the most related stud-
ies in this sectoin due to the space limitation and provide
additional discussion of other studies in Appendix A.
Several studies investigate the time scale representation
of the forget gate function to improve learning on data in-
volving long-term dependencies (Tallec and Ollivier 2018;
Mahto et al. 2021). For example, performance of LSTM lan-
guage models can be improved by fixing the bias parameter
of the forget gate in accordance with a power law of time
scale distribution, which underlies natural language (Mahto
et al. 2021). Such techniques require us to know the appro-
priate time scales of data a priori, which is often difficult.
Note that this approach can be combined with our method
since it is complementary with our work.
Several modifications of the gate function have been pro-
posed to tackle the saturation problem. The noisy gradient
for a piece-wise linear gate function was proposed to prevent
the gradient to take zero values (Gulcehre et al. 2016). This
training protocol includes hyperparameters controlling noise
level, which requires manual tuning. Furthermore, such a
stochastic approach can result in unstable training due to
gradient estimation bias (Bengio, L´
eonard, and Courville
2013). The refine gate (Gu et al. 2020b) was proposed as an-
other modification introducing a residual connection to push
the gate value to the boundaries. It is rather heuristic and
does not provide theoretical justification. It also requires ad-
ditional parameters for the auxiliary gate, which increases
the computational cost for both inference and training. In
contrast, our method theoretically improves learnability and
does not introduce any additional parameters. Another study
suggests that omitting gates other than the forget gate makes
training of models for long time scales easier (Van Der West-
huizen and Lasenby 2018). However, such simplification
may lose the expressive power of the model and limit its ap-
plication fields. Chandar et al. (2019) proposed an RNN with
a non-saturating activation function to directly avoid the gra-
dient vanishing due to saturation. Since its state and mem-
ory vector evolves in unbounded regions, the behavior of the
gradient can be unstable depending on tasks. Our method
mitigates the gradient vanishing by controlling the order of
saturation, while maintaining the bounded state transition.
4 Analysis on Saturation and Learnability
We discuss the learning behavior of the forget gate for long
time scales. First, we formulate a problem of learning long
time scales in a simplified setting. Next, we relate the effi-
ciency of learning on the problem to the saturation of the
gate functions. We conclude that the faster saturation makes
learning more efficient. All proofs for mathematical results
below are given in Appendix C.
4.1 Problem Setting
Recall Eq. (8), which describes the time scales of the mem-
ory cell ctof an LSTM via exponential decay. Let the mem-
ory cell at time t1be ct1=λct0with λ∈[0,1]. Requiring
long time scales corresponds to getting λclose to 1. There-
fore, we can consider a long-time-scale learning problem as
minimizing a loss function Lthat measures discrepancy of
ct1and λ∗ct0where λ∗∈[0,1] is a desired value close to 1.
We take Las the absolute loss for example. Then, we obtain
L=|ct1−λ∗ct0|(11)
=|¯
ft1−t0ct0−λ∗ct0|(12)
=ct0|¯
ft1−t0−λ∗|,(13)
using Eq. (8). Let zt=Wfxt+Ufht−1+bf, so that ft=
σ(zt). Since we are interested in the averaged value of ft,
we consider ztto be time-independent, that is, zt=zin the
same way as Tallec and Ollivier (2018). The problem is then
reduced to a problem to obtain zthat minimizes
L(z) = ct0|σ(z)t1−t0−λ∗|.(14)
We consider this as the minimal problem to analyze the
learnability of the forget gate for long time scales. Note
that since the product ct1=¯
ft1−t0ct0is taken element-
wise, we can consider this as a one-dimensional problem.
Furthermore, the global solution can be explicitly written as
z=σ−1(λ1/(t1−t0)
∗)where σ−1is an inverse of σ.
Next, we consider the learning dynamics of the model
on the aforementioned problem Eq. (14). RNNs are usu-
ally trained with gradient methods. Learning dynamics with
gradient methods can be analyzed considering learning rate
→0limit known as gradient flow (Harold and George
2003). Therefore, we consider the following gradient flow
dz
dτ =−∂L
∂z ,(15)
using the loss function introduced above. Here, τdenotes
a time variable for learning dynamics, which should not be
confused with trepresenting the state transition. Our aim
is to investigate the convergence rate of a solution of the
differential equation Eq. (15) when σin the forget gate is
replaced with another function φ.
4.2 Order of Saturation
To investigate the effect of choice of gate functions on the
convergence rate, we first define the candidate set Fof
bounded functions for the gate function.
Definition 4.1. Let Fbe a set of differentiable and strictly
increasing surjective functions φ:R→[0,1] such that the
derivative φ0is monotone on z > z0for some z0≥0.
Fis a natural class of gating activation functions includ-
ing σ. As we explained in Section 2.2, gated RNNs suf-
fer from gradient vanishing due to saturation when learning
long time scales. To clarify the issue, we first show that sat-
uration is inevitable regardless of the choice of φ∈ F.