LONG SHOR T-TERM MEMOR Y Neural Comput a tion - - PDF document

long shor t term memor y neural comput a tion 9 8 1735
SMART_READER_LITE
LIVE PREVIEW

LONG SHOR T-TERM MEMOR Y Neural Comput a tion - - PDF document

LONG SHOR T-TERM MEMOR Y Neural Comput a tion 9(8):1735{1780, 1997 Sepp Ho c hreiter J urgen Sc hmidh ub er F akult at f ur Informatik IDSIA T ec hnisc he Univ ersit at M unc hen Corso Elv


slide-1
SLIDE 1 LONG SHOR T-TERM MEMOR Y Neural Comput a tion 9(8):1735{1780, 1997 Sepp Ho c hreiter F akult at f
  • ur
Informatik T ec hnisc he Univ ersit at M
  • unc
hen 80290 M
  • unc
hen, German y ho c hreit@informatik.tu-m uenc hen.de h ttp://www7.informatik.tu-m uenc hen.de/~ho c hreit J
  • urgen
Sc hmidh ub er IDSIA Corso Elv ezia 36 6900 Lugano, Switzerland juergen@idsia.c h h ttp://www.idsia.c h/~juergen Abstract Learning to store information
  • v
er extended time in terv als via recurren t bac kpropagation tak es a v ery long time, mostly due to insucien t, deca ying error bac k
  • w.
W e briey review Ho c hreiter's 1991 analysis
  • f
this problem, then address it b y in tro ducing a no v el, ecien t, gradien t-based metho d called \Long Short-T erm Memory" (LSTM). T runcating the gradien t where this do es not do harm, LSTM can learn to bridge minimal time lags in excess
  • f
1000 discrete time steps b y enforcing c
  • nstant
error
  • w
through \constan t error carrousels" within sp ecial units. Multiplicativ e gate units learn to
  • p
en and close access to the constan t error
  • w.
LSTM is lo cal in space and time; its computational complexit y p er time step and w eigh t is O (1). Our exp erimen ts with articial data in v
  • lv
e lo cal, distributed, real-v alued, and noisy pattern represen tations. In comparisons with R TRL, BPTT, Recurren t Cascade-Correlation, Elman nets, and Neural Sequence Ch unking, LSTM leads to man y more successful runs, and learns m uc h faster. LSTM also solv es complex, articial long time lag tasks that ha v e nev er b een solv ed b y previous recurren t net w
  • rk
algorithms. 1 INTR ODUCTION Recurren t net w
  • rks
can in principle use their feedbac k connections to store represen tations
  • f
recen t input ev en ts in form
  • f
activ ations (\short-term memory", as
  • pp
  • sed
to \long-term mem-
  • ry"
em b
  • died
b y slo wly c hanging w eigh ts). This is p
  • ten
tially signican t for man y applications, including sp eec h pro cessing, non-Mark
  • vian
con trol, and m usic comp
  • sition
(e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory , ho w ev er, tak e to
  • m
uc h time
  • r
do not w
  • rk
w ell at all, esp ecially when minimal time lags b et w een inputs and corresp
  • nding
teac her signals are long. Although theoretically fascinating, existing metho ds do not pro vide clear pr actic al adv an tages
  • v
er, sa y , bac kprop in feedforw ard nets with limited time windo ws. This pap er will review an analysis
  • f
the problem and suggest a remedy . The problem. With con v en tional \Bac k-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, W erb
  • s
1988)
  • r
\Real-Time Recurren t Learning" (R TRL, e.g., Robinson and F allside 1987), error signals \o wing bac kw ards in time" tend to either (1) blo w up
  • r
(2) v anish: the temp
  • ral
ev
  • lution
  • f
the bac kpropagated error exp
  • nen
tially dep ends
  • n
the size
  • f
the w eigh ts (Ho c hreiter 1991). Case (1) ma y lead to
  • scillating
w eigh ts, while in case (2) learning to bridge long time lags tak es a prohibitiv e amoun t
  • f
time,
  • r
do es not w
  • rk
at all (see section 3). The remedy . This pap er presen ts \L
  • ng
Short-T erm Memory" (LSTM), a no v el recurren t net w
  • rk
arc hitecture in conjunction with an appropriate gradien t-based learning algorithm. LSTM is designed to
  • v
ercome these error bac k-o w problems. It can learn to bridge time in terv als in excess
  • f
1000 steps ev en in case
  • f
noisy , incompressible input sequences, without loss
  • f
short time lag capabilities. This is ac hiev ed b y an ecien t, gradien t-based algorithm for an arc hitecture 1
slide-2
SLIDE 2 enforcing c
  • nstant
(th us neither explo ding nor v anishing) error
  • w
through in ternal states
  • f
sp ecial units (pro vided the gradien t computation is truncated at certain arc hitecture-sp ecic p
  • in
ts | this do es not aect long-term error
  • w
though). Outline
  • f
pap er. Section 2 will briey review previous w
  • rk.
Section 3 b egins with an
  • utline
  • f
the detailed analysis
  • f
v anishing errors due to Ho c hreiter (1991). It will then in tro duce a naiv e approac h to constan t error bac kprop for didactic purp
  • ses,
and highligh t its problems concerning information storage and retriev al. These problems will lead to the LSTM arc hitecture as describ ed in Section 4. Section 5 will presen t n umerous exp erimen ts and comparisons with comp eting metho ds. LSTM
  • utp
erforms them, and also learns to solv e complex, articial tasks no
  • ther
recurren t net algorithm has solv ed. Section 6 will discuss LSTM's limitations and adv an tages. The app endix con tains a detailed description
  • f
the algorithm (A.1), and explicit error
  • w
form ulae (A.2). 2 PREVIOUS W ORK This section will fo cus
  • n
recurren t nets with time-v arying inputs (as
  • pp
  • sed
to nets with sta- tionary inputs and xp
  • in
t-based gradien t calculations, e.g., Almeida 1987, Pineda 1987). Gradien t-descen t v arian ts. The approac hes
  • f
Elman (1988), F ahlman (1991), Williams (1989), Sc hmidh ub er (1992a), P earlm utter (1989), and man y
  • f
the related algorithms in P earl- m utter's comprehensiv e
  • v
erview (1995) suer from the same problems as BPTT and R TRL (see Sections 1 and 3). Time-dela ys. Other metho ds that seem practical for short time lags
  • nly
are Time-Dela y Neural Net w
  • rks
(Lang et al. 1990) and Plate's metho d (Plate 1993), whic h up dates unit activ a- tions based
  • n
a w eigh ted sum
  • f
  • ld
activ ations (see also de V ries and Princip e 1991). Lin et al. (1995) prop
  • se
v arian ts
  • f
time-dela y net w
  • rks
called NARX net w
  • rks.
Time constan ts. T
  • deal
with long time lags, Mozer (1992) uses time constan ts inuencing c hanges
  • f
unit activ ations (deV ries and Princip e's ab
  • v
e-men tioned approac h (1991) ma y in fact b e view ed as a mixture
  • f
TDNN and time constan ts). F
  • r
long time lags, ho w ev er, the time constan ts need external ne tuning (Mozer 1992). Sun et al.'s alternativ e approac h (1993) up dates the activ ation
  • f
a recurren t unit b y adding the
  • ld
activ ation and the (scaled) curren t net input. The net input, ho w ev er, tends to p erturb the stored information, whic h mak es long-term storage impractical. Ring's approac h. Ring (1993) also prop
  • sed
a metho d for bridging long time lags. Whenev er a unit in his net w
  • rk
receiv es conicting error signals, he adds a higher
  • rder
unit inuencing appropriate connections. Although his approac h can sometimes b e extremely fast, to bridge a time lag in v
  • lving
100 steps ma y require the addition
  • f
100 units. Also, Ring's net do es not generalize to unseen lag durations. Bengio et al.'s approac hes. Bengio et al. (1994) in v estigate metho ds suc h as sim ulated annealing, m ulti-grid random searc h, time-w eigh ted pseudo-Newton
  • ptimization,
and discrete error propagation. Their \latc h" and \2-sequence" problems are v ery similar to problem 3a with minimal time lag 100 (see Exp erimen t 3). Bengio and F rasconi (1994) also prop
  • se
an EM approac h for propagating targets. With n so-called \state net w
  • rks",
at a giv en time, their system can b e in
  • ne
  • f
  • nly
n dieren t states. See also b eginning
  • f
Section 5. But to solv e con tin uous problems suc h as the \adding problem" (Section 5.4), their system w
  • uld
require an unacceptable n um b er
  • f
states (i.e., state net w
  • rks).
Kalman lters. Pusk
  • rius
and F eldk amp (1994) use Kalman lter tec hniques to impro v e recurren t net p erformance. Since they use \a deriv ativ e discoun t factor imp
  • sed
to deca y exp
  • nen
tially the eects
  • f
past dynamic deriv ativ es," there is no reason to b eliev e that their Kalman Filter T rained Recurren t Net w
  • rks
will b e useful for v ery long minimal time lags. Second
  • rder
nets. W e will see that LSTM uses m ultiplicativ e units (MUs) to protect error
  • w
from un w an ted p erturbations. It is not the rst recurren t net metho d using MUs though. F
  • r
instance, W atrous and Kuhn (1992) use MUs in second
  • rder
nets. Some dierences to LSTM are: (1) W atrous and Kuhn's arc hitecture do es not enforce constan t error
  • w
and is not designed 2
slide-3
SLIDE 3 to solv e long time lag problems. (2) It has fully connected second-order sigma-pi units, while the LSTM arc hitecture's MUs are used
  • nly
to gate access to constan t error
  • w.
(3) W atrous and Kuhn's algorithm costs O (W 2 )
  • p
erations p er time step,
  • urs
  • nly
O (W ), where W is the n um b er
  • f
w eigh ts. See also Miller and Giles (1993) for additional w
  • rk
  • n
MUs. Simple w eigh t guessing. T
  • a
v
  • id
long time lag problems
  • f
gradien t-based approac hes w e ma y simply randomly initialize all net w
  • rk
w eigh ts un til the resulting net happ ens to classify all training sequences correctly . In fact, recen tly w e disco v ered (Sc hmidh ub er and Ho c hreiter 1996, Ho c hreiter and Sc hmidh ub er 1996, 1997) that simple w eigh t guessing solv es man y
  • f
the problems in (Bengio 1994, Bengio and F rasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than the algorithms prop
  • sed
therein. This do es not mean that w eigh t guessing is a go
  • d
algorithm. It just means that the problems are v ery simple. More realistic tasks require either man y free parameters (e.g., input w eigh ts)
  • r
high w eigh t precision (e.g., for con tin uous-v alued parameters), suc h that guessing b ecomes completely infeasible. Adaptiv e sequence c h unk ers. Sc hmidh ub er's hierarc hical c h unk er systems (1992b, 1993) do ha v e a capabilit y to bridge arbitrary time lags, but
  • nly
if there is lo cal predictabilit y across the subsequences causing the time lags (see also Mozer 1992). F
  • r
instance, in his p
  • stdo
ctoral thesis (1993), Sc hmidh ub er uses hierarc hical recurren t nets to rapidly solv e certain grammar learning tasks in v
  • lving
minimal time lags in excess
  • f
1000 steps. The p erformance
  • f
c h unk er systems, ho w ev er, deteriorates as the noise lev el increases and the input sequences b ecome less compressible. LSTM do es not suer from this problem. 3 CONST ANT ERR OR BA CKPR OP 3.1 EXPONENTIALL Y DECA YING ERR OR Con v en tional BPTT (e.g. Williams and Zipser 1992). Output unit k 's target at time t is denoted b y d k (t). Using mean squared error, k 's error signal is # k (t) = f k (net k (t))(d k (t)
  • y
k (t)); where y i (t) = f i (net i (t)) is the activ ation
  • f
a non-input unit i with dieren tiable activ ation function f i , net i (t) = X j w ij y j (t
  • 1)
is unit i's curren t net input, and w ij is the w eigh t
  • n
the connection from unit j to i. Some non-output unit j 's bac kpropagated error signal is # j (t) = f j (net j (t)) X i w ij # i (t + 1): The corresp
  • nding
con tribution to w j l 's total w eigh t up date is # j (t)y l (t
  • 1),
where
  • is
the learning rate, and l stands for an arbitrary unit connected to unit j . Outline
  • f
Ho c hreiter's analysis (1991, page 19-21). Supp
  • se
w e ha v e a fully connected net whose non-input unit indices range from 1 to n. Let us fo cus
  • n
lo cal error
  • w
from unit u to unit v (later w e will see that the analysis immediately extends to global error
  • w).
The error
  • ccurring
at an arbitrary unit u at time step t is propagated \bac k in to time" for q time steps, to an arbitrary unit v . This will scale the error b y the follo wing factor: @ # v (t
  • q
) @ # u (t) = ( f v (net v (t
  • 1))w
uv q = 1 f v (net v (t
  • q
)) P n l=1 @ # l (tq +1) @ # u (t) w lv q > 1 : (1) 3
slide-4
SLIDE 4 With l q = v and l = u, w e
  • btain:
@ # v (t
  • q
) @ # u (t) = n X l 1 =1 : : : n X l q 1 =1 q Y m=1 f l m (net l m (t
  • m))w
l m l m1 (2) (pro
  • f
b y induction). The sum
  • f
the n q 1 terms Q q m=1 f l m (net l m (t
  • m))w
l m l m1 determines the total error bac k
  • w
(note that since the summation terms ma y ha v e dieren t signs, increasing the n um b er
  • f
units n do es not necessarily increase error
  • w).
In tuitiv e explanation
  • f
equation (2). If jf l m (net l m (t
  • m))w
l m l m1 j > 1:0 for all m (as can happ en, e.g., with linear f l m ) then the largest pro duct increases exp
  • nen
tially with q . That is, the error blo ws up, and conicting error signals arriving at unit v can lead to
  • scillating
w eigh ts and unstable learning (for error blo w-ups
  • r
bifurcations see also Pineda 1988, Baldi and Pineda 1991, Do y a 1992). On the
  • ther
hand, if jf l m (net l m (t
  • m))w
l m l m1 j < 1:0 for all m, then the largest pro duct de cr e ases exp
  • nen
tially with q . That is, the error v anishes, and nothing can b e learned in acceptable time. If f l m is the logistic sigmoid function, then the maximal v alue
  • f
f l m is 0.25. If y l m1 is constan t and not equal to zero, then jf l m (net l m )w l m l m1 j tak es
  • n
maximal v alues where w l m l m1 = 1 y l m1 coth ( 1 2 net l m ); go es to zero for jw l m l m1 j ! 1, and is less than 1:0 for jw l m l m1 j < 4:0 (e.g., if the absolute max- imal w eigh t v alue w max is smaller than 4.0). Hence with con v en tional logistic sigmoid activ ation functions, the error
  • w
tends to v anish as long as the w eigh ts ha v e absolute v alues b elo w 4.0, esp ecially in the b eginning
  • f
the training phase. In general the use
  • f
larger initial w eigh ts will not help though | as seen ab
  • v
e, for jw l m l m1 j ! 1 the relev an t deriv ativ e go es to zero \faster" than the absolute w eigh t can gro w (also, some w eigh ts will ha v e to c hange their signs b y crossing zero). Lik ewise, increasing the learning rate do es not help either | it will not c hange the ratio
  • f
long-range error
  • w
and short-range error
  • w.
BPTT is to
  • sensitiv
e to recen t distractions. (A v ery similar, more recen t analysis w as presen ted b y Bengio et al. 1994). Global error
  • w.
The lo cal error
  • w
analysis ab
  • v
e immediately sho ws that global error
  • w
v anishes, to
  • .
T
  • see
this, compute X u: u
  • utput
unit @ # v (t
  • q
) @ # u (t) : W eak upp er b
  • und
for scaling factor. The follo wing, sligh tly extended v anishing error analysis also tak es n, the n um b er
  • f
units, in to accoun t. F
  • r
q > 1, form ula (2) can b e rewritten as (W u T ) T F (t
  • 1)
q 1 Y m=2 (W F (t
  • m))
W v f v (net v (t
  • q
)); where the w eigh t matrix W is dened b y [W ] ij := w ij , v 's
  • utgoing
w eigh t v ector W v is dened b y [W v ] i := [W ] iv = w iv , u's incoming w eigh t v ector W u T is dened b y [W u T ] i := [W ] ui = w ui , and for m = 1; : : : ; q , F (t
  • m)
is the diagonal matrix
  • f
rst
  • rder
deriv ativ es dened as: [F (t
  • m)]
ij := if i 6= j , and [F (t
  • m)]
ij := f i (net i (t
  • m))
  • therwise.
Here T is the transp
  • sition
  • p
erator, [A] ij is the elemen t in the i-th column and j
  • th
ro w
  • f
matrix A, and [x] i is the i-th comp
  • nen
t
  • f
v ector x. 4
slide-5
SLIDE 5 Using a matrix norm k : k A compatible with v ector norm k : k x , w e dene f max := max m=1;::: ;q fk F (t
  • m)
k A g: F
  • r
max i=1;:::;n fjx i jg
  • k
x k x w e get jx T y j
  • n
k x k x k y k x : Since jf v (net v (t
  • q
))j
  • k
F (t
  • q
) k A
  • f
max ; w e
  • btain
the follo wing inequalit y: j @ # v (t
  • q
) @ # u (t) j
  • n
(f max ) q k W v k x k W u T k x k W k q 2 A
  • n
(f max k W k A ) q : This inequalit y results from k W v k x = k W e v k x
  • k
W k A k e v k x
  • k
W k A and k W u T k x = k e u W k x
  • k
W k A k e u k x
  • k
W k A ; where e k is the unit v ector whose comp
  • nen
ts are except for the k
  • th
comp
  • nen
t, whic h is 1. Note that this is a w eak, extreme case upp er b
  • und
| it will b e reac hed
  • nly
if all k F (t
  • m)
k A tak e
  • n
maximal v alues, and if the con tributions
  • f
all paths across whic h error
  • ws
bac k from unit u to unit v ha v e the same sign. Large k W k A , ho w ev er, t ypically result in small v alues
  • f
k F (t
  • m)
k A , as conrmed b y exp erimen ts (see, e.g., Ho c hreiter 1991). F
  • r
example, with norms k W k A := max r X s jw r s j and k x k x := max r jx r j; w e ha v e f max = 0:25 for the logistic sigmoid. W e
  • bserv
e that if jw ij j
  • w
max < 4:0 n 8i; j; then k W k A
  • nw
max < 4:0 will result in exp
  • nen
tial deca y | b y setting
  • :=
  • nw
max 4:0
  • <
1:0, w e
  • btain
j @ # v (t
  • q
) @ # u (t) j
  • n
( ) q : W e refer to Ho c hreiter's 1991 thesis for additional results. 3.2 CONST ANT ERR OR FLO W: NAIVE APPR O A CH A single unit. T
  • a
v
  • id
v anishing error signals, ho w can w e ac hiev e constan t error
  • w
through a single unit j with a single connection to itself ? According to the rules ab
  • v
e, at time t, j 's lo cal error bac k
  • w
is # j (t) = f j (net j (t))# j (t + 1)w j j . T
  • enforce
c
  • nstant
error
  • w
through j , w e require f j (net j (t))w j j = 1:0: Note the similarit y to Mozer's xed time constan t system (1992) | a time constan t
  • f
1:0 is appropriate for p
  • ten
tially innite time lags 1 . The constan t error carrousel. In tegrating the dieren tial equation ab
  • v
e, w e
  • btain
f j (net j (t)) = net j (t) w j j for arbitrary net j (t). This means: f j has to b e linear, and unit j 's acti- v ation has to remain constan t: y j (t + 1) = f j (net j (t + 1)) = f j (w j j y j (t)) = y j (t): 1 W e do not use the expression \time constan t" in the dieren tial sense, as, e.g., P earlm utter (1995). 5
slide-6
SLIDE 6 In the exp erimen ts, this will b e ensured b y using the iden tit y function f j : f j (x) = x; 8x, and b y setting w j j = 1:0. W e refer to this as the constan t error carrousel (CEC). CEC will b e LSTM's cen tral feature (see Section 4). Of course unit j will not
  • nly
b e connected to itself but also to
  • ther
units. This in v
  • k
es t w
  • b
vious, related problems (also inheren t in all
  • ther
gradien t-based approac hes): 1. Input w eigh t conict: for simplicit y , let us fo cus
  • n
a single additional input w eigh t w j i . Assume that the total error can b e reduced b y switc hing
  • n
unit j in resp
  • nse
to a certain input, and k eeping it activ e for a long time (un til it helps to compute a desired
  • utput).
Pro vided i is non- zero, since the same incoming w eigh t has to b e used for b
  • th
storing certain inputs and ignoring
  • thers,
w j i will
  • ften
receiv e conicting w eigh t up date signals during this time (recall that j is linear): these signals will attempt to mak e w j i participate in (1) storing the input (b y switc hing
  • n
j ) and (2) protecting the input (b y prev en ting j from b eing switc hed
  • b
y irrelev an t later inputs). This conict mak es learning dicult, and calls for a more con text-sensitiv e mec hanism for con trolling \write
  • p
erations" through input w eigh ts. 2. Output w eigh t conict: assume j is switc hed
  • n
and curren tly stores some previous input. F
  • r
simplicit y , let us fo cus
  • n
a single additional
  • utgoing
w eigh t w k j . The same w k j has to b e used for b
  • th
retrieving j 's con ten t at certain times and prev en ting j from disturbing k at
  • ther
times. As long as unit j is non-zero, w k j will attract conicting w eigh t up date signals generated during sequence pro cessing: these signals will attempt to mak e w k j participate in (1) accessing the information stored in j and | at dieren t times | (2) protecting unit k from b eing p erturb ed b y j . F
  • r
instance, with man y tasks there are certain \short time lag errors" that can b e reduced in early training stages. Ho w ev er, at later training stages j ma y suddenly start to cause a v
  • idable
errors in situations that already seemed under con trol b y attempting to participate in reducing more dicult \long time lag errors". Again, this conict mak es learning dicult, and calls for a more con text-sensitiv e mec hanism for con trolling \read
  • p
erations" through
  • utput
w eigh ts. Of course, input and
  • utput
w eigh t conicts are not sp ecic for long time lags, but
  • ccur
for short time lags as w ell. Their eects, ho w ev er, b ecome particularly pronounced in the long time lag case: as the time lag increases, (1) stored information m ust b e protected against p erturbation for longer and longer p erio ds, and | esp ecially in adv anced stages
  • f
learning | (2) more and more already correct
  • utputs
also require protection against p erturbation. Due to the problems ab
  • v
e the naiv e approac h do es not w
  • rk
w ell except in case
  • f
certain simple problems in v
  • lving
lo cal input/output represen tations and non-rep eating input patterns (see Ho c hreiter 1991 and Silv a et al. 1996). The next section sho ws ho w to do it righ t. 4 LONG SHOR T-TERM MEMOR Y Memory cells and gate units. T
  • construct
an arc hitecture that allo ws for constan t error
  • w
through sp ecial, self-connected units without the disadv an tages
  • f
the naiv e approac h, w e extend the constan t error carrousel CEC em b
  • died
b y the self-connected, linear unit j from Section 3.2 b y in tro ducing additional features. A m ultiplicativ e input gate unit is in tro duced to protect the memory con ten ts stored in j from p erturbation b y irrelev an t inputs. Lik ewise, a m ultiplicativ e
  • utput
gate unit is in tro duced whic h protects
  • ther
units from p erturbation b y curren tly irrelev an t memory con ten ts stored in j . The resulting, more complex unit is called a memory c el l (see Figure 1). The j
  • th
memory cell is denoted c j . Eac h memory cell is built around a cen tral linear unit with a xed self-connection (the CEC). In addition to net c j , c j gets input from a m ultiplicativ e unit
  • ut
j (the \output gate"), and from another m ultiplicativ e unit in j (the \input gate"). in j 's activ ation at time t is denoted b y y in j (t),
  • ut
j 's b y y
  • ut
j (t). W e ha v e y
  • ut
j (t) = f
  • ut
j (net
  • ut
j (t)); y in j (t) = f in j (net in j (t)); 6
slide-7
SLIDE 7 where net
  • ut
j (t) = X u w
  • ut
j u y u (t
  • 1);
and net in j (t) = X u w in j u y u (t
  • 1):
W e also ha v e net c j (t) = X u w c j u y u (t
  • 1):
The summation indices u ma y stand for input units, gate units, memory cells,
  • r
ev en con v en tional hidden units if there are an y (see also paragraph
  • n
\net w
  • rk
top
  • logy"
b elo w). All these dieren t t yp es
  • f
units ma y con v ey useful information ab
  • ut
the curren t state
  • f
the net. F
  • r
instance, an input gate (output gate) ma y use inputs from
  • ther
memory cells to decide whether to store (access) certain information in its memory cell. There ev en ma y b e recurren t self-connections lik e w c j c j . It is up to the user to dene the net w
  • rk
top
  • logy
. See Figure 2 for an example. A t time t, c j 's
  • utput
y c j (t) is computed as y c j (t) = y
  • ut
j (t)h(s c j (t)); where the \in ternal state" s c j (t) is s c j (0) = 0; s c j (t) = s c j (t
  • 1)
+ y in j (t)g
  • net
c j (t)
  • for
t > 0: The dieren tiable function g squashes net c j ; the dieren tiable function h scales memory cell
  • utputs
computed from the in ternal state s c j .

inj inj

  • ut j
  • ut j

w

i c j

wic j y

c j

g h

1.0

net w

i

w

i

y

inj

y

  • ut j

net c j g y

inj

= g + sc j sc j y

inj

h y

  • ut j

net

Figure 1: A r chite ctur e
  • f
memory c el l c j (the b
  • x)
and its gate units in j ;
  • ut
j . The self-r e curr ent c
  • nne
ction (with weight 1.0) indic ates fe e db ack with a delay
  • f
1 time step. It builds the b asis
  • f
the \c
  • nstant
err
  • r
c arr
  • usel"
CEC. The gate units
  • p
en and close ac c ess to CEC. Se e text and app endix A.1 for details. Wh y gate units? T
  • a
v
  • id
input w eigh t conicts, in j con trols the error
  • w
to memory cell c j 's input connections w c j i . T
  • circum
v en t c j 's
  • utput
w eigh t conicts,
  • ut
j con trols the error
  • w
from unit j 's
  • utput
connections. In
  • ther
w
  • rds,
the net can use in j to decide when to k eep
  • r
  • v
erride information in memory cell c j , and
  • ut
j to decide when to access memory cell c j and when to prev en t
  • ther
units from b eing p erturb ed b y c j (see Figure 1). Error signals trapp ed within a memory cell's CEC c annot c hange { but dieren t error signals
  • wing
in to the cell (at dieren t times) via its
  • utput
gate ma y get sup erimp
  • sed.
The
  • utput
gate will ha v e to learn which errors to trap in its CEC, b y appropriately scaling them. The input 7
slide-8
SLIDE 8 gate will ha v e to learn when to release errors, again b y appropriately scaling them. Essen tially , the m ultiplicativ e gate units
  • p
en and close access to constan t error
  • w
through CEC. Distributed
  • utput
represen tations t ypically do require
  • utput
gates. Not alw a ys are b
  • th
gate t yp es necessary , though |
  • ne
ma y b e sucien t. F
  • r
instance, in Exp erimen ts 2a and 2b in Section 5, it will b e p
  • ssible
to use input gates
  • nly
. In fact,
  • utput
gates are not required in case
  • f
lo cal
  • utput
enco ding | prev en ting memory cells from p erturbing already learned
  • utputs
can b e done b y simply setting the corresp
  • nding
w eigh ts to zero. Ev en in this case, ho w ev er,
  • utput
gates can b e b enecial: they prev en t the net's attempts at storing long time lag memories (whic h are usually hard to learn) from p erturbing activ ations represen ting easily learnable short time lag memories. (This will pro v e quite useful in Exp erimen t 1, for instance.) Net w
  • rk
top
  • logy
. W e use net w
  • rks
with
  • ne
input la y er,
  • ne
hidden la y er, and
  • ne
  • utput
la y er. The (fully) self-connected hidden la y er con tains memory cells and corresp
  • nding
gate units (for con v enience, w e refer to b
  • th
memory cells and gate units as b eing lo cated in the hidden la y er). The hidden la y er ma y also con tain \con v en tional" hidden units pro viding inputs to gate units and memory cells. All units (except for gate units) in all la y ers ha v e directed connections (serv e as inputs) to all units in the la y er ab
  • v
e (or to all higher la y ers { Exp erimen ts 2a and 2b). Memory cell blo c ks. S memory cells sharing the same input gate and the same
  • utput
gate form a structure called a \memory cell blo c k
  • f
size S ". Memory cell blo c ks facilitate information storage | as with con v en tional neural nets, it is not so easy to co de a distributed input within a single cell. Since eac h memory cell blo c k has as man y gate units as a single memory cell (namely t w
  • ),
the blo c k arc hitecture can b e ev en sligh tly more ecien t (see paragraph \computational complexit y"). A memory cell blo c k
  • f
size 1 is just a simple memory cell. In the exp erimen ts (Section 5), w e will use memory cell blo c ks
  • f
v arious sizes. Learning. W e use a v arian t
  • f
R TRL (e.g., Robinson and F allside 1987) whic h prop erly tak es in to accoun t the altered, m ultiplicativ e dynamics caused b y input and
  • utput
gates. Ho w ev er, to ensure non-deca ying error bac kprop through in ternal states
  • f
memory cells, as with truncated BPTT (e.g., Williams and P eng 1990), errors arriving at \memory cell net inputs" (for cell c j , this includes net c j , net in j , net
  • ut
j ) do not get propagated bac k further in time (although they do serv e to c hange the incoming w eigh ts). Only within 2 memory cells, errors are propagated bac k through previous in ternal states s c j . T
  • visualize
this:
  • nce
an error signal arriv es at a memory cell
  • utput,
it gets scaled b y
  • utput
gate activ ation and h . Then it is within the memory cell's CEC, where it can
  • w
bac k indenitely without ev er b eing scaled. Only when it lea v es the memory cell through the input gate and g , it is scaled
  • nce
more b y input gate activ ation and g . It then serv es to c hange the incoming w eigh ts b efore it is truncated (see app endix for explicit form ulae). Computational complexit y . As with Mozer's fo cused recurren t bac kprop algorithm (Mozer 1989),
  • nly
the deriv ativ es @ s c j @ w il need to b e stored and up dated. Hence the LSTM algorithm is v ery ecien t, with an excellen t up date complexit y
  • f
O (W ), where W the n um b er
  • f
w eigh ts (see details in app endix A.1). Hence, LSTM and BPTT for fully recurren t nets ha v e the same up date complexit y p er time step (while R TRL's is m uc h w
  • rse).
Unlik e full BPTT, ho w ev er, LSTM is lo c al in sp ac e and time 3 : there is no need to store activ ation v alues
  • bserv
ed during sequence pro cessing in a stac k with p
  • ten
tially unlimited size. Abuse problem and solutions. In the b eginning
  • f
the learning phase, error reduction ma y b e p
  • ssible
without storing information
  • v
er time. The net w
  • rk
will th us tend to abuse memory cells, e.g., as bias cells (i.e., it migh t mak e their activ ations constan t and use the
  • utgoing
connections as adaptiv e thresholds for
  • ther
units). The p
  • ten
tial dicult y is: it ma y tak e a long time to release abused memory cells and mak e them a v ailable for further learning. A similar \abuse problem" app ears if t w
  • memory
cells store the same (redundan t) information. There are at least t w
  • solutions
to the abuse problem: (1) Se quential network c
  • nstruction
(e.g., F ahlman 1991): a memory cell and the corresp
  • nding
gate units are added to the net w
  • rk
whenev er the 2 F
  • r
in tra-cellular bac kprop in a quite dieren t con text see also Do y a and Y
  • shiza
w a (1989). 3 F
  • llo
wing Sc hmidh ub er (1989), w e sa y that a recurren t net algorithm is lo c al in sp ac e if the up date complexit y p er time step and w eigh t do es not dep end
  • n
net w
  • rk
size. W e sa y that a metho d is lo c al in time if its storage requiremen ts do not dep end
  • n
input sequence length. F
  • r
instance, R TRL is lo cal in time but not in space. BPTT is lo cal in space but not in time. 8
slide-9
SLIDE 9

1 1 2

  • utput

hidden input

  • ut 1

in 1

  • ut 2

in 2

1 cell block block 1 cell block block 2 cell 2 cell 2

Figure 2: Example
  • f
a net with 8 input units, 4
  • utput
units, and 2 memory c el l blo cks
  • f
size 2. in1 marks the input gate,
  • ut1
marks the
  • utput
gate, and cel l 1=bl
  • ck
1 marks the rst memory c el l
  • f
blo ck 1. cel l 1=bl
  • ck
1's ar chite ctur e is identic al to the
  • ne
in Figur e 1, with gate units in1 and
  • ut1
(note that by r
  • tating
Figur e 1 by 90 de gr e es anti-clo ckwise, it wil l match with the c
  • rr
esp
  • nding
p arts
  • f
Figur e 1). The example assumes dense c
  • nne
ctivity: e ach gate unit and e ach memory c el l se e al l non-output units. F
  • r
simplicity, however,
  • utgoing
weights
  • f
  • nly
  • ne
typ e
  • f
unit ar e shown for e ach layer. With the ecient, trunc ate d up date rule, err
  • r
  • ws
  • nly
thr
  • ugh
c
  • nne
ctions to
  • utput
units, and thr
  • ugh
xe d self-c
  • nne
ctions within c el l blo cks (not shown her e | se e Figur e 1). Err
  • r
  • w
is trunc ate d
  • nc
e it \wants" to le ave memory c el ls
  • r
gate units. Ther efor e, no c
  • nne
ction shown ab
  • ve
serves to pr
  • p
agate err
  • r
b ack to the unit fr
  • m
which the c
  • nne
ction
  • riginates
(exc ept for c
  • nne
ctions to
  • utput
units), although the c
  • nne
ctions themselves ar e mo diable. That's why the trunc ate d LSTM algorithm is so ecient, despite its ability to bridge very long time lags. Se e text and app endix A.1 for details. Figur e 2 actual ly shows the ar chite ctur e use d for Exp eriment 6a |
  • nly
the bias
  • f
the non-input units is
  • mitte
d. error stops decreasing (see Exp erimen t 2 in Section 5). (2) Output gate bias: eac h
  • utput
gate gets a negativ e initial bias, to push initial memory cell activ ations to w ards zero. Memory cells with more negativ e bias automatically get \allo cated" later (see Exp erimen ts 1, 3, 4, 5, 6 in Section 5). In ternal state drift and remedies. If memory cell c j 's inputs are mostly p
  • sitiv
e
  • r
mostly negativ e, then its in ternal state s j will tend to drift a w a y
  • v
er time. This is p
  • ten
tially dangerous, for the h (s j ) will then adopt v ery small v alues, and the gradien t will v anish. One w a y to cir- cum v en t this problem is to c ho
  • se
an appropriate function h. But h(x) = x, for instance, has the disadv an tage
  • f
unrestricted memory cell
  • utput
range. Our simple but eectiv e w a y
  • f
solving drift problems at the b eginning
  • f
learning is to initially bias the input gate in j to w ards zero. Although there is a tradeo b et w een the magnitudes
  • f
h (s j )
  • n
the
  • ne
hand and
  • f
y in j and f in j
  • n
the
  • ther,
the p
  • ten
tial negativ e eect
  • f
input gate bias is negligible compared to the
  • ne
  • f
the drifting eect. With logistic sigmoid activ ation functions, there app ears to b e no need for ne-tuning the initial bias, as conrmed b y Exp erimen ts 4 and 5 in Section 5.4. 5 EXPERIMENTS In tro duction. Whic h tasks are appropriate to demonstrate the qualit y
  • f
a no v el long time lag 9
slide-10
SLIDE 10 algorithm? First
  • f
all, minimal time lags b et w een relev an t input signals and corresp
  • nding
teac her signals m ust b e long for al l training sequences. In fact, man y previous recurren t net algorithms sometimes manage to generalize from v ery short training sequences to v ery long test sequences. See, e.g., P
  • llac
k (1991). But a real long time lag problem do es not ha v e any short time lag exemplars in the training set. F
  • r
instance, Elman's training pro cedure, BPTT,
  • ine
R TRL,
  • nline
R TRL, etc., fail miserably
  • n
real long time lag problems. See, e.g., Ho c hreiter (1991) and Mozer (1992). A second imp
  • rtan
t requiremen t is that the tasks should b e complex enough suc h that they cannot b e solv ed quic kly b y simple-minded strategies suc h as random w eigh t guessing. Guessing can
  • utp
erform man y long time lag algorithms. Recen tly w e disco v ered (Sc hmidh ub er and Ho c hreiter 1996, Ho c hreiter and Sc hmidh ub er 1996, 1997) that man y long time lag tasks used in previous w
  • rk
can b e solv ed more quic kly b y simple random w eigh t guessing than b y the prop
  • sed
algorithms. F
  • r
instance, guessing solv ed a v arian t
  • f
Bengio and F rasconi's \parit y problem" (1994) problem m uc h faster 4 than the sev en metho ds tested b y Bengio et al. (1994) and Bengio and F rasconi (1994). Similarly for some
  • f
Miller and Giles' problems (1993). Of course, this do es not mean that guessing is a go
  • d
algorithm. It just means that some previously used problems are not extremely appropriate to demonstrate the qualit y
  • f
previously prop
  • sed
algorithms. What's common to Exp erimen ts 1{6. All
  • ur
exp erimen ts (except for Exp erimen t 1) in v
  • lv
e long minimal time lags | there are no short time lag training exemplars facilitating learning. Solutions to most
  • f
  • ur
tasks are sparse in w eigh t space. They require either man y parameters/inputs
  • r
high w eigh t precision, suc h that random w eigh t guessing b ecomes infeasible. W e alw a ys use
  • n-line
learning (as
  • pp
  • sed
to batc h learning), and logistic sigmoids as acti- v ation functions. F
  • r
Exp erimen ts 1 and 2, initial w eigh ts are c hosen in the range [0:2; 0:2], for the
  • ther
exp erimen ts in [0:1; 0:1]. T raining sequences are generated randomly according to the v arious task descriptions. In sligh t deviation from the notation in App endix A1, eac h discrete time step
  • f
eac h input sequence in v
  • lv
es three pro cessing steps: (1) use curren t input to set the input units. (2) Compute activ ations
  • f
hidden units (including input gates,
  • utput
gates, mem-
  • ry
cells). (3) Compute
  • utput
unit activ ations. Except for Exp erimen ts 1, 2a, and 2b, sequence elemen ts are randomly generated
  • n-line,
and error signals are generated
  • nly
at sequence ends. Net activ ations are reset after eac h pro cessed input sequence. F
  • r
comparisons with recurren t nets taugh t b y gradien t descen t, w e giv e results
  • nly
for R TRL, except for comparison 2a, whic h also includes BPTT. Note, ho w ev er, that un truncated BPTT (see, e.g., Williams and P eng 1990) computes exactly the same gradien t as
  • ine
R TRL. With long time lag problems,
  • ine
R TRL (or BPTT) and the
  • nline
v ersion
  • f
R TRL (no activ ation resets,
  • nline
w eigh t c hanges) lead to almost iden tical, negativ e results (as conrmed b y additional sim ulations in Ho c hreiter 1991; see also Mozer 1992). This is b ecause
  • ine
R TRL,
  • nline
R TRL, and full BPTT all suer badly from exp
  • nen
tial error deca y . Our LSTM arc hitectures are selected quite arbitrarily . If nothing is kno wn ab
  • ut
the complex- it y
  • f
a giv en problem, a more systematic approac h w
  • uld
b e: start with a v ery small net consisting
  • f
  • ne
memory cell. If this do es not w
  • rk,
try t w
  • cells,
etc. Alternativ ely , use sequen tial net w
  • rk
construction (e.g., F ahlman 1991). Outline
  • f
exp erimen ts.
  • Exp
erimen t 1 fo cuses
  • n
a standard b enc hmark test for recurren t nets: the em b edded Reb er grammar. Since it allo ws for training sequences with short time lags, it is not a long time lag problem. W e include it b ecause (1) it pro vides a nice example where LSTM's
  • utput
gates are truly b enecial, and (2) it is a p
  • pular
b enc hmark for recurren t nets that has b een used b y man y authors | w e w an t to include at least
  • ne
exp erimen t where con v en tional BPTT and R TRL do not fail completely (LSTM, ho w ev er, clearly
  • utp
erforms them). The em b edded Reb er grammar's minimal time lags represen t a b
  • rder
case in the sense that it is still p
  • ssible
to learn to bridge them with con v en tional algorithms. Only sligh tly longer 4 It should b e men tioned, ho w ev er, that dieren t input represen tations and dieren t t yp es
  • f
noise ma y lead to w
  • rse
guessing p erformance (Y
  • sh
ua Bengio, p ersonal comm unication, 1996). 10
slide-11
SLIDE 11 minimal time lags w
  • uld
mak e this almost imp
  • ssible.
The more in teresting tasks in
  • ur
pap er, ho w ev er, are those that R TRL, BPTT, etc. cannot solv e at all.
  • Exp
erimen t 2 fo cuses
  • n
noise-free and noisy sequences in v
  • lving
n umerous input sym b
  • ls
distracting from the few imp
  • rtan
t
  • nes.
The most dicult task (T ask 2c) in v
  • lv
es h undreds
  • f
distractor sym b
  • ls
at random p
  • sitions,
and minimal time lags
  • f
1000 steps. LSTM solv es it, while BPTT and R TRL already fail in case
  • f
10-step minimal time lags (see also, e.g., Ho c hreiter 1991 and Mozer 1992). F
  • r
this reason R TRL and BPTT are
  • mitted
in the remaining, more complex exp erimen ts, all
  • f
whic h in v
  • lv
e m uc h longer time lags.
  • Exp
erimen t 3 addresses long time lag problems with noise and signal
  • n
the same input line. Exp erimen ts 3a/3b fo cus
  • n
Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can b e solv ed quic kly b y random w eigh t guessing, w e also include a far more dicult 2-sequence problem (3c) whic h requires to learn real-v alued, conditional exp ectations
  • f
noisy targets, giv en the inputs.
  • Exp
erimen ts 4 and 5 in v
  • lv
e distributed, con tin uous-v alued input represen tations and require learning to store precise, real v alues for v ery long time p erio ds. Relev an t input signals can
  • ccur
at quite dieren t p
  • sitions
in input sequences. Again minimal time lags in v
  • lv
e h undreds
  • f
steps. Similar tasks nev er ha v e b een solv ed b y
  • ther
recurren t net algorithms.
  • Exp
erimen t 6 in v
  • lv
es tasks
  • f
a dieren t complex t yp e that also has not b een solv ed b y
  • ther
recurren t net algorithms. Again, relev an t input signals can
  • ccur
at quite dieren t p
  • sitions
in input sequences. The exp erimen t sho ws that LSTM can extract information con v ey ed b y the temp
  • ral
  • rder
  • f
widely separated inputs. Subsection 5.7 will pro vide a detailed summary
  • f
exp erimen tal conditions in t w
  • tables
for reference. 5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR T ask. Our rst task is to learn the \em b edded Reb er grammar", e.g. Smith and Zipser (1989), Cleeremans et al. (1989), and F ahlman (1991). Since it allo ws for training sequences with short time lags (of as few as 9 steps), it is not a long time lag problem. W e include it for t w
  • reasons:
(1) it is a p
  • pular
recurren t net b enc hmark used b y man y authors | w e w an ted to ha v e at least
  • ne
exp erimen t where R TRL and BPTT do not fail completely , and (2) it sho ws nicely ho w
  • utput
gates can b e b enecial.

B T S X X P V T P V S E

Figure 3: T r ansition diagr am for the R eb er gr ammar.

B T P E T P

GRAMMAR GRAMMAR REBER REBER

Figure 4: T r ansition diagr am for the emb e dde d R eb er gr ammar. Each b
  • x
r epr esents a c
  • py
  • f
the R eb er gr ammar (se e Figur e 3). Starting at the leftmost no de
  • f
the directed graph in Figure 4, sym b
  • l
strings are generated sequen tially (b eginning with the empt y string) b y follo wing edges | and app ending the asso ciated 11
slide-12
SLIDE 12 sym b
  • ls
to the curren t string | un til the righ tmost no de is reac hed. Edges are c hosen randomly if there is a c hoice (probabilit y: 0.5). The net's task is to read strings,
  • ne
sym b
  • l
at a time, and to p ermanen tly predict the next sym b
  • l
(error signals
  • ccur
at ev ery time step). T
  • correctly
predict the sym b
  • l
b efore last, the net has to remem b er the second sym b
  • l.
Comparison. W e compare LSTM to \Elman nets trained b y Elman's training pro cedure" (ELM) (results tak en from Cleeremans et al. 1989), F ahlman's \Recurren t Cascade-Correlation" (R CC) (results tak en from F ahlman 1991), and R TRL (results tak en from Smith and Zipser (1989), where
  • nly
the few successful trials are listed). It should b e men tioned that Smith and Zipser actually mak e the task easier b y increasing the probabilit y
  • f
short time lag exemplars. W e didn't do this for LSTM. T raining/T esting. W e use a lo cal input/output represen tation (7 input units, 7
  • utput
units). F
  • llo
wing F ahlman, w e use 256 training strings and 256 separate test strings. The training set is generated randomly; training exemplars are pic k ed randomly from the training set. T est sequences are generated randomly , to
  • ,
but sequences already used in the training set are not used for testing. After string presen tation, all activ ations are reinitialized with zeros. A trial is considered successful if all string sym b
  • ls
  • f
all sequences in b
  • th
test set and training set are predicted correctly | that is, if the
  • utput
unit(s) corresp
  • nding
to the p
  • ssible
next sym b
  • l(s)
is(are) alw a ys the most activ e
  • nes.
Arc hitectures. Arc hitectures for R TRL, ELM, R CC are rep
  • rted
in the references listed ab
  • v
e. F
  • r
LSTM, w e use 3 (4) memory cell blo c ks. Eac h blo c k has 2 (1) memory cells. The
  • utput
la y er's
  • nly
incoming connections
  • riginate
at memory cells. Eac h memory cell and eac h gate unit receiv es incoming connections from all memory cells and gate units (the hidden la y er is fully connected | less connectivit y ma y w
  • rk
as w ell). The input la y er has forw ard connections to all units in the hidden la y er. The gate units are biased. These arc hitecture parameters mak e it easy to store at least 3 input signals (arc hitectures 3-2 and 4-1 are emplo y ed to
  • btain
comparable n um b ers
  • f
w eigh ts for b
  • th
arc hitectures: 264 for 4-1 and 276 for 3-2). Other parameters ma y b e appropriate as w ell, ho w ev er. All sigmoid functions are logistic with
  • utput
range [0; 1], except for h, whose range is [1; 1], and g , whose range is [2; 2]. All w eigh ts are initialized in [0:2; 0:2], except for the
  • utput
gate biases, whic h are initialized to
  • 1,
  • 2,
and
  • 3,
resp ectiv ely (see abuse problem, solution (2)
  • f
Section 4). W e tried learning rates
  • f
0.1, 0.2 and 0.5. Results. W e use 3 dieren t, randomly generated pairs
  • f
training and test sets. With eac h suc h pair w e run 10 trials with dieren t initial w eigh ts. See T able 1 for results (mean
  • f
30 trials). Unlik e the
  • ther
metho ds, LSTM alw a ys learns to solv e the task. Ev en when w e ignore the unsuccessful trials
  • f
the
  • ther
approac hes, LSTM learns m uc h faster. Imp
  • rtance
  • f
  • utput
gates. The exp erimen t pro vides a nice example where the
  • utput
gate is truly b enecial. Learning to store the rst T
  • r
P should not p erturb activ ations represen ting the more easily learnable transitions
  • f
the
  • riginal
Reb er grammar. This is the job
  • f
the
  • utput
gates. Without
  • utput
gates, w e did not ac hiev e fast learning. 5.2 EXPERIMENT 2: NOISE-FREE AND NOISY SEQUENCES T ask 2a: noise-free sequences with long time lags. There are p + 1 p
  • ssible
input sym b
  • ls
denoted a 1 ; :::; a p1 ; a p = x; a p+1 = y . a i is \lo cally" represen ted b y the p + 1-dimensional v ector whose i-th comp
  • nen
t is 1 (all
  • ther
comp
  • nen
ts are 0). A net with p + 1 input units and p + 1
  • utput
units sequen tially
  • bserv
es input sym b
  • l
sequences,
  • ne
at a time, p ermanen tly trying to predict the next sym b
  • l
| error signals
  • ccur
at ev ery single time step. T
  • emphasize
the \long time lag problem", w e use a training set consisting
  • f
  • nly
t w
  • v
ery similar sequences: (y ; a 1 ; a 2 ; : : : ; a p1 ; y ) and (x; a 1 ; a 2 ; : : : ; a p1 ; x). Eac h is selected with probabilit y 0.5. T
  • predict
the nal elemen t, the net has to learn to store a represen tation
  • f
the rst elemen t for p time steps. W e compare \Real-Time Recurren t Learning" for fully recurren t nets (R TRL), \Bac k-Propa- gation Through Time" (BPTT), the sometimes v ery successful 2-net \Neural Sequence Ch unk er" (CH, Sc hmidh ub er 1992b), and
  • ur
new metho d (LSTM). In all cases, w eigh ts are initialized in [-0.2,0.2]. Due to limited computation time, training is stopp ed after 5 million sequence presen- 12
slide-13
SLIDE 13 metho d hidden units # w eigh ts learning rate %
  • f
success success after R TRL 3
  • 170
0.05 \some fraction" 173,000 R TRL 12
  • 494
0.1 \some fraction" 25,000 ELM 15
  • 435
>200,000 R CC 7-9
  • 119-198
50 182,000 LSTM 4 blo c ks, size 1 264 0.1 100 39,740 LSTM 3 blo c ks, size 2 276 0.1 100 21,730 LSTM 3 blo c ks, size 2 276 0.2 97 14,060 LSTM 4 blo c ks, size 1 264 0.5 97 9,500 LSTM 3 blo c ks, size 2 276 0.5 100 8,440 T able 1: EXPERIMENT 1: Emb e dde d R eb er gr ammar: p er c entage
  • f
suc c essful trials and numb er
  • f
se quenc e pr esentations until suc c ess for R TRL (r esults taken fr
  • m
Smith and Zipser 1989), \Elman net tr aine d by Elman 's pr
  • c
e dur e" (r esults taken fr
  • m
Cle er emans et al. 1989), \R e curr ent Casc ade-Corr elation " (r esults taken fr
  • m
F ahlman 1991) and
  • ur
new appr
  • ach
(LSTM). Weight numb ers in the rst 4 r
  • ws
ar e estimates | the c
  • rr
esp
  • nding
p ap ers do not pr
  • vide
al l the te chnic al details. Only LSTM almost always le arns to solve the task (only two failur es
  • ut
  • f
150 trials). Even when we ignor e the unsuc c essful trials
  • f
the
  • ther
appr
  • aches,
LSTM le arns much faster (the numb er
  • f
r e quir e d tr aining examples in the b
  • ttom
r
  • w
varies b etwe en 3,800 and 24,100). tations. A successful run is
  • ne
that fullls the follo wing criterion: after training, during 10,000 successiv e, randomly c hosen input sequences, the maximal absolute error
  • f
all
  • utput
units is alw a ys b elo w 0:25. Arc hitectures. R TRL:
  • ne
self-recurren t hidden unit, p + 1 non-recurren t
  • utput
units. Eac h la y er has connections from all la y ers b elo w. All units use the logistic activ ation function sigmoid in [0,1]. BPTT: same arc hitecture as the
  • ne
trained b y R TRL. CH: b
  • th
net arc hitectures lik e R TRL's, but
  • ne
has an additional
  • utput
for predicting the hidden unit
  • f
the
  • ther
  • ne
(see Sc hmidh ub er 1992b for details). LSTM: lik e with R TRL, but the hidden unit is replaced b y a memory cell and an input gate (no
  • utput
gate required). g is the logistic sigmoid, and h is the iden tit y function h : h(x) = x; 8x. Memory cell and input gate are added
  • nce
the error has stopp ed decreasing (see abuse problem: solution (1) in Section 4). Results. Using R TRL and a short 4 time step dela y (p = 4), 7 9
  • f
all trials w ere successful. No trial was suc c essful with p = 10. With long time lags,
  • nly
the neural sequence c h unk er and LSTM ac hiev ed successful trials, while BPTT and R TRL failed. With p = 100, the 2-net sequence c h unk er solv ed the task in
  • nly
1 3
  • f
all trials. LSTM, ho w ev er, alw a ys learned to solv e the task. Comparing successful trials
  • nly
, LSTM learned m uc h faster. See T able 2 for details. It should b e men tioned, ho w ev er, that a hier ar chic al c h unk er can also alw a ys quic kly solv e this task (Sc hmidh ub er 1992c, 1993). T ask 2b: no lo cal regularities. With the task ab
  • v
e, the c h unk er sometimes learns to correctly predict the nal elemen t, but
  • nly
b ecause
  • f
predictable lo cal regularities in the input stream that allo w for compressing the sequence. In an additional, more dicult task (in v
  • lving
man y more dieren t p
  • ssible
sequences), w e remo v e compressibilit y b y replacing the determin- istic subsequence (a 1 ; a 2 ; : : : ; a p1 ) b y a r andom subsequence (of length p
  • 1)
  • v
er the alpha- b et a 1 ; a 2 ; : : : ; a p1 . W e
  • btain
2 classes (t w
  • sets
  • f
sequences) f(y ; a i 1 ; a i 2 ; : : : ; a i p1 ; y ) j 1
  • i
1 ; i 2 ; : : : ; i p1
  • p
  • 1g
and f(x; a i 1 ; a i 2 ; : : : ; a i p1 ; x) j 1
  • i
1 ; i 2 ; : : : ; i p1
  • p
  • 1g.
Again, ev ery next sequence elemen t has to b e predicted. The
  • nly
totally predictable targets, ho w ev er, are x and y , whic h
  • ccur
at sequence ends. T raining exemplars are c hosen randomly from the 2 classes. Arc hitectures and parameters are the same as in Exp erimen t 2a. A successful run is
  • ne
that fullls the follo wing criterion: after training, during 10,000 successiv e, randomly c hosen input 13
slide-14
SLIDE 14 Metho d Dela y p Learning rate # w eigh ts % Successful trials Success after R TRL 4 1.0 36 78 1,043,000 R TRL 4 4.0 36 56 892,000 R TRL 4 10.0 36 22 254,000 R TRL 10 1.0-10.0 144 > 5,000,000 R TRL 100 1.0-10.0 10404 > 5,000,000 BPTT 100 1.0-10.0 10404 > 5,000,000 CH 100 1.0 10506 33 32,400 LSTM 100 1.0 10504 100 5,040 T able 2: T ask 2a: Per c entage
  • f
suc c essful trials and numb er
  • f
tr aining se quenc es until suc c ess, for \R e al-Time R e curr ent L e arning" (R TRL), \Back-Pr
  • p
agation Thr
  • ugh
Time" (BPTT), neur al se quenc e chunking (CH), and the new metho d (LSTM). T able entries r efer to me ans
  • f
18 trials. With 100 time step delays,
  • nly
CH and LSTM achieve suc c essful trials. Even when we ignor e the unsuc c essful trials
  • f
the
  • ther
appr
  • aches,
LSTM le arns much faster. sequences, the maximal absolute error
  • f
all
  • utput
units is b elo w 0:25 at sequence end. Results. As exp ected, the c h unk er failed to solv e this task (so did BPTT and R TRL,
  • f
course). LSTM, ho w ev er, w as alw a ys successful. On a v erage (mean
  • f
18 trials), success for p = 100 w as ac hiev ed after 5,680 sequence presen tations. This demonstrates that LSTM do es not require sequence regularities to w
  • rk
w ell. T ask 2c: v ery long time lags | no lo cal regularities. This is the most dicult task in this subsection. T
  • ur
kno wledge no
  • ther
recurren t net algorithm can solv e it. No w there are p + 4 p
  • ssible
input sym b
  • ls
denoted a 1 ; :::; a p1 ; a p ; a p+1 = e; a p+2 = b; a p+3 = x; a p+4 = y . a 1 ; :::; a p are also called \distr actor symb
  • ls".
Again, a i is lo cally represen ted b y the p + 4-dimensional v ector whose ith comp
  • nen
t is 1 (all
  • ther
comp
  • nen
ts are 0). A net with p + 4 input units and 2
  • utput
units sequen tially
  • bserv
es input sym b
  • l
sequences,
  • ne
at a time. T raining sequences are randomly c hosen from the union
  • f
t w
  • v
ery similar subsets
  • f
sequences: f(b; y ; a i 1 ; a i 2 ; : : : ; a i q +k ; e; y ) j 1
  • i
1 ; i 2 ; : : : ; i q +k
  • q
g and f(b; x; a i 1 ; a i 2 ; : : : ; a i q +k ; e; x) j 1
  • i
1 ; i 2 ; : : : ; i q +k
  • q
g. T
  • pro
duce a training sequence, w e (1) randomly generate a sequence prex
  • f
length q + 2, (2) randomly generate a sequence sux
  • f
additional elemen ts (6= b; e; x; y ) with probabilit y 9 10
  • r,
alternativ ely , an e with probabilit y 1 10 . In the latter case, w e (3) conclude the sequence with x
  • r
y , dep ending
  • n
the second elemen t. F
  • r
a giv en k , this leads to a uniform distribution
  • n
the p
  • ssible
sequences with length q + k + 4. The minimal sequence length is q + 4; the exp ected length is 4 + 1 X k =0 1 10 ( 9 10 ) k (q + k ) = q + 14: The exp ected n um b er
  • f
  • ccurrences
  • f
elemen t a i ; 1
  • i
  • p,
in a sequence is q +10 p
  • q
p . The goal is to predict the last sym b
  • l,
whic h alw a ys
  • ccurs
after the \trigger sym b
  • l"
e. Error signals are generated
  • nly
at sequence ends. T
  • predict
the nal elemen t, the net has to learn to store a represen tation
  • f
the second elemen t for at least q + 1 time steps (un til it sees the trigger sym b
  • l
e). Success is dened as \prediction error (for nal sequence elemen t)
  • f
b
  • th
  • utput
units alw a ys b elo w 0:2, for 10,000 successiv e, randomly c hosen input sequences". Arc hitecture/Learning. The net has p + 4 input units and 2
  • utput
units. W eigh ts are initialized in [-0.2,0.2]. T
  • a
v
  • id
to
  • m
uc h learning time v ariance due to dieren t w eigh t initial- izations, the hidden la y er gets t w
  • memory
cells (t w
  • cell
blo c ks
  • f
size 1 | although
  • ne
w
  • uld
b e sucien t). There are no
  • ther
hidden units. The
  • utput
la y er receiv es connections
  • nly
from memory cells. Memory cells and gate units receiv e connections from input units, memory cells and gate units (i.e., the hidden la y er is fully connected). No bias w eigh ts are used. h and g are logistic sigmoids with
  • utput
ranges [1; 1] and [2; 2], resp ectiv ely . The learning rate is 0.01. 14
slide-15
SLIDE 15 q (time lag 1) p (# random inputs) q p # w eigh ts Success after 50 50 1 364 30,000 100 100 1 664 31,000 200 200 1 1264 33,000 500 500 1 3064 38,000 1,000 1,000 1 6064 49,000 1,000 500 2 3064 49,000 1,000 200 5 1264 75,000 1,000 100 10 664 135,000 1,000 50 20 364 203,000 T able 3: T ask 2c: LSTM with very long minimal time lags q + 1 and a lot
  • f
noise. p is the numb er
  • f
available distr actor symb
  • ls
(p + 4 is the numb er
  • f
input units). q p is the exp e cte d numb er
  • f
  • c
curr enc es
  • f
a given distr actor symb
  • l
in a se quenc e. The rightmost c
  • lumn
lists the numb er
  • f
tr aining se quenc es r e quir e d by LSTM (BPTT, R TRL and the
  • ther
c
  • mp
etitors have no chanc e
  • f
solving this task). If we let the numb er
  • f
distr actor symb
  • ls
(and weights) incr e ase in pr
  • p
  • rtion
to the time lag, le arning time incr e ases very slow ly. The lower blo ck il lustr ates the exp e cte d slow-down due to incr e ase d fr e quency
  • f
distr actor symb
  • ls.
Note that the minimal time lag is q + 1 | the net nev er sees short training sequences facilitating the classication
  • f
long test sequences. Results. 20 trials w ere made for all tested pairs (p; q ). T able 3 lists the mean
  • f
the n um b er
  • f
training sequences required b y LSTM to ac hiev e success (BPTT and R TRL ha v e no c hance
  • f
solving non-trivial tasks with minimal time lags
  • f
1000 steps). Scaling. T able 3 sho ws that if w e let the n um b er
  • f
input sym b
  • ls
(and w eigh ts) increase in prop
  • rtion
to the time lag, learning time increases v ery slo wly . This is a another remark able prop ert y
  • f
LSTM not shared b y an y
  • ther
metho d w e are a w are
  • f.
Indeed, R TRL and BPTT are far from scaling reasonably | instead, they app ear to scale exp
  • nen
tially , and app ear quite useless when the time lags exceed as few as 10 steps. Distractor inuence. In T able 3, the column headed b y q p giv es the exp ected frequency
  • f
distractor sym b
  • ls.
Increasing this frequency decreases learning sp eed, an eect due to w eigh t
  • scillations
caused b y frequen tly
  • bserv
ed input sym b
  • ls.
5.3 EXPERIMENT 3: NOISE AND SIGNAL ON SAME CHANNEL This exp erimen t serv es to illustrate that LSTM do es not encoun ter fundamen tal problems if noise and signal are mixed
  • n
the same input line. W e initially fo cus
  • n
Bengio et al.'s simple 1994 \2-sequence problem"; in Exp erimen t 3c w e will then p
  • se
a more c hallenging 2-sequence problem. T ask 3a (\2-sequence problem"). The task is to
  • bserv
e and then classify input sequences. There are t w
  • classes,
eac h
  • ccurring
with probabilit y 0.5. There is
  • nly
  • ne
input line. Only the rst N real-v alued sequence elemen ts con v ey relev an t information ab
  • ut
the class. Sequence elemen ts at p
  • sitions
t > N are generated b y a Gaussian with mean zero and v ariance 0.2. Case N = 1: the rst sequence elemen t is 1.0 for class 1, and
  • 1.0
for class 2. Case N = 3: the rst three elemen ts are 1.0 for class 1 and
  • 1.0
for class 2. The target at the sequence end is 1.0 for class 1 and 0.0 for class 2. Correct classication is dened as \absolute
  • utput
error at sequence end b elo w 0.2". Giv en a constan t T, the sequence length is randomly selected b et w een T and T + T/10 (a dierence to Bengio et al.'s problem is that they also p ermit shorter sequences
  • f
length T/2). Guessing. Bengio et al. (1994) and Bengio and F rasconi (1994) tested 7 dieren t metho ds
  • n
the 2-sequence problem. W e disco v ered, ho w ev er, that random w eigh t guessing easily
  • utp
er- 15
slide-16
SLIDE 16 T N stop: ST1 stop: ST2 # w eigh ts ST2: fraction misclassied 100 3 27,380 39,850 102 0.000195 100 1 58,370 64,330 102 0.000117 1000 3 446,850 452,460 102 0.000078 T able 4: T ask 3a: Bengio et al.'s 2-se quenc e pr
  • blem.
T is minimal se quenc e length. N is the numb er
  • f
information-c
  • nveying
elements at se quenc e b e gin. The c
  • lumn
he ade d by ST1 (ST2) gives the numb er
  • f
se quenc e pr esentations r e quir e d to achieve stopping criterion ST1 (ST2). The rightmost c
  • lumn
lists the fr action
  • f
misclassie d p
  • st-tr
aining se quenc es (with absolute err
  • r
> 0.2) fr
  • m
a test set c
  • nsisting
  • f
2560 se quenc es (teste d after ST2 was achieve d). A l l values ar e me ans
  • f
10 trials. We disc
  • ver
e d, however, that this pr
  • blem
is so simple that r andom weight guessing solves it faster than LSTM and any
  • ther
metho d for which ther e ar e publishe d r esults. forms them all, b ecause the problem is so simple 5 . See Sc hmidh ub er and Ho c hreiter (1996) and Ho c hreiter and Sc hmidh ub er (1996, 1997) for additional results in this v ein. LSTM arc hitecture. W e use a 3-la y er net with 1 input unit, 1
  • utput
unit, and 3 cell blo c ks
  • f
size 1. The
  • utput
la y er receiv es connections
  • nly
from memory cells. Memory cells and gate units receiv e inputs from input units, memory cells and gate units, and ha v e bias w eigh ts. Gate units and
  • utput
unit are logistic sigmoid in [0; 1], h in [1; 1], and g in [2; 2]. T raining/T esting. All w eigh ts (except the bias w eigh ts to gate units) are randomly initialized in the range [0:1; 0:1]. The rst input gate bias is initialized with 1:0, the second with 3:0, and the third with 5:0. The rst
  • utput
gate bias is initialized with 2:0, the second with 4:0 and the third with 6:0. The precise initialization v alues hardly matter though, as conrmed b y additional exp erimen ts. The learning rate is 1.0. All activ ations are reset to zero at the b eginning
  • f
a new sequence. W e stop training (and judge the task as b eing solv ed) according to the follo wing criteria: ST1: none
  • f
256 sequences from a randomly c hosen test set is misclassied. ST2: ST1 is satised, and mean absolute test set error is b elo w 0.01. In case
  • f
ST2, an additional test set consisting
  • f
2560 randomly c hosen sequences is used to determine the fraction
  • f
misclassied sequences. Results. See T able 4. The results are means
  • f
10 trials with dieren t w eigh t initializations in the range [0:1; 0:1]. LSTM is able to solv e this problem, though b y far not as fast as random w eigh t guessing (see paragraph \Guessing" ab
  • v
e). Clearly , this trivial problem do es not pro vide a v ery go
  • d
testb ed to compare p erformance
  • f
v arious non-trivial algorithms. Still, it demonstrates that LSTM do es not encoun ter fundamen tal problems when faced with signal and noise
  • n
the same c hannel. T ask 3b. Arc hitecture, parameters, etc. lik e in T ask 3a, but no w with Gaussian noise (mean and v ariance 0.2) added to the information-con v eying elemen ts (t <= N ). W e stop training (and judge the task as b eing solv ed) according to the follo wing, sligh tly redened criteria: ST1: less than 6
  • ut
  • f
256 sequences from a randomly c hosen test set are misclassied. ST2: ST1 is satised, and mean absolute test set error is b elo w 0.04. In case
  • f
ST2, an additional test set consisting
  • f
2560 randomly c hosen sequences is used to determine the fraction
  • f
misclassied sequences. Results. See T able 5. The results represen t means
  • f
10 trials with dieren t w eigh t initializa- tions. LSTM easily solv es the problem. T ask 3c. Arc hitecture, parameters, etc. lik e in T ask 3a, but with a few essen tial c hanges that mak e the task non-trivial: the targets are 0.2 and 0.8 for class 1 and class 2, resp ectiv ely , and there is Gaussian noise
  • n
the tar gets (mean and v ariance 0.1; st.dev. 0.32). T
  • minimize
mean squared error, the system has to learn the c
  • nditional
exp e ctations
  • f
the tar gets giv en the inputs. Misclassication is dened as \absolute dierence b et w een
  • utput
and noise-free target (0.2 for 5 It should b e men tioned, ho w ev er, that dieren t input represen tations and dieren t t yp es
  • f
noise ma y lead to w
  • rse
guessing p erformance (Y
  • sh
ua Bengio, p ersonal comm unication, 1996). 16
slide-17
SLIDE 17 T N stop: ST1 stop: ST2 # w eigh ts ST2: fraction misclassied 100 3 41,740 43,250 102 0.00828 100 1 74,950 78,430 102 0.01500 1000 1 481,060 485,080 102 0.01207 T able 5: T ask 3b: mo die d 2-se quenc e pr
  • blem.
Same as in T able 4, but now the information- c
  • nveying
elements ar e also p erturb e d by noise. T N stop # w eigh ts fraction misclassied a v. dierence to mean 100 3 269,650 102 0.00558 0.014 100 1 565,640 102 0.00441 0.012 T able 6: T ask 3c: mo die d, mor e chal lenging 2-se quenc e pr
  • blem.
Same as in T able 4, but with noisy r e al-value d tar gets. The system has to le arn the c
  • nditional
exp e ctations
  • f
the tar gets given the inputs. The rightmost c
  • lumn
pr
  • vides
the aver age dier enc e b etwe en network
  • utput
and exp e cte d tar get. Unlike 3a and 3b, this task c annot b e solve d quickly by r andom weight guessing. class 1 and 0.8 for class 2) > 0.1. " The net w
  • rk
  • utput
is considered acceptable if the mean absolute dierence b et w een noise-free target and
  • utput
is b elo w 0.015. Since this requires high w eigh t precision, T ask 3c (unlike 3a and 3b) c annot b e solve d quickly by r andom guessing. T raining/T esting. The learning rate is 0:1. W e stop training according to the follo wing criterion: none
  • f
256 sequences from a randomly c hosen test set is misclassied, and mean absolute dierence b et w een noise free target and
  • utput
is b elo w 0.015. An additional test set consisting
  • f
2560 randomly c hosen sequences is used to determine the fraction
  • f
misclassied sequences. Results. See T able 6. The results represen t means
  • f
10 trials with dieren t w eigh t initial- izations. Despite the noisy targets, LSTM still can solv e the problem b y learning the exp ected target v alues. 5.4 EXPERIMENT 4: ADDING PR OBLEM The dicult task in this section is
  • f
a t yp e that has nev er b een solv ed b y
  • ther
recurren t net al- gorithms. It sho ws that LSTM can solv e long time lag problems in v
  • lving
distributed, con tin uous- v alued represen tations. T ask. Eac h elemen t
  • f
eac h input sequence is a pair
  • f
comp
  • nen
ts. The rst comp
  • nen
t is a real v alue randomly c hosen from the in terv al [1; 1]; the second is either 1.0, 0.0,
  • r
  • 1.0,
and is used as a mark er: at the end
  • f
eac h sequence, the task is to
  • utput
the sum
  • f
the rst comp
  • nen
ts
  • f
those pairs that are marke d b y second comp
  • nen
ts equal to 1.0. Sequences ha v e random lengths b et w een the minimal sequence length T and T + T 10 . In a giv en sequence exactly t w
  • pairs
are mark ed as follo ws: w e rst randomly select and mark
  • ne
  • f
the rst ten pairs (whose rst comp
  • nen
t w e call X 1 ). Then w e randomly select and mark
  • ne
  • f
the rst T 2
  • 1
still unmark ed pairs (whose rst comp
  • nen
t w e call X 2 ). The second comp
  • nen
ts
  • f
all remaining pairs are zero except for the rst and nal pair, whose second comp
  • nen
ts are
  • 1.
(In the rare case where the rst pair
  • f
the sequence gets mark ed, w e set X 1 to zero.) An error signal is generated
  • nly
at the sequence end: the target is 0:5 + X 1 +X 2 4:0 (the sum X 1 + X 2 scaled to the in terv al [0; 1]). A sequence is pro cessed correctly if the absolute error at the sequence end is b elo w 0.04. Arc hitecture. W e use a 3-la y er net with 2 input units, 1
  • utput
unit, and 2 cell blo c ks
  • f
size 2. The
  • utput
la y er receiv es connections
  • nly
from memory cells. Memory cells and gate units receiv e inputs from memory cells and gate units (i.e., the hidden la y er is fully connected | less connectivit y ma y w
  • rk
as w ell). The input la y er has forw ard connections to all units in the hidden 17
slide-18
SLIDE 18 T minimal lag # w eigh ts # wrong predictions Success after 100 50 93 1
  • ut
  • f
2560 74,000 500 250 93
  • ut
  • f
2560 209,000 1000 500 93 1
  • ut
  • f
2560 853,000 T able 7: EXPERIMENT 4: R esults for the A dding Pr
  • blem.
T is the minimal se quenc e length, T =2 the minimal time lag. \# wr
  • ng
pr e dictions" is the numb er
  • f
inc
  • rr
e ctly pr
  • c
esse d se quenc es (err
  • r
> 0.04) fr
  • m
a test set c
  • ntaining
2560 se quenc es. The rightmost c
  • lumn
gives the numb er
  • f
tr aining se quenc es r e quir e d to achieve the stopping criterion. A l l values ar e me ans
  • f
10 trials. F
  • r
T = 1000 the numb er
  • f
r e quir e d tr aining examples varies b etwe en 370,000 and 2,020,000, exc e e ding 700,000 in
  • nly
3 c ases. la y er. All non-input units ha v e bias w eigh ts. These arc hitecture parameters mak e it easy to store at least 2 input signals (a cell blo c k size
  • f
1 w
  • rks
w ell, to
  • ).
All activ ation functions are logistic with
  • utput
range [0; 1], except for h, whose range is [1; 1], and g , whose range is [2; 2]. State drift v ersus initial bias. Note that the task requires storing the precise v alues
  • f
real n um b ers for long durations | the system m ust learn to protect memory cell con ten ts against ev en minor in ternal state drift (see Section 4). T
  • study
the signicance
  • f
the drift problem, w e mak e the task ev en more dicult b y biasing all non-input units, th us articially inducing in ternal state drift. All w eigh ts (including the bias w eigh ts) are randomly initialized in the range [0:1; 0:1]. F
  • llo
wing Section 4's remedy for state drifts, the rst input gate bias is initialized with 3:0, the second with 6:0 (though the precise v alues hardly matter, as conrmed b y additional exp erimen ts). T raining/T esting. The learning rate is 0.5. T raining is stopp ed
  • nce
the a v erage training error is b elo w 0.01, and the 2000 most recen t sequences w ere pro cessed correctly . Results. With a test set consisting
  • f
2560 randomly c hosen sequences, the a v erage test set error w as alw a ys b elo w 0.01, and there w ere nev er more than 3 incorrectly pro cessed sequences. T able 7 sho ws details. The exp erimen t demonstrates: (1) LSTM is able to w
  • rk
w ell with distributed represen tations. (2) LSTM is able to learn to p erform calculations in v
  • lving
c
  • ntinuous
v alues. (3) Since the system manages to store con tin uous v alues without deterioration for minimal dela ys
  • f
T 2 time steps, there is no signican t, harmful in ternal state drift. 5.5 EXPERIMENT 5: MUL TIPLICA TION PR OBLEM One ma y argue that LSTM is a bit biased to w ards tasks suc h as the Adding Problem from the previous subsection. Solutions to the Adding Problem ma y exploit the CEC's built-in in tegration capabilities. Although this CEC prop ert y ma y b e view ed as a feature rather than a disadv an tage (in tegration seems to b e a natural subtask
  • f
man y tasks
  • ccurring
in the real w
  • rld),
the question arises whether LSTM can also solv e tasks with inheren tly non-in tegrativ e solutions. T
  • test
this, w e c hange the problem b y requiring the nal target to equal the pro duct (instead
  • f
the sum)
  • f
earlier mark ed inputs. T ask. Lik e the task in Section 5.4, except that the rst comp
  • nen
t
  • f
eac h pair is a real v alue randomly c hosen from the in terv al [0; 1]. In the rare case where the rst pair
  • f
the input sequence gets mark ed, w e set X 1 to 1.0. The target at sequence end is the pro duct X 1
  • X
2 . Arc hitecture. Lik e in Section 5.4. All w eigh ts (including the bias w eigh ts) are randomly initialized in the range [0:1; 0:1]. T raining/T esting. The learning rate is 0.1. W e test p erformance t wice: as so
  • n
as less than n seq
  • f
the 2000 most recen t training sequences lead to absolute errors exceeding 0.04, where n seq = 140, and n seq = 13. Wh y these v alues? n seq = 140 is sucien t to learn storage
  • f
the relev an t inputs. It is not enough though to ne-tune the precise nal
  • utputs.
n seq = 13, ho w ev er, 18
slide-19
SLIDE 19 T minimal lag # w eigh ts n seq # wrong predictions MSE Success after 100 50 93 140 139
  • ut
  • f
2560 0.0223 482,000 100 50 93 13 14
  • ut
  • f
2560 0.0139 1,273,000 T able 8: EXPERIMENT 5: R esults for the Multiplic ation Pr
  • blem.
T is the minimal se quenc e length, T =2 the minimal time lag. We test
  • n
a test set c
  • ntaining
2560 se quenc es as so
  • n
as less than n seq
  • f
the 2000 most r e c ent tr aining se quenc es le ad to err
  • r
> 0.04. \# wr
  • ng
pr e dictions" is the numb er
  • f
test se quenc es with err
  • r
> 0.04. MSE is the me an squar e d err
  • r
  • n
the test set. The rightmost c
  • lumn
lists numb ers
  • f
tr aining se quenc es r e quir e d to achieve the stopping criterion. A l l values ar e me ans
  • f
10 trials. leads to quite satisfactory results. Results. F
  • r
n seq = 140 (n seq = 13) with a test set consisting
  • f
2560 randomly c hosen sequences, the a v erage test set error w as alw a ys b elo w 0.026 (0.013), and there w ere nev er more than 170 (15) incorrectly pro cessed sequences. T able 8 sho ws details. (A net with additional standard hidden units
  • r
with a hidden la y er ab
  • v
e the memory cells ma y learn the ne-tuning part more quic kly .) The exp erimen t demonstrates: LSTM can solv e tasks in v
  • lving
b
  • th
con tin uous-v alued repre- sen tations and non-in tegrativ e information pro cessing. 5.6 EXPERIMENT 6: TEMPORAL ORDER In this subsection, LSTM solv es
  • ther
dicult (but articial) tasks that ha v e nev er b een solv ed b y previous recurren t net algorithms. The exp erimen t sho ws that LSTM is able to extract information con v ey ed b y the temp
  • ral
  • rder
  • f
widely separated inputs. T ask 6a: t w
  • relev
an t, widely separated sym b
  • ls.
The goal is to classify sequences. Elemen ts and targets are represen ted lo cally (input v ectors with
  • nly
  • ne
non-zero bit). The sequence starts with an E , ends with a B (the \trigger sym b
  • l")
and
  • therwise
consists
  • f
randomly c hosen sym b
  • ls
from the set fa; b; c; dg except for t w
  • elemen
ts at p
  • sitions
t 1 and t 2 that are either X
  • r
Y . The sequence length is randomly c hosen b et w een 100 and 110, t 1 is randomly c hosen b et w een 10 and 20, and t 2 is randomly c hosen b et w een 50 and 60. There are 4 sequence classes Q; R ; S; U whic h dep end
  • n
the temp
  • ral
  • rder
  • f
X and Y . The rules are: X ; X ! Q; X ; Y ! R ; Y ; X ! S ; Y ; Y ! U . T ask 6b: three relev an t, widely separated sym b
  • ls.
Again, the goal is to classify sequences. Elemen ts/targets are represen ted lo cally . The sequence starts with an E , ends with a B (the \trigger sym b
  • l"),
and
  • therwise
consists
  • f
randomly c hosen sym b
  • ls
from the set fa; b; c; dg except for three elemen ts at p
  • sitions
t 1 ; t 2 and t 3 that are either X
  • r
Y . The sequence length is randomly c hosen b et w een 100 and 110, t 1 is randomly c hosen b et w een 10 and 20, t 2 is randomly c hosen b et w een 33 and 43, and t 3 is randomly c hosen b et w een 66 and 76. There are 8 sequence classes Q; R ; S; U; V ; A; B ; C whic h dep end
  • n
the temp
  • ral
  • rder
  • f
the X s and Y s. The rules are: X ; X ; X ! Q; X ; X ; Y ! R ; X ; Y ; X ! S ; X ; Y ; Y ! U ; Y ; X ; X ! V ; Y ; X ; Y ! A; Y ; Y ; X ! B ; Y ; Y ; Y ! C . There are as man y
  • utput
units as there are classes. Eac h class is lo cally represen ted b y a binary target v ector with
  • ne
non-zero comp
  • nen
t. With b
  • th
tasks, error signals
  • ccur
  • nly
at the end
  • f
a sequence. The sequence is classied correctly if the nal absolute error
  • f
all
  • utput
units is b elo w 0.3. Arc hitecture. W e use a 3-la y er net with 8 input units, 2 (3) cell blo c ks
  • f
size 2 and 4 (8)
  • utput
units for T ask 6a (6b). Again all non-input units ha v e bias w eigh ts, and the
  • utput
la y er receiv es connections from memory cells
  • nly
. Memory cells and gate units receiv e inputs from input units, memory cells and gate units (i.e., the hidden la y er is fully connected | less connectivit y ma y w
  • rk
as w ell). The arc hitecture parameters for T ask 6a (6b) mak e it easy to 19
slide-20
SLIDE 20 store at least 2 (3) input signals. All activ ation functions are logistic with
  • utput
range [0; 1], except for h, whose range is [1; 1], and g , whose range is [2; 2]. T raining/T esting. The learning rate is 0.5 (0.1) for Exp erimen t 6a (6b). T raining is stopp ed
  • nce
the a v erage training error falls b elo w 0.1 and the 2000 most recen t sequences w ere classied correctly . All w eigh ts are initialized in the range [0:1; 0:1]. The rst input gate bias is initialized with 2:0, the second with 4:0, and (for Exp erimen t 6b) the third with 6:0 (again, w e conrmed b y additional exp erimen ts that the precise v alues hardly matter). Results. With a test set consisting
  • f
2560 randomly c hosen sequences, the a v erage test set error w as alw a ys b elo w 0.1, and there w ere nev er more than 3 incorrectly classied sequences. T able 9 sho ws details. The exp erimen t sho ws that LSTM is able to extract information con v ey ed b y the temp
  • ral
  • rder
  • f
widely separated inputs. In T ask 6a, for instance, the dela ys b et w een rst and second relev an t input and b et w een second relev an t input and sequence end are at least 30 time steps. task # w eigh ts # wrong predictions Success after T ask 6a 156 1
  • ut
  • f
2560 31,390 T ask 6b 308 2
  • ut
  • f
2560 571,100 T able 9: EXPERIMENT 6: R esults for the T emp
  • r
al Or der Pr
  • blem.
\# wr
  • ng
pr e dictions" is the numb er
  • f
inc
  • rr
e ctly classie d se quenc es (err
  • r
> 0.3 for at le ast
  • ne
  • utput
unit) fr
  • m
a test set c
  • ntaining
2560 se quenc es. The rightmost c
  • lumn
gives the numb er
  • f
tr aining se quenc es r e quir e d to achieve the stopping criterion. The r esults for T ask 6a ar e me ans
  • f
20 trials; those for T ask 6b
  • f
10 trials. T ypical solutions. In Exp erimen t 6a, ho w do es LSTM distinguish b et w een temp
  • ral
  • rders
(X ; Y ) and (Y ; X )? One
  • f
man y p
  • ssible
solutions is to store the rst X
  • r
Y in cell blo c k 1, and the second X= Y in cell blo c k 2. Before the rst X= Y
  • ccurs,
blo c k 1 can see that it is still empt y b y means
  • f
its recurren t connections. After the rst X= Y , blo c k 1 can close its input gate. Once blo c k 1 is lled and closed, this fact will b ecome visible to blo c k 2 (recall that all gate units and all memory cells receiv e connections from all non-output units). T ypical solutions, ho w ev er, require
  • nly
  • ne
memory cell blo c k. The blo c k stores the rst X
  • r
Y ;
  • nce
the second X= Y
  • ccurs,
it c hanges its state dep ending
  • n
the rst stored sym b
  • l.
Solution t yp e 1 exploits the connection b et w een memory cell
  • utput
and input gate unit | the follo wing ev en ts cause dieren t input gate activ ations: \X
  • ccurs
in conjunction with a lled blo c k"; \X
  • ccurs
in conjunction with an empt y blo c k". Solution t yp e 2 is based
  • n
a strong p
  • sitiv
e connection b et w een memory cell
  • utput
and memory cell input. The previous
  • ccurrence
  • f
X (Y ) is represen ted b y a p
  • sitiv
e (negativ e) in ternal state. Once the input gate
  • p
ens for the second time, so do es the
  • utput
gate, and the memory cell
  • utput
is fed bac k to its
  • wn
input. This causes (X ; Y ) to b e represen ted b y a p
  • sitiv
e in ternal state, b ecause X con tributes to the new in ternal state t wice (via curren t in ternal state and cell
  • utput
feedbac k). Similarly , (Y ; X ) gets represen ted b y a negativ e in ternal state. 5.7 SUMMAR Y OF EXPERIMENT AL CONDITIONS The t w
  • tables
in this subsection pro vide an
  • v
erview
  • f
the most imp
  • rtan
t LSTM parameters and arc hitectural details for Exp erimen ts 1{6. The conditions
  • f
the simple exp erimen ts 2a and 2b dier sligh tly from those
  • f
the
  • ther,
more systematic exp erimen ts, due to historical reasons. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T ask p lag b s in
  • ut
w c
  • gb
igb bias h g
  • 1-1
9 9 4 1 7 7 264 F
  • 1,-2,-3,-4
r ga h1 g2 0.1 1-2 9 9 3 2 7 7 276 F
  • 1,-2,-3
r ga h1 g2 0.1 to b e con tin ued
  • n
next page 20
slide-21
SLIDE 21 con tin ued from previous page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T ask p lag b s in
  • ut
w c
  • gb
igb bias h g
  • 1-3
9 9 3 2 7 7 276 F
  • 1,-2,-3
r ga h1 g2 0.2 1-4 9 9 4 1 7 7 264 F
  • 1,-2,-3,-4
r ga h1 g2 0.5 1-5 9 9 3 2 7 7 276 F
  • 1,-2,-3
r ga h1 g2 0.5 2a 100 100 1 1 101 101 10504 B no
  • g
none none id g1 1.0 2b 100 100 1 1 101 101 10504 B no
  • g
none none id g1 1.0 2c-1 50 50 2 1 54 2 364 F none none none h1 g2 0.01 2c-2 100 100 2 1 104 2 664 F none none none h1 g2 0.01 2c-3 200 200 2 1 204 2 1264 F none none none h1 g2 0.01 2c-4 500 500 2 1 504 2 3064 F none none none h1 g2 0.01 2c-5 1000 1000 2 1 1004 2 6064 F none none none h1 g2 0.01 2c-6 1000 1000 2 1 504 2 3064 F none none none h1 g2 0.01 2c-7 1000 1000 2 1 204 2 1264 F none none none h1 g2 0.01 2c-8 1000 1000 2 1 104 2 664 F none none none h1 g2 0.01 2c-9 1000 1000 2 1 54 2 364 F none none none h1 g2 0.01 3a 100 100 3 1 1 1 102 F
  • 2,-4,-6
  • 1,-3,-5
b1 h1 g2 1.0 3b 100 100 3 1 1 1 102 F
  • 2,-4,-6
  • 1,-3,-5
b1 h1 g2 1.0 3c 100 100 3 1 1 1 102 F
  • 2,-4,-6
  • 1,-3,-5
b1 h1 g2 0.1 4-1 100 50 2 2 2 1 93 F r
  • 3,-6
all h1 g2 0.5 4-2 500 250 2 2 2 1 93 F r
  • 3,-6
all h1 g2 0.5 4-3 1000 500 2 2 2 1 93 F r
  • 3,-6
all h1 g2 0.5 5 100 50 2 2 2 1 93 F r r all h1 g2 0.1 6a 100 40 2 2 8 4 156 F r
  • 2,-4
all h1 g2 0.5 6b 100 24 3 2 8 8 308 F r
  • 2,-4,-6
all h1 g2 0.1 T able 10: Summary
  • f
exp erimental c
  • nditions
for LSTM, Part I. 1st c
  • lumn:
task numb er. 2nd c
  • lumn:
minimal se quenc e length p. 3r d c
  • lumn:
minimal numb er
  • f
steps b etwe en most r e c ent r elevant input information and te acher signal. 4th c
  • lumn:
numb er
  • f
c el l blo cks b. 5th c
  • lumn:
blo ck size s. 6th c
  • lumn:
numb er
  • f
input units in. 7th c
  • lumn:
numb er
  • f
  • utput
units
  • ut.
8th c
  • lumn:
numb er
  • f
weights w . 9th c
  • lumn:
c describ es c
  • nne
ctivity: \F" me ans \output layer r e c eives c
  • nne
ctions fr
  • m
memory c el ls; memory c el ls and gate units r e c eive c
  • nne
ctions fr
  • m
input units, memory c el ls and gate units"; \B" me ans \e ach layer r e c eives c
  • nne
ctions fr
  • m
al l layers b elow". 10th c
  • lumn:
initial
  • utput
gate bias
  • g
b, wher e \r" stands for \r andomly chosen fr
  • m
the interval [0:1; 0:1]" and \no
  • g"
me ans \no
  • utput
gate use d". 11th c
  • lumn:
initial input gate bias ig b (se e 10th c
  • lumn).
12th c
  • lumn:
which units have bias weights? \b1" stands for \al l hidden units", \ga" for \only gate units", and \al l" for \al l non-input units". 13th c
  • lumn:
the function h, wher e \id" is identity function, \h1" is lo gistic sigmoid in [2; 2]. 14th c
  • lumn:
the lo gistic function g , wher e \g1" is sigmoid in [0; 1], \g2" in [1; 1]. 15th c
  • lumn:
le arning r ate . 1 2 3 4 5 6 T ask select in terv al test set size stopping criterion success 1 t1 [0:2; 0:2] 256 training & test correctly pred. see text 2a t1 [0:2; 0:2] no test set after 5 million exemplars ABS(0.25) 2b t2 [0:2; 0:2] 10000 after 5 million exemplars ABS(0.25) 2c t2 [0:2; 0:2] 10000 after 5 million exemplars ABS(0.2) 3a t3 [0:1; 0:1] 2560 ST1 and ST2 (see text) ABS(0.2) 3b t3 [0:1; 0:1] 2560 ST1 and ST2 (see text) ABS(0.2) 3c t3 [0:1; 0:1] 2560 ST1 and ST2 (see text) see text 4 t3 [0:1; 0:1] 2560 ST3(0.01) ABS(0.04) 5 t3 [0:1; 0:1] 2560 see text ABS(0.04) 6a t3 [0:1; 0:1] 2560 ST3(0.1) ABS(0.3) 6b t3 [0:1; 0:1] 2560 ST3(0.1) ABS(0.3) 21
slide-22
SLIDE 22 T able 11: Summary
  • f
exp erimental c
  • nditions
for LSTM, Part II. 1st c
  • lumn:
task numb er. 2nd c
  • lumn:
tr aining exemplar sele ction, wher e \t1" stands for \r andomly chosen fr
  • m
tr aining set", \t2" for \r andomly chosen fr
  • m
2 classes", and \t3" for \r andomly gener ate d
  • n-line".
3r d c
  • lumn:
weight initialization interval. 4th c
  • lumn:
test set size. 5th c
  • lumn:
stopping criterion for tr aining, wher e \ST3( )" stands for \aver age tr aining err
  • r
b elow
  • and
the 2000 most r e c ent se quenc es wer e pr
  • c
esse d c
  • rr
e ctly". 6th c
  • lumn:
suc c ess (c
  • rr
e ct classic ation) criterion, wher e \ABS( )" stands for \absolute err
  • r
  • f
al l
  • utput
units at se quenc e end is b elow
  • ".
6 DISCUSSION Limitations
  • f
LSTM.
  • The
particularly ecien t truncated bac kprop v ersion
  • f
the LSTM algorithm will not easily solv e problems similar to \strongly dela y ed X OR problems", where the goal is to compute the X OR
  • f
t w
  • widely
separated inputs that previously
  • ccurred
somewhere in a noisy sequence. The reason is that storing
  • nly
  • ne
  • f
the inputs will not help to reduce the exp ected error | the task is non-decomp
  • sable
in the sense that it is imp
  • ssible
to incremen tally reduce the error b y rst solving an easier subgoal. In theory , this limitation can b e circum v en ted b y using the full gradien t (p erhaps with ad- ditional con v en tional hidden units receiving input from the memory cells). But w e do not recommend computing the full gradien t for the follo wing reasons: (1) It increases computa- tional complexit y . (2) Constan t error
  • w
through CECs can b e sho wn
  • nly
for truncated LSTM. (3) W e actually did conduct a few exp erimen ts with non-truncated LSTM. There w as no signican t dierence to truncated LSTM, exactly b ecause
  • utside
the CECs error
  • w
tends to v anish quic kly . F
  • r
the same reason full BPTT do es not
  • utp
erform truncated BPTT.
  • Eac
h memory cell blo c k needs t w
  • additional
units (input and
  • utput
gate). In comparison to standard recurren t nets, ho w ev er, this do es not increase the n um b er
  • f
w eigh ts b y more than a factor
  • f
9: eac h con v en tional hidden unit is replaced b y at most 3 units in the LSTM arc hitecture, increasing the n um b er
  • f
w eigh ts b y a factor
  • f
3 2 in the fully connected case. Note, ho w ev er, that
  • ur
exp erimen ts use quite comparable w eigh t n um b ers for the arc hitectures
  • f
LSTM and comp eting approac hes.
  • Generally
sp eaking, due to its constan t error
  • w
through CECs within memory cells, LSTM runs in to problems similar to those
  • f
feedforw ard nets seeing the en tire input string at
  • nce.
F
  • r
instance, there are tasks that can b e quic kly solv ed b y random w eigh t guessing but not b y the truncated LSTM algorithm with small w eigh t initializations, suc h as the 500-step parit y problem (see in tro duction to Section 5). Here, LSTM's problems are similar to the
  • nes
  • f
a feedforw ard net with 500 inputs, trying to solv e 500-bit parit y . Indeed LSTM t ypically b eha v es m uc h lik e a feedforw ard net trained b y bac kprop that sees the en tire input. But that's also precisely wh y it so clearly
  • utp
erforms previous approac hes
  • n
man y non-trivial tasks with signican t searc h spaces.
  • LSTM
do es not ha v e an y problems with the notion
  • f
\recency" that go b ey
  • nd
those
  • f
  • ther
approac hes. All gradien t-based approac hes, ho w ev er, suer from practical inabilit y to precisely coun t discrete time steps. If it mak es a dierence whether a certain signal
  • ccurred
99
  • r
100 steps ago, then an additional coun ting mec hanism seems necessary . Easier tasks, ho w ev er, suc h as
  • ne
that
  • nly
requires to mak e a dierence b et w een, sa y , 3 and 11 steps, do not p
  • se
an y problems to LSTM. F
  • r
instance, b y generating an appropriate negativ e connection b et w een memory cell
  • utput
and input, LSTM can giv e more w eigh t to recen t inputs and learn deca ys where necessary . 22
slide-23
SLIDE 23 Adv an tages
  • f
LSTM.
  • The
constan t error bac kpropagation within memory cells results in LSTM's abilit y to bridge v ery long time lags in case
  • f
problems similar to those discussed ab
  • v
e.
  • F
  • r
long time lag problems suc h as those discussed in this pap er, LSTM can handle noise, distributed represen tations, and con tin uous v alues. In con trast to nite state automata
  • r
hidden Mark
  • v
mo dels LSTM do es not require an a priori c hoice
  • f
a nite n um b er
  • f
states. In principle it can deal with unlimited state n um b ers.
  • F
  • r
problems discussed in this pap er LSTM generalizes w ell | ev en if the p
  • sitions
  • f
widely separated, relev an t inputs in the input sequence do not matter. Unlik e previous approac hes,
  • urs
quic kly learns to distinguish b et w een t w
  • r
more widely separated
  • ccurrences
  • f
a particular elemen t in an input sequence, without dep ending
  • n
appropriate short time lag training exemplars.
  • There
app ears to b e no need for parameter ne tuning. LSTM w
  • rks
w ell
  • v
er a broad range
  • f
parameters suc h as learning rate, input gate bias and
  • utput
gate bias. F
  • r
instance, to some readers the learning rates used in
  • ur
exp erimen ts ma y seem large. Ho w ev er, a large learning rate pushes the
  • utput
gates to w ards zero, th us automatically coun termanding its
  • wn
negativ e eects.
  • The
LSTM algorithm's up date complexit y p er w eigh t and time step is essen tially that
  • f
BPTT, namely O (1). This is excellen t in comparison to
  • ther
approac hes suc h as R TRL. Unlik e full BPTT, ho w ev er, LSTM is lo c al in b
  • th
sp ac e and time. 7 CONCLUSION Eac h memory cell's in ternal arc hitecture guaran tees constan t error
  • w
within its constan t error carrousel CEC, pro vided that truncated bac kprop cuts
  • error
  • w
trying to leak
  • ut
  • f
memory cells. This represen ts the basis for bridging v ery long time lags. Tw
  • gate
units learn to
  • p
en and close access to error
  • w
within eac h memory cell's CEC. The m ultiplicativ e input gate aords protection
  • f
the CEC from p erturbation b y irrelev an t inputs. Lik ewise, the m ultiplicativ e
  • utput
gate protects
  • ther
units from p erturbation b y curren tly irrelev an t memory con ten ts. F uture w
  • rk.
T
  • nd
  • ut
ab
  • ut
LSTM's practical limitations w e in tend to apply it to real w
  • rld
data. Application areas will include (1) time series prediction, (2) m usic comp
  • sition,
and (3) sp eec h pro cessing. It will also b e in teresting to augmen t sequence c h unk ers (Sc hmidh ub er 1992b, 1993) b y LSTM to com bine the adv an tages
  • f
b
  • th.
8 A CKNO WLEDGMENTS Thanks to Mik e Mozer, Wilfried Brauer, Nic Sc hraudolph, and sev eral anon ymous referees for v alu- able commen ts and suggestions that help ed to impro v e a previous v ersion
  • f
this pap er (Ho c hreiter and Sc hmidh ub er 1995). This w
  • rk
w as supp
  • rted
b y DF G gr ant SCHM 942/3-1 from \Deutsc he F
  • rsc
h ungsgemeinsc haft". APPENDIX A.1 ALGORITHM DET AILS In what follo ws, the index k ranges
  • v
er
  • utput
units, i ranges
  • v
er hidden units, c j stands for the j
  • th
memory cell blo c k, c v j denotes the v
  • th
unit
  • f
memory cell blo c k c j , u; l ; m stand for arbitrary units, t ranges
  • v
er all time steps
  • f
a giv en input sequence. 23
slide-24
SLIDE 24 The gate unit logistic sigmoid (with range [0; 1]) used in the exp erimen ts is f (x) = 1 1 + exp(x) . (3) The function h (with range [1; 1]) used in the exp erimen ts is h(x) = 2 1 + exp(x)
  • 1
. (4) The function g (with range [2; 2]) used in the exp erimen ts is g (x) = 4 1 + exp(x)
  • 2
. (5) F
  • rw
ard pass. The net input and the activ ation
  • f
hidden unit i are net i (t) = X u w iu y u (t
  • 1)
(6) y i (t) = f i (net i (t)) . The net input and the activ ation
  • f
in j are net in j (t) = X u w in j u y u (t
  • 1)
(7) y in j (t) = f in j (net in j (t)) . The net input and the activ ation
  • f
  • ut
j are net
  • ut
j (t) = X u w
  • ut
j u y u (t
  • 1)
(8) y
  • ut
j (t) = f
  • ut
j (net
  • ut
j (t)) . The net input net c v j , the in ternal state s c v j , and the
  • utput
activ ation y c v j
  • f
the v
  • th
memory cell
  • f
memory cell blo c k c j are: net c v j (t) = X u w c v j u y u (t
  • 1)
(9) s c v j (t) = s c v j (t
  • 1)
+ y in j (t)g
  • net
c v j (t)
  • y
c v j (t) = y
  • ut
j (t)h(s c v j (t)) . The net input and the activ ation
  • f
  • utput
unit k are net k (t) = X u: u not a gate w k u y u (t
  • 1)
y k (t) = f k (net k (t)) . The bac kw ard pass to b e describ ed later is based
  • n
the follo wing truncated bac kprop form ulae. Appro ximate deriv ativ es for truncated bac kprop. The truncated v ersion (see Section 4)
  • nly
appro ximates the partial deriv ativ es, whic h is reected b y the \ tr " signs in the notation b elo w. It truncates error
  • w
  • nce
it lea v es memory cells
  • r
gate units. T runcation ensures that there are no lo
  • ps
across whic h an error that left some memory cell through its input
  • r
input gate can reen ter the cell through its
  • utput
  • r
  • utput
gate. This in turn ensures constan t error
  • w
through the memory cell's CEC. 24
slide-25
SLIDE 25 In the truncated bac kprop v ersion, the follo wing deriv ativ es are replaced b y zero: @ net in j (t) @ y u (t
  • 1)
  • tr
8u; @ net
  • ut
j (t) @ y u (t
  • 1)
  • tr
8u; and @ net c j (t) @ y u (t
  • 1)
  • tr
8u: Therefore w e get @ y in j (t) @ y u (t
  • 1)
= f in j (net in j (t)) @ net in j (t) @ y u (t
  • 1)
  • tr
8u; @ y
  • ut
j (t) @ y u (t
  • 1)
= f
  • ut
j (net
  • ut
j (t)) @ net
  • ut
j (t) @ y u (t
  • 1)
  • tr
8u; and @ y c j (t) @ y u (t
  • 1)
= @ y c j (t) @ net
  • ut
j (t) @ net
  • ut
j (t) @ y u (t
  • 1)
+ @ y c j (t) @ net in j (t) @ net in j (t) @ y u (t
  • 1)
+ @ y c j (t) @ net c j (t) @ net c j (t) @ y u (t
  • 1)
  • tr
8u: This implies for all w lm not
  • n
connections to c v j ; in j ;
  • ut
j (that is, l 62 fc v j ; in j ;
  • ut
j g): @ y c v j (t) @ w lm = X u @ y c v j (t) @ y u (t
  • 1)
@ y u (t
  • 1)
@ w lm
  • tr
0: The truncated deriv ativ es
  • f
  • utput
unit k are: @ y k (t) @ w lm = f k (net k (t)) X u: u not a gate w k u @ y u (t
  • 1)
@ w lm +
  • k
l y m (t
  • 1)
!
  • tr
(10) f k (net k (t)) @ X j S j X v =1
  • c
v j l w k c v j @ y c v j (t
  • 1)
@ w lm + X j
  • in
j l +
  • ut
j l
  • S
j X v =1 w k c v j @ y c v j (t
  • 1)
@ w lm + X i: i hidden unit w k i @ y i (t
  • 1)
@ w lm +
  • k
l y m (t
  • 1)
! = f k (net k (t)) 8 > > > > > < > > > > > : y m (t
  • 1)
l = k w k c v j @ y c v j (t1) @ w lm l = c v j P S j v =1 w k c v j @ y c v j (t1) @ w lm l = in j OR l =
  • ut
j P i: i hidden unit w k i @ y i (t1) @ w lm l
  • therwise
, where
  • is
the Kronec k er delta ( ab = 1 if a = b and
  • therwise),
and S j is the size
  • f
memory cell blo c k c j . The truncated deriv ativ es
  • f
a hidden unit i that is not part
  • f
a memory cell are: @ y i (t) @ w lm = f i (net i (t)) @ net i (t) @ w lm
  • tr
  • li
f i (net i (t))y m (t
  • 1)
. (11) (Note: here it w
  • uld
b e p
  • ssible
to use the full gradien t without aecting constan t error
  • w
through in ternal states
  • f
memory cells.) 25
slide-26
SLIDE 26 Cell blo c k c j 's truncated deriv ativ es are: @ y in j (t) @ w lm = f in j (net in j (t)) @ net in j (t) @ w lm
  • tr
  • in
j l f in j (net in j (t))y m (t
  • 1)
. (12) @ y
  • ut
j (t) @ w lm = f
  • ut
j (net
  • ut
j (t)) @ net
  • ut
j (t) @ w lm
  • tr
  • ut
j l f
  • ut
j (net
  • ut
j (t))y m (t
  • 1)
. (13) @ s c v j (t) @ w lm = @ s c v j (t
  • 1)
@ w lm + @ y in j (t) @ w lm g
  • net
c v j (t)
  • +
y in j (t)g
  • net
c v j (t)
  • @
net c v j (t) @ w lm
  • tr
(14)
  • in
j l +
  • c
v j l
  • @
s c v j (t
  • 1)
@ w lm +
  • in
j l @ y in j (t) @ w lm g
  • net
c v j (t)
  • +
  • c
v j l y in j (t)g
  • net
c v j (t)
  • @
net c v j (t) @ w lm =
  • in
j l +
  • c
v j l
  • @
s c v j (t
  • 1)
@ w lm +
  • in
j l f in j (net in j (t)) g
  • net
c v j (t)
  • y
m (t
  • 1)
+
  • c
v j l y in j (t) g
  • net
c v j (t)
  • y
m (t
  • 1)
. @ y c v j (t) @ w lm = @ y
  • ut
j (t) @ w lm h(s c v j (t)) + h (s c v j (t)) @ s c v j (t) @ w lm y
  • ut
j (t)
  • tr
(15)
  • ut
j l @ y
  • ut
j (t) @ w lm h(s c v j (t)) +
  • in
j l +
  • c
v j l
  • h
(s c v j (t)) @ s c v j (t) @ w lm y
  • ut
j (t) . T
  • ecien
tly up date the system at time t, the
  • nly
(truncated) deriv ativ es that need to b e stored at time t
  • 1
are @ s c v j (t1) @ w lm , where l = c v j
  • r
l = in j . Bac kw ard pass. W e will describ e the bac kw ard pass
  • nly
for the particularly ecien t \truncated gradien t v ersion"
  • f
the LSTM algorithm. F
  • r
simplicit y w e will use equal signs ev en where appro ximations are made according to the truncated bac kprop equations ab
  • v
e. The squared error at time t is giv en b y E (t) = X k : k
  • utput
unit
  • t
k (t)
  • y
k (t)
  • 2
, (16) where t k (t) is
  • utput
unit k 's target at time t. Time t's con tribution to w lm 's gradien t-based up date with learning rate
  • is
w lm (t) =
  • @
E (t) @ w lm . (17) W e dene some unit l 's error at time step t b y e l (t) :=
  • @
E (t) @ net l (t) . (18) Using (almost) standard bac kprop, w e rst compute up dates for w eigh ts to
  • utput
units (l = k ), w eigh ts to hidden units (l = i) and w eigh ts to
  • utput
gates (l =
  • ut
j ). W e
  • btain
(compare form ulae (10), (11), (13)): l = k (output) : e k (t) = f k (net k (t))
  • t
k (t)
  • y
k (t)
  • ,
(19) l = i (hidden) : e i (t) = f i (net i (t)) X k : k
  • utput
unit w k i e k (t) , (20) 26
slide-27
SLIDE 27 l =
  • ut
j (output gates) : (21) e
  • ut
j (t) = f
  • ut
j (net
  • ut
j (t)) @ S j X v =1 h(s c v j (t)) X k : k
  • utput
unit w k c v j e k (t) 1 A . F
  • r
all p
  • ssible
l time t's con tribution to w lm 's up date is w lm (t) =
  • e
l (t) y m (t
  • 1)
. (22) The remaining up dates for w eigh ts to input gates (l = in j ) and to cell units (l = c v j ) are less con v en tional. W e dene some in ternal state s c v j 's error: e s c v j :=
  • @
E (t) @ s c v j (t) = (23) f
  • ut
j (net
  • ut
j (t)) h (s c v j (t)) X k : k
  • utput
unit w k c v j e k (t) . W e
  • btain
for l = in j
  • r
l = c v j ; v = 1; : : : ; S j
  • @
E (t) @ w lm = S j X v =1 e s c v j (t) @ s c v j (t) @ w lm . (24) The deriv ativ es
  • f
the in ternal states with resp ect to w eigh ts and the corresp
  • nding
w eigh t up dates are as follo ws (compare expression (14)): l = in j (input gates) : (25) @ s c v j (t) @ w in j m = @ s c v j (t
  • 1)
@ w in j m + g (net c v j (t)) f in j (net in j (t)) y m (t
  • 1)
; therefore time t's con tribution to w in j m 's up date is (compare expression (10)): w in j m (t) =
  • S
j X v =1 e s c v j (t) @ s c v j (t) @ w in j m . (26) Similarly w e get (compare expression (14)): l = c v j (memory cells) : (27) @ s c v j (t) @ w c v j m = @ s c v j (t
  • 1)
@ w c v j m + g (net c v j (t)) f in j (net in j (t)) y m (t
  • 1)
; therefore time t's con tribution to w c v j m 's up date is (compare expression (10)): w c v j m (t) = e s c v j (t) @ s c v j (t) @ w c v j m . (28) All w e need to implemen t for the bac kw ard pass are equations (19), (20), (21), (22), (23), (25), (26), (27), (28). Eac h w eigh t's total up date is the sum
  • f
the con tributions
  • f
all time steps. Computational complexit y . LSTM's up date complexit y p er time step is O (K H + K C S + H I + C S I ) = O (W ); (29) where K is the n um b er
  • f
  • utput
units, C is the n um b er
  • f
memory cell blo c ks, S > is the size
  • f
the memory cell blo c ks, H is the n um b er
  • f
hidden units, I is the (maximal) n um b er
  • f
units forw ard-connected to memory cells, gate units and hidden units, and W = K H + K C S + C S I + 2C I + H I = O (K H + K C S + C S I + H I ) 27
slide-28
SLIDE 28 is the n um b er
  • f
w eigh ts. Expression (29) is
  • btained
b y considering all computations
  • f
the bac kw ard pass: equation (19) needs K steps; (20) needs K H steps; (21) needs K S C steps; (22) needs K (H + C ) steps for
  • utput
units, H I steps for hidden units, C I steps for
  • utput
gates; (23) needs K C S steps; (25) needs C S I steps; (26) needs C S I steps; (27) needs C S I steps; (28) needs C S I steps. The total is K + 2K H + K C + 2K S C + H I + C I + 4C S I steps,
  • r
O (K H + K S C + H I + C S I ) steps. W e conclude: LSTM algorithm's up date complexit y p er time step is just lik e BPTT's for a fully recurren t net. A t a giv en time step,
  • nly
the 2C S I most recen t @ s c v j @ w lm v alues from equations (25) and (27) need to b e stored. Hence LSTM's storage complexit y also is O (W ) | it do es not dep end
  • n
the input sequence length. A.2 ERR OR FLO W W e compute ho w m uc h an error signal is scaled while
  • wing
bac k through a memory cell for q time steps. As a b y-pro duct, this analysis reconrms that the error
  • w
within a memory cell's CEC is indeed constan t, pro vided that truncated bac kprop cuts
  • error
  • w
trying to lea v e memory cells (see also Section 3.2). The analysis also highligh ts a p
  • ten
tial for undesirable long-term drifts
  • f
s c j (see (2) b elo w), as w ell as the b enecial, coun termanding inuence
  • f
negativ ely biased input gates (see (3) b elo w). Using the truncated bac kprop learning rule, w e
  • btain
@ s c j (t
  • k
) @ s c j (t
  • k
  • 1)
= (30) 1 + @ y in j (t
  • k
) @ s c j (t
  • k
  • 1)
g
  • net
c j (t
  • k
)
  • +
y in j (t
  • k
)g
  • net
c j (t
  • k
)
  • @
net c j (t
  • k
) @ s c j (t
  • k
  • 1)
= 1 + X u
  • @
y in j (t
  • k
) @ y u (t
  • k
  • 1)
@ y u (t
  • k
  • 1)
@ s c j (t
  • k
  • 1)
  • g
  • net
c j (t
  • k
)
  • +
y in j (t
  • k
)g
  • net
c j (t
  • k
)
  • X
u
  • @
net c j (t
  • k
) @ y u (t
  • k
  • 1)
@ y u (t
  • k
  • 1)
@ s c j (t
  • k
  • 1)
  • tr
1: The
  • tr
sign indicates equalit y due to the fact that truncated bac kprop replaces b y zero the follo wing deriv ativ es: @ y in j (tk ) @ y u (tk 1) 8u and @ net c j (tk ) @ y u (tk 1) 8u. In what follo ws, an error # j (t) starts
  • wing
bac k at c j 's
  • utput.
W e redene # j (t) := X i w ic j # i (t + 1) . (31) F
  • llo
wing the denitions/con v en tions
  • f
Section 3.1, w e compute error
  • w
for the truncated bac kprop learning rule. The error
  • ccurring
at the
  • utput
gate is #
  • ut
j (t)
  • tr
@ y
  • ut
j (t) @ net
  • ut
j (t) @ y c j (t) @ y
  • ut
j (t) # j (t) . (32) The error
  • ccurring
at the in ternal state is # s c j (t) = @ s c j (t + 1) @ s c j (t) # s c j (t + 1) + @ y c j (t) @ s c j (t) # j (t) . (33) Since w e use truncated bac kprop w e ha v e # j (t) = P i, i no gate and no memory cell w ic j # i (t + 1); therefore w e get @ # j (t) @ # s c j (t + 1) = X i w ic j @ # i (t + 1) @ # s c j (t + 1)
  • tr
. (34) 28
slide-29
SLIDE 29 The previous equations (33) and (34) imply constan t error
  • w
through in ternal states
  • f
memory cells: @ # s c j (t) @ # s c j (t + 1) = @ s c j (t + 1) @ s c j (t)
  • tr
1 . (35) The error
  • ccurring
at the memory cell input is # c j (t) = @ g (net c j (t)) @ net c j (t) @ s c j (t) @ g (net c j (t)) # s c j (t) . (36) The error
  • ccurring
at the input gate is # in j (t)
  • tr
@ y in j (t) @ net in j (t) @ s c j (t) @ y in j (t)) # s c j (t) . (37) No external error
  • w.
Errors are propagated bac k from units l to unit v along
  • utgoing
connections with w eigh ts w lv . This \external error" (note that for con v en tional units there is nothing but external error) at time t is # e v (t) = @ y v (t) @ net v (t) X l @ net l (t + 1) @ y v (t) # l (t + 1) . (38) W e
  • btain
@ # e v (t
  • 1)
@ # j (t) = (39) @ y v (t
  • 1)
@ net v (t
  • 1)
  • @
#
  • ut
j (t) @ # j (t) @ net
  • ut
j (t) @ y v (t
  • 1)
+ @ # in j (t) @ # j (t) @ net in j (t) @ y v (t
  • 1)
+ @ # c j (t) @ # j (t) @ net c j (t) @ y v (t
  • 1)
  • tr
. W e
  • bserv
e: the error # j arriving at the memory cell
  • utput
is not bac kpropagated to units v via external connections to in j ;
  • ut
j ; c j . Error
  • w
within memory cells. W e no w fo cus
  • n
the error bac k
  • w
within a memory cell's CEC. This is actually the
  • nly
t yp e
  • f
error
  • w
that can bridge sev eral time steps. Supp
  • se
error # j (t) arriv es at c j 's
  • utput
at time t and is propagated bac k for q steps un til it reac hes in j
  • r
the memory cell input g (net c j ). It is scaled b y a factor
  • f
@ # v (tq ) @ # j (t) , where v = in j ; c j . W e rst compute @ # s c j (t
  • q
) @ # j (t)
  • tr
8 < : @ y c j (t) @ s c j (t) q = @ s c j (tq +1) @ s c j (tq ) @ # s c j (tq +1) @ # j (t) q > . (40) Expanding equation (40), w e
  • btain
@ # v (t
  • q
) @ # j (t)
  • tr
@ # v (t
  • q
) @ # s c j (t
  • q
) @ # s c j (t
  • q
) @ # j (t)
  • tr
(41) @ # v (t
  • q
) @ # s c j (t
  • q
) 1 Y m=q @ s c j (t
  • m
+ 1) @ s c j (t
  • m)
! @ y c j (t) @ s c j (t)
  • tr
y
  • ut
j (t)h (s c j (t))
  • g
(net c j (t
  • q
)y in j (t
  • q
) v = c j g (net c j (t
  • q
)f in j (net in j (t
  • q
)) v = in j . Consider the factors in the previous equation's last expression. Ob viously , error
  • w
is scaled
  • nly
at times t (when it en ters the cell) and t
  • q
(when it lea v es the cell), but not in b et w een (constan t error
  • w
through the CEC). W e
  • bserv
e: 29
slide-30
SLIDE 30 (1) The
  • utput
gate's eect is: y
  • ut
j (t) scales do wn those errors that can b e reduced early during training without using the memory cell. Lik ewise, it scales do wn those errors resulting from using (activ ating/deactiv ating) the memory cell at later training stages | without the
  • utput
gate, the memory cell migh t for instance suddenly start causing a v
  • idable
errors in situations that already seemed under con trol (b ecause it w as easy to reduce the corresp
  • nding
errors without memory cells). See \output w eigh t conict" and \abuse problem" in Sections 3/4. (2) If there are large p
  • sitiv
e
  • r
negativ e s c j (t) v alues (b ecause s c j has drifted since time step t
  • q
), then h (s c j (t)) ma y b e small (assuming that h is a logistic sigmoid). See Section 4. Drifts
  • f
the memory cell's in ternal state s c j can b e coun termanded b y negativ ely biasing the input gate in j (see Section 4 and next p
  • in
t). Recall from Section 4 that the precise bias v alue do es not matter m uc h. (3) y in j (t
  • q
) and f in j (net in j (t
  • q
)) are small if the input gate is negativ ely biased (assume f in j is a logistic sigmoid). Ho w ev er, the p
  • ten
tial signicance
  • f
this is negligible compared to the p
  • ten
tial signicance
  • f
drifts
  • f
the in ternal state s c j . Some
  • f
the factors ab
  • v
e ma y scale do wn LSTM's
  • v
erall error
  • w,
but not in a manner that dep ends
  • n
the length
  • f
the time lag. The
  • w
will still b e m uc h more eectiv e than an exp
  • nen
tially (of
  • rder
q ) deca ying
  • w
without memory cells. References Almeida, L. B. (1987). A learning rule for async hronous p erceptrons with feedbac k in a com bina- torial en vironmen t. In IEEE 1st International Confer enc e
  • n
Neur al Networks, San Die go, v
  • lume
2, pages 609{618. Baldi, P . and Pineda, F. (1991). Con trastiv e learning and neural
  • scillator.
Neur al Computation, 3:526{545. Bengio, Y. and F rasconi, P . (1994). Credit assignmen t through time: Alternativ es to bac kpropaga- tion. In Co w an, J. D., T esauro, G., and Alsp ector, J., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 6, pages 75{82. San Mateo, CA: Morgan Kaufmann. Bengio, Y., Simard, P ., and F rasconi, P . (1994). Learning long-term dep endencies with gradien t descen t is dicult. IEEE T r ansactions
  • n
Neur al Networks, 5(2):157{166. Cleeremans, A., Serv an-Sc hreib er, D., and McClelland, J. L. (1989). Finite-state automata and simple recurren t net w
  • rks.
Neur al Computation, 1:372{381. de V ries, B. and Princip e, J. C. (1991). A theory for neural net w
  • rks
with time dela ys. In Lippmann, R. P ., Mo
  • dy
, J. E., and T
  • uretzky
, D. S., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 3, pages 162{168. San Mateo, CA: Morgan Kaufmann. Do y a, K. (1992). Bifurcations in the learning
  • f
recurren t neural net w
  • rks.
In Pr
  • c
e e dings
  • f
1992 IEEE International Symp
  • sium
  • n
Cir cuits and Systems, pages 2777{2780. Do y a, K. and Y
  • shiza
w a, S. (1989). Adaptiv e neural
  • scillator
using con tin uous-time bac k- propagation learning. Neur al Networks, 2:375{385. Elman, J. L. (1988). Finding structure in time. T ec hnical Rep
  • rt
CRL T ec hnical Rep
  • rt
8801, Cen ter for Researc h in Language, Univ ersit y
  • f
California, San Diego. F ahlman, S. E. (1991). The recurren t cascade-correlation learning algorithm. In Lippmann, R. P ., Mo
  • dy
, J. E., and T
  • uretzky
, D. S., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 3, pages 190{196. San Mateo, CA: Morgan Kaufmann. Ho c hreiter, J. (1991). Un tersuc h ungen zu dynamisc hen neuronalen Netzen. Diploma the- sis, Institut f
  • ur
Informatik, Lehrstuhl Prof. Brauer, T ec hnisc he Univ ersit at M
  • unc
hen. See www7.informatik.tu-m uenc hen.de/~ho c hreit. 30
slide-31
SLIDE 31 Ho c hreiter, S. and Sc hmidh ub er, J. (1995). Long short-term memory . T ec hnical Rep
  • rt
FKI-207- 95, F akult at f
  • ur
Informatik, T ec hnisc he Univ ersit at M
  • unc
hen. Ho c hreiter, S. and Sc hmidh ub er, J. (1996). Bridging long time lags b y w eigh t guessing and \Long Short-T erm Memory". In Silv a, F. L., Princip e, J. C., and Almeida, L. B., editors, Sp a- tiotemp
  • r
al mo dels in biolo gic al and articial systems, pages 65{72. IOS Press, Amsterdam, Netherlands. Serie: F ron tiers in Articial In telligence and Applications, V
  • lume
37. Ho c hreiter, S. and Sc hmidh ub er, J. (1997). LSTM can solv e hard long time lag problems. In A dvanc es in Neur al Information Pr
  • c
essing Systems 9. MIT Press, Cam bridge MA. Presen ted at NIPS 96. Lang, K., W aib el, A., and Hin ton, G. E. (1990). A time-dela y neural net w
  • rk
arc hitecture for isolated w
  • rd
recognition. Neur al Networks, 3:23{43. Miller, C. B. and Giles, C. L. (1993). Exp erimen tal comparison
  • f
the eect
  • f
  • rder
in recurren t neural net w
  • rks.
International Journal
  • f
Pattern R e c
  • gnition
and A rticial Intel ligenc e, 7(4):849{872. Mozer, M. C. (1989). A fo cused bac k-propagation algorithm for temp
  • ral
sequence recognition. Complex Systems, 3:349{381. Mozer, M. C. (1992). Induction
  • f
m ultiscale temp
  • ral
structure. In Lippman, D. S., Mo
  • dy
, J. E., and T
  • uretzky
, D. S., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 4, pages 275{282. San Mateo, CA: Morgan Kaufmann. P earlm utter, B. A. (1989). Learning state space tra jectories in recurren t neural net w
  • rks.
Neur al Computation, 1(2):263{269. P earlm utter, B. A. (1995). Gradien t calculations for dynamic recurren t neural net w
  • rks:
A surv ey . IEEE T r ansactions
  • n
Neur al Networks, 6(5):1212{1228. Pineda, F. J. (1987). Generalization
  • f
bac k-propagation to recurren t neural net w
  • rks.
Physic al R eview L etters, 19(59):2229{2232. Pineda, F. J. (1988). Dynamics and arc hitecture for neural computation. Journal
  • f
Complexity, 4:216{245. Plate, T. A. (1993). Holographic recurren t net w
  • rks.
In S. J. Hanson, J. D. C. and Giles, C. L., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 5, pages 34{41. San Mateo, CA: Morgan Kaufmann. P
  • llac
k, J. B. (1991). Language induction b y phase transition in dynamical recognizers. In Lippmann, R. P ., Mo
  • dy
, J. E., and T
  • uretzky
, D. S., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 3, pages 619{626. San Mateo, CA: Morgan Kaufmann. Pusk
  • rius,
G. V. and F eldk amp, L. A. (1994). Neuro con trol
  • f
nonlinear dynamical systems with Kalman lter trained recurren t net w
  • rks.
IEEE T r ansactions
  • n
Neur al Networks, 5(2):279{ 297. Ring, M. B. (1993). Learning sequen tial tasks b y incremen tally adding higher
  • rders.
In S. J. Han- son, J. D. C. and Giles, C. L., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 5, pages 115{122. Morgan Kaufmann. Robinson, A. J. and F allside, F. (1987). The utilit y driv en dynamic error propagation net w
  • rk.
T ec hnical Rep
  • rt
CUED/F-INFENG/TR.1, Cam bridge Univ ersit y Engineering Departmen t. Sc hmidh ub er, J. (1989). The Neural Buc k et Brigade: A lo cal learning algorithm for dynamic feedforw ard and recurren t net w
  • rks.
Conne ction Scienc e, 1(4):403{412. 31
slide-32
SLIDE 32 Sc hmidh ub er, J. (1992a). A xed size storage O (n 3 ) time complexit y learning algorithm for fully recurren t con tin ually running net w
  • rks.
Neur al Computation, 4(2):243{248. Sc hmidh ub er, J. (1992b). Learning complex, extended sequences using the principle
  • f
history compression. Neur al Computation, 4(2):234{242. Sc hmidh ub er, J. (1992c). Learning unam biguous reduced sequence descriptions. In Mo
  • dy
, J. E., Hanson, S. J., and Lippman, R. P ., editors, A dvanc es in Neur al Information Pr
  • c
essing Sys- tems 4, pages 291{298. San Mateo, CA: Morgan Kaufmann. Sc hmidh ub er, J. (1993). Netzw erk arc hitekturen, Zielfunktionen und Kettenregel. Habilitations- sc hrift, Institut f
  • ur
Informatik, T ec hnisc he Univ ersit at M
  • unc
hen. Sc hmidh ub er, J. and Ho c hreiter, S. (1996). Guessing can
  • utp
erform man y long time lag algo- rithms. T ec hnical Rep
  • rt
IDSIA-19-96, IDSIA. Silv a, G. X., Amaral, J. D., Langlois, T., and Almeida, L. B. (1996). F aster training
  • f
recurren t net w
  • rks.
In Silv a, F. L., Princip e, J. C., and Almeida, L. B., editors, Sp atiotemp
  • r
al mo dels in biolo gic al and articial systems, pages 168{175. IOS Press, Amsterdam, Netherlands. Serie: F ron tiers in Articial In telligence and Applications, V
  • lume
37. Smith, A. W. and Zipser, D. (1989). Learning sequen tial structures with the real-time recurren t learning algorithm. International Journal
  • f
Neur al Systems, 1(2):125{131. Sun, G., Chen, H., and Lee, Y. (1993). Time w arping in v arian t neural net w
  • rks.
In S. J. Hanson, J. D. C. and Giles, C. L., editors, A dvanc es in Neur al Information Pr
  • c
essing Systems 5, pages 180{187. San Mateo, CA: Morgan Kaufmann. W atrous, R. L. and Kuhn, G. M. (1992). Induction
  • f
nite-state languages using second-order recurren t net w
  • rks.
Neur al Computation, 4:406{414. W erb
  • s,
P . J. (1988). Generalization
  • f
bac kpropagation with application to a recurren t gas mark et mo del. Neur al Networks, 1. Williams, R. J. (1989). Complexit y
  • f
exact gradien t computation algorithms for recurren t neural net w
  • rks.
T ec hnical Rep
  • rt
T ec hnical Rep
  • rt
NU-CCS-89-27, Boston: Northeastern Univ er- sit y , College
  • f
Computer Science. Williams, R. J. and P eng, J. (1990). An ecien t gradien t-based algorithm for
  • n-line
training
  • f
recurren t net w
  • rk
tra jectories. Neur al Computation, 4:491{501. Williams, R. J. and Zipser, D. (1992). Gradien t-based learning algorithms for recurren t net w
  • rks
and their computational complexit y . In Back-pr
  • p
agation: The
  • ry,
A r chite ctur es and Appli- c ations. Hillsdale, NJ: Erlbaum. 32