long shor t term memor y neural comput a tion 9 8 1735
play

LONG SHOR T-TERM MEMOR Y Neural Comput a tion - PDF document

LONG SHOR T-TERM MEMOR Y Neural Comput a tion 9(8):1735{1780, 1997 Sepp Ho c hreiter J urgen Sc hmidh ub er F akult at f ur Informatik IDSIA T ec hnisc he Univ ersit at M unc hen Corso Elv


  1. LONG SHOR T-TERM MEMOR Y Neural Comput a tion 9(8):1735{1780, 1997 Sepp Ho c hreiter J urgen � Sc hmidh ub er F akult� at f � ur Informatik IDSIA T ec hnisc he Univ ersit� at M unc � hen Corso Elv ezia 36 80290 M unc � hen, German y 6900 Lugano, Switzerland ho c hreit@informatik.tu-m uenc hen.de juergen@idsia.c h h ttp://www7.informatik.tu-m uenc hen.de/~ho c hreit h ttp://www.idsia.c h/~juergen Abstract Learning to store information o v er extended time in terv als via recurren t bac kpropagation tak es a v ery long time, mostly due to insu�cien t, deca ying error bac k �o w. W e brie�y review Ho c hreiter's 1991 analysis of this problem, then address it b y in tro ducing a no v el, e�cien t, gradien t-based metho d called \Long Short-T erm Memory" (LSTM). T runcating the gradien t where this do es not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps b y enforcing c onstant error �o w through \constan t error carrousels" within sp ecial units. Multiplicativ e gate units learn to op en and close access to the constan t error �o w. LSTM is lo cal in space and time; its computational complexit y p er time step and w eigh t is O (1). Our exp erimen ts with arti�cial data in v olv e lo cal, distributed, real-v alued, and noisy pattern represen tations. In comparisons with R TRL, BPTT, Recurren t Cascade-Correlation, Elman nets, and Neural Sequence Ch unking, LSTM leads to man y more successful runs, and learns m uc h faster. LSTM also solv es complex, arti�cial long time lag tasks that ha v e nev er b een solv ed b y previous recurren t net w ork algorithms. 1 INTR ODUCTION Recurren t net w orks can in principle use their feedbac k connections to store represen tations of recen t input ev en ts in form of activ ations (\short-term memory", as opp osed to \long-term mem- ory" em b o died b y slo wly c hanging w eigh ts). This is p oten tially signi�can t for man y applications, including sp eec h pro cessing, non-Mark o vian con trol, and m usic comp osition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory , ho w ev er, tak e to o m uc h time or do not w ork w ell at all, esp ecially when minimal time lags b et w een inputs and corresp onding teac her signals are long. Although theoretically fascinating, existing metho ds do not pro vide clear pr actic al adv an tages o v er, sa y , bac kprop in feedforw ard nets with limited time windo ws. This pap er will review an analysis of the problem and suggest a remedy . The problem. With con v en tional \Bac k-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, W erb os 1988) or \Real-Time Recurren t Learning" (R TRL, e.g., Robinson and F allside 1987), error signals \�o wing bac kw ards in time" tend to either (1) blo w up or (2) v anish: the temp oral ev olution of the bac kpropagated error exp onen tially dep ends on the size of the w eigh ts (Ho c hreiter 1991). Case (1) ma y lead to oscillating w eigh ts, while in case (2) learning to bridge long time lags tak es a prohibitiv e amoun t of time, or do es not w ork at all (see section 3). The remedy . This pap er presen ts \L ong Short-T erm Memory" (LSTM), a no v el recurren t net w ork arc hitecture in conjunction with an appropriate gradien t-based learning algorithm. LSTM is designed to o v ercome these error bac k-�o w problems. It can learn to bridge time in terv als in excess of 1000 steps ev en in case of noisy , incompressible input sequences, without loss of short time lag capabilities. This is ac hiev ed b y an e�cien t, gradien t-based algorithm for an arc hitecture 1

  2. enforcing c onstant (th us neither explo ding nor v anishing) error �o w through in ternal states of sp ecial units (pro vided the gradien t computation is truncated at certain arc hitecture-sp eci�c p oin ts | this do es not a�ect long-term error �o w though). Outline of pap er. Section 2 will brie�y review previous w ork. Section 3 b egins with an outline of the detailed analysis of v anishing errors due to Ho c hreiter (1991). It will then in tro duce a naiv e approac h to constan t error bac kprop for didactic purp oses, and highligh t its problems concerning information storage and retriev al. These problems will lead to the LSTM arc hitecture as describ ed in Section 4. Section 5 will presen t n umerous exp erimen ts and comparisons with comp eting metho ds. LSTM outp erforms them, and also learns to solv e complex, arti�cial tasks no other recurren t net algorithm has solv ed. Section 6 will discuss LSTM's limitations and adv an tages. The app endix con tains a detailed description of the algorithm (A.1), and explicit error �o w form ulae (A.2). 2 PREVIOUS W ORK This section will fo cus on recurren t nets with time-v arying inputs (as opp osed to nets with sta- tionary inputs and �xp oin t-based gradien t calculations, e.g., Almeida 1987, Pineda 1987). Gradien t-descen t v arian ts. The approac hes of Elman (1988), F ahlman (1991), Williams (1989), Sc hmidh ub er (1992a), P earlm utter (1989), and man y of the related algorithms in P earl- m utter's comprehensiv e o v erview (1995) su�er from the same problems as BPTT and R TRL (see Sections 1 and 3). Time-dela ys. Other metho ds that seem practical for short time lags only are Time-Dela y Neural Net w orks (Lang et al. 1990) and Plate's metho d (Plate 1993), whic h up dates unit activ a- tions based on a w eigh ted sum of old activ ations (see also de V ries and Princip e 1991). Lin et al. (1995) prop ose v arian ts of time-dela y net w orks called NARX net w orks. Time constan ts. T o deal with long time lags, Mozer (1992) uses time constan ts in�uencing c hanges of unit activ ations (deV ries and Princip e's ab o v e-men tioned approac h (1991) ma y in fact b e view ed as a mixture of TDNN and time constan ts). F or long time lags, ho w ev er, the time constan ts need external �ne tuning (Mozer 1992). Sun et al.'s alternativ e approac h (1993) up dates the activ ation of a recurren t unit b y adding the old activ ation and the (scaled) curren t net input. The net input, ho w ev er, tends to p erturb the stored information, whic h mak es long-term storage impractical. Ring's approac h. Ring (1993) also prop osed a metho d for bridging long time lags. Whenev er a unit in his net w ork receiv es con�icting error signals, he adds a higher order unit in�uencing appropriate connections. Although his approac h can sometimes b e extremely fast, to bridge a time lag in v olving 100 steps ma y require the addition of 100 units. Also, Ring's net do es not generalize to unseen lag durations. Bengio et al.'s approac hes. Bengio et al. (1994) in v estigate metho ds suc h as sim ulated annealing, m ulti-grid random searc h, time-w eigh ted pseudo-Newton optimization, and discrete error propagation. Their \latc h" and \2-sequence" problems are v ery similar to problem 3a with minimal time lag 100 (see Exp erimen t 3). Bengio and F rasconi (1994) also prop ose an EM approac h for propagating targets. With n so-called \state net w orks", at a giv en time, their system can b e in one of only n di�eren t states. See also b eginning of Section 5. But to solv e con tin uous problems suc h as the \adding problem" (Section 5.4), their system w ould require an unacceptable n um b er of states (i.e., state net w orks). Kalman �lters. Pusk orius and F eldk amp (1994) use Kalman �lter tec hniques to impro v e recurren t net p erformance. Since they use \a deriv ativ e discoun t factor imp osed to deca y exp o- nen tially the e�ects of past dynamic deriv ativ es," there is no reason to b eliev e that their Kalman Filter T rained Recurren t Net w orks will b e useful for v ery long minimal time lags. Second order nets. W e will see that LSTM uses m ultiplicativ e units (MUs) to protect error �o w from un w an ted p erturbations. It is not the �rst recurren t net metho d using MUs though. F or instance, W atrous and Kuhn (1992) use MUs in second order nets. Some di�erences to LSTM are: (1) W atrous and Kuhn's arc hitecture do es not enforce constan t error �o w and is not designed 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend