Dynamical-System s Beha vior in Recurren t and Non-Recurren - - PDF document

dynamical system s beha vior in recurren t and non
SMART_READER_LITE
LIVE PREVIEW

Dynamical-System s Beha vior in Recurren t and Non-Recurren - - PDF document

Dynamical-System s Beha vior in Recurren t and Non-Recurren t Connectionist Nets An Honors Thesis Presen ted b y Jason M. Eisner to The Departmen t of Psyc hology in P artial F ulllmen t of the Requiremen ts


slide-1
SLIDE 1 Dynamical-System s Beha vior in Recurren t and Non-Recurren t Connectionist Nets An Honors Thesis Presen ted b y Jason M. Eisner to The Departmen t
  • f
Psyc hology in P artial F ulllmen t
  • f
the Requiremen ts for the Degree
  • f
Bac helor
  • f
Arts with Honors in the Sub ject
  • f
Psyc holgy Harv ard-Radclie Colleges Cam bridge, Massac h usetts April 2, 1990
slide-2
SLIDE 2 Abstract A broad approac h is dev elop ed for training dynamical b eha viors in connectionist net w
  • rks.
General recurren t net w
  • rks
are p
  • w
erful computational devices, necessary for dicult tasks lik e constrain t sat- isfaction and temp
  • ral
pro cessing. These tasks are discussed here in some detail. F rom b
  • th
theoretical and empirical considerations, it is concluded that suc h tasks are b est addressed b y recurren t net w
  • rks
that
  • p
erate con tin uously in time|and further, that eectiv e learn- ing rules for these con tin uous-time net w
  • rks
m ust b e able to prescrib e their dynamic al prop erties. A general class
  • f
suc h learning rules is deriv ed and tested
  • n
simple problems. Where existing learning algo- rithms for recurren t and non-recurren t net w
  • rks
  • nly
attempt to train a net w
  • rk's
p
  • sition
in activ ation space, the mo dels presen ted here can also explicitly and successfully prescrib e the nature
  • f
its movement thr
  • ugh
activ ation space. I am indebted to Ja y Ruec kl, m y advisor, b
  • th
for his suggestions and for his supp
  • rt.
Ja y Ruec kl and Greg Galp erin pro vided com- putational facilities that pro v ed indisp ensable. I w
  • uld
also lik e to thank m y family and the man y friends whose encouragemen t sa w this pro ject through its nal stages. Con ten ts 1 In tro duction 3 1.1 The uses
  • f
recurren t net w
  • rks
: : : : : : : : : : : : : : : : : : 3 1.1.1 Recurren t net w
  • rks
and constrain t satisfaction : : : : : 3 1.1.2 Recurren t net w
  • rks
and temp
  • ral
problems : : : : : : : 4 1.1.3 The computational p
  • w
er
  • f
recurren t net w
  • rks
: : : : 5 1.2 Recurren t net w
  • rks
in practice : : : : : : : : : : : : : : : : : : 6 1.2.1 Existing mo dels
  • f
constrain t satisfaction : : : : : : : : 6 1.2.2 Existing temp
  • ral
mo dels : : : : : : : : : : : : : : : : 8 1.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 2 Some Theoretical Observ ations 11 2.1 Though ts
  • n
temp
  • ral
pattern pro cessing : : : : : : : : : : : : 12 1
slide-3
SLIDE 3 2.1.1 The usefulness
  • f
gradual-resp
  • nse
nets : : : : : : : : : 12 2.1.2 The dynamical systems approac h : : : : : : : : : : : : 14 2.1.3 Dynamics
  • f
small clusters : : : : : : : : : : : : : : : : 16 2.2 Though ts
  • n
constrain t satisfaction : : : : : : : : : : : : : : : 19 2.2.1 The need for hidden units : : : : : : : : : : : : : : : : 19 2.2.2 A rst attempt at a satisfaction mo del : : : : : : : : : 20 2.2.3 A learning rule for the non-resonan t case : : : : : : : : 21 2.2.4 Extending
  • ur
rule to the resonan t case : : : : : : : : : 23 2.2.5 The problem with this approac h : : : : : : : : : : : : : 25 2.3 Summary
  • f
Theoretical Observ ations : : : : : : : : : : : : : : 26 3 A General Mo del and its Learning Rule 26 3.1 Conception
  • f
the general mo del : : : : : : : : : : : : : : : : : 26 3.1.1 Molding a dynamical system : : : : : : : : : : : : : : : 26 3.1.2 The role
  • f
input : : : : : : : : : : : : : : : : : : : : : 27 3.1.3 Ho w to use the error measure : : : : : : : : : : : : : : 28 3.1.4 Ho w w eigh t c hanges shift the tra jectory : : : : : : : : 31 3.2 F
  • rmal
deriv ation
  • f
the general mo del : : : : : : : : : : : : : 32 3.2.1 Notation : : : : : : : : : : : : : : : : : : : : : : : : : : 33 3.2.2 Calculating the gradien t in w eigh t space : : : : : : : : 34 3.2.3 An algorithm : : : : : : : : : : : : : : : : : : : : : : : 35 3.3 Summary
  • f
the general mo del : : : : : : : : : : : : : : : : : : 37 4 P articular Mo dels 38 4.1 Some mo dels
  • f
p
  • ten
tial in terest : : : : : : : : : : : : : : : : 38 4.2 Some top
  • logies
  • f
p
  • ten
tial in terest : : : : : : : : : : : : : : 41 4.3 Detailed deriv ation
  • f
particular error measures : : : : : : : : 42 4.3.1 Mapping mo del I: No des to w ard targets : : : : : : : : 42 4.3.2 Mapping mo del I I: System to w ard target : : : : : : : : 44 4.3.3 General gradien t-descen t mo del : : : : : : : : : : : : : 45 4.3.4 Con ten t-addressable memory mo del : : : : : : : : : : : 46 5 Sim ulation Results 49 5.1 Results for feedforw ard X OR : : : : : : : : : : : : : : : : : : : 51 5.2 Other tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : 53 6 Conclusions 54 2
slide-4
SLIDE 4 1 In tro duction In the 1960's, Minsky and P ap ert p
  • in
ted to hidden units as a p
  • ten
tial remedy for some
  • f
connectionism's problems. Recurren t connections ha v e lately b een attracting the same kind
  • f
in terest. Muc h as hidden units extend the computational p
  • w
er
  • f
p erceptrons, recurren t connections extend the computational p
  • w
er
  • f
feedforw ard net w
  • rks.
The w
  • rk
rep
  • rted
here is ultimately concerned with b
  • th
recurren t and non-recurren t net w
  • rks.
Ho w ev er, it fo cuses
  • n
net w
  • rk
prop erties that are most eviden t (and most useful) in the presence
  • f
recurrence. These are dynamic al prop erties
  • f
net w
  • rks|prop
erties describing ho w net w
  • rks'
states c hange
  • r
remain stable
  • v
er time. The pap er has three ma jor aims, as follo ws. First, to highligh t the fea- tures
  • f
recurrence that mak e it useful. Second, to demonstrate that certain net w
  • rk
arc hitectures exhibit esp ecially ric h kinds
  • f
b eha vior. Finally , to dev elop a training algorithm that can pro duce the desired b eha viors in net- w
  • rks
that use these arc hitectures. 1.1 The uses
  • f
recurren t net w
  • rks
Since it is useful to fo cus
  • n
actual problems, the early sections
  • f
this pa- p er will pa y sp ecial atten tion to t w
  • domains
in whic h recurren t net w
  • rks
ha v e pro v ed esp ecially useful. These are the c
  • nstr
aint satisfaction domain and the temp
  • r
al domain. In a constrain t satisfaction problem, the net w
  • rk
is supp
  • sed
to disco v er an y regularities that hold among v arious static in- puts. The temp
  • ral
domain includes all those tasks where a net w
  • rk's
inputs and/or
  • utputs
are to c hange
  • v
er time in a principled w a y . 1.1.1 Recurren t net w
  • rks
and constrain t satisfaction The general constrain t satisfaction task is simple. V arious p atterns (v ectors
  • f
n um b ers) are sho wn to a net w
  • rk.
The net w
  • rk
is supp
  • sed
to disco v er regularities in the set
  • f
patterns it sees. When it is sho wn
  • nly
part
  • f
a pattern, it should correctly ll in the missing elemen ts. The purest form
  • f
constrain t satisfaction mak es no distinction b et w een input and
  • utput
no des
  • f
the net w
  • rk.
There is simply a set
  • f
visible no des, whic h hold the patterns. A partial pattern can b e \clamp ed"
  • n
to some
  • f
3
slide-5
SLIDE 5 the visible no des|this means that the no des' activ ations are held constan t at the comp
  • nen
t v alues
  • f
the pattern|and the remaining, \free" visible no des are supp
  • sed
to assume a set
  • f
activ ations that could consisten tly complete the partial pattern. It should b e clear wh y recurrence is necessary for a net w
  • rk
to p erform this sort
  • f
  • p
eration. The net m ust b e able to run b
  • th
bac kw ards and forw ards, dep ending
  • n
whic h visible no des are clamp ed: sometimes no de i migh t ha v e to inuence no de j and sometime s vice-v ersa. If the net w
  • rk
do es con tain m utually inuen tial no des lik e i and j , then its graph m ust con tain cycles. In short, the net w
  • rk
m ust b e recurren t. Note that constrain t satisfaction tec hniques w
  • uld
  • ften
b e helpful if ap- plied to
  • ther
problems. An y net w
  • rk
whose input patterns are
  • ften
in- complete,
  • r
partly erroneous, migh t do w ell to lter them through suc h a pro cedure at the
  • utset.
Ideally , this ltering w
  • uld
not ev en b e a separate phase
  • f
the net w
  • rk's
  • p
eration, but w
  • uld
dev elop naturally as the net- w
  • rk's
in ternal represen tations learned to feed bac k and inuence its input. 1.1.2 Recurren t net w
  • rks
and temp
  • ral
problems Recurren t net w
  • rks
are also w ell-suited to temp
  • ral
problems, b ecause re- curren t connections help a net w
  • rk
preserv e information from
  • ne
momen t to the next. They allo w the net to aect its
  • wn
subsequen t b eha vior. F
  • r
some temp
  • ral
problems, the net w
  • rk
do es not ha v e to preserv e particularly complex information. A simple record
  • f
past input ma y suf- ce. This w as roughly the approac h
  • f
Sejno wski and Rosen b erg's NETtalk (1987). NETtalk had no recurren t connections, but its \mo ving windo w" pro vided con tin uit y in the input stream. A certain amoun t
  • f
its past input could con tin ue to aect it. This approac h is not v ery general. An input buer
  • f
xed size can
  • nly
hold a limited n um b er
  • f
past ev en ts|but some tasks require memory for input ev en ts that happ ened arbitrarily far in the past. F
  • r
example,
  • ne
ma y ha v e to remem b er the sub ject
  • f
an arbitrarily long sen tence. The \cum u- lativ e X OR" task illustrates the dicult y clearly . In this task, a net w
  • rk
is presen ted with a con tin uous input stream
  • f
0's and 1's, and is exp ected at ev ery momen t to
  • utput
the cum ulativ e X OR
  • f
all the input to date. The problem can b e easily solv ed b y a net w
  • rk
that main tains just
  • ne
bit
  • f
state information. T
  • solv
e it b y remem b ering actual input, ho w ev er, w
  • uld
4
slide-6
SLIDE 6 require a buer
  • f
innite length, b ecause ev ery bit is signican t. Ev en when a problem can b e solv ed b y preserving past input, it ma y b e more useful to preserv e some
  • ther
kind
  • f
state information instead. F
  • r
most problems, a net w
  • rk
do es not ha v e to remem b er the exact pattern
  • f
ra w input data that it has seen, but
  • nly
certain relev an t features
  • f
those data. Indeed, the features
  • f
past input that the net w
  • rk
m ust remem b er are t ypically the same features it w as required to extract when it
  • riginally
sa w that input. A buer for ra w data is clearly sup eruous here. Finally|and most imp
  • rtan
t|input buers fail to capture temp
  • ral
in- v ariance in a natural w a y . Supp
  • se
a net w
  • rk
is pro cessing a stream
  • f
data that con tains, among
  • ther
things, copies
  • f
t w
  • sp
ecial input sequences des- ignated ST AR T and STOP . Sa y the net w
  • rk
needs to recognize the follo wing condition: I, the net, ha v e receiv ed a ST AR T sequence since the last STOP sequence. If the net w
  • rk
relies
  • n
an appropriately long buer to remem b er its past input, then it m ust learn ho w to detect ST AR T and STOP sequences at eac h p
  • sition
in the buer. This requires a great deal
  • f
learning and a complete set
  • f
training examples. It w
  • uld
b e m uc h more natural for the input to simply aect some in ternal v ariable as it arriv es. In general, it seems most sensible to let a net w
  • rk
use an y in ternal repre- sen tations that help it do its job. Certainly there is a strong case for letting net w
  • rks
ha v e in ternal states that reect their past input and pro cessing. And short
  • f
augmen ting a net w
  • rk's
units with some sort
  • f
storage capac- it y , recurren t connections seem to b e the
  • nly
w a y that a net w
  • rk
can ac hiev e suc h states. 1.1.3 The computational p
  • w
er
  • f
recurren t net w
  • rks
Recurren t net w
  • rks
are a prop er sup erset
  • f
non-recurren t net w
  • rks,
and constitute a more p
  • w
erful class
  • f
computational devices. 1 Hence a nal reason to study them is simply to nd
  • ut
what they can do. Recurren t net w
  • rks
ma y b e capable
  • f
man y kinds
  • f
b eha vior
  • ther
than constrain t satisfaction and temp
  • ral
tasks. 1 Indeed, digital electronics w as founded up
  • n
the ip-op memory , a recurren t electrical circuit. 5
slide-7
SLIDE 7 As is alw a ys true in connectionism, w e are esp ecially in terested in the class
  • f
b eha viors that
  • ur
net w
  • rks
can le arn. Unless there is a natural w a y to teac h recurren t net w
  • rks
particular tasks, their p
  • w
er is not esp ecially use- ful. Later sections
  • f
this pap er will deriv e actual algorithms for sup ervised learning in recurren t nets. 1.2 Recurren t net w
  • rks
in practice The algorithms
  • f
this pap er t in to an existing b
  • dy
  • f
researc h in v
  • lving
recurren t net w
  • rks.
The b est-kno wn mo dels to date, whic h are review ed b elo w, can b e easily divided in to the t w
  • groups
discussed earlier. Some p erform constrain t satisfaction;
  • thers
p erform temp
  • ral
tasks. 1.2.1 Existing mo dels
  • f
constrain t satisfaction A classic example
  • f
a constrain t satisfaction device is the inter active activa- tion (IA) mo del
  • f
letter p erception (Rumelhart & McClelland, 1986). P art
  • f
the in terest
  • f
this mo del stems from its abilit y to predict exp erimen tal results
  • n
letter p erception in h umans, but it is also an excellen t example
  • f
ho w recurrence
  • p
erates in the constrain t satisfaction domain. The IA mo del has three lev els
  • f
units: visual feature detectors, letters, and w
  • rds.
The feature detectors simply resp
  • nd
to dieren t kinds
  • f
line segmen ts; they ex- cite the letters that are kno wn to con tain those segmen ts. The w
  • rds
excite, and are equally excited b y , the letters they con tain. Finally , all w
  • rds
inhibit eac h
  • ther,
b eing inconsisten t h yp
  • theses,
and so do all letters. 2 Stim ulating the feature detectors causes a set
  • f
p
  • ssible
letters to b e activ ated, some more strongly than
  • thers.
The
  • b
ject
  • f
the mo del is to decide whic h
  • f
these p
  • ssible
letters are really presen t in the w
  • rd
sho wn. It do es this through the w a y the units in teract. W
  • rds
try to inhibit eac h
  • ther,
and the w
  • rd
with the greatest activ ation will tend to b e most successful in inhibiting the
  • thers.
Letters comp ete for top activ ation in the same w a y . It follo ws that if a w
  • rd
  • r
letter starts
  • ut
with sligh tly more activ ation than its comp etitors, it will b e lik ely to end up with substan tially more activ ation. 2 Wh y should letters b e inconsisten t with eac h
  • ther?
The actual IA mo del has four copies
  • f
all
  • f
its letter units:
  • ne
set for the rst p
  • sition
in a four-letter w
  • rd,
  • ne
set for the second p
  • sition,
and so
  • n.
(Eac h set has its
  • wn
feature detectors.) Inhibitory connections
  • nly
  • ccur
among letters in the same set. 6
slide-8
SLIDE 8 Ho w ev er, the w
  • rd
and letter units in teract so as to pro duce a solution that is consisten t
  • n
b
  • th
lev els sim ultaneously . Initially lik ely letters, if they do not com bine to mak e an y w
  • rd,
can b e suppressed b y
  • ther
letter units that are supp
  • rted
b y w
  • rd
units. In short, the IA mo del nds a set
  • f
activ ations
  • v
er the letter and w
  • rd
units that is in ternally consisten t and also consisten t with the activ e feature detectors. (Tw
  • units
are consisten t if they ha v e an excitatory connection and similar activ ations,
  • r
an inhibitory connection and dissimilar activ ations.) This is similar to the \pure" constrain t satisfaction idea describ ed in 1.1.1. 3 One can clamp the feature detectors to get plausible activ ations
  • v
er the letter and w
  • rd
units,
  • r
clamp a few
  • f
the letter units to get plausible activ ations
  • v
er the w
  • rd
units and remaining letter units, and so
  • n.
The IA mo del is a mo del
  • f
p erception, not
  • f
p erceptual learning, and has so no explicit learning rule. This is a serious shortcoming. The
  • ther
constrain t satisfaction arc hitecture discussed here do es ha v e a learning rule, alb eit a slo w
  • ne.
This is the Boltzmann mac hine
  • f
Hin ton and Sejno wski (1986). The Boltzmann mac hine, lik e the earlier mo del b y Hopeld (1982) from whic h it deriv es, has an un usual arc hitecture. Eac h no de can
  • nly
  • utput
  • r
1, and its
  • utput
its bac k and forth b et w een these t w
  • v
alues. The itting is sto c hastic; its probabilities are arranged in suc h a w a y that the no de's total
  • utput
p er unit time is,
  • n
a v erage, a logistic function
  • f
its input. The logistic function is made steep er
  • v
er time so that the net w
  • rk
ev en tually settles in to a xed pattern
  • f
0's and 1's. This is called lo w ering the system's temp erature. An y pattern the net w
  • rk
settles in to will tend to b e highly consisten t, in the ab
  • v
e sense
  • f
the term: that is, an y t w
  • units
with a p
  • sitiv
e connection will tend to ha v e the same v alue. 4 F urthermore, the net w
  • rk
can b e trained to settle in to particular patterns. The w eigh ts should b e adjusted gradually in suc h a w a y as to mak e the desired patterns sligh tly more consisten t, and the net w
  • rk's
actual patterns sligh tly less consisten t. If the desired patterns matc h the actual patterns, the w eigh t c hanges will cancel eac h
  • ther
  • ut.
3 Adding connections from the letter units bac k to the feature detectors w
  • uld
mak e it 100% pure. 4 The net w
  • rk's
lik eliho
  • d
  • f
settling in to a particular pattern is an exp
  • nen
tial function
  • f
the pattern's consistency (appropriately measured), where the exp
  • nen
tial function dep ends
  • n
temp erature and is exactly as steep as the logistic function. 7
slide-9
SLIDE 9 This is a v ery attractiv e mo del: eac h w eigh t aects the consistency
  • f
a pattern in a lo cal manner, and hence can b e adjusted using
  • nly
lo cal information. Ho w ev er, it suers from an extremely slo w learning rule. The
  • dd
net w
  • rk
arc hitecture also presen ts some practical diculties. Units m ust b e assumed to pro duce
  • nly
xed binary
  • utputs,
since if the logistic function is left gen tle enough for the units to
  • utput
a mixture
  • f
0's and 1's, the state
  • f
the system as a whole will b e unstable. P erhaps the most unfortunate asp ect
  • f
the Boltzmann arc hitecture is that it cannot b e extended to the temp
  • ral
domain. Section 1.1.1 noted that constrain t satisfaction can b e a v ery useful feature within
  • ther
net w
  • rks.
Ho w ev er, the Boltzmann arc hitecture simply is not suited to temp
  • ral
prob- lems. In
  • rder
to correctly resp
  • nd
to new input, a Boltzmann mac hine has to raise its temp erature and then gradually lo w er it again. Once its tem- p erature is high, sto c hastic forces ma y cause its state to c hange completely . Successiv e states
  • f
the net w
  • rk
are therefore not guaran teed to ha v e an y relation to eac h
  • ther.
The ideal constrain t satisfaction mo del w
  • uld
do the job
  • f
a Boltzmann mac hine without relying
  • n
randomness. Section 2.2.5 demonstrates the dif- cult y
  • f
adapting Boltzmann metho ds to non-sto c hastic nets. The trouble is that
  • nce
the net w
  • rk
is made deterministic , it will
  • nly
b e able to settle in to
  • ne
pattern (for a giv en starting state)|and will not necessarily c ho
  • se
the most c
  • nsistent
  • ne.
The net w
  • rk's
c hoice, therefore, cannot b e con trolled simply b y con trolling pattern consistencies. Without randomness, the particular pattern a net w
  • rk
settles in to is the result
  • f
complex time-go v erned in teractions among its units. It is m uc h harder to con trol those in teractions than to con trol pattern consistencies. Nonetheless, section 3 will describ e a metho d for doing just that. 1.2.2 Existing temp
  • ral
mo dels The
  • ther
imp
  • rtan
t mo dels using recurren t net w
  • rks
are mo dels
  • f
tem- p
  • ral
pro cesses. Earlier, section 1.1 dened temp
  • ral
tasks as \tasks where a net w
  • rk's
inputs and/or
  • utputs
are to c hange
  • v
er time in a principled w a y ." In
  • ther
w
  • rds,
for a device that accomplishes the task, past inputs and
  • utputs
m ust ha v e predictiv e p
  • w
er with resp ect to future inputs and
  • utputs.
The b est-kno wn mo del
  • f
temp
  • ral
  • utput
is the serial
  • rder
mo del
  • f
8
slide-10
SLIDE 10 Mic hael Jordan (1986). Jordan's mo del is based directly
  • n
the idea that past
  • utputs
should predict future b eha vior. His net w
  • rk
is a bac k-propagation net with t w
  • kinds
  • f
input: a constan t \plan" v ector sp ecifying what se- quence
  • f
  • utputs
the net w
  • rk
is to generate, and a \state" v ector that reminds the net w
  • rk
what
  • utput
it has just generated. 5 Under an y giv en plan, the past
  • utputs
completely determine the future b eha vior
  • f
the sys- tem. If w e iden tify state with
  • utput,
as Jordan do es, the function
  • f
the net w
  • rk
itself is simply to compute the system's next state from its previous states. Jordan's mo del is the in v erse
  • f
a mo del lik e NETtalk, where limited past input w as a v ailable to the system: here, limite d past
  • utput
is a v ailable. The tec hnique successfully describ es ho w a net w
  • rk
can b e trained to pro duce a simple sequence
  • f
\actions,"
  • n
its
  • wn,
without step-b y-step guidance from the input. The resulting net w
  • rk
also has an in teresting tendency to mak e successiv e actions
  • v
erlap in time when p
  • ssible,
b y increasing their duration. Je Elman (1988) has devised a v ariation
  • f
Jordan's mo del that can deal with the temp
  • ral
structure
  • f
input. His arc hitecture, instead
  • f
feeding the net w
  • rk
its past
  • utputs,
feeds it the past v alues
  • f
its hidden units. This is a simple but useful idea. Whereas Jordan's net w
  • rks
can
  • nly
consider states prescrib ed b y the en vironmen t|the
  • utput
v ectors|Elman's can cre- ate their
  • wn
state v ariables. Elman has demonstrated that a net w
  • rk
with this arc hitecture can dis- co v er non-trivial
  • rdinal
structure in its input. With relativ ely short training regimens, it can b e made to predict at eac h time step the input v alue it is ab
  • ut
to receiv e. And although Elman do es not men tion it explicitly , the arc hitecture should b e capable
  • f
pro ducing Jordan-lik e
  • utput
sequences as w ell. Both Jordan and Elman rely
  • n
bac k propagation to train their net w
  • rks.
They use the standard form, whic h do es not prop erly apply to recurren t ar- c hitectures, b ecause the recurren t connections in their mo dels ha v e xed w eigh ts. F
  • r
the purp
  • ses
  • f
the learning algorithm, the recurren t connec- tions migh t as w ell not exist. The learning algorithm can treat the mo del as a feedforw ard net w
  • rk
whose input just happ ens to reect its previous state. 6 5 T
  • arrange
that
  • utputs
b efore the most recen t
  • ne
can help determine the next
  • utput,
Jordan uses a state v ector whose v alue is an exp
  • nen
tially w eigh ted a v erage
  • f
all past
  • utputs.
The state v ector is up dated using recurren t w eigh ts. 6 Or so Jordan and Elman imply . In truth, their learning algorithm is appro ximate. The 9
slide-11
SLIDE 11 These t w
  • mo
dels use
  • nly
restricted recurrence. The feedbac k connec- tions are carefully c hosen and not p ermitted to c hange. While the net w
  • rks
are successful at solving certain problems, they do not allo w the full range
  • f
b eha vior p
  • ssible
in unrestricted recurren t nets. Williams and Zipser (1988)
  • v
ercome these restrictions b y deriving a full- edged extension
  • f
bac k propagation. 7 Their gradien t-descen t algorithm can b e applied to net w
  • rks
that con tain an y recurren t connections whatso ev er. Ho w ev er, it pa ys a price for this generalit y: it is nonlo cal and computationally in tensiv e. In the Williams and Zipser paradigm, a con tin uously running net w
  • rk
receiv es input at eac h sim ulated time step, and has a target
  • utput
at eac h time step. The net w
  • rk's
error at a single time step is giv en b y the usual sum-
  • f-squares
expression; but the learning rule alw a ys adjusts w eigh ts so as to reduce the net w
  • rk's
total error, summed
  • v
er all time steps. In
  • ther
w
  • rds,
the net w
  • rk
learns to compute the correct
  • utputs
an y w a y that it can. If the target
  • utputs
are determined b y the curren t inputs alone, for example, the net w
  • rk
will learn an
  • rdinary
mapping. If they are determined b y the past inputs, it will dev elop some sort
  • f
state v ariables. If they are determined solely b y the passage
  • f
time, it will learn to pro duce a Jordan-st yle
  • utput
sequence. One migh t describ e this algorithm as p
  • w
erful, but greedy . Its time requiremen t is O (n 4 ), where n is the n um b er
  • f
no des. The algorithm set forth in section 3 will turn
  • ut
to w
  • rk
  • n
similar principles. It is rather more p
  • w
erful, and, unfortunately , equally greedy . 1.3 Summ ary W e ha v e no w seen ho w sev eral previous researc hers ha v e though t ab
  • ut
re- curren t net w
  • rks.
xedness
  • f
the recurren t w eigh ts do es not really mak e them irrelev an t to the learning algorithm. T
  • see
wh y , consider the hidden units in Elman's net w
  • rk.
Standard bac k propagation tries to arrange for them to ha v e activ ations that are helpful to the
  • utput
units; but it misses the c hance to mak e them helpful to the hidden units
  • n
the next cycle. In
  • ther
w
  • rds,
error
  • ugh
t to b e propagated bac kw ards along the recurren t connections, ev en if the w eigh ts
  • n
those connections are not mo diable. This requires a recurren t learning rule
  • f
the William s and Zipser sort. F
  • r
Jordan's and Elman's mo dels, ho w ev er, ignoring this limited recurrence do es not seem to ha v e h urt the learning pro cedure m uc h. 7 Rumelhart, Hin ton, and William s (1986) had already giv en an appro ximate extension. 10
slide-12
SLIDE 12 In the constrain t satisfaction w
  • rld,
the IA and Boltzmann mac hine mo d- els tak e nearly
  • pp
  • site
approac hes. Units in IA ha v e con tin uous activ ations and inuence eac h
  • ther
steadily , so that when
  • ne
unit's activ ation
  • v
er- tak es another's, it is a qualitativ ely signican t ev en t. By con trast, there is no suc h thing as a steady inuence in Boltzmann mac hines. Eac h unit simply mo dulates the probabilistic b eha vior
  • f
  • ther
units. The net w
  • rk's
b eha vior do es not c hange at all
  • v
er time, except insofar as the temp erature do es. The temp
  • ral
pro cessing researc h tak es a dieren t view. In the mo dels
  • f
Jordan and Elman, the p
  • in
t
  • f
recurren t connections is to p ermit feedbac k. A net w
  • rk
that is to
  • p
erate in time needs kno wledge
  • f
its past history . Recurren t connections, then, transmit information. In the case
  • f
discrete- time net w
  • rks,
they transmit a new pac k et
  • f
information
  • n
eac h time step. Finally , Williams and Zipser tak e no particular p
  • sition
  • n
the prop er role
  • f
recurren t connections, except to note that they increase a net w
  • rk's
computational p
  • w
er. F
  • r
them, the main p
  • in
t
  • f
any connection is to mak e it easier for a net w
  • rk
to pro duce exactly the righ t
  • utputs
at the righ t times. The next section will discuss the constrain t satisfaction and temp
  • ral
domains in more detail, and will b egin to mak e the case for a new w a y to think ab
  • ut
recurren t net w
  • rks:
in terms
  • f
their dynamics. The whole p
  • in
t
  • f
connections is to allo w units to inuence eac h
  • ther.
If a net w
  • rk
  • p
erates
  • v
er time, its connections determine its dynamical prop erties|its patterns
  • f
mo v em en t through activ ation space. The next section concludes that these dynamical prop erties are imp
  • rtan
t. A net w
  • rk's
abilit y to solv e a problem
  • ften
dep ends
  • n
the w a y its activ ations c hange
  • v
er time,
  • r
remain the same. 2 Some Theoretical Observ ations A primary aim
  • f
this pap er is to dev elop an actual computational mo del. W e b egin b y considering what sorts
  • f
mo dels migh t b e successful. The follo wing remarks, then, are exploratory . They
  • er
some tec hniques with whic h
  • ne
migh t try to design a useful recurren t net, and pro vide some
  • f
the motiv ation for the mo del describ ed later. As w e ha v e seen, some recurren t net w
  • rk
arc hitectures are go
  • d
at re- sp
  • nding
  • v
er time. Others are go
  • d
at disco v ering regularities in their in- put. An ideal approac h w
  • uld
b e able to address b
  • th
problems; w e discuss 11
slide-13
SLIDE 13 them in turn. 2.1 Though ts
  • n
temp
  • ral
pattern pro cessing 2.1.1 The usefulness
  • f
gradual-resp
  • nse
nets In the temp
  • ral
domain, the most useful recurren t arc hitectures ma y b e those that exhibit a gr adual r esp
  • nse.
In discrete-time net w
  • rks,
a unit's activ ation at time t + 1 simply replaces its activ ation at time t. A gradual-resp
  • nse
net w
  • rk,
b y con trast, runs in con tin uous time. Units c hange their activ ations con tin uously; a unit's activ ation at t + 1 follo ws from the accum ulation
  • f
innitesimal c hanges
  • v
er the in terv al (t; t + 1]. 8 If w e implem en t suc h a net
  • n
digital hardw are, w e are
  • b
viously forced to use discrete time steps. Ho w ev er, w e can mak e these time steps as close to innitesimals as w e lik e. The system can b e describ ed without the assump- tion
  • f
discrete time; it p erforms a computation that is sensitiv e to the total time elapsed, not to the n um b er
  • f
time steps. Suc h a net w
  • rk
is w ell-suited to temp
  • ral
problems b ecause its state c hanges con tin uously . As with an y computational system, the en vironmen tal input ma y v ary with time, p erhaps discon tin uously . But the net will resp
  • nd
gradually ev en to abrupt c hanges in input. This prop ert y is what allo ws it to preserv e state. 9 8 T
  • put
this a little more formally , the activ ation
  • f
a unit, act , is to b e dieren tiable with resp ect to time, so that act (t) exists and act (t + 1) = act (t) + Z t+1
  • =t
act ( )d : 9 By con trast, a feed-forw ard net w
  • rk
  • rdinarily
do es not tak e state in to accoun t at all. The net w
  • rk
resp
  • nds
indep enden tly to input patterns at times t, t + 1, and so forth; its units' activ ations at time t + 1 ha v e nothing to do with their activ ations at t. W e can certainly imagine discrete-time net w
  • rks
whose state at t + 1 is a function
  • f
b
  • th
their input at t + 1 and their state at t. This is in fact the approac h
  • f
Elman (1988). Suc h a mo del has certain disadv an tages, ho w ev er. It assumes that the system's input (and its desired
  • utput)
can b e describ ed as c hanging
  • nly
at regular clo c k tic ks. If more frequen t tic ks b ecome necessary to describ e the en vironmen t, the net migh t ha v e to b e completely retrained, b ecause its
  • p
eration dep ends in a fundamen tal w a y
  • n
the step size, i.e., the temp
  • ral
graininess with whic h the en vironmen t is sampled. F urthermore, since activ ations do not c hange con tin uously , state prop erties
  • f
a discrete-time net w
  • rk
12
slide-14
SLIDE 14 Note that state pr eservation is not strictly necessary for all temp
  • ral
pro cessing. All that is really required is that the system's curren t state can someho w inuenc e its later b eha vior; its curren t state need not p ersist
  • v
er time. But lasting states are imp
  • rtan
t for t w
  • reasons.
First, when states remain relativ ely constan t
  • v
er short p erio ds
  • f
time, the net w
  • rk
is not aected b y sligh t temp
  • ral
distortions in its input, and as Jordan (1986)
  • bserv
es, will pro duce
  • utputs
that are \spread in time" (i.e., non- instan taneous). Second, man y temp
  • ral
problems do happ en to require the preserv ation
  • f
state
  • v
er longer in terv als. In this con text, the tendency to preserv e state is sometimes referred to as memory. It allo ws a system to tak e adv an tage
  • f
the relativ e stabilit y
  • f
the ph ysical w
  • rld,
to k eep trac k
  • f
the sub ject
  • f
a sen tence, and so
  • n.
A gradual-resp
  • nse
net is guaran teed to preserv e state
  • v
er at least the v ery short term, simply b ecause its activ ations c hange con tin uously with time. A unit's activ ation ma y also sta y constan t for longer. In the standard case where the activ ation
  • f
a unit increases in prop
  • rtion
to the total input it receiv es, a unit will b e relativ ely stable if it receiv es little input from
  • ther
units and has little tendency to c hange
  • n
its
  • wn
(e.g., b y deca ying). T
  • demonstrate
that recurren t nets are w ell-suited to main tain their states
  • v
er long p erio ds
  • f
time, w e can consider the extreme case in whic h the net's
  • nly
goal is to k eep its activ ations constan t. In practice, suc h a net w
  • uld
not b e v ery in teresting. W e
  • rdinarily
w an t activ ations to c hange in resp
  • nse
to input and/or the passage
  • f
time. Ho w ev er, certain parts
  • f
the net w
  • rk
migh t learn to act as memories, and sta y constan t except in the presence
  • f
certain external inputs. W e w an t to b e sure that the b eha vior is natural for
  • ur
net w
  • rk
to ac hiev e. It is easy to ensure that, b y default, units receiv e little input. W e can sim- ply initialize w eigh ts to small v alues b efore an y learning tak es place. More-
  • v
er, if the w eigh ts are initially large, a net w
  • rk
that w an ts to b e stabler can easily learn to mak e them smaller. If I is the input v ector and A is the v ector
  • f
activ ations in the net w
  • rk,
the w eigh ts serv e to map fI ; Ag directly
  • n
to dA=dt, with no hidden units. The net w
  • rk
  • nly
has to learn the mapping that tak es eac h I ; A to the zero v ector. An y rule that adjusts net w
  • rk
w eigh ts in a manner consisten t with the delta rule will learn this mapping easily; in ha v e no in trinsic tendency to p ersist
  • v
er time|a prop ert y whose p
  • ssible
imp
  • rtance
will b e discussed in a momen t. 13
slide-15
SLIDE 15 particular, an y gradien t-descen t rule will suce. 10 Section 2.1.3 discusses
  • ther
w a ys to encourage a recurren t net w
  • rk
to preserv e state. 2.1.2 The dynamical systems approac h Recurren t net w
  • rks
can do more than just sit in the same state,
  • f
course. Input ma y aect them; and ev en when input is held constan t, the activ ations in a net w
  • rk
ma y con tin ue to c hange. Indeed, the b eha vior
  • f
the net w
  • rk
ma y dep end in complex w a ys
  • n
b
  • th
the input and the curren t activ ations
  • f
the net w
  • rk.
This notion is captured nicely b y a construct used in mathematical ph ysics, the dynamic al system. A dynamical system consists
  • f
t w
  • parts:
a con tin- uous state sp ac e, whic h represen ts the set
  • f
p
  • ssible
states
  • f
the system, and a con tin uous function v
  • v
er the state space, whic h sp ecies an instan- tane
  • us
velo city ve ctor at eac h p
  • in
t in the state space. The idea is that the system mo v es con tin uously through state space; its sp eed and direction in state space are completely determined b y its curren t p
  • sition
and sp ecied b y v . An y con tin uous path that the system ma y tak e through state space is called a tr aje ctory. F
  • r
example, consider a p endulum. The state
  • f
the p endulum is sp ec- ied along t w
  • dimensions.
One dimension is the in terv al [ ; + ), whic h represen ts the p
  • ssible
angles that the p endulum ma y mak e with the v ertical. The
  • ther
dimension giv es the p endulum's rate
  • f
rotation. If w e kno w the p endulum's curren t angle and rate
  • f
rotation, w e can compute ho w quic kly 10 This discussion is not unique to a gradual-resp
  • nse
arc hitecture. In a mo del lik e Elman's, where the total input to a no de replaces its activ ation instead
  • f
mo difying it, the net w
  • rk
could still preserv e state. Ho w ev er, it w
  • uld
need to implemen t a dieren t mapping to do so. Let C b e the v ector
  • f
activ ations
  • n
the con text units (whic h, in Elman's mo del, record the v alue
  • f
A from the last time step). The w eigh ts in the net w
  • rk
map fI ; C g
  • n
to A. F
  • r
the net w
  • rk
to retain its state, the w eigh ts m ust map eac h fI ; C g to C . In
  • ther
w
  • rds,
all w eigh ts should b e except for an y w eigh t from a con text unit to its asso ciated hidden unit, whic h m ust b e 1. W e could initialize the w eigh ts in this manner (somewhat a wkw ardly). W e could also learn them if necessary . Ho w ev er, the mapping is somewhat harder to learn than the direct mapping that
  • ur
arc hitecture requires. Its
  • utput
is not a constan t; and dep ending
  • n
the top
  • logy
  • f
the net w
  • rk,
it ma y b e necessary to train m ultiple la y ers
  • f
w eigh ts b et w een C and A. Static-state b eha vior is sligh tly more \natural" than this for a gradual-resp
  • nse
net w
  • rk.
14
slide-16
SLIDE 16 eac h is c hanging. In
  • ther
w
  • rds,
if w e kno w its curren t state, w e kno w ho w it is curren tly mo ving in state space. An y gradual-resp
  • nse
net w
  • rk
with xed top
  • logy
and w eigh ts, dieren- tiable
  • utput
functions, and constan t input, sp ecies a dynamical system in input space
  • activ
ation space. 11 If w e kno w the curren t constan t input v ec- tor to the net w
  • rk
and the curren t activ ations
  • f
all the units, then w e kno w ho w the input is c hanging (not at all) and ho w the activ ations are c hanging (e.g., in prop
  • rtion
to their units' net inputs). The net w
  • rk
ma y b e initialized at an y p
  • in
t in this state space and \re- leased," i.e, allo w ed to run. Its
  • wn
arc hitecture determines what tra jectory it follo ws after that. Changing the input means sliding the system to a dieren t p
  • in
t in its state space and releasing it again. This dynamical systems p ersp ectiv e is p
  • ten
tially useful, b ecause it al- lo ws us to b
  • rro
w some terminology , and equips us to think ab
  • ut
certain phenomena that are commonly
  • bserv
ed in dynamical systems. F
  • r
instance, an e quilibrium p
  • int
  • f
the system is an y p
  • in
t in state space whose asso ci- ated v ector is 0|that is, an y p
  • in
t from whic h the system will not mo v e. Most suc h p
  • in
ts happ en to b e p
  • int
attr actors. T
  • sa
y that p is an attractor means that if the system is released at a p
  • in
t q sucien tly near p, it will trace a tra jectory that reac hes p (and sta ys there). The complete set
  • f
suc h p
  • in
ts q is called the attr actor b asin
  • f
p. In general, an equilibrium set is an y subset S
  • f
state space ha ving the prop ert y that if the system is released an ywhere in S , it remains in S . S need not b e a p
  • in
t; sometimes, for example,
  • ne
sees p erio dic tr aje ctories, where the system rep eatedly traces a closed lo
  • p
in state space. Nearb y tra jectories ma y con v erge to this lo
  • p,
in whic h case it is called a cyclic attr actor
  • r
limit cycle, and has a basin lik e an y
  • ther
attractor. Note that the state space
  • f
a dynamical system is divided up among lo w er-dimensional attractors and their basins,
  • ther
equilibrium p
  • in
ts, and p
  • in
ts
  • n
div ergen t tra jectories. 11 Input sp ac e is the Euclidean space
  • f
p
  • ssible
input v ectors. A ctivation sp ac e is the Euclidean space
  • f
p
  • ssible
activ ation v ectors. In some mo dels, lik e those describ ed later,
  • ne
pro vides input to the system b y setting the activ ations
  • f
certain \input units." In this case there is no need for a separate input space. 15
slide-17
SLIDE 17 2.1.3 Dynamics
  • f
small clusters Within a recurren t net w
  • rk,
it is p
  • ssible
to build small clusters
  • f
highly in terconnected units that exhibit in teresting and stable dynamic b eha viors in their
  • wn
activ ation spaces. These clusters need
  • nly
b e 2 to 5 units in size. Because they are capable
  • f
b
  • th
time-dep enden t b eha vior and state preserv ation, they
  • er
some exciting p
  • ssibilities
for future w
  • rk
in temp
  • ral
pattern pro cessing. This section argues for the p
  • ten
tial usefulness
  • f
cluster-based arc hitec- tures. The argumen t is made not
  • nly
for its
  • wn
sak e, but also to underscore the imp
  • rtance
  • f
b eing able to teac h dynamical b eha vior to net w
  • rks.
Actual cluster dynamics w ere explored through a computer sim ulation. The clusters studied w ere tin y , fully recurren t net w
  • rks,
made up
  • f
  • rdinary
connectionist units with logistic
  • utput
functions and deca y . As it turns
  • ut,
suc h minature dynamical systems are capable
  • f
features suc h as these:
  • Single
p
  • int
attr actors with dier ent b asin dynamics. A cluster with a single p
  • in
t attractor p is useless as a memory , since it will alw a ys end up at the attractor. Ho w ev er, b y c ho
  • sing
the recurren t w eigh ts carefully , w e can ac hiev e in teresting dynamics in the basin
  • f
p. The basin dynamics determine ho w the cluster resp
  • nds
in the short term if an input stim ulus mo v es it a w a y from the attractor. F
  • r
instance,
  • ne
kind
  • f
cluster will con v erge v ery quic kly when near p, but v ery slo wly when far a w a y . In
  • ther
w
  • rds,
if it is displaced from p b y a strong input, it will sta y a w a y from it for a certain p erio d
  • f
time b efore returning. If it is at the attractor, this indicates that it has not receiv ed suc h an input for at least this m uc h time. Another p
  • ssible
b eha vior, for a cluster that has b een jiggled
  • p
in the righ t direction, is to lo
  • p
around and return to p. F
  • r
example, a sligh t input stim ulus migh t cause a cluster to snap
  • ut
to high activ ation and righ t bac k again. In
  • ther
w
  • rds,
the cluster generates a standardized pulse when pro dded.
  • Multiple
p
  • int
attr actors. A cluster with t w
  • r
more attractors is useful as a memory . Appropriate input can mo v e the cluster from
  • ne
attractor to another. A simple example is a pair
  • f
m utually inhibiting units, whic h has t w
  • stable
states (on-o and
  • -on)
and acts lik e an 16
slide-18
SLIDE 18
  • rdinary
ip-op. These states are more useful as stable memories than the p ersisten t states describ ed in 2.1.1: without giving up the abilit y to b e aected b y strong inputs, they resist b eing aected b y w eak
  • nes.
With m ultiple attractors, eac h basin can still adopt the same sorts
  • f
dynamics that w e considered for single attractors earlier. It is ev en p
  • ssible
to build a cluster with t w
  • attractors,
p and q , in suc h a w a y that the basin
  • f
p is non-con v ex and curv es part w a y around the basin
  • f
q . This allo ws the same source
  • f
input to mo v e the cluster's state from p to q , and if the stim ulus is rep eated, from q bac k in to the basin
  • f
p. In
  • ther
w
  • rds,
although the input source alw a ys pushes the cluster in the same direction in its activ ation space, input serv es to toggle the cluster b et w een p and q . The initial state
  • f
the cluster determines where it ends up after input. 12
  • Quantum
units. By carefully creating a cluster with sev eral m ultiple attractors, and b y pa ying atten tion to the
  • utput
  • f
  • nly
  • ne
no de in the cluster,
  • ne
can implem en t a p
  • ten
tially useful device called a quantum unit. A quan tum unit b eha v es for the most part lik e an
  • rdinary
connectionist unit, but its activ ation space is discrete. It has
  • nly
a small nite n um b er
  • f
c haracteristic activ ations. An y induced activ ation will tend to snap to the nearest
  • f
these. The unit alw a ys
  • utputs
its activ ation (no logistic
  • r
threshold function is necessary). 12 This is a somewhat coun terin tuitiv e result. T
  • see
that suc h toggling b eha vior is p
  • ssible,
consider ho w w e w
  • uld
pro duce it b y augmen ting an
  • rdinary
t w
  • -state
memory cluster. Supp
  • se
w e already ha v e a cluster C with t w
  • attractiv
e states, whic h w e call
  • 1
and +1. Ordinarily , a p
  • sitiv
e input puts C in state +1; a negativ e input puts it in state
  • 1.
W e w an t to arrange things so that a p
  • sitiv
e input alw a ys toggles the state
  • f
C. The solution is simple, and similar to the solution for the classic X OR problem. W e in tro duce an extra semilinear unit, D, that is w eakly excited b
  • th
b y the input and b y C, and that can inhibit C just enough to turn it
  • .
If C is at
  • 1,
a p
  • sitiv
e input pulse toggles it to +1. D will
  • nly
cross threshold if a p
  • sitiv
e input pulse arriv es when C is already at +1. In that case,
  • nce
the input pulse ends, D's high
  • utput
is able to mo v e C bac k to
  • 1.
This system
  • f
C and D, then, pro duces the b eha vior promised in the text. The attractor at (C = 1; D = 0) do es indeed ha v e a non-con v ex basin. Cho
  • sing
appropriate n um b ers, the basin con tains the p
  • in
t (C = +3; D = 1), whic h is the state the system assumes if C is at +1 and a p
  • sitiv
e input pulse arriv es. The basin also con tains the attractor itself. But it certainly excludes the p
  • in
t halfw a y in b et w een, (C = +1; D = 0:5), b ecause that p
  • in
t happ ens to b e the
  • ther
attractor! 17
slide-19
SLIDE 19 Suc h units could b e useful for sev eral reasons. First, they resp
  • nd
  • nly
to strong inputs. Second, although they can pro duce a range
  • f
  • utputs,
those
  • utputs
are quan tized in the st yle
  • f
threshold units. Third, they retain their state
  • v
er time (and can accum ulate state c hanges
  • v
er time). Finally , their relativ e insensitivit y to c hanging input w
  • uld
mak e them quic k to settle when used in a recurren t net w
  • rk.
  • Limit
cycles. Finally , it is p
  • ssible
to pro duce a limit cycle within a small cluster. Surprisingly , this b eha vior requires
  • nly
three units. One unit remains constan t (acts as a bias) while the
  • ther
t w
  • trace
a circle in the plane. This limit cycle is a stable attractor whose basin is the whole space: if the cluster's tra jectory is jolted
  • the
cycle, it will spiral in w ard
  • r
  • ut
w ards and return to the tra jectory . (By increasing
  • r
decreasing all the w eigh ts, w e can mak e it return faster
  • r
slo w er.) Suc h a cluster can easily b e made to generate regular pulses in another unit. This feature migh t b e useful as an in ternal \clo c k" for a larger net w
  • rk.
All these b eha viors ma y b e easily v eried with a short computer program. It is easy to imagine net w
  • rk
top
  • logies
that w
  • uld
mak e go
  • d
use
  • f
suc h clusters. F
  • r
example, w e migh t construct a con tin uous-time feedforw ard net using b
  • th
  • rdinary
units and quan tum units. Suc h a net w
  • rk
w
  • uld
retain state information from
  • ne
pattern presen tation to the next, c hanging state v ariables
  • nly
under strong input. More generally , an y cluster implem en ts some deterministic nite automa- ton (DF A) that has
  • nly
a few states. The input lines to the cluster supply the DF A with transition elemen ts; the passage
  • f
time ma y also serv e as a transition elemen t. In a hierarc hical arc hitecture where most clusters tak e their input from
  • ther
clusters, the higher lev els ma y b e able to recognize complex temp
  • ral
patterns in the en vironmen tal input. This w
  • uld
b e a w
  • rth
while area to explore. Unfortunately , without a learning rule, these are simply
  • bserv
ations ab
  • ut
useful top
  • logies.
It is unclear ho w w e w
  • uld
train a cluster to adopt a particular b eha vior|or ho w w e w
  • uld
decide what b eha vior the cluster
  • ught
to adopt in
  • rder
to b e useful. The w
  • rk
in section 3, whic h dev elops a learning rule for dynamical systems, w as p erformed with an ey e to w ard solving either
  • r
b
  • th
  • f
these learning problems. 18
slide-20
SLIDE 20 2.2 Though ts
  • n
constrain t satisfaction 2.2.1 The need for hidden units The am bition
  • f
an y y
  • ung
constrain t satisfaction net w
  • rk
is to disco v er the regularities in its input en vironmen t. If the activ ations
  • f
visible units V 1 ; V 2 ; : : : V n represen t en vironmen t v ariables, the
  • b
ject is to nd w eigh ts that capture the relations among these activ ations. In particular, giv en v alues for a subset S
  • f
the V i , the mature net w
  • rk
should b e able to predict v alues for visible units not in S . Some constrain t satisfaction net w
  • rks
ha v e no hidden no des, and simply try to nd direct relationships among the visible units. The net w
  • rk
can tak e adv an tage
  • f
these with appropriate direct w eigh ts b et w een correlated
  • r
an ticorrelated units. T ypically , suc h a net w
  • rk
is exp ected to compute a consisten t solution for all visible no des at
  • nce.
This parallelism is
  • ften
helpful. An X OR problem men tioned b y Hin ton and Sejno wski (1986) pro vides a go
  • d
illustration. In this problem's en vironmen t, there are four v ariables, A, B , and C , and D . F
  • ur
patterns
  • v
er these v ariables are regularly found in the en vironmen t: 0000; 0110; 10 10; and 1101. Insp ection sho ws that C is the exclusiv e \or"
  • f
A and B , while D is their logical \and." If a net w
  • rk
without hidden units is giv en the v alues
  • f
A and B , no set
  • f
direct w eigh ts will enable it to nd the v alue
  • f
C . (Neither A nor B is at all correlated with C .) Luc kily , the v alue
  • f
D c an b e found using direct connections from A and B . Since the net w
  • rk
is lo
  • king
for v alues for all the no des at
  • nce,
it will determine the v alue
  • f
D |and then A, B , and D together can determine C with
  • nly
direct connections. The p
  • in
t
  • f
this example is that D , an en vironmen t v ariable, p erforms a necessary function in the computation
  • f
C . The net w
  • rk
  • nly
manages to determine C b ecause it is sim ultaneously in terested in determining D , and has already learned from the en vironmen t ho w D is related to the
  • ther
v ariables. This is the same tric k that Rumelhart and McClelland (1981) used in their in teractiv e activ ation mo del. Their mo del w as designed to recognize w
  • rds
from their visual features, but it did not attempt to correlate features in isolation with particular w
  • rds.
Rather, features help ed to predict the presence
  • f
letters, whic h in their turn w ere correlated with the w
  • rds.
The 19
slide-21
SLIDE 21 mo del w
  • uld
certainly ha v e failed if feature units had just b een connected directly to w
  • rd
units. Unfortunately , in termedi ate represen tations as helpful as letters ma y not alw a ys b e a v ailable from the en vironmen t. T
  • capture
the regularities
  • f
the en vironmen t, the net w
  • rk
ma y ha v e to compute prop erties that are not recorded b y an y visible unit. This means training hidden units. In the X OR example men tioned earlier, the net w
  • rk
migh t b e able to do without D if it could dev elop a hidden unit to assume the same function. In general, hidden units are necessary to do constrain t satisfaction when the constrain ts are complex, just as they are necessary to implem en t complex mappings in feedforw ard nets. Hidden units are the
  • nly
w a y to detect imp
  • rtan
t features
  • f
the en vironmen t that are not directly a v ailable in the en vironmen t. 2.2.2 A rst attempt at a satisfaction mo del The remainder
  • f
section 2.2 attempts, unsuccessfully , to dev elop a straigh t- forw ard mo del for constrain t satisfaction in a recurren t net. The approac h fails, but for reasons that are w
  • rth
understanding: in particular, b ecause it pa ys insucien t atten tion to net w
  • rk
dynamics. The mo del describ ed in section 3 tries to address its problems. The mo del tak es the same tac k as Hin ton and Sejno wski (1986), who wrote: W e w
  • uld
lik e to nd a set
  • f
w eigh ts so that when the net w
  • rk
is running freely , the patterns
  • f
activit y that
  • ccur
  • v
er the visible units are the same as they w
  • uld
b e if the en vironmen t w as clamping them. (p. 292) Rather than using a Boltzmann-st yle device with sto c hastic binary units, ho w ev er, w e will use a gradual-resp
  • nse
net w
  • rk
  • f
the sort describ ed in 2.1.1. A t an y giv en time, some
  • r
all
  • f
the visible units ma y b e designated as clamp e d, whic h means that their activ ations are not p ermitted to c hange. The dynamics
  • f
the net w
  • rk
are giv en b y the equations a i (t +
  • )
=
  • a
i (t) if i is clamp ed; a i (t) +
  • net
i (t)
  • de
c ay (a i (t)) if i is free (1) 20
slide-22
SLIDE 22 where net i (t) = X j f (a j )w ij (2) (Here a i is the activ ation
  • f
unit i; net i is its instan taneous input, dened b y equation (2); and w ij is the w eigh t to j from i. All three ma y range from
  • innit
y to +innit y . There are also net w
  • rk
parameters that gure in the ab
  • v
e equations. f is the con tin uous
  • utput
function for all units; de c ay is the con tin uous and in v ertible function that sp ecies the instan taneous deca y rate
  • f
a giv en activ ation; and
  • is
the size
  • f
the time step. Go
  • d
c hoices are to mak e f the logistic function and let de c ay (a) b e prop
  • rtional
to a.) T
  • train
the unit
  • n
a pattern, w e clamp the visible no des with that pattern and w ait for the net w
  • rk
to settle. W e w an t
  • ur
w eigh ts to ha v e the eect that
  • n
ev ery pattern, eac h visible unit will receiv e exactly enough input to main tain its clamp ed activ ation. In
  • ther
w
  • rds,
w e w an t to train the net w
  • rk
so that the clamp ed units can b e released and still hold the correct v alues at whic h they w ere clamp ed. This is similar to the learning pro cedure in a Boltzmann mac hine. 2.2.3 A learning rule for the non-resonan t case W e see from (1) that unit i is at equilibrium if a i (t +
  • )
  • a
i (t) =
  • net
i (t)
  • de
c ay (a i (t)) = 0;
  • r
in
  • ther
w
  • rds,
net i (t) = de c ay (a i (t)): (3) This means that to sta y at its curren t activ ation, a unit m ust get just enough input to balance its rate
  • f
deca y . Note that it is deca y that k eeps a i from b ecoming innite. 13 W e ma y call this a subsistenc e input. Since clamp ed no des are supp
  • sed
to get just enough input to main tain their activ ations,
  • ur
learning rule will try to mak e the ab
  • v
e equation hold for all clamp ed 13 Since de c ay is required to b e a con tin uous in v ertible function, the equilibrium v alue
  • f
a i under constan t input is uniquely giv en b y the con tin uous function de c ay 1 (net i ). Th us if net i do es not increase without b
  • und
while the net w
  • rk
is running, a i will not do so either. This condition
  • n
net i is easy to arrange. Cho
  • sing
the logistic function for f restricts the
  • utput
v alues
  • f
all units j to the in terv al (0; 1); then for xed w eigh ts w ij , the input to i is b
  • unded.
21
slide-23
SLIDE 23 no des. That is, the no des' actual activ ations, to whic h they are clamp ed,
  • ugh
t to b e their equilibrium activ ations as w ell. F
  • r
a giv en clamp pattern, then, w e dene the lo cal error at a clamp ed no de i b y e i = net i
  • de
c ay (a i ): (4) W e w an t a learning rule that minim iz es the global error measure E = 1 2 P i e 2 i for all patterns. F
  • llo
wing the approac h
  • f
Rumelhart, Hin ton, and Williams (1986), w e will deriv e a rule that descends against the gradien t
  • f
E in w eigh t space. W e b egin b y considering a sp ecial case, where there are no connections among the hidden units. A hidden unit ma y connect
  • nly
to visible units. This is an easy case b ecause when the visible units are clamp ed, there is no resonan t b eha vior in suc h a net w
  • rk.
Changing a w eigh t will aect the activ ation
  • f
  • ne
unit at most, and in a straigh tforw ard w a y . When w e release the clamp ed units,
  • f
course, resonance will b ecome p
  • ssible.
But if w e ha v e trained the net w
  • rk
w ell, no activ ations will c hange. There will indeed b e recurren t cycles
  • p
erating in the net w
  • rk,
but they are no w exactly suited to main tain the correct activ ations. Where the visible units formerly pro duced the
  • utput
that they w ere instructed to, no w they pro duce the same
  • utput
freely . The gradien t
  • f
E for this case is computed as follo ws: @ E @ w ij = X k clamp ed e k @ e k @ w ij (5) There are t w
  • cases
for determining @ e k =@ w ij . If i is a clamp ed no de, @ e k @ w ij = @ net k @ w ij =
  • ik
f (a j ) (6) (The Kronec k er delta,
  • ,
represen ts the iden tit y matrix. It has the v alue 1 if its subscripts are equal,
  • therwise.
Here it is used as a notational con v enience.) On the
  • ther
hand, if i is free, w e can write @ e k @ w ij = @ e k @ a i @ a i @ w ij (7) 22
slide-24
SLIDE 24 = @ net k @ a i @ a i @ w ij (8) = @ net k @ f (a i ) df (a i ) da i @ a i @ net i @ net i @ w ij (9) = w k i f (a i )de c ay 1 (net i )f (a j ) (10) (The nal line
  • f
this deriv ation is the
  • ne
that tak es adv an tage
  • f
  • ur
as- sumptions. The substitution in the third term requires that the net w
  • rk
has settled|it follo ws from the equilibrium condition (3)|and the substitutions in the rst and fourth terms
  • nly
w
  • rk
for the presen t non-resonan t case.) W e can summarize these results as @ E @ w ij = X k clamp ed e k @ e k @ w ij (11) = ( e i f (a j ) if i is clamp ed; ( P k clamp ed e k w k i )f (a j )f (a i )de c ay 1 (net i ) if i is free. (12) In general, @ E =@ w ij is giv en b y m ultiply ing an error term at i b y the
  • utput
  • f
j , and b y an additional factor that describ es ho w the
  • utput
  • f
i c hanges with its net input. This is not unlik e the rule for bac k propagation. 2.2.4 Extending
  • ur
rule to the resonan t case It turns
  • ut
to b e v ery dicult to extend this learning rule to the more general case. If w e p ermit resonance during training, it is simply to
  • m
uc h w
  • rk
to calculate the w eigh t c hanges. Briey , the problem lies in the computation
  • f
@ a i =@ w ij : Supp
  • se
that f (j ) is p
  • sitiv
e and w e increase w ij . Then the input to i from j will certainly increase. The eect
  • f
that increase
  • n
a i , ho w ev er, dep ends
  • n
i's in teraction with the rest
  • f
the net w
  • rk.
If i excites a no de that excites it in turn, for example, then the eect will b e augmen ted through that
  • ther
no de. On the
  • ther
hand, if excitation to i causes i to b e inhibited more (or excited less) b y
  • ther
units, then the eect
  • f
the additional input will b e damp ed. This in teraction
  • f
i with the rest
  • f
the net w
  • rk
has nothing in particular to do with j
  • r
w ij : W e can isolate the in teraction itself b y in tro ducing a further abstraction. Let stim i b e the input from an imaginary electro de stim ulating no de i. W e redene net i to add this new source
  • f
input. In 23
slide-25
SLIDE 25 practice, stim i is alw a ys zero|but w e can still dieren tiate with resp ect to it. So w e displace the innitesimal c hange to w ij
  • n
to stim i , allo wing us to write @ a i @ w ij = @ a i @ stim i f (j ): Let c ij stand for @ a i =@ stim j : W e merely need to compute c ii for eac h unit i in the net w
  • rk.
Unfortunately , this turns
  • ut
to b e an inheren tly serial computation. If w e deriv e an equation relating the v alues
  • f
c in the net w
  • rk,
c ij = de c ay 1 (net i ) X k w ik f (a k )c k j ; w e see that they cannot all b e determined from eac h
  • ther
b y a single relax- ation computation. The
  • nly
in terrelated v alues
  • f
c are those with the same second subscript. In
  • rder
to nd @ a i =@ stim j , w e m ust actually stim ulate j and measure the resulting c hanges in the net w
  • rk's
activ ation v ector; so nding v alues for all the c ij requires us to stim ulate all the j ,
  • ne
at a time. This is an ugly result. It is similar to sa ying that the
  • nly
w a y to determine the error gradien t in w eigh t space is to t w eak eac h w eigh t individually and see what happ ens to the error. If w e require that w eigh ts b e symmet ric, whic h in a Boltzmann mac hine is the c haracteristic that allo ws error gradien ts to b e determined lo cally , w e can get so far as pro ving that c ij = c j i for all i and j . Ev en so, w e still ha v e to compute c ii separately for eac h i. A simple example suggests that no amoun t
  • f
clev erness will mak e it p
  • ssible
to compute the c ii in parallel. Let i b e a clamp ed unit and j b e an y
  • ther
unit. Then c ii = 0; and c ij = c j i = 0, but w e ha v e no idea from these n um b ers what c j j is. As a last hop e,
  • ne
migh t try generalizing the non-resonan t rule from its form alone. There are a few generalizations that lo
  • k
promising. When tested empirically , ho w ev er, none
  • f
them can reliably minimi ze error
  • n
more than
  • ne
pattern at a time. One presumes this is b ecause they do not p erform gradien t descen t. 14 14 Ev en if w eigh ts mo v e in the righ t direction
  • n
a giv en pattern, usually enabling the net w
  • rk
to learn that pattern in isolation, they do not necessarily mo v e at the righ t sp eed. This means that when m ultiple patterns are b eing trained, the sum
  • f
the w eigh t c hanges for an ep
  • c
h ma y not ha v e the righ t sign (ev en if all the individual summands do). 24
slide-26
SLIDE 26 2.2.5 The problem with this approac h The ab
  • v
e discussion illustrates the dicult y
  • f
computing the correct w eigh t c hanges for recurren t nets. It is far harder to minim iz e
  • ur
error measure
  • n
a resonan t net than
  • n
a non-resonan t
  • ne.
The constrain t satisfaction problem is harder still, b ecause the deriv ation ab
  • v
e w as
  • v
ersimplie d. The analysis w as
  • nly
v alid with resp ect to the error measure giv en. T
  • solv
e the constrain t satisfaction problem, a net w
  • rk's
error measure actually has to b e more sophisticated than that. T
  • see
wh y , supp
  • se
w e ha v e trained all the putativ e error
  • ut
  • f
the net, so that E = 0: F
  • r
ev ery clamp ed pattern, the net w
  • rk
no w succeeds in deliv ering subsistence inputs to the clamp ed visible no des. The no des will main tain their activ ations when they are freed; the system kno ws not to stra y from the righ t answ er. This means that eac h pattern has b ecome an equilibrium p
  • in
t in activ ation space. This is w ell and go
  • d,
but the problem is that the net w
  • rk
ma y ha v e man y
  • ther
equilibrium p
  • in
ts as w ell. Ev en if w e assume that a pattern is an attractor, w e cannot b e sure that it has a v ery large basin. The system ma y not settle to the attractor unless it starts close b y . So w e can ha v e a p erfectly trained net w
  • rk
that is nonetheless unable to generate the correct solutions without prompting. F
  • r
dicult tasks, this situation almost alw a ys happ ens in practice. If w e train the reasonable non-resonan t net w
  • rk
  • f
section 2.2.3
  • n
the symmetri c X OR problem, whose clamp ed patterns are 000, 010, 101, and 110, it is usually able to reduce E to within a dozen
  • r
so ep
  • c
hs. 15 Ho w ev er, it has not really solv ed the symmetri c X OR problem in an y useful sense. If w e set the three visible units to the activ ations (1; 0; 0:5) and clamp the rst t w
  • ,
the third unit ma y assume an activ ation close to 1|but a ludicrous activ ation lik e
  • 2.31
is just as lik ely . Wh y exactly should this b e so? W e ha v e trained the system so that, when a pattern p is clamp ed
  • n
the visible no des, the hidden units assume a state S that supp
  • rts
p . Ho w ev er, if w e do not clamp p exactly , the hidden units ma y not mo v e to w ard state S at all|or to w ard S 1 , S 2 ,
  • r
an y state that supp
  • rts
a pattern in the training set. W e ha v e left some visible no des free, and their v alues ma y b e crucial in getting the hidden units to assume a 15 This result is for net w
  • rks
with
  • ne
to four hidden units and no visible-visible connections. 25
slide-27
SLIDE 27 particular state. Ho w can w e x this problem? Again, it is useful to regard the net w
  • rk
as a dynamical system. The problem at hand is not strictly a problem ab
  • ut
k eeping the visible no des in place. Rather, it is a problem ab
  • ut
getting them to mo v e to the righ t place and then k eeping them there. In
  • ther
w
  • rds,
the learning rule needs to extend the basin
  • f
an attractor that it already kno ws ho w to create: it needs to con trol the dynamics
  • f
the system. 2.3 Summ ary
  • f
Theoretical Observ ations W e ha v e no w dened a straigh tforw ard arc hitecture for a gradual-resp
  • nse
recurren t net w
  • rk,
and discussed ho w suc h a net w
  • rk
migh t learn to do tem- p
  • ral
pattern pro cessing and constrain t satisfaction. In b
  • th
cases, w e ha v e run up against the problem
  • f
training the net w
  • rk
to b e some particular dynamical system. What w
  • uld
b e really helpful is a learning rule that could train the net- w
  • rk
to assume an y desired dynamics. It is just suc h a rule that w e will dev elop in the next section. Adopting arbitrary dynamics is an in trinsically dicult problem, and it should b e noted that it is more general and p erhaps more dicult for the net w
  • rk
to solv e than the particular dynamical problems w e ha v e encoun tered so far. Simpler algorithms ma y exist for sp ecial cases. It ma y b e, for example, that there are easy w a ys to train a cluster to b e a quan tum unit. Nonetheless, in the in terest
  • f
generalit y , the remainder
  • f
this pap er will fo cus
  • n
a learning rule that is sucien t to co v er an y dynamics w e w an t to induce in a net w
  • rk,
recurren t
  • r
  • therwise.
W e will see that suc h a rule exists, deriv e its form, and attempt to apply it to particular learning situations. 3 A General Mo del and its Learning Rule 3.1 Conception
  • f
the general mo del 3.1.1 Molding a dynamical system Let us return briey to the unsuccessful constrain t satisfaction mo del
  • f
section 2.2. That mo del learned ho w to main tain visible no des already at 26
slide-28
SLIDE 28 their target v alues. It w as at a loss, ho w ev er, when it had to force the visible no des to c hange their activ ations. A b etter mo del w
  • uld
b e able to mak e the visible no des c hange their activ ations. Consider what should happ en if the activ ation
  • f
unit i is wrong. If the activ ation is to
  • lo
w, w e are
  • nly
in terested in stim ulating it enough to mak e it increase, at whatev er sp eed; an y large enough input will do. Similarly , if the activ ation is to
  • high,
all the mo del m ust do is to deliv er an y input lo w enough for the no de to decrease. The input to a no de m ust b e precise
  • nly
when the mo del needs to sustain the no de exactly at its target. W e can design an error measure that recognizes these conditions. Suc h a measure m ust pa y atten tion to whether activ ations are increasing
  • r
de- creasing. In particular, it m ust rely
  • n
equation (1), whic h sho ws that a i will increase whenev er net i > de c ay (a i ), and decrease whenev er net i < de c ay (a i ). Indeed, w e can dev elop error measures to prescrib e the dynamics
  • f
the system in an y detail whatso ev er. Most traditional error measures ha v e
  • nly
considered a i for eac h unit. When w
  • rking
with dynamic recurren t nets, ho w ev er, w e ma y also w an t to consider da i =dt. It follo ws imme diately from (1) that da i dt = net i
  • de
c ay (a i ): (13) This result holds for discrete-time sim ulations, where
  • >
0, as w ell as in the limit case where
  • approac
hes 0. It pro vides a lo cal denition
  • f
da i =dt that an error measure can easily tak e in to accoun t. In general, w e can c ho
  • se
error measures that consider the system's p
  • si-
tion in A, its direction
  • f
mo v emen t in A, and the curren t time. F
  • r
example, a measure migh t require that the system's direction b e related to its p
  • si-
tion in some time-dep enden t w a y . The formal algorithm that w e are so
  • n
to consider is able to minim iz e suc h an error measure|unlik e
  • ther
learning rules. 3.1.2 The role
  • f
input A t the same time that the net is b eing required to p erform certain stun ts in activ ation space, it ma y also receiv e some sort
  • f
input. T ypically , this input will b e useful in helping the net w
  • rk
ac hiev e the desired b eha vior (m uc h as kno wing the problems
  • n
a test helps
  • ne
get the righ t answ ers!). 27
slide-29
SLIDE 29 The en vironmen t can supply the net w
  • rk
with input in
  • ne
  • f
t w
  • w
a ys. F
  • r
tasks where the net w
  • rk
needs
  • nly
  • ne
input p er tra jectory , the starting activ ation A can pla y this role. Dieren t b eha viors are required
  • f
the net w
  • rk
dep ending
  • n
where it starts in A. (Note that the error measure ma y v ary for dieren t v alues
  • f
A .) F
  • r
  • ther
tasks, the input ma y c hange
  • v
er time. Here the preferred w a y
  • f
presen ting input is to clamp some
  • f
the units to input v alues. Changing the clamping pattern c hanges the input. (Jordan's (1986) plan v ector w
  • rks
this w a y .) This metho d conforms w ell to the dynamical system p ersp ectiv e. W e alw a ys w an t the net w
  • rk's
b eha vior to b e some function
  • f
its input. In this case, that means its b eha vior is to b e a function
  • f
its curren t p
  • sition
in A. 3.1.3 Ho w to use the error measure Once w e ha v e c hosen an error measure E for a particular dynamical problem, w e w an t to c ho
  • se
w eigh ts that will minim iz e it. But E is dened at ev ery p
  • in
t
  • f
activ ation space. A t whic h p
  • in
ts do w e need to minim ize it? Let A represen t activ ation space. In
  • rder
to induce the correct dynamics for the en tire dynamical system, w e migh t try to minim ize the surface in tegral
  • f
E
  • v
er all
  • f
A. This w
  • uld
create a v ery robust net w
  • rk.
Regardless
  • f
the place in activ ation space where the net w as released, it w
  • uld
b eha v e according to the dynamics prescrib ed for it. 16 Unfortunately , there is usually no set
  • f
w eigh ts that will ac hiev e lo w error everywher e in the space. W e ha v e seen that a net w
  • rk
  • ften
requires hidden units in
  • rder
to ac hiev e in teresting b eha vior
  • n
its visible units. These hidden units ha v e no sa y in determining the desir e d b eha vior
  • f
the visible units|but so long as they can inuence the rest
  • f
the net w
  • rk,
they certainly ha v e a sa y in determining those units' actual b eha vior. Th us they con trol the discrepancy b et w een desired and actual b eha vior. Ev en with the b est w eigh ts p
  • ssible,
then, a net w
  • rk
can nev er get lo w error ev erywhere in activ ation space. W e can still increase
  • r
decrease its error simply b y c hanging the activ ations
  • f
some hidden units. 16 The curren t w eigh ts sp ecify an error surface in activ ation space; the idea is to minim ize the v
  • lume
under this surface. F
  • r
practical purp
  • ses,
this means minim i zing total E
  • v
er a lattice
  • f
p
  • in
ts distributed ev enly through activ ation space (under the assumption that E is con tin uous). 28
slide-30
SLIDE 30 F
  • r
example, supp
  • se
  • ur
error measure requires unit i to alw a ys b e mo ving to w ard an activ ation
  • f
+2. The desired direction
  • f
mo v eme n t is th us determined b y the curren t v alue
  • f
a i . But the direction in whic h a i actually mo v es is also dep enden t
  • n
net i . Dieren t p
  • in
ts in activ ation space ma y ha v e the same v alue
  • f
a i , but still deliv er v ery dieren t amoun ts
  • f
input to i. These p
  • in
ts cannot all ha v e lo w error. The righ t approac h is instead to train the system dynamic al ly. In
  • ther
w
  • rds,
w e w an t to run the net w
  • rk
as w e w
  • uld
in practice, w atc h the tra jec- tories it actually follo ws through activ ation space, and minimi ze error along those. Th us w e
  • nly
try to con trol the net w
  • rk's
dynamics in the imp
  • rtan
t parts
  • f
activ ation space. 17 Our strategy , then, is to minimi ze the result
  • f
in tegrating E
  • v
er the tra- jectories that the system actually follo ws. This raises an in teresting question that has not b een ask ed b efore. Should w e in tegrate b y length
  • r
b y time? Some though t suggests that in tegrating b y length mak es more sense. First
  • f
all, fast tra jectories should not b e able to escap e getting blamed for error. If error is in tegrated with resp ect to time, ho w ev er, an erroneous part
  • f
the tra jectory can sp eed up and automatically reduce its con tribution to the total error. 18 One migh t attempt to mak e a similar case against the alternativ e. If the learning rule in tegrates b y length, a net w
  • rk
that is required to mak e a series
  • f
complex lo
  • ps
in A could reduce its error substan tially b y just cutting through, i.e., taking an erroneous but short detour through that region. Ho w ev er, a time-in tegrating net w
  • rk
do es no b etter at learning to trace suc h a Gordian knot; it prots from the same shortcut. So length- in tegrating net w
  • rks
seem to ha v e the
  • v
erall adv an tage. There is another, more signican t adv an tage to in tegrating b y length. Supp
  • se
w e are training a net w
  • rk
to settle at a particular activ ation v ector A. If the net w
  • rk
settles at some
  • ther
p
  • in
t A instead, its failure to con tin ue mo ving to w ard A should con tribute to the error for the tra jectory . But ho w m uc h should it con tribute? If w e in tegrate b y time, the answ er dep ends 17 Similarl y , when adjusting the second lev el
  • f
w eigh ts in a three-la y er feedforw ard net, the
  • b
ject is not to pro duce the righ t
  • utputs
giv en any activ ations
  • f
the hidden units, but
  • nly
for the activ ations that the hidden units presen tly assume in resp
  • nse
to input. 18 In some cases, w e ma y actually w an t the learning rule to fa v
  • r
fast tra jectories, but suc h a prop ert y should not b e in trinsic to the net w
  • rk.
The error measure is the appro- priate place to sp ecify suc h a preference. 29
slide-31
SLIDE 31
  • n
ho w long w e let the net w
  • rk
remain at A. F
  • r
ev ery additional second that elapses, the erroneousness
  • f
the system's b eha vior at A b ecomes more signican t to the system (and so
  • n
  • v
erwhelms the error
  • n
the rest
  • f
the tra jectory). W e cannot get around this in a principled w a y b y stopping the net w
  • rk
as so
  • n
as it settles, b ecause the net w
  • rk
ma y sta y forev er in the neigh b
  • rho
  • d
  • f
A without ev er quite con v erging to it|and if w e resp
  • nd
with an appro ximate criterion to decide that the net w
  • rk
has come \prett y close" to settling, then
  • ur
error gradien t is going to dep end critically
  • n
the strictness
  • f
this criterion. In tegrating b y length solv es this problem nicely . It ensures that a tra jec- tory generates less error p er unit time as it slo ws do wn. When the net w
  • rk
reac hes A and is no longer mo ving, it can remain there for an instan t
  • r
forev er without making an y dierence to the computation
  • f
total error
  • r
the error gradien t. Moreo v er,
  • nce
the tra jectory has gotten close to A , it has generated virtually all
  • f
the error that it ev er will; so w e can legitimately stop the net w
  • rk
when it is \prett y close" to settling. So
  • n
eac h tra jectory T,
  • ur
goal is to minimi ze E T = Z E dL (14) = Z E s X i (da i ) 2 (15) = Z E @ s X i (da i =dt ) 2 1 A dt (16) (In the standard fashion, dL represen ts an innitesimal step along the length
  • f
a
  • ne-parameter
function, E (t) in this case.) One nal p
  • in
t. Usually w e will w an t the net w
  • rk
to minimi ze the total error for sev eral patterns. Eac h pattern has a dieren t starting p
  • in
t in activ ation space and hence a dieren t tra jectory . The net w
  • rk
is supp
  • sed
to minim iz e the sum
  • f
the errors
  • n
these tra jectories. As things stand, if tra jectory T 1 is longer than tra jectory T 2 , T 1 's error will ha v e extra clout. This is usually inappropriate. F
  • r
example, supp
  • se
that T 2 b egins at an equilibrium p
  • in
t
  • f
the system and nev er mo v es; its error will not b e coun ted at all relativ e to T 1 's! T
  • remo
v e this eect, w e 30
slide-32
SLIDE 32 divide the error
  • n
eac h tra jectory b y the length
  • f
the tra jectory: E T = R T E dL R T dL : (17) One can think
  • f
this as R E ds , where eac h s 2 [0; 1] represen ts a p
  • in
t some fraction
  • f
the distance along the tra jectory; e.g., s = 0:5 is the halfw a y p
  • in
t. 3.1.4 Ho w w eigh t c hanges shift the tra jectory The previous section describ es the desired learning rule as minim iz ing an error function along a tra jectory in activ ation space. It is to accomplish this b y c hanging the w eigh ts in the net w
  • rk.
Ho w ev er, c hanging the w eigh ts will not
  • nly
c hange the v alue
  • f
the error function at p
  • in
ts along the tra jectory . Changing the w eigh ts will also shift the actual c
  • urse
  • f
the tra jectory . Migh t suc h a shift in terfere with the attempt to reduce error? Not if the error function is con tin uous
  • v
er A. Supp
  • se
the tra jectory formerly passed through a p
  • in
t A 2 A, but innitesimally adjusting the w eigh ts has shifted it an innitesimal distance, so that it no w passes through A . Because E is con tin uous, the error gradien t is virtually the same at A and A . The t w
  • gradien
ts
  • nly
dier b y an innitesimal. Hence reducing error at A requires w eigh t c hanges in the same direction as reducing error at A. The w eigh t c hanges that w e made for the
  • ld
tra jectory are also correct for the new
  • ne.
19 One consequence
  • f
this argumen t is that w eigh t c hanges can b e calcu- lated indep enden tly at dieren t p
  • in
ts in the tra jectory . Reducing error at A 1 ma y mean that w e will no longer pass through A 2 , but it do esn't aect
  • ur
computation
  • f
w eigh t c hanges for A 2 . The tra jectory shifts are imp
  • rtan
t nonetheless. The error measure
  • f
(17) asks for eac h s 2 [0; 1]: Ho w could I c hange w eigh ts to reduce error s
  • f
the distance along the tra jectory? All w e ha v e sho wn is that this question ask ed at s 1 is indep enden t
  • f
the same question ask ed at s 2 . It do esn't mean 19 In short, mo ving innitesimally along a gradien t do esn't really c hange the gradien t. This happ ens to b e part
  • f
the denition
  • f
gradien t. If it w eren't true, the function w
  • uldn't
b e dieren tiable. In practice w e will ha v e to mo v e a little faster than innitesimally ,
  • f
course, but (ac- cording to the standard excuse) not much faster. 31
slide-33
SLIDE 33 that w e can answ er the question at s 1 without considering the total eect
  • f
w eigh t c hanges
  • n
the tra jectory from s = to s = s 1 . Indeed, c hanges to the tra jectory are
  • ften
necessary to reduce error. One reason to c hange w eigh ts is to ensure that the hidden units will ha v e more helpful activ ations at some s 1 . This means that w e m ust redirect the tra jectory so that at s 1 it is passing through a more helpful part
  • f
activ ation space. So to gure
  • ut
ho w innitesimal w eigh t c hanges will aect the net w
  • rk's
activ ations at s 1 , w e ha v e to consider the cum ulativ e eect
  • f
those c hanges
  • v
er the distance [0; s 1 ]. A little though t sho ws that this is actually p
  • ssible.
F
  • r
eac h w eigh t w , w e can determine the eect
  • n
the tra jectory
  • f
an innitesimal c hange to w alone. The tec hnique b
  • ils
do wn to this: While sim ulating the net with the actual w eigh ts, w e also sim ulate a h yp
  • thetical
net where w is replaced b y w + dw . This is a straigh tforw ard w a y to determine the h yp
  • thetical
tra jectory and th us the v alue
  • f
@ E=@ w . If w e carry
  • ut
the pro cedure for ev ery mo diable w eigh t w in the system, w e will ha v e enough information to determine the gradien t
  • f
E in w eigh t space. This is the same tec hnique used b y Williams and Zipser (1988). The resulting algorithm will b e computationally in tensiv e. It essen tially requires us to run n sim ultaneous copies
  • f
the net w
  • rk,
where n is the n um b er
  • f
mo diable w eigh ts. Unfortunately , suc h a metho d seems necessary to solv e the general prob- lem. W e could rewrite it somewhat to reduce the actual requiremen ts
  • n
the net w
  • rk.
F
  • r
instance, the learning algorithm could rep eatedly c ho
  • se
a w eigh t at random, mak e some ten tativ e mo dication to it, and later com- pute a p ermanen t mo dication based
  • n
the
  • bserv
ed c hange in error. Suc h a learning algorithm w
  • uld
not require an y storage space
  • f
its
  • wn;
but it w
  • uld
learn more slo wly and b e no more elegan t. The fundamen tal dicult y
  • f
the problem w
  • uld
b e unc hanged. 3.2 F
  • rmal
deriv ation
  • f
the general mo del This section giv es the tec hnical details
  • f
the algorithm, and ma y b e skipp ed b y the general reader. 32
slide-34
SLIDE 34 3.2.1 Notation Some additional notation will b e necessary . A denotes activ ation space, A its dieren tial space. W denotes w eigh t space. T denotes the timeline [0; +innit y ). W e use con v en tional v ariables A 2 A, D 2 A , W 2 W, t 2 T. Units in the net w
  • rk
are en umerated, and the in teger v ariables i, j , k , and l stand for particular units. The comp
  • nen
ts
  • f
a v ector A are represen ted b y a 1 ; a 2 ; a 3 : : : . The same con v en tion is used for D . W is a square matrix; the comp
  • nen
t w ij represen ts the w eigh t to unit j from unit i, and is considered to b e if no connection exists. The follo wing functions are assumed to b e con tin uous and dieren tiable
  • n
all their argumen ts, except as noted. A : W
  • A
  • T
! A describ es p
  • ssible
tra jectories for a net
  • f
xed top
  • logy
. W e write A W ;A (t) to denote the p
  • sition
that the system will arriv e at when giv en w eigh t v ector W , initialized at A , and allo w ed to run for a time in terv al t. Generally w e lea v e the subscripts
  • ,
so that A(t) denotes the activ ation v ector
  • f
a particular net w
  • rk
at time t, assuming some particular v ector
  • f
starting activ ations. D : W
  • A
! A describ es the dynamics
  • f
the net. W e write D W (A) for lim t!0 A W ;A (t)A W ;A (0) t , i.e., for A W ;A (0): D W (A) need not b e dieren tiable with resp ect to A, though it m ust b e con tin uous. E : T
  • A
  • A
! R + describ es the error
  • f
the system at a particular time, giv en its p
  • sition
and direction in activ ation space. W e usually abbreviate E t (A W ;A (t); D W (A W ;A (t))) simply as E(t), since W and A are usually clear from con text. E ma y dep end
  • n
t, ev en discon tin uously . The deriv ation b elo w will sometime s use expressions lik e @ f ( b x; x 3 ) @ x and @ f (x; c x 3 ) @ x 3 : These stand resp ectiv ely for the rst and second partials
  • f
f , ev aluated at the p
  • in
t (x; x 3 ). The p eak ed hat
  • n
top indicates whic h argumen t is the v ariable. This notation is a little easier to follo w than the standard @ f (a; b) @ a (x;x 3 ) and @ f (a; b) @ b (x;x 3 ) : 33
slide-35
SLIDE 35 In the same st yle, w e ma y write @ f ( b x; x 3 )=@ t to stand for @ f (a; b) @ a (x;x 3 )
  • dx
dt : The idea is simply to describ e ho w f c hanges
  • v
er t when
  • nly
its rst argumen t is allo w ed to v ary . 3.2.2 Calculating the gradien t in w eigh t space W e w an t an expression for @ E W ;A (t) =@ W : W e can nd this b y expressing W in terms
  • f
its basis v ectors, so w e
  • nly
need to nd @ E W ;A (t) =@ w ij for eac h w eigh t w ij . @ E W ;A (t) @ w ij = @ E(A W ;A (t); D W (A W ;A (t))) @ w ij (18) = @ E( d A(t) ; D(t)) @ A(t)
  • @
A(t) @ w ij + @ E(A(t); d D(t) ) @ D(t)
  • @
D(t) @ w ij (19) Note that the m ultipli cands are v ectors, and their dot pro duct a scalar. The rst term
  • f
eac h pro duct is simply a partial deriv ativ e
  • f
E, and can b e found directly from E's denition (whatev er it is). The in teresting terms are @ A(t) =@ w ij and @ D(t) =@ w ij . T
  • deriv
e them, w e m ust sp ecify A and D for the particular net w
  • rk
arc hitecture w e're using. W e dene D W (A) as follo ws: d i =
  • net
i
  • de
c ay i (where net i (t) = P j f (a j )w ij ),
  • r
if i is clamp ed. (20) No w w e dene A W ;A (t 1 ) in terms
  • f
D: A W ;A (t 1 ) = A + Z t 1 t=0 D W (A W ;A (t))dt: (21) In the case where time is discrete, this is appro ximated b y A W ;A (t 1 ) = A + (t 1 = )1 X (t= )=0
  • D(A
W ;A (t)) (22) = A W ;A (t 1
  • )
+
  • D(A
W ;A (t 1
  • ))
with A W ;A (0) = A (23) 34
slide-36
SLIDE 36 whic h is exactly ho w w e sim ulate the net iterativ ely . In
  • rder
to con tin ue, w e will also ha v e to nd the partial deriv ativ es
  • f
D W (A). @ D b W (A) @ w ij is giv en b y @ d k (A) @ w ij = @ net k @ w ij (24) = @ @ w ij X l f (a l )w k l (25) = X l f (a l ) @ w k l @ w ij (26) = f (a j ) ik : (27) @ D W ( b A) @ a l is giv en b y @ d k ( b A) @ a l = @ @ a l net k
  • @
@ a l de c ay (a k ) (28) = f (a l )w k l
  • k
l de c ay (a l ): (29) No w w e ha v e all the terms necessary to nd the eect
  • f
individual w eigh t c hanges
  • n
D and A. The t w
  • equations
are written recursiv ely in terms
  • f
eac h
  • ther.
@ D W (A(t)) @ w ij = @ D b W (A(t)) @ w ij + @ D W ( d A(t) ) @ A(t)
  • @
A(t) @ w ij (30) i.e., @ d k @ w ij = @ d k (A) @ w ij + X l @ d k ( b A ) @ a l
  • @
a l (t) @ w ij (31) @ A W ;A (t 1 ) @ w ij = 8 > > > > < > > > > : R t 1 t=0 @ D W (A W ;A (t)) @ w ij dt (con tin uous case) @ A W ;A (t 1
  • )
@ w ij +
  • @
D W (A W ;A (t 1
  • ))
@ w ij dt (discrete case) with @ A W ;A (0) @ w ij = @ A @ w ij = 0. (32) 3.2.3 An algorithm The ab
  • v
e equations, together with (14) and (17), lead directly to the fol- lo wing learning algorithm. F
  • r
a net w
  • rk
with a giv en set
  • f
w eigh ts W , this algorithm describ es ho w to c hange the w eigh ts so as to reduce the error along the tra jectory that starts at A 2 A: 35
slide-37
SLIDE 37 1. Start the tra jectory at A : Set A(0) = A and @ A(0)=@ w ij = (for eac h w eigh t w ij ). 2. F
  • r
eac h sampled time t, where t tak es
  • n
the v alues t = 0; t 1 = t + t ; t 2 = t 1 + t 1 ; : : : : (a) Compute D(A(t)) and @ D(A(t)) =@ w ij from A(t), @ A(t) =@ w ij , and W . This determines the net w
  • rk's
direction from its cur- ren t p
  • sition,
and its h yp
  • thetical
direction (under a small w eigh t c hange) from its h yp
  • thetical
p
  • sition.
(b) Using the denition
  • f
E, compute the partials
  • f
E(A(t); D(A W ;A (t))) with resp ect to A(t) and D(A W ;A (t)). These n um b ers indicate ho w E w
  • uld
c hange if the system's p
  • sition
in A c hanged but not its dynamics,
  • r
vice-v ersa. (c) F rom the quan tities men tioned in the ab
  • v
e t w
  • steps,
calculate the partials
  • f
E with resp ect to eac h w ij . This giv es the gradien t
  • f
E in W at time t. (d) Calculate the distance L = (t) q P i d 2 i that the net w
  • rk
will mo v e in A
  • v
er the next time step, where t is the exp ected du- ration
  • f
that time step. (e) Accum ulate w eigh t c hanges against the gradien t
  • f
E, in prop
  • r-
tion to the learning rate and the instan taneous distance L. (f ) Find A(t + t) from A(t) and D(A(t)); and nd @ A(t + t) =@ w ij from @ A(t) =@ w ij and @ D(A(t)) =@ w ij : That is, determine where the net w
  • rk
will b e (and w
  • uld
b e with dieren t w eigh ts)
  • n
the next time step. 3. When the tra jectory is complete, divide the accum ulated w eigh t c hanges b y the total length
  • f
the tra jectory and institute them. (Dep ending
  • n
the problem, w e ma y call the tra jectory complete when it has set- tled, used up more than its allotted time
  • r
distance, run
  • ut
  • f
input, nished pro ducing its
  • utput,
accum ulated to
  • m
uc h error, etc.) So long as all the t's are small, 20 there is no reason that they m ust b e the same for ev ery time step. F
  • r
example, when pro cessing t i , w e migh t 20 In fact, the learning algorithm will still w
  • rk
with large v alues
  • f
t, so long as they are consisten t from
  • ne
run
  • f
the tra jectory to the next. Ho w ev er, using large v alues mak es 36
slide-38
SLIDE 38 c ho
  • se
t i so that L, the distance the net w
  • rk
tra v els
  • n
the follo wing time step, is a constan t. Later w e will also see discon tin uous error measures that require irregularly spaced time steps. 3.3 Summ ary
  • f
the general mo del The ab
  • v
e learning algorithm is more p
  • w
erful than an y in the literature to date. The strongest previous algorithm w as
  • nly
able to prescrib e a target ac- tiv ation at ev ery p
  • in
t in time (Williams & Zipser, 1988). The error measure
  • f
this algorithm, b y con trast, can consider an y
  • r
all
  • f
three quan tities: the net w
  • rk's
p
  • sition
A(t) in activ ation space, its instan taneous v elo cit y v ector D(t), and the curren t time t. This p ermits it to explicitly prescrib e certain kinds
  • f
b eha viors that ha v e previously b een neglected; in particular, time-dep enden t and p
  • sition-dep
enden t dynamics. In ligh t
  • f
the discussion in section 2, these capabilities could b e quite useful. The learning algorithm p erforms gradien t descen t
  • n
an error surface b y c hanging the net w
  • rk
w eigh ts. F
  • r
eac h w eigh t w , the algorithm m ust determine the eect
  • n
the whole tra jectory
  • f
a small c hange to w . It do es so b y gradually accum ulating the small eects
  • f
suc h a c hange. As it gures
  • ut
ho w a small c hange to w w
  • uld
c hange the tra jectory , the algorithm also calculates the error for the c hanged tra jectory . This tells it whether increasing w w
  • uld
increase
  • r
decrease error, and th us whic h w a y it should c hange w . It is imp
  • rtan
t that the form ulation giv en ab
  • v
e do es not commit itself to an y particular error measure. An y real-v alued con tin uous function
  • n
T
  • A
  • A
will do. T
  • gether
with the new abilit y to con trol dynamics, this giv es
  • ne
an enormous amoun t
  • f
exibilit y in the kinds
  • f
b eha vior
  • ne
can require
  • f
a net w
  • rk.
the net w
  • rk
a dieren t kind
  • f
computational device. F
  • r
example, recurren t net w
  • rks
with large time steps are less lik ely to settle (m utually inhibitory units will
  • scillate,
etc.). Section 2.1.1
  • utlines
some
  • ther
reasons to b e in terested in net w
  • rks
with small
  • r
innitesimal time steps. 37
slide-39
SLIDE 39 4 P articular Mo dels The general mo del describ ed ab
  • v
e p ermits us to dene
  • ur
net w
  • rk's
error virtually an y w a y w e care to. No w w e discuss some actual error measures that b
  • th
demonstrate this exibilit y and are useful for particular learning problems. 4.1 Some mo del s
  • f
p
  • ten
tial in terest Let us b egin b y
  • utlining
a n um b er
  • f
in teresting error measures. Only a few will b e dev elop ed in full in this pap er; but the discussion b elo w should illustrate the generalit y
  • f
the
  • v
erall approac h. With appropriate error mea- sures, it seems, this arc hitecture can train a n um b er
  • f
imp
  • rtan
t b eha viors, some
  • f
whic h ha v e already b een studied individually . 1. Wil liams and Zipser pr
  • blems.
The net's activ ation is prescrib ed at ev ery time t. If
  • A(t)
is the desired tra jectory through A, the error at t can b e dened as the square
  • f
the Euclidean distance b et w een A(t) and
  • A(t).
21 When input to the system is pro vided b y clamping no des, this error measure yields the mo del deriv ed in Williams and Zipser (1988). 22 2. Gener alize d Jor dan pr
  • blems.
The net's activ ation is prescrib ed
  • nly
at certain momen ts t 1 ; t 2 ; : : : . In b et w een, it is free to tak e whatev er path is easiest for it. E here is discon tin uous. A t the t i it is computed as in n um b er 1; it is the rest
  • f
the time. The system's actual tra jectory will b e con tin uous,
  • f
course. Note that when sim ulating a net w
  • rk
for this problem, w e m ust b e careful to include the imp
  • rtan
t momen ts t 1 ; t 2 ; : : : among
  • ur
time 21 F
  • r
simplicit y , w e will
  • ften
refer to
  • A(t)
and
  • ther
prescrib ed activ ations as p
  • in
ts in A. In actualit y they usually happ en to b e h yp erplanes; i.e., they do not prescrib e individual activ ations for ev ery unit
  • f
the system. Th us w e are really talking at the momen t ab
  • ut
the distance from a p
  • in
t to a h yp erplane. Later w e will talk ab
  • ut
training the tra jectory to pass through p
  • in
ts, whic h really means training it to pass through h yp erplanes. 22 The t w
  • mo
dels are almost iden tical. Williams and Zipser's is sligh tly dieren t in that it alw a ys has ev enly spaced time steps and in tegrates error b y time, not b y length. Also, their mo del uses dedicated input lines (whic h
  • ur
mo del's clamp ed no des can easily sim ulate, in the st yle
  • f
the IA mo del discussed earlier). 38
slide-40
SLIDE 40 samples. F
  • r
the similar problems considered b y Jordan (1986), the t i w ere ev enly spaced and constituted the
  • nly
time samples for his (discrete-time) net w
  • rk.
3. Mapping pr
  • blems.
F
  • r
a large class
  • f
problems, w e ma y w an t the net w
  • rk's
activ ation to settle in the long run to some target activ ation
  • A.
W e
  • nly
care that the net w
  • rk
reac hes this target; w e are indieren t as to the path it tak es. There are sev eral w a ys to ensure that the net w
  • rk
reac hes the target. W e ma y simply tak e the approac h
  • f
n um b er 2 and require that the net w
  • rk
b e at
  • r
near the target after some reasonable p erio d
  • f
time. (W e also need to require that the net w
  • rk
is not mo ving from the target.) Alternativ ely , w e ma y ask that eac h target no de b e constan tly mo ving to w ard the target,
  • r
that the net w
  • rk
as a whole mo v es closer in activ ation space to the target. 4. R ep e ate d mapping pr
  • blems.
Another imp
  • rtan
t class
  • f
tasks ma y b e describ ed as r ep e ate d mapping pr
  • blems.
If w e ask a net w
  • rk
to do man y mapping problems in succession, it ma y b e able to exploit regu- larities in the
  • rder
  • f
the problems it is giv en. Consider a net w
  • rk
that is to transcrib e disconnected sp eec h. Eac h w
  • rd
is a separate mapping problem: from the phonemes, the net w
  • rk
m ust deriv e a written rep- resen tation. Ho w ev er, the previous w
  • rds
in the net w
  • rk's
input can help it deco de the curren t w
  • rd.
W e can tak e this situation to extremes. The cum ulativ e X OR problem, men tioned earlier (1.1.2), simply cannot b e solv ed as a sequence
  • f
individual mappings. It r e quir es that the net w
  • rk
pa y atten tion to the previous mappings. 5. Constr aint satisfaction. A constrain t satisfaction
  • r
pattern completion net w
  • rk
just solv es a sp ecial kind
  • f
mapping problem. The desired mappings ha v e a sp ecial consistency . F
  • r
example, if pattern A [ P maps to pattern Q, then A [ Q can legitimately map to P , insofar as the net w
  • rk
is capable
  • f
implem en ti ng b
  • th
mappings. W e can reuse the error measures
  • f
n um b er 3, whic h are suited for general mapping problems. W e migh t also arrange error measures that 39
slide-41
SLIDE 41 tak e adv an tage
  • f
the redundancy in the mappings. F
  • r
example, p er- haps the net w
  • rk
will learn faster if while w e train it to bring free no des to their correct activ ations, w e sim ultaneously train it to deliv er sub- sistence inputs to the clamp ed no des. This will mak e it a little easier to train the clamp ed no des when they are the free no des
  • f
some
  • ther
pattern. Essen tially w e are training
  • n
t w
  • patterns
at
  • nce.
W e can mak e this tec hnique ev en more useful b y releasing the clamp ed no des. After a short p erio d
  • f
time, when the free no des ha v e had a c hance to approac h their targets, w e can release the clamp ed no des. If w e con tin ue to train the net w
  • rk,
asking all the no des to mo v e to their targets, w e are extending the basin around the target p
  • in
t in activ ation space. 6. R ele ase d mappings. In general, it is p
  • ssible
to do mapping problems without clamp ed no des (using the same error measures). W e can re- quire that a system, when released at a particular p
  • in
t A
  • f
activ ation space, mo v es to its asso ciated p
  • in
t A and sta ys there. Suc h a sc heme is esp ecially w ell-suited to p erform transformations
  • n
input. Consider the case where A represen ts ring + P AST phoneti- cally , and A represen ts r ang + . A released activ ation net w
  • rk
can learn to automatically transform
  • ne
to the
  • ther.
Lik e the past-tense net w
  • rk
  • f
Rumelhart and McClelland (1986), suc h a net w
  • rk
migh t b e able to generalize from ring/r ang to sing/sang and
  • ther
similar pairs. Moreo v er, it has at least
  • ne
ma jor adv an tage
  • v
er the direct mapping approac h
  • f
that net w
  • rk.
It manages to explain wh y the mapping sing/sang, whic h preserv es most
  • f
the phonetic information, is easier to learn than
  • ne
lik e sing/ka. This is a p
  • in
t
  • n
whic h Pink er and Prince (1988) sev erely criticize the Rumelhart and McClelland mo del. 7. Position-indep endent dynamics. The ip side
  • f
n um b er 1 is to pre- scrib e, not the net w
  • rk's
activ ation at eac h time t, but its direction
  • D(t).
The error measure at time t is then the square
  • f
the distance b et w een D(t) and
  • D(t).
This is not v ery dieren t from n um b er 1. T elling the net w
  • rk
what path to follo w is the same problem, after all, regardless
  • f
whether the path is sp ecied in terms
  • f
its p
  • sition
  • v
er time
  • r
its direction
  • v
er time. 40
slide-42
SLIDE 42 Ho w ev er, the presen t error measure successfully abstracts the idea
  • f
a tra jectory's shap e. If w e w an t the net w
  • rk
to generate a lo cal pulse in A regardless
  • f
whether it is released from A , A 1 ,
  • r
some
  • ther
(p
  • ssibly
unan ticipated) c haracteristic starting activ ation, this error measure is the natural
  • ne
to use. Moreo v er, it forgiv es dieren t kinds
  • f
error than the measure
  • f
n um b er 1 do es. If it is easiest for the net w
  • rk
to mo v e a bit to the righ t as it b egins to generate its pulse, this error measure do esn't mind m uc h, whereas the
  • ther
considers all the p
  • in
ts along the shifted pulse to b e \wrong," and ma y try hard to correct it at the exp ense
  • f
  • ther
desirable tra jectories elsewhere in the space. 8. Time-indep endent dynamics. W e ma y wish to induce certain dynami- cal prop erties in the net w
  • rk,
suc h as limit cycles. The error measure in this case dep ends
  • n
A and D
  • nly
. It is giv en at time t b y the Euclidean distance
  • f
D(A(t)) from its desired v alue,
  • D(A(t)).
Alter- nativ ely , w e can w
  • rk
up error measures that put less exact constrain ts
  • n
D(t). W e migh t just require that the D(t) fall in some particular h yp erquadran t
  • f
A . This means that all the activ ations are mo ving in particular directions, at whatev er sp eeds. 9. Gr adient desc ent. A sp ecial case
  • f
n um b er 8 is gradien t descen t. The learning rule already p erforms gradien t descen t in activ ation space
  • n
E. But w e can actually train the net w
  • rk
to p erform gradien t descen t in activ ation space
  • n
some
  • ther
measure G. That is, let G b e a dieren tiable function from A to R. A t an y p
  • in
t A 2 A, w e w an t to mak e the net's direction
  • f
mo v em en t D(A) prop
  • rtional
to that prescrib ed b y the gradien t
  • f
G at A. A system that has learned these dynamics correctly will ha v e attractors at the lo cal minima
  • f
G (and no where else). W e will see later that this tec hnique can b e used to implem en t a con ten t-addressable memory , whic h is a particular kind
  • f
constrain t-satisfaction device. 4.2 Some top
  • logies
  • f
p
  • ten
tial in terest F
  • r
a giv en task, a learning rule's success ma y dep end
  • n
the top
  • logy
  • f
the net w
  • rk
that is trying to learn the task. Section 2.1.3 has already explored 41
slide-43
SLIDE 43 in detail the p
  • ten
tial
  • f
small recurren t clusters. Here w e men tion t w
  • ther
top
  • logical
prop erties. Symmetri c w eigh ts are useful for constrain t satisfaction problems. In fact, Boltzmann mac hines (Hin ton & Sejno wski, 1986) use them exclusiv ely . They are w ell suited to suc h problems, b ecause they capture the idea
  • f
a correlation
  • r
an ticorrelation b et w een t w
  • units.
Symmetri c w eigh ts are v ery easy to implem en t. T
  • impleme
n t the w eigh t w connecting i and j , w e can simply regard it as t w
  • separate
but equal w eigh ts, w ij and w j i . Eac h has its
  • wn
eect
  • n
the error, so that @ E =@ w = @ E =@ w ij + @ E =@ w j i : In
  • rder
to mak e a w eigh t c hange to w , w e c hange it along the error gradien t for w ij , and then along the error gradien t for w j i . Sigma-pi (m ultiplicativ e) connections are also useful for constrain t satis- faction. Sigma-pi connections mak e it p
  • ssible
to solv e the symmetric X OR problem with no hidden units at all. The solution is v ery simple: eac h unit, when activ e, c hanges the connection b et w een the
  • ther
t w
  • units
from m u- tually excitatory to m utually inhibitory . This exactly captures the meaning
  • f
symme tric X OR. In general, the use
  • f
b
  • th
symmetric and sigma-pi connections ma y b e v ery helpful. When a gated symmetri c connection links units i and j , the rest
  • f
the net w
  • rk
is determining the degree
  • f
correlation b et w een the t w
  • .
This is a sensible paradigm for constrain t satisfaction. The deriv ation
  • f
a learning rule for sigma-pi units is relativ ely straigh t- forw ard. The
  • nly
c hanges are to the expressions for net i , D(t), and the partials
  • f
D(t). The form ulas are mildly messy , ho w ev er, and not really w
  • rth
including here. 4.3 Detailed deriv ation
  • f
particular error measures 4.3.1 Mapping mo del I: No des to w ard targets Section 2.2.5
  • bserv
ed that tasks requiring a net w
  • rk
to settle to particular v alues are really requiring certain dynamics
  • f
the net w
  • rk.
The net w
  • rk
will settle to a p
  • in
t A 2 A if and
  • nly
if it has established A as an attractor whose basin includes the net w
  • rk's
starting p
  • in
t. There are sev eral t yp es
  • f
error measure that migh t encourage the net w
  • rk
to establish suc h attractors. Since w e are in terested in exploring the general mo del's abilit y to consider net w
  • rk
dynamics, ho w ev er, w e will c ho
  • se
  • ne
42
slide-44
SLIDE 44 that connes itself to prescribing dynamics. Sp ecically , w e will require that eac h no de mo v e to w ard its curren t target at all times. This is a strong condition, in that it requires go
  • d
b eha vior from ev ery target no de. Ho w ev er, it has the adv an tage
  • f
b eing lo cal, and therefore easy to compute. 23 It is also quite lenien t with resp ect to the no des' exact b eha vior. Eac h no de m ust b e mo ving to w ard its target with some minim um sp eed, but
  • therwise
is free to tra v el as quic kly
  • r
slo wly as is con v enien t. Let
  • >
b e the required minim um sp eed
  • f
a no de i to w ard its target. Then w e w an t the no de's lo cal error e i to b e when its actual sp eed d i
  • r
d i exceeds , and
  • therwise
the amoun t b y whic h it falls short
  • f
. (This could b e substan tial if it is mo ving in the wrong direction.) W e set the
  • v
erall error, E , to P i e 2 i . This is the ideal measure, but it is not con tin uous at 0,
  • r
dieren tiable when the no de is mo ving at exactly the minim um sp eed. W e can x it up as follo ws. Instead
  • f
using a constan t , let
  • =
  • ja
i
  • t
i j n , where n > 1 and
  • is
a constan t. This mo v e mak es E con tin uous and dieren tiable at 0. It means that error gets more serious when the no de is far from the target (and substan tially more serious for large n). No w in the case where a i is less than
  • r
equal to its target v alue t i , w e determine e i as e i = 8 > < > : d i +
  • if
d i < 0, if d i > ,
  • r
(d i
  • )
3 = 2 + 2(d i
  • )
2 = if d i 2 [0; ]. (33) The deriv ativ es are straigh tforw ard to nd. If a i > t i , the
  • nly
dierence is that instead
  • f
w an ting d i > , w e w an t d i > ; so w e simply substitute d i for d i . (It is unnecessary to also rev erse the sign
  • f
e i , since it will b e squared an yw a y .) When in tegrating this measure
  • v
er time,
  • ne
should realize that it ma y b e imp
  • ssible
for all units to mo v e to w ard their targets from the v ery b egin- ning. The hidden units ma y ha v e to \c harge up" b efore the system starts mo ving in the righ t direction. One w a y to tak e this in to accoun t is to m ulti- ply E at ev ery step b y a factor lik e (1
  • e
t= ), where
  • is
a constan t. This is called a soft start. 23 There is certainly no philosophical adv an tage to a lo cal error measure in this v ery nonlo cal algorithm. Ho w ev er, if a lo cal metho d for training dynamic b eha vior is disco v ered in future, suc h a measure migh t indeed b e desirable. 43
slide-45
SLIDE 45 4.3.2 Mapping mo del I I: System to w ard target W e can dene a similar but nonlo cal error measure that mak es few er demands
  • n
the individual units. W e simply ask the system to con tin ually reduce its Euclidean distance to w ard the target. In actual fact, it is easiest to ha v e it reduce the square
  • f
that distance. Let z i = a i
  • t
i for all units i with target v alues. Then the square
  • f
the distance to the target is Z = P i z 2 i (summing
  • v
er these units). That v alue is c hanging with resp ect to time at a rate
  • f
d Z dt = 2 X i z i dz i dt = 2 X i z i da i dt = 2 X i z i d i : (34) Let us require this v alue to b e negativ e, and less than some quan tit y 2. Then w e can dene
  • ur
error as E = max (0; x) (35) where x =
  • +
X i z i d i (36) T
  • mak
e this con tin uous, w e actually use E = xh(x), where x is a sharp logistic function. The partial deriv ativ es
  • f
this error measure are giv en b y dE = dx h(x) + xh (x)dx (37) = dx (h(x) + xh (x)) (38) @ E @ a i = @ x @ a i (h(x) + xh (x)) (39) = d i (h(x) + xh (x)) (40) @ E @ d i = @ x @ d i (h(x) + xh (x)) (41) = z i (h(x) + xh (x)) (42) This measure has the apparen t adv an tage that it will p ermit some units to mo v e sligh tly a w a y from their targets in
  • rder
for the system as a whole to get closer to its target in A. In
  • ther
w
  • rds,
the system's p
  • ssible
tra jectories are somewhat less constrained. There is a geometrical in terpretation
  • f
this 44
slide-46
SLIDE 46 fact. Instead
  • f
ha ving to approac h the target b y p enetrating the corners
  • f
successiv ely smaller cub es cen tered at the target, as with the previous error measure, the system
  • nly
has to mo v e inside successiv ely smaller spheres. 4.3.3 General gradien t-descen t mo del The general gradien t-descen t mo del is v ery simple. Let G b e an energy function
  • n
A. The system's dynamics are to ensure that the system will follo w the gradien t
  • f
G to a lo cal minim um . In
  • ther
w
  • rds,
w e w an t D(A) =
  • Gradien
t A G, for some
  • >
0. Comp
  • nen
t-wise, for a giv en A w e require that d i =
  • d
i , where
  • d
i =
  • @
G(A) @ a i : (43) F
  • r
this
  • r
an y case where the learning rule prescrib es a particular dy- namical system
  • v
er activ ation space, w e can dene E and nd its deriv ativ es as follo ws. E(A; D ) = 1 2 X i (e i (A; D )) 2 (44) where e i (A; D ) = d i (A)
  • d
i (A) (45) @ E(A; D ) = X i e i (A; D )@ e i (A; D ) (46) @ e i ( b A; D ) @ a j =
  • ij
@
  • d
i (A) @ a j (47) @ e i (A; c D ) @ d j =
  • ij
: (48) Hence @ E( b A ; D ) @ a j = e j (A; D )
  • @
  • d
j (A) @ a j ! (49) @ E(A; c D ) @ d j = e j (A; D ): (50) T
  • teac
h a net w
  • rk
to do gradien t descen t
  • n
G, then, w e
  • nly
need to sp ecify expressions for
  • d
i =
  • @
G(A)=@ a i (equation 43) and @
  • d
i (A) =@ a i (equation 49). 45
slide-47
SLIDE 47 4.3.4 Con ten t-addressable memory mo del With an appropriate denition
  • f
G, training the net w
  • rk
to do gradien t descen t can get it to act as a con ten t-addressable memory . This is a form
  • f
constrain t satisfaction. The net w
  • rk
is to ha v e a n um b er
  • f
\memories," represen ted b y particular p
  • in
ts in A. If the net w
  • rk
is released an ywhere in activ ation space, it should end up settling at
  • ne
  • f
these memories. The net w
  • rk
should tend to con v erge to memorie s with activ ations similar to its starting p
  • in
t. Ho w ev er, a relativ ely \strong" memory should b e able to attract it from farther a w a y . F
  • r
eac h pattern p 2 A that serv es as a memory , w e dene g p (A) = 1 2 X i (a i
  • p
i ) 2 : (51) This denes a large b
  • wl
cen tered
  • n
p with g p (p) = as the
  • nly
minim um . No w let G(A) = Y p g p (A) s p ; (52) where s p > measures the strength
  • f
pattern p, and where P p s p = 1: The surface G has its minima at the patterns p. G(p) = for eac h p, and G(p) > in the imme diate neigh b
  • rho
  • d
  • f
p. s p shap es the surface depressions surrounding eac h pattern. If s p is v ery small, then the function g p s p is close to 1 ev erywhere, at except for a deep p
  • c
kmark immediatel y surrounding p, where its v alue decreases to 0. If s p is comparativ ely large,
  • n
the
  • ther
hand, g p s p describ es a wide b
  • wl
the w a y that g p do es. Multiplying all the terms together yields a con tin uous surface, with zero es at all the memories, sides sloping do wn to w ards the strong memories, and p
  • c
kmarks at the w eak
  • nes.
Gradien t descen t
  • n
G will settle at memories according to the conditions describ ed earlier in this section. 24 This measure G rolls all the patterns together in a complex w a y . W e w an t to train the net w
  • rk
to follo w the gradien t
  • f
G. The surprising result
  • f
this section is that there is a seemingly practical algorithm to do this. It turns 24 One minor exception. It ma y happ en that, as the net w
  • rk
is descending to w ard a strong memory , it passes through the attractor basin
  • f
some w eak memory . In this case, the net w
  • rk
will end up settling at a w eak memory that w asn't close to its starting p
  • in
t. This case will rarely
  • ccur
in practice, ho w ev er, since w eak memories ha v e small attractor basins. 46
slide-48
SLIDE 48
  • ut
that the net w
  • rk
can learn the correct b eha vior b y separate study
  • f
the individual patterns, t w
  • at
a time. Sp ecically , the net w
  • rk
needs to mo v e in w eigh t space so as to b etter follo w G at some p
  • in
t A|and its required direction
  • f
mo v em en t in W can b e giv en as the sum
  • f
man y small direction v ectors, eac h term determined b y an individual pair
  • f
patterns. F
  • r
the net w
  • rk
to learn to follo w the gradien t, the learning rule
  • f
section 4.3.3 needs to kno w what the gradien t actually is. So w e m ust compute @ G(A) =@ a i : First
  • f
all, w e note that @ g p (A) @ a i = @ @ a i 1 2 (a i
  • p
i ) 2 = a i
  • p
i : (53) If G(A) = 0, then there is some pattern q for whic h g q (A) = 0. It follo ws that a i = q i , and that @ g q (A)=@ a i = (from (53)). Moreo v er, @ g q (A) s q =@ a i = s q g q (A) s q 1 (@ g q (A)=@ a i ) = 0: Then w e see that @ G(A) @ a i = @ @ a i @ g q (A) s q Y p6=q g p (A) s p 1 A (54) = @ @ a i g q (A) s q ! Y p6=q g p (A) s p + g q (A) s q @ @ @ a i Y p6=q g p (A) s p 1 A (55) = + = 0: (56) If G(A) 6= 0,
  • n
the
  • ther
hand, w e are p ermitted to factor
  • ut
a G(A) term and write @ G(A) @ a i = G(A) X p @ (g p (A) s p )=@ a i g p (A) s p ! (57) = G(A) X p s p g p (A) s p 1 (@ g p (A) =@ a i ) g p (A) s p (58) = G(A) X p s p (@ g p (A)=@ a i ) g p (A) (59) = G(A) X p s p a i
  • p
i g p (A) : (60) No w w e can dene the quan tities required b y section 4.3.3. W e w an t
  • d
i to b e prop
  • rtional
to @ G(A)=@ a i . W e are actually going to set it prop
  • rtional
to 47
slide-49
SLIDE 49 (@ G(A) =@ a i )=G(A). This amoun ts to the same thing, since G(A) is p
  • sitiv
e and constan t at A.
  • D
=<
  • d
1 ;
  • d
2 ; : : : > will still p
  • in
t in the same direction as the negativ e gradien t, although it will b e scaled b y a factor
  • f
  • G(A).
  • d
i =
  • X
p s p a i
  • p
i g p (A) ,
  • r
if some g p (A) = (61) @
  • d
i @ a i =
  • X
p s p (a i
  • p
i ) 2
  • g
p (A) g p (A) 2 ,
  • r
if some g p (A) = 0: (62) No w w e go ahead and deriv e the expression for the error gradien t
  • f
the net w
  • rk.
F rom equations (49{50), and taking adv an tage
  • f
the fact that P p s p = 1, w e deduce that @ E @ d i = e i = d i
  • d
i (63) = d i +
  • X
p s p a i
  • p
i g p (A) (64) = X p s p
  • (
d i
  • +
a i
  • p
i g p (A) ) (65) @ E @ a i = e i
  • @
  • d
i @ a i ! (66) = ( X p s p
  • (
d i
  • +
a i
  • p
i g p (A) ))( X q s q
  • (a
i
  • q
i ) 2
  • g
q (A) g q (A) 2 ) (67) = X p X q " s p s q
  • 2
( d i
  • +
a i
  • p
i g p (A) )( (a i
  • q
i ) 2
  • g
q (A) g q (A) 2 ) # (68) Our w eigh t c hanges are prescrib ed in terms
  • f
(63{68) b y the usual rule, w ij =
  • @
E @ w ij =
  • X
i ( @ E @ a i @ a i @ w ij + @ E @ d i @ d i @ w ij ): In practice, w e need not compute the full summations
  • f
equations (63{ 68) ab
  • v
e. The follo wing approac h suces. When training the net w
  • rk
  • n
48
slide-50
SLIDE 50 a giv en trial, w e simply pic k t w
  • patterns,
p and q . W e mak e
  • ur
w eigh t c hanges under the pretense that @ E @ d i =
  • (
d i
  • +
a i
  • p
i g p (A) ) (69) @ E @ a i =
  • 2
( d i
  • +
a i
  • p
i g p (A) )( (a i
  • q
i ) 2
  • g
q (A) g q (A) 2 ) (70) If the c hoices
  • f
p and q are indep enden t, and eac h pattern p is c hosen s p
  • f
the time, then the a v erage p er-trial w eigh t c hanges matc h those prescrib ed b y (63) and (66). This is a nice result. It means that the net w
  • rk
will b e trained to follo w the gradien t
  • f
G(A) = Q p g p (A) f p ; where f p represen ts the frequency with whic h the net w
  • rk
sees pattern p. In
  • ther
w
  • rds,
the frequency with whic h the pattern is presen ted will exactly equal the strength
  • f
the pattern. Note that
  • con
trols the sp eed at whic h the trained net w
  • rk
is exp ected to trace the gradien t
  • f
G. If w e are willing to use a small v alue
  • f
  • ,
the term @ E=@ a i b ecomes insignican t. This is the
  • nly
term that in v
  • lv
es the pattern q . F
  • r
small
  • ,
w e can safely ignore that term and still mak e appro ximately the correct w eigh t c hanges. That is, supp
  • se
there are n training patterns. A sucien tly small
  • can
free us from ha ving to sho w the net w
  • rk
all n 2 p
  • ssible
pattern pairs (p; q )
  • n
eac h training ep
  • c
h. In eect, b y th us ignoring the @ E =@ a i factor, w e are asking the system to simply ac hiev e a b etter gradien t in the region
  • f
its tra jectory , and not w
  • rry
ab
  • ut
whether its tra jectory also shifts in suc h a w a y as to decrease error. 5 Sim ulation Results The error measures dev elop ed ab
  • v
e w ere subsequen tly tested
  • n
v arious forms
  • f
the X OR problem. X OR problems w ere delib erately c hosen so as to mak e things dicult for the learning rules. The traditional sum-of-squares expression (i.e., sum
  • f
distances squared) is a v ery straigh tforw ard error measure. If w e are to replace it with an y
  • f
the less
  • b
vious measures dev el-
  • p
ed ab
  • v
e, w e
  • ugh
t to mak e sure that the replacemen ts are able to ac hiev e go
  • d
p erformance
  • n
dicult problems. 49
slide-51
SLIDE 51 X OR is a go
  • d
test for t w
  • reasons.
First, it cannot b e solv ed unless the net w
  • rk
dev elops sp ecialized hidden units. Second, gradien t descen t solutions to X OR usually sp end a tremendous amoun t
  • f
time in a saddle p
  • in
t
  • f
the error surface b efore they solv e the problem. They mo v e quic kly in to a highly stable state where they
  • utput
0.5 in resp
  • nse
to all inputs, and dev elop the necessary hidden units
  • nly
with great dicult y . Using a dieren t error surface is unlik ely to remo v e this saddle p
  • in
t, since its existence seems to result from the lac k
  • f
an y zero-order
  • r
rst-order structure in the X OR test patterns. Ho w ev er, a dieren t error measure migh t ha v e dieren t curv ature there, making it easier
  • r
harder for the system to escap e. There is
  • ne
imp
  • rtan
t detail ab
  • ut
the sim ulator that needs to b e de- scrib ed. Dieren t error measures ha v e dieren t
  • ptimal
learning rates. In
  • rder
that the measures could b e compared without ha ving to nd
  • ptimal
learning rates for eac h, and simply to increase p erformance, the w eigh ts w ere up dated after eac h ep
  • c
h according to the delta-bar-delta rule
  • f
Jacobs (1989). This is an impro v ed v ersion
  • f
the momen tum heuristic (Rumelhart, Hin ton & Williams, 1986). In momen tum , the w eigh t c hange v ector is a v- eraged with
  • ther,
recen t w eigh t c hange v ectors, so that
  • scillation
  • n
an y dimension will cancel itself
  • ut.
Lik e momen tum , delta-bar-delta do es not follo w the gradien t exactly . Eac h w eigh t has its
  • wn,
v ariable learning rate, empirically determined from the lo cal curv ature
  • f
the error surface in that dimension. The rule attempts to nd
  • ptimal
learning rates for all w eigh ts; it is applicable not
  • nly
to bac k propagation, but to gradien t descen t tec hniques in general. 25 Note also that the error measures ab
  • v
e all require the net w
  • rk
to ac hiev e correct activations for their target no des, not merely correct
  • utputs.
Outputs are giv en b y logistic functions, and hence w
  • uld
alw a ys b e v ery close to
  • r
1. One adv an tage
  • f
  • ur
gradual-resp
  • nse
net w
  • rks
  • v
er a Boltzmann mac hine is that its inputs and
  • utputs
need not ha v e this binary distinction. Con tin uous I/O v ariables can b e implem en te d through setting and examining activ ations. This is a particular adv an tage for a net that is to mak e distinctions among 25 The delta-bar-delta rule seemed to normalize learning in these exp erimen ts, although not as m uc h as
  • ne
w
  • uld
hop e. In particular, scaling the error measure b y a constan t factor did result in c hanges in the learning rate. Wh y w
  • uld
this b e? One p
  • ssibilit
y is that the net w
  • rk
w as at the b
  • ttom
  • f
a longitudinally curving ra vine. In suc h a case, the learning rates will b e presumably b e forced to c hange
  • n
most time steps, so that the parameters that con trol ho w m uc h they c hange p er step b ecome signican t. 50
slide-52
SLIDE 52 real-w
  • rld
inputs,
  • r
act as a memory . The net w
  • rk
w as deemed to ha v e solv ed a problem if its actual activ ations w ere within 0.1
  • f
their targets. F
  • r
binary targets, this is a stricter criterion than ha ving
  • utput
within 0.1
  • f
target. There is no logistic function to force
  • utputs
to w ard
  • r
1,
  • r
to restrict them to the range (0; 1). Indeed, a k ey prop ert y
  • f
con tin uous readout is that the v alue can err b y b eing either to
  • large
  • r
to
  • small.
If net w
  • rk
  • utput
w ere restricted to (0; 1), ho w ev er, the net w
  • rk
could b e fairly slopp y , b ecause it w
  • uld
b e imp
  • ssible
for the
  • utput
to
  • v
ersho
  • t
a target
  • f
1,
  • r
fall b elo w a target
  • f
0. 5.1 Results for feedforw ard X OR As a basic test
  • f
the error measures, the system w as ask ed to solv e the
  • rdinary
X OR problem (almost t w
  • dozen
times, using dieren t parameters). The net w
  • rk
used a feedforw ard 2-2-1 top
  • logy
. T
  • c
hec k the mo del, it w as rst run using the standard bac k propagation denition
  • f
error, measured
  • nly
at the v ery end
  • f
the tra jectory . (That is, E w as in tegrated
  • nly
  • v
er the last time step. A Williams and Zipser net w
  • rk
w
  • uld
also b e capable
  • f
doing this, with minimal mo dication.) The t w
  • runs
required 1660 and 1876 ep
  • c
hs to con v erge. As a further c hec k
  • n
these runs, the system w as ask ed to con tin ually compare its w eigh t c hanges with the w eigh t c hanges prescrib ed b y bac k propagation. 26 The t w
  • prescriptions
w ere alw a ys iden tical to within a few p ercen t. Next, the net w
  • rk
w as giv en the \no des to w ard targets" (NTT) error measure
  • f
section 4.3.1, using
  • =
1:0 and n = 1:005. The results w ere excellen t. On
  • ne
run, the net w
  • rk
to
  • k
1687 ep
  • c
hs to con v erge
  • n
the solution. On the
  • ther,
it managed to satisfy the 10% criterion in
  • nly
942 ep
  • c
hs, and when allo w ed to con tin ue running, passed the 1% criterion at ep
  • c
h 1175. This w as easily the b est p erformance that an y net w
  • rk
ac hiev ed
  • n
the problem. Hence, at least under these limited conditions, NTT w as able to solv e the problem in few er ep
  • c
hs than bac k propagation (or Williams and Zipser). 26 When de c ay (a i ) = a i , as it did here, the asymptotic activ ations
  • f
the gradual-resp
  • nse
net are iden tical to the activ ations
  • f
an
  • rdinary
feedforw ard net that uses the same w eigh ts. Since b
  • th
nets compute the same mappings, and use the same error measure in this case, then they should compute the same gradien t. 51
slide-53
SLIDE 53 Of course, eac h ep
  • c
h is computationally v ery demanding|but
  • nly
b ecause the algorithm can w
  • rk
equally w ell in recurren t net w
  • rks.
The p
  • in
t is that the NTT error surface has at least as clear a path to the zero-error minima as do es sum-of-squares. Since the net w
  • rk's
con v ergence is actually ev aluated
  • n
sum-of-squares, the high p erformance
  • f
NTT is not a trivial result. The Euclidean measure, at least for the tested v alues
  • f
, actually con- v erged far more slo wly . Its short-term error reductions w ere
  • nly
in the fourth decimal place; at 2000 ep
  • c
hs it w as usually still trying to edge
  • ut
  • f
the saddle p
  • in
t. When there is
  • nly
  • ne
target no de, the Euclidean measure is similar to the NTT measure. The dierence is that where NTT requires no des that are far from their targets to approac h faster, the Eu- clidean measure|since it asks the squar e
  • f
the distance to decrease at a constan t rate|actually mak es a lesser demand
  • n
fara w a y no des. This problem w as rst addressed b y replacing
  • in
the error measure with
  • p
Z . In
  • ther
w
  • rds,
distance squared w as required to decrease at a constan t prop
  • rtion
  • f
the distance, i.e., distance had to decrease at a constan t rate. This revised Euclidean measure still to
  • k
almost 3500 ep
  • c
hs to con v erge. Finally ,
  • w
as simply replaced with Z , to get the same eect as NTT. In this case, the Euclidean measure w as able to solv e the problem in a (still slo w) 2300 ep
  • c
hs. It turns
  • ut
that the NTT and Euclidean measures are sensitiv e to their parameters. If a net w
  • rk
using
  • ne
  • f
these measures is required to mo v e to
  • quic
kly to w ard its targets, it will arriv e v ery quic kly at an unfortunate (but inno v ativ e) lo cal minim um . Ignoring its input, it will simply dart far b elo w (or far ab
  • v
e 1), incurring some error, and then b egin to return at the required sp eeds. In this situation, it is mo ving to w ard its target regardless
  • f
whether that target happ ens to b e
  • r
1! Although the net w
  • rk's
w eigh ts in this case do not solv e the X OR prob- lem, they demonstrate dramatically that the error measure is training dy- namics: the mo del is learning ho w to mo v e rather than ho w to b e somewhere. Indeed, it has learned ho w to use its hidden units to generate a pulse. This is
  • ne
  • f
the b eha viors noted for clusters in section 2.1.3. Unexp ectedly , the net w
  • rk
has ac hiev ed it without an y recurren t connections, relying
  • nly
  • n
hidden units that pass threshold at dieren t times. The net w
  • rk
also adopted this do dge
  • n
the t w
  • ccasions
it w as giv en a \soft start" (see 4.3.1). It to
  • k
adv an tage
  • f
the error measure's early leniency to quic kly get b elo w 0, then returned as required. One p
  • ssible
52
slide-54
SLIDE 54 x for suc h ab errations is to use a h ybrid measure, a v eraging together an
  • rdinary
sum-of-squares measure and
  • ne
  • f
the dynamic measures. This w
  • uld
require units to b e b
  • th
near the target and mo ving to w ard it. On t w
  • trials,
the NTT measure w as com bined in this w a y with sum-of-squares. The h ybrid measure did solv e the X OR problem successfully in 1689 ep
  • c
hs. Ho w ev er, increasing
  • nly
brough t bac k the
  • riginal
diculties. 5.2 Other tasks The feedforw ard X OR tasks sho w ed quite clearly that the mo del w
  • rk
ed, and that the net w
  • rk
w as capable
  • f
learning particular dynamics at all. F urthermore, they sho w ed that error measures based
  • n
net w
  • rk
dynamics w
  • rk
ev en for problems whose solutions are static. The
  • ther
tests w ere less satisfying. Learning in recurren t net w
  • rks
ap- p eared to w
  • rk,
but pro v ed to
  • slo
w to test fully . In addition, the con ten t- addressable memory mo del turned
  • ut
to ha v e a serious a w. Recurren t-net w
  • rk
learning w as tested using a symmetr ic X OR problem. Since the general algorithm
  • f
section 3 is essen tially a generalization
  • f
Williams and Zipser's (1988) algorithm, it m ust share that algorithm's abilit y to teac h complex tasks to recurren t net w
  • rks.
The question is whether it can teac h them eectiv ely using something
  • ther
than a sum-of-squares error measure. The net w
  • rk
w as ask ed to solv e symmetric X OR six times, using eac h
  • f
three error measures
  • n
b
  • th
a t w
  • -hidden-unit
and a three-hidden-unit top
  • logy
(b
  • th
fully recurren t). On
  • ne
  • ccasion,
NTT w as p ermitted to run for a long time, and found a solution at ep
  • c
h 8336. F
  • r
the
  • ther
tests, whic h w ere terminated after 4000 ep
  • c
hs, no solution w as found. In eac h case, error w as reduced steadily|indeed, more steadily than in the feedforw ard task|but at an excruciatingly slo w rate. When applying NTT to either net w
  • rk,
for example, 4000 ep
  • c
hs w ere
  • nly
enough to bring error do wn from a p er-pattern a v erage
  • f
0.25 (the saddle p
  • in
t) to sligh tly
  • v
er 0.24. As for con ten t-addressable memory , the rst test rev ealed an unfortunate a w in the approac h
  • f
section 4.3.4. 27 The learning rule tries to mak e the 27 The test in v
  • lv
ed a single visible unit that w as supp
  • sed
to learn three real-n um b er \memories,"
  • f
v arying strengths, with the help
  • f
three hidden units and fully recurren t 53
slide-55
SLIDE 55 net w
  • rk
descend along the gradien t
  • f
G in activ ation space. It do es indeed get the direction
  • f
the gradien t correct at ev ery p
  • in
t: it tries to ac hiev e a v elo cit y v ector at eac h p
  • in
t A that is exactly
  • =G(A)
times the gradien t at A. The dicult y is that G(A) ma y b e v ery small. A tra jectory that comes to
  • near
a training pattern will th us ha v e enormous w eigh t c hanges prescrib ed for it in that region. As the sim ulated net w
  • rk
approac hes a solution, it is driv en a w a y again,
  • ften
with enough force that it ends up
  • scillating
across w eigh t space. There seems to b e no w a y to a v
  • id
this b eha vior, except to ha v e the net w
  • rk
someho w compute G(A)
  • r
a function thereof. (Indeed, it follo ws from the deriv ation at 4.3.4 that the net m ust kno w to mak e no w eigh t c hanges
  • n
an y pattern when G(A) = 0.) In
  • rder
to compute G(A), the mo del m ust tak e all the patterns in to accoun t at
  • nce.
It cannot simply add up w eigh t c hanges prescrib ed b y the individual patterns. 28 The mo del could b e salv aged,
  • f
course, b y ha ving it explicitly compute G(A) from all the patterns, at eac h A. But this x mak es it a far less attractiv e design for a con ten t-addressable memory . 6 Conclusions This researc h has attempted to understand the relationship b et w een connec- tionist net w
  • rks,
esp ecially recurren t net w
  • rks,
and dynamical systems. It has b een instructiv e
  • n
sev eral coun ts. First
  • f
all, the dynamical systems p ersp ectiv e has pro v ed to b e fruitful. A k ey prop ert y
  • f
gradual-resp
  • nse
net w
  • rks
is that their states can c hange gradually
  • v
er time. W e ha v e seen the practical results
  • f
this ev en in tin y recurren t clusters and gradual-resp
  • nse
feedforw ard nets. Apparen tly it do es not tak e a v ery complex net w
  • rk
to pro duce non-trivial kinds
  • f
mo v eme n t through activ ation space. It is imp
  • rtan
t to stress this p ersp ectiv e b ecause, un til v ery recen tly , it has b een ignored. T raditionally , the
  • b
ject
  • f
training a net w
  • rk
has b een to ha v e the net w
  • rk
pro duce certain static b eha viors|constan t
  • utput
v ectors. Net w
  • rks
are capable
  • f
m uc h more than this, ho w ev er. Their dynamics ma y connections. In the terms
  • f
section 2.1.3, it w as supp
  • sed
to learn ho w to b e a quan tum unit. 28 Logarithmic manipulations do not help. 54
slide-56
SLIDE 56 ha v e signican t qualitativ e c haracteristics. The dierences b et w een a single attractor, a double attractor, and a limit cycle are far more pronounced than the dierence b et w een
  • utputs
  • f
and 1. Second, it is p
  • ssible
to explicitly train net w
  • rks
to ha v e particular dy- namical b eha viors. This w as not kno wn b efore. Ev en Jordan (1986), who describ ed his mo del as a dynamical system, simply trained it to pass through particular p
  • in
ts
  • n
particular time steps; Williams and Zipser (1988) did the same. In the exp erimen ts rep
  • rted
here, ho w ev er, net w
  • rks
w ere not taugh t to b e an ywhere in activ ation space at an y particular time, but
  • nly
to mo v e in a general direction to w ard their targets. The net w
  • rks
nonetheless succeeded in getting to their targets under these rather lenien t conditions, establishing attractor basins around the targets. Moreo v er, they
  • ccasionally
managed to fulll the dynamical conditions in unexp ected w a ys that they w
  • uld
not ha v e found under the \equiv alen t" static conditions. There are really t w
  • new
results here. The theoretical result is that an algorithm exists to train dynamics. The exp erimen tal result is that ev en simple net w
  • rks
are actually capable
  • f
p erforming the b eha viors they are b eing ask ed to learn. Third, training a net w
  • rk's
dynamic c haracteristics ma y not b e an y harder than training its static c haracteristics. On a mildly dicult mapping task, a net w
  • rk
w as disco v ered to learn equally w ell regardless
  • f
whether its p
  • sition
  • r
its direction w as prescrib ed. F
  • urth,
while the training tec hniques in v
  • lv
ed are dicult, they are not prohibitiv ely dicult. A single general approac h is sucien t to train net- w
  • rks
to p erform an y
  • f
a wide class
  • f
b eha viors. The approac h is no more computationally in tensiv e than its predecessor, the Williams and Zipser gra- dien t descen t algorithm for arbitrarily recurren t net w
  • rks.
Y et it extends the p
  • w
er
  • f
that algorithm w ell in to dynamical systems territory . There is more w
  • rk
left to b e done. It is not y et clear what kinds
  • f
dynamics are easy for a net w
  • rk
to learn and what kinds are hard; nor has the appropriateness
  • f
dieren t error measures b een systematically studied. Just as imp
  • rtan
t, no
  • ne
kno ws ho w top
  • logy
aects the learning
  • f
dynamics| for instance, whether the cluster arc hitectures explored in section 2.1.3 are as promising as they seem to b e. Be that as it ma y , the w
  • rk
here ma y v ery w ell k eep its promises. Through-
  • ut
the researc h, the dynamic prop erties
  • f
connectionist nets ha v e pro v ed to b e con tin ually in teresting, sometime s surprising, and
  • ften
encouraging 55
slide-57
SLIDE 57 with resp ect to their
  • v
erall signicance for connectionism. With luc k, these initial ndings will ha v e a c hance at further dev elopmen t. References [1] Elman, J. L. (1988). Finding structure in time (CRL T ec hnical Re- p
  • rt
8801). La Jolla: Univ ersit y
  • f
California, San Diego, Cen ter for Researc h in Language. [2] Hin ton, G. E., & Sejno wski, T. J. (1986). Learning and relearning in Boltzmann mac hines. In D. E. Rumelhart & J. L. McClelland (Eds.), Par al lel Distribute d Pr
  • c
essing: Explor ations in the Micr
  • structur
e
  • f
Co gnition, 1 (pp. 282-317). Cam bridge, MA: MIT Press. [3] Hopeld, J. J. (1982). Neural net w
  • rks
and ph ysical systems with emergen t collectiv e computational abilities. Pr
  • c
e e dings
  • f
the Na- tional A c ademy
  • f
Scienc es, USA, 81, 6871-6874 . [4] Jacobs, R. A. (1987). Increased rates
  • f
con v ergence through learn- ing rate adaptation (COINS T ec hnical Rep
  • rt
87-117). Amherst, MA: Univ ersit y
  • f
Massac h usetts, Departmen t
  • f
Computer & Information Science. [5] Jordan, M. I. (1986). Serial
  • rder:
A parallel distributed pro cessing approac h (ICS Rep
  • rt
8604). La Jolla: Univ ersit y
  • f
California, San Diego, Institute for Cognitiv e Science. [6] McClelland, J. L., & Rumelhart, D. E. (1981). An in teractiv e activ a- tion mo del
  • f
con text eects in letter p erception: P art 1. An accoun t
  • f
basic ndings. Psycholo gic al R eview, 88, 375-407. [7] Pink er, S., & Prince, A. (1988). On language and connectionism: Anal- ysis
  • f
a parallel distributed pro cessing mo del
  • f
language acquisition. Co gnition 28, 73-193. [8] Rumelhart, D. E., Hin ton, G. E., & Williams, R. J. (1986). Learning in ternal represen tations b y error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Par al lel Distribute d Pr
  • c
essing: Explor ations 56
slide-58
SLIDE 58 in the Micr
  • structur
e
  • f
Co gnition, 1 (pp. 318-364). Cam bridge, MA: MIT Press. [9] Rumelhart, D. E., & McClelland, J. L. (1986). On learning the past tenses
  • f
English v erbs. In J. L. McClelland & D. E. Rumelhart (Eds.), Par al lel Distribute d Pr
  • c
essing: Explor ations in the Micr
  • structur
e
  • f
Co gnition, 2 (pp. 216-271). Cam bridge, MA: MIT Press. [10] Sejno wski, T. J., & Rosen b erg, C. R. (1987). P arallel net w
  • rks
that learn to pronounce English text. Complex Systems, 1, 145-168. [11] Williams, R. J., & Zipser, D. (1988). A learning algorithm for con- tin ually running fully recurren t neural net w
  • rks
(ICS Rep
  • rt
8805). Boston: North w estern Univ ersit y , Dept.
  • f
Computer Science. 57