Finding Structure in Time Jeffrey L. Elman In Cognitive Science 14, - - PowerPoint PPT Presentation

finding structure in time
SMART_READER_LITE
LIVE PREVIEW

Finding Structure in Time Jeffrey L. Elman In Cognitive Science 14, - - PowerPoint PPT Presentation

Finding Structure in Time Jeffrey L. Elman In Cognitive Science 14, 179 211 (1990) presented by Dominic Seyler (dseyler2@illinois.edu) Outline Motivation Method Experiments Exclusive-Or Structure in Letter Sequences


slide-1
SLIDE 1

Finding Structure in Time

Jeffrey L. Elman In Cognitive Science 14, 179 – 211 (1990) presented by Dominic Seyler (dseyler2@illinois.edu)

slide-2
SLIDE 2

Outline

  • Motivation
  • Method
  • Experiments
  • Exclusive-Or
  • Structure in Letter Sequences
  • Discovering the Notion “Word”
  • Discovering Lexical Classes
  • Conclusions
slide-3
SLIDE 3

Motivation: The Problem with Time

  • Previous methods of representing time
  • Associate serial order of temporal pattern with dimensionality of pattern

vector

  • [ 0 1 0 0 1 ] <- first, second, third... event in temporal order
  • There are several downsides of presenting time this way
  • Input buffer is required to represent events all at once
  • All input vectors must be the same length and provide for the longest possible

temporal pattern

  • Most importantly: Cannot distinguish relative from absolute temporal

position

[ 0 1 1 1 0 0 0 0 0 ] [ 0 0 0 1 1 1 0 0 0 ]

slide-4
SLIDE 4

An Alternative Way of Treating Time

  • Don’t model time as an explicit

part of the input

  • Allow time to be represented by

the effect it has on processing

  • Networks allows hidden units to

see previous output

  • Recurrent connections are what

give the network memory

slide-5
SLIDE 5

Approach: Recurrent Neural Network

  • Argument input with additional

units (context units)

  • When input is processed

sequentially, the context units contain the exact values of the hidden units of the previous sequence

  • The hidden units map the

external input and previous internal state to desired output

slide-6
SLIDE 6

Exclusive-OR

  • XOR function cannot be learned by a simple two-layer network
  • Temporal XOR: One input bit is presented at a time, predict next bit
  • Input: 1 0 1 0 0 0
  • output: 0 1 0 0 0 ?
  • Training: Run 600 passes through a 3,000 bit XOR sequence
slide-7
SLIDE 7

Exclusive-OR (cont.)

  • It is only sometimes possible

to predict the next bit correctly

  • After one bit, there is a 50/50

chance

  • After two bits, the third bit will

be the XOR of the first and second

slide-8
SLIDE 8

Structure in Letter Sequences

  • Idea: Extend prediction from one bit vectors to more complex

predictions (multi-bit)

  • Method:
  • map six letters to a binary representation (b, d, g, a, i, u)
  • Use three consonants to create a random 1,000 letter sequence
  • Replace each consonant by adding vowels: b -> ba; d -> dii; g -> guuu
  • Example input: dbgbddg -> diibaguuubadiidiiguuu
  • Prediction task: given the bit representations of characters in

sequence, predict the character word

slide-9
SLIDE 9

Structure in Letter Sequences (cont.)

  • Since consonants where ordered

randomly there is high error

  • Vowels are not random, therefore

the network can make use of previous information. Thus, error is low.

  • Takeaway: Since the input is

structured the network can make partial predictions even where the complete prediction is not possible

slide-10
SLIDE 10

Discovering the Notion “Word”

  • Learning a language involves learning words
  • Can the network automatically learn “words”, when given a

sequential list of concatenated characters?

  • Words are represented as concatenated bit vectors of their characters
  • These bit vectors are concatenated to form sentences
  • Then, each character is inputted sequentially and the network has to

predict the following letter

  • input:

manyyearsago

  • output:

anyyearsago?

slide-11
SLIDE 11

Discovering the Notion “Word” (cont.)

  • At the onset of each word error is

high

  • As more of the word is received,

error declines

  • Error provides good clue as to what

the recurring sequences in the input are and highly correlates with words

  • Network can learn boundaries of

linguistic units from input signal

slide-12
SLIDE 12

Discovering Lexical Classes from Word Order

  • Can network learn the abstract structure that underlies sentences,

when only the surface forms (i.e. words) are presented to it?

  • Method
  • Define a set of category-to-word mappings (e.g., NOUN-HUMAN -> man,

woman; VERB-PERCEPTION -> smell, see)

  • Use templates to create sentences (e.g., NOUN-HUMAN, VERB-EAT, NOUN-

FOOD)

  • Words in sentence (e.g., ”woman eat bread”) are mapped to one-hot-

vectors (e.g. 00010 00100 10000)

  • Task: Given a word vector (“woman”) predict next word (“eat”).
slide-13
SLIDE 13

Discovering Lexical Classes (cont.)

  • Since prediction task is

nondeterministic RMS error is not a fitting measurement

  • Save the hidden unit vectors for

each word in all possible contexts and average over them

  • Perform hierarchical clustering
  • Similarity structure of internal

representations is shown in tree

slide-14
SLIDE 14

Discovering Lexical Classes (cont.)

  • Network has developed internal representations for the input vectors

which reflect facts about possible sequential ordering of inputs

  • Hidden unit patterns are not word representations in the

conventional sense, since patterns also reflect prior context.

  • Error in predicting the actual next word in a given context is high, but

the network is able to predict the approximate likelihood of

  • ccurrence of classes and words
  • A given node in hidden layer participates in multiple concepts. Only

the activation pattern in its entirety is meaningful.

slide-15
SLIDE 15

Conclusions

  • Networks can learn temporal structure implicitly
  • Problems change their nature when expressed as temporal events

(XOR could previously not be learned by single-layer network)

  • Error signal is a good metric of where structure exists (Error was high

at the beginning of words in sentence)

  • Increasing complexity does not necessarily result in worse

performance (Increasing number of bits did not hurt performance)

  • Internal representations can be hierarchical in nature (Similarity was

high among words within one class)

slide-16
SLIDE 16

Finding Structure in Time

Jeffrey L. Elman In Cognitive Science 14, 179 – 211 (1990) presented by Dominic Seyler (dseyler2@illinois.edu)