Associative memories 9/25/2014 Memorized associations are - - PowerPoint PPT Presentation

associative memories
SMART_READER_LITE
LIVE PREVIEW

Associative memories 9/25/2014 Memorized associations are - - PowerPoint PPT Presentation

Associative memories 9/25/2014 Memorized associations are ubiquitous Stimulus Response Bill Memorized associations are ubiquitous Stimulus Response Key properties: Noise tolerance (generalization) Graceful saturation


slide-1
SLIDE 1

Associative memories

  • 9/25/2014
slide-2
SLIDE 2

Memorized associations are ubiquitous

“Bill”

Stimulus Response

slide-3
SLIDE 3

Memorized associations are ubiquitous

“Bill”

Stimulus Response Key properties:

  • Noise tolerance (generalization)
  • Graceful saturation
  • High capacity
slide-4
SLIDE 4

First attempts: Holography

van Heerden, 1963 Willshaw, Longuet-higgins, 1960s

r,s(x) = r ? s = Z r(⇣)r(x − ⇣)d⇣ s φr,s(x) = Z s(τ)φr,s + (τ + x)dτ

Mathematically, this is the convolution-correlation scheme from class 4.

slide-5
SLIDE 5

Matrix memories

Steinbuch, 1962 Willshaw et al., 1969

Before long, it was realized that better results could be obtained with a simpler, more neurally plausible framework. Let’s explore a simple Hebbian scheme. We have input and output lines, and we strength synapses when they’re

  • n together.

Input / Stimulus Output / Response

slide-6
SLIDE 6

Storage

slide-7
SLIDE 7

Storage

slide-8
SLIDE 8

Storage

slide-9
SLIDE 9

Storage

slide-10
SLIDE 10

Storage

What happens here depends on the specific choice of learning rule.

slide-11
SLIDE 11

Storage

Additive Hebb rule M =

n

X

i=1

RiST

i

S1 S2 S3

R3 R2 R1

What happens here depends on the specific choice of learning rule.

slide-12
SLIDE 12

Storage

Additive Hebb rule

S1 S2 S3

R3 R2 R1

R M =

n

X

i=1

RiST

i

M =

n

X

i=1

RST M ST

What happens here depends on the specific choice of learning rule.

slide-13
SLIDE 13

Storage

Additive Hebb rule

S2

M =

n

X

i=1

RiST

i

M =

n

X

i=1

RST M ST

slide-14
SLIDE 14

Storage

Additive Hebb rule

S2

M =

n

X

i=1

RiST

i

M =

n

X

i=1

RST M ST

ˆ R2

ˆ R

slide-15
SLIDE 15

Retrieval

Additive Hebb rule

S1

M =

n

X

i=1

RiST

i

M =

n

X

i=1

RST M ST

slide-16
SLIDE 16

Retrieval

Additive Hebb rule

S1

M =

n

X

i=1

RiST

i

M ST ˆ R

ˆ R1

ˆ Rj =

n

X

i=1

RiST

i Sj

slide-17
SLIDE 17

Retrieval

Additive Hebb rule

S1

M =

n

X

i=1

RiST

i

M ST ˆ R

ˆ R1

ˆ Rj =

n

X

i=1

RiST

i Sj

ˆ Rj =

n

X

i6=j

RiST

i Sj + Rj||Sj||2

slide-18
SLIDE 18

Retrieval

Additive Hebb rule

S1

M =

n

X

i=1

RiST

i

M ST ˆ R

ˆ R1

ˆ Rj =

n

X

i=1

RiST

i Sj

ˆ Rj =

n

X

i6=j

RiST

i Sj + Rj||Sj||2

If the Sk are orthonormal, then retrieval is exact. 1

slide-19
SLIDE 19

Another perspective

Recall the the optimal memory matrix is M? = RS† If the columns of S are linearly independent, then X† = (XT X)−1XT , giving M? = R(ST S)−1ST . So if the columns of S (the Si) are orthonormal, M? = RST , which is exactly what we got for the simple Hebb rule.

slide-20
SLIDE 20

Capacity

How much information can a matrix memory store? Model:

  • 1. |S| = |R| = N × P
  • 2. Each input pattern (column of S) has mS nonzeros,

and each output pattern (column of R) has mR nonzeros.

  • 3. Binary Hebb rule: M = max(RST , 1)

(Each entry is clipped at one.)

  • 4. Threshold recall:

ˆ Rjk = ⇢ 1 [MSj]k > τ else P patterns of size N

slide-21
SLIDE 21

Capacity

How much information can a matrix memory store? Model:

  • 1. |S| = |R| = N × P
  • 2. Each input pattern (column of S) has mS nonzeros,

and each output pattern (column of R) has mR nonzeros.

  • 3. Binary Hebb rule: M = max(RST , 1)

(Each entry is clipped at one.)

  • 4. Threshold recall:

ˆ Rjk = ⇢ 1 [MSj]k > τ else P patterns of size N

  • Parameters we can pick
slide-22
SLIDE 22

Capacity

M ST

ˆ R2

ˆ R

To choose the threshold τ, note that in the absence of noise, a column

  • f M has exactly mS ones.

So in order to recover all the ones in R, we better set τ = mS.

slide-23
SLIDE 23

Capacity

Sparsity parameters The chance of given weight in M remaining zero throughout the learning process is (1 − mSmR N 2 )P ≈ e

−P mS mR N2

= (1 − q) NqmS = 1 ⇒ mS = − logN

logq

The probability of a spurious one in ˆ R is the probability of exceeding τ purely by chance. This is NqmS. So the highest value for mS we can choose before make the first error is:

slide-24
SLIDE 24

Capacity

Sparsity parameters e

−P mS mR N2

= (1 − q) mS = − logN

logq

From previous slide

slide-25
SLIDE 25

Capacity

Sparsity parameters This is quite good! (1 − q) = e

−P mS mR N2

⇒ mSmR N 2 = log(2) N 2

slide-26
SLIDE 26

Memory space

The decision to represent memory items as sparse, high- dimensional vectors has some interesting consequences. High-dimensional spaces are counterintuitive.

In 1000 dimensions, 0.001 of all patterns are within 451 bits of a given point, and all but 0.001 are within 549 bits.

  • Points tend to be orthogonal - most

point-pairs are “noise-like.”

Kanerva, 1988

slide-27
SLIDE 27

Memory space

The decision to represent memory items as sparse, high- dimensional vectors has some interesting consequences. High-dimensional spaces are counterintuitive.

Almost all pairs of points are far apart, but there are multiple“linking” points that are close to both.

  • Kanerva, 1988

Linking concepts

slide-28
SLIDE 28

Memory space

The decision to represent memory items as sparse, high- dimensional vectors has some interesting consequences. High-dimensional spaces are counterintuitive.

Almost all pairs of points are far apart, but there are multiple“linking” points that are close to both.

  • Kanerva, 1988

Linking concepts

slide-29
SLIDE 29

Matrix memories in the brain: Marr’s model of the cerebellum

Short, live experiment Marr, 1967 Albus, 1971

slide-30
SLIDE 30

Matrix memories in the brain: Marr’s model of the cerebellum

The cerebellum produces smooth, coordinated motor movements. (And may be involved in cognition as well). 24-year old Chinese woman without a cerebellum

slide-31
SLIDE 31

Mossy Fiber Contextual Input Purkinje Axons Motor Output

Learn associations straight from context to actions, so you don’t have to “think” before doing.

slide-32
SLIDE 32

Mossy Fiber Contextual Input Purkinje Axons Motor Output

Learn associations straight from context to actions, so you don’t have to “think” before doing.

slide-33
SLIDE 33

Mossy Fiber Contextual Input Purkinje Axons Motor Output

Learn associations straight from context to actions, so you don’t have to “think” before doing.

slide-34
SLIDE 34

Mossy Fiber Contextual Input Purkinje Axons Motor Output

But how does training work? How do the right patterns “appear” on the output (Purkinje) lines?

Climbing Fibers Motor Teaching Input

The rest of the brain

slide-35
SLIDE 35

Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input

The rest of the brain There’s a remarkable 1-1 correspondence between climbing fibers and Purkinje

  • axons. Moreover, each climbing fiber wraps

around and around it’s Purkinje axon, making hundreds of synapses; a single AP can make a Purkinje spike.

slide-36
SLIDE 36

Mossy Fiber Contextual Input Purkinje Axons Motor Output

We said that sparsity was a key property. How is that manifested here?

Climbing Fibers Motor Teaching Input

The rest of the brain

slide-37
SLIDE 37

Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification

slide-38
SLIDE 38

Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification

slide-39
SLIDE 39

Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification

slide-40
SLIDE 40

Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification

There are 50 billion granule cells - 3/4 of the brain’s neurons.They’re tiny.

  • The idea here is that they “blow up” the

mossy fiber input into a larger space in which the signal can be sparser.

  • Granule cells code for sets of mossy fibers

(codons), hypothesized to be primitive input features.

slide-41
SLIDE 41

Storing structured information

We’ve discussed how to store S-R pairs, but human cognition goes way beyond this.

Relations

  • The kettle is on the table.
  • The kettle is to the right of the mug.
slide-42
SLIDE 42

Storing structured information

As before, “concepts” are activation vectors. Kettle Jar

slide-43
SLIDE 43

Storing structured information

As before, “concepts” are activation vectors. Kettle Jar

How to represent this? Green(Jar) & Gray(Kettle)

slide-44
SLIDE 44

Storing structured information

As before, “concepts” are activation vectors. Kettle Jar

How to represent this? Green(Jar) & Gray(Kettle)

Green Gray

slide-45
SLIDE 45

Storing structured information

As before, “concepts” are activation vectors. Kettle Jar

How to represent this? Green(Jar) & Gray(Kettle)

Green Gray Now what?

Maybe we should just have all of these patterns fire at once?

  • But then how do we know we don’t

have Gray(Jar) & Green(Kettle) ? Or, worse, Jar(Kettle) & Green & Gray?

slide-46
SLIDE 46

Storing structured information

As before, “concepts” are activation vectors. Kettle Jar

How to represent this? Green(Jar) & Gray(Kettle)

Green Gray Now what?

Maybe we should just have all of these patterns fire at once?

  • But then how do we know we don’t

have Gray(Jar) & Green(Kettle) ? Or, worse, Jar(Kettle) & Green & Gray?

We need a way to bind predicates to arguments.

slide-47
SLIDE 47

Storing structured information

As before, “concepts” are activation vectors. Kettle Jar

How to represent this? Green(Jar) & Gray(Kettle)

Green Gray

We need a way to bind predicates to arguments.

⊗ ⊕ ⊗

Binding operator Conjunction operator

slide-48
SLIDE 48

Storing structured information

Jar Green How should we choose and ?

⊗ ⊕

=

Jar Green

Paul Smolensky, “Tensor Product Variable Binding”, 1990

slide-49
SLIDE 49

Storing structured information

How should we choose and ?

⊗ ⊕

=

Jar

Paul Smolensky, “Tensor Product Variable Binding”, 1990

Kettle Jar Green Gray

⊗ ⊕ ⊗

Kettle Green Gray

slide-50
SLIDE 50

Storing structured information

How should we choose and ?

⊗ ⊕

=

Jar

Paul Smolensky, “Tensor Product Variable Binding”, 1990

Kettle Jar Green Gray

⊗ ⊕ ⊗

Kettle Green Gray

This is just a matrix memory

slide-51
SLIDE 51

Storing structured information

We can store some interesting “data structures,” like a stack:

Linearly independent “indexing roles” Some alphabet “fillers”

Or a tree: A B C S = X

i

fi ⊗ ri T = A ⊗ r0 ⊗ +[B ⊗ r0 + C ⊗ r1] ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r0 ⊗ r1 + C ⊗ r1 ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r10 + C ⊗ r11

Note the dimensions wrt these sums

slide-52
SLIDE 52

Storing structured information

A B C T = A ⊗ r0 ⊗ +[B ⊗ r0 + C ⊗ r1] ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r0 ⊗ r1 + C ⊗ r1 ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r10 + C ⊗ r11 Even better, we can do symbolic operations with matrices: A = Wex0T B = Wex0Wex1T Wex01 T = [A, T 0] = Wcons0A + Wcons1T 0 Extraction (car, cdr) Construction (cons) One matrix

slide-53
SLIDE 53

Storing structured information

A B C T = A ⊗ r0 ⊗ +[B ⊗ r0 + C ⊗ r1] ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r0 ⊗ r1 + C ⊗ r1 ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r10 + C ⊗ r11 Even better, we can do symbolic operations with matrices: A = Wex0T B = Wex0Wex1T Wex01 T = [A, T 0] = Wcons0A + Wcons1T 0 Extraction (car, cdr) Construction (cons) One matrix

  • So you can build a whole LISP

in linear algebra!

slide-54
SLIDE 54

Storing structured information

One more example: Go from passive-voice sentences to predicate representation

P Aux by V A

[Few movies] are admired by [skeptical critics]

A P V

F F(T) = WT W = Wcons0[Wex1Wex0Wex1] +Wcons1[Wcons0(Wex1Wex1Wex1) + Wcons1Wex0] This is just one matrix multiplication

Admire( [Few movies], [skeptical critics] )

slide-55
SLIDE 55

Encoding grammars

Grammaticality can be represented as constraints on which role/filler pairs.

  • A singular subject requires a correctly conjugated verb, e.g.
  • Such constraints can be encoded as weights in the “binding matrix.” (Note

the departure from matrix memories, in which the entries in the matrix were synapses.) These weights give us an energy function, and evolution dynamics.

  • Over time, the network will settle into

low energy (=high grammaticality) states.

slide-56
SLIDE 56
slide-57
SLIDE 57

Other frameworks: Vector-symbolic architectures Idea: represent role/filler bindings as vectors, rather than tensors; everything is the same size.

Circular covolution (Plate)

slide-58
SLIDE 58

Other frameworks Recursive Auto-Associative Memory (RAAM) Learn the compression function. A C B D W WT W[W[W[B, C], D], A] Just like in an ordinary autoencoder, we learn the same weights for encoding and decoding. Pollack (1990)

slide-59
SLIDE 59

Modern applications: Socher et al. 2013 Similar to RAAMs, that the model learns a parametrized compression function; details are different.

Bag of words Bigrams

slide-60
SLIDE 60

End