Associative memories
- 9/25/2014
Associative memories 9/25/2014 Memorized associations are - - PowerPoint PPT Presentation
Associative memories 9/25/2014 Memorized associations are ubiquitous Stimulus Response Bill Memorized associations are ubiquitous Stimulus Response Key properties: Noise tolerance (generalization) Graceful saturation
“Bill”
Stimulus Response
“Bill”
Stimulus Response Key properties:
van Heerden, 1963 Willshaw, Longuet-higgins, 1960s
r,s(x) = r ? s = Z r(⇣)r(x − ⇣)d⇣ s φr,s(x) = Z s(τ)φr,s + (τ + x)dτ
Mathematically, this is the convolution-correlation scheme from class 4.
Steinbuch, 1962 Willshaw et al., 1969
Before long, it was realized that better results could be obtained with a simpler, more neurally plausible framework. Let’s explore a simple Hebbian scheme. We have input and output lines, and we strength synapses when they’re
Input / Stimulus Output / Response
What happens here depends on the specific choice of learning rule.
Additive Hebb rule M =
n
X
i=1
RiST
i
S1 S2 S3
R3 R2 R1
What happens here depends on the specific choice of learning rule.
Additive Hebb rule
S1 S2 S3
R3 R2 R1
R M =
n
X
i=1
RiST
i
M =
n
X
i=1
RST M ST
What happens here depends on the specific choice of learning rule.
Additive Hebb rule
S2
M =
n
X
i=1
RiST
i
M =
n
X
i=1
RST M ST
Additive Hebb rule
S2
M =
n
X
i=1
RiST
i
M =
n
X
i=1
RST M ST
ˆ R2
ˆ R
Additive Hebb rule
S1
M =
n
X
i=1
RiST
i
M =
n
X
i=1
RST M ST
Additive Hebb rule
S1
M =
n
X
i=1
RiST
i
M ST ˆ R
ˆ R1
ˆ Rj =
n
X
i=1
RiST
i Sj
Additive Hebb rule
S1
M =
n
X
i=1
RiST
i
M ST ˆ R
ˆ R1
ˆ Rj =
n
X
i=1
RiST
i Sj
ˆ Rj =
n
X
i6=j
RiST
i Sj + Rj||Sj||2
Additive Hebb rule
S1
M =
n
X
i=1
RiST
i
M ST ˆ R
ˆ R1
ˆ Rj =
n
X
i=1
RiST
i Sj
ˆ Rj =
n
X
i6=j
RiST
i Sj + Rj||Sj||2
If the Sk are orthonormal, then retrieval is exact. 1
Recall the the optimal memory matrix is M? = RS† If the columns of S are linearly independent, then X† = (XT X)−1XT , giving M? = R(ST S)−1ST . So if the columns of S (the Si) are orthonormal, M? = RST , which is exactly what we got for the simple Hebb rule.
How much information can a matrix memory store? Model:
and each output pattern (column of R) has mR nonzeros.
(Each entry is clipped at one.)
ˆ Rjk = ⇢ 1 [MSj]k > τ else P patterns of size N
How much information can a matrix memory store? Model:
and each output pattern (column of R) has mR nonzeros.
(Each entry is clipped at one.)
ˆ Rjk = ⇢ 1 [MSj]k > τ else P patterns of size N
M ST
ˆ R2
ˆ R
To choose the threshold τ, note that in the absence of noise, a column
So in order to recover all the ones in R, we better set τ = mS.
Sparsity parameters The chance of given weight in M remaining zero throughout the learning process is (1 − mSmR N 2 )P ≈ e
−P mS mR N2
= (1 − q) NqmS = 1 ⇒ mS = − logN
logq
The probability of a spurious one in ˆ R is the probability of exceeding τ purely by chance. This is NqmS. So the highest value for mS we can choose before make the first error is:
Sparsity parameters e
−P mS mR N2
= (1 − q) mS = − logN
logq
From previous slide
Sparsity parameters This is quite good! (1 − q) = e
−P mS mR N2
⇒ mSmR N 2 = log(2) N 2
The decision to represent memory items as sparse, high- dimensional vectors has some interesting consequences. High-dimensional spaces are counterintuitive.
In 1000 dimensions, 0.001 of all patterns are within 451 bits of a given point, and all but 0.001 are within 549 bits.
point-pairs are “noise-like.”
Kanerva, 1988
The decision to represent memory items as sparse, high- dimensional vectors has some interesting consequences. High-dimensional spaces are counterintuitive.
Almost all pairs of points are far apart, but there are multiple“linking” points that are close to both.
Linking concepts
The decision to represent memory items as sparse, high- dimensional vectors has some interesting consequences. High-dimensional spaces are counterintuitive.
Almost all pairs of points are far apart, but there are multiple“linking” points that are close to both.
Linking concepts
Short, live experiment Marr, 1967 Albus, 1971
The cerebellum produces smooth, coordinated motor movements. (And may be involved in cognition as well). 24-year old Chinese woman without a cerebellum
Mossy Fiber Contextual Input Purkinje Axons Motor Output
Learn associations straight from context to actions, so you don’t have to “think” before doing.
Mossy Fiber Contextual Input Purkinje Axons Motor Output
Learn associations straight from context to actions, so you don’t have to “think” before doing.
Mossy Fiber Contextual Input Purkinje Axons Motor Output
Learn associations straight from context to actions, so you don’t have to “think” before doing.
Mossy Fiber Contextual Input Purkinje Axons Motor Output
But how does training work? How do the right patterns “appear” on the output (Purkinje) lines?
Climbing Fibers Motor Teaching Input
The rest of the brain
Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input
The rest of the brain There’s a remarkable 1-1 correspondence between climbing fibers and Purkinje
around and around it’s Purkinje axon, making hundreds of synapses; a single AP can make a Purkinje spike.
Mossy Fiber Contextual Input Purkinje Axons Motor Output
We said that sparsity was a key property. How is that manifested here?
Climbing Fibers Motor Teaching Input
The rest of the brain
Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification
Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification
Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification
Mossy Fiber Contextual Input Purkinje Axons Motor Output Climbing Fibers Motor Teaching Input Granule Cells Sparsification
There are 50 billion granule cells - 3/4 of the brain’s neurons.They’re tiny.
mossy fiber input into a larger space in which the signal can be sparser.
(codons), hypothesized to be primitive input features.
We’ve discussed how to store S-R pairs, but human cognition goes way beyond this.
Relations
As before, “concepts” are activation vectors. Kettle Jar
As before, “concepts” are activation vectors. Kettle Jar
How to represent this? Green(Jar) & Gray(Kettle)
As before, “concepts” are activation vectors. Kettle Jar
How to represent this? Green(Jar) & Gray(Kettle)
Green Gray
As before, “concepts” are activation vectors. Kettle Jar
How to represent this? Green(Jar) & Gray(Kettle)
Green Gray Now what?
Maybe we should just have all of these patterns fire at once?
have Gray(Jar) & Green(Kettle) ? Or, worse, Jar(Kettle) & Green & Gray?
As before, “concepts” are activation vectors. Kettle Jar
How to represent this? Green(Jar) & Gray(Kettle)
Green Gray Now what?
Maybe we should just have all of these patterns fire at once?
have Gray(Jar) & Green(Kettle) ? Or, worse, Jar(Kettle) & Green & Gray?
We need a way to bind predicates to arguments.
As before, “concepts” are activation vectors. Kettle Jar
How to represent this? Green(Jar) & Gray(Kettle)
Green Gray
We need a way to bind predicates to arguments.
Binding operator Conjunction operator
Jar Green How should we choose and ?
⊗ ⊕
Jar Green
Paul Smolensky, “Tensor Product Variable Binding”, 1990
How should we choose and ?
⊗ ⊕
Jar
Paul Smolensky, “Tensor Product Variable Binding”, 1990
Kettle Jar Green Gray
Kettle Green Gray
How should we choose and ?
⊗ ⊕
Jar
Paul Smolensky, “Tensor Product Variable Binding”, 1990
Kettle Jar Green Gray
Kettle Green Gray
This is just a matrix memory
We can store some interesting “data structures,” like a stack:
Linearly independent “indexing roles” Some alphabet “fillers”
Or a tree: A B C S = X
i
fi ⊗ ri T = A ⊗ r0 ⊗ +[B ⊗ r0 + C ⊗ r1] ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r0 ⊗ r1 + C ⊗ r1 ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r10 + C ⊗ r11
Note the dimensions wrt these sums
A B C T = A ⊗ r0 ⊗ +[B ⊗ r0 + C ⊗ r1] ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r0 ⊗ r1 + C ⊗ r1 ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r10 + C ⊗ r11 Even better, we can do symbolic operations with matrices: A = Wex0T B = Wex0Wex1T Wex01 T = [A, T 0] = Wcons0A + Wcons1T 0 Extraction (car, cdr) Construction (cons) One matrix
A B C T = A ⊗ r0 ⊗ +[B ⊗ r0 + C ⊗ r1] ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r0 ⊗ r1 + C ⊗ r1 ⊗ r1 = A ⊗ r0 ⊗ +B ⊗ r10 + C ⊗ r11 Even better, we can do symbolic operations with matrices: A = Wex0T B = Wex0Wex1T Wex01 T = [A, T 0] = Wcons0A + Wcons1T 0 Extraction (car, cdr) Construction (cons) One matrix
One more example: Go from passive-voice sentences to predicate representation
P Aux by V A
[Few movies] are admired by [skeptical critics]
A P V
F F(T) = WT W = Wcons0[Wex1Wex0Wex1] +Wcons1[Wcons0(Wex1Wex1Wex1) + Wcons1Wex0] This is just one matrix multiplication
Admire( [Few movies], [skeptical critics] )
Grammaticality can be represented as constraints on which role/filler pairs.
the departure from matrix memories, in which the entries in the matrix were synapses.) These weights give us an energy function, and evolution dynamics.
low energy (=high grammaticality) states.
Other frameworks: Vector-symbolic architectures Idea: represent role/filler bindings as vectors, rather than tensors; everything is the same size.
Circular covolution (Plate)
Other frameworks Recursive Auto-Associative Memory (RAAM) Learn the compression function. A C B D W WT W[W[W[B, C], D], A] Just like in an ordinary autoencoder, we learn the same weights for encoding and decoding. Pollack (1990)
Modern applications: Socher et al. 2013 Similar to RAAMs, that the model learns a parametrized compression function; details are different.
Bag of words Bigrams