Dense Associative Memories and Deep Learning Dmitry Krotov IBM - - PowerPoint PPT Presentation

▶

Oct 11, 2023 371 likes •654 views

Dense Associative Memories and Deep Learning Dmitry Krotov IBM Research MIT-IBM Watson AI Lab Institute for Advanced Study Learning Mechanisms Architectures What is associative memory? energy landscape 1 2 3 4 memories

SLIDE 1

Dense Associative Memories and Deep Learning

Dmitry Krotov IBM Research MIT-IBM Watson AI Lab Institute for Advanced Study

SLIDE 2

Architectures Learning Mechanisms

SLIDE 3

What is associative memory?

ξ1 ξ2 ξ3 ξ4 energy landscape memories

SLIDE 4

E = −

i,j=1

σiTijσj

Standard Associative Memory Dense Associative Memory

Tij =

µ=1

ξµ

i ξµ j

σi ξµ

dynamical variables
memorized patterns

N -number of neurons K -number of memories E = −

µ=1

⇣ N X

i=1

ξµ

i σi

⌘n power of the interaction vertex

E = −

µ=1

⇣ N X

i=1

ξµ

i σi

⌘2

Kmax ≈ 0.14N Kmax ≈ αnN n−1 n ≥ 2

SLIDE 5

σ(t+1)

= Sign  K X

µ=1

✓ F ⇣ ξµ

i +

j6=i

ξµ

j σ(t) j

⌘ F ⇣ ξµ

i +

j6=i

ξµ

j σ(t) j

⌘◆

hξµ

i i = 0

hξµ

i ξν j i = δµνδij

SLIDE 6

Pattern recognition with DAM

vi =

28 28 784 visible neurons classification neurons 10

xα or cα

SLIDE 7

σ(t+1)

= Sign  K X

µ=1

✓ F ⇣ ξµ

i +

j6=i

ξµ

j σ(t) j

⌘ F ⇣ ξµ

i +

j6=i

ξµ

j σ(t) j

⌘◆

cα = g  β

µ=1

✓ F ⇣ − ξµ

αxα +

γ6=α

ξµ

γ xγ + N

i=1

ξµ

i vi

⌘ − F ⇣ ξµ

αxα +

γ6=α

ξµ

γ xγ + N

i=1

ξµ

i vi

⌘◆

utput cα. The update

g(x) = tanh(x)

SLIDE 8

MNIST Dataset

ξµ

i ∈ N(0, 0.1)

random memories constructed memory vectors training

SLIDE 9

Main question: What kind of representation of the data has the neural network learned?

SLIDE 10

Features vs. prototypes in psychology and neuroscience

Solso, McCarthy,1981 Wallis, et al., Journal of Vision,2008

Feature-matching theory Prototype theory

Hubel,Wiesel, 1959

Electrical signal from brain Visual area

f brain

Recording electrode Stimulus

training set

SLIDE 11

Feature to prototype transition

64 128 192 256

−1

−0.5 0.5

n = 2 n = 3 n = 20 n = 30

power of the interaction vertex feature detectors prototype detectors

SLIDE 12

Feature to prototype transition

64 128 192 256

−1

−0.5 0.5

n = 2 n = 3 n = 20 n = 30

1.80%

1.61% 1.44% 1.51%

Simard, Steinkraus, Platt, 2003

1.6% power of the interaction vertex

SLIDE 13

Duality with feed-forward nets

f(x) = F 0(x)

Duality rule:

energy function activation function

hµ

cα

cα = g ⇣ K X

µ=1

ξµ

αhµ

⌘

hµ = f ⇣ N X

i=1

ξµ

i vi

⌘

vi cα

xα

E = −

µ=1

F ⇣ N X

i=1

ξµ

i vi + 10

α=1

ξµ

αcα

⌘

SLIDE 14

Commonly used activation functions

f(x) = ReLU f(x) = RePn−1

n = 2

standard Hopfield net DAM

SLIDE 15

Question: Are there any tasks for which models with higher

rder interactions perform

better than models with quadratic interactions?

SLIDE 16

n=2

Adversarial Inputs

2 3 vi → vi − ∂C ∂vi

SLIDE 17

10 20 30 40 50 60 70 80

10 A A A A

decision boundary

number of image updates

log(Cα)

C1st C2nd

n=2 n=3 n=20 n=30 3 8 9 5 8 8 8 3 3 3 8 3

Adversarial Deformations in DAM

SLIDE 18

Question: Can we use Dense Associative Memories for classification of high resolution images?

SLIDE 19

VGG16 coupled to DAM

SLIDE 20

Adversarial Inputs in the Image Domain

SLIDE 21

Input transfer

made with n=2

classified by n=2 classified by n=8 made with n=8 classified by n=2 classified by n=8

SLIDE 22

SLIDE 23

n=2 n=8 n=2 100% 32% n=8 57% 100%

Error rate of misclassification

Generate Classify

SLIDE 24

n=2 n=2 n=3 n=20 n=30 n=3 n=20 n=30

generate test

98.9% 50.7% 9.07% 3.44% 33.9% 99% 8.71% 3.32% 45.3% 63.7% 98.9% 5.77% 37.6% 48.3% 56.9% 98.8%

SLIDE 25

Results on ImageNet

Accuracy: 69%

SLIDE 26

ImageNet errors

police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria bell cote, bell cot

SLIDE 27

E = −

µ=1

⇣ N X

i=1

ξµ

i σi

⌘n

Large Capacity Physics Computer Science Feature to Prototype Transition No Adversarial Problems Psychology Neuroscience Dense Associative Memories