Neural networks: Unsupervised learning 1 Previously The - - PowerPoint PPT Presentation

neural networks unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Neural networks: Unsupervised learning 1 Previously The - - PowerPoint PPT Presentation

Neural networks: Unsupervised learning 1 Previously The supervised learning paradigm: given example inputs x and target outputs t learning the mapping between them the trained network is supposed to give correct response for any


slide-1
SLIDE 1

1

Neural networks:
 Unsupervised learning

slide-2
SLIDE 2

2

Previously

The supervised learning paradigm: given example inputs x and target outputs t learning the mapping between them the trained network is supposed to give ‘correct response’ for any given input stimulus training is equivalent of learning the appropriate weights to achiece this an objective function (or error function) is defined, which is minimized during training

slide-3
SLIDE 3

3

Previously

Optimization wrt. an objective function where (error function) (regularizer)

slide-4
SLIDE 4

4

Previously

Interpret y(x,w) as a probability:

slide-5
SLIDE 5

4

Previously

Interpret y(x,w) as a probability: the likelihood of the input data can be expressed with the original error function function

slide-6
SLIDE 6

4

Previously

Interpret y(x,w) as a probability: the likelihood of the input data can be expressed with the original error function function the regularizer has the form of a prior!

slide-7
SLIDE 7

4

Previously

Interpret y(x,w) as a probability: the likelihood of the input data can be expressed with the original error function function the regularizer has the form of a prior! what we get in the objective function M(w): the posterior distribution of w:

slide-8
SLIDE 8

4

Previously

Interpret y(x,w) as a probability: the likelihood of the input data can be expressed with the original error function function the regularizer has the form of a prior! what we get in the objective function M(w): the posterior distribution of w: The neuron’s behavior is faithfully translated into probabilistic terms!

slide-9
SLIDE 9

5

Previously

When making predictions....

slide-10
SLIDE 10

5

Previously

Original estimate When making predictions....

slide-11
SLIDE 11

5

Previously

Original estimate Bayesian estimate When making predictions....

slide-12
SLIDE 12

5

Previously

Original estimate Bayesian estimate When making predictions....

  • The probabilistic interpretation makes our assumptions explicit:

by the regularizer we imposed a soft constraint on the learned parameters, which expresses our prior expecations.

  • An additional plus:

beyond getting wMP we get a measure for learned parameter uncertainty

slide-13
SLIDE 13

6

What’s coming?

  • Networks & probabilistic framework:

from the Hopfield network to Boltzmann machine

  • What we learn?

Density estimation, neural architecture and optimization 
 principles: principal component analysis (PCA)

  • How we learn?

Hebb et al: Learning rules

  • Any biology?

Simple cells & ICA

slide-14
SLIDE 14

7

Learning data...

slide-15
SLIDE 15

7

Learning data...

slide-16
SLIDE 16

8

Unsupervised learning: what is it about? Capacity of a single neuron is limited: certain data can only be learned So far, we used a supervised learning paradigm: a teacher was necessary to teach an input-output relation Hopfield networks try to cure both Hebb rule: an enlightening example assuming 2 neurons and a weight modification process: This simple rule realizes an associative memory!

Neural networks
 Unsupervised learning

slide-17
SLIDE 17

9

Neural networks
 The Hopfield network

Architecture: a set of I neurons connected by symmetric synapses of weight wij no self connections: wii=0

  • utput of neuron i: xi

Activity rule: Synchronous/ asynchronous update Learning rule: ;

slide-18
SLIDE 18

9

Neural networks
 The Hopfield network

Architecture: a set of I neurons connected by symmetric synapses of weight wij no self connections: wii=0

  • utput of neuron i: xi

Activity rule: Synchronous/ asynchronous update Learning rule: alternatively, a continuous network can be defined as: ;

slide-19
SLIDE 19

10

Neural networks
 Stability of Hopfield network

Are the memories stable? Necessary conditions: symmetric weights; asynchronous update the activation and activity rule together define a Lyapunov function

slide-20
SLIDE 20

10

Neural networks
 Stability of Hopfield network

Are the memories stable? Necessary conditions: symmetric weights; asynchronous update the activation and activity rule together define a Lyapunov function

slide-21
SLIDE 21

10

Neural networks
 Stability of Hopfield network

Are the memories stable? Necessary conditions: symmetric weights; asynchronous update Robust against perturbation of a subset of weights the activation and activity rule together define a Lyapunov function

slide-22
SLIDE 22

11

Neural networks
 Capacity of Hopfield network

How many traces can be memorized by a network of I neurons?

slide-23
SLIDE 23

12

Neural networks
 Capacity of Hopfield network

Failures of the Hopfield networks:

  • Corrupted bits
  • Missing memory traces
  • Spurious states not directly related to training data
slide-24
SLIDE 24

13

Neural networks
 Capacity of Hopfield network

Activation rule: Trace of the ‘desired memory and additional random memories:

slide-25
SLIDE 25

13

Neural networks
 Capacity of Hopfield network

Activation rule: Trace of the ‘desired memory and additional random memories:

slide-26
SLIDE 26

13

Neural networks
 Capacity of Hopfield network

Activation rule: Trace of the ‘desired memory and additional random memories: desired state

slide-27
SLIDE 27

13

Neural networks
 Capacity of Hopfield network

Activation rule: Trace of the ‘desired memory and additional random memories: desired state random contribution

slide-28
SLIDE 28

13

Neural networks
 Capacity of Hopfield network

Activation rule: Trace of the ‘desired memory and additional random memories: desired state random contribution

slide-29
SLIDE 29

14

Neural networks
 Capacity of Hopfield network

Failure in operation: avalanches

  • N/I < 0.138: ‘spin glass states’
  • N/I (0 0.138): states close to desired memories
  • N/I (0 0.05): desired states have lower energy than

spurious states

  • N/I (0.05 0.138): spurious states dominate
  • N/I (0 0.03): mixture states

∈ ∈

The Hebb rule determines how well it performs

  • ther learning might do a better job

(reiforcement learning)

slide-30
SLIDE 30

15

Hopfield network for optimization

slide-31
SLIDE 31

16

The Boltzmann machine

The optimization performed by Hopfield network: minimizing

slide-32
SLIDE 32

16

The Boltzmann machine

The optimization performed by Hopfield network: minimizing Again: we can make a correspondence with a probabilistic model:

slide-33
SLIDE 33

16

The Boltzmann machine

The optimization performed by Hopfield network: minimizing Again: we can make a correspondence with a probabilistic model: What do we gain by this:

  • more transparent functioning
  • superior performance than Hebb rule
slide-34
SLIDE 34

16

The Boltzmann machine

The optimization performed by Hopfield network: minimizing Again: we can make a correspondence with a probabilistic model: What do we gain by this:

  • more transparent functioning
  • superior performance than Hebb rule

Activity rule:

slide-35
SLIDE 35

16

The Boltzmann machine

The optimization performed by Hopfield network: minimizing Again: we can make a correspondence with a probabilistic model: What do we gain by this:

  • more transparent functioning
  • superior performance than Hebb rule

Activity rule: How is learning performed?

slide-36
SLIDE 36

17

Boltzmann machine -- EM

Likelihood function:

slide-37
SLIDE 37

17

Boltzmann machine -- EM

Likelihood function: Estimating the parameters:

slide-38
SLIDE 38

17

Boltzmann machine -- EM

Likelihood function: Minimizing for w: Estimating the parameters:

slide-39
SLIDE 39

17

Boltzmann machine -- EM

Likelihood function: Minimizing for w: Estimating the parameters: Sleeping and waking:

slide-40
SLIDE 40

17

Boltzmann machine -- EM

Likelihood function: Minimizing for w: Estimating the parameters: Sleeping and waking:

slide-41
SLIDE 41

17

Boltzmann machine -- EM

Likelihood function: Minimizing for w: Estimating the parameters: Sleeping and waking:

slide-42
SLIDE 42

18

Learning data...

slide-43
SLIDE 43

18

Learning data...

slide-44
SLIDE 44

19

Summary

Boltmann translates the neural network mecanisms into a probablisitic framework Its capabilities are limited We learned that the probabilistic framework clarifies assumptions We learned that within the world constrained by

  • ur assumptions the probabilistic approach gives

clear answers

slide-45
SLIDE 45

20

Learning data...

slide-46
SLIDE 46

20

Learning data...

slide-47
SLIDE 47

20

Learning data...

Hopfield/Boltman

slide-48
SLIDE 48

20

Learning data...

Hopfield/Boltman ?

slide-49
SLIDE 49

21

Principal Component Analysis

slide-50
SLIDE 50

21

Principal Component Analysis

Let’s try to find linearly independent filters Set the basis along the eigenvectors of the data

slide-51
SLIDE 51

22

Principal Component Analysis

slide-52
SLIDE 52

22

Principal Component Analysis

slide-53
SLIDE 53

22

Principal Component Analysis

slide-54
SLIDE 54

22

Principal Component Analysis

Olshausen & Field, Nature (1996)

slide-55
SLIDE 55

23

Relation between PCA and learning rules

A single neuron driven by multiple inputs: Basic Hebb rule: Averaged Hebb rule: Correlation based rule: note that

slide-56
SLIDE 56

24

Relation between PCA and learning rules

Making possible both LTP and LTD Postsynaptic threshold Postsynaptic threshold Setting the threshold to average postsynaptic activity: , where

slide-57
SLIDE 57

24

Relation between PCA and learning rules

Making possible both LTP and LTD Postsynaptic threshold Postsynaptic threshold Setting the threshold to average postsynaptic activity: , where heterosynaptic depression homosynaptic depression

slide-58
SLIDE 58

24

Relation between PCA and learning rules

Making possible both LTP and LTD Postsynaptic threshold Postsynaptic threshold Setting the threshold to average postsynaptic activity: , where heterosynaptic depression homosynaptic depression BCM rule:

slide-59
SLIDE 59

25

Relation between PCA and learning rules
 Regularization again

BCM rule: Hebb rule: (Oja rule)

slide-60
SLIDE 60

26

Relation between PCA and learning rules

slide-61
SLIDE 61

26

Relation between PCA and learning rules

slide-62
SLIDE 62

26

Relation between PCA and learning rules

slide-63
SLIDE 63

27

The architecture of the network and learning rule hand-in-hand detemine the learned representation

slide-64
SLIDE 64

28

Density estimation

Empirical distribution/ input distribution Latent variables: v Recognition: p[ v | u] Generative distribution: Kullback-Leibler divergence:

slide-65
SLIDE 65

28

Density estimation

Empirical distribution/ input distribution Latent variables: v Recognition: p[ v | u] generative model Generative distribution: Kullback-Leibler divergence:

slide-66
SLIDE 66

28

Density estimation

Empirical distribution/ input distribution Latent variables: v Recognition: p[ v | u] generative model Generative distribution: Recognition distribution: Kullback-Leibler divergence:

slide-67
SLIDE 67

28

Density estimation

Empirical distribution/ input distribution Latent variables: v Recognition: p[ v | u] generative model Generative distribution: Recognition distribution: Kullback-Leibler divergence: The match between our model distribution and input distribution

slide-68
SLIDE 68

29

How to solve density estmation?
 EM

slide-69
SLIDE 69

30

Sparse coding

PCA: Sparse coding: make the reconstruction faithful keep the units/neurons quiet

slide-70
SLIDE 70

30

Sparse coding

PCA: trying to find linearly independent filters setting the basis along the eigenvectors of the data Sparse coding: make the reconstruction faithful keep the units/neurons quiet

slide-71
SLIDE 71

30

Sparse coding

PCA: trying to find linearly independent filters setting the basis along the eigenvectors of the data Sparse coding: make the reconstruction faithful keep the units/neurons quiet

slide-72
SLIDE 72

31

Sparse coding

Neural dynamics: gradient descent

slide-73
SLIDE 73

32

Sparse coding