Algorithms in Nature Neural Networks (NN) Mimicking the brain In - - PowerPoint PPT Presentation

algorithms in nature
SMART_READER_LITE
LIVE PREVIEW

Algorithms in Nature Neural Networks (NN) Mimicking the brain In - - PowerPoint PPT Presentation

Algorithms in Nature Neural Networks (NN) Mimicking the brain In the early days of AI there was a lot of interest in developing models that can mimic human thinking. While no one knew exactly how the brain works (and, even though there


slide-1
SLIDE 1

Algorithms in Nature

Neural Networks (NN)

slide-2
SLIDE 2

Mimicking the brain

  • In the early days of AI there was a lot of interest in

developing models that can mimic human thinking.

  • While no one knew exactly how the brain works (and,

even though there was a lot of progress since, there is still a lot we do not know), some of the basic computational units were known

  • A key component of these units is the neuron.
slide-3
SLIDE 3

The Neuron

  • A cell in the brain
  • Highly connected to other

neurons

  • Thought to perform

computations by integrating signals from other neurons

  • Outputs of these

computation may be transmitted to one or more neurons

slide-4
SLIDE 4

Biological inspiration

  • The nervous system is built using relatively simple units, the

neurons, so copying their behaviour and functionality could provide solutions to problems related to interpretation and optimization.

slide-5
SLIDE 5

Biological inspiration

Dendrites Soma (cell body) Axon

slide-6
SLIDE 6

Biological inspiration

synapses axon dendrites Synapses are the edges in this network, responsible for transmitting information between the neurons

slide-7
SLIDE 7

Biological inspiration

  • The spikes travelling along the axon of the pre-synaptic neuron

trigger the release of neurotransmitter substances at the synapse.

  • The neurotransmitters cause excitation or inhibition in the

dendrite of the post-synaptic neuron.

  • The integration of the excitatory and inhibitory signals may

produce spikes in the post-synaptic neuron.

  • The contribution of the signals depends on the strength of the

synaptic connection.

slide-8
SLIDE 8

What can we do with NN?

  • Classification
  • Regression

Input: Real valued variables Output: One or more real values

  • Examples:
  • Predict the price of Google’s stock from Microsoft’s

stock price

  • Predict distance to obstacle from various sensors
slide-9
SLIDE 9

Recall: Regression

  • In linear regression we

assume that y and x are related with the following equation: y = wx+ε

X Y

slide-10
SLIDE 10

Multivariate regression: Least squares

  • We already presented a solution for determining the

parameters of a general linear regression problem. Define:

 Φ = φ0(x1) φ1(x1)  φm(x1) φ0(x 2) φ1(x 2)  φm(x 2)     φ0(x n) φ1(x n)  φm(x n)            

Then deriving w we get:

w = (ΦTΦ)−1ΦTy ε φ + = ) ( y

T

x w

slide-11
SLIDE 11

Multivariate regression: Least squares

  • The solution turns out to be:

we need to invert a k by k matrix (k-1 is the number of features)

  • This takes O(k3)
  • Depending on k this can be rather slow

w = (ΦTΦ)−1ΦTy

slide-12
SLIDE 12

Where we are

  • Linear regression – solved!
  • But
  • Solution may be slow
  • Does not address general regression problems of the

form y = f(wTX)

slide-13
SLIDE 13

Back to NN: Preceptron

  • The basic processing unit of a neural net

y=f(∑wixi)

w0 w1 w2 wk x1 x2 xk 1

Input layer Output layer

slide-14
SLIDE 14

Linear regression

  • Lets start by setting f(∑wixi)=∑wixi
  • We are back to linear regression
  • Unlike our original linear regression

solution, for perceptrons we will use a different strategy

  • Why?

y=∑wix

i

w0 w1 w2 wk x1 x2 xk 1

slide-15
SLIDE 15

Gradient descent

z=(f(w)-y)2 w

Slope = ∂z/ ∂w

∆z ∆w

  • Going in the opposite direction to the slope will lead to

a smaller z

  • But not too much, otherwise we would go beyond the
  • ptimal w
slide-16
SLIDE 16

Gradient descent

  • Going in the opposite direction to the slope will lead to

a smaller z

  • But not too much, otherwise we would go beyond the
  • ptimal w
  • We thus update the weights by setting:

where λ is small constant which is intended to prevent us from passing the optimal w

w z w w ∂ ∂ − ← λ

slide-17
SLIDE 17

Example when choosing the ‘right’ λ

  • We get a monotonically decreasing error as we perform

more updates

slide-18
SLIDE 18

Gradient descent for linear regression

  • Taking the derivatove w.r.t. to each wi for a sample x:
  • And if we have n measurements then

where xj,i is the i’th value of the j’th input vector

) ( 2

2

∑ ∑

− − =       − ∂ ∂

k k k i k k k i

x w y x x w y w

∑ ∑

= =

− = ∂ ∂

n j j T j i j n j j T j i

y x y w

1 , 1 2

) ( 2 ) ( x w

  • x

w

slide-19
SLIDE 19

Gradient descent for linear regression

  • If we have n measurements then
  • Set
  • Then our update rule can be written as

∑ ∑

= =

− = ∂ ∂

n j j T j i j n j j T j i

y x y w

1 , 1 2

) ( 2 ) ( x w

  • x

w

  • )

(

j T j j

y x w

  • =

δ

=

+ ←

n j j i j i i

x w w

1 ,

2 δ λ

slide-20
SLIDE 20

Gradient descent algorithm for linear regression

1.Chose λ 2.Start with a guess for w 3.Compute δj for all j 4.For all i set 5.If no improvement for

  • stop. Otherwise go to step 3

=

+ ←

n j j i j i i

x w w

1 ,

2 δ λ

= n j j T j

y

1 2

) ( x w

slide-21
SLIDE 21

Gradient descent vs. matrix inversion

  • Advantages of matrix inversion
  • No iterations
  • No need to specify parameters
  • Closed form solution in a predictable time
  • Advantages of gradient descent
  • Applicable regardless of the number of parameters
  • General, applies to other forms of regression
slide-22
SLIDE 22

Perceptrons for classification

  • So far we discussed regression
  • However, perceptrons can also be used for classification
  • For example, output 1 is wTx > 0 and -1 otherwise
  • Problem?
slide-23
SLIDE 23

Regression for classification

  • Assume we would like to use linear regression to learn

the parameters for a classification problem

  • Problems?

1

  • 1

Optimal regression model

wTx ≥ 0 ⇒ classify as 1 wTx < 0 ⇒ classify as -1

slide-24
SLIDE 24

The sigmoid function

  • To classify using regression

models we replace the linear function with the sigmoid function:

  • Using the sigmoid we set (for

binary classification problems) g(h) = 1 1+ e−h

p(y | x;θ)

Always between 0 and 1

p(y = 0 | x;θ) = g(w Tx) = 1 1+ ew T x p(y =1| x;θ) =1− g(w Tx) = ew T x 1+ ew T x

slide-25
SLIDE 25

The sigmoid function

  • To classify using regression

models we replace the linear function with the sigmoid function:

  • Using the sigmoid we set (for

binary classification problems) g(h) = 1 1+ e−h

p(y | x;θ)

Always between 0 and 1

p(y = 0 | x;θ) = g(w Tx) = 1 1+ ew T x p(y =1| x;θ) =1− g(w Tx) = ew T x 1+ ew T x We can use the sigmoid function as part of the perception when using it for classification

slide-26
SLIDE 26

Logistic regression vs. Linear regression

p(y = 0 | x;θ) = g(w Tx) = 1 1+ ew T x

p(y =1| x;θ) =1− g(w Tx) = ew T x 1+ ew T x

slide-27
SLIDE 27

Non linear regression with NN

x

e x g

+ = 1 1 ) (

j j T j

x w g y

2

)) ( ( min

Taking the derivative w.r.t. wi we get:

i j j T j T j j T j j j T j i

x x w g x w g x w g y x w g y w

, 2

)) ( 1 )( ( )) ( ( 2 )) ( ( − − − = − ∂ ∂

∑ ∑

)) ( 1 )( ( ) ( ' x g x g x g − − =

  • So how do we find the parameters?
  • Least squares minimization when using a

sigmoid function in a NN:

slide-28
SLIDE 28

Deriving g’(x)

  • Recall that g(x) is the sigmoid function so
  • The derivation of g’(x) is below

x

e x g

+ = 1 1 ) (

slide-29
SLIDE 29

New target function for NN

x

e x g

+ = 1 1 ) (

j j T j

x w g y

2

)) ( ( min

Taking the derivative w.r.t. wi we get:

i j j j j j def i j j T j T j j T j j j T j i

x g g x x w g x w g x w g y x w g y w

, , 2

) 1 ( 2 )) ( 1 )( ( )) ( ( 2 )) ( ( − = − − = − ∂ ∂

∑ ∑ ∑

δ

)) ( 1 )( ( ) ( ' x g x g x g − − =

) (

j T j

x w g g =

  • So how do we find the parameters?
  • Least squares minimization when using a

sigmoid function in a NN:

slide-30
SLIDE 30

Revised algorithm for sigmoid regression

1.Chose λ 2.Start with a guess for w 3.Compute δj for all j 4.For all i set 5.If no improvement for

  • stop. Otherwise go to step 3

= n j j T j

g y

1 2

)) ( x (w

  • i

j j j n j j i i

x g g w w

, 1

) 1 ( 2 − − ←

=

δ λ

slide-31
SLIDE 31

Multilayer neural networks

  • So far we discussed networks with one layer.
  • But these networks can be extended to combine several

layers, increasing the set of functions that can be represented using a NN

v1=g(wTx)

w0,1 x1 x2 1

v2=g(wTx) v1=g(wTv)

w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2

Input layer Output layer Hidden layer

slide-32
SLIDE 32

Learning the parameters for multilayer networks

  • Gradient descent works by connecting the output to the

inputs.

  • But how do we use it for a multilayer network?
  • We need to account for both, the output weights and the

hidden layer weights

v1=g(wTx)

w0,1 x1 x2 1

v2=g(wTx) v1=g(wTv)

w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2

slide-33
SLIDE 33

Learning the parameters for multilayer networks

  • If we know the values of the internal layer, it is easy to

compute the update rule for the output weights w1 and w2: where

i j j n j j j i i

v g g w w

, 1

) 1 ( 2 − + ←

=

δ λ

v1=g(wTx)

w0,1 x1 x2 1

v2=g(wTx) y=g(wTv)

w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2

) (

j T j j

g y v w − = δ

slide-34
SLIDE 34

Learning the parameters for multilayer networks

  • Its easy to compute the update rule for the output weights

w1 and w2: where

i j j n j j j i i

v g g w w

, 1

) 1 ( 2 − + ←

=

δ λ

v1=g(wTx)

w0,1 x1 x2 1

v2=g(wTx) v1=g(wTv)

w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2

) (

j T j j

g y v w − = δ But what is the error associated with each of the hidden layer states?

slide-35
SLIDE 35

Backpropagation

  • A method for distributing the error among hidden layer states
  • Using the error for each of these states we can employ gradient

descent to update them

  • Set

∆ j,i = wiδ j(1− g j)g j

v1=g(wTx)

w0,1 x1 x2 1

v2=g(wTx) v1=g(wTv)

w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2

  • utput error

weight

slide-36
SLIDE 36

Backpropagation

  • A method for distributing the error among hidden layer states
  • Using the error for each of these states we can employ gradient

descent to update them

  • Set
  • Our update rule changes to:

∆ j,i = wiδ j(1− g j)g j

k j i j n j i j i j i k i k

x g g w w

, , 1 , , , ,

) 1 ( 2 − ∆ + ←

=

λ

slide-37
SLIDE 37

Backpropagation

k j i j n j i j i j i k i k

x g g w w

, , 1 , , , ,

) 1 ( 2 − ∆ + ←

=

λ

The correct error term for each hidden state can be determined by taking the partial derivative for each

  • f the weight parameters of the hidden layer w.r.t.

the global error function*:

2

)) ( ( ( x w w

T i T j j

g g y Err − =

*See RN book for details (pages 746-747)

slide-38
SLIDE 38

Revised algorithm for multilayered neural network

1.Chose λ 2.Start with a guess for w, wi 3.Compute values vi,j for all hidden layer states i and inputs j 4.Compute δj for all j: 5.Compute ∆j,I 6.For all i set

  • 7. For all k and i set
  • 8. If no improvement for stop. Otherwise go to

step 3 ) (

j T j j

g y v w − = δ

i j j n j j j i i

v g g w w

, 1

) 1 ( 2 − + ←

=

δ λ

k j i j n j i j i j i k i k

x g g w w

, , 1 , , , ,

) 1 ( 2 − ∆ + ←

=

λ

∑ ∑

= =

∆ +

s i i j n j j 1 2 , 1 2

δ

slide-39
SLIDE 39

Examples

Figure 1: Feedforward ANN designed and tested for prediction of tactical air combat maneuvers.

slide-40
SLIDE 40

Deep learning

slide-41
SLIDE 41

Historical background:

First generation neural networks

  • Perceptrons (~1960)

used a layer of hand- coded features and tried to recognize objects by learning how to weight these features. – There was a neat learning algorithm for adjusting the weights. – But perceptrons are fundamentally limited in what they can learn to do.

non-adaptive hand-coded features

  • utput units

e.g. class labels input units e.g. pixels

Sketch of a typical perceptron from the 1960’s

Bomb Toy

slide-42
SLIDE 42

Second generation neural networks (~1985)

input vector

hidden layers

  • utputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

slide-43
SLIDE 43

What is wrong with back- propagation?

  • It requires labeled training data.

–Almost all data is unlabeled.

  • The learning time does not scale well

–It is very slow in networks with multiple hidden layers.

  • It can get stuck in poor local optima.
slide-44
SLIDE 44

Overcoming the limitations of back-propagation

  • Keep the efficiency and simplicity of using a gradient

method for adjusting the weights, but use it for modeling the structure of the sensory input. – Iteratively learn the different layers. – Adjust the weights to maximize the probability that a generative model would have produced the sensory input. – Learn p(image) not p(label | image) for the lower layers.

slide-45
SLIDE 45

Iterative learning of layers

Input Hidden Reconstruction

slide-46
SLIDE 46

Iterative learning of layers

Input Hidden Reconstruction Hidden

slide-47
SLIDE 47

The final 50 x 256 weights

Each neuron grabs a different feature.

slide-48
SLIDE 48

Reconstruction from activated binary features

Data

Reconstruction from activated binary features

Data

How well can we reconstruct the digit images from the binary feature activations?

New test images from the digit class that the model was trained on Images from an unfamiliar digit class (the network tries to see every image as a 2)

slide-49
SLIDE 49

Training a deep network

(the main reason RBM’s are interesting)

  • First train a layer of features that receive input directly

from the pixels.

  • Then treat the activations of the trained features as if

they were pixels and learn features of features in a second hidden layer.

  • It can be proved that each time we add another layer of

features we improve a variational lower bound on the log probability of the training data. – The proof is slightly complicated. – But it is based on a neat equivalence between an RBM and a deep directed model (described later)

slide-50
SLIDE 50

Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples.

slide-51
SLIDE 51

Features learned

slide-52
SLIDE 52

Features learned

slide-53
SLIDE 53

What you should know

  • Linear regression
  • Solving a linear regression problem
  • Gradient descent
  • Perceptrons
  • Sigmoid functions for classification
  • Multilayered neural networks
  • Backpropagation
slide-54
SLIDE 54

Deriving g’(x)

  • Recall that g(x) is the sigmoid function so
  • The derivation of g’(x) is below

x

e x g

+ = 1 1 ) (

slide-55
SLIDE 55

The Energy of a joint configuration

− =

j i ij j i

w h v v,h E

,

) (

weight between units i and j Energy with configuration v on the visible units and h on the hidden units binary state of visible unit i binary state of hidden unit j

j i ij

h v w h v E = ∂ ∂ − ) , (

slide-56
SLIDE 56

Using energies to define probabilities

  • The probability of a joint

configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.

  • The probability of a

configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.

− −

=

g u g u E h v E

e e h v p

, ) , ( ) , (

) , (

∑ ∑

− −

=

g u g u E h h v E

e e v p

, ) , ( ) , (

) (

partition function

slide-57
SLIDE 57

A picture of the maximum likelihood learning algorithm for an RBM

> <

j ih

v

> <

j ih

v

i j i j i j i j t = 0 t = 1 t = 2 t = infinity

> < − > < = ∂ ∂

j i j i ij

h v h v w v p ) ( log

Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

a fantasy

slide-58
SLIDE 58

A quick way to learn an RBM

> <

j ih

v

1

> <

j ih

v

i j i j t = 0 t = 1

) (

1

> < − > < = ∆

j i j i ij

h v h v w ε

Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again.

This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another

  • bjective function (Carreira-Perpinan & Hinton, 2005).

reconstruction data

slide-59
SLIDE 59

What you should know

  • Linear regression
  • Solving a linear regression problem
  • Gradient descent
  • Perceptrons
  • Sigmoid functions for classification
  • Multilayered neural networks
  • Backpropagation