Lecture 13: From Unsupervised to Reinforcement Learning (Chapters - - PDF document

lecture 13 from unsupervised to
SMART_READER_LITE
LIVE PREVIEW

Lecture 13: From Unsupervised to Reinforcement Learning (Chapters - - PDF document

CSE/NB 528 Lecture 13: From Unsupervised to Reinforcement Learning (Chapters 8-10) R. Rao, 528: Lecture 13 1 Todays Agenda: All about Learning F Unsupervised Learning Sparse Coding Predictive Coding F Supervised learning Perceptrons and


slide-1
SLIDE 1

1

  • R. Rao, 528: Lecture 13

CSE/NB 528 Lecture 13: From Unsupervised to Reinforcement Learning

(Chapters 8-10)

2

  • R. Rao, 528: Lecture 13

Today’s Agenda: All about Learning

F Unsupervised Learning Sparse Coding Predictive Coding F Supervised learning Perceptrons and Backpropagation F Reinforcement Learning TD and Actor-Critic learning

slide-2
SLIDE 2

3

  • R. Rao, 528: Lecture 13

Recall from Last Time: Linear Generative Model

Causes v Data u Generative model

Suppose input u was generated by a linear superposition of causes v1, v2, …, vk with basis vectors (or “features”) gi

n G noise v

i i i

     v g u

(Assume noise is Gaussian white noise with mean zero)

4

  • R. Rao, 528: Lecture 13

Bayesian approach

F Find v and G that maximize posterior: F Equivalently, find v and G that maximize log posterior:

k G p G p G F log ] ; [ log ] ; | [ log ) , (    v v u v

Prior for individual causes (what should this be?)

] ; [ ] ; | [ ] ; | [ G p G p k G p v v u u v    

 

a a a a

G v p G p G v p G p ] ; [ log ] ; [ log ] ; [ ] ; [ v v

If va independent

C G G I G N

T

     ) ( ) ( 2 1 ) , ; ( log v u v u v u

log of Gaussian

n G   v u

slide-3
SLIDE 3

5

  • R. Rao, 528: Lecture 13

What do we know about the causes v?

F Idea: Causes independent: only a few of them will be active

for any input va will be 0 most of the time but high for a few inputs Suggests a sparse distribution for p[va;G]: peak at 0 but with heavy tail (also called super-Gaussian distribution)

6

  • R. Rao, 528: Lecture 13

Examples of Prior Distributions for Causes

c v g G p v g c G p

a a a a

   

 

) ( ] ; [ log )) ( exp( ] ; [ v v

Possible prior distributions Log prior

| | ) ( v v g   ) 1 log( ) (

2

v v g   

sparse

slide-4
SLIDE 4

7

  • R. Rao, 528: Lecture 13

Finding the optimal v and G

F Want to maximize:

F Alternate between two steps: Maximize F with respect to v keeping G fixed How? Maximize F with respect to G, given the v above How? K v g G G k G p G p G F

a a T

        

) ( ) ( ) ( 2 1 log ] ; [ log ] ; | [ log ) , ( v u v u v v u v

8

  • R. Rao, 528: Lecture 13

Estimating the causes v for a given input

) ( ) ( v v u v g G G dt d

T

    

Firing rate dynamics (Recurrent network)

Error Sparseness constraint

) ( ) ( v v u v v g G G d dF dt d

T

    

Gradient ascent

Reconstruction (prediction) of u Derivative of g

slide-5
SLIDE 5

9

  • R. Rao, 528: Lecture 13

Sparse Coding Network for Estimating v ) ( v u G  v G

) ( ) ( v v u v g G G dt d

T

    

Prediction Error Corrected Estimate [Suggests a role for feedback pathways in the cortex (Rao & Ballard, 1999)]

10

  • R. Rao, 528: Lecture 13

Learning the Synaptic Weights G

) ( v u G  v G

T G

G dt dG v v u ) (   

Hebbian! (similar to Oja’s rule) Learning rule Prediction Error

T

G dG dF dt dG v v u ) (   

Gradient ascent

slide-6
SLIDE 6

11

  • R. Rao, 528: Lecture 13

Result: Learning G for Natural Images

Each square is a column gi of G (obtained by collapsing rows of the square into a vector)

v g u G v

i i i

 

Any image patch u can be expressed as: Almost all the gi represent local edge features

(Olshausen & Field, 1996)

12

  • R. Rao, 528: Lecture 13

Sparse Coding Network is a special case of Predictive Coding Networks

(Rao, Vision Research, 1999)

slide-7
SLIDE 7

13

  • R. Rao, 528: Lecture 13

Predictive Coding Model of Visual Cortex

(Rao & Ballard, Nature Neurosci., 1999)

14

  • R. Rao, 528: Lecture 13

Predictive coding model explains contextual effects

Monkey Primary Visual Cortex Model

(Zipser et al., J. Neurosci., 1996)

Increased activity for non-homogenous input interpreted as prediction error (i.e., anomalous input): center is not predicted by surrounding context.

slide-8
SLIDE 8

15

  • R. Rao, 528: Lecture 13

Natural Images as a Source of Contextual Effects

(Rao & Ballard, Nature Neurosci., 1999)

Center

predictable from

Surround

16

  • R. Rao, 528: Lecture 13

What if your data comes with not just inputs but also outputs?

Enter…Supervised Learning

slide-9
SLIDE 9

17

  • R. Rao, 528: Lecture 13

Supervised Learning

F

Two Primary Tasks

  • 1. Classification

Inputs u1, u2, … and discrete classes C1, C2, …, Ck Training examples: (u1, C2), (u2, C7), etc. Learn the mapping from an arbitrary input to its class Example: Inputs = images, output classes = face, not a face

  • 2. Regression

Inputs u1, u2, … and continuous outputs v1, v2, … Training examples: (input, desired output) pairs Learn to map an arbitrary input to its corresponding output Example: Highway driving Input = road image, output = steering angle

18

  • R. Rao, 528: Lecture 13

The Classification Problem

denotes output of +1 (faces) denotes output of -1 (other)

Faces Other objects Idea: Find a separating hyperplane (line in this case)

slide-10
SLIDE 10

19

  • R. Rao, 528: Lecture 13

Neurons as Classifiers: The “Perceptron”

F Artificial neuron:

m binary inputs (-1 or 1) and 1 output (-1 or 1) Synaptic weights wij Threshold i Inputs uj (-1 or +1) Output vi (-1 or +1) Weighted Sum Threshold (x) = +1 if x  0 and -1 if x < 0

) (

i j j ij i

u w v    

20

  • R. Rao, 528: Lecture 13

What does a Perceptron compute?

F Consider a single-layer perceptron Weighted sum forms a linear hyperplane (line, plane, …) Everything on one side of hyperplane is in class 1 (output = +1) and everything on other side is class 2 (output = -1) Any function that is linearly separable can be computed by a perceptron

 

i j j iju

w 

slide-11
SLIDE 11

21

  • R. Rao, 528: Lecture 13

Linear Separability

F Example: AND function is linearly separable

a AND b = 1 if and only if a = 1 and b = 1 Linear hyperplane v u1 u2  = 1.5 (1,1) 1

  • 1

1

  • 1

u1 u2 Perceptron for AND

+1 output

  • 1 output

22

  • R. Rao, 528: Lecture 13

What about the XOR function?

1

  • 1

1

  • 1

u1 u2

  • 1
  • 1

+1 1

  • 1
  • 1
  • 1

1

  • 1

1 1 +1

u1 u2 XOR

Can a straight line separate the +1 outputs from the -1 outputs?

?

+1 output

  • 1 output
slide-12
SLIDE 12

23

  • R. Rao, 528: Lecture 13

Multilayer Perceptrons

F Removes limitations of single-layer networks Can solve XOR F An example of a two-layer perceptron that computes XOR F Output is +1 if and only if x + y + 2(– x – y – 1.5) > – 1

  • 1

2

 = -1

 = 1.5

  • 1

1 1 x y (Inputs x and y can be +1 or -1)

24

  • R. Rao, 528: Lecture 13

What if you want to approximate a continuous function (i.e., regression)?

Can a network learn to drive?

slide-13
SLIDE 13

25

  • R. Rao, 528: Lecture 13

Example Network

Input u = [u1 u2 … u960] = image pixels

Steering angle Current image

Desired Output: d = [d1 d2 … d30]

26

  • R. Rao, 528: Lecture 13

Input nodes

a

e a g

 

  1 1 ) (

a

(a) 1

Sigmoid output function: Sigmoid is a non-linear “squashing” function: Squashes input to be between 0 and 1. Parameter  controls the slope.

g(a)

) ( ) (

i i i T

u w g g v

  u w u = (u1 u2 u3)T w Output

Sigmoid Networks

slide-14
SLIDE 14

27

  • R. Rao, 528: Lecture 13

Multilayer Sigmoid Networks

How do we learn these weights? Input u = (u1 u2 … uK)T Output v = (v1 v2 … vJ)T; Desired = d

)) ( (

k k kj j ji i

u w g W g v

 

28

  • R. Rao, 528: Lecture 13

Backpropagation Learning: Uppermost layer

j j j ji i i ji ji ji ji

x x W g v d dW dE dW dE W W ) ( ) (

      

{delta rule} ) (

j j ji i

x W g v

k

u

j

x

Learning rule for hidden-output weights W:

2

) ( 2 1 ) , (

i i i

v d E    w W

{gradient descent}

Minimize output error:

slide-15
SLIDE 15

29

  • R. Rao, 528: Lecture 13

Backpropagation: Inner layer (chain rule)

) (

j j ji m i

x W g v

m k

u

                     

  

m k m k k kj ji j m j ji m i m i i m kj kj j j kj kj kj kj

u u w g W x W g v d dw dE dw dx dx dE dw dE dw dE w w ) ( ) ( ) ( : But

,

{chain rule}

) (

m k k kj m j

u w g x

 Learning rule for input-hidden weights w:

2

) ( 2 1 ) , (

i i i

v d E    w W

Minimize output error:

30

  • R. Rao, 528: Lecture 13

Demos: Pole Balancing and Backing up a Truck (courtesy of Keith Grochow, CSE 599)

  • Neural network learns to balance a pole on a cart
  • System:
  • 4 state variables: xcart, vcart, θpole, vpole
  • 1 input: Force on cart
  • Backprop Network:
  • Input: State variables
  • Output: New force on cart
  • NN learns to back a truck into a loading dock
  • System (Nyugen and Widrow, 1989):
  • State variables: xcab, ycab, θcab
  • 1 input: new θsteering
  • Backprop Network:
  • Input: State variables
  • Output: Steering angle θsteering

xcart vcart vpole θpole

slide-16
SLIDE 16

31

  • R. Rao, 528: Lecture 13

Humans (and animals in general) don’t get exact supervisory signals (commands for muscles) for learning to talk, walk, ride a bicycle, play the piano, drive, etc.

We learn by trial-and-error (with hints from others) Might get “rewards and punishments” along the way

Enter…Reinforcement Learning

32

  • R. Rao, 528: Lecture 13

The Reinforcement Learning “Agent”

Agent Environment State ut Reward rt Action at

slide-17
SLIDE 17

33

  • R. Rao, 528: Lecture 13

Early Results: Pavlov and his Dog

F Classical (Pavlovian)

conditioning experiments

F Training: Bell Food F After: Bell  Salivate F Conditioned stimulus

(bell) predicts future reward (food)

(http://employees.csbsju.edu/tcreed/pb/pdoganim.html) 34

  • R. Rao, 528: Lecture 13

Predicting Delayed Rewards

F Reward is typically delivered at the end (when you know

whether you succeeded or not)

F Time: 0  t  T with stimulus u(t) and reward r(t) at each

time step t (Note: r(t) can be zero at some time points)

F Key Idea: Make the output v(t) predict total expected future

reward starting from time t

 

 

t T

t r t v ) ( ) (

slide-18
SLIDE 18

35

  • R. Rao, 528: Lecture 13

Learning to Predict Delayed Rewards

F Use a set of modifiable weights w(t) and predict based on all

past stimuli u(t):

F Would like to find the weights (or filter) w() that minimize:

) ( ) ( ) (  

 

t u w t v

t

2

) ( ) (        

 

t v t r

t T 

Yes, BUT…not yet available are the future rewards (Can we minimize this using gradient descent and delta rule?)

36

  • R. Rao, 528: Lecture 13

Temporal Difference (TD) Learning

F Key Idea: Rewrite squared error to get rid of future terms: F Temporal Difference (TD) Learning:

 

2 2 1 2

) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) ( t v t v t r t v t r t r t v t r

t T t T

                      

 

      

 

) ( )] ( ) 1 ( ) ( [ ) ( ) (           t u t v t v t r w w

Expected future reward Prediction 

Minimize this using gradient descent!

slide-19
SLIDE 19

37

  • R. Rao, 528: Lecture 13

Predicting Delayed Reward: TD Learning

Stimulus at t = 100 and reward at t = 200

Prediction error  for each time step (over many trials)

38

  • R. Rao, 528: Lecture 13

Possible Reward Prediction Error Signal in the Primate Brain

Dopaminergic cells in Ventral Tegmental Area (VTA) Before Training After Training

Reward Prediction error? No error

)] ( ) 1 ( ) ( [ t v t v t r    ) 1 ( ) ( ) (    t v t r t v

)] ( ) 1 ( [ t v t v    )] ( ) 1 ( ) ( [     t v t v t r

slide-20
SLIDE 20

39

  • R. Rao, 528: Lecture 13

More Evidence for Prediction Error Signals

Dopaminergic cells in VTA

Negative error

) ( )] ( ) 1 ( ) ( [ ) 1 ( , ) ( t v t v t v t r t v t r        

Reward predicted but not delivered

40

  • R. Rao, 528: Lecture 13

That’s great, but how does all that math help me get food in a maze?

slide-21
SLIDE 21

41

  • R. Rao, 528: Lecture 13

Selecting Actions when Reward is Delayed

States: A, B, or C Possible actions at any state: Left (L) or Right (R) If you randomly choose to go L or R (random “policy”), what is the expected value v of each state?

42

  • R. Rao, 528: Lecture 13

Policy Evaluation

For random policy: Can learn value of locations using TD learning:

75 . 1 ) ( 2 1 ) ( 2 1 ) ( 1 2 1 2 2 1 ) ( 5 . 2 5 2 1 2 1 ) (                C v B v A v C v B v

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w

a

    

(u,a)  u’ Let value of location v(u) = weight w(u) (Location, action)  new location

slide-22
SLIDE 22

43

  • R. Rao, 528: Lecture 13

Maze Value Learning for Random Policy

1.75 2.5 1 Once I know the values, I can pick the action that leads to the higher valued state!

(For all three,  = 0.5)

44

  • R. Rao, 528: Lecture 13

Selecting Actions based on Values

2.5 1 Values act as surrogate immediate rewards  Locally

  • ptimal choice leads

to globally optimal policy (for “Markov” environments) Related to Dynamic Programming in CS (see appendix in text)

slide-23
SLIDE 23

45

  • R. Rao, 528: Lecture 13

Actor-Critic Learning

F

Two separate components: Actor (maintains policy) and Critic (maintains value of each state)

  • 1. Critic Learning (“Policy Evaluation”):

Value of state u = v(u) = w(u)

  • 2. Actor Learning (“Policy Improvement”):
  • 3. Interleave 1 and 2

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w

a

    

)) ; ' ( )]( ( ) ' ( ) ( [ ) ( ) (

' ' '

u a P u v u v u r u Q u Q

aa a a a

      

b b a

u Q u Q u a P )) ( exp( )) ( exp( ) ; (   Use this to select an action a at state u

(same as TD rule) For all a’:

46

  • R. Rao, 528: Lecture 13

Actor-Critic Learning in the Maze Task

Probability of going Left at a location

slide-24
SLIDE 24

47

  • R. Rao, 528: Lecture 13

Demo of Reinforcement Learning in a Robot

(from http://sysplan.nams.kyushu-u.ac.jp/gen/papers/JavaDemoML97/robodemo.html )

48

  • R. Rao, 528: Lecture 13

Things to do:

Finish homework 3 Work on group project Thanks, dopamine!