[PDF] - CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course PDF Document

SLIDE 1

1 Animation: Tom Creed, SJU

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review

(Chapter 9)

2

Early Results: Pavlov and his Dog

F Classical (Pavlovian)

conditioning experiments

F Training: Bell Food F After: Bell  Salivate F Conditioned stimulus

(bell) predicts future reward (food)

Image: Wikimedia Commons; Animation: Tom Creed, SJU

SLIDE 2

3

Predicting Delayed Rewards

F How do we predict rewards delivered some time after a

stimulus is presented?

F Given: Many trials, each of length T time steps F Time within a trial: 0  t  T with stimulus u(t) and reward

r(t) at each time step t (Note: r(t) can be zero for some t)

F We would like a neuron whose output v(t) predicts the

expected total future reward starting from time t

trials t T

t r t v



 

  ) ( ) (



4

Learning to Predict Future Rewards

F Use a set of synaptic weights w(t) and predict

based on all past stimuli u(t):

F Learn weights w() that minimize error:

) ( ) ( ) (  



 



t u w t v

t

2

) ( ) (        



 

t v t r

t T 



Yes, BUT future rewards are not yet available! (Can we minimize this using gradient descent and delta rule?)

) (t v

) (t u ) 1 (  t u

) ( u ) ( w ) (t w ) (T w

(Linear filter!)

SLIDE 3

5

Temporal Difference (TD) Learning

F Key Idea: Rewrite error function to get rid of future terms: F Temporal Difference (TD) Learning:

 

2 2 1 2

) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) ( t v t v t r t v t r t r t v t r

t T t T

                      

 

      

 

) ( )] ( ) 1 ( ) ( [ ) (          t u t v t v t r w

Expected future reward Prediction 

Minimize this using gradient descent!

6

Predicting Future Rewards: TD Learning

Stimulus at t = 100 and reward at t = 200

Prediction error  for each time step (over many trials)

Image Source: Dayan & Abbott textbook

SLIDE 4

7

Possible Reward Prediction Error Signal in the Primate Brain

Dopaminergic cells in Ventral Tegmental Area (VTA) Before Training After Training

Reward Prediction error δ? No error

)] ( ) 1 ( ) ( [ t v t v t r    ) 1 ( ) ( ) (    t v t r t v

)] 1 ( ) ( [    t v t v )] ( ) 1 ( ) ( [     t v t v t r

Image Source: Dayan & Abbott textbook

8

More Evidence for Prediction Error Signals

Dopaminergic cells in VTA after Training

Negative error

) ( )] ( ) 1 ( ) ( [ ) 1 ( , ) ( t v t v t v t r t v t r        

Reward expected but not delivered

Image Source: Dayan & Abbott textbook

SLIDE 5

9

Reinforcement Learning: Acting to Maximize Rewards

Agent Environment State ut Reward rt Action at

10

The Problem

Agent Environment State ut Action at Reward rt Learn a state-to-action mapping or “policy”: which maximizes the expected total future reward:

trials t T

t r



 

 ) (



 a u  ) ( 

SLIDE 6

11

Example: Rat in a barn

States = locations A, B, or C Actions= L (go left)

r R (go right)

If the rat chooses L or R at random (random “policy”), what is the expected reward (or “value”) v for each state?

Image Source: Dayan & Abbott textbook

12

Policy Evaluation

For random policy: Can learn value of states using TD learning:

75 . 1 ) ( 2 1 ) ( 2 1 ) ( 1 2 1 2 2 1 ) ( 5 . 2 5 2 1 2 1 ) (                C v B v A v C v B v

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w     

Let value of state u v(u) = weight w(u)

(Location, action)  new location i.e., (u,a)  u’

SLIDE 7

13

TD Learning of Values for Random Policy

1.75 2.5 1 Once I know the values, I can pick the action that leads to the higher valued state!

(For all three,  = 0.5)

Image Source: Dayan & Abbott textbook

14

Selecting Actions based on Values

2.5 1

Values act as surrogate immediate rewards  Locally

ptimal choice leads

to globally optimal policy for “Markov” environments (Related to Dynamic Programming)

SLIDE 8

15

Putting it all together: Actor-Critic Learning

F

Two separate components: Actor (selects action and maintains policy) and Critic (maintains value of each state)

1. Critic Learning (“Policy Evaluation”):

Value of state u = v(u) = w(u)

2. Actor Learning (“Policy Improvement”):

3. Repeat 1 and 2

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w     

)) ; ' ( )]( ( ) ' ( ) ( [ ) ( ) (

' ' '

u a P u v u v u r u Q u Q

aa a a

      





b b a

u Q u Q u a P )) ( exp( )) ( exp( ) ; (   Probabilistically select an action a at state u

(same as TD rule) For all actions a’:

16

Actor-Critic Learning in our Barn Example

Probability of going Left at each location

Image Source: Dayan & Abbott textbook

SLIDE 9

17

Possible Implementation of the Actor-Critic Model in the Basal Ganglia

Cortex Striatum GPi/SNr Thalamus STN GPe SNc

DA

State Estimate Hidden Layer TD error Action Value Actor Critic

(See Supplementary Materials for references)

18

Reinforcement learning has been applied to many real-world problems!

Example: Google’s AlphaGo beats human champion in Go, Autonomous Helicopter Flight (learned from human demonstrations)

(Videos and papers at: http://heli.stanford.edu/)

SLIDE 10

19

Course Summary

Where have we been?
Course Highlights
Where do we go from here?
Challenges and Open Problems
Further Reading

20

What is the neural code?

What is the nature of the code? Representing the spiking output: single cells vs populations rates vs spike times vs intervals What features of the stimulus does the neural system represent?

SLIDE 11

21

Encoding and decoding neural information

Encoding: building functional models of neurons/neural systems and predicting the spiking output given the stimulus Decoding: what can we say about the stimulus given what we

bserve from the neuron or neural population?

22

Information maximization as a design principle

f the nervous

system

SLIDE 12

23

Biophysical Models of Neurons

Voltage dependent
transmitter dependent (synaptic)
Ca dependent

24

The neural equivalent circuit

Ohm’s law: and Kirchhoff’s law

Capacitive current Ionic currents Externally applied current

SLIDE 13

25

Simplified models: integrate-and-fire

V

m e L m

R I E V dt dV     ) ( 

If V > Vthreshold  Spike Then reset: V = Vreset Integrate-and- Fire Model

26

Modeling Networks of Neurons

) M W ( v u v v     F dt d 

Output Decay Input Feedback

SLIDE 14

27

Unsupervised Learning

For linear neuron:
Basic Hebb Rule:
Average effect over many inputs:
Q is the input correlation matrix:

v dt d

w

u w   w u u w

T T

v   w u w Q v dt d

w

  

T

Q uu 

w

Hebb rule performs principal component analysis (PCA)

28

The Connection to Statistics

] ; | [ G p v u ] ; | [ G p u v

Causes v Data u Generative model Recognition model (data likelihood) (posterior) Unsupervised learning = learning the hidden causes of input data G = (v, v) Causes of clustered data “Causes”

f natural

images

Use EM algorithm for learning

SLIDE 15

29

Generative Models

Droning lecture Mathematical derivations Lack of sleep

30

Supervised Learning

)) ( (

m k k jk j ij m i

u w g W g v

 



2 ,

) ( 2 1 ) , (

m i i m m i jk ij

v d w W E   

m k

u

m j

x

Goal: Find W and w that minimize errors:

Backpropagation for Multilayered Networks

jk m j m j jk jk jk jk ij ij ij

w x x E w w E w w W E W W                   Gradient descent learning rules:

Desired output (Delta rule) (Chain rule)

SLIDE 16

31

Reinforcement Learning

Learning to predict rewards:
Learning to predict delayed

rewards (TD learning):

Actor-Critic Learning:
Critic learns value of each

state using TD learning

Actor learns best actions

based on value of next state (using the TD error)

(http://employees.csbsju.edu/tcreed/pb/pdoganim.html)

u v r w w ) (    

) ( )] ( ) 1 ( ) ( [ ) ( ) (           t u t v t v t r w w

2.5 1

32

The Future: Challenges and Open Problems

How do neurons encode information?
Topics: Synchrony, Spike-timing based learning, Dynamic

synapses

Does a neuron’s structure confer computational

advantages?

Topics: Role of channel dynamics, dendrites, plasticity in

channels and their density

How do networks implement computational principles

such as efficient coding and Bayesian inference?

How do networks learn “optimal” representations of their

environment and engage in purposeful behavior?

Topics: Unsupervised/reinforcement/imitation learning

SLIDE 17

33

Next two classes: Project presentations!

Keep your presentation short: ~7-8 slides, 8 + 3 mins

mins/group (with questions)

Introduction, Background, Methods, Results, Conclusion

Slides:
Bring your slides on a USB stick to use the class

laptop (Windows machine) OR

Bring your own laptop (esp if you have videos etc.)
Projects reports (10-15 pages total) due March 12 (by

email to both Adrienne, Rich, and Raj before midnight)

SLIDE 18

35

Have a great weekend!

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review

Early Results: Pavlov and his Dog

Predicting Delayed Rewards

t r t v



  ) ( ) (



Learning to Predict Future Rewards

) ( ) ( ) (  

 

t u w t v

) ( ) (        



t v t r



Temporal Difference (TD) Learning

 

) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) ( t v t v t r t v t r t r t v t r

                      

 

 

) ( )] ( ) 1 ( ) ( [ ) (          t u t v t v t r w

Predicting Future Rewards: TD Learning

Possible Reward Prediction Error Signal in the Primate Brain

)] ( ) 1 ( ) ( [ t v t v t r    ) 1 ( ) ( ) (    t v t r t v

More Evidence for Prediction Error Signals

) ( )] ( ) 1 ( ) ( [ ) 1 ( , ) ( t v t v t v t r t v t r        

Reinforcement Learning: Acting to Maximize Rewards

The Problem

t r



 ) (

 a u  ) ( 

Example: Rat in a barn

Policy Evaluation

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w     

TD Learning of Values for Random Policy

Selecting Actions based on Values

2.5 1

Putting it all together: Actor-Critic Learning

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w     



Actor-Critic Learning in our Barn Example

Possible Implementation of the Actor-Critic Model in the Basal Ganglia

Reinforcement learning has been applied to many real-world problems!

Course Summary

Encoding and decoding neural information

Information maximization as a design principle

system

Simplified models: integrate-and-fire

R I E V dt dV     ) ( 

Modeling Networks of Neurons

) M W ( v u v v     F dt d 

Unsupervised Learning

v dt d

u w   w u u w

v   w u w Q v dt d

  

Q uu 

The Connection to Statistics

Generative Models

Supervised Learning

 

) ( 2 1 ) , (

v d w W E   

u

x

Reinforcement Learning

u v r w w ) (    

) ( )] ( ) 1 ( ) ( [ ) ( ) (           t u t v t v t r w w

The Future: Challenges and Open Problems

Further Reading (for Spring and beyond)

Next two classes: Project presentations!