CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course - - PDF document

cse neuro 528 lecture 13 reinforcement learning course
SMART_READER_LITE
LIVE PREVIEW

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course - - PDF document

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9) 1 Animation: Tom Creed, SJU Early Results: Pavlov and his Dog F Classical (Pavlovian) conditioning experiments F Training: Bell Food F After: Bell


slide-1
SLIDE 1

1 Animation: Tom Creed, SJU

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review

(Chapter 9)

2

Early Results: Pavlov and his Dog

F Classical (Pavlovian)

conditioning experiments

F Training: Bell Food F After: Bell  Salivate F Conditioned stimulus

(bell) predicts future reward (food)

Image: Wikimedia Commons; Animation: Tom Creed, SJU

slide-2
SLIDE 2

3

Predicting Delayed Rewards

F How do we predict rewards delivered some time after a

stimulus is presented?

F Given: Many trials, each of length T time steps F Time within a trial: 0  t  T with stimulus u(t) and reward

r(t) at each time step t (Note: r(t) can be zero for some t)

F We would like a neuron whose output v(t) predicts the

expected total future reward starting from time t

trials t T

t r t v

 

  ) ( ) (

4

Learning to Predict Future Rewards

F Use a set of synaptic weights w(t) and predict

based on all past stimuli u(t):

F Learn weights w() that minimize error:

) ( ) ( ) (  

 

t u w t v

t

2

) ( ) (        

 

t v t r

t T 

Yes, BUT future rewards are not yet available! (Can we minimize this using gradient descent and delta rule?)

) (t v

) (t u ) 1 (  t u

) ( u ) ( w ) (t w ) (T w

(Linear filter!)

slide-3
SLIDE 3

5

Temporal Difference (TD) Learning

F Key Idea: Rewrite error function to get rid of future terms: F Temporal Difference (TD) Learning:

 

2 2 1 2

) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) ( t v t v t r t v t r t r t v t r

t T t T

                      

 

      

 

) ( )] ( ) 1 ( ) ( [ ) (          t u t v t v t r w

Expected future reward Prediction 

Minimize this using gradient descent!

6

Predicting Future Rewards: TD Learning

Stimulus at t = 100 and reward at t = 200

Prediction error  for each time step (over many trials)

Image Source: Dayan & Abbott textbook

slide-4
SLIDE 4

7

Possible Reward Prediction Error Signal in the Primate Brain

Dopaminergic cells in Ventral Tegmental Area (VTA) Before Training After Training

Reward Prediction error δ? No error

)] ( ) 1 ( ) ( [ t v t v t r    ) 1 ( ) ( ) (    t v t r t v

)] 1 ( ) ( [    t v t v )] ( ) 1 ( ) ( [     t v t v t r

Image Source: Dayan & Abbott textbook

8

More Evidence for Prediction Error Signals

Dopaminergic cells in VTA after Training

Negative error

) ( )] ( ) 1 ( ) ( [ ) 1 ( , ) ( t v t v t v t r t v t r        

Reward expected but not delivered

Image Source: Dayan & Abbott textbook

slide-5
SLIDE 5

9

Reinforcement Learning: Acting to Maximize Rewards

Agent Environment State ut Reward rt Action at

10

The Problem

Agent Environment State ut Action at Reward rt Learn a state-to-action mapping or “policy”: which maximizes the expected total future reward:

trials t T

t r

 

 ) (

 a u  ) ( 

slide-6
SLIDE 6

11

Example: Rat in a barn

States = locations A, B, or C Actions= L (go left)

  • r R (go right)

If the rat chooses L or R at random (random “policy”), what is the expected reward (or “value”) v for each state?

Image Source: Dayan & Abbott textbook

12

Policy Evaluation

For random policy: Can learn value of states using TD learning:

75 . 1 ) ( 2 1 ) ( 2 1 ) ( 1 2 1 2 2 1 ) ( 5 . 2 5 2 1 2 1 ) (                C v B v A v C v B v

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w     

Let value of state u v(u) = weight w(u)

(Location, action)  new location i.e., (u,a)  u’

slide-7
SLIDE 7

13

TD Learning of Values for Random Policy

1.75 2.5 1 Once I know the values, I can pick the action that leads to the higher valued state!

(For all three,  = 0.5)

Image Source: Dayan & Abbott textbook

14

Selecting Actions based on Values

2.5 1

Values act as surrogate immediate rewards  Locally

  • ptimal choice leads

to globally optimal policy for “Markov” environments (Related to Dynamic Programming)

slide-8
SLIDE 8

15

Putting it all together: Actor-Critic Learning

F

Two separate components: Actor (selects action and maintains policy) and Critic (maintains value of each state)

  • 1. Critic Learning (“Policy Evaluation”):

Value of state u = v(u) = w(u)

  • 2. Actor Learning (“Policy Improvement”):

3. Repeat 1 and 2

)] ( ) ' ( ) ( [ ) ( ) ( u v u v u r u w u w     

)) ; ' ( )]( ( ) ' ( ) ( [ ) ( ) (

' ' '

u a P u v u v u r u Q u Q

aa a a

      

b b a

u Q u Q u a P )) ( exp( )) ( exp( ) ; (   Probabilistically select an action a at state u

(same as TD rule) For all actions a’:

16

Actor-Critic Learning in our Barn Example

Probability of going Left at each location

Image Source: Dayan & Abbott textbook

slide-9
SLIDE 9

17

Possible Implementation of the Actor-Critic Model in the Basal Ganglia

Cortex Striatum GPi/SNr Thalamus STN GPe SNc

DA

State Estimate Hidden Layer TD error Action Value Actor Critic

(See Supplementary Materials for references)

18

Reinforcement learning has been applied to many real-world problems!

Example: Google’s AlphaGo beats human champion in Go, Autonomous Helicopter Flight (learned from human demonstrations)

(Videos and papers at: http://heli.stanford.edu/)

slide-10
SLIDE 10

19

Course Summary

  • Where have we been?
  • Course Highlights
  • Where do we go from here?
  • Challenges and Open Problems
  • Further Reading

20

What is the neural code?

What is the nature of the code? Representing the spiking output: single cells vs populations rates vs spike times vs intervals What features of the stimulus does the neural system represent?

slide-11
SLIDE 11

21

Encoding and decoding neural information

Encoding: building functional models of neurons/neural systems and predicting the spiking output given the stimulus Decoding: what can we say about the stimulus given what we

  • bserve from the neuron or neural population?

22

Information maximization as a design principle

  • f the nervous

system

slide-12
SLIDE 12

23

Biophysical Models of Neurons

  • Voltage dependent
  • transmitter dependent (synaptic)
  • Ca dependent

24

  • The neural equivalent circuit

Ohm’s law: and Kirchhoff’s law

Capacitive current Ionic currents Externally applied current

slide-13
SLIDE 13

25

Simplified models: integrate-and-fire

V

m e L m

R I E V dt dV     ) ( 

If V > Vthreshold  Spike Then reset: V = Vreset Integrate-and- Fire Model

26

Modeling Networks of Neurons

) M W ( v u v v     F dt d 

Output Decay Input Feedback

slide-14
SLIDE 14

27

Unsupervised Learning

  • For linear neuron:
  • Basic Hebb Rule:
  • Average effect over many inputs:
  • Q is the input correlation matrix:

v dt d

w

u w   w u u w

T T

v   w u w Q v dt d

w

  

T

Q uu 

w

Hebb rule performs principal component analysis (PCA)

28

The Connection to Statistics

] ; | [ G p v u ] ; | [ G p u v

Causes v Data u Generative model Recognition model (data likelihood) (posterior) Unsupervised learning = learning the hidden causes of input data G = (v, v) Causes of clustered data “Causes”

  • f natural

images

Use EM algorithm for learning

slide-15
SLIDE 15

29

Generative Models

Droning lecture Mathematical derivations Lack of sleep

30

Supervised Learning

)) ( (

m k k jk j ij m i

u w g W g v

 

2 ,

) ( 2 1 ) , (

m i i m m i jk ij

v d w W E   

m k

u

m j

x

Goal: Find W and w that minimize errors:

Backpropagation for Multilayered Networks

jk m j m j jk jk jk jk ij ij ij

w x x E w w E w w W E W W                   Gradient descent learning rules:

Desired output (Delta rule) (Chain rule)

slide-16
SLIDE 16

31

Reinforcement Learning

  • Learning to predict rewards:
  • Learning to predict delayed

rewards (TD learning):

  • Actor-Critic Learning:
  • Critic learns value of each

state using TD learning

  • Actor learns best actions

based on value of next state (using the TD error)

(http://employees.csbsju.edu/tcreed/pb/pdoganim.html)

u v r w w ) (    

) ( )] ( ) 1 ( ) ( [ ) ( ) (           t u t v t v t r w w

2.5 1

32

The Future: Challenges and Open Problems

  • How do neurons encode information?
  • Topics: Synchrony, Spike-timing based learning, Dynamic

synapses

  • Does a neuron’s structure confer computational

advantages?

  • Topics: Role of channel dynamics, dendrites, plasticity in

channels and their density

  • How do networks implement computational principles

such as efficient coding and Bayesian inference?

  • How do networks learn “optimal” representations of their

environment and engage in purposeful behavior?

  • Topics: Unsupervised/reinforcement/imitation learning
slide-17
SLIDE 17

33

Further Reading (for Spring and beyond)

  • Spikes: Exploring the Neural Code, F. Rieke

et al., MIT Press, 1997

  • The Biophysics of Computation, C. Koch,

Oxford University Press, 1999

  • Large-Scale Neuronal Theories of the Brain,
  • C. Koch and J. L. Davis, MIT Press, 1994
  • Probabilistic Models of the Brain, R. Rao et

al., MIT Press, 2002

  • Bayesian Brain, K. Doya et al., MIT Press,

2007

  • Reinforcement Learning: An Introduction, R.

Sutton and A. Barto, MIT Press, 1998

34

Next two classes: Project presentations!

  • Keep your presentation short: ~7-8 slides, 8 + 3 mins

mins/group (with questions)

Introduction, Background, Methods, Results, Conclusion

  • Slides:
  • Bring your slides on a USB stick to use the class

laptop (Windows machine) OR

  • Bring your own laptop (esp if you have videos etc.)
  • Projects reports (10-15 pages total) due March 12 (by

email to both Adrienne, Rich, and Raj before midnight)

slide-18
SLIDE 18

35

Have a great weekend!