Biology: dompanine TD() Using a longer trajectory rather than - - PowerPoint PPT Presentation

biology
SMART_READER_LITE
LIVE PREVIEW

Biology: dompanine TD() Using a longer trajectory rather than - - PowerPoint PPT Presentation

TD learning Biology: dompanine TD() Using a longer trajectory rather than single step: For a single step: The expected total Return from S on is: R(S) = r + V(S) For two steps: TD() n-step return at time t: Using a trajectory of


slide-1
SLIDE 1

TD learning Biology: dompanine

slide-2
SLIDE 2

TD(λ)

Using a longer trajectory rather than single step: For a single step: The expected total Return from S on is: R(S) = r + γV(S’) For two steps:

slide-3
SLIDE 3

TD(λ)

n-step return at time t: Using a trajectory of length n The value V(S) can be updated following n steps from S by: Estimation of the total return based on n steps

slide-4
SLIDE 4

Summary: generalizes the 1-step update

n-step return at time t: The value V(S) can be updated following n steps from S by: ΔVt(St) = α[rt+1 + γVt(St+1) - Vt(St)] Generalizes the 1-step learning :

slide-5
SLIDE 5

Averaging trajectories:

  • It is also possible to average trajectories; we can use the sub-trajectories of

the full length-n trajectory to update V(S).

  • A particular averaging (particular weights) is the TD(λ) weights:
  • The weights are 1, λ, λ2,… with all this multiplied by (1-λ) since a

weighted average needs the sum of weights to be 1.

slide-6
SLIDE 6

λ – Return

Using the single long trajectory we had: The λ –return is the weighted average of all lengths:

slide-7
SLIDE 7

TD(λ)

And the learning rules: TD(λ) learning: Singe long trajectory:

slide-8
SLIDE 8

Eligibility traces

To compute this at time t, we need the n next steps which we still do not have. We want at time t to update back, the previous n visited states. This can be done with ‘eligibility trace Each visited state becomes ‘eligible’ for update, updates take place later: TD(λ) learning:

slide-9
SLIDE 9

Implementing TD(λ) with Eligibility Traces

A memory called 'eligibility trace' is added to each state et(S) It is updated by: The trace of S is incremented by 1 when S is visited, and decays by γλ at each

  • step. Here γ is the discount factor and λ is the decay parameter.
slide-10
SLIDE 10

Learning with eligibility traces

Update V(S): Take a step, compute a singel-step TD error: V(S) is updated at each step, although the current step is

  • different. If S was visited, then S1, S2, S3, then V(S) will be

updated with the error of each of them. δ

slide-11
SLIDE 11

The full TD(λ) Algorithm:

V(S) is updated at each step, although the current step is different from S. If S was visited, then S1, S2, S3, then V(S) will be updated with the error of each of them.

slide-12
SLIDE 12

Eligibility traces

Updating state values V(S) by eligibility traces is mathematically identical to the ‘forward’ TD(λ) learning: The update does not rely on future values, and has plausible biological models.

slide-13
SLIDE 13

SARSA (λ)

slide-14
SLIDE 14

Eligibility traces – biology

slide-15
SLIDE 15

SDTP

slide-16
SLIDE 16

Eligibility

slide-17
SLIDE 17

Synaptic Reinforcement

slide-18
SLIDE 18

Dopamine story

slide-19
SLIDE 19

Behavioral support for ‘prediction error’

Associating light cue with food

slide-20
SLIDE 20

‘Blocking’

No response to the bell The bell and food were consistently associated There was no prediction error, prediction error, not association, drives learning

slide-21
SLIDE 21

Rescola - Wagner

Associative learning occurs not because two events co-occur but because that co-occurrence is unanticipated on the basis of current associative strength. Α, β are rate parameters. Vtot is the total association from all cues

  • n this trial. λ is the currently expected value. Learning occurs if

the current value Vtot is different from expectation. Still no action selection, policy for behavior, long sequences

slide-22
SLIDE 22

Iterative solution for V(S)

Vπ(S) = < r1 + γ Vπ (S') > V(S) ← V(S) + α [ (r + γV(S’)) – V(S) ] Error Prediction error, TD error

slide-23
SLIDE 23
  • Learning is driven by the prediction error:
  • δ(t) = r + γV(S’)) – V(S)
  • Computed by the dopamine system
  • (Here too, if there is no error, no learning will

take place)

slide-24
SLIDE 24

Domaminergic neurons

  • Dopamine is a neuro-modulator
  • In the:
  • VTA (ventral tegmental area)
  • Substantia Nigra
  • These neurons send their axons to brain

structures involved in motivation and goal- directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.

slide-25
SLIDE 25

Major players in RL

slide-26
SLIDE 26

Effects of dopamine, why it is associated with reward and reward related learning

  • drugs like amphetamine and cocaine exert their addictive

actions in part by prolonging the influence of dopamine on target neurons

  • Second, neural pathways associated with dopamine neurons

are among the best targets for electrical self-stimulation.

  • animals treated with dopamine receptor blockers learn less

rapidly to press a bar for a reward pellet

slide-27
SLIDE 27

Self stimulation

slide-28
SLIDE 28
  • You can put a stimulating electrode in various places. In

the Dopamine system (e.g. VTA), the animal will continue stimulating.

  • In the Orbital cortex for example you can put the

electrode in a taste-related sub-region, activated by food. The animal will stimulate the electrode when it is hungry, but will stop activating when he is not.

slide-29
SLIDE 29

Dopamine and prediction error

The animal (rat, monkey) gets a cue (visual, or auditory). A reward after a delay (1 sec below)

slide-30
SLIDE 30

Dopamine and prediction error

slide-31
SLIDE 31

TD, prediction error Conclusion of the biological study

slide-32
SLIDE 32

Computational TD learning is similar:

Update V(S): Take a step, compute a TD error: V(S) is updated at each step, although the current step is

  • different. If S was visited, then S1, S2, S3, then V(S) will be

updated with the error of each of them. δ