Reinforcement Learning Course 9.54 Final review Agent learning to - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Course 9.54 Final review Agent learning to - - PowerPoint PPT Presentation

Reinforcement Learning Course 9.54 Final review Agent learning to act in an unknown environment Reinforcement Learning Setup S t Background and setup The environment is initially unknown or partially known It is also stochastic, the


slide-1
SLIDE 1

Reinforcement Learning

Course 9.54 Final review

slide-2
SLIDE 2

Agent learning to act in an unknown environment

slide-3
SLIDE 3

Reinforcement Learning Setup

St

slide-4
SLIDE 4

Background and setup

  • The environment is initially unknown or partially known
  • It is also stochastic, the agent cannot fully predict what will

happen next

  • What is a ‘good’ action to select under these conditions?
  • Animals learning seeks to maximize their reward
slide-5
SLIDE 5

Formal Setup

  • The agent is in one of a set of states, {S1, S2,…Sn }
  • At each state, it can take an action from a set of available

actions {A1, A2,…Ak}

  • From state Si taking action Aj –> a new state Sj and a possible

reward

slide-6
SLIDE 6

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Stochastic transitions

slide-7
SLIDE 7

The consequence of an action:

  • (S,A) → (S', R)
  • Governed by:
  • P(S' | S, A)
  • P(R | S, A, S')
  • These probabilities are properties of the world. (‘Contingencies’)
  • An assumption that the transitions are Markovian
slide-8
SLIDE 8

Policy

  • The goal is to learn a policy π: S → A
  • The policy determines the future of the agent:

S1 S2 S3 A2 A1 A3 π π π

slide-9
SLIDE 9

Model-based vs. Model-free RL

  • Model-based methods assume that the probabilities:

– P(S' | S, A) – P(R | S, A, S')

are known and can be used in the planning

  • In model-free methods:

– The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy learning

slide-10
SLIDE 10

S S(1) S(2) a1 a2 r1 r2

Step 1 defining Vπ(S)

  • Start from S and just follow the policy π
  • We find ourselves in state S(1) and reward r1 etc.
  • Vπ(S) = < r1 + γ r2 + γ2 r3 + … >
  • The expected (discounted) reward from S on.
slide-11
SLIDE 11

Step 2: equations for V(S)

  • Vπ(S) = < r1 + γ r2 + γ2 r3 + … >
  • = Vπ(S) = < r1 + γ (r2 + γ r3 + … ) >
  • = < r1 + γ V(S') >
  • These are equations relating V(S) for different states.
  • Next write the explicitly in terms of the known parameters

(contingencies):

slide-12
SLIDE 12

A S1 S S3 S2 r1 r3 r2

Equations for V(S)

  • Vπ(S) = < r1 + γ Vπ (S') > =
  • E.g.:

Vπ(S) = Vπ(S) = [ 0.2 (r1 + γVπ (S1)) + 0.5 (r2 + γVπ (S2)) + 0.3 (r3 + γVπ (S3)) ]

  • Linear equations, the unknowns are Vπ (Si)

'

] ) (S' V ) S' a, [r(S, a) S, | p(S'

S 

slide-13
SLIDE 13

Improving the Policy

  • Given the policy π, we can find the values Vπ(S) by solving the

linear equations iteratively

  • Convergence is guaranteed (the system of equations is

strongly diagonally dominant)

  • Given V(S), we can improve the policy:
  • We can combine these steps to find the optimal policy
slide-14
SLIDE 14

A’ S1 S S3 S2 r1 r3 r2 A π

Improving the policy

slide-15
SLIDE 15

Value Iteration

learning V and π when the ‘contingencies’ are known:

slide-16
SLIDE 16

Value Iteration Algorithm

* Value iteration is used in the problem set

slide-17
SLIDE 17

Q-learning

  • The main algorithm used for model-free RL
slide-18
SLIDE 18

Q-values (state-action)

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Q(S1, A1 ) Q(S1, A3 ) Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

slide-19
SLIDE 19

Q-value (state-action)

  • The same update is done on Q-values rather than on V
  • Used in most practical algorithms and some brain

models

  • Qπ (S,a) is the expected return starting from S, taking

the action a, and thereafter following policy π:

slide-20
SLIDE 20

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Q-values (state-action)

Q(S1, A1 ) Q(S1, A3 )

slide-21
SLIDE 21

SARSA

  • It is called SARSA because it uses:

s(t), a(t), r(t+1), s(t+1), a(t+1)

  • A step like this uses the current π, so that each S has its

action according to the policy π: a = π(S)

slide-22
SLIDE 22

SARSA RL Algorithm

* ԑ-greedy: with probability ԑ , do not select the greedy action. Instead select with equal probability among all actions.

slide-23
SLIDE 23

TD learning Biology: dompanine

slide-24
SLIDE 24

Behavioral support for ‘prediction error’

Associating light cue with food

slide-25
SLIDE 25

‘Blocking’

  • No response to the bell
  • The bell and food were consistently associated
  • There was no prediction error.
  • Conclusion: prediction error, not association, drives learning !
slide-26
SLIDE 26

Learning and Dopamine

  • Learning is driven by the prediction error:

δ(t) = r + γV(S’)) – V(S)

  • Computed by the dopamine system

(Here too, if there is no error, no learning will take place)

slide-27
SLIDE 27

Dopaminergic neurons

  • Dopamine is a neuro-modulator
  • In the:

– VTA (ventral tegmental area) – Substantia Nigra

  • These neurons send their axons to brain

structures involved in motivation and goal- directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.

slide-28
SLIDE 28

Major players in RL

slide-29
SLIDE 29

Effects of dopamine, why it is associated with reward and reward related learning

  • Drugs like amphetamine and cocaine exert their

addictive actions in part by prolonging the influence of dopamine on target neurons

  • Second, neural pathways associated

with dopamine neurons are among the best targets for electrical self- stimulation:

– Animals treated with dopamine receptor blockers learn less rapidly to press a bar for a reward pellet

slide-30
SLIDE 30

Dopamine and prediction error

  • The animal (rat, monkey) gets a cue (visual, or auditory).
  • A reward after a delay (1 sec below)
slide-31
SLIDE 31

Dopamine and prediction error