Model-Free Methods Model-Free Methods Model-based: use all branches - - PowerPoint PPT Presentation

model free methods
SMART_READER_LITE
LIVE PREVIEW

Model-Free Methods Model-Free Methods Model-based: use all branches - - PowerPoint PPT Presentation

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1 In model-based we update V (S) using all the possible S In model-free we take a step, and update based on this sample


slide-1
SLIDE 1

Model-Free Methods

Model-Free Methods

slide-2
SLIDE 2

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Model-based: use all branches

In model-based we update Vπ(S) using all the possible S’ In model-free we take a step, and update based on this sample

slide-3
SLIDE 3

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Model-based: use all branches

In model-free we take a step, and update based on this sample V(S1) ← V(S1) + α [r + γ V (S3) - V(S1) ] <V> ← <V> + α (V – <V> )

slide-4
SLIDE 4

A S1 S3 S2 r1

On-line: take an action A, ending at S1

St <V> ← <V> + α (V – <V> )

slide-5
SLIDE 5

TD Prediction Algorithm

Terminology: Prediction -- computing Vπ (S) for a given π Prediction error: [r + γV(S') – V(S)] Expected : V(S), observed: r + γV(S')

slide-6
SLIDE 6

A S1 S3 S2 r1

Learning a Policy: Exploration problem:

take an action A, ending at S1

St Update St then update S1 May never explore the alternative actions to A

slide-7
SLIDE 7

From Value to Action

  • Based on V(S), action can be selected
  • ‘Greedy’ selection is not good enough

(Select action A with current max expected future reward)

  • Need for ‘exploration’
  • For example: ‘ε-greedy’
  • Max return with p = 1-ε, and with p=ε one of the other actions
  • Can be a more complex decision
  • Done here in episodes
slide-8
SLIDE 8

TD Policy Learning

ε-greedy ε-greedy performs exploration Can be more complex, e.g. changing ε with time or with conditions

slide-9
SLIDE 9

TD ‘Actor-Critic’

Terminology: Prediction is the same as policy evaluation. Computing Vπ (S) ‘actor’ Motivated by brain modeling

slide-10
SLIDE 10

‘Actor-critic’ scheme -- standard drawing

Motivated by brain modeling (E.g. Ventral striatum is the critic, dorsal striatum is the actor)

slide-11
SLIDE 11

Q-learning

  • The main algorithm used for model-free RL
slide-12
SLIDE 12

Q-values (state-action)

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Q(S1, A1 ) Q(S1, A3 ) Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

slide-13
SLIDE 13

Q-value (state-action)

  • The same update is done on Q-values rather than on V
  • Used in most practical algorithms and some brain models
  • Qπ (S,a) is the expected return starting from S, taking the action a, and

thereafter following policy π:

slide-14
SLIDE 14

A1 A2 S1 A3 S2 S3 S1 S3 S2

R=2 R= -1

Q-values (state-action)

Q(S1, A1 ) Q(S1, A3 )

slide-15
SLIDE 15

SARSA

It is called SARSA because it uses s(t) a(t) r(t+1) s(t+1) a(t+1) A step like this uses the current π, so that each S has its a = π(S)

slide-16
SLIDE 16

SARSA RL Algorithm

Epsilon greedy: with probability epsilon do not select the greedy action, but with equal probability among all actions

slide-17
SLIDE 17

On Convergence

  • Using episodes:
  • Some of the states are ‘terminals’
  • When the computation reaches a terminal s, it stops.
  • Re-starts at a new state s according to some probability
  • At the starting state, each action has a non-zero probability (exploration)
  • As the number of episodes goes to infinity, Q(S,A) will converge to

Q*(S,A).