model free methods
play

Model-Free Methods Model-Free Methods Model-based: use all branches - PowerPoint PPT Presentation

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1 In model-based we update V (S) using all the possible S In model-free we take a step, and update based on this sample


  1. Model-Free Methods Model-Free Methods

  2. Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1 In model-based we update V π (S) using all the possible S’ In model-free we take a step, and update based on this sample

  3. Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1 In model-free we take a step, and update based on this sample <V> ← <V> + α (V – <V> ) V(S 1 ) ← V(S 1 ) + α [r + γ V (S 3 ) - V(S 1 ) ]

  4. On-line: take an action A, ending at S 1 S 1 r 1 A S 2 S t S 3 <V> ← <V> + α (V – <V> )

  5. TD Prediction Algorithm Terminology: Prediction -- computing V π (S) for a given π Prediction error: [r + γV (S') – V(S)] Expected : V(S), observed: r + γV (S')

  6. Learning a Policy: Exploration problem: take an action A, ending at S 1 S 1 r 1 A S 2 S t S 3 Update S t then update S 1 May never explore the alternative actions to A

  7. From Value to Action • Based on V(S), action can be selected • ‘Greedy’ selection is not good enough (Select action A with current max expected future reward) • Need for ‘exploration’ • For example: ‘ε - greedy’ • Max return with p = 1- ε, and with p=ε one of the other actions • Can be a more complex decision • Done here in episodes

  8. TD Policy Learning ε -greedy ε -greedy performs exploration Can be more complex, e.g. changing ε with time or with conditions

  9. TD ‘Actor - Critic’ Terminology: Prediction is the same as policy evaluation. Computing V π (S) ‘actor’ Motivated by brain modeling

  10. ‘Actor - critic’ scheme -- standard drawing Motivated by brain modeling (E.g. Ventral striatum is the critic, dorsal striatum is the actor)

  11. Q-learning • The main algorithm used for model-free RL

  12. Q-values (state-action) S 2 A 1 S 3 R=2 Q(S 1, A 1 ) A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1 Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

  13. Q-value (state-action) • The same update is done on Q-values rather than on V • Used in most practical algorithms and some brain models • Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π :

  14. Q-values (state-action) S 2 A 1 S 3 Q(S 1, A 1 ) R=2 A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1

  15. SARSA It is called SARSA because it uses s(t) a(t) r(t+1) s(t+1) a(t+1) A step like this uses the current π, so that each S has its a = π(S)

  16. SARSA RL Algorithm Epsilon greedy: with probability epsilon do not select the greedy action, but with equal probability among all actions

  17. On Convergence • Using episodes: • Some of the states are ‘ terminals ’ • When the computation reaches a terminal s, it stops. • Re-starts at a new state s according to some probability • At the starting state, each action has a non-zero probability (exploration) • As the number of episodes goes to infinity, Q(S,A) will converge to Q * (S,A).

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend