Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ - - PowerPoint PPT Presentation

monte carlo approaches to reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ - - PowerPoint PPT Presentation

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ Marcus Gualtieris edits) Northeastern University Model Free Reinforcement Learning Joystick command Agent World Observe screen pixels Reward = game score Goal: learn a


slide-1
SLIDE 1

Monte Carlo Approaches to Reinforcement Learning

Robert Platt (w/ Marcus Gualtieri’s edits) Northeastern University

slide-2
SLIDE 2

Model Free Reinforcement Learning

Agent World

Joystick command Observe screen pixels Reward = game score Goal: learn a value function through trial-and-error experience

slide-3
SLIDE 3

Model Free Reinforcement Learning

Agent World

Joystick command Observe screen pixels Reward = game score Goal: learn a value function through trial-and-error experience Recall:

Value of state when acting according to policy

slide-4
SLIDE 4

Model Free Reinforcement Learning

Agent World

Joystick command Observe screen pixels Reward = game score Goal: learn a value function through trial-and-error experience Recall:

Value of state when acting according to policy

How?

slide-5
SLIDE 5

Model Free Reinforcement Learning

Agent World

Joystick command Observe screen pixels Reward = game score Goal: learn a value function through trial-and-error experience Recall:

Value of state when acting according to policy

How? Simplest solution: average all outcomes from previous experiences in a given state – this is called a Monte Carlo method

slide-6
SLIDE 6

Running Example: Blackjack

State: sum of cards in agent’s hand + dealer’s showing card + does agent have usable ace? Actions: hit, stick Objective: Have agent’s card sum be greater than the dealer’s without exceeding 21 Reward: +1 for winning, 0 for a draw,

  • 1 for losing

Discounting: Dealer policy: draw until sum at least 17

slide-7
SLIDE 7

Running Example: Blackjack

Blackjack “Basic Strategy” is a set of rules for play so as to maximize return – well known in the gambling community – how might an RL agent learn the Basic Strategy?

slide-8
SLIDE 8

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward

slide-9
SLIDE 9

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 19, 10, no

slide-10
SLIDE 10

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 19, 10, no

Agent sum, dealer’s card, ace?

slide-11
SLIDE 11

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 19, 10, no HIT

Agent sum, dealer’s card, ace?

slide-12
SLIDE 12

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 19, 10, no HIT 22, 10, no

  • 1

Agent sum, dealer’s card, ace?

slide-13
SLIDE 13

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 19, 10, no HIT 22, 10, no

  • 1

Agent sum, dealer’s card, ace?

Bust! (reward = -1)

slide-14
SLIDE 14

Monte Carlo Policy Evaluation: Example

Upon episode termination, make the following value function updates: State Action Next State Reward 19, 10, no HIT 22, 10, no

  • 1
slide-15
SLIDE 15

Monte Carlo Policy Evaluation: Example

Next episode...

slide-16
SLIDE 16

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 13, 10, no

slide-17
SLIDE 17

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 13, 10, no HIT 16, 10, no

slide-18
SLIDE 18

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 13, 10, no HIT 16, 10, no 13, 10, no

slide-19
SLIDE 19

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 13, 10, no HIT 16, 10, no 13, 10, no HIT 19, 10, no

slide-20
SLIDE 20

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 13, 10, no HIT 16, 10, no 13, 10, no HIT 19, 10, no 19, 10, no

slide-21
SLIDE 21

Monte Carlo Policy Evaluation: Example

Dealer card: Agent’s hand:

State Action Next State Reward 13, 10, no HIT 16, 10, no 13, 10, no HIT 19, 10, no 19, 10, no HIT 21, 22, no 1

slide-22
SLIDE 22

Monte Carlo Policy Evaluation: Example

State Action Next State Reward 13, 10, no HIT 16, 10, no 16, 10, no HIT 19, 10, no 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value function updates:

slide-23
SLIDE 23

Monte Carlo Policy Evaluation: Example

Value function learned for “hit everything except for 20 and 21” policy.

slide-24
SLIDE 24

Monte Carlo Policy Evaluation

Given a policy, , estimate the value function, , for all states,

slide-25
SLIDE 25

Monte Carlo Policy Evaluation

Given a policy, , estimate the value function, , for all states, Monte Carlo Policy Evaluation (first visit):

slide-26
SLIDE 26

Monte Carlo Policy Evaluation

All states: To get an accurate estimate of the value function, every state has to be visited many times.

Rollouts

slide-27
SLIDE 27

Think-pair-share: frozenlake env

0123 0 SFFF 1 FHFH 2 FFFH 3 HFFG

States: grid world coordinates Actions: L, R, U, D Reward: 0 except at G

slide-28
SLIDE 28

Think-pair-share: frozenlake env

0123 0 SFFF 1 FHFH 2 FFFH 3 HFFG

States: grid world coordinates Actions: L, R, U, D Reward: 0 except at G where r=1 Given: three episodes as shown Calculate: values of states on top row as calculated by MC

slide-29
SLIDE 29

Monte Carlo Control

So far, we’re only talking about policy evaluation … but RL requires us to find a policy, not just evaluate it… How? Key idea: evaluate/improve policy iteratively... Estimate via rollouts

slide-30
SLIDE 30

Monte Carlo Control

Monte Carlo, Exploring Starts

slide-31
SLIDE 31

Monte Carlo Control

Monte Carlo, Exploring Starts

Exploring starts: – each episode starts with a random action taken from a random state

slide-32
SLIDE 32

Monte Carlo Control

Monte Carlo, Exploring Starts

slide-33
SLIDE 33

Monte Carlo Control

Monte Carlo, Exploring Starts

Notice there is only one step of policy evaluation – that’s okay. – each evaluation iter moves value fn toward its optimal value. Good enough to improve policy.

slide-34
SLIDE 34

Monte Carlo Control

slide-35
SLIDE 35

Monte Carlo Control

What the MC agent learned The official “basic strategy”

slide-36
SLIDE 36

Monte Carlo Control: Convergence

slide-37
SLIDE 37

Monte Carlo Control: Convergence

If then i.e. is better than

slide-38
SLIDE 38

Policy Improvement Theorem: Proof (Sketch)

slide-39
SLIDE 39

E-Greedy Exploration

Monte Carlo, Exploring Starts:

Without exploring starts, we are not guaranteed to explore the state/action space – why is this a problem? – what happens if we never experience certain transitions?

slide-40
SLIDE 40

E-Greedy Exploration

Monte Carlo, Exploring Starts:

Without exploring starts, we are not guaranteed to explore the state/action space – why is this a problem? – what happens if we never experience certain transitions? Can we accomplish this without exploring starts?

slide-41
SLIDE 41

E-Greedy Exploration

Monte Carlo, Exploring Starts:

Without exploring starts, we are not guaranteed to explore the state/action space – why is this a problem? – what happens if we never experience certain transitions? Can we accomplish this without exploring starts? Yes: create a stochastic (e-greedy) policy

slide-42
SLIDE 42

E-Greedy Exploration

Greedy policy: E-Greedy policy:

slide-43
SLIDE 43

E-Greedy Exploration

Greedy policy: E-Greedy policy:

Action drawn uniformly from

slide-44
SLIDE 44

E-Greedy Exploration

Greedy policy: E-Greedy policy:

Guarantees every state/action will be visited infinitely often

– Notice that this is a stochastic policy (not deterministic). – This is an example of an soft policy – soft policy: all actions in all states have non-zero probability

slide-45
SLIDE 45

E-Greedy Exploration

Monte Carlo, ε-greedy exploration:

E-greedy exploration

slide-46
SLIDE 46

Off-Policy Methods

  • On-policy methods evaluate or improve the policy that is used to

make decisions.

  • Off-policy methods evaluate or improve a policy different from that

used to generate the data.

  • The target policy is the policy (π) we wish to evaluate/improve.
  • The behavior policy is the policy (b) used to generate experiences.
  • Coverage:
slide-47
SLIDE 47

MC Summary

MC methods estimate value function by doing rollouts Can estimate either the state value function, , or the action value function, MC Control alternates between policy evaluation and policy improvement E-greedy exploration explores all possible actions while preferring greedy actions Off-policy methods update a policy other than the one used to generate experience