Chapter 5: Monte Carlo Methods Monte Carlo methods are learning - - PowerPoint PPT Presentation

chapter 5 monte carlo methods
SMART_READER_LITE
LIVE PREVIEW

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning - - PowerPoint PPT Presentation

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience values, policy Monte Carlo methods can be used in two ways: ! model-free: No model necessary and still attains optimality ! Simulated: Needs only a


slide-1
SLIDE 1
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

1

Chapter 5: Monte Carlo Methods

❐ Monte Carlo methods are learning methods
 Experience → values, policy ❐ Monte Carlo methods can be used in two ways:

! model-free: No model necessary and still attains optimality ! Simulated: Needs only a simulation, not a full model

❐ Monte Carlo methods learn from complete sample returns

! Only defined for episodic tasks (in this book)

❐ Like an associative version of a bandit method

slide-2
SLIDE 2
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

2

Monte Carlo Policy Evaluation

❐ Goal: learn ❐ Given: some number of episodes under π which contain s ❐ Idea: Average returns observed after visits to s ❐ Every-Visit MC: average returns for every time s is visited in an episode ❐ First-visit MC: average returns only for first time s is visited in an episode ❐ Both converge asymptotically 1 2 3 4 5 vπ(s)

slide-3
SLIDE 3
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

3

First-visit Monte Carlo policy evaluation

Initialize: π ← policy to be evaluated V ← an arbitrary state-value function Returns(s) ← an empty list, for all s ∈ S Repeat forever: Generate an episode using π For each state s appearing in the episode: G ← return following the first occurrence of s Append G to Returns(s) V (s) ← average(Returns(s))

slide-4
SLIDE 4
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

4

Blackjack example

❐ Object: Have your card sum be greater than the dealer’s without exceeding 21. ❐ States (200 of them):

! current sum (12-21) ! dealer’s showing card (ace-10) ! do I have a useable ace?

❐ Reward: +1 for winning, 0 for a draw, -1 for losing ❐ Actions: stick (stop receiving cards), hit (receive another card) ❐ Policy: Stick if my sum is 20 or 21, else hit ❐ No discounting (휸 = 1)

slide-5
SLIDE 5
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

5

Learned blackjack state-value functions

slide-6
SLIDE 6

terminal state

  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

6

Backup diagram for Monte Carlo

❐ Entire rest of episode included ❐ Only one choice considered at each state (unlike DP)

! thus, there will be an

explore/exploit dilemma ❐ Does not bootstrap from successor states’s values (unlike DP) ❐ Time required to estimate one state does not depend on the total number of states

slide-7
SLIDE 7
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

7

e.g., Elastic Membrane (Dirichlet Problem)

The Power of Monte Carlo

How do we compute the shape of the membrane or bubble?

slide-8
SLIDE 8
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

8

Relaxation Kakutani’s algorithm, 1945

Two Approaches

slide-9
SLIDE 9
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

9

Monte Carlo Estimation of Action Values (Q)

❐ Monte Carlo is most useful when a model is not available

! We want to learn q*

❐ qπ(s,a) - average return starting from state s and action a following π ❐ Converges asymptotically if every state-action pair is visited ❐ Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

slide-10
SLIDE 10
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

10

Monte Carlo Control

❐ MC policy iteration: Policy evaluation using MC methods followed by policy improvement ❐ Policy improvement step: greedify with respect to value (or action-value) function

evaluation improvement

π Q

π greedy(Q) Q qπ

slide-11
SLIDE 11

qπk(s, πk+1(s)) = qπk(s, argmax

a

qπk(s, a)) = max

a

qπk(s, a) ≥ qπk(s, πk(s)) = vπk(s).

  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

11

Convergence of MC Control

❐ Greedified policy meets the conditions for policy improvement: ❐ And thus must be ≥ πk by the policy improvement theorem ❐ This assumes exploring starts and infinite number of episodes for MC policy evaluation ❐ To solve the latter:

! update only to a given level of performance ! alternate between evaluation and improvement per episode

slide-12
SLIDE 12
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

12

Monte Carlo Exploring Starts

Fixed point is optimal policy π* Now proven (almost)

Initialize, for all s ∈ S, a ∈ A(s): Q(s, a) ← arbitrary π(s) ← arbitrary Returns(s, a) ← empty list Repeat forever: Choose S0 ∈ S and A0 ∈ A(S0) s.t. all pairs have probability > 0 Generate an episode starting from S0, A0, following π For each pair s, a appearing in the episode: G ← return following the first occurrence of s, a Append G to Returns(s, a) Q(s, a) ← average(Returns(s, a)) For each s in the episode: π(s) ← argmaxa Q(s, a)

slide-13
SLIDE 13
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

13

Blackjack example continued

❐ Exploring starts ❐ Initial policy as described before

Usable ace No usable ace

20 10 A 2 3 4 5 6 7 8 9

Dealer showing Player sum HIT STICK

19 21 11 12 13 14 15 16 17 18

!*

10 A 2 3 4 5 6 7 8 9

HIT STICK

20 19 21 11 12 13 14 15 16 17 18

V*

21 1 12 A Dealer showing Player sum 1 A 12 21 +1 "1

v*

21 1 12 A +1 "1 1 Dealer showing Player sum 1 A 12 21 D e a l e r s h

  • w

i n g Player sum

* *

slide-14
SLIDE 14

❐ On-policy: learn about policy currently executing ❐ How do we get rid of exploring starts?

! The policy must be eternally soft:

– π(a|s) > 0 for all s and a

! e.g. ε-soft policy:

– probability of an action = or

  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

14

On-policy Monte Carlo Control

max (greedy) non-max ❐ Similar to GPI: move policy towards greedy policy 
 (e.g., ε-greedy) ❐ Converges to best ε-soft policy

) + ✏ |A(s)| h 1 − ✏ + ✏ |A(s)|

slide-15
SLIDE 15
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

15

On-policy MC Control

Initialize, for all s 2 S, a 2 A(s): Q(s, a) arbitrary Returns(s, a) empty list ⇡(a|s) an arbitrary "-soft policy Repeat forever: (a) Generate an episode using ⇡ (b) For each pair s, a appearing in the episode: G return following the first occurrence of s, a Append G to Returns(s, a) Q(s, a) average(Returns(s, a)) (c) For each s in the episode: A∗ arg maxa Q(s, a) For all a 2 A(s): ⇡(a|s) ⇢ 1 " + "/|A(s)| if a = A∗ "/|A(s)| if a 6= A∗

slide-16
SLIDE 16
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

22

What we’ve learned about Monte Carlo so far

❐ MC has several advantages over DP:

! Can learn directly from interaction with environment ! No need for full models ! No need to learn about ALL states (no bootstrapping) ! Less harmed by violating Markov property (later in book)

❐ MC methods provide an alternate policy evaluation process ❐ One issue to watch for: maintaining sufficient exploration

! exploring starts, soft policies

slide-17
SLIDE 17

Off-policy methods

❐ Learn the value of the target policy π from experience due to behavior policy 휇 ❐ For example, π is the greedy policy (and ultimately the

  • ptimal policy) while 휇 is exploratory (e.g., 휀-soft)

❐ In general, we only require coverage, i.e., that 휇 generates behavior that covers, or includes, π ❐ Idea: importance sampling – Weight each return by the ratio of the probabilities

  • f the trajectory under the two policies

17

at π(a|s) > 0 es µ(a|s) > 0. for every s,a at which

slide-18
SLIDE 18

Importance Sampling Ratio

❐ Probability of the rest of the trajectory, after St, under π: ❐ In importance sampling, each return is weighted by the relative probability of the trajectory under the two policies ❐ This is called the importance sampling ratio ❐ All importance sampling ratios have expected value 1

18

EAk∼µ π(Ak|Sk) µ(Ak|Sk)

  • =

X

a

µ(a|Sk)π(a|Sk) µ(a|Sk) = X

a

π(a|Sk) = 1.

Pr{At, St+1, At+1, . . . , ST | St, At:T−1 ∼ π} = π(At|St)p(St+1|St, At)π(At+1|St+1) · · · p(ST |ST−1, AT−1) =

T−1

Y

k=t

π(Ak|Sk)p(Sk+1|Sk, Ak), ρT

t =

QT−1

k=t π(Ak|Sk)p(Sk+1|Sk, Ak)

QT−1

k=t µ(Ak|Sk)p(Sk+1|Sk, Ak)

=

T−1

Y

k=t

π(Ak|Sk) µ(Ak|Sk).

slide-19
SLIDE 19

Importance Sampling

❐ New notation: time steps increase across episode boundaries:

! . . . s . . . . ▨ . . . . . . ▨ . . . s . . . . ▨ . . . ! t = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

❐ Ordinary importance sampling forms estimate ❐ Whereas weighted importance sampling forms estimate

19

V (s) . = P

t∈T(s) ρT(t) t

Gt |T(s)| .

V (s) . = P

t∈T(s) ρT(t) t

Gt P

t∈T(s) ρT(t) t

T(s) = {4, 20} T(4) = 9 T(20) = 25

set of start times next termination times

slide-20
SLIDE 20

Example of infinite variance under ordinary importance sampling

20

vπ(s) =

π(right|s) µ(right|s) =

π(left|s) µ(left|s) =

1 100,000 1,000,000 10,000,000 100,000,000 2

0.1 0.9

R = +1

s π(left|s) = 1 µ(left|s) = 1 2 left right

R = 0 R = 0

vπ(s)

Monte-Carlo estimate of with

  • rdinary

importance sampling (ten runs) Episodes (log scale)

1 10 100 1000 10,000

0.1 0.9

R = +1

s π(left|s) = 1 µ(left|s) = 1 2 left right

R = 0 R = 0

1 2 휸 = 1

Trajectory G0 s, left, 0, s, left, 0, s, left, 0, s, right, 0, s, left, 0, s, left, 0, s, left, 0, s, left, +1, 1 ρT

V (s) , P

t∈T(s) ρT(t) t

Gt |T(s)| .

OIS: WIS:

V (s) , P

t∈T(s) ρT(t) t

Gt P

t∈T(s) ρT(t) t

16

slide-21
SLIDE 21

Example: Off-policy Estimation 


  • f the value of a single Blackjack State

❐ State is player-sum 13, dealer-showing 2, useable ace ❐ Target policy is stick only on 20 or 21 ❐ Behavior policy is equiprobable ❐ True value ≈ −0.27726

21 Ordinary importance sampling Weighted importance sampling

Episodes (log scale)

10 100 1000 10,000

Mean square error

(average over 100 runs)

2 4

slide-22
SLIDE 22

Incremental off-policy every-visit MC policy evaluation (returns Q ≈ qπ

Input: an arbitrary target policy π Initialize, for all s ∈ S, a ∈ A(s): Q(s, a) ← arbitrary C(s, a) ← 0 Repeat forever: µ ← any policy with coverage of π Generate an episode using µ: S0, A0, R1, . . . , ST −1, AT −1, RT , ST G ← 0 W ← 1 For t = T − 1, T − 2, . . . downto 0: G ← γG + Rt+1 C(St, At) ← C(St, At) + W Q(St, At) ← Q(St, At) +

W C(St,At) [G − Q(St, At)]

W ← W π(At|St)

µ(At|St)

If W = 0 then ExitForLoop

slide-23
SLIDE 23

Off-policy every-visit MC control (returns π ⇡ π∗)

Initialize, for all s 2 S, a 2 A(s): Q(s, a) arbitrary C(s, a) 0 π(s) argmaxa Q(St, a) (with ties broken consistently) Repeat forever: µ any soft policy Generate an episode using µ: S0, A0, R1, . . . , ST −1, AT −1, RT , ST G 0 W 1 For t = T 1, T 2, . . . downto 0: G γG + Rt+1 C(St, At) C(St, At) + W Q(St, At) Q(St, At) +

W C(St,At) [G Q(St, At)]

π(St) argmaxa Q(St, a) (with ties broken consistently) If At 6= π(St) then ExitForLoop W W

1 µ(At|St)

Target policy is greedy and deterministic Behavior policy is soft, typically 휀-greedy

slide-24
SLIDE 24

Discounting-aware Importance Sampling (motivation)

❐ So far we have weighted returns without taking into account that they are a discounted sum ❐ This can’t be the best one can do! ❐ For example, suppose 휸 = 0

! Then G0 will be weighted by ! But it really need only be weighted by ! Which would have much smaller variance

24

ρT

0 = π(A0|S0)

µ(A0|S0) π(A1|S1) µ(A1|S1) · · · π(AT −1|ST −1) µ(AT −1|ST −1) ρ1

0 = π(A0|S0)

µ(A0|S0)

slide-25
SLIDE 25

Discounting-aware Importance Sampling

❐ Define the flat partial return: ❐ Then

25

¯ Gh

t , Rt+1 + Rt+2 + · · · + Rh,

0 ≤ t < h ≤ T,

Gt , Rt+1 + γRt+2 + γ2Rt+3 + · · · + γT−t−1RT = (1 − γ)Rt+1 + (1 − γ)γ (Rt+1 + Rt+2) + (1 − γ)γ2 (Rt+1 + Rt+2 + Rt+3) . . . + (1 − γ)γT−t−2 (Rt+1 + Rt+2 + · · · + RT−1) + γT−t−1 (Rt+1 + Rt+2 + · · · + RT ) = (1 − γ)

T−1

X

h=t+1

γh−t−1 ¯ Gh

t

+ γT−t−1 ¯ GT

t

slide-26
SLIDE 26

Discounting-aware Importance Sampling

27

¯ Gh

t , Rt+1 + Rt+2 + · · · + Rh,

0 ≤ t < h ≤ T,

Gt , Rt+1 + γRt+2 + γ2Rt+3 + · · · + γT−t−1RT = (1 − γ)Rt+1 + (1 − γ)γ (Rt+1 + Rt+2) + (1 − γ)γ2 (Rt+1 + Rt+2 + Rt+3) . . . + (1 − γ)γT−t−2 (Rt+1 + Rt+2 + · · · + RT−1) + γT−t−1 (Rt+1 + Rt+2 + · · · + RT ) = (1 − γ)

T−1

X

h=t+1

γh−t−1 ¯ Gh

t

+ γT−t−1 ¯ GT

t

· · · = (1 − γ)

T−1

X

h=t+1

γh−t−1 ¯ Gh

t

+ γT−t−1 ¯ GT

t

❐ Define the flat partial return: ❐ Then ❐ Ordinary discounting-aware IS: ❐ Weighted discounting-aware IS:

V (s) , P

t∈T(s)

⇣ (1 − γ) PT(t)−1

h=t+1 γh−t−1ρh t ¯

Gh

t

+ γT(t)−t−1ρT(t)

t

¯ GT(t)

t

⌘ |T(s)| V (s) , P

t∈T(s)

⇣ (1 − γ) PT(t)−1

h=t+1 γh−t−1ρh t ¯

Gh

t

+ γT(t)−t−1ρT(t)

t

¯ GT(t)

t

⌘ P

t∈T(s)

⇣ (1 − γ) PT(t)−1

h=t+1 γh−t−1ρh t

+ γT(t)−t−1ρT(t)

t

⌘ .

slide-27
SLIDE 27

Per-reward Importance Sampling

❐ Another way of reducing variance, even if 휸 = 1 ❐ Uses the fact that the return is a sum of rewards ❐ where ❐ Per-reward ordinary IS:

30

ρT

t Rt+k = π(At|St)

µ(At|St) π(At+1|St+1) µ(At+1|St+1) · · · π(At+k|St+k) µ(At+k|St+k) · · · π(AT −1|ST −1) µ(AT −1|ST −1)Rt+k | {z }

˜ Gt

V (s) , P

t∈T(s) ˜

Gt |T(s)| E ⇥ ρT

t Rt+k

⇤ = E ⇥ ρt+k

t

Rt+k ⇤

E ⇥ ρT

t Gt

⇤ = E ⇥ ρt+1

t

Rt+1 + γρt+2

t

Rt+2 + γ2ρt+3

t

Rt+3 + · · · + γT −t−1ρT

t RT

ρT

t Gt = ρT t Rt+1 + γρT t Rt+2 + · · · + γk−1ρT t Rt+k + · · · + γT −t−1ρT t RT

slide-28
SLIDE 28
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

22

Summary

❐ MC has several advantages over DP:

! Can learn directly from interaction with environment ! No need for full models ! Less harmed by violating Markov property (later in book)

❐ MC methods provide an alternate policy evaluation process ❐ One issue to watch for: maintaining sufficient exploration

! exploring starts, soft policies

❐ Introduced distinction between on-policy and off-policy methods ❐ Introduced importance sampling for off-policy learning ❐ Introduced distinction between ordinary and weighted IS ❐ Introduced two return-specific ideas for reducing IS variance

! discounting-aware and per-reward IS

slide-29
SLIDE 29

Paths to a policy

Model Value function Policy Experience

Direct RL methods Direct planning Greedification Environmental interaction

slide-30
SLIDE 30

Paths to a policy

Model Value function Policy Experience

Direct RL methods Direct planning Greedification Simulation Environmental interaction

Simulation-based RL

slide-31
SLIDE 31

Paths to a policy

Model Value function Policy Experience

Direct RL methods Direct planning Greedification Model learning Simulation Environmental interaction

Conventional Model-based RL

slide-32
SLIDE 32

Paths to a policy

Model Value function Policy Experience

Direct RL methods Direct planning Greedification Model learning Simulation Environmental interaction

Dyna Model-based RL