CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - - PowerPoint PPT Presentation

cmu q 15 381
SMART_READER_LITE
LIVE PREVIEW

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE MDP S ? Assumption 1: state is known exactly after performing an action Do we always have an infinitely powerful GPS that tells us where


slide-1
SLIDE 1

CMU-Q 15-381

Lecture 18: Reinforcement Learning I

Teacher: Gianni A. Di Caro

slide-2
SLIDE 2

2

HOW REALISTIC ARE MDPS?

§ Assumption 1: state is known exactly after performing an action § Do we always have an infinitely powerful “GPS” that tells us where we are in the world? Think of a robot moving in a building, how does it know where it is? § Relax the assumption: Partially Observable MDP (POMDP) § Assumption 2: known model of dynamics and reward of the world, ! and " § Do we always know what will be the effect of our actions when chance is playing against us? Where those numbers come from? Image to fill in the ! matrix for the action of a wheeled robot on an icy surface … § Relax the assumption: Reinforcement Learning Problems

slide-3
SLIDE 3

3

REINFORCEMENT LEARNING

Transition Model? Agent Action State Reward model?

Goal: Maximize expected sum of future rewards Memoryless stochastic reward process (MRP)

slide-4
SLIDE 4

4

MDP PLANNING VS. REINFORCEMENT LEARNING

Don’t have a simulator! Have to actually learn what happens if take an action in a state

Drawings by Ketrina Yim

slide-5
SLIDE 5

5

REINFORCEMENT LEARNING PROBLEM

ü The agent can ”sense” the environment (it knows the state) and has goals ü Learning effect of actions from interaction with the environment § Trial and Error search § (Delayed) Rewards (Advisory signals ≠ Error signals) § What actions to take? → Exploration- exploitation dilemma § The agent has to generate the training set by interaction

slide-6
SLIDE 6

6

REINFORCEMENT LEARNING

Transition Model? Agent Action State Reward model?

Goal: Maximize expected sum of future rewards

Memoryless stochastic reward process (MRP)

slide-7
SLIDE 7

7

PASSIVE REINFORCEMENT LEARNING

§ Before figuring out how to act, let’s first just try to figure out how good a (given) particular policy ! is § Passive learning: agent’s policy is fixed (i.e., in state " it always execute action !(")) and the task is to estimate policy’s value → Learn state values, % " , or State-action values, '(", () → Policy evaluation

Policy evaluation in MDPs ∼ Passive RL

(*, +) Model Bellman eqs. (*, +) Model Learning

slide-8
SLIDE 8

8

PASSIVE REINFORCEMENT LEARNING

Two approaches

  • 1. Build a model

à Solve Value Iteration

Transition Model? Agent Action State Reward model?

T(s,a,s’)=0.8, R(s,a,s’)=4,…

slide-9
SLIDE 9

9

PASSIVE REINFORCEMENT LEARNING

Two approaches:

  • 1. Build a model
  • 2. Model-free:

directly estimate !" Transition Model? Agent Action State Reward model?

Vπ(s1)=1.8, Vπ(s2)=2.5,…

slide-10
SLIDE 10

10

PASSIVE RL: BUILD A MODEL

  • 1. Build a model

Transition Model? Agent Action State Reward model?

T(s,a,s’)=0.8, R(s,a,s’)=4,…

slide-11
SLIDE 11

Start at (1,1)

11

GRID WORLD EXAMPLE

slide-12
SLIDE 12

Start at (1,1) s=(1,1) action= tup try up

Adaption of drawing by Ketrina Yim

GRID WORLD EXAMPLE

12

GRID WORLD EXAMPLE

slide-13
SLIDE 13

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01

Adaption of drawing by Ketrina Yim

GRID WORLD EXAMPLE

13

GRID WORLD EXAMPLE

slide-14
SLIDE 14

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup

Adaption of drawing by Ketrina Yim

GRID WORLD EXAMPLE

14

GRID WORLD EXAMPLE

slide-15
SLIDE 15

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01

Adaption of drawing by Ketrina Yim

GRID WORLD EXAMPLE

15

GRID WORLD EXAMPLE

slide-16
SLIDE 16

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1

Adaption of drawing by Ketrina Yim

GRID WORLD EXAMPLE

16

GRID WORLD EXAMPLE

slide-17
SLIDE 17

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1

Adaption of drawing by Ketrina Yim

The gathered experience can be used to estimate MDP’s ! and " models

GRID WORLD EXAMPLE

17

GRID WORLD EXAMPLE

slide-18
SLIDE 18

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1

Adaption of drawing by Ketrina Yim

Estimate of T(<1,2>,tup,<1,3>) = 1/2

The gathered experience can be used to estimate MDP’s ! and " models

GRID WORLD EXAMPLE

18

GRID WORLD EXAMPLE

slide-19
SLIDE 19

19

MODEL-BASED PASSIVE REINFORCEMENT LEARNING

  • 1. Follow policy !, observe transitions and rewards
  • 2. Estimate MDP model parameters " and # given
  • bserved transitions and rewards

§ If finite set of states and actions, can just make a table, count, and average counts

  • 3. Use estimated MDP to do policy evaluation of !

(using Value Iteration)

Does this give us all the parameters for an MDP?

slide-20
SLIDE 20

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1

Adaption of drawing by Ketrina Yim

Estimate of T(<1,2>,tright,<1,3>)? No idea! Never tried this action…

SOME PARAMETERS ARE MISSING

20

GRID WORLD EXAMPLE

slide-21
SLIDE 21

21

PASSIVE MODEL-BASED RL

§ Does this give us all the parameters of the underlying MDP? § No. § But does that matter for computing policy value? § No, don’t need to reconstruct the whole MDP for performing policy evaluation! § Have all parameters we need! § We have !(#), we can assign non-zero probabilities to all

  • bserved transitions and zero to the unobserved ones

§ We need to visit all states # ∈ & at least once in order to solve the Bellman equations for all states

V π(s) = Eπ " R(st+1) + γV π(st+1) | st = s # = X

s02S

p(s0|s, π(s)) ⇣ R(s0, s, π(s)) + γV π(s0) ⌘ ∀s ∈ S

slide-22
SLIDE 22

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(3,2), r = -.01 s=(3,2) action= tup, s’=(4,2), r = -1

Adaption of drawing by Ketrina Yim

2 episodes of experience in MDP. Use to estimate MDP parameters & evaluate ! Is the computed policy value likely to be correct? (1) Yes (2) No (3) Not sure

22

PASSIVE MODEL-BASED RL

slide-23
SLIDE 23

23

PASSIVE REINFORCEMENT LEARNING

Two Approaches:

  • 1. Build a model
  • 2. Model-free:

directly estimate !" Transition Model? Agent Action State Reward model?

Vπ(s1)=1.8, Vπ(s2)=2.5,…

slide-24
SLIDE 24

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(3,2), r = -.01 s=(3,2) action= tup, s’=(4,2), r = -1

Adaption of drawing by Ketrina Yim

2 episodes of (MDP) experiences

24

LET’S CONSIDER AN EPISODIC SCENARIO

Estimate of !(1,1)? & ! 1,1 = 1 2 1 + 7 + −0.01 + (−1 + 5 + −0.01 )

Averaging episode returns

01 02

slide-25
SLIDE 25

25

AVERAGING OBSERVED RETURNS

§ Arithmetic average:

!

"#$ % = 1

( )

*+$ "

,* !

"#$ % = ! " % + 1

( (,"−!

" % )

§ Incremental arithmetic average:

§ Incremental weighted arithmetic average:

§ Weight of an episode: 1" § Sum of ( episodes: 2

" = ∑*+4 "

1*

!

"#$ % = ! " % + 1"

2

"

(,"−!

" % )

§ Averaging the returns from ( episodes, ,$, ,6, ⋯ , ,"

slide-26
SLIDE 26

26

AVERAGING OBSERVED RETURNS

§ Exponentially-weighted average (moving average):

!

"#$ % = ! " % + ((*"−! " % )

= (!

" % + (1 − ()*"

!

"#$ % = ("! . % + / 01$ "

("20 (1 − ()*0

§ Weights decrease exponentially:

!

$ % = (! . % + 1 − ( *$

!

3 % = (! $ % + 1 − ( *3 = ( (! . % + 1 − ( *$ + 1 − ( *3

= (3!

. % + ( 1 − ( *$ + 1 − ( *3

(Note: constant ( vs.

$ " )

slide-27
SLIDE 27

27

DIRECT UTILITY ESTIMATION:

MONTE CARLO POLICY EVALUATION

1. Sample an episode based on ! 2. Observe the total return "# (collected reward along the sequence) 3. Use "# as learning target and update the sample estimate of the value $(&#) of the starting state

$ &# ← $ &# + α "# − $(&#)

&# "#

~ Supervised learning error-correction

(works for both stationary and non-stationary environments)

slide-28
SLIDE 28

28

DIRECT UTILITY ESTIMATION: MONTE CARLO POLICY EVALUATION

E.g., !": $

%, $ ', ⋯ , $ )*+ → Sum of discounted returns for !": $ % + .$ ' + ⋯ + .)*+/%$ )*+

!0

% → !+ " → !" 1 → !% + → !' " → !1 " → !2 3 = 5

slide-29
SLIDE 29

29

DIRECT UTILITY ESTIMATION: MONTE CARLO POLICY EVALUATION

Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(3,2), r = -.01 s=(3,2) action= tup, s’=(4,2), r = -1

slide-30
SLIDE 30

30

MONTE CARLO POLICY EVALUATION

Black Jack example from Sutton and Barto, p. 93-94

slide-31
SLIDE 31

31

MONTE CARLO POLICY EVALUATION

What information is missing when performing Monte Carlo policy evaluation for our MDP? (it’s an MDP, we just do not know its parameters …)

The recursive Bellman relations among states!

V π(s) = X

s02S

p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S We should exploit this structure in the MDP…

slide-32
SLIDE 32

32

VALUE ITERATION VS. MC

! "# ← ! "# + α '# − !("#)

"# '#

For a knwon MDP we could generate a sample by using the + and ' models

slide-33
SLIDE 33

33

TEMPORAL DIFFERENCES (TD) POLICY EVALUATION

TD bootstraps on available state information

Update at each step! Sample + Bellman Local learning target ! !" # $ %(!") %(!)

slide-34
SLIDE 34

34

TEMPORAL DIFFERENCE LEARNING

§ No explicit model of ! or "! § Estimate # through samples § Update after every experience § Update #(%) after each state transition (%, (, ), %*) § Likely outcomes %* will contribute updates more often § Don’t need episodes / terminal states, can keep updating! § Temporal difference learning of values § Policy still fixed, still doing evaluation! § Move values toward sample of #: moving average

Bellman sample of #(%): Update to #(%):

slide-35
SLIDE 35

35

TD EXAMPLE: GOING HOME

Value of a state: expected time to go = (your) predicted time to go (! = 1) How does the value of the initial state change during the episode?

MC

TD: each error is proportional to the change

  • ver time of the

prediction

Sutton and Barto book 2018 draft, Example 6.1, page 122

slide-36
SLIDE 36

TD Learning Example Initialize all Vπ(s) values: Vπ(s) = 0

State Vπ(s) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3)

36

TD LEARNING EXAMPLE

slide-37
SLIDE 37

TD Learning Example s=(1,1) action= tup, s’=(1,2), r = -.01 Update Vπ((1,1))

State Vπ(s) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3)

sample = -0.01 + Vπ( (1,2)) = -0.01 Vπ( (1,1) ) = (1-α) Vπ( (1,1)) + α*sample = .9*0 + 0.1*-0.01 = -0.001

α=0.1, γ=1

37

TD LEARNING EXAMPLE

slide-38
SLIDE 38

TD Learning Example s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01

State Vπ(s) (1,1)

  • 0.001

(1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3)

α=0.1, γ=1

38

TD LEARNING EXAMPLE

slide-39
SLIDE 39

39

UPDATING AND BOOTSTRAPPING

TD

  • nline

MC batch TD batch !(#$)? MC: At least &$'

(, need to wait until #) to complete the sample

TD: &$'

( + &((! estimate of next state), this is the Bellman sample

TD batch: need to wait until #) and then update !(#56(), ! #5 ⋯ , ! #$

Sample backup

slide-40
SLIDE 40

40

MC VS. TD

Value function estimation → Prediction problem

Monte Carlo Policy Evaluation TD Policy Evaluation Sample backups: based on single sample successor states, not on on the complete distribution of all successors No Bootstrapping: the estimates for each state are independent Bootstrapping: the estimate for one state builds upon the estimate of another state Batch: Update at the end of the episode Online: Incremental updating

slide-41
SLIDE 41

41

MC CONVERGENCE TO !"?

§ As the number of visits to state # goes to infinity increasing the number of episodes, each $%&'() # is an independent, identically distributed estimate of !" # (i.e., each return is a utility value, sampled according to the policy *) § By the law of large numbers, the average of the episode returns converge to the expected value of !" # as the number of episodes tends to infinity § Each return is itself an unbiased estimate of !" # § After ) episodes, the standard deviation of the sample average is +

,

slide-42
SLIDE 42

42

LAW OF LARGE NUMBERS (FROM PROBABILITY THEORY)

§ We want to estimate the expected value of a random variable ! § Let’s consider a set of " independent realizations of the variable !: !

$, ! %, ! &, ⋯ ! ) (i.e., a trial process)

§ The random variables !

* are independent and identically distributed (being all

different realizations of the same random variable !) : , !

* = .

the (finite) expected value for all / = 1, ⋯ , " § Let !) =

∑234

5

62 )

be the sample average of the " variables § Then, for every 7 > 0: :(|!) − .| > 7 ) → 0, as " → ∞

While nothing is more uncertain than the duration of a single life, nothing is more certain than the average duration of a thousand lives. (Elizur Wright)

§ Describes what happens when performing the same experiment many times: after many trials, the average of the results should be close to the expected value and will be more accurate with more trials. § For Monte Carlo this means that we can learn properties of a random variable (mean, variance, etc.) simply by observing it or simulating it over many trials.

slide-43
SLIDE 43

43

MC PROPERTIES (VS. VALUE ITERATION / DP)

§ How do we generate the episodes?

Direct online interaction (chance comes from nature) Simulation (for many problems, e.g. Black Jack, don’t need to know precisely all the probabilities and rewards of the MDP!)

§ Computational cost for estimating the value of a single state is independent of the number of the states (no bootstrapping) § Focus: If only a restricted number of states are of interest, generate sample episodes starting from these states

slide-44
SLIDE 44

44

CONVERGENCE OF TD?

§ With constant !, the weight of the "-th sample decreases exponentially with ", (needed in non-stationary environments) § If 0 ≤ ! ≤ 1, for each state (action) is a sequence of decreasing step-size values according to the Robbins-Monroe conditions for stochastic approximation (e.g., !& ' =

) * ∀ ' ∈ -)

§ ∑&/) !&(') = ∞ Learning steps are large enough to overcome random fluctuations and initial conditions § ∑&/) !&

4(')

< ∞ Learning steps get sufficiently small to guarantee convergence without keep fluctuating § Estimates 6 converge with probability 1 to the true 67 § Depend on the learning rate !, which quantifies how a new sample backup 8 + :; '< changes the current estimate ;(') § Estimates 6 converge in mean to the true 67 § If 0 ≤ ! ≤ 1 is constant and sufficiently small

slide-45
SLIDE 45

45

LIMITATIONS OF PASSIVE LEARNING

§ Agent wants to ultimately learn to act to gather high reward in the environment. § So far prediction problems, not control ones § Using a given deterministic policy, gives no experience for other actions (not included in the policy)

Active reinforcement learning: the agent decides what action to take with the goal of learning an optimal policy