CMU-Q 15-381
Lecture 18: Reinforcement Learning I
Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - - PowerPoint PPT Presentation
CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE MDP S ? Assumption 1: state is known exactly after performing an action Do we always have an infinitely powerful GPS that tells us where
Teacher: Gianni A. Di Caro
2
§ Assumption 1: state is known exactly after performing an action § Do we always have an infinitely powerful “GPS” that tells us where we are in the world? Think of a robot moving in a building, how does it know where it is? § Relax the assumption: Partially Observable MDP (POMDP) § Assumption 2: known model of dynamics and reward of the world, ! and " § Do we always know what will be the effect of our actions when chance is playing against us? Where those numbers come from? Image to fill in the ! matrix for the action of a wheeled robot on an icy surface … § Relax the assumption: Reinforcement Learning Problems
3
Goal: Maximize expected sum of future rewards Memoryless stochastic reward process (MRP)
4
Don’t have a simulator! Have to actually learn what happens if take an action in a state
Drawings by Ketrina Yim
5
ü The agent can ”sense” the environment (it knows the state) and has goals ü Learning effect of actions from interaction with the environment § Trial and Error search § (Delayed) Rewards (Advisory signals ≠ Error signals) § What actions to take? → Exploration- exploitation dilemma § The agent has to generate the training set by interaction
6
Goal: Maximize expected sum of future rewards
Memoryless stochastic reward process (MRP)
7
§ Before figuring out how to act, let’s first just try to figure out how good a (given) particular policy ! is § Passive learning: agent’s policy is fixed (i.e., in state " it always execute action !(")) and the task is to estimate policy’s value → Learn state values, % " , or State-action values, '(", () → Policy evaluation
(*, +) Model Bellman eqs. (*, +) Model Learning
8
à Solve Value Iteration
T(s,a,s’)=0.8, R(s,a,s’)=4,…
9
Vπ(s1)=1.8, Vπ(s2)=2.5,…
10
T(s,a,s’)=0.8, R(s,a,s’)=4,…
Start at (1,1)
11
Start at (1,1) s=(1,1) action= tup try up
Adaption of drawing by Ketrina Yim
12
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01
Adaption of drawing by Ketrina Yim
13
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup
Adaption of drawing by Ketrina Yim
14
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01
Adaption of drawing by Ketrina Yim
15
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1
Adaption of drawing by Ketrina Yim
16
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1
Adaption of drawing by Ketrina Yim
The gathered experience can be used to estimate MDP’s ! and " models
17
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1
Adaption of drawing by Ketrina Yim
Estimate of T(<1,2>,tup,<1,3>) = 1/2
The gathered experience can be used to estimate MDP’s ! and " models
18
19
§ If finite set of states and actions, can just make a table, count, and average counts
(using Value Iteration)
Does this give us all the parameters for an MDP?
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1
Adaption of drawing by Ketrina Yim
Estimate of T(<1,2>,tright,<1,3>)? No idea! Never tried this action…
20
21
§ Does this give us all the parameters of the underlying MDP? § No. § But does that matter for computing policy value? § No, don’t need to reconstruct the whole MDP for performing policy evaluation! § Have all parameters we need! § We have !(#), we can assign non-zero probabilities to all
§ We need to visit all states # ∈ & at least once in order to solve the Bellman equations for all states
V π(s) = Eπ " R(st+1) + γV π(st+1) | st = s # = X
s02S
p(s0|s, π(s)) ⇣ R(s0, s, π(s)) + γV π(s0) ⌘ ∀s ∈ S
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(3,2), r = -.01 s=(3,2) action= tup, s’=(4,2), r = -1
Adaption of drawing by Ketrina Yim
2 episodes of experience in MDP. Use to estimate MDP parameters & evaluate ! Is the computed policy value likely to be correct? (1) Yes (2) No (3) Not sure
22
23
Vπ(s1)=1.8, Vπ(s2)=2.5,…
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(3,2), r = -.01 s=(3,2) action= tup, s’=(4,2), r = -1
Adaption of drawing by Ketrina Yim
2 episodes of (MDP) experiences
24
Estimate of !(1,1)? & ! 1,1 = 1 2 1 + 7 + −0.01 + (−1 + 5 + −0.01 )
Averaging episode returns
01 02
25
§ Arithmetic average:
!
"#$ % = 1
( )
*+$ "
,* !
"#$ % = ! " % + 1
( (,"−!
" % )
§ Incremental arithmetic average:
§ Weight of an episode: 1" § Sum of ( episodes: 2
" = ∑*+4 "
1*
!
"#$ % = ! " % + 1"
2
"
(,"−!
" % )
§ Averaging the returns from ( episodes, ,$, ,6, ⋯ , ,"
26
§ Exponentially-weighted average (moving average):
!
"#$ % = ! " % + ((*"−! " % )
= (!
" % + (1 − ()*"
!
"#$ % = ("! . % + / 01$ "
("20 (1 − ()*0
§ Weights decrease exponentially:
!
$ % = (! . % + 1 − ( *$
!
3 % = (! $ % + 1 − ( *3 = ( (! . % + 1 − ( *$ + 1 − ( *3
= (3!
. % + ( 1 − ( *$ + 1 − ( *3
(Note: constant ( vs.
$ " )
27
1. Sample an episode based on ! 2. Observe the total return "# (collected reward along the sequence) 3. Use "# as learning target and update the sample estimate of the value $(&#) of the starting state
$ &# ← $ &# + α "# − $(&#)
&# "#
~ Supervised learning error-correction
(works for both stationary and non-stationary environments)
28
E.g., !": $
%, $ ', ⋯ , $ )*+ → Sum of discounted returns for !": $ % + .$ ' + ⋯ + .)*+/%$ )*+
!0
% → !+ " → !" 1 → !% + → !' " → !1 " → !2 3 = 5
29
Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 s=(3,1) action= tup, s’=(3,2), r = -.01 s=(3,2) action= tup, s’=(4,2), r = -1
30
Black Jack example from Sutton and Barto, p. 93-94
31
What information is missing when performing Monte Carlo policy evaluation for our MDP? (it’s an MDP, we just do not know its parameters …)
V π(s) = X
s02S
p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S We should exploit this structure in the MDP…
32
! "# ← ! "# + α '# − !("#)
"# '#
For a knwon MDP we could generate a sample by using the + and ' models
33
TD bootstraps on available state information
Update at each step! Sample + Bellman Local learning target ! !" # $ %(!") %(!)
34
§ No explicit model of ! or "! § Estimate # through samples § Update after every experience § Update #(%) after each state transition (%, (, ), %*) § Likely outcomes %* will contribute updates more often § Don’t need episodes / terminal states, can keep updating! § Temporal difference learning of values § Policy still fixed, still doing evaluation! § Move values toward sample of #: moving average
Bellman sample of #(%): Update to #(%):
35
Value of a state: expected time to go = (your) predicted time to go (! = 1) How does the value of the initial state change during the episode?
MC
TD: each error is proportional to the change
prediction
Sutton and Barto book 2018 draft, Example 6.1, page 122
TD Learning Example Initialize all Vπ(s) values: Vπ(s) = 0
State Vπ(s) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3)
36
TD Learning Example s=(1,1) action= tup, s’=(1,2), r = -.01 Update Vπ((1,1))
State Vπ(s) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3)
sample = -0.01 + Vπ( (1,2)) = -0.01 Vπ( (1,1) ) = (1-α) Vπ( (1,1)) + α*sample = .9*0 + 0.1*-0.01 = -0.001
α=0.1, γ=1
37
TD Learning Example s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01
State Vπ(s) (1,1)
(1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3)
α=0.1, γ=1
38
39
TD
MC batch TD batch !(#$)? MC: At least &$'
(, need to wait until #) to complete the sample
TD: &$'
( + &((! estimate of next state), this is the Bellman sample
TD batch: need to wait until #) and then update !(#56(), ! #5 ⋯ , ! #$
Sample backup
40
Value function estimation → Prediction problem
Monte Carlo Policy Evaluation TD Policy Evaluation Sample backups: based on single sample successor states, not on on the complete distribution of all successors No Bootstrapping: the estimates for each state are independent Bootstrapping: the estimate for one state builds upon the estimate of another state Batch: Update at the end of the episode Online: Incremental updating
41
§ As the number of visits to state # goes to infinity increasing the number of episodes, each $%&'() # is an independent, identically distributed estimate of !" # (i.e., each return is a utility value, sampled according to the policy *) § By the law of large numbers, the average of the episode returns converge to the expected value of !" # as the number of episodes tends to infinity § Each return is itself an unbiased estimate of !" # § After ) episodes, the standard deviation of the sample average is +
,
42
§ We want to estimate the expected value of a random variable ! § Let’s consider a set of " independent realizations of the variable !: !
$, ! %, ! &, ⋯ ! ) (i.e., a trial process)
§ The random variables !
* are independent and identically distributed (being all
different realizations of the same random variable !) : , !
* = .
the (finite) expected value for all / = 1, ⋯ , " § Let !) =
∑234
5
62 )
be the sample average of the " variables § Then, for every 7 > 0: :(|!) − .| > 7 ) → 0, as " → ∞
While nothing is more uncertain than the duration of a single life, nothing is more certain than the average duration of a thousand lives. (Elizur Wright)
§ Describes what happens when performing the same experiment many times: after many trials, the average of the results should be close to the expected value and will be more accurate with more trials. § For Monte Carlo this means that we can learn properties of a random variable (mean, variance, etc.) simply by observing it or simulating it over many trials.
43
§ How do we generate the episodes?
Direct online interaction (chance comes from nature) Simulation (for many problems, e.g. Black Jack, don’t need to know precisely all the probabilities and rewards of the MDP!)
§ Computational cost for estimating the value of a single state is independent of the number of the states (no bootstrapping) § Focus: If only a restricted number of states are of interest, generate sample episodes starting from these states
44
§ With constant !, the weight of the "-th sample decreases exponentially with ", (needed in non-stationary environments) § If 0 ≤ ! ≤ 1, for each state (action) is a sequence of decreasing step-size values according to the Robbins-Monroe conditions for stochastic approximation (e.g., !& ' =
) * ∀ ' ∈ -)
§ ∑&/) !&(') = ∞ Learning steps are large enough to overcome random fluctuations and initial conditions § ∑&/) !&
4(')
< ∞ Learning steps get sufficiently small to guarantee convergence without keep fluctuating § Estimates 6 converge with probability 1 to the true 67 § Depend on the learning rate !, which quantifies how a new sample backup 8 + :; '< changes the current estimate ;(') § Estimates 6 converge in mean to the true 67 § If 0 ≤ ! ≤ 1 is constant and sufficiently small
45
§ Agent wants to ultimately learn to act to gather high reward in the environment. § So far prediction problems, not control ones § Using a given deterministic policy, gives no experience for other actions (not included in the policy)
Active reinforcement learning: the agent decides what action to take with the goal of learning an optimal policy