SLIDE 1 DS595/CS525 Reinforcement Learning
Welcome to
Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020 This lecture will be recorded!!!
SLIDE 2 Last lecture
v Reinforcement Learning Components
§ Model, Value function, Policy
v Model-based Control
§ Policy Evaluation, Policy Iteration, Value Iteration
v Project 1 description.
SLIDE 3 Quiz 1 Week 4 (9/24 R)
v Model-based Control
§ Policy Evaluation, Policy Iteration, Value Iteration § 20 min at the beginning
- You can start as early as 5:55PM, and finish as late as
6:20PM. The quiz duration is 20 minutes.
§ Login class zoom so you can ask questions regarding the quiz in Zoom chat box.
Project 1 due Week 4 (9/24 R)
SLIDE 4 This lecture
v Markov Process (Markov Chain), Markov Reward
Process, and Markov Decision Process
§ MP, MRP, MDP, POMDP
v Review: Model based control
§ Policy Iteration, and Value iteration
v Model-Free Policy Evaluation
§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
SLIDE 5
Example: Taxi passenger-seeking task as a decision-making process
States: Locations of taxi (s1, . . . , s6) on the road Actions: Left or Right Rewards: +1 in state s1 +3 in state s5 0 in all other states
s1 s2 s3 s4 s5 s6
SLIDE 6
RL components
v Often include one or more of
§ Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy
SLIDE 7
RL components: (1) Model
v Agent’s representation of how the world
changes in response to agent’s action, with two parts:
Transition model predicts next agent state
p(st+1 =s’ |st =s, at =a)
Reward model predicts immediate reward r(s,a)
SLIDE 8 RL components: (2)Policy
v Policy π determines how the agent chooses
actions
§ π : S → A, mapping from states to actions
v Deterministic policy:
§ π(s) = a § In the other word,
- π(a|s) = 0,
- π(a’|s) = π(a’’|s)=0,
v Stochastic policy:
§ π(a|s) = Pr(at = a|st = s) a a’ a’’ a a’ a’’ s s
SLIDE 9 RL components: (3)Value Function
v Value function Vπ: expected discounted sum of
future rewards under a particular policy π
v Discount factorγweighs immediate vs future rewards v Can be used to quantify goodness/badness of states
and actions
v And decide how to act by comparing policies
a a’ s
SLIDE 10
Model-based Explicit: Model Model-Free: No model
RL agents and algorithms
SLIDE 11 Find a good policy: Problem settings
v (Agent’s internal
computation) § Given model of how the world works § Dynamics and reward model § Algorithm computes how to act in order to maximize expected reward
v Computing while
interacting with environment § Agent doesn’t know how world works § Interacts with world to implicitly/explicitly learn how world works § Agent improves policy (may involve planning) Model-based control Model-free control
SLIDE 12 v Computing while interacting
with environment § Taxi passenger-seeking problem § Demand/Traffic dynamics are uncertain § Huge state space
Path 1 Path 2 Path 3
Find a good policy: Problem settings
Model-based control Model-free control
v (Agent’s internal
computation) § Frozen Lake project 1 § Know all rules of game / perfect model § dynamic programming, tree search
SLIDE 13 Find a good policy: Problem settings
v Given: MDP
§ S, A, P, R, γ
v Output:
§ π Model-based control Model-free control
v Given: MDP/R/P
§ S, A, γ
v Unknow
§ P , R,
v Output:
§ π
SLIDE 14 This lecture
v Markov Process (Markov Chain), Markov Reward
Process, and Markov Decision Process
§ MP, MRP, MDP, POMDP
v Review: Model based control
§ Policy Iteration, and Value iteration
v Model-Free Policy Evaluation
§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
SLIDE 15 DP, MRP, and MDP
v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process
SLIDE 16 Random Walks on Graphs
Random walk sampling Random Walk Routing Influence diffusion Molecule in liquid
SLIDE 17
Undirected Graphs
1 2 6 4 5 3
Undirected !!
SLIDE 18 Random Walk
v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution
1 4 3 2
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '
πi = di 2 E
P
ij =
1 ki if i is not equal to j 0 if i=j ⎧ ⎨ ⎪ ⎩ ⎪
SLIDE 19
SLIDE 20 A random walker: Markov Chain / Markov Process
s1 s2 s3 s4 s5 s6
0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3
SLIDE 21 A random walker: Markov Chain / Markov Process
s1 s2 s3 s4 s5 s6
0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3
s0 * P = s1
SLIDE 22 Taxi passenger-seeking task: Markov Process --- Episodes
s1 s2 s3 s4 s5 s6
0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3
Example: Sample episodes starting from s3 s3, s2, s2, s2, s1, s1,... s3, s3, s4, s5, s6, s6,... s3, s4, s5, s4,...
SLIDE 23 DP, MRP, and MDP
v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process
SLIDE 24
SLIDE 25 A random walker + rewards: Markov Reward Process (MRP)
s1 s2 s3 s4 s5 s6
0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3
v Reward: +1 in s1, +3 in s5, 0 in all other states.
SLIDE 26
SLIDE 27
SLIDE 28 A random walker + rewards: Markov Reward Process
s1 s2 s3 s4 s5 s6
0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3
v Reward: +1 in s1, +3 in s5, 0 in all other states Sample
returns for sample 4-step episodes, γ = ½
v s3(t=1), s4(t=2), s5(t=3), s5(t=4): v G1=? v G3=?
SLIDE 29 A random walker + rewards: Markov Reward Process
s1 s2 s3 s4 s5 s6
0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3
v Reward: +1 in s1, +3 in s5, 0 in all other states v Sample returns for sample 4-step episodes, γ = 1/2 v s3, s4, s5, s6: G1=? v s3, s3, s4, s3: G1=? v s3, s2, s1, s1: G1=?
SLIDE 30 vs3, s4, s5, s6… vs3, s3, s4, s3… vs3, s2, s1, s1… v…
Samples:
Path 1 Path 2 Path 3
SLIDE 31 vSamples: vs3, s4, s5, s6…,
s3, s3, s4, s3…
v…
Samples:
Path 1 Path 2 Path 3
Return vs Value function
Path 2
SLIDE 32
SLIDE 33
SLIDE 34
SLIDE 35
SLIDE 36 DP, MRP, and MDP
v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process
SLIDE 37
SLIDE 38 Taxi passenger-seeking task: Markov Decision Process (MDP)
s1 s2 s3 s4 s5 s6
a1 a2
Deterministic transition model
SLIDE 39 This lecture
v Markov Process (Markov Chain), Markov Reward
Process, and Markov Decision Process
§ MP, MRP, MDP, POMDP
v Review:
§ Policy Iteration, and Value iteration
v Model-Free Policy Evaluation
§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
SLIDE 40
SLIDE 41
SLIDE 42
For deterministic policy:
SLIDE 43 For deterministic and stochastic policy:
From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition
SLIDE 44
SLIDE 45
SLIDE 46
SLIDE 47 (All-in-one algorithm)
From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition
SLIDE 48
Deterministic policy
SLIDE 49 From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition
SLIDE 50 This lecture
v Markov Process (Markov Chain), Markov Reward
Process, and Markov Decision Process
§ MP, MRP, MDP, POMDP
v Review:
§ Policy Iteration, and Value iteration
v Model-Free Policy Evaluation
§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
SLIDE 51 53
Review of Dynamic Programming for policy evaluation (model-baased)
equivalently,
action state !
! "(#) = &",$ [( + *! !%& " (#′)]
SLIDE 52 54
Review of Dynamic Programming for policy evaluation (model-based)
v Bootstrapping: Update for V uses an estimate v Known model P(s’|s,a) and r(s,a)
action state Bootstrapping !
! "(#) = &",$ [( + *! !%& " (#′)]
SLIDE 53 55
Review of Dynamic Programming for policy evaluation (model-based)
v Requires model of MDP P(s’|s,a) and r(s,a)
Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history
action state Bootstrapping !
! "(#) = &",$ [( + *! !%& " (#′)]
SLIDE 54 56
Model-free Policy Evaluation
v What if don’t know dynamics model P and/or
reward model R?
v Today: Policy evaluation without a model v Given data and/or ability to interact in the
environment Efficiently compute a good estimate of a policy π
SLIDE 55 57
Model-free Policy Evaluation
v Monte Carlo (MC) policy evaluation
§ First visit based § Every visit based
v Temporal Difference (TD)
§ TD(0)
v Metrics to evaluate and compare algorithms
SLIDE 56 Monte Carlo (MC) policy evaluation
v Return of a trajectory under policy π v Value function:
§ Expectation over trajectories T generated by following π
v Simple idea: Value = mean return
§ sample set of trajectories & average returns action State s G1t(s) G2t(s) G3t(s)
SLIDE 57
SLIDE 58
SLIDE 59
For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …
SLIDE 60 Bias, Variance, MSE
v Biased vs unbiased estimator
§ Bias is zero or not,
v Consistent vs inconsistent estimator
§ When n goes to infinity, if the estimator goes to ground- truth
SLIDE 61
SLIDE 62
For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …
SLIDE 63
For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …
SLIDE 64
For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …
SLIDE 65
For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …
?
How about α = 1?
SLIDE 66 MC on policy evaluation
s1 s2 s3 s4 s5 s6
a1 a2
Taxi passenger-seeking process: R=[1,0,0,0,3,0] For any action, π(s) = a1, ∀s, γ = 1. any action from s1 and s7 terminates episode Given (s3, a1, 0, s3, a1, 0, s2, a1, 0, s1, a1, 1, T); Q1: First visit MC estimate of V of each state? Q2: Every visit MC estimate of s2? 3
SLIDE 67 Example: MC on policy evaluation
s1 s2 s3 s4 s5 s6
a1 a2
Taxi passenger-seeking process: R=[1,0,0,0,3,0] For any action, π(s) = a1, ∀s, γ = 1. any action from s1 and s7 terminates episode Given (s3, a1, 0, s3, a1, 0, s2, a1, 0, s1, a1, 1, T); Q1: First visit MC estimate of V of each state? V = [1110000] Q2: Every visit MC estimate of s2? V(s2)=1 3 3
SLIDE 68 70
MC policy evaluation
v MC updates the value estimate using a sample of the
return to approximate an expectation
action state T, terminal state
SLIDE 69 71
MC policy evaluation limitations
v Generally high variance
§ Reducing variance can require a lot of data
v Requires episodic settings
§ Episode must end before data from that episode can be used to update the value function § action state T, terminal state
SLIDE 70 72
Model-free Policy Evaluation
v Monte Carlo (MC) policy evaluation
§ First visit based § Every visit based
v Temporal Difference (TD)
§ TD(0) § Combination of MC and Dynamic Programming
“If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” – Sutton and Barto 2017
SLIDE 71 73
MC + DP = TD
v Dynamic Programming (DP) policy evaluation v Monte Carlo (MC) policy evaluation
§
v Temporal Difference (TD)
Rewritten as
!
! "(#) = &",$ [( + *! !%& " (#′)]
SLIDE 73 75
MC + DP = TD
v Can be rewritten as
SLIDE 74 Example: TD policy evaluation
s1 s2 s3 s4 s5 s6
a1 a2
Taxi passenger-seeking process: R=[1,0,0,0,3,0] For any action, π(s) = a1, ∀s, γ = 1. any action from s1 and s7 terminates episode Given (s3, a1, 0, s3, a1, 0, s2, a1, 0, s1, a1, 1, T); Q1: First visit MC estimate of V of each state? V = [111000] Q2: Every visit MC estimate of s2? V(s2)=1 Q3: TD estimate of all states (init at 0) with α = 1? 3 3
SLIDE 75 78
TD(0) policy evaluation
v TD updates the value estimate using a sample of st+1
to approximate the expectation
v TD updates the value estimate by bootstrapping
using estimate of V(st+1)
action state T, terminal state
SLIDE 76 79
DP MC TD Model-free method Handle non-episodic case No Markovian assumption Consistent estimator Unbiased estimator
Policy evaluation
SLIDE 77 Next Lecture
v Markov Process (Markov Chain), Markov Reward
Process, and Markov Decision Process
§ MP, MRP, MDP, POMDP
v Review
§ Policy Iteration and Value Iteration
v Model-Free Policy Evaluation
§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
v Model-Free Control
§ Monte Carlo control § Temporal-difference (TD) control § SARSA § Q-learning control
SLIDE 78
Any Comments & Critiques?