DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - - PowerPoint PPT Presentation

ds595 cs525 reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last lecture v Reinforcement Learning Components Model, Value function, Policy v Model-based


slide-1
SLIDE 1

DS595/CS525 Reinforcement Learning

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020 This lecture will be recorded!!!

slide-2
SLIDE 2

Last lecture

v Reinforcement Learning Components

§ Model, Value function, Policy

v Model-based Control

§ Policy Evaluation, Policy Iteration, Value Iteration

v Project 1 description.

slide-3
SLIDE 3

Quiz 1 Week 4 (9/24 R)

v Model-based Control

§ Policy Evaluation, Policy Iteration, Value Iteration § 20 min at the beginning

  • You can start as early as 5:55PM, and finish as late as

6:20PM. The quiz duration is 20 minutes.

§ Login class zoom so you can ask questions regarding the quiz in Zoom chat box.

Project 1 due Week 4 (9/24 R)

slide-4
SLIDE 4

This lecture

v Markov Process (Markov Chain), Markov Reward

Process, and Markov Decision Process

§ MP, MRP, MDP, POMDP

v Review: Model based control

§ Policy Iteration, and Value iteration

v Model-Free Policy Evaluation

§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

slide-5
SLIDE 5

Example: Taxi passenger-seeking task as a decision-making process

States: Locations of taxi (s1, . . . , s6) on the road Actions: Left or Right Rewards: +1 in state s1 +3 in state s5 0 in all other states

s1 s2 s3 s4 s5 s6

slide-6
SLIDE 6

RL components

v Often include one or more of

§ Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

slide-7
SLIDE 7

RL components: (1) Model

v Agent’s representation of how the world

changes in response to agent’s action, with two parts:

Transition model predicts next agent state

p(st+1 =s’ |st =s, at =a)

Reward model predicts immediate reward r(s,a)

slide-8
SLIDE 8

RL components: (2)Policy

v Policy π determines how the agent chooses

actions

§ π : S → A, mapping from states to actions

v Deterministic policy:

§ π(s) = a § In the other word,

  • π(a|s) = 0,
  • π(a’|s) = π(a’’|s)=0,

v Stochastic policy:

§ π(a|s) = Pr(at = a|st = s) a a’ a’’ a a’ a’’ s s

slide-9
SLIDE 9

RL components: (3)Value Function

v Value function Vπ: expected discounted sum of

future rewards under a particular policy π

v Discount factorγweighs immediate vs future rewards v Can be used to quantify goodness/badness of states

and actions

v And decide how to act by comparing policies

a a’ s

slide-10
SLIDE 10

Model-based Explicit: Model Model-Free: No model

RL agents and algorithms

slide-11
SLIDE 11

Find a good policy: Problem settings

v (Agent’s internal

computation) § Given model of how the world works § Dynamics and reward model § Algorithm computes how to act in order to maximize expected reward

v Computing while

interacting with environment § Agent doesn’t know how world works § Interacts with world to implicitly/explicitly learn how world works § Agent improves policy (may involve planning) Model-based control Model-free control

slide-12
SLIDE 12

v Computing while interacting

with environment § Taxi passenger-seeking problem § Demand/Traffic dynamics are uncertain § Huge state space

Path 1 Path 2 Path 3

Find a good policy: Problem settings

Model-based control Model-free control

v (Agent’s internal

computation) § Frozen Lake project 1 § Know all rules of game / perfect model § dynamic programming, tree search

slide-13
SLIDE 13

Find a good policy: Problem settings

v Given: MDP

§ S, A, P, R, γ

v Output:

§ π Model-based control Model-free control

v Given: MDP/R/P

§ S, A, γ

v Unknow

§ P , R,

v Output:

§ π

slide-14
SLIDE 14

This lecture

v Markov Process (Markov Chain), Markov Reward

Process, and Markov Decision Process

§ MP, MRP, MDP, POMDP

v Review: Model based control

§ Policy Iteration, and Value iteration

v Model-Free Policy Evaluation

§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

slide-15
SLIDE 15

DP, MRP, and MDP

v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

slide-16
SLIDE 16

Random Walks on Graphs

Random walk sampling Random Walk Routing Influence diffusion Molecule in liquid

slide-17
SLIDE 17

Undirected Graphs

1 2 6 4 5 3

Undirected !!

slide-18
SLIDE 18

Random Walk

v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '

πi = di 2 E

P

ij =

1 ki if i is not equal to j 0 if i=j ⎧ ⎨ ⎪ ⎩ ⎪

slide-19
SLIDE 19
slide-20
SLIDE 20

A random walker: Markov Chain / Markov Process

s1 s2 s3 s4 s5 s6

0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3

slide-21
SLIDE 21

A random walker: Markov Chain / Markov Process

s1 s2 s3 s4 s5 s6

0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3

s0 * P = s1

slide-22
SLIDE 22

Taxi passenger-seeking task: Markov Process --- Episodes

s1 s2 s3 s4 s5 s6

0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3

Example: Sample episodes starting from s3 s3, s2, s2, s2, s1, s1,... s3, s3, s4, s5, s6, s6,... s3, s4, s5, s4,...

slide-23
SLIDE 23

DP, MRP, and MDP

v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

slide-24
SLIDE 24
slide-25
SLIDE 25

A random walker + rewards: Markov Reward Process (MRP)

s1 s2 s3 s4 s5 s6

0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3

v Reward: +1 in s1, +3 in s5, 0 in all other states.

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

A random walker + rewards: Markov Reward Process

s1 s2 s3 s4 s5 s6

0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3

v Reward: +1 in s1, +3 in s5, 0 in all other states Sample

returns for sample 4-step episodes, γ = ½

v s3(t=1), s4(t=2), s5(t=3), s5(t=4): v G1=? v G3=?

slide-29
SLIDE 29

A random walker + rewards: Markov Reward Process

s1 s2 s3 s4 s5 s6

0.7 0.4 0.4 0.4 0.3 0.7 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3

v Reward: +1 in s1, +3 in s5, 0 in all other states v Sample returns for sample 4-step episodes, γ = 1/2 v s3, s4, s5, s6: G1=? v s3, s3, s4, s3: G1=? v s3, s2, s1, s1: G1=?

slide-30
SLIDE 30

vs3, s4, s5, s6… vs3, s3, s4, s3… vs3, s2, s1, s1… v…

Samples:

Path 1 Path 2 Path 3

slide-31
SLIDE 31

vSamples: vs3, s4, s5, s6…,

s3, s3, s4, s3…

v…

Samples:

Path 1 Path 2 Path 3

Return vs Value function

Path 2

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

DP, MRP, and MDP

v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

slide-37
SLIDE 37
slide-38
SLIDE 38

Taxi passenger-seeking task: Markov Decision Process (MDP)

s1 s2 s3 s4 s5 s6

a1 a2

Deterministic transition model

slide-39
SLIDE 39

This lecture

v Markov Process (Markov Chain), Markov Reward

Process, and Markov Decision Process

§ MP, MRP, MDP, POMDP

v Review:

§ Policy Iteration, and Value iteration

v Model-Free Policy Evaluation

§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

For deterministic policy:

slide-43
SLIDE 43

For deterministic and stochastic policy:

From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47

(All-in-one algorithm)

From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

slide-48
SLIDE 48

Deterministic policy

slide-49
SLIDE 49

From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

slide-50
SLIDE 50

This lecture

v Markov Process (Markov Chain), Markov Reward

Process, and Markov Decision Process

§ MP, MRP, MDP, POMDP

v Review:

§ Policy Iteration, and Value iteration

v Model-Free Policy Evaluation

§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

slide-51
SLIDE 51

53

Review of Dynamic Programming for policy evaluation (model-baased)

equivalently,

action state !

! "(#) = &",$ [( + *! !%& " (#′)]

slide-52
SLIDE 52

54

Review of Dynamic Programming for policy evaluation (model-based)

v Bootstrapping: Update for V uses an estimate v Known model P(s’|s,a) and r(s,a)

action state Bootstrapping !

! "(#) = &",$ [( + *! !%& " (#′)]

slide-53
SLIDE 53

55

Review of Dynamic Programming for policy evaluation (model-based)

v Requires model of MDP P(s’|s,a) and r(s,a)

Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history

action state Bootstrapping !

! "(#) = &",$ [( + *! !%& " (#′)]

slide-54
SLIDE 54

56

Model-free Policy Evaluation

v What if don’t know dynamics model P and/or

reward model R?

v Today: Policy evaluation without a model v Given data and/or ability to interact in the

environment Efficiently compute a good estimate of a policy π

slide-55
SLIDE 55

57

Model-free Policy Evaluation

v Monte Carlo (MC) policy evaluation

§ First visit based § Every visit based

v Temporal Difference (TD)

§ TD(0)

v Metrics to evaluate and compare algorithms

slide-56
SLIDE 56

Monte Carlo (MC) policy evaluation

v Return of a trajectory under policy π v Value function:

§ Expectation over trajectories T generated by following π

v Simple idea: Value = mean return

§ sample set of trajectories & average returns action State s G1t(s) G2t(s) G3t(s)

slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59

For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …

slide-60
SLIDE 60

Bias, Variance, MSE

v Biased vs unbiased estimator

§ Bias is zero or not,

v Consistent vs inconsistent estimator

§ When n goes to infinity, if the estimator goes to ground- truth

slide-61
SLIDE 61
slide-62
SLIDE 62

For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …

slide-63
SLIDE 63

For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …

slide-64
SLIDE 64

For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …

slide-65
SLIDE 65

For example: s1, a1, r1, s2, a2, r2, s2, a3, r3, … …

?

How about α = 1?

slide-66
SLIDE 66

MC on policy evaluation

s1 s2 s3 s4 s5 s6

a1 a2

Taxi passenger-seeking process: R=[1,0,0,0,3,0] For any action, π(s) = a1, ∀s, γ = 1. any action from s1 and s7 terminates episode Given (s3, a1, 0, s3, a1, 0, s2, a1, 0, s1, a1, 1, T); Q1: First visit MC estimate of V of each state? Q2: Every visit MC estimate of s2? 3

slide-67
SLIDE 67

Example: MC on policy evaluation

s1 s2 s3 s4 s5 s6

a1 a2

Taxi passenger-seeking process: R=[1,0,0,0,3,0] For any action, π(s) = a1, ∀s, γ = 1. any action from s1 and s7 terminates episode Given (s3, a1, 0, s3, a1, 0, s2, a1, 0, s1, a1, 1, T); Q1: First visit MC estimate of V of each state? V = [1110000] Q2: Every visit MC estimate of s2? V(s2)=1 3 3

slide-68
SLIDE 68

70

MC policy evaluation

v MC updates the value estimate using a sample of the

return to approximate an expectation

action state T, terminal state

slide-69
SLIDE 69

71

MC policy evaluation limitations

v Generally high variance

§ Reducing variance can require a lot of data

v Requires episodic settings

§ Episode must end before data from that episode can be used to update the value function § action state T, terminal state

slide-70
SLIDE 70

72

Model-free Policy Evaluation

v Monte Carlo (MC) policy evaluation

§ First visit based § Every visit based

v Temporal Difference (TD)

§ TD(0) § Combination of MC and Dynamic Programming

“If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” – Sutton and Barto 2017

slide-71
SLIDE 71

73

MC + DP = TD

v Dynamic Programming (DP) policy evaluation v Monte Carlo (MC) policy evaluation

§

v Temporal Difference (TD)

Rewritten as

!

! "(#) = &",$ [( + *! !%& " (#′)]

slide-72
SLIDE 72

74

slide-73
SLIDE 73

75

MC + DP = TD

v Can be rewritten as

slide-74
SLIDE 74

Example: TD policy evaluation

s1 s2 s3 s4 s5 s6

a1 a2

Taxi passenger-seeking process: R=[1,0,0,0,3,0] For any action, π(s) = a1, ∀s, γ = 1. any action from s1 and s7 terminates episode Given (s3, a1, 0, s3, a1, 0, s2, a1, 0, s1, a1, 1, T); Q1: First visit MC estimate of V of each state? V = [111000] Q2: Every visit MC estimate of s2? V(s2)=1 Q3: TD estimate of all states (init at 0) with α = 1? 3 3

slide-75
SLIDE 75

78

TD(0) policy evaluation

v TD updates the value estimate using a sample of st+1

to approximate the expectation

v TD updates the value estimate by bootstrapping

using estimate of V(st+1)

action state T, terminal state

slide-76
SLIDE 76

79

DP MC TD Model-free method Handle non-episodic case No Markovian assumption Consistent estimator Unbiased estimator

Policy evaluation

slide-77
SLIDE 77

Next Lecture

v Markov Process (Markov Chain), Markov Reward

Process, and Markov Decision Process

§ MP, MRP, MDP, POMDP

v Review

§ Policy Iteration and Value Iteration

v Model-Free Policy Evaluation

§ Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

v Model-Free Control

§ Monte Carlo control § Temporal-difference (TD) control § SARSA § Q-learning control

slide-78
SLIDE 78

Any Comments & Critiques?