CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to - - PowerPoint PPT Presentation

▶

Sep 12, 2023 324 likes •515 views

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Markov

SLIDE 1

CS885 Reinforcement Learning Lecture 2a: May 4, 2018

Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Markov process augmented with…

– Actions e.g., !" – Rewards e.g., #"

Markov Decision Process

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

University of Waterloo

SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Current Assumptions

Uncertainty: stochastic process
Time: sequential process
Observability: fully observable states
No learning: complete model
Variable type: discrete (e.g., discrete states and

actions)

University of Waterloo

SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Rewards

Rewards: !" ∈ ℜ
Reward function: % &", (" = !"

mapping from state-action pairs to rewards

Common assumption: stationary reward function

– % &", (" is the same ∀+

Exception: terminal reward function often different

– E.g., in a game: 0 reward at each turn and +1/-1 at the end for winning/losing

Goal: maximize sum of rewards ∑- .(0-, 1-)

University of Waterloo

SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Discounted/Average Rewards

If process infinite, isn’t ∑" #(%", '") infinite?
Solution 1: discounted rewards

– Discount factor: 0 ≤ + < 1 – Finite utility: ∑" +"#(%", '") is a geometric sum – + induces an inflation rate of 1/+ − 1 – Intuition: prefer utility sooner than later

Solution 2: average rewards

– More complicated computationally – Beyond the scope of this course

University of Waterloo

SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Markov Decision Process

Definition

– Set of states: S – Set of actions: A – Transition model: Pr($%|$%'(, %'() – Reward model: ,($%, %) – Discount factor: 0 ≤ / ≤ 1

discounted: / < 1

undiscounted: / = 1

– Horizon (i.e., # of time steps): ℎ

Finite horizon: ℎ ∈ ℕ

infinite horizon: ℎ = ∞

Goal: find optimal policy

University of Waterloo

SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Inventory Management

Markov Decision Process

– States: inventory levels – Actions: {doNothing, orderWidgets} – Transition model: stochastic demand – Reward model: Sales – Costs - Storage – Discount factor: 0.999 – Horizon: ∞

Tradeoff: increasing supplies decreases odds of

missed sales, but increases storage costs

University of Waterloo

SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Policy

Choice of action at each time step
Formally:

– Mapping from states to actions – i.e., ! "# = %# – Assumption: fully observable states

Allows %# to be chosen only based on current state "#

University of Waterloo

SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Policy Optimization

Policy evaluation:

– Compute expected utility !" #$ = ∑'($

)

*' ∑+, Pr #' #$, 0 1(#', 0 #' )

Optimal policy:

– Policy with highest expected utility !"∗ #$ ≥ !" #$ ∀0

University of Waterloo

SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

Policy Optimization

Several classes of algorithms:

– Value iteration – Policy iteration – Linear Programming – Search techniques

Computation may be done

– Offline: before the process starts – Online: as the process evolves

University of Waterloo

SLIDE 11

CS885 Spring 2018 Pascal Poupart 11

Value Iteration

Performs dynamic programming
Optimizes decisions in reverse order

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

University of Waterloo

SLIDE 12

CS885 Spring 2018 Pascal Poupart 12

Value Iteration

Value when no time left:

! "# = max

() *("#, -#)

Value with one time step left:

! "#/0 = max

()12 * "#/0, -#/0 + 4 ∑6) Pr "# "#/0, -#/0 !("#)

Value with two time steps left:

! "#/9 = max

()1: * "#/9, -#/9 + 4 ∑6)12 Pr "#/0 "#/9, -#/9 !("#/0)

…
Bellman’s equation:

! "; = max

(< * ";, -; + 4 ∑6<=2 Pr ";>0 ";, -; !(";>0)

;

∗ = argmax (<

* ";, -; + 4 ∑6<=2 Pr ";>0 ";, -; !(";>0)

University of Waterloo

SLIDE 13

CS885 Spring 2018 Pascal Poupart 13

A Markov Decision Process

1 Poor & Unknown +0 Poor & Famous +0 Rich & Famous +10 Rich & Unknown +10 S S S S A A A A 1 1 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

g = 0.9

You own a company In every state you must choose between Saving money or Advertising

University of Waterloo

SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

1 PU +0 PF +0 RF +10 RU +10 S S S S A A A A 1 1 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

g = 0.9

! "($%) '($%) "($() '($() "()%) '()%) "()() '()() ℎ A,S A,S 10 A,S 10 A,S ℎ − 1 A,S 4.5 S 14.5 S 19 S ℎ − 2 2.03 A 8.55 S 16.53 S 25.08 S ℎ − 3 4.76 A 12.20 S 18.35 S 28.72 S ℎ − 4 7.63 A 15.07 S 20.40 S 31.18 S ℎ − 5 10.21 A 17.46 S 22.61 S 33.21 S

University of Waterloo

SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Finite Horizon

When h is finite,
Non-stationary optimal policy
Best action different at each time step
Intuition: best action varies with the amount of time

left

University of Waterloo

SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

Infinite Horizon

When h is infinite,
Stationary optimal policy
Same best action at each time step
Intuition: same (infinite) amount of time left at each

time step, hence same best action

Problem: value iteration does an infinite number of

iterations…

University of Waterloo

SLIDE 17

CS885 Spring 2018 Pascal Poupart 17

Infinite Horizon

Assuming a discount factor !, after " time steps,

rewards are scaled down by !#

For large enough ", rewards become insignificant

since !# → 0

Solution:

– pick large enough " – run value iteration for " steps – Execute policy found at the "&' iteration

University of Waterloo

CS885 Reinforcement Learning Lecture 2a: May 4, 2018

Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5

– Actions e.g., !" – Rewards e.g., #"

Markov Decision Process

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

Current Assumptions

actions)

Rewards

mapping from state-action pairs to rewards

– % &", (" is the same ∀+

– E.g., in a game: 0 reward at each turn and +1/-1 at the end for winning/losing

Discounted/Average Rewards

– Discount factor: 0 ≤ + < 1 – Finite utility: ∑" +"#(%", '") is a geometric sum – + induces an inflation rate of 1/+ − 1 – Intuition: prefer utility sooner than later

– More complicated computationally – Beyond the scope of this course

Markov Decision Process

– Set of states: S – Set of actions: A – Transition model: Pr($%|$%'(, *%'() – Reward model: ,($%, *%) – Discount factor: 0 ≤ / ≤ 1

undiscounted: / = 1

– Horizon (i.e., # of time steps): ℎ

infinite horizon: ℎ = ∞

Inventory Management

– States: inventory levels – Actions: {doNothing, orderWidgets} – Transition model: stochastic demand – Reward model: Sales – Costs - Storage – Discount factor: 0.999 – Horizon: ∞

missed sales, but increases storage costs

Policy

– Mapping from states to actions – i.e., ! "# = %# – Assumption: fully observable states

Policy Optimization

– Compute expected utility !" #$ = ∑'($

*' ∑+, Pr #' #$, 0 1(#', 0 #' )

– Policy with highest expected utility !"∗ #$ ≥ !" #$ ∀0

Policy Optimization

– Value iteration – Policy iteration – Linear Programming – Search techniques

– Offline: before the process starts – Online: as the process evolves

Value Iteration

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

Value Iteration

! "; = max

* ";, -; + 4 ∑6<=2 Pr ";>0 ";, -; !(";>0)

A Markov Decision Process

g = 0.9

g = 0.9

Finite Horizon

left

Infinite Horizon

time step, hence same best action

iterations…

Infinite Horizon

rewards are scaled down by !#

since !# → 0

– pick large enough " – run value iteration for " steps – Execute policy found at the "&' iteration

– Set of states: S – Set of actions: A – Transition model: Pr($%|$%'(, %'() – Reward model: ,($%, %) – Discount factor: 0 ≤ / ≤ 1