CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 2a may 4 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Markov


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 2a: May 4, 2018

Intro to Markov decision processes [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

  • Markov process augmented with…

– Actions e.g., !" – Rewards e.g., #"

Markov Decision Process

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Current Assumptions

  • Uncertainty: stochastic process
  • Time: sequential process
  • Observability: fully observable states
  • No learning: complete model
  • Variable type: discrete (e.g., discrete states and

actions)

University of Waterloo

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Rewards

  • Rewards: !" ∈ ℜ
  • Reward function: % &", (" = !"

mapping from state-action pairs to rewards

  • Common assumption: stationary reward function

– % &", (" is the same ∀+

  • Exception: terminal reward function often different

– E.g., in a game: 0 reward at each turn and +1/-1 at the end for winning/losing

  • Goal: maximize sum of rewards ∑- .(0-, 1-)

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Discounted/Average Rewards

  • If process infinite, isn’t ∑" #(%", '") infinite?
  • Solution 1: discounted rewards

– Discount factor: 0 ≤ + < 1 – Finite utility: ∑" +"#(%", '") is a geometric sum – + induces an inflation rate of 1/+ − 1 – Intuition: prefer utility sooner than later

  • Solution 2: average rewards

– More complicated computationally – Beyond the scope of this course

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Markov Decision Process

  • Definition

– Set of states: S – Set of actions: A – Transition model: Pr($%|$%'(, *%'() – Reward model: ,($%, *%) – Discount factor: 0 ≤ / ≤ 1

  • discounted: / < 1

undiscounted: / = 1

– Horizon (i.e., # of time steps): ℎ

  • Finite horizon: ℎ ∈ ℕ

infinite horizon: ℎ = ∞

  • Goal: find optimal policy

University of Waterloo

slide-7
SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Inventory Management

  • Markov Decision Process

– States: inventory levels – Actions: {doNothing, orderWidgets} – Transition model: stochastic demand – Reward model: Sales – Costs - Storage – Discount factor: 0.999 – Horizon: ∞

  • Tradeoff: increasing supplies decreases odds of

missed sales, but increases storage costs

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Policy

  • Choice of action at each time step
  • Formally:

– Mapping from states to actions – i.e., ! "# = %# – Assumption: fully observable states

  • Allows %# to be chosen only based on current state "#

University of Waterloo

slide-9
SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Policy Optimization

  • Policy evaluation:

– Compute expected utility !" #$ = ∑'($

)

*' ∑+, Pr #' #$, 0 1(#', 0 #' )

  • Optimal policy:

– Policy with highest expected utility !"∗ #$ ≥ !" #$ ∀0

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

Policy Optimization

  • Several classes of algorithms:

– Value iteration – Policy iteration – Linear Programming – Search techniques

  • Computation may be done

– Offline: before the process starts – Online: as the process evolves

University of Waterloo

slide-11
SLIDE 11

CS885 Spring 2018 Pascal Poupart 11

Value Iteration

  • Performs dynamic programming
  • Optimizes decisions in reverse order

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

University of Waterloo

slide-12
SLIDE 12

CS885 Spring 2018 Pascal Poupart 12

Value Iteration

  • Value when no time left:

! "# = max

() *("#, -#)

  • Value with one time step left:

! "#/0 = max

()12 * "#/0, -#/0 + 4 ∑6) Pr "# "#/0, -#/0 !("#)

  • Value with two time steps left:

! "#/9 = max

()1: * "#/9, -#/9 + 4 ∑6)12 Pr "#/0 "#/9, -#/9 !("#/0)

  • Bellman’s equation:

! "; = max

(< * ";, -; + 4 ∑6<=2 Pr ";>0 ";, -; !(";>0)

  • ;

∗ = argmax (<

* ";, -; + 4 ∑6<=2 Pr ";>0 ";, -; !(";>0)

University of Waterloo

slide-13
SLIDE 13

CS885 Spring 2018 Pascal Poupart 13

A Markov Decision Process

1 Poor & Unknown +0 Poor & Famous +0 Rich & Famous +10 Rich & Unknown +10 S S S S A A A A 1 1 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

g = 0.9

You own a company In every state you must choose between Saving money or Advertising

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

1 PU +0 PF +0 RF +10 RU +10 S S S S A A A A 1 1 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

g = 0.9

! "($%) '($%) "($() '($() "()%) '()%) "()() '()() ℎ A,S A,S 10 A,S 10 A,S ℎ − 1 A,S 4.5 S 14.5 S 19 S ℎ − 2 2.03 A 8.55 S 16.53 S 25.08 S ℎ − 3 4.76 A 12.20 S 18.35 S 28.72 S ℎ − 4 7.63 A 15.07 S 20.40 S 31.18 S ℎ − 5 10.21 A 17.46 S 22.61 S 33.21 S

University of Waterloo

slide-15
SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Finite Horizon

  • When h is finite,
  • Non-stationary optimal policy
  • Best action different at each time step
  • Intuition: best action varies with the amount of time

left

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

Infinite Horizon

  • When h is infinite,
  • Stationary optimal policy
  • Same best action at each time step
  • Intuition: same (infinite) amount of time left at each

time step, hence same best action

  • Problem: value iteration does an infinite number of

iterations…

University of Waterloo

slide-17
SLIDE 17

CS885 Spring 2018 Pascal Poupart 17

Infinite Horizon

  • Assuming a discount factor !, after " time steps,

rewards are scaled down by !#

  • For large enough ", rewards become insignificant

since !# → 0

  • Solution:

– pick large enough " – run value iteration for " steps – Execute policy found at the "&' iteration

University of Waterloo