CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 15c june 20 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put] Sec. 11.1-11.3 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Hierarchical RL Hierarchy of goals Reach and actions in Destination


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 15c: June 20, 2018

Semi-Markov Decision Processes [Put] Sec. 11.1-11.3

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Hierarchical RL

  • Hierarchy of goals

and actions in autonomous driving

  • Theory: Semi-Markov Decision Processes

University of Waterloo

Reach Destination Reach B Reach A Reach C Turn Overtake Stop Park Break Gas Steering

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

  • Definition

– Set of States: ! – Transition dynamics: Pr $%, ' $ = Pr $% $ Pr ' $ where ' indicates the time to transition

  • Semi-Markovian:

– Next state depends only on current state – Time spent in each state varies

Semi-Markov Process

$ $′ $′′ $′′′

University of Waterloo

' '′ '′′

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Semi-Markov Decision Process

  • Definition

– Set of states: ! – Set of actions: " – Transition model: Pr(&',)|&,+) – Reward model: - &,+ = /[1|&,+] – Discount factor: 0 ≤ 5 ≤ 1

  • discounted: 5 < 1

undiscounted: 5 = 1

– Horizon (i.e., # of time steps): ℎ

  • Finite horizon: ℎ ∈ ℕ

infinite horizon: ℎ = ∞

  • Goal: find optimal policy

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Example from Queuing Theory

  • Consider a retail store with two queues:

– Customer service queue – Cashier queue

  • Semi-Markov decision process

– State: ! = ($%, $') where $) = # of customers in queue + – Action: , ∈ {1,2} (i.e., serve customer in queue 1 or 2) – Transition model: distribution over arrival and service times for customers in each queue. – Reward model: expected revenue of each serviced customer – expected cost associated with waiting times – Discount factor: 0 ≤ 4 < 1 – Horizon (i.e., # of time steps): ℎ = ∞

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Value Function and Policy

  • Objective: !" # = ∑& '()* + #(),-(#())

– Where 0& = 12 + 14 + ⋯+ 1& – Optimal policy: -∗ such that !"∗ # ≥ !" # ∀#,-

  • Bellman’s equation:

!∗ # = max

B + #,C + D EF,G

Pr #J,1 #,C 'G!∗(#J)

  • Q-learning update:

K #,C ← K #,C + M N + 'G max

BF K #J,CJ − K(#,C)

University of Waterloo

slide-7
SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Option Framework

  • Semi-Markov decision process where actions are
  • ptions (temporally extended sub-policies)
  • Let ! be an option with sub-policy " and terminal

states #$%&

∀()*+ ∈ #$%&: Pr ()*+, 0 (), ! = ∑3456:45896∉ ;<=> ∏@AB

+CB Pr ()*@ ()*@CB, " ()*@CB

D (), !, ()*+, 0 = D (), " () + F ∑3456 Pr ()*B (), " () D ()*B," ()*B +⋯F∑3458Pr ()*+ ()*+CB,"(()*+CB) D ()*+," ()*+ …

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Option Framework

  • Bellman’s equation:

!∗ # = max

( ) *+,-

Pr #0,1 #,2 [4 #,2,#0,1 + 6-!∗(#0)]

  • Q-learning update:

: #,2 ← : #,2 + < =- + 6- max

(+ : #0,20 − :(#,2)

where =- = ∑@AB

  • 6@C

@

University of Waterloo