cs885 reinforcement learning lecture 15c june 20 2018
play

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put] Sec. 11.1-11.3 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Hierarchical RL Hierarchy of goals Reach and actions in Destination


  1. CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put] Sec. 11.1-11.3 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

  2. Hierarchical RL • Hierarchy of goals Reach and actions in Destination autonomous driving Reach A Reach B Reach C Turn Overtake Stop Park Break Gas Steering • Theory: Semi-Markov Decision Processes University of Waterloo CS885 Spring 2018 Pascal Poupart 2

  3. Semi-Markov Process • Definition – Set of States: ! – Transition dynamics: Pr $ % , ' $ = Pr $ % $ Pr ' $ where ' indicates the time to transition • Semi-Markovian: – Next state depends only on current state – Time spent in each state varies '′′ ' '′ $ $′ $′′ $′′′ University of Waterloo CS885 Spring 2018 Pascal Poupart 3

  4. Semi-Markov Decision Process • Definition – Set of states: ! – Set of actions: " – Transition model: Pr(& ' ,)|&,+) – Reward model: - &,+ = /[1|&,+] – Discount factor: 0 ≤ 5 ≤ 1 • discounted: 5 < 1 undiscounted: 5 = 1 – Horizon (i.e., # of time steps): ℎ • Finite horizon: ℎ ∈ ℕ infinite horizon: ℎ = ∞ • Goal: find optimal policy University of Waterloo CS885 Spring 2018 Pascal Poupart 4

  5. Example from Queuing Theory • Consider a retail store with two queues: – Customer service queue – Cashier queue • Semi-Markov decision process – State: ! = ($ % , $ ' ) where $ ) = # of customers in queue + – Action: , ∈ {1,2} (i.e., serve customer in queue 1 or 2) – Transition model: distribution over arrival and service times for customers in each queue. – Reward model: expected revenue of each serviced customer – expected cost associated with waiting times – Discount factor: 0 ≤ 4 < 1 – Horizon (i.e., # of time steps): ℎ = ∞ University of Waterloo CS885 Spring 2018 Pascal Poupart 5

  6. Value Function and Policy • Objective: ! " # = ∑ & ' ( ) * + # ( ) ,-(# ( ) ) – Where 0 & = 1 2 + 1 4 + ⋯+ 1 & – Optimal policy: - ∗ such that ! " ∗ # ≥ ! " # ∀#,- • Bellman’s equation: ! ∗ # = max Pr # J ,1 #,C ' G ! ∗ (# J ) B + #,C + D E F ,G • Q-learning update: K #,C ← K #,C + M N + ' G max B F K # J ,C J − K(#,C) University of Waterloo CS885 Spring 2018 Pascal Poupart 6

  7. Option Framework • Semi-Markov decision process where actions are options (temporally extended sub-policies) • Let ! be an option with sub-policy " and terminal states # $%& ∀( )*+ ∈ # $%& : Pr ( )*+ , 0 ( ) , ! = +CB Pr ( )*@ ( )*@CB , " ( )*@CB ∑ 3 456:45896 ∉ ; <=> ∏ @AB + F ∑ 3 456 Pr ( )*B ( ) , " ( ) D ( ) , !, ( )*+ , 0 = D ( ) , " ( ) +⋯F∑ 3 458 Pr ( )*+ ( )*+CB ,"(( )*+CB ) D ( )*B ," ( )*B D ( )*+ ," ( )*+ … University of Waterloo CS885 Spring 2018 Pascal Poupart 7

  8. Option Framework • Bellman’s equation: ! ∗ # = max Pr # 0 ,1 #,2 [4 #,2,# 0 ,1 + 6 - ! ∗ (# 0 )] ( ) * + ,- • Q-learning update: : #,2 ← : #,2 + < = - + 6 - max ( + : # 0 ,2 0 − :(#,2) - 6 @ C where = - = ∑ @AB @ University of Waterloo CS885 Spring 2018 Pascal Poupart 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend