Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] - PDF document

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686 University of Waterloo Lecture 13: February 14, 2012 Outline • Markov Decision Processes • Dynamic Decision Networks 2 CS486/686 Lecture Slides (c) 2012 P. Poupart 1

Sequential Decision Making Static Inference Bayesian Networks Sequential Inference Static Decision Making Hidden Markov Models Decision Networks Dynamic Bayesian Networks Sequential Decision Making Markov Decision Processes Dynamic Decision Networks 3 CS486/686 Lecture Slides (c) 2012 P. Poupart Sequential Decision Making • Wide range of applications – Robotics (e.g., control) – Investments (e.g., portfolio management) – Computational linguistics (e.g., dialogue management) – Operations research (e.g., inventory management, resource allocation, call admission control) – Assistive technologies (e.g., patient monitoring and support) 4 CS486/686 Lecture Slides (c) 2012 P. Poupart 2

Markov Decision Process • Intuition: Markov Process with… – Decision nodes – Utility nodes a 0 a 1 a 3 a 2 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 4 r 1 5 CS486/686 Lecture Slides (c) 2012 P. Poupart Stationary Preferences • Hum… but why many utility nodes? • U(s 0 ,s 1 ,s 2 ,…) – Infinite process  infinite utility function • Solution: – Assume stationary and additive preferences – U(s 0 ,s 1 ,s 2 ,…) = Σ t R(s t ) 6 CS486/686 Lecture Slides (c) 2012 P. Poupart 3

Discounted/Average Rewards • If process infinite, isn’t Σ t R(s t ) infinite? • Solution 1: discounted rewards – Discount factor: 0 ≤  ≤ 1 – Finite utility: Σ t  t R(s t ) is a geometric sum –  is like an inflation rate of 1/  - 1 – Intuition: prefer utility sooner than later • Solution 2: average rewards – More complicated computationally – Beyond the scope of this course 7 CS486/686 Lecture Slides (c) 2012 P. Poupart Markov Decision Process • Definition – Set of states: S – Set of actions (i.e., decisions): A – Transition model: Pr(s t |a t-1 ,s t-1 ) – Reward model (i.e., utility): R(s t ) – Discount factor: 0 ≤  ≤ 1 – Horizon (i.e., # of time steps): h • Goal: find optimal policy 8 CS486/686 Lecture Slides (c) 2012 P. Poupart 4

Inventory Management • Markov Decision Process – States: inventory levels – Actions: {doNothing, orderWidgets} – Transition model: stochastic demand – Reward model: Sales – Costs - Storage – Discount factor: 0.999 – Horizon: ∞ • Tradeoff: increasing supplies decreases odds of missed sales but increases storage costs 9 CS486/686 Lecture Slides (c) 2012 P. Poupart Policy • Choice of action at each time step • Formally: – Mapping from states to actions – i.e., δ (s t ) = a t – Assumption: fully observable states • Allows a t to be chosen only based on current state s t . Why? 10 CS486/686 Lecture Slides (c) 2012 P. Poupart 5

Policy Optimization • Policy evaluation: – Compute expected utility – EU( δ ) = Σ t=0  t Pr(s t | δ ) R(s t ) h • Optimal policy: – Policy with highest expected utility – EU( δ ) ≤ EU( δ *) for all δ 11 CS486/686 Lecture Slides (c) 2012 P. Poupart Policy Optimization • Three algorithms to optimize policy: – Value iteration – Policy iteration – Linear Programming • Value iteration: – Equivalent to variable elimination 12 CS486/686 Lecture Slides (c) 2012 P. Poupart 6

Value Iteration • Nothing more than variable elimination • Performs dynamic programming • Optimize decisions in reverse order a 3 a 2 a 0 a 1 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 4 r 1 13 CS486/686 Lecture Slides (c) 2012 P. Poupart Value Iteration • At each t, starting from t=h down to 0: – Optimize a t : EU(a t |s t )? – Factors: Pr(s i+1 |a i ,s i ), R(s i ), for 0 ≤ i ≤ h – Restrict s t – Eliminate s t+1 ,…,s h ,a t+1 ,…,a h a 3 a 2 a 0 a 1 s 0 s 1 s 2 s 3 s 4 r 2 r 3 r 4 r 1 14 CS486/686 Lecture Slides (c) 2012 P. Poupart 7

Value Iteration • Value when no time left: – V(s h ) = R(s h ) • Value with one time step left: – V(s h-1 ) = max a h-1 R(s h-1 ) +  Σ s h Pr(s h |s h-1 ,a h-1 ) V(s h ) • Value with two time steps left: – V(s h-2 ) = max a h-2 R(s h-2 ) +  Σ s h-1 Pr(s h-1 |s h-2 ,a h-2 ) V(s h- 1 ) • … • Bellman’s equation: – V(s t ) = max a t R(s t ) +  Σ s t+1 Pr(s t+1 |s t ,a t ) V(s t+1 ) – a t * = argmax a t R(s t ) +  Σ s t+1 Pr(s t+1 |s t ,a t ) V(s t+1 ) 15 CS486/686 Lecture Slides (c) 2012 P. Poupart A Markov Decision Process 1  = 0.9 S ½ ½ 1 Poor & You own a Poor & A company Unknown Famous A +0 +0 In every state you must S choose between ½ S aving money or ½ 1 ½ A dvertising ½ ½ S A A Rich & Rich & S Famous Unknown ½ +10 +10 ½ ½ 16 CS486/686 Lecture Slides (c) 2012 P. Poupart 8

1  = 0.9 S ½ 1 PU ½ PF A A +0 +0 1 S ½ ½ ½ ½ ½ A S A RF RU S +10 +10 ½ ½ ½ t V(PU) V(PF) V(RU) V(RF) h 0 0 10 10 h-1 0 4.5 14.5 19 h-2 2.03 8.55 16.53 25.08 h-3 4.76 12.20 18.35 28.72 h-4 7.63 15.07 20.40 31.18 h-5 10.21 17.46 22.61 33.21 17 CS486/686 Lecture Slides (c) 2012 P. Poupart Finite Horizon • When h is finite, • Non-stationary optimal policy • Best action different at each time step • Intuition: best action varies with the amount of time left 18 CS486/686 Lecture Slides (c) 2012 P. Poupart 9

Infinite Horizon • When h is infinite, • Stationary optimal policy • Same best action at each time step • Intuition: same (infinite) amount of time left at each time step, hence same best action • Problem: value iteration does an infinite number of iterations… 19 CS486/686 Lecture Slides (c) 2012 P. Poupart Infinite Horizon • Assuming a discount factor  , after k time steps, rewards are scaled down by  k • For large enough k, rewards become insignificant since  k  0 • Solution: – pick large enough k – run value iteration for k steps – Execute policy found at the k th iteration 20 CS486/686 Lecture Slides (c) 2012 P. Poupart 10

Computational Complexity • Space and time: O(k|A||S| 2 )  – Here k is the number of iterations • But what if |A| and |S| are defined by several random variables and consequently exponential? • Solution: exploit conditional independence – Dynamic decision network 21 CS486/686 Lecture Slides (c) 2012 P. Poupart Dynamic Decision Network Act t-2 Act t-1 Act t M t-2 M t-1 M t M t+1 T t-2 T t-1 T t T t+1 L t-2 L t-1 L t L t+1 C t-2 C t-1 C t C t+1 N t-2 N t-1 N t N t+1 R t R t-2 R t-1 R t+1 22 CS486/686 Lecture Slides (c) 2012 P. Poupart 11

Dynamic Decision Network • Similarly to dynamic Bayes nets: – Compact representation  – Exponential time for decision making  23 CS486/686 Lecture Slides (c) 2012 P. Poupart Partial Observability • What if states are not fully observable? • Solution: Partially Observable Markov Decision Process o 2 o 3 o o o 1 a 3 a 2 a 0 a 1 s 0 s 1 s 2 s 3 s 4 r 2 r 3 r 4 r 1 24 CS486/686 Lecture Slides (c) 2012 P. Poupart 12

Partially Observable Markov Decision Process (POMDP) • Definition – Set of states: S – Set of actions (i.e., decisions): A – Set of observations: O – Transition model: Pr(s t |a t-1 ,s t-1 ) – Observation model: Pr(o t |s t ) – Reward model (i.e., utility): R(s t ) – Discount factor: 0 ≤  ≤ 1 – Horizon (i.e., # of time steps): h • Policy: mapping from past obs. to actions 25 CS486/686 Lecture Slides (c) 2012 P. Poupart POMDP • Problem: action choice generally depends on all previous observations… • Two solutions: – Consider only policies that depend on a finite history of observations – Find stationary sufficient statistics encoding relevant past observations 26 CS486/686 Lecture Slides (c) 2012 P. Poupart 13

Partially Observable DDN • Actions do not depend on all state variables Act t-2 Act t-1 Act t M t-2 M t-1 M t M t+1 T t-2 T t-1 T t T t+1 L t-2 L t-1 L t L t+1 C t-2 C t-1 C t C t+1 N t-2 N t-1 N t N t+1 R t R t-2 R t-1 R t+1 27 CS486/686 Lecture Slides (c) 2012 P. Poupart Policy Optimization • Policy optimization: – Value iteration (variable elimination) – Policy iteration • POMDP and PODDN complexity: – Exponential in |O| and k when action choice depends on all previous observations  – In practice, good policies based on subset of past observations can still be found 28 CS486/686 Lecture Slides (c) 2012 P. Poupart 14

COACH project • Automated prompting system to help elderly persons wash their hands • IATSL: Alex Mihailidis, Pascal Poupart, Jennifer Boger, Jesse Hoey, Geoff Fernie and Craig Boutilier 29 CS486/686 Lecture Slides (c) 2012 P. Poupart Aging Population • Dementia – Deterioration of intellectual faculties – Confusion – Memory losses (e.g., Alzheimer’s disease) • Consequences: – Loss of autonomy – Continual and expensive care required 30 CS486/686 Lecture Slides (c) 2012 P. Poupart 15

Intelligent Assistive Technology • Let’s facilitate aging in place • Intelligent assistive technology – Non-obtrusive, yet pervasive – Adaptable • Benefits: – Greater autonomy – Feeling of independence 31 CS486/686 Lecture Slides (c) 2012 P. Poupart System Overview planning sensors hand verbal washing cues 32 CS486/686 Lecture Slides (c) 2012 P. Poupart 16

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] - PDF document

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686 University of Waterloo Lecture 13: February 14, 2012 Outline Markov Decision Processes Dynamic Decision Networks 2 CS486/686 Lecture

Reasoning Over Time [RN2] Sec 15.1-15.3, 15.5 [RN3] Sec 15.1-15.3, 15.5 CS 486/686 University

Uncertainty [RN2 Sec. 13.1-13.6] [RN3 Sec. 13.1-13.5] CS 486/686 University of Waterloo

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1.0 sec 0.1 sec 10 sec 1.0 sec 0.1 sec Min:500

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo

Informed Search [RN2] Sec. 4.1, 4.2 [RN3] Sec. 3.5, 3.6 CS 486/686 University of Waterloo

First-order Logic [RN2] Sec 7.1-7.6 Chap 8-9 [RN3] Sec 7.1-7.6 Chap 8-9 CS 486/686 University

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Utility Theory [RN2] Sect 16.1-16.3 [RN3] Sect 16.1-16.3 CS 486/686 University of Waterloo

Local Search [RN2] Section 4.3 [RN3] Section 4.1 CS 486/686 University of Waterloo Lecture 6:

Bayes Nets (continued) [RN2] Section 14.4 [RN3] Section 14.4 CS 486/686 University of Waterloo

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

I have no disclosures to report A Patient Centered Approach to Abortion Essentials of Womens

GHC heap internals Nikita Frolov < frolov@chalmers.se > (courtesy of Bob Ippolito,

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements

HPC & Big Data Analytics in Luxembourg Application to Efficient Parallel Coupling of CFD-DEM

Quarterly Supplementary Materials February 16, 2017 Revenue Structure Yandex ex Revenue ue

Randomized comparison of an ultrathin strut cobalt-chromium biodegradable polymer

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] - PDF document

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686 University of Waterloo Lecture 13: February 14, 2012 Outline Markov Decision Processes Dynamic Decision Networks 2 CS486/686 Lecture

Reasoning Over Time [RN2] Sec 15.1-15.3, 15.5 [RN3] Sec 15.1-15.3, 15.5 CS 486/686 University

Uncertainty [RN2 Sec. 13.1-13.6] [RN3 Sec. 13.1-13.5] CS 486/686 University of Waterloo

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1.0 sec 0.1 sec 10 sec 1.0 sec 0.1 sec Min:500

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo

Informed Search [RN2] Sec. 4.1, 4.2 [RN3] Sec. 3.5, 3.6 CS 486/686 University of Waterloo

First-order Logic [RN2] Sec 7.1-7.6 Chap 8-9 [RN3] Sec 7.1-7.6 Chap 8-9 CS 486/686 University

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Utility Theory [RN2] Sect 16.1-16.3 [RN3] Sect 16.1-16.3 CS 486/686 University of Waterloo

Local Search [RN2] Section 4.3 [RN3] Section 4.1 CS 486/686 University of Waterloo Lecture 6:

Bayes Nets (continued) [RN2] Section 14.4 [RN3] Section 14.4 CS 486/686 University of Waterloo

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

I have no disclosures to report A Patient Centered Approach to Abortion Essentials of Womens

GHC heap internals Nikita Frolov &lt; frolov@chalmers.se &gt; (courtesy of Bob Ippolito,

Tactical and Strategic AI Marco Chiarandini Department of Mathematics &amp; Computer Science

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements

HPC &amp; Big Data Analytics in Luxembourg Application to Efficient Parallel Coupling of CFD-DEM

Quarterly Supplementary Materials February 16, 2017 Revenue Structure Yandex ex Revenue ue

Randomized comparison of an ultrathin strut cobalt-chromium biodegradable polymer

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP

GHC heap internals Nikita Frolov < frolov@chalmers.se > (courtesy of Bob Ippolito,

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

HPC & Big Data Analytics in Luxembourg Application to Efficient Parallel Coupling of CFD-DEM