out line
play

Out line Markov Decision P rocesses Dynamic Decision Net works - PDF document

Out line Markov Decision P rocesses Dynamic Decision Net works Lect ure 13 Russell and Nor vig: Sect 17.1, 17.2 (up t o p. 620), 17.4, 17.5 J une 14, 2005 CS 486/ 686 2 CS486/686 Lecture Slides (c) 2005 P. Poupart Sequent


  1. Out line • Markov Decision P rocesses • Dynamic Decision Net works Lect ure 13 • Russell and Nor vig: Sect 17.1, 17.2 (up t o p. 620), 17.4, 17.5 J une 14, 2005 CS 486/ 686 2 CS486/686 Lecture Slides (c) 2005 P. Poupart Sequent ial Decision Making Sequent ial Decision Making • Wide range of applicat ions Static Inference – Robot ics (e.g., cont rol) Bayesian Networks – I nvest ment s (e.g., port f olio management ) – Comput at ional linguist ics (e.g., dialogue Sequential Inference management ) Static Decision Making Hidden Markov Models Decision Networks – Operat ions research (e.g., invent ory Dynamic Bayesian Networks management , resour ce allocat ion, call admission cont rol) Sequential Decision Making – Assist ive t echnologies (e.g., pat ient Markov Decision Processes monit or ing and support ) Dynamic Decision Networks 3 4 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart Markov Decision Process St at ionary Pref erences • I nt uit ion: Mar kov Process wit h… • Hum…but why many ut ilit y nodes? – Decision nodes – Ut ilit y nodes • U(s 0 ,s 1 ,s 2 ,… ) – I nf init e pr ocess � inf init e ut ilit y f unct ion a 0 a 1 a 2 a 3 • Solut ion: s 0 s 1 s 2 s 3 s 4 – Assume st at ionary and addit ive pref erences ) = Σ t R(s t ) – U(s 0 ,s 1 ,s 2 ,… r 2 r 3 r 4 r 1 5 6 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart 1

  2. Discount ed/ Average Rewards Markov Decision Process • I f process inf init e, isn’t Σ t R(s t ) inf init e? • Def init ion – Set of st at es: S • Solut ion 1: discount ed rewards – Set of act ions (i.e., decisions): A – Discount f act or: 0 ≤ γ ≤ 1 – Transit ion model: Pr(s t | a t -1 ,s t -1 ) – Finit e ut ilit y: Σ t γ t R(s t ) is a geomet ric sum – γ is like an inf lat ion rat e of 1/ γ - 1 – Reward model (i.e., ut ilit y): R(s t ) – Discount f act or: 0 ≤ γ ≤ 1 – I nt uit ion: pref er ut ilit y sooner t han lat er – Horizon (i.e., # of t ime st eps): h • Solut ion 2: aver age rewards – More complicat ed comput at ionally • Goal: f ind opt imal policy – Beyond t he scope of t his course 7 8 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart I nvent ory Management Policy • Markov Decision Pr ocess • Choice of act ion at each t ime st ep – St at es: invent ory levels – Act ions: {doNot hing, orderWidget s} • For mally: – Transit ion model: st ochast ic demand – Mapping f rom st at es t o act ions – Reward model: Sales – Cost s - St orage – i.e., δ (s t ) = a t – Discount f act or: 0.999 – Horizon: ∞ – Assumpt ion: f ully observable st at es • Allows a t t o be chosen only based on current • Tradeof f : increasing supplies decreases odds st at e s t . Why? of missed sales but increases st orage cost s 9 10 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart Policy Opt imizat ion Policy Opt imizat ion • Policy evaluat ion: • Three algorit hms t o opt imize policy: – Comput e expect ed ut ilit y – Value it erat ion h – EU( δ ) = Σ t =0 γ t Pr(s t | δ ) R(s t ) – Policy it erat ion – Linear Progr amming • Opt imal policy: • Value it er at ion: – Policy wit h highest expect ed ut ilit y – EU( δ ) ≤ EU( δ *) f or all δ – Equivalent t o variable eliminat ion 11 12 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart 2

  3. Value I t erat ion Value I t erat ion • Not hing more t han variable eliminat ion • At each t , st art ing f rom t =h down t o 0: • Perf orms dynamic programming – Opt imize a t : EU(a t | s t )? – Fact ors: Pr(s t +1 | a t ,s t ), R(s t ), f or • Opt imize decisions in rever se order – Rest rict s t – Eliminat e s t +1 ,… ,s h ,a t +1 ,… ,a h a 3 a 3 a 2 a 2 a 0 a 1 a 0 a 1 s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 r 2 r 3 r 4 r 2 r 3 r 4 r 1 r 1 13 14 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart Value I t erat ion A Markov Decision Process • Value when no t ime lef t : 1 γ = 0.9 – V(s h ) = R(s h ) S • Value wit h one t ime st ep lef t : ½ ½ 1 Poor & You own a Poor & – V(s h-1 ) = max a h-1 R(s h-1 ) + γ Σ s h Pr(s h |s h-1 ,a h-1 ) V(s h ) A Unknown company Famous A +0 +0 • Value wit h t wo t ime st eps lef t : I n ever y st at e you must S – V(s h-2 ) = max a h-2 R(s h-2 ) + γ Σ s h-1 Pr(s h-1 |s h-2 ,a h-2 ) V(s h-1 ) choose bet ween ½ S aving money or ½ 1 • … ½ ½ ½ A dver t ising • Bellman’s equat ion: S A A Rich & Rich & – V(s t ) = max a t R(s t ) + γ Σ s t +1 Pr(s t +1 |s t ,a t ) V(s t +1 ) S Famous Unknown – a t * = argmax a t R(s t ) + γ Σ s t +1 Pr(s t +1 |s t ,a t ) V(s t +1 ) ½ +10 +10 ½ ½ 15 16 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart 1 γ = 0.9 Finit e Horizon S ½ 1 PU ½ PF A A +0 +0 1 S ½ ½ ½ ½ ½ S A A • When h is f init e, RF RU S +10 +10 • Non-st at ionar y opt imal policy ½ ½ ½ • Best act ion dif f erent at each t ime st ep • I nt uit ion: best act ion varies wit h t he amount t V(PU) V(PF) V(RU) V(RF) of t ime lef t h 0 0 10 10 h-1 0 4.5 14.5 19 h-2 2.03 8.55 16.53 25.08 h-3 4.76 12.20 18.35 28.72 h-4 7.63 15.07 20.40 31.18 h-5 10.21 17.46 22.61 33.21 17 18 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart 3

  4. I nf init e Horizon I nf init e Horizon • Assuming a discount f act or γ , af t er k t ime • When h is inf init e, st eps, rewards are scaled down by γ k • St at ionary opt imal policy • For large enough k, rewards become • Same best act ion at each t ime st ep insignif icant since γ k � 0 • I nt uit ion: same (inf init e) amount of t ime lef t • Solut ion: at each t ime st ep, hence same best act ion – pick large enough k – run value it erat ion f or k st eps • Problem: value it erat ion does an inf init e – Execut e policy f ound at t he k t h it erat ion number of it er at ions… 19 20 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart Comput at ional Complexit y Dynamic Decision Net work • Space and t ime: O(k| A| | S| 2 ) ☺ Act t- 2 Act t- 1 Act t – Here k is t he number of it erat ions M t- 2 M t- 1 M t M t+1 • But what if | A| and | S| are def ined by several random variables and T t- 2 T t- 1 T t T t+1 consequent ly exponent ial? L t- 2 L t- 1 L t L t+1 • Solut ion: exploit condit ional C C C C t- 2 t- 1 t t+1 independence N t- 2 N t- 1 N t N t+1 – Dynamic decision net work R t R t- 2 R t- 1 R t+1 21 22 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart Dynamic Decision Net work Part ial Observabilit y • What if st at es are not f ully obser vable? • Similarly t o dynamic Bayes net s: • Solut ion: Par t ially Obser vable Mar kov Decision Process – Compact represent at ion ☺ – Exponent ial t ime f or decision making � o 2 o 3 o o o 1 a 3 a 2 a 0 a 1 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 4 r 1 23 24 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart 4

  5. Par t ially Observable Markov POMDP Decision Pr ocess (POMDP) • Def init ion • Pr oblem: act ion choice gener ally depends – Set of st at es: S on all previous observat ions… – Set of act ions (i.e., decisions): A – Set of observat ions: O • Two solut ions: – Transit ion model: Pr(s t |a t -1 ,s t -1 ) – Consider only policies t hat depend on a – Observat ion model: Pr(o t |s t ) f init e hist or y of observat ions – Reward model (i.e., ut ilit y): R(s t ) – Discount f act or: 0 ≤ γ ≤ 1 – Find st at ionary suf f icient st at ist ics – Horizon (i.e., # of t ime st eps): h encoding relevant past observat ions • Policy: mapping f r om past obs. t o act ions 25 26 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart Part ially Observable DDN Policy Opt imizat ion • Act ions do not depend on all st at e variables • Policy opt imizat ion: – Value it erat ion (variable eliminat ion) Act t- 2 Act t- 1 Act t – Policy it erat ion M t- 2 M t- 1 M t M t+1 • POMDP and PODDN complexit y: T t- 2 T t- 1 T t T t+1 – Exponent ial in | O| and k when act ion choice L t- 2 L t- 1 L t L t+1 depends on all previous observat ions � C C C C t- 2 t- 1 t t+1 – I n pract ice, good policies based on subset of past observat ions can st ill be f ound N t- 2 N t- 1 N t N t+1 R t R t- 2 R t- 1 R t+1 27 28 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart COACH project COACH project Aging Population • Automated prompting system to help elderly persons wash their hands • Dementia • IATSL: Alex Mihailidis, Pascal Poupart, Jennifer Boger, – Deterioration of intellectual faculties Jesse Hoey, Geoff Fernie and Craig Boutilier – Confusion – Memory losses (e.g., Alzheimer’s disease) • Consequences: – Loss of autonomy – Continual and expensive care required 29 30 CS486/686 Lecture Slides (c) 2005 P. Poupart CS486/686 Lecture Slides (c) 2005 P. Poupart 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend