Out line Markov Decision P rocesses Dynamic Decision Net works - - PDF document

out line
SMART_READER_LITE
LIVE PREVIEW

Out line Markov Decision P rocesses Dynamic Decision Net works - - PDF document

Out line Markov Decision P rocesses Dynamic Decision Net works Lect ure 13 Russell and Nor vig: Sect 17.1, 17.2 (up t o p. 620), 17.4, 17.5 J une 14, 2005 CS 486/ 686 2 CS486/686 Lecture Slides (c) 2005 P. Poupart Sequent


slide-1
SLIDE 1

1

Lect ure 13

J une 14, 2005 CS 486/ 686

CS486/686 Lecture Slides (c) 2005 P. Poupart

2

Out line

  • Markov Decision P

rocesses

  • Dynamic Decision Net works
  • Russell and Nor vig: Sect 17.1, 17.2 (up

t o p. 620), 17.4, 17.5

CS486/686 Lecture Slides (c) 2005 P. Poupart

3

Sequent ial Decision Making

Static Inference

Bayesian Networks

Sequential Inference

Hidden Markov Models Dynamic Bayesian Networks

Static Decision Making

Decision Networks

Sequential Decision Making

Markov Decision Processes Dynamic Decision Networks

CS486/686 Lecture Slides (c) 2005 P. Poupart

4

Sequent ial Decision Making

  • Wide range of applicat ions

– Robot ics (e.g., cont rol) – I nvest ment s (e.g., port f olio management ) – Comput at ional linguist ics (e.g., dialogue management ) – Operat ions research (e.g., invent ory management , resour ce allocat ion, call admission cont rol) – Assist ive t echnologies (e.g., pat ient monit or ing and support )

CS486/686 Lecture Slides (c) 2005 P. Poupart

5

  • I nt uit ion: Mar kov Process wit h…

– Decision nodes – Ut ilit y nodes

Markov Decision Process

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r4

CS486/686 Lecture Slides (c) 2005 P. Poupart

6

St at ionary Pref erences

  • Hum…but why many ut ilit y nodes?
  • U(s0,s1,s2,…

)

– I nf init e pr ocess inf init e ut ilit y f unct ion

  • Solut ion:

– Assume st at ionary and addit ive pref erences – U(s0,s1,s2,… ) = Σt R(st)

slide-2
SLIDE 2

2

CS486/686 Lecture Slides (c) 2005 P. Poupart

7

Discount ed/ Average Rewards

  • I f process inf init e, isn’t Σt R(st) inf init e?
  • Solut ion 1: discount ed rewards

– Discount f act or: 0 ≤ γ ≤ 1 – Finit e ut ilit y: Σt γtR(st) is a geomet ric sum – γ is like an inf lat ion rat e of 1/ γ - 1 – I nt uit ion: pref er ut ilit y sooner t han lat er

  • Solut ion 2: aver age rewards

– More complicat ed comput at ionally – Beyond t he scope of t his course

CS486/686 Lecture Slides (c) 2005 P. Poupart

8

Markov Decision Process

  • Def init ion

– Set of st at es: S – Set of act ions (i.e., decisions): A – Transit ion model: Pr(st| at -1,st -1) – Reward model (i.e., ut ilit y): R(st) – Discount f act or: 0 ≤ γ ≤ 1 – Horizon (i.e., # of t ime st eps): h

  • Goal: f ind opt imal policy

CS486/686 Lecture Slides (c) 2005 P. Poupart

9

I nvent ory Management

  • Markov Decision Pr ocess

– St at es: invent ory levels – Act ions: {doNot hing, orderWidget s} – Transit ion model: st ochast ic demand – Reward model: Sales – Cost s - St orage – Discount f act or: 0.999 – Horizon: ∞

  • Tradeof f : increasing supplies decreases odds
  • f missed sales but increases st orage cost s

CS486/686 Lecture Slides (c) 2005 P. Poupart

10

Policy

  • Choice of act ion at each t ime st ep
  • For mally:

– Mapping f rom st at es t o act ions – i.e., δ(st) = at – Assumpt ion: f ully observable st at es

  • Allows at t o be chosen only based on current

st at e st. Why?

CS486/686 Lecture Slides (c) 2005 P. Poupart

11

Policy Opt imizat ion

  • Policy evaluat ion:

– Comput e expect ed ut ilit y – EU(δ) = Σt =0 γt Pr(st| δ) R(st)

  • Opt imal policy:

– Policy wit h highest expect ed ut ilit y – EU(δ) ≤ EU(δ*) f or all δ

h

CS486/686 Lecture Slides (c) 2005 P. Poupart

12

Policy Opt imizat ion

  • Three algorit hms t o opt imize policy:

– Value it erat ion – Policy it erat ion – Linear Progr amming

  • Value it er at ion:

– Equivalent t o variable eliminat ion

slide-3
SLIDE 3

3

CS486/686 Lecture Slides (c) 2005 P. Poupart

13

Value I t erat ion

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r4

  • Not hing more t han variable eliminat ion
  • Perf orms dynamic programming
  • Opt imize decisions in rever se order

CS486/686 Lecture Slides (c) 2005 P. Poupart

14

Value I t erat ion

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r4

  • At each t , st art ing f rom t =h down t o 0:

– Opt imize at: EU(at| st)? – Fact ors: Pr(st +1| at,st), R(st), f or – Rest rict st – Eliminat e st +1,… ,sh,at +1,… ,ah

CS486/686 Lecture Slides (c) 2005 P. Poupart

15

Value I t erat ion

  • Value when no t ime lef t :

– V(sh) = R(sh)

  • Value wit h one t ime st ep lef t :

– V(sh-1) = maxah-1 R(sh-1) + γ Σsh Pr(sh|sh-1,ah-1) V(sh)

  • Value wit h t wo t ime st eps lef t :

– V(sh-2) = maxah-2 R(sh-2) + γ Σsh-1 Pr(sh-1|sh-2,ah-2) V(sh-1)

  • Bellman’s equat ion:

– V(st) = maxat R(st) + γ Σst +1 Pr(st +1|st,at ) V(st +1) – at* = argmax at R(st) + γ Σst +1 Pr(st +1|st,at) V(st +1)

CS486/686 Lecture Slides (c) 2005 P. Poupart

16

A Markov Decision Process

1 Poor & Unknown +0 Poor & Famous +0 Rich & Famous +10 Rich & Unknown +10 S S S S A A A A 1 1 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

γ = 0.9

You own a company I n ever y st at e you must choose bet ween Saving money or Adver t ising

CS486/686 Lecture Slides (c) 2005 P. Poupart

17

1 PU +0 PF +0 RF +10 RU +10 S S S S A A A A 1 1 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

γ = 0.9

33.21 22.61 17.46 10.21 h-5 31.18 20.40 15.07 7.63 h-4 28.72 18.35 12.20 4.76 h-3 25.08 16.53 8.55 2.03 h-2 19 14.5 4.5 h-1 10 10 h V(RF) V(RU) V(PF) V(PU) t

CS486/686 Lecture Slides (c) 2005 P. Poupart

18

Finit e Horizon

  • When h is f init e,
  • Non-st at ionar y opt imal policy
  • Best act ion dif f erent at each t ime st ep
  • I nt uit ion: best act ion varies wit h t he amount
  • f t ime lef t
slide-4
SLIDE 4

4

CS486/686 Lecture Slides (c) 2005 P. Poupart

19

I nf init e Horizon

  • When h is inf init e,
  • St at ionary opt imal policy
  • Same best act ion at each t ime st ep
  • I nt uit ion: same (inf init e) amount of t ime lef t

at each t ime st ep, hence same best act ion

  • Problem: value it erat ion does an inf init e

number of it er at ions…

CS486/686 Lecture Slides (c) 2005 P. Poupart

20

I nf init e Horizon

  • Assuming a discount f act or γ, af t er k t ime

st eps, rewards are scaled down by γk

  • For large enough k, rewards become

insignif icant since γk 0

  • Solut ion:

– pick large enough k – run value it erat ion f or k st eps – Execut e policy f ound at t he kt h it erat ion

CS486/686 Lecture Slides (c) 2005 P. Poupart

21

Comput at ional Complexit y

  • Space and t ime: O(k| A| | S| 2) ☺

– Here k is t he number of it erat ions

  • But what if | A| and | S| are def ined by

several random variables and consequent ly exponent ial?

  • Solut ion: exploit condit ional

independence

– Dynamic decision net work

CS486/686 Lecture Slides (c) 2005 P. Poupart

22

Dynamic Decision Net work

Tt Lt C

t

N t Tt+1 Lt+1 C

t+1

N t+1 Mt Mt+1 Act t Tt- 1 Lt- 1 C

t- 1

N t- 1 Mt- 1 Act t- 1 Tt- 2 Lt- 2 C

t- 2

N t- 2 Mt- 2 Act t- 2 Rt+1 Rt Rt- 1 Rt- 2

CS486/686 Lecture Slides (c) 2005 P. Poupart

23

Dynamic Decision Net work

  • Similarly t o dynamic Bayes net s:

– Compact represent at ion ☺ – Exponent ial t ime f or decision making

CS486/686 Lecture Slides (c) 2005 P. Poupart

24

Part ial Observabilit y

  • What if st at es are not f ully obser vable?
  • Solut ion: Par t ially Obser vable Mar kov

Decision Process

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r4

  • 1
  • 2
  • 3
slide-5
SLIDE 5

5

CS486/686 Lecture Slides (c) 2005 P. Poupart

25

Par t ially Observable Markov Decision Pr ocess (POMDP)

  • Def init ion

– Set of st at es: S – Set of act ions (i.e., decisions): A – Set of observat ions: O – Transit ion model: Pr(st|at -1,st -1) – Observat ion model: Pr(ot |st) – Reward model (i.e., ut ilit y): R(st) – Discount f act or: 0 ≤ γ ≤ 1 – Horizon (i.e., # of t ime st eps): h

  • Policy: mapping f r om past obs. t o act ions

CS486/686 Lecture Slides (c) 2005 P. Poupart

26

POMDP

  • Pr oblem: act ion choice gener ally depends
  • n all previous observat ions…
  • Two solut ions:

– Consider only policies t hat depend on a f init e hist or y of observat ions – Find st at ionary suf f icient st at ist ics encoding relevant past observat ions

CS486/686 Lecture Slides (c) 2005 P. Poupart

27

Part ially Observable DDN

Tt Lt C

t

N t Tt+1 Lt+1 C

t+1

N t+1 Mt Mt+1 Act t Tt- 1 Lt- 1 C

t- 1

N t- 1 Mt- 1 Act t- 1 Tt- 2 Lt- 2 C

t- 2

N t- 2 Mt- 2 Act t- 2 Rt+1 Rt Rt- 1 Rt- 2

  • Act ions do not depend on all st at e variables

CS486/686 Lecture Slides (c) 2005 P. Poupart

28

Policy Opt imizat ion

  • Policy opt imizat ion:

– Value it erat ion (variable eliminat ion) – Policy it erat ion

  • POMDP and PODDN complexit y:

– Exponent ial in | O| and k when act ion choice depends on all previous observat ions – I n pract ice, good policies based on subset

  • f past observat ions can st ill be f ound

CS486/686 Lecture Slides (c) 2005 P. Poupart

29

COACH project COACH project

  • Automated prompting system to help elderly persons

wash their hands

  • IATSL: Alex Mihailidis, Pascal Poupart, Jennifer Boger,

Jesse Hoey, Geoff Fernie and Craig Boutilier

CS486/686 Lecture Slides (c) 2005 P. Poupart

30

Aging Population

  • Dementia

– Deterioration of intellectual faculties – Confusion – Memory losses (e.g., Alzheimer’s disease)

  • Consequences:

– Loss of autonomy – Continual and expensive care required

slide-6
SLIDE 6

6

CS486/686 Lecture Slides (c) 2005 P. Poupart

31

Intelligent Assistive Technology

  • Let’s facilitate aging in place
  • Intelligent assistive technology

– Non-obtrusive, yet pervasive – Adaptable

  • Benefits:

– Greater autonomy – Feeling of independence

CS486/686 Lecture Slides (c) 2005 P. Poupart

32

System Overview

sensors hand washing verbal cues planning

CS486/686 Lecture Slides (c) 2005 P. Poupart

33

Prompting Strategy Prompting Strategy

  • Sequential decision problem

– Sequence of prompts

  • Noisy sensors & imprecise actuators

– Noisy image processing, uncertain prompt effects

  • Partially unknown environment

– Unknown user habits, preferences and abilities

  • Tradeoff between complex concurrent goals

– Rapid task completion vs greater autonomy

  • Approach: Partially Observable Markov Decision

Partially Observable Markov Decision Processes ( Processes (POMDPs POMDPs) )

CS486/686 Lecture Slides (c) 2005 P. Poupart

34

POMDP components POMDP components

  • State set S = dom(HL) x dom(WF) x dom(D) x …

– Hand Location ∈ {tap,water,soap,towel,sink,away,…} – Water Flow ∈ {on, off}, – Dementia ∈ {high, low}, etc.

  • Observation set O = dom(C) x dom(FS)

– Camera ∈ {handsAtTap, handsAtTowel, …} – Faucet sensor ∈ {waterOn, waterOff}

  • Action set A

– DoNothing, CallCaregiver, Prompt ∈ {turnOnWater, rinseHands, useSoap, …}

CS486/686 Lecture Slides (c) 2005 P. Poupart

35

POMDP components POMDP components

  • Transition function

Pr(s’|s,a)

  • Reward function R(s,a)

– Task completed +100 – Call caregiver -30 – Each prompt -1, -2 or -3 sink,off sink,off sink,off tap,on tap,on soap,off soap,off 0.3 0.6 0.01 0.95 0.01 0.01

Observation function Pr(o|s)

CS486/686 Lecture Slides (c) 2005 P. Poupart

36

Next Class

  • Mult i-agent syst ems
  • Game t heory
  • Russell and Norvig: Chapt er 17