CS440/ECE448 Lecture 21: Markov Decision Processes
Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 3/2019
CS440/ECE448 Lecture 21: Markov Decision Processes Slides by - - PowerPoint PPT Presentation
CS440/ECE448 Lecture 21: Markov Decision Processes Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 3/2019 Markov Decision Processes In HMMs, we see a sequence of observations and try to reason about the underlying
Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 3/2019
statement, you either know these, or you learn them from data:
Q1 Q2 Q3 Q4
Correct Incorrect: $0 Correct Incorrect: $0 Quit: $100 Correct Incorrect: $0 Quit: $1,100 Correct: $61,100 Incorrect: $0 Quit: $11,100 $100 question $1,000 question $10,000 question $50,000 question
1/10 9/10 1/2 1/2 3/4 1/4 1/100 99/100
0.1 * 61,100 + 0.9 * 0 = 6,110
Q1 Q2 Q3 Q4
Correct Incorrect: $0 Correct Incorrect: $0 Quit: $100 Correct Incorrect: $0 Quit: $1,100 Correct: $61,100 Incorrect: $0 Quit: $11,100 $100 question $1,000 question $10,000 question $50,000 question
1/10 9/10 1/100 99/100 3/4 1/4 1/2 1/2
Q1 Q2 Q3 Q4
Correct Incorrect: $0 Correct Incorrect: $0 Quit: $100 Correct Incorrect: $0 Quit: $1,100 Correct: $61,100 Incorrect: $0 Quit: $11,100 $100 question $1,000 question $10,000 question $50,000 question U = $11,100 U = $5,550 U = $4,162 U = $3,746
1/10 9/10 1/100 99/100 3/4 1/4 1/2 1/2
R(s) = -0.04 for every non-terminal state Transition model: 0.8 0.1 0.1
Source: P. Abbeel and D. Klein
Source: P. Abbeel and D. Klein
R(s) = -0.04 for every non-terminal state Transition model:
Optimal policy when R(s) = -0.04 for every non-terminal state
utility over all possible state sequences produced by following that policy: !
"#$#% "%&'%()%" "#$*#+(, -*./ "0
1 23453673|29, ; = = 29 > 23453673
sum of the rewards of the individual states
between 0 and 1:
max 2 2 1 2 1
¥ =
t t t
&'(') &)*+),-)& &'(.'/,0 1.23 &
'
) ' ( ) , | ' (
s
s U a s s P
U(s’) Max node Chance node
Î
' ) ( *
s s A a
P(s’ | s, a)
what is the expected utility of taking action a in state s?
action?
'
s a
End up here with P(s’ | s, a) Get utility U(s’) (discounted by g) Receive reward R(s) Choose optimal action a
Î
' ) (
s s A a
an iterative solution method (is there a globally optimum solution?)
but that would run into trouble with infinite sequences
Î
' ) (
s s A a
according to this rule:
infinite number of iterations…
Î +
' ) ( 1
s i s A a i
Î +
' ) ( 1
s i s A a i
Value iteration demo
Utilities with discount factor 1 Final policy Input (non-terminal R=-0.04)
steps, as long as the state space and action set are both finite.
Î
' ) (
s s A a
'
s
p p
Î +
' ) ( 1
s s A a i
i
p
given state
tells you the optimum policy. The Bellman equation is N nonlinear equations in N unknowns (the policy), therefore it can’t be solved in closed form.
steps in time