CS 343H: Honors AI
Lecture 10: MDPs I 2/18/2014 Kristen Grauman UT Austin Slides courtesy of Dan Klein, UC Berkeley Unless otherwise noted
1
CS 343H: Honors AI Lecture 10: MDPs I 2/18/2014 Kristen Grauman - - PowerPoint PPT Presentation
CS 343H: Honors AI Lecture 10: MDPs I 2/18/2014 Kristen Grauman UT Austin Slides courtesy of Dan Klein, UC Berkeley Unless otherwise noted 1 Some context First weeks: search (BFS, A*, minimax, alpha beta) Find an optimal plan (or
1
at every step
2
Slide credit: Peter Stone
go as planned:
takes the agent North (if there is no wall there)
agent West; 10% East
agent would have been taken, the agent stays put
Deterministic Grid World Stochastic Grid World
X X
E N S W
X
E N S W
X X X
expectimax search – but we’ll have a new tool soon
6
Andrey Markov (1856-1922)
if followed
Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s
R(s) = -2.0 R(s) = -0.4 R(s) = -0.03 R(s) = -0.01
Example: Stuart Russell
Cool Warm Overheated 1.0 Slow Fast 0.5 0.5 Slow 0.5 0.5 Fast 1.0
11
a s s’ s, a (s,a,s’) called a transition T(s,a,s’) = P(s’|s,a) R(s,a,s’) s,a,s’ s is a state (s, a) is a q-state
12
13
14
1 Worth now γ Worth next step γ2 Worth in 2 steps
we multiply in the discount once.
utility than later rewards
converge
15
16
will eventually be reached (like “overheated” for racing)
17
a s s, a s,a,s’ s’
18
V*(s) = expected utility starting in s and acting optimally
Q*(s,a) = expected utility starting
state s and (thereafter) acting
*(s) = optimal action from state s
a s s, a s,a,s’ s’ V*(s) Q*(s,a)
20
Utilities (values) Policy
21
Utilities (values) Policy Q-values
0.660
a s s, a s,a,s’ s’
23
24
V2( )
V4( ) V3( ) V2( )
V0( ) V0( ) V0( ) V1( ) V1( ) V1( ) V2( ) V2( ) V3( ) V3( ) V4( ) V4( )
*(s) = 0 for all s, which we know is right (why?).
*, calculate the values for all states for depth i+1:
31
a s s, a s,a,s’
Vi(s’) Vi+1(s)
1.0 Slow Fast 0.5 0.5 Slow 0.5 0.5 Fast 1.0
Assume no discount V0 V1 V2
Slow: 1+2 Fast: 2+0.5*2+0.5*1
1.0 Slow Fast 0.5 0.5 Slow 0.5 0.5 Fast 1.0
Assume no discount V0 V1 V2
1.0 Slow Fast 0.5 0.5 Slow 0.5 0.5 Fast 1.0
Assume no discount V0 V1 V2
1.0 Slow Fast 0.5 0.5 Slow 0.5 0.5 Fast 1.0
Assume no discount V0 V1 V2
as depth k+1 expectimax resulting in nearly identical search trees.
Vk(s) Vk+1(s)
37