CS 188: Artificial Intelligence
Markov Decision Processes II
Instructor: Anca Dragan University of California, Berkeley
[These slides adapted from Dan Klein and Pieter Abbeel]
CS 188: Artificial Intelligence Markov Decision Processes II - - PowerPoint PPT Presentation
CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Recap: Defining MDPs o Markov decision processes: s o Set of states
Instructor: Anca Dragan University of California, Berkeley
[These slides adapted from Dan Klein and Pieter Abbeel]
a s s, a s,a,s’ s’
a s s’ s, a
(s,a,s’) is a transition
s,a,s’
s is a state (s, a) is a q-state
[Demo – gridworld values (L8D4)]
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
a s s, a s,a,s’ s’
ends in k more time steps
from s
[Demo – time-limited values (L8D6)]
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
a Vk+1(s) s, a s,a,s’ Vk(s’)
S: 1
Assume no discount!
F: .5*2+.5*2=2
Assume no discount!
S: .5*1+.5*1=1 F: -10
Assume no discount!
Assume no discount!
S: 1+2=3 F: .5*(2+2)+.5*(2+1)=3.5
Assume no discount!
converge?
holds the actual untruncated values
depth k+1 expectimax results in nearly identical search trees
actual rewards while Vk has zeros
37
a s s, a s,a,s’ s’
[Demo: value iteration (L9D2)]
Noise = 0.2 Discount = 0.9 Living reward = 0
Noise = 0.2 Discount = 0.9 Living reward = 0
utilities!) until convergence
resulting converged (but not optimal!) utilities as future values
per state
a s s, a s,a,s’ s’ p(s) s s, p(s) s, p(s),s’ s’ Do the optimal action Do what p says to do
under a fixed (generally non-optimal) policy
Vp(s) = expected total discounted rewards starting in s and following p
equation): p(s) s s, p(s) s, p(s),s’ s’
(like value iteration)
p(s) s s, p(s) s, p(s),s’ s’
Always Go Right Always Go Forward
Always Go Right Always Go Forward
consider only one action, not all of them)