SLIDE 5 5 The Infinite (Indefinite) Case
} If the time-horizon T is infinite, then the sum of rewards:
Rt = rt+1 + rt+2 + … + rT
can be infinitely large (or infinitely small), too!
} For example, suppose a robot is exploring Mars
}
Whenever it collects a valuable sample, it gets a reward of +100
}
Less valuable samples only give it +1 (everything else is just 0)
} Now, if the problem is indefinite-horizon, it doesn’t matter
what the robot does: all policies give it the same value (+∞) even if it ignores any valuable samples
Monday, 13 Apr. 2020 Machine Learning (COMP 135) 17
17
Discounted Reward
} To solve the problem of future reward in MDPs, we
therefore introduce a discount rate, γ (gamma), which is some number between 0 and 1
} Reward we get is then weighted by the discount rate: } If our time horizon is finite, we can set gamma to 1; if it
is infinite, we always make sure that gamma is less than 1
} What happens if gamma = 0?
Monday, 13 Apr. 2020 Machine Learning (COMP 135) 18
Rt = rt+1 + γrt+2 + γ2rt+3 + γ3rt+4 + · · · =
∞
γkrt+k+1 18
Policy Values in MDPs
} Suppose we have a policy π for an MDP } A policy is a function from states to actions } We can calculate the expected value we get by starting in some
state s, at time t, and then following policy 𝜌 for T steps:
} Ep {…} is the expectation for value of {…} if we follow policy p } If the domain is not probabilistic, then we know this exactly, otherwise
we can calculate it using probability/decision theory
19 Monday, 13 Apr. 2020 Machine Learning (COMP 135)
π : S → A
U π(s) = Eπ{Rt | st = s} = Eπ (T −1 X
k=0
γkrt+k+1 | st = s )
<latexit sha1_base64="UhvjVIHyFy1GaYxnCKtJOZ9ljas=">ACWXicZVFda9swFJXdbku9r7R93MtlZdDRLdjZw/oSKCtle+xG0xaqxCiy7AjLlpGuC8X4D+ytf6ywx3X/ZUxO8rC0FyQO57oaNZpaTFMPzt+RubT54+620Fz1+8fPW6v71zbnVtuBhzrbS5nDErlCzFGCUqcVkZwYqZEhez/LjLX1wLY6Uuz/CmEpOCZaVMJWfoqLiP4ymt5L59DyM4iR0EagU2P2J0oMY5WIdGYFt3BSuBEinSpsXcZOPwnbanH2MWqAZKwo2zcHEDR7kBx31fw+gRmZzpG3c3wsH4SLgMYhWYO/o68/jv/T212ncv6OJ5nUhSuSKWXsVhRVOGmZQciXagNZWVIznLBPNwpMW3jkqgVQbd0qEBbumKzUuPFirvqoxPZw0sqxqFCVftklrBaihsw8SaQRHdeMA40a6+cDnzDCOzuS1TqZWIvkA193PJG5XlWmnxdDt68zIHr43MfgfDiIPg2G350TX8gyeuQNeUv2SUQ+kyPyjZySMeHk3iPelhd4f3zP7/nBUup7q5pdshb+7j9QRrRd</latexit>
19
The Bellman Equation
}
Using basic algebra, the expected value of starting in some state, s, can be calculated, via dynamic programming, based on the next possible state(s) we can reach if we take the action dictated by our policy, π (s) = a :
}
Which means that we can define policy-value for state s recursively, based on the policy- value of any next state s´ that we can get to when we follow that policy:
Monday, 13 Apr. 2020 Machine Learning (COMP 135) 20
U π(s) = X
s0
P(s, π(s), s0) [R(s, π(s), s0) + γ U π(s0)] U π(s) = Eπ{Rt | st = s} = Eπ ( T 1 X
k=0
γkrt+k+1
) = Eπ ( rt+1 + γ
T 2
X
k=0
γkrt+k+2
) = X
s0
P(s, π(s), s0) " R(s, π(s), s0) + γ Eπ ( T 2 X
k=0
γkrt+k+2
)#
<latexit sha1_base64="MnqcPr9i9Cns2S3Dw+dhPmojzRU=">ADjXicjVLbtNAEN3YXEqANoVHXkZEoFQJkR2K4AFXFQjBY6iatlKcWJv1Jl5fdHuFJl8j38De/8DRvbqE0jUayfHTmzJnZ0cwyKTQ6zu+GZd+7/+DhzqPm4ydPd/da+8/OdJorxkcslam6mFHNpUj4CAVKfpEpTuOZ5Oez6PM6f37JlRZpcopXGZ/EdJGIuWAUDRXsN36Opn4mOvoAXnvwJTAYfM2xOAnQgByXoA3yQK/A95vXGsn6BfVv79WxkERec5qWpy+cY12QeOYTiNQYHdqLumlFgsEX78NawJ/5/GZbkp7taOtzoNtjsN7u5UeWg/U/EKh3dg2oFPSi5g6o5jOGkzsHN5PUoPWiCif/YyJ1zFmh8zTu9qsuNiSvNJGi1nb5TBmwDtwZtUscwaP3yw5TlMU+QSar12HUynBRUoWCSr5p+rnlGWUQXvCjvaAWvDBXCPFXmSxBKdkOXpFjezUb1OMf5h0khkixHnrDKZp5LwBTWJwehUJyhvDKAMiVMf2BLqihDc5gbTiqXPOzB5fqaQzOrXKRGv4wHZl6zAPf2c7fB2aDvu0Pvh+2jz/Vq9ghL8hL0iEueU+OyTcyJCPCrF3r0PKsI3vPfmd/tI8qdWoa56TjbC/gGEBRTK</latexit>
20