our toy problem lookup table N S E W 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 home 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0
our toy problem lookup table 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
reward structure? 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 0 0 home 0 0 0 0 0 0 0 7 7 8 9 0 0 0 0 0 0 home 0 0 0 move… to 7/home: out of bounds: to 5: to any cell except 5 and 7: 10 -5 -10 -1
let’s fix 𝛽 = 0.1, 𝛿 = 0.5 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0 say 𝜁 -greedy policy… episode 1 begins...
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -1 0 0 0 1 2 3 0 0 ? 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -5 ? 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -1 ? 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -10 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 ? 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 -1 0 ? 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 10 7 8 9 0 0 ? 0 0 0 home 0 0 0
a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 episode 1 ends.
let’s work out the next episode, star<ng at state 4 -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 go WEST and then SOUTH how does the table change?
-0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 0 0 0 0 1 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0
and the next episode, star<ng at state 3 go WEST -> SOUTH -> WEST -> SOUTH
-0.5 0 0 1 2 3 0 0 -0.1 0 -0.1 0 -0.1 -1 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 -0.05 0 0 0 1.9 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0 over <me, values will converge to op<mal!
what we just saw was some episodes of Q-learning values update towards value of op+mal policy : target comes from value of assumed next best ac+on off-policy learning
SARSA-learning? values update towards value of current policy : target comes from value of the actual next ac+on on-policy learning
Q SARSA By Andreas Tille (Own work) [GFDL (www.gnu.org/copylei/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creaCvecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons data generated by data not generated by target policy target policy 𝜁 : 0.1 𝛿 : 1.0 SARSA Q Example credit Travis DeWolf : https://studywolf.wordpress.com/ and https://git.io/vFBvv
Problem Decomposition nested sub-problems solution to sub-problem informs solution to whole problem
Bellman Expectation Backup system of linear equations solution: value of policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) ⎛ ⎞ a + γ a + γ a v π ( s ') ∑ ∑ ∑ ∑ v π ( s ) = π ( a | s ) r s q π ( s , a ) = r s π ( a '| s ') q π ( s ', a ') P a P ⎜ ⎟ ⎝ ⎠ ss ' ss ' a s ' s ' a ' Bellman expectation equations under a given policy
Bellman Optimality Backup system of non-linear equations solution: value of optimal policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) ⎛ ⎞ a + γ a + γ a v * ( s ') ∑ ∑ q * ( s , a ) = r s a v * ( s ) = max a r s P max a ' q * ( s ', a ') P ⎜ ⎟ ⎝ ⎠ ss ' ss ' s ' s ' Bellman optimality equations under optimal policy
Value Based
Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home
Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home
Recommend
More recommend