Reinforcement Learning
George Konidaris gdk@cs.brown.edu
Fall 2019
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - - PowerPoint PPT Presentation
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) vs ML
George Konidaris gdk@cs.brown.edu
Fall 2019
Subfield of AI concerned with learning from data. Broadly, using:
(Tom Mitchell, 1997)
vs
vs
Developing effective learning methods has proved difficult. Why bother? Autonomous discovery
Hard to program
Adaptive behavior
circumstances.
Depends on feedback available: Labeled data:
No feedback, just data:
Sequential data, weak labels:
Input: X = {x1, …, xn} Y = {y1, …, yn} Learn to predict new labels. Given x: y?
inputs labels training data
Input: X = {x1, …, xn} Try to understand the structure of the data. E.g., how many types of cars? How can they vary?
inputs
Learning counterpart of planning. max
π
R =
∞
γtrt π : S → A
Agent interacts with an environment At each time t:
st at st+1 rt Goal: find policy that maximizes expected return (sum
π max
π
E
∞
γtrt
: set of states : set of actions : discount factor : reward function is the reward received taking action from state and transitioning to state . : transition function is the probability of transitioning to state after taking action in state . S A R R(s, a, s′) γ a s s′ T T(s′|s, a) s′ a s
< S, A, γ, R, T >
In planning:
In reinforcement learning:
This formulation is general enough to encompass a wide variety of learned control problems.
As before, our target is a policy: A policy maps states to actions. The optimal policy maximizes: This means that we wish to find a policy that maximizes the return from every state.
max
π
∀s, E
∞
γtrt
In planning, we used policy iteration to find an optimal policy.
a. π π
Repeat
V π π(s) = max
a
E [r + γV π(s0)] , ∀s More precisely, we use a value function: … then we would update by computing:
π
π(s) = argmaxa X
s0
T(s, a, s0) [r(s, a, s0) + γV [s0]]
V π(s) = E " ∞ X
i=0
γiri #
can’t do this anymore
For learning, we use a state-action value function as follows: This is the value of executing in state , then following . Note that . π a s
|A| x
V π(s) = Qπ(s, π(s)) Qπ(s, a) = E " ∞ X
i=0
γiri|s0 = s, a0 = a #
This leads to a general policy improvement framework:
a. π π π(s) = max
a
Q(s, a), ∀s
Repeat
Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step. Qπ
Learning proceeds by gathering samples of . Methods differ by:
Q Q(s, a)
Simplest thing you can do: sample . Do this repeatedly, average values: R(s)
r r r r r r r r
Q(s, a) = R1(s) + R2(s) + ... + Rn(s) n
Where can we get more (immediate) samples? Idea: use the Bellman equation.
value of this state reward value
Qπ(s, a) = Es0 [r(s, a, s0) + γQπ(s0, π(s0))]
<latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit><latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit><latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit><latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit>Ideally and in expectation: is correct if this holds in expectation for all states. When it does not: temporal difference error.
ri + γQ(si+1, ai+1) − Q(si, ai) = 0
Q st st+1 at rt Q(st, at) ← rt + γQ(st+1, at+1)
Sarsa: very simple algorithm
(s, a, r, s′, a′) δ = r + γQ(s′, a′) − Q(s, a) Q(s, a) = Q(s, a) + αδ
zero by def. if s is absorbing
Always maxa Q(s, a)?
What if your current knowledge is wrong? How are you going to find out?
Exploration is mandatory if you want to find the optimal solution, but every exploratory action may sacrifice reward.
Exploration vs. Exploitation - when to try new things? Consistent theme of RL.
How to balance? Simplest, most popular approach: Instead of always being greedy:
Explore with probability :
✏ ✏
(1 − ✏)
✏
(✏ ≈ 0.1)
TD and MC two extremes of obtaining samples of Q:
t=1 t=2 t=3 t=4 t=L ...
r + γV r + γV r + γV
t=1 t=2 t=3 t=4 t=L ...
γiri
We can generalize this to the idea of an n-step rollout: R(1)
st = rt + γQ(st+1, at+1)
<latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit><latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit><latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit><latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit>R(2)
st = rt + γrt+1 + γ2Q(st+2, at+2)
<latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit><latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit><latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit><latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit>Each tells us something about the value function.
R(n)
st = rt + γrt+1 + γ2rt+2 + ... + γn−1rt+n−1 + γnQ(st+n, at+n)
<latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit><latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit><latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit><latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit>. . .
Weighted sum: . . . Estimator:
1
λ λn
weights
R(1) = r0 + γQ(s1, a1)
<latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit><latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit><latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit><latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit>R(2) = r0 + γr1 + γ2Q(s2, a2)
<latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit><latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit><latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit><latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit>R(n) =
n−1
X
i=0
γiri + γnQ(sn, an)
<latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit><latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit><latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit><latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit>This is called the λ-return.
Each state has eligibility trace e(s, a). At time t: e(st, at) = 1 e(s, a) = γλe(s, a), for all other (s, a) pairs. At end of episode: e(s, a) = 0, for all (s, a) pairs. When updating:
zero by def. if s is absorbing
δ = rt + γQ(st+1, at+1) − Q(st, at)
don’t forget!
Q(s, a) = Q(s, a) + αδe(s, a), ∀s, a
<latexit sha1_base64="ovQUocCMgFfaqSFD3hZAMzjrQ/Q=">ACNHicbVDLSgMxFM34flt16SZYBEUpM1XUjSC6cdmCVaFTyp301oZmJkNyRyilf+J3+AFu9QcEdyLu/AbTh+DrQODknHs5yYlSJS35/rM3Nj4xOTU9Mzs3v7C4tJxbWb20OjMCK0Irba4jsKhkghWSpPA6NQhxpPAqap/1/atbNFbq5I6KdZiuElkUwogJ9VzB+Utu8thmx/zL7bDQ1BpC8IGKgKOQ3mXh01tQCnev9Zzeb/gD8D/kmBE8myEUj3Hja0yGJMSCiwthr4KdW6YEgKhb25MLOYgmjDVYdTSBGW+sO/tfjm05pcBfvTkJ8oH7f6EJsbSeO3GQM1LK/vb74n1fNqHlU68okzQgTMQxqZoqT5v2yeEMaFKQ6joAw0r2VixYEOQq/ZESad0miGzPNRP87uEvuSwWgr1CsbyfPzkdTD1tkG2IBO2Qn7JyVWIUJdsce2CN78u69F+/VexuOjnmjnTX2A97HJ8a1p50=</latexit>