Reinforcement Learning II
George Konidaris gdk@cs.brown.edu
Fall 2019
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall - - PowerPoint PPT Presentation
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019 Reinforcement Learning t r t R = max : S A t =0 MDPs Agent interacts with an environment At each time t: Receives sensor signal s t
George Konidaris gdk@cs.brown.edu
Fall 2019
max
π
R =
∞
γtrt π : S → A
Agent interacts with an environment At each time t:
st at st+1 rt Goal: find policy that maximizes expected return (sum
π max
π
E
∞
γtrt
: set of states : set of actions : discount factor : reward function is the reward received taking action from state and transitioning to state . : transition function is the probability of transitioning to state after taking action in state . RL: one or both of T, R unknown. S A R R(s, a, s′) γ a s s′ T T(s′|s, a) s′ a s
< S, A, γ, R, T >
What if the states are real-valued?
vs
Example: States: (real-valued vector) Actions: +1, -1, 0 units of torque added to elbow Transition function: physics! Reward function: -1 for every step (θ1, ˙ θ1, θ2, ˙ θ2)
Represent Q function:
Q(s, a, w) : Rn → R
Samples of form:
(si, ai, ri, si+1, ai+1)
parameter vector
Minimize summed squared TD error:
min
w n
X
i=0
(ri + γQ(si+1, ai+1, w) − Q(si, ai, w))2
<latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit><latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit><latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit><latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit>Given a function approximator, compute the gradient and descend it. Which function approximator to use? Simplest thing you can do:
φ1, ..., φn ˆ Q(s, a) = w · Φ(s, a) =
n
X
j=1
wjφj(s, a)
One choice of basis functions:
What can be represented this way? [1, x, y]
x y Q
More powerful:
What can be represented? [1, x, y, xy] [1, x, y, xy, x2, y2, x2y, y2x, x2y2]
How to get the terms of the Taylor series? Each term has an exponent:
φc(x, y, z) = xc1yc2zc3
φc(x, y, z) = x = x1y0z0 c = [1, 0, 0] φc(x, y, z) = xy2 = x1y2z0 c = [1, 2, 0] φc(x, y, z) = x2z4 = x2y0z4 c = [2, 0, 4] c = [0, 3, 1] φc(x, y, z) = y3z1 = x0y3z1 all combinations generates basis ci ∈ [0, ..., d]
Another:
cos(πc · [x, y, z])
coefficient vector
First, let’s do stochastic gradient descent. As each data point (transition) comes in
ˆ Q(s, a) = w · Φ(s, a)
min
w n
X
i=0
(ri + γw · φ(si+1, ai+1) − w · φ(si, ai))2
<latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit>For each weight wj:
TD error
wi+1 = wi + αδφ(si, ai) vector ∂ ∂wj
n
X
i=0
(ri + γw · φ(si+1, ai+1) − w · φ(si, ai))2
<latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit><latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit><latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit><latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit>= −2
n
X
i=0
(ri + γw · φ(si+1, ai+1) − w · φ(si, ai)) φj(si, ai)
<latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit><latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit><latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit><latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit>so for time i the contribution for weight wj is:
(ri + γw · φ(si+1, ai+1) − w · φ(si, ai)) φj(si, ai)
<latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit><latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit><latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit><latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit>make a step:
wj,i+1 = wj,i + α (ri + γw · φ(si+1, ai+1) − w · φ(si, ai)) φj(si, ai)
<latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit><latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit><latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit><latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit>The same logic applies when using eligibility traces. becomes where
wi+1 = wi + αδφ(si, ai)
wi+1 = wi + αδe
et = γλet−1 + φ(st, at)
e0 = ¯
[Sutton and Barto, 1998]
vectors
Minimize: Error function has a bowl shape, so unique minimum. Just go right there!
min
w n
X
i=0
(ri + γw · φ(si+1, ai+1) − w · φ(si, ai))2
<latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit>Derivative set to zero:
n
X
i=1
(w · φ(si, ai) − ri − γw · φ(si+1, ai+1)) φ(si, ai)T = 0
wT
n
X
i=1
(w · φ(si, ai) − γw · φ(si+1, ai+1)) φT (si, ai) =
n
X
i=1
riφT (si, ai)
A =
n
X
i=1
(φ(si, ai) − γφ(si+1, ai+1)) φT (si, ai) b =
n
X
i=1
riφT (si, ai)
[Bradtke and Barto, 1996]
Can derive the least-squares version of LSTD(λ) in this way. Try it at home!
[Boyan, 1999]
One inversion solves for w! But:
[Lagoudakis and Parr, 2003]
Why not?
How many basis functions in a complete nth order Taylor series of d variables?
TD-Gammon: Tesauro (circa 1992-1995)
Changed the way the best human players played.
[Bellemare 2013]
[Mnih et al., 2015]
[Mnih et al., 2015]
video: Two Minute Papers
[Mnih et al., 2015]
Represent policy directly: Objective function:
π(s, a, θ) : Rn, Rm → [0, 1]
max
θ
E " R =
∞
X
i=0
γiri #
Why?
parameter vector
So far: improve policy via value function. Sometimes policies are simpler than value functions:
Sometimes we wish to search in space of restricted policies. In such cases it makes sense to search directly in policy-space rather than trying to learn a value function. π(s, a|θ)
What if you can’t differentiate ? Sample-based optimization:
π
θ θ θ
from Kohl and Stone, ICRA 2004.
More recently, two closely related algorithms:
(Theodorou and Schaal 2010, Kober and Peters 2011)
θ θ
What if we can differentiate with respect to ? Policy gradient methods.
Policy gradient theorem: Therefore, one way is to learn Q and then ascend gradient. Q need only be defined using basis functions computed from . ∂R/∂θ θ θ π
[Levine et al., 2016]
[Levine et al., 2016]
[Levine et al., 2016]
Very active area of current research, applications in:
AI