Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall - - PowerPoint PPT Presentation

reinforcement learning ii
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall - - PowerPoint PPT Presentation

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019 Reinforcement Learning t r t R = max : S A t =0 MDPs Agent interacts with an environment At each time t: Receives sensor signal s t


slide-1
SLIDE 1

Reinforcement Learning II

George Konidaris gdk@cs.brown.edu

Fall 2019

slide-2
SLIDE 2

Reinforcement Learning

max

π

R =

  • t=0

γtrt π : S → A

slide-3
SLIDE 3

MDPs

Agent interacts with an environment At each time t:

  • Receives sensor signal
  • Executes action
  • Transition:
  • new sensor signal
  • reward

st at st+1 rt Goal: find policy that maximizes expected return (sum

  • f discounted future rewards):

π max

π

E

  • R =

  • t=0

γtrt

slide-4
SLIDE 4

Markov Decision Processes

: set of states : set of actions : discount factor : reward function is the reward received taking action from state and transitioning to state . : transition function is the probability of transitioning to state after taking action in state . RL: one or both of T, R unknown. S A R R(s, a, s′) γ a s s′ T T(s′|s, a) s′ a s

< S, A, γ, R, T >

slide-5
SLIDE 5

The World

slide-6
SLIDE 6

Real-Valued States

What if the states are real-valued?

  • Cannot use table to represent Q.
  • States may never repeat: must generalize.
10 20 30 40 50 60 70 80 90 20 40 60 80 100 0.5 1 1.5 2 2.5

vs

slide-7
SLIDE 7

RL

Example: States: (real-valued vector) Actions: +1, -1, 0 units of torque added to elbow Transition function: physics! Reward function: -1 for every step (θ1, ˙ θ1, θ2, ˙ θ2)

slide-8
SLIDE 8

Value Function Approximation

Represent Q function:

Q(s, a, w) : Rn → R

Samples of form:

(si, ai, ri, si+1, ai+1)

parameter vector

Minimize summed squared TD error:

min

w n

X

i=0

(ri + γQ(si+1, ai+1, w) − Q(si, ai, w))2

<latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit><latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit><latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit><latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit>
slide-9
SLIDE 9

Value Function Approximation

Given a function approximator, compute the gradient and descend it. Which function approximator to use? Simplest thing you can do:

  • Linear value function approximation.
  • Use set of basis functions
  • Q is a linear function of them:

φ1, ..., φn ˆ Q(s, a) = w · Φ(s, a) =

n

X

j=1

wjφj(s, a)

slide-10
SLIDE 10

Function Approximation

One choice of basis functions:

  • Just use state variables directly:

What can be represented this way? [1, x, y]

x y Q

slide-11
SLIDE 11

Polynomial Basis

More powerful:

  • Polynomials in state variables.
  • 1st order:
  • 2nd order:
  • This is like a Taylor expansion.

What can be represented? [1, x, y, xy] [1, x, y, xy, x2, y2, x2y, y2x, x2y2]

slide-12
SLIDE 12

Function Approximation

How to get the terms of the Taylor series? Each term has an exponent:

φc(x, y, z) = xc1yc2zc3

φc(x, y, z) = x = x1y0z0 c = [1, 0, 0] φc(x, y, z) = xy2 = x1y2z0 c = [1, 2, 0] φc(x, y, z) = x2z4 = x2y0z4 c = [2, 0, 4] c = [0, 3, 1] φc(x, y, z) = y3z1 = x0y3z1 all combinations generates basis ci ∈ [0, ..., d]

slide-13
SLIDE 13

Function Approximation

Another:

  • Fourier terms on state variables.
  • [1, cos(πx), cos(πy), cos(π[x + y])]

cos(πc · [x, y, z])

coefficient vector

slide-14
SLIDE 14

Objective Function Minimization

First, let’s do stochastic gradient descent. As each data point (transition) comes in

  • compute gradient of objective w.r.t. data point
  • descend gradient a little bit

ˆ Q(s, a) = w · Φ(s, a)

min

w n

X

i=0

(ri + γw · φ(si+1, ai+1) − w · φ(si, ai))2

<latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit>
slide-15
SLIDE 15

Gradient

For each weight wj:

TD error

wi+1 = wi + αδφ(si, ai) vector ∂ ∂wj

n

X

i=0

(ri + γw · φ(si+1, ai+1) − w · φ(si, ai))2

<latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit><latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit><latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit><latexit sha1_base64="ML8Kp3BaS5qyIi3vkBY49EY8Ns=">ACf3icbZFdaxNBFIZn169arUYFvfDmYBATWsNuEdQLoeiNlxVMW8iky9nJbDJmZmeZOWsIy174M/0F/RtOPhDTemDg4X3P4RzeySutPCXJ7yi+dfvO3Xt79/cfPDx49Ljz5OmZt7UTcistu4iRy+1KuWQFGl5UTmJtfyPJ9/WfnP6XzypbfaVnJscFpqQolkIKUdX7xwqFoeIWOFOr2L8Ei+9EC97XJGvUpaS9L4FoW1AOXKTgEPkVjEBbAxcQS8Gqmej60HqbtEeAG+gBvYbHToFam6gN3ajqjPlweZ51uMkjWBTch3UKXbes061zxiRW1kSUJjd6P0qSicbM6W2jZ7vPaywrFHKdyFLBEI/24WUfVwugTKCwLrySYK3+O9Gg8X5p8tBpkGb+urcS/+eNaio+jBtVjXJUmwWFbUGsrDKHSbKSUF6GQCFU+FWEDM2VP4nZ0tubVzwty3IZn0eg434ex4kAb+9q578nmb0R57yV6xHkvZe3bCvrJTNmSCXUH0fPoRzFb+JBnGxa42g784ztVPzxDyTpv3I=</latexit>

= −2

n

X

i=0

(ri + γw · φ(si+1, ai+1) − w · φ(si, ai)) φj(si, ai)

<latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit><latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit><latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit><latexit sha1_base64="1aqPcpGn7xRGenAumW7YL36Aiuk=">ACdXicbZHdaxNBFMVnt37U2mrURxEuRiWhtu4WQV8KRV98rGDaQjYudyezyZj5WGbuGsKyf6gPvsn+OrkAzStFwZ+nHOGO5wpKiU9JcmPKN65dfvO3d17e/f3Dx487Dx6fOFt7bgYcKusuyrQCyWNGJAkJa4qJ1AXSlwWs49L/K7cF5a84UWlRhpnBhZSo4UpLwzP4WjE4DM1zpv5GnSfjWQKVFSD1wu4RCyCWqNMIeMjy1BVk1lz4foYdq+BlxDH+AI5lsBuTRlHzInJ1Pqr+T8218j73ST42Q1cBPSDXTZs7zq9sbHmthSGu0PthmlQ0atCR5Eq0e1ntRYV8hMxDGhQCz9qVgW18DIoYyitC8cQrNR/bzSovV/oIiQ10tRf95bi/7xhTeX7USNVZMwfL2orBWQhWXbMJZOcFKLAMidDG8FPkWHnMKfbG0prJ0RFr4NzaTXe7gJFyfHaeDPb7tnHzYd7bKn7DnrsZS9Y2fsEztnA8bZz2gn2o8Oot/xs/hF/GodjaPNnSdsa+I3fwAB5rnm</latexit>

so for time i the contribution for weight wj is:

(ri + γw · φ(si+1, ai+1) − w · φ(si, ai)) φj(si, ai)

<latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit><latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit><latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit><latexit sha1_base64="nCB1cG2M3ji28BCu57D1mSZq7U=">ACYnicbZFNaxsxEIblTduk7pedHNvDUFOwcRt2SyA5hvaSYwp1YvCaZVartVLq0WajTGL/2D+Qe+FXntT5E/oHXSAcHD+84w6u0VNJRGH5vBHuPHj/ZP3jafPb8xctXrfbhlTOV5WLAjTJ2mKITShZiQJKUGJZWoE6VuE5n1f+9Y2wTpriKy1KMdY4KWQuOZKXklYWK5FTF2wioQ/xBLVGmEPM0MQl1PZdUkt+9HyPeAGegAfYL7TIFem7EFs5WRKvbWcfPtrJK1OeByuCx5CtIUO29Zl0voZ4ZXWhTEFTo3isKSxjVaklyJZTOunCiRz3AiRh4L1MKN63UaS3jnlQxyY/0rCNbqvxM1aucWOvWdGmnq7nsr8X/eqKL8bFzLoqxIFHyzK8UkIFVtJBJKziphQfkVvpbgU/RIif/ATtbUmNmhKlb+mSi+zk8hKuPx5HnLyed80/bjA7Ya/aWdVnETtk5u2CXbMA4u2W/2G/2p/EjaAbt4GjTGjS2M0dsp4I3dweFtl8=</latexit>

make a step:

wj,i+1 = wj,i + α (ri + γw · φ(si+1, ai+1) − w · φ(si, ai)) φj(si, ai)

<latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit><latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit><latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit><latexit sha1_base64="Bv18ZoUygfNxPknr08xsIK+lCXY=">ACgXicbZHbatAEIZXSg9peojT3rT0Zqgp2OSAFAoJlEJob3qZQp0ELCNG65W18UordkcxRuiqT9knCHmLrmxB6QDC9/+/wyz/JuUSloKgt+ev/Xo8ZOn2892nr94+Wq3t/f6wurKcDHiWmlzlaAVShZiRJKUuCqNwDxR4jKZf2v9yxthrNTFT1qWYpLjrJCp5EhOinu/FnF9fQByP2zgC3SXBvYhQlVmCJESKQ3AxLVZpjnCAuI+FQTRGUmBzau2+EDwDUMAQ5hsdEgW1MOITJyltFwJcfXf4241w+OglXBQwg76LOuzuPebTVvMpFQVyhteMwKGlSoyHJlWh2osqKEvkcZ2LsMBc2Em9CquBj06ZQqNOwXBSv13osbc2mWeuM4cKbP3vVb8nzeuKD2d1LIoKxIFXy9KwWkoU0eptITmrpALmR7q3AMzTIyf3PxpZE6zlhYhuXTHg/h4dwcXwUOv7xqX/2tctom71nH9iAheyEnbHv7JyNGd3q731nvnb/lDP/CP162+1828YRvlf/4DsB+9OQ=</latexit>
slide-16
SLIDE 16

λ-Gradient

The same logic applies when using eligibility traces. becomes where

wi+1 = wi + αδφ(si, ai)

wi+1 = wi + αδe

et = γλet−1 + φ(st, at)

e0 = ¯

[Sutton and Barto, 1998]

vectors

slide-17
SLIDE 17

Acrobot

slide-18
SLIDE 18

Acrobot

slide-19
SLIDE 19

Least-Squares TD

Minimize: 
 Error function has a bowl shape, so unique minimum. Just go right there!

min

w n

X

i=0

(ri + γw · φ(si+1, ai+1) − w · φ(si, ai))2

<latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit><latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit>
slide-20
SLIDE 20

Least-Squares TD

Derivative set to zero:

n

X

i=1

(w · φ(si, ai) − ri − γw · φ(si+1, ai+1)) φ(si, ai)T = 0

wT

n

X

i=1

(w · φ(si, ai) − γw · φ(si+1, ai+1)) φT (si, ai) =

n

X

i=1

riφT (si, ai)

w = A−1b

A =

n

X

i=1

(φ(si, ai) − γφ(si+1, ai+1)) φT (si, ai) b =

n

X

i=1

riφT (si, ai)

[Bradtke and Barto, 1996]

slide-21
SLIDE 21

LSTD(λ)

Can derive the least-squares version of LSTD(λ) in this way. Try it at home!

  • Write down the objective function …
  • Sample ri replaced by complex reward estimate.
  • You will get a trace vector if you do some clever algebra.
  • Trace vector is the same size as w.

[Boyan, 1999]

slide-22
SLIDE 22

LSTD(λ)

One inversion solves for w! But:

  • Computationally expensive.
  • A may not be invert-able.
  • Least-squares behavior sometimes unstable outside of data.
  • LSPI: Least Squares Policy Iteration
  • Requires recomputing A over historical data.
  • ai+1 changes with the policy

[Lagoudakis and Parr, 2003]

slide-23
SLIDE 23

Linear Methods Don’t Scale

Why not?

  • They’re complete.
  • They have nice properties (bowl-shaped error).
  • They are easy to use!

How many basis functions in a complete nth order Taylor series of d variables?

(n + 1)d

slide-24
SLIDE 24

Function Approximation

TD-Gammon: Tesauro (circa 1992-1995)

  • At or near best human level
  • Learn to play Backgammon through self-play
  • 1.5 million games
  • Neural network function approximator
  • TD(λ)

Changed the way the best human players played.

slide-25
SLIDE 25

Arcade Learning Environment

[Bellemare 2013]

slide-26
SLIDE 26

Deep Q-Networks

[Mnih et al., 2015]

slide-27
SLIDE 27

Atari

[Mnih et al., 2015]

video: Two Minute Papers

slide-28
SLIDE 28

Atari

[Mnih et al., 2015]

slide-29
SLIDE 29

POLICY SEARCH

slide-30
SLIDE 30

Policy Search

Represent policy directly: Objective function:

π(s, a, θ) : Rn, Rm → [0, 1]

max

θ

E " R =

X

i=0

γiri #

Why?

parameter vector

slide-31
SLIDE 31

Policy Search

So far: improve policy via value function. Sometimes policies are simpler than value functions:

  • Parametrized program

Sometimes we wish to search in space of restricted policies. In such cases it makes sense to search directly in policy-space rather than trying to learn a value function. π(s, a|θ)

slide-32
SLIDE 32

Hill Climbing

What if you can’t differentiate ? Sample-based optimization:

  • Sample some values near your current best .
  • Adjust your current best to the highest value .

π

θ θ θ

slide-33
SLIDE 33

Aibo Gait Optimization

from Kohl and Stone, ICRA 2004.

slide-34
SLIDE 34

PoWER and PI2

More recently, two closely related algorithms:

  • Generate some sample values.
  • Next is sum of prior samples weighted by reward.

(Theodorou and Schaal 2010, Kober and Peters 2011)

θ θ

slide-35
SLIDE 35

Policy Search

What if we can differentiate with respect to ? Policy gradient methods.

  • Compute and ascend
  • This is the gradient of return w.r.t policy parameters

Policy gradient theorem: Therefore, one way is to learn Q and then ascend gradient. Q need only be defined using basis functions computed from . ∂R/∂θ θ θ π

slide-36
SLIDE 36

Postural Recovery

slide-37
SLIDE 37

Deep Policy Search

[Levine et al., 2016]

slide-38
SLIDE 38

Deep Policy Search

[Levine et al., 2016]

slide-39
SLIDE 39

Robotics

[Levine et al., 2016]

slide-40
SLIDE 40

Reinforcement Learning

Very active area of current research, applications in:

  • Robotics
  • Operations Research
  • Computer Games
  • Theoretical Neuroscience

AI

  • The primary function of the brain is control.