reinforcement learning ii
play

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019 Reinforcement Learning t r t R = max : S A t =0 MDPs Agent interacts with an environment At each time t: Receives sensor signal s t


  1. Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019

  2. Reinforcement Learning ∞ � γ t r t R = max π : S → A π t =0

  3. MDPs Agent interacts with an environment At each time t: • Receives sensor signal s t • Executes action a t • Transition : • new sensor signal s t +1 • reward r t Goal: find policy that maximizes expected return (sum π of discounted future rewards): � � ∞ � γ t r t E R = max π t =0

  4. Markov Decision Processes : set of states S : set of actions < S, A, γ , R, T > A : discount factor γ : reward function R is the reward received taking action from state R ( s, a, s ′ ) a s and transitioning to state . s ′ : transition function T is the probability of transitioning to state after s ′ T ( s ′ | s, a ) taking action in state . s a RL: one or both of T, R unknown.

  5. The World

  6. Real-Valued States What if the states are real-valued? • Cannot use table to represent Q. • States may never repeat: must generalize . 2.5 2 1.5 vs 1 0.5 0 100 80 60 40 70 80 90 20 40 50 60 30 10 20 0 0

  7. RL Example: ( θ 1 , ˙ θ 1 , θ 2 , ˙ States : (real-valued vector) θ 2 ) Actions : +1, -1, 0 units of torque added to elbow Transition function : physics! Reward function : -1 for every step

  8. <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> Value Function Approximation Represent Q function: Q ( s, a, w ) : R n → R parameter vector Samples of form: ( s i , a i , r i , s i +1 , a i +1 ) Minimize summed squared TD error: n ( r i + γ Q ( s i +1 , a i +1 , w ) − Q ( s i , a i , w )) 2 X min w i =0

  9. Value Function Approximation Given a function approximator, compute the gradient and descend it. Which function approximator to use? Simplest thing you can do: • Linear value function approximation . • Use set of basis functions φ 1 , ..., φ n • Q is a linear function of them: n ˆ X Q ( s, a ) = w · Φ ( s, a ) = w j φ j ( s, a ) j =1

  10. Function Approximation One choice of basis functions: • Just use state variables directly: [1 , x, y ] What can be represented this way? Q y x

  11. Polynomial Basis More powerful: • Polynomials in state variables. • 1st order: [1 , x, y, xy ] • 2nd order: [1 , x, y, xy, x 2 , y 2 , x 2 y, y 2 x, x 2 y 2 ] • This is like a Taylor expansion. What can be represented?

  12. Function Approximation How to get the terms of the Taylor series? Each term has an exponent: c i ∈ [0 , ..., d ] φ c ( x, y, z ) = x c 1 y c 2 z c 3 all combinations generates basis φ c ( x, y, z ) = x = x 1 y 0 z 0 c = [1 , 0 , 0] φ c ( x, y, z ) = xy 2 = x 1 y 2 z 0 c = [1 , 2 , 0] φ c ( x, y, z ) = x 2 z 4 = x 2 y 0 z 4 c = [2 , 0 , 4] φ c ( x, y, z ) = y 3 z 1 = x 0 y 3 z 1 c = [0 , 3 , 1]

  13. Function Approximation Another: • Fourier terms on state variables. • [1 , cos ( π x ) , cos ( π y ) , cos( π [ x + y ])] • cos ( π c · [ x, y, z ]) coefficient vector

  14. <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> Objective Function Minimization First, let’s do s tochastic gradient descent. As each data point (transition) comes in • compute gradient of objective w.r.t. data point • descend gradient a little bit ˆ Q ( s, a ) = w · Φ ( s, a ) n ( r i + γ w · φ ( s i +1 , a i +1 ) − w · φ ( s i , a i )) 2 X min w i =0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend