Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019

Reinforcement Learning ∞ � γ t r t R = max π : S → A π t =0

MDPs Agent interacts with an environment At each time t: • Receives sensor signal s t • Executes action a t • Transition : • new sensor signal s t +1 • reward r t Goal: find policy that maximizes expected return (sum π of discounted future rewards): � � ∞ � γ t r t E R = max π t =0

Markov Decision Processes : set of states S : set of actions < S, A, γ , R, T > A : discount factor γ : reward function R is the reward received taking action from state R ( s, a, s ′ ) a s and transitioning to state . s ′ : transition function T is the probability of transitioning to state after s ′ T ( s ′ | s, a ) taking action in state . s a RL: one or both of T, R unknown.

The World

Real-Valued States What if the states are real-valued? • Cannot use table to represent Q. • States may never repeat: must generalize . 2.5 2 1.5 vs 1 0.5 0 100 80 60 40 70 80 90 20 40 50 60 30 10 20 0 0

RL Example: ( θ 1 , ˙ θ 1 , θ 2 , ˙ States : (real-valued vector) θ 2 ) Actions : +1, -1, 0 units of torque added to elbow Transition function : physics! Reward function : -1 for every step

<latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> Value Function Approximation Represent Q function: Q ( s, a, w ) : R n → R parameter vector Samples of form: ( s i , a i , r i , s i +1 , a i +1 ) Minimize summed squared TD error: n ( r i + γ Q ( s i +1 , a i +1 , w ) − Q ( s i , a i , w )) 2 X min w i =0

Value Function Approximation Given a function approximator, compute the gradient and descend it. Which function approximator to use? Simplest thing you can do: • Linear value function approximation . • Use set of basis functions φ 1 , ..., φ n • Q is a linear function of them: n ˆ X Q ( s, a ) = w · Φ ( s, a ) = w j φ j ( s, a ) j =1

Function Approximation One choice of basis functions: • Just use state variables directly: [1 , x, y ] What can be represented this way? Q y x

Polynomial Basis More powerful: • Polynomials in state variables. • 1st order: [1 , x, y, xy ] • 2nd order: [1 , x, y, xy, x 2 , y 2 , x 2 y, y 2 x, x 2 y 2 ] • This is like a Taylor expansion. What can be represented?

Function Approximation How to get the terms of the Taylor series? Each term has an exponent: c i ∈ [0 , ..., d ] φ c ( x, y, z ) = x c 1 y c 2 z c 3 all combinations generates basis φ c ( x, y, z ) = x = x 1 y 0 z 0 c = [1 , 0 , 0] φ c ( x, y, z ) = xy 2 = x 1 y 2 z 0 c = [1 , 2 , 0] φ c ( x, y, z ) = x 2 z 4 = x 2 y 0 z 4 c = [2 , 0 , 4] φ c ( x, y, z ) = y 3 z 1 = x 0 y 3 z 1 c = [0 , 3 , 1]

Function Approximation Another: • Fourier terms on state variables. • [1 , cos ( π x ) , cos ( π y ) , cos( π [ x + y ])] • cos ( π c · [ x, y, z ]) coefficient vector

<latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> Objective Function Minimization First, let’s do s tochastic gradient descent. As each data point (transition) comes in • compute gradient of objective w.r.t. data point • descend gradient a little bit ˆ Q ( s, a ) = w · Φ ( s, a ) n ( r i + γ w · φ ( s i +1 , a i +1 ) − w · φ ( s i , a i )) 2 X min w i =0

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019 Reinforcement Learning t r t R = max : S A t =0 MDPs Agent interacts with an environment At each time t: Receives sensor signal s t

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building

CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

A Desktop Can Machines Learn? Pascal Poupart Associate Professor David R. Cheriton School of

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn Artificial Intelligence:

AAAI-14 Tutorial Image sources: britannica.com, wikimedia.org

Reinforcement Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 13

Who We Are Who We Are Grassroots group of Scientists Economists Business owners

Deep Learning Techniques for Music Generation Reinforcement (7) Jean-Pierre Briot