Reinforcement Learning for Continuous State and Action Spaces - PowerPoint PPT Presentation

MACHINE LEARNING – 2012 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1

MACHINE LEARNING – 2012 Reinforcement Learning (RL) Supervised Learning Unsupervised Learning Reinforcement learning is in-between 2

MACHINE LEARNING – 2012 Drawbacks of classical RL Curse of dimensionality: Computational costs increase dramatically with number of states. Markov World: Cannot handle continuous action and state space Model-based vs model free: Need a model of the world (can be estimated through exploration)  Gradient methods to handle continuous state and action spaces 3

MACHINE LEARNING – 2012 RL in continuous state and action spaces States and actions ,  , are continuous: s a 1.... t T t t   N P , s a t t One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either : 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a)    3) optimize a parameterized policy | (policy sear ch) a s 4

MACHINE LEARNING – 2012 Policy Gradients Parametrize the value function:   V s  ; Open parameters    Assume an initial estimate ; . V s     Run an episode using a greedy policy | ; a s            k   Compute the error on the estimate: = ; e V s E r     t   t  Find the optimal parameters through gradient descent on the stat e value function:     ; V s    ~ e   5

MACHINE LEARNING – 2012 TD learning for continuous state action space Parametrize the value function such that: K               T ; V s s s j j  j 1    is a set of basis function (e.g. RBF functions). s K These are set by the user and also called the features.   are the weights associated to each feature. These are j the unknown parameters. Doya, NIPS 1996 6

MACHINE LEARNING – 2012 TD learning for continuous state action space  Pick a set of parameters and compute: K               T ; V s s s j j  1 j    Do some roll-outs of fixed episode length and gather , 1... r s t T t    Estimate new ; through TD learning V s Gradient descent on the squared TD err or gives:      ; V s                     ˆ t ; ; ,   r s V s V s  r s s 1... j K    1 j t t t t j t   j r s t Other technique for estimation of non-linear regression functions have been proposed elsewhere to get a better estimate of the parameters than simple gradient descent. Doya, NIPS 1996 7

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Teaching a two joints, three link robot leg to stand up Robot configuration. θ 0 : pitch angle, θ 1 : hip joint angle, θ 2 : knee joint angle, θ m : the angle of the line from the center of mass to the center of the foot. Morimoto and Doya, Robotics and Autonomous Systems , 2001 8

MACHINE LEARNING – 2012 Robotics Applications of continuous RL • The final goal is the upright stand-up posture. • Reaching the final goal is a necessary but not sufficient condition of successful stand-up because the robot may fall down after passing through the final goal  need to define subgoals • The stand-up task is accomplished when the robot stands up and stays upright for more than 2 (T + 1 ) seconds. Morimoto and Doya, Robotics and Autonomous Systems , 2001 9

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Hierarchical reinforcement learning : Upper layer : discretize state-action space and discover which subgoal should be reached using Q-learning. State: pitch and joint angles Actions: joint displacement (not the torque). Lower layer : continuous state-action space The robot learns to apply appropriate torque to achieve each subgoal. State: pitch and joint angles, and their velocities Actions: torque at the two joints Morimoto and Doya, Robotics and Autonomous Systems , 2001 10

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Reward: Decompose the task into an upper-level and lower-level sets of goals. Y: height of the head of the robot at a sub-goal posture and L is total length of the robot. When the robot achieves a sub-goal, the learner gets a reward < 0.5. Full reward is obtained when all subgoals and main goal are both achieved. Morimoto and Doya, Robotics and Autonomous Systems , 2001 11

MACHINE LEARNING – 2012 Robotics Applications of continuous RL     State and action are continuous in time: , s t u t         T Model value function: ; V s s     Model policy: | ; u s             T , u s f s  : sigmoid function, : noise f  Learn through gradient descent on squared TD error.    Backp ropagate TD error to estimate : e t        ,  t e t 1... j K j j Morimoto and Doya, Robotics and Autonomous Systems , 2001 12

MACHINE LEARNING – 2012 Robotics Applications of continuous RL • 750 trials in simulation + 170 on real robot • Goal: to stand up Morimoto and Doya, Robotics and Autonomous Systems , 2001 13

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learning how to swing up a pendulum Separate the problem in two parts: a) Learning to swing the pole up b) Learning to balance the pole upright Part a is learned through TD-learning Part b is learned using optimal control Additionally estimate a model of the dynamics of the inverted pendulum through locally weighted regression based (estimate parameters of a known model) Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 14

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learning how to swing up a pendulum   T * * Collect a sequence , from s u t t  1 t a single human demonstration     State of the system = , , , s x x t t t t t (human hand trajectory + velocity angular trajectory and velocity of the pendulum) Actions: = (c an be converted into torque u x t t command to the robot) Reward: minimizes angular and hand displacement as well as velocities Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 15

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learn first a model of the task through RL using    continuous TD learning to estimate ; : V s K         Take a model ; = V s s j  i 1  Parameters are estimated incrementally through non-linear function approximation ( locally weighted regression) Learn then the reward:       T     * * T , , r x u t x x Q x x u Ru t t t t t t t t and are estimated so as to minimize Q R discrepancies between human and robot trajectories. Use the reward to generate optimal trajectories from optim al control. Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 16

MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learning how to swing up a pendulum Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 17

MACHINE LEARNING – 2012 RL in continuous state and action spaces States and actions ,  , are continuous: s a 1.... t T t t   N P , s a t t One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either : 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a)    3) optimize a parameterized policy | (policy sear ch) a s 18

MACHINE LEARNING – 2012 Least Square Policy Iteration  19

MACHINE LEARNING – 2012 Least Square Policy Iteration Approximate the state-value function: K               T , ; , , Q s a s a s a j j  1 j          N P , is a set of basis function , : . s a K s a j These are set by the user and also called the features.   are the weights associated to each feature. These are j the unknown parameters. 20

Reinforcement Learning for Continuous State and Action Spaces - PowerPoint PPT Presentation

MACHINE LEARNING 2012 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING 2012 Reinforcement Learning (RL) Supervised Learning Unsupervised

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Accelerating a local search algorithm for large instances of the independent task scheduling

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Lecture 33 Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY

REQUEST M MANAGEM EMEN ENT: LEGAC ACY SYSTE TEMS MS, C , CONTIN TINUOUS I IMPROVEME MENT

CS 395T: Class Specific FaceTracer: A Search Engine for Large Collections of Images with Faces

Bootstrap Learning for Visual Perception on Mobile Robots ICRA-11 Uncertainty in Automation

trt t r