reinforcement learning
play

Reinforcement Learning for Continuous State and Action Spaces - PowerPoint PPT Presentation

MACHINE LEARNING 2012 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING 2012 Reinforcement Learning (RL) Supervised Learning Unsupervised


  1. MACHINE LEARNING – 2012 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1

  2. MACHINE LEARNING – 2012 Reinforcement Learning (RL) Supervised Learning Unsupervised Learning Reinforcement learning is in-between 2

  3. MACHINE LEARNING – 2012 Drawbacks of classical RL Curse of dimensionality: Computational costs increase dramatically with number of states. Markov World: Cannot handle continuous action and state space Model-based vs model free: Need a model of the world (can be estimated through exploration)  Gradient methods to handle continuous state and action spaces 3

  4. MACHINE LEARNING – 2012 RL in continuous state and action spaces States and actions ,  , are continuous: s a 1.... t T t t   N P , s a t t One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either : 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a)    3) optimize a parameterized policy | (policy sear ch) a s 4

  5. MACHINE LEARNING – 2012 Policy Gradients Parametrize the value function:   V s  ; Open parameters    Assume an initial estimate ; . V s     Run an episode using a greedy policy | ; a s            k   Compute the error on the estimate: = ; e V s E r     t   t  Find the optimal parameters through gradient descent on the stat e value function:     ; V s    ~ e   5

  6. MACHINE LEARNING – 2012 TD learning for continuous state action space Parametrize the value function such that: K               T ; V s s s j j  j 1    is a set of basis function (e.g. RBF functions). s K These are set by the user and also called the features.   are the weights associated to each feature. These are j the unknown parameters. Doya, NIPS 1996 6

  7. MACHINE LEARNING – 2012 TD learning for continuous state action space  Pick a set of parameters and compute: K               T ; V s s s j j  1 j    Do some roll-outs of fixed episode length and gather , 1... r s t T t    Estimate new ; through TD learning V s Gradient descent on the squared TD err or gives:      ; V s                     ˆ t ; ; ,   r s V s V s  r s s 1... j K    1 j t t t t j t   j r s t Other technique for estimation of non-linear regression functions have been proposed elsewhere to get a better estimate of the parameters than simple gradient descent. Doya, NIPS 1996 7

  8. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Teaching a two joints, three link robot leg to stand up Robot configuration. θ 0 : pitch angle, θ 1 : hip joint angle, θ 2 : knee joint angle, θ m : the angle of the line from the center of mass to the center of the foot. Morimoto and Doya, Robotics and Autonomous Systems , 2001 8

  9. MACHINE LEARNING – 2012 Robotics Applications of continuous RL • The final goal is the upright stand-up posture. • Reaching the final goal is a necessary but not sufficient condition of successful stand-up because the robot may fall down after passing through the final goal  need to define subgoals • The stand-up task is accomplished when the robot stands up and stays upright for more than 2 (T + 1 ) seconds. Morimoto and Doya, Robotics and Autonomous Systems , 2001 9

  10. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Hierarchical reinforcement learning : Upper layer : discretize state-action space and discover which subgoal should be reached using Q-learning. State: pitch and joint angles Actions: joint displacement (not the torque). Lower layer : continuous state-action space The robot learns to apply appropriate torque to achieve each subgoal. State: pitch and joint angles, and their velocities Actions: torque at the two joints Morimoto and Doya, Robotics and Autonomous Systems , 2001 10

  11. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Reward: Decompose the task into an upper-level and lower-level sets of goals. Y: height of the head of the robot at a sub-goal posture and L is total length of the robot. When the robot achieves a sub-goal, the learner gets a reward < 0.5. Full reward is obtained when all subgoals and main goal are both achieved. Morimoto and Doya, Robotics and Autonomous Systems , 2001 11

  12. MACHINE LEARNING – 2012 Robotics Applications of continuous RL     State and action are continuous in time: , s t u t         T Model value function: ; V s s     Model policy: | ; u s             T , u s f s  : sigmoid function, : noise f  Learn through gradient descent on squared TD error.    Backp ropagate TD error to estimate : e t        ,  t e t 1... j K j j Morimoto and Doya, Robotics and Autonomous Systems , 2001 12

  13. MACHINE LEARNING – 2012 Robotics Applications of continuous RL • 750 trials in simulation + 170 on real robot • Goal: to stand up Morimoto and Doya, Robotics and Autonomous Systems , 2001 13

  14. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learning how to swing up a pendulum Separate the problem in two parts: a) Learning to swing the pole up b) Learning to balance the pole upright Part a is learned through TD-learning Part b is learned using optimal control Additionally estimate a model of the dynamics of the inverted pendulum through locally weighted regression based (estimate parameters of a known model) Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 14

  15. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learning how to swing up a pendulum   T * * Collect a sequence , from s u t t  1 t a single human demonstration     State of the system = , , , s x x t t t t t (human hand trajectory + velocity angular trajectory and velocity of the pendulum) Actions: = (c an be converted into torque u x t t command to the robot) Reward: minimizes angular and hand displacement as well as velocities Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 15

  16. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learn first a model of the task through RL using    continuous TD learning to estimate ; : V s K         Take a model ; = V s s j  i 1  Parameters are estimated incrementally through non-linear function approximation ( locally weighted regression) Learn then the reward:       T     * * T , , r x u t x x Q x x u Ru t t t t t t t t and are estimated so as to minimize Q R discrepancies between human and robot trajectories. Use the reward to generate optimal trajectories from optim al control. Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 16

  17. MACHINE LEARNING – 2012 Robotics Applications of continuous RL Learning how to swing up a pendulum Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 17

  18. MACHINE LEARNING – 2012 RL in continuous state and action spaces States and actions ,  , are continuous: s a 1.... t T t t   N P , s a t t One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either : 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a)    3) optimize a parameterized policy | (policy sear ch) a s 18

  19. MACHINE LEARNING – 2012 Least Square Policy Iteration  19

  20. MACHINE LEARNING – 2012 Least Square Policy Iteration Approximate the state-value function: K               T , ; , , Q s a s a s a j j  1 j          N P , is a set of basis function , : . s a K s a j These are set by the user and also called the features.   are the weights associated to each feature. These are j the unknown parameters. 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend