Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn Rietz University of Hamburg Faculty of Mathematics, Informatics and Natural Sciences Department of Informatics Technical Aspects of Multimodal Systems 13. January 2020 Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 1 / 26

Creative policy example Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Taken from [1] Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 2 / 26

Outline Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion 1. Motivation and reinforcement learning (RL) basics 2. Challenges in deep reinforcement learnign (DRL) with robotics 3. Soft actor-critic algorithm 4. Results and Discussion 5. Conclusion Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 3 / 26

Motivation Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Potential of RL: ◮ Automatic learning of robotic tasks, directly from sensory input Promising results: ◮ Superhuman performance on Atari games [2] ◮ AlphaGoZero becoming the greatest Go player [3] ◮ AlphaStart becoming better than 99.8% of all Star Craft 2 players [4] ◮ Real-world, simple robotic manipulation tasks (numerous limitations) [5] Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 4 / 26

Basics Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Markov Decision Process. Figure taken from [6] RL in a nutshell: ◮ Learning to map actions to situations ◮ Trial-and-error search ◮ Maximize numerical reward Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 5 / 26

Reinforcement Learning fundamentals Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion ◮ Reward r t : Skalar ◮ State function s t ∈ S : Vector of observations ◮ Action function a t ∈ A : Vector of actions ◮ Policy π : Mapping function from states to actions ◮ Action-Value function Q π ( s t , a t ) : Expected reward for state-action pair Putting the deep in RL: ◮ How to deal with continuous spaces? ◮ Approximate (state and action) function ◮ Approximator has fewer, limited number of parameters Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 6 / 26

On-policy versus off-policy learning Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion On-policy learning: ◮ Only one policy ◮ Exploitation versus exploration dilemma ◮ Optimize same policy that collects data ◮ Very data hungry Off-policy learning: ◮ Employs multiple policies ◮ One collects data, other becomes final policy ◮ We can save and reuse past experiences ◮ More suitable for robotics Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 7 / 26

Model-based versus model-free methods Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Model-based methods: ◮ Learn model of the environment ◮ Chose actions by planning on learned model ◮ "Think then act" ◮ Statistically efficient, but model often too complex to learn Model-free methods: ◮ Directly learn Q -function by sampling from environment ◮ No planning possible ◮ Can produce same optimal policy as model-based methods ◮ More suitable for robotics Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 8 / 26

Progress Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion 1. Motivation and basics 2. Challenges in DRL 3. Soft actor-critic algorithm 4. Results and Discussion 5. Conclusion Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 9 / 26

Data inefficiency Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion RL algorithms are notoriously data-hungry: ◮ Not a big problem in simulated settings ◮ Impractical amounts of training time in real-world ◮ Wear-and-tear on robot must be minimized ◮ Need for statistically efficient methods Off-policy methods better suited, due to higher sample-efficiency Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 10 / 26

Safe exploration Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion RL is trial-and-error search: ◮ Again no problem in simulation ◮ Randomly applying force to motors of an expansive robot is problematic ◮ Could lead to destruction of robot ◮ Need for safety measures during exploration Possible solutions: Limit maximum allowed velocity per joint, position limits for joints [7] Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 11 / 26

Sparse rewards Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Classic reward is binary measure: ◮ Robot might never complete complex tasks, thus never observes reward ◮ No variance in reward function, no learning possible ◮ Need for manually designed reward function, reward engineering ◮ Need for designated state representation, against the principal of RL ◮ Not trivial problem, manually designed reward function often exploited in an unforeseen manner Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 12 / 26

Reality Gap Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Why not train in simulation? ◮ Simulations are still imperfect ◮ Many (small) dynamics of the environment remain uncaptured ◮ Policy will likely not generalize to real world ◮ Recent research field (automatic domain randomization) Training in simulation more attractive, but often policy not directly applicable in the real world Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 13 / 26

Soft actor-critic algorithm Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Soft actor-critic by Haarnoja et al: ◮ Original version early 2018: Temperature hyperparameter [8] ◮ Refined version late 2018: Workaround for critical hyperparameter [9] ◮ Developed in cooperation by UC Berkeley & Google Brain ◮ Off-policy, model-free, actor-critic method ◮ Key-idea: Exploit entropy of policy ◮ "Succeed at task while acting as random as possible" [9] Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 15 / 26

Soft actor-critic algorithm Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Classical reinforcement learning objective: ◮ � t E ( s t , a t ) ∼ ρ π [ r ( s t , a t )] ◮ Find π ( a t | s t ) maximizing sum of reward SAC objective: ◮ π ∗ = argmax � t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ·| s t ))] π ◮ Augment classical objective with entropy regularization H ◮ Problematic hyperparameter α ◮ Instead treat entropy as constraint, automatically update during learning Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 16 / 26

Advantages of using entropy Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion Some advantages of the maximum entropy objective: ◮ Policy explores more widely ◮ Learn multiple modes of near-optimal behavior, more robust ◮ Significantly speeds up learning Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 17 / 26

Dexterous hand manipulation Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion [9] ◮ 3-finger hand, 9 degrees of freedom ◮ Goal: Rotate valve into target position ◮ Learns directly from RGB images via CNN features ◮ Challenging due too complex hand and end-to-end perception ◮ 20 hours of real-world training Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 19 / 26

Dexterous hand manipulation Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion [9] Alternative mode: ◮ Use valve position directly ◮ 3 hours of real-world training ◮ Substantially faster than competition on same tasks (PPO, 7.4 hours [10]) Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 20 / 26

Dexterous hand manipulation Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion [11] Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 21 / 26

Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn Rietz University of Hamburg Faculty of Mathematics, Informatics and Natural Sciences Department of Informatics Technical Aspects of

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

AN AN AN ACTOR AN ACTOR ACTOR ACTOR- - - -CENTERED POLICY PROCESS CENTERED POLICY PROCESS

Living Actor Living Actor Living Actor - Use Cases Living Actor - Use Cases Use Cases

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Actor-Attention-Critic for Multi-Agent Reinforcement Learning Shariq Iqbal and Fei Sha Outline

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

A Robust Graph-based Framework for Building Precise Maps from Laser Range Scans Marian Himstedt

On-line Reasoning about Coordination Design Decisions Frank Ehlers 2 nd October 2015, DEMUR 2015

Homotopy-Aware RRT* : Toward Human-Robot Topological Path-Planning Daqing Yi Michael A. Goodrich

15-780: Grad AI Lecture 16: Probability Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik

Human Centered Autonomy Katie Driggs-Campbell Electrical and Computer Engineering Coordinated

Design of Deep Neural Networks as Add-on Blocks for Improving Impromptu Trajectory Tracking

of Subarctic Forests Philippe Babin, Philippe Dandurand, Vladimr Kubelka, Philippe Gigure and

Probabilistic Estimation of the Drivers Gaze from Head Orientation and Position Sumit Jha and

Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn Rietz University of Hamburg Faculty of Mathematics, Informatics and Natural Sciences Department of Informatics Technical Aspects of

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

AN AN AN ACTOR AN ACTOR ACTOR ACTOR- - - -CENTERED POLICY PROCESS CENTERED POLICY PROCESS

Living Actor Living Actor Living Actor - Use Cases Living Actor - Use Cases Use Cases

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Actor-Attention-Critic for Multi-Agent Reinforcement Learning Shariq Iqbal and Fei Sha Outline

Movie &amp; Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

A Robust Graph-based Framework for Building Precise Maps from Laser Range Scans Marian Himstedt

On-line Reasoning about Coordination Design Decisions Frank Ehlers 2 nd October 2015, DEMUR 2015

Homotopy-Aware RRT* : Toward Human-Robot Topological Path-Planning Daqing Yi Michael A. Goodrich

15-780: Grad AI Lecture 16: Probability Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik

Human Centered Autonomy Katie Driggs-Campbell Electrical and Computer Engineering Coordinated

Design of Deep Neural Networks as Add-on Blocks for Improving Impromptu Trajectory Tracking

of Subarctic Forests Philippe Babin, Philippe Dandurand, Vladimr Kubelka, Philippe Gigure and

Probabilistic Estimation of the Drivers Gaze from Head Orientation and Position Sumit Jha and

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor