Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn - - PowerPoint PPT Presentation

soft actor critic deep reinforcement learning for robotics
SMART_READER_LITE
LIVE PREVIEW

Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn - - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn Rietz University of Hamburg Faculty of Mathematics, Informatics and Natural Sciences Department of Informatics Technical Aspects of


slide-1
SLIDE 1

MIN Faculty Department of Informatics

Soft Actor-Critic: Deep Reinforcement Learning for Robotics

Finn Rietz

University of Hamburg Faculty of Mathematics, Informatics and Natural Sciences Department of Informatics Technical Aspects of Multimodal Systems

  • 13. January 2020

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 1 / 26

slide-2
SLIDE 2

Creative policy example

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Taken from [1]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 2 / 26

slide-3
SLIDE 3

Outline

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

  • 1. Motivation and reinforcement learning (RL) basics
  • 2. Challenges in deep reinforcement learnign (DRL) with robotics
  • 3. Soft actor-critic algorithm
  • 4. Results and Discussion
  • 5. Conclusion

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 3 / 26

slide-4
SLIDE 4

Motivation

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Potential of RL: ◮ Automatic learning of robotic tasks, directly from sensory input Promising results: ◮ Superhuman performance on Atari games [2] ◮ AlphaGoZero becoming the greatest Go player [3] ◮ AlphaStart becoming better than 99.8% of all Star Craft 2 players [4] ◮ Real-world, simple robotic manipulation tasks (numerous limitations) [5]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 4 / 26

slide-5
SLIDE 5

Basics

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Markov Decision Process. Figure taken from [6]

RL in a nutshell: ◮ Learning to map actions to situations ◮ Trial-and-error search ◮ Maximize numerical reward

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 5 / 26

slide-6
SLIDE 6

Reinforcement Learning fundamentals

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

◮ Reward rt: Skalar ◮ State function st ∈ S: Vector of observations ◮ Action function at ∈ A: Vector of actions ◮ Policy π: Mapping function from states to actions ◮ Action-Value function Qπ(st, at): Expected reward for state-action pair Putting the deep in RL: ◮ How to deal with continuous spaces? ◮ Approximate (state and action) function ◮ Approximator has fewer, limited number of parameters

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 6 / 26

slide-7
SLIDE 7

On-policy versus off-policy learning

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

On-policy learning: ◮ Only one policy ◮ Exploitation versus exploration dilemma ◮ Optimize same policy that collects data ◮ Very data hungry Off-policy learning: ◮ Employs multiple policies ◮ One collects data, other becomes final policy ◮ We can save and reuse past experiences ◮ More suitable for robotics

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 7 / 26

slide-8
SLIDE 8

Model-based versus model-free methods

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Model-based methods: ◮ Learn model of the environment ◮ Chose actions by planning on learned model ◮ "Think then act" ◮ Statistically efficient, but model often too complex to learn Model-free methods: ◮ Directly learn Q-function by sampling from environment ◮ No planning possible ◮ Can produce same optimal policy as model-based methods ◮ More suitable for robotics

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 8 / 26

slide-9
SLIDE 9

Progress

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

  • 1. Motivation and basics
  • 2. Challenges in DRL
  • 3. Soft actor-critic algorithm
  • 4. Results and Discussion
  • 5. Conclusion

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 9 / 26

slide-10
SLIDE 10

Data inefficiency

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

RL algorithms are notoriously data-hungry: ◮ Not a big problem in simulated settings ◮ Impractical amounts of training time in real-world ◮ Wear-and-tear on robot must be minimized ◮ Need for statistically efficient methods Off-policy methods better suited, due to higher sample-efficiency

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 10 / 26

slide-11
SLIDE 11

Safe exploration

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

RL is trial-and-error search: ◮ Again no problem in simulation ◮ Randomly applying force to motors of an expansive robot is problematic ◮ Could lead to destruction of robot ◮ Need for safety measures during exploration Possible solutions: Limit maximum allowed velocity per joint, position limits for joints [7]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 11 / 26

slide-12
SLIDE 12

Sparse rewards

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Classic reward is binary measure: ◮ Robot might never complete complex tasks, thus never

  • bserves reward

◮ No variance in reward function, no learning possible ◮ Need for manually designed reward function, reward engineering ◮ Need for designated state representation, against the principal

  • f RL

◮ Not trivial problem, manually designed reward function often exploited in an unforeseen manner

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 12 / 26

slide-13
SLIDE 13

Reality Gap

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Why not train in simulation? ◮ Simulations are still imperfect ◮ Many (small) dynamics of the environment remain uncaptured ◮ Policy will likely not generalize to real world ◮ Recent research field (automatic domain randomization) Training in simulation more attractive, but often policy not directly applicable in the real world

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 13 / 26

slide-14
SLIDE 14

Progress

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

  • 1. Motivation and basics
  • 2. Challenges in DRL
  • 3. Soft actor-critic algorithm
  • 4. Results and Discussion
  • 5. Conclusion

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 14 / 26

slide-15
SLIDE 15

Soft actor-critic algorithm

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Soft actor-critic by Haarnoja et al: ◮ Original version early 2018: Temperature hyperparameter [8] ◮ Refined version late 2018: Workaround for critical hyperparameter [9] ◮ Developed in cooperation by UC Berkeley & Google Brain ◮ Off-policy, model-free, actor-critic method ◮ Key-idea: Exploit entropy of policy ◮ "Succeed at task while acting as random as possible" [9]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 15 / 26

slide-16
SLIDE 16

Soft actor-critic algorithm

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Classical reinforcement learning objective: ◮

t E(st, at)∼ρπ[r(st, at)]

◮ Find π(at|st) maximizing sum of reward SAC objective: ◮ π∗ = argmax

π

  • t E(st,at)∼ρπ[r(st, at) + αH(π(·|st))]

◮ Augment classical objective with entropy regularization H ◮ Problematic hyperparameter α ◮ Instead treat entropy as constraint, automatically update during learning

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 16 / 26

slide-17
SLIDE 17

Advantages of using entropy

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Some advantages of the maximum entropy objective: ◮ Policy explores more widely ◮ Learn multiple modes of near-optimal behavior, more robust ◮ Significantly speeds up learning

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 17 / 26

slide-18
SLIDE 18

Progress

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

  • 1. Motivation and basics
  • 2. Challenges in DRL
  • 3. Soft actor-critic algorithm
  • 4. Results and Discussion
  • 5. Conclusion

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 18 / 26

slide-19
SLIDE 19

Dexterous hand manipulation

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

[9]

◮ 3-finger hand, 9 degrees of freedom ◮ Goal: Rotate valve into target position ◮ Learns directly from RGB images via CNN features ◮ Challenging due too complex hand and end-to-end perception ◮ 20 hours of real-world training

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 19 / 26

slide-20
SLIDE 20

Dexterous hand manipulation

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

[9]

Alternative mode: ◮ Use valve position directly ◮ 3 hours of real-world training ◮ Substantially faster than competition on same tasks (PPO, 7.4 hours [10])

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 20 / 26

slide-21
SLIDE 21

Dexterous hand manipulation

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

[11]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 21 / 26

slide-22
SLIDE 22

Simulated Benchmark

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Comparison of SAC against other state of the art algorithms: ◮ DDPG, 2015: Off-policy, model-free, sample-efficient [12] ◮ TD3, 2018: Extension of DDPG [13] ◮ PPO, 2017: On-policy (relatively efficient), model-free [14] Simpler and complex environments: ◮ Hopper-v2 (2D), Walker2D-v2 (2D), HalfCheetah-v2 (2D), Ant-v2 (3D) ◮ Humanoid-v2 (3D), Humanoid (rllab, 3D)

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 22 / 26

slide-23
SLIDE 23

Simulated Benchmark

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

0.0 0.2 0.4 0.6 0.8 1.0 million steps 1000 2000 3000 4000 average return

Hopper-v2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 million steps 1000 2000 3000 4000 5000 6000 7000 average return

Walker2d-v2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 million steps 2500 5000 7500 10000 12500 15000 average return

HalfCheetah-v2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 million steps −1000 1000 2000 3000 4000 5000 6000 7000 average return

Ant-v2

2 4 6 8 10 million steps 2000 4000 6000 8000 average return

Humanoid-v2

2 4 6 8 10 million steps 1000 2000 3000 4000 5000 6000 7000 average return

Humanoid (rllab)

SAC (learned temperature) SAC (fixed temperature) DDPG TD3 PPO

Figure taken from [9]

◮ Comparable to baseline on simple tasks ◮ Exceeds baseline on challenging tasks

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 23 / 26

slide-24
SLIDE 24

Progress

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

  • 1. Motivation and basics
  • 2. Challenges in DRL
  • 3. Soft actor-critic algorithm
  • 4. Results and Discussion
  • 5. Conclusion

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 24 / 26

slide-25
SLIDE 25

Wrap-up & Conclusion

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Soft actor-critic in a nutshell: ◮ Off-policy (higher sample efficiency) ◮ Model-free (almost necessity for real-world robotics) ◮ Training in simulation preferable, but still problematic ◮ Exploits entropy framework Take-away: ◮ Can learn directly in real-world ◮ Can learn from raw sensory input (end-to-end) ◮ Entropy significantly speeds up learning ◮ Comparable to state of the art on simple tasks ◮ Exceeds state of the art on complex tasks

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 25 / 26

slide-26
SLIDE 26

Question time

Motivation and basics Challenges in DRL Soft actor-critic algorithm Results and Discussion Conclusion

Thanks for your attention :)

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 26 / 26

slide-27
SLIDE 27

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 26 / 26

slide-28
SLIDE 28

References

References

[1] Xue Bin Peng et al. “DeepMimic”. In: ACM Transactions

  • n Graphics 37.4 (July 2018), pp. 1–14. issn: 0730-0301.

doi: 10.1145/3197517.3201311. url: http://dx.doi.org/10.1145/3197517.3201311. [2] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 2015),

  • pp. 529–533. issn: 00280836. url:

http://dx.doi.org/10.1038/nature14236. [3] David Silver et al. “Mastering the game of go without human knowledge”. In: nature 550.7676 (2017), pp. 354–359. [4] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354.

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 1 / 10

slide-29
SLIDE 29

References (cont.)

References

[5] Shixiang Gu et al. “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE. 2017, pp. 3389–3396. [6] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Second. The MIT Press, 2018. url: http://incompleteideas.net/book/the-book- 2nd.html. [7]

  • S. Gu et al. “Deep reinforcement learning for robotic

manipulation with asynchronous off-policy updates”. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). May 2017, pp. 3389–3396. doi: 10.1109/ICRA.2017.7989385.

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 2 / 10

slide-30
SLIDE 30

References (cont.)

References

[8] Tuomas Haarnoja et al. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: arXiv preprint arXiv:1801.01290 (2018). [9] Tuomas Haarnoja et al. “Soft actor-critic algorithms and applications”. In: arXiv preprint arXiv:1812.05905 (2018). [10] Henry Zhu et al. “Dexterous manipulation with deep reinforcement learning: Efficient, general, and low-cost”. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE. 2019, pp. 3651–3657. [11] Soft Actor-Critic Project Website. https: //sites.google.com/view/sac-and-applications. Accessed: 2020-01-05.

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 3 / 10

slide-31
SLIDE 31

References (cont.)

References

[12] Timothy P Lillicrap et al. “Continuous control with deep reinforcement learning”. In: arXiv preprint arXiv:1509.02971 (2015). [13] Scott Fujimoto, Herke van Hoof, and David Meger. “Addressing function approximation error in actor-critic methods”. In: arXiv preprint arXiv:1802.09477 (2018). [14] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017).

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 4 / 10

slide-32
SLIDE 32

Value-based versus policy-based methods

References

So far, Value-based methods: ◮ Learn value-function (Q) ◮ Select actions based on learned value function ◮ Policies highly depend on value function Alternatively, Policy-based methods: ◮ Learn parameterized policy ◮ No value function required, use total reward obtained from each action ◮ Can deal with continuous state and actions spaces ◮ However, requires complete transitions (Monte-Carlo)

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 5 / 10

slide-33
SLIDE 33

Actor-critic methods

References

Why not use both? ◮ Learn policy (actor) ◮ Learn value-function (critic), approximating true value-function ◮ Basis for most recent RL algorithms At each time-step (TD-approach): ◮ Adjust critic to fit value-function ◮ Update actor to new critic ◮ This is the classical generalized policy iteration (GPI) algorithm ◮ Not possible for purely policy-based methods ()

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 6 / 10

slide-34
SLIDE 34

Quadrupedal locomotion

References

Learning quadrupedal walking gaits: ◮ Learning directly in real-world ◮ Some reward-engineering ◮ Walking learned within 2 hours of training ◮ First example of DRL on quadrupedal locomotion without any pretraining ◮ SAC policies are robust, generalizes well to unseen environment

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 7 / 10

slide-35
SLIDE 35

Quadrupedal locomotion

References

[11]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 8 / 10

slide-36
SLIDE 36

Quadrupedal locomotion

References

[11]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 9 / 10

slide-37
SLIDE 37

Dexterous hand manipulation

References

[11]

Finn Rietz – Soft actor-critic: Deep reinforcement learning for Robotics 10 / 10