CS 285 Instructor: Sergey Levine UC Berkeley Last time: - - PowerPoint PPT Presentation

cs 285
SMART_READER_LITE
LIVE PREVIEW

CS 285 Instructor: Sergey Levine UC Berkeley Last time: - - PowerPoint PPT Presentation

Model-Based Policy Learning CS 285 Instructor: Sergey Levine UC Berkeley Last time: model-based RL with MPC every N steps The stochastic open-loop case why is this suboptimal? The stochastic closed-loop case Backpropagate directly into the


slide-1
SLIDE 1

Model-Based Policy Learning

CS 285

Instructor: Sergey Levine UC Berkeley

slide-2
SLIDE 2

Last time: model-based RL with MPC

every N steps

slide-3
SLIDE 3

The stochastic open-loop case

why is this suboptimal?

slide-4
SLIDE 4

The stochastic closed-loop case

slide-5
SLIDE 5

Backpropagate directly into the policy?

backprop backprop backprop easy for deterministic policies, but also possible for stochastic policy

slide-6
SLIDE 6

What’s the problem with backprop into policy?

big gradients here small gradients here backprop backprop backprop

slide-7
SLIDE 7

What’s the problem with backprop into policy?

backprop backprop backprop

slide-8
SLIDE 8

What’s the problem with backprop into policy?

  • Similar parameter sensitivity problems as shooting methods
  • But no longer have convenient second order LQR-like method,

because policy parameters couple all the time steps, so no dynamic programming

  • Similar problems to training long RNNs with BPTT
  • Vanishing and exploding gradients
  • Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics

are chosen by nature

backprop backprop backprop

slide-9
SLIDE 9

What’s the solution?

  • Use derivative-free (“model-free”) RL algorithms, with the model

used to generate synthetic samples

  • Seems weirdly backwards
  • Actually works very well
  • Essentially “model-based acceleration” for model-free RL
  • Use simpler policies than neural nets
  • LQR with learned models (LQR-FLM – Fitted Local Models)
  • Train local policies to solve simple tasks
  • Combine them into global policies via supervised learning
slide-10
SLIDE 10

Model-Free Learning With a Model

slide-11
SLIDE 11

What’s the solution?

  • Use derivative-free (“model-free”) RL algorithms, with the model

used to generate synthetic samples

  • Seems weirdly backwards
  • Actually works very well
  • Essentially “model-based acceleration” for model-free RL
  • Use simpler policies than neural nets
  • LQR with learned models (LQR-FLM – Fitted Local Models)
  • Train local policies to solve simple tasks
  • Combine them into global policies via supervised learning
slide-12
SLIDE 12

Model-free optimization with a model

  • Policy gradient might be more stable (if enough samples are used)

because it does not require multiplying many Jacobians

  • See a recent analysis here:
  • Parmas et al. ‘18: PIPP: Flexible Model-Based Policy Search Robust to the

Curse of Chaos

Policy gradient: Backprop (pathwise) gradient:

slide-13
SLIDE 13

Model-free optimization with a model

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.

Dyna

  • nline Q-learning algorithm that performs model-free RL with a model
slide-14
SLIDE 14

General “Dyna-style” model-based RL recipe

+ only requires short (as few as one step) rollouts from model + still sees diverse states

slide-15
SLIDE 15

Model-Based Acceleration (MBA) Model-Based Value Expansion (MVE) Model-Based Policy Optimization (MBPO)

Gu et al. Continuous deep Q-learning with model-based acceleration. ‘16 Feinberg et al. Model-based value expansion. ’18 Janner et al. When to trust your model: model-based policy optimization. ‘19

+ why is this a good idea?

  • why is this a bad idea?
slide-16
SLIDE 16

Local Models

slide-17
SLIDE 17

What’s the solution?

  • Use derivative-free (“model-free”) RL algorithms, with the model

used to generate synthetic samples

  • Seems weirdly backwards
  • Actually works very well
  • Essentially “model-based acceleration” for model-free RL
  • Use simpler policies than neural nets
  • LQR with learned models (LQR-FLM – Fitted Local Models)
  • Train local policies to solve simple tasks
  • Combine them into global policies via supervised learning
slide-18
SLIDE 18

What’s the solution?

  • Use derivative-free (“model-free”) RL algorithms, with the model

used to generate synthetic samples

  • Seems weirdly backwards
  • Actually works very well
  • Essentially “model-based acceleration” for model-free RL
  • Use simpler policies than neural nets
  • LQR with learned models (LQR-FLM – Fitted Local Models)
  • Train local policies to solve simple tasks
  • Combine them into global policies via supervised learning
slide-19
SLIDE 19

Local models

slide-20
SLIDE 20

Local models

slide-21
SLIDE 21

Local models

slide-22
SLIDE 22

What controller to execute?

slide-23
SLIDE 23

Local models

slide-24
SLIDE 24

How to fit the dynamics?

slide-25
SLIDE 25

What if we go too far?

slide-26
SLIDE 26

How to stay close to old controller?

For details, see: “Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics”

slide-27
SLIDE 27
slide-28
SLIDE 28

Global Policies from Local Models

slide-29
SLIDE 29

What’s the solution?

  • Use derivative-free (“model-free”) RL algorithms, with the model

used to generate synthetic samples

  • Seems weirdly backwards
  • Actually works very well
  • Essentially “model-based acceleration” for model-free RL
  • Use simpler policies than neural nets
  • LQR with learned models (LQR-FLM – Fitted Local Models)
  • Train local policies to solve simple tasks
  • Combine them into global policies via supervised learning
slide-30
SLIDE 30

What’s the solution?

  • Use derivative-free (“model-free”) RL algorithms, with the model

used to generate synthetic samples

  • Seems weirdly backwards
  • Actually works very well
  • Essentially “model-based acceleration” for model-free RL
  • Use simpler policies than neural nets
  • LQR with learned models (LQR-FLM – Fitted Local Models)
  • Train local policies to solve simple tasks
  • Combine them into global policies via supervised learning
slide-31
SLIDE 31

Guided policy search: high-level idea

supervised learning trajectory-centric RL

slide-32
SLIDE 32

Guided policy search: algorithm sketch

supervised learning trajectory-centric RL For details, see: “End-to-End Training of Deep Visuomotor Policies”

slide-33
SLIDE 33

Slide adapted from G. Hinton, see also Hinton et al. “Distilling the Knowledge in a Neural Network”

Ensemble models: single models are often not the most robust – instead train many models and average their predictions this is how most ML competitions (e.g., Kaggle) are won this is very expensive at test time Can we make a single model that is as good as an ensemble? Distillation: train on the ensemble’s predictions as “soft” targets Intuition: more knowledge in soft targets than hard labels!

logit temperature

Underlying principle: distillation

slide-34
SLIDE 34

Distillation for Multi-Task Transfer

Parisotto et al. “Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”

some other details (e.g., feature regression objective) – see paper

(just supervised learning/distillation) analogous to guided policy search, but for multi-task learning

slide-35
SLIDE 35

Combining weak policies into a strong policy

supervised learning trajectory-centric RL local neural net policies For details, see: “Divide and Conquer Reinforcement Learning”

slide-36
SLIDE 36

Readings: guided policy search & distillation

  • L.*, Finn*, et al. End-to-End Training of Deep Visuomotor Policies. 2015.
  • Rusu et al. Policy Distillation. 2015.
  • Parisotto et al. Actor-Mimic: Deep Multitask and Transfer Reinforcement
  • Learning. 2015.
  • Ghosh et al. Divide-and-Conquer Reinforcement Learning. 2017.
  • Teh et al. Distral: Robust Multitask Reinforcement Learning. 2017.