CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. - - PowerPoint PPT Presentation

cs 285
SMART_READER_LITE
LIVE PREVIEW

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. - - PowerPoint PPT Presentation

Optimal Control and Planning CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Introduction to model-based reinforcement learning 2. What if we know the dynamics? How can we make decisions? 3. Stochastic optimization methods


slide-1
SLIDE 1

Optimal Control and Planning

CS 285

Instructor: Sergey Levine UC Berkeley

slide-2
SLIDE 2

Today’s Lecture

  • 1. Introduction to model-based reinforcement learning
  • 2. What if we know the dynamics? How can we make decisions?
  • 3. Stochastic optimization methods
  • 4. Monte Carlo tree search (MCTS)
  • 5. Trajectory optimization
  • Goals:
  • Understand how we can perform planning with known dynamics models in

discrete and continuous spaces

slide-3
SLIDE 3

Recap: the reinforcement learning objective

slide-4
SLIDE 4

Recap: model-free reinforcement learning

assume this is unknown don’t even attempt to learn it

slide-5
SLIDE 5

What if we knew the transition dynamics?

  • Often we do know the dynamics

1. Games (e.g., Atari games, chess, Go) 2. Easily modeled systems (e.g., navigating a car) 3. Simulated environments (e.g., simulated robots, video games)

  • Often we can learn the dynamics

1. System identification – fit unknown parameters of a known model 2. Learning – fit a general-purpose model to observed transition data

Does knowing the dynamics make things easier? Often, yes!

slide-6
SLIDE 6
  • 1. Model-based reinforcement learning: learn the transition dynamics,

then figure out how to choose actions

  • 2. Today: how can we make decisions if we know the dynamics?

a. How can we choose actions under perfect knowledge of the system dynamics? b. Optimal control, trajectory optimization, planning

  • 3. Next week: how can we learn unknown dynamics?
  • 4. How can we then also learn policies? (e.g. by imitating optimal control)

Model-based reinforcement learning

policy system dynamics

slide-7
SLIDE 7

The objective

  • 1. run away
  • 2. ignore
  • 3. pet
slide-8
SLIDE 8

The deterministic case

slide-9
SLIDE 9

The stochastic open-loop case

why is this suboptimal?

slide-10
SLIDE 10

Aside: terminology

what is this “loop”? closed-loop

  • pen-loop
  • nly sent at t = 1,

then it’s one-way!

slide-11
SLIDE 11

The stochastic closed-loop case

(more on this later)

slide-12
SLIDE 12

Open-Loop Planning

slide-13
SLIDE 13

But for now, open-loop planning

slide-14
SLIDE 14

Stochastic optimization

simplest method: guess & check “random shooting method”

slide-15
SLIDE 15

Cross-entropy method (CEM)

can we do better?

typically use Gaussian distribution see also: CMA-ES (sort of like CEM with momentum)

slide-16
SLIDE 16

What’s the problem?

  • 1. Very harsh dimensionality limit
  • 2. Only open-loop planning

What’s the upside?

  • 1. Very fast if parallelized
  • 2. Extremely simple
slide-17
SLIDE 17

Discrete case: Monte Carlo tree search (MCTS)

slide-18
SLIDE 18

Discrete case: Monte Carlo tree search (MCTS)

e.g., random policy

slide-19
SLIDE 19

Discrete case: Monte Carlo tree search (MCTS)

+10 +15

slide-20
SLIDE 20

Discrete case: Monte Carlo tree search (MCTS)

Q = 10 N = 1 Q = 12 N = 1 Q = 16 N = 1 Q = 22 N = 2 Q = 38 N = 3 Q = 10 N = 1 Q = 12 N = 1 Q = 22 N = 2 Q = 30 N = 3 Q = 8 N = 1

slide-21
SLIDE 21

Additional reading

  • 1. Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener,

Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods.

  • Survey of MCTS methods and basic summary.
slide-22
SLIDE 22

Trajectory Optimization with Derivatives

slide-23
SLIDE 23

Can we use derivatives?

slide-24
SLIDE 24

Shooting methods vs collocation

shooting method: optimize over actions only

slide-25
SLIDE 25

Shooting methods vs collocation

collocation method: optimize over actions and states, with constraints

slide-26
SLIDE 26

Linear case: LQR

linear quadratic

slide-27
SLIDE 27

Linear case: LQR

slide-28
SLIDE 28

Linear case: LQR

slide-29
SLIDE 29

Linear case: LQR

linear linear quadratic

slide-30
SLIDE 30

Linear case: LQR

linear linear quadratic

slide-31
SLIDE 31

Linear case: LQR

slide-32
SLIDE 32

Linear case: LQR

slide-33
SLIDE 33

LQR for Stochastic and Nonlinear Systems

slide-34
SLIDE 34

Stochastic dynamics

slide-35
SLIDE 35

The stochastic closed-loop case

slide-36
SLIDE 36

Nonlinear case: DDP/iterative LQR

slide-37
SLIDE 37

Nonlinear case: DDP/iterative LQR

slide-38
SLIDE 38

Nonlinear case: DDP/iterative LQR

slide-39
SLIDE 39

Nonlinear case: DDP/iterative LQR

slide-40
SLIDE 40

Nonlinear case: DDP/iterative LQR

slide-41
SLIDE 41

Nonlinear case: DDP/iterative LQR

slide-42
SLIDE 42

Case Study and Additional Readings

slide-43
SLIDE 43

Case study: nonlinear model-predictive control

slide-44
SLIDE 44
slide-45
SLIDE 45

Additional reading

  • 1. Mayne, Jacobson. (1970). Differential dynamic programming.
  • Original differential dynamic programming algorithm.
  • 2. Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex

Behaviors through Online Trajectory Optimization.

  • Practical guide for implementing non-linear iterative LQR.
  • 3. Levine, Abbeel. (2014). Learning Neural Network Policies with Guided

Policy Search under Unknown Dynamics.

  • Probabilistic formulation and trust region alternative to deterministic line search.
slide-46
SLIDE 46

What’s wrong with known dynamics? Next time: learning the dynamics model