Trajectory Optimization, Imitation Learning Lecture 14 What will - - PowerPoint PPT Presentation

trajectory optimization imitation learning
SMART_READER_LITE
LIVE PREVIEW

Trajectory Optimization, Imitation Learning Lecture 14 What will - - PowerPoint PPT Presentation

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR Trajectory Optimization Paper Imitation Learning Supervised Learning Dagger How to solve Optimal Control Problems? Sequential Quadratic


slide-1
SLIDE 1

Trajectory Optimization, Imitation Learning

Lecture 14

slide-2
SLIDE 2

What will you take home today?

Recap LQR Trajectory Optimization Paper Imitation Learning Supervised Learning Dagger

slide-3
SLIDE 3

How to solve Optimal Control Problems?

slide-4
SLIDE 4

Sequential Quadratic Programming

slide-5
SLIDE 5

Example – Newton-Raphson Method

slide-6
SLIDE 6

Sequential Linear Quadratic Programming

slide-7
SLIDE 7

SLQ Algorithm

slide-8
SLIDE 8

Linear Dynamical Systems, Quadratic cost – Linear Quadratic Regulator (LQR)

slide-9
SLIDE 9

Linear Dynamical Systems, Quadratic cost – Linear Quadratic Regulator (LQR)

slide-10
SLIDE 10

Trajectory Optimization

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

What will you take home today?

Recap LQR Trajectory Optimization Paper Imitation Learning Supervised Learning Dagger

slide-15
SLIDE 15

Assumptions in Optimal Control

1.

Known and/or simple System Dynamics

2.

Known Cost function

slide-16
SLIDE 16

What are approaches for unknown dynamics and/or cost?

1.

Learning approaches

  • a. Reinforcement learning

i.

Model-based

ii.

Model-based

  • b. Imitation learning

i.

Imitate an expert policy

slide-17
SLIDE 17

Learning to make single predictions versus a sequence of predictions

slide-18
SLIDE 18

Running Example: Super Tux Cart from

A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011. https://www.youtube.com/watch?feature=oembed&v=V00npNnWzSU

slide-19
SLIDE 19

Imitation Learning

1.

Useful when dynamics and/or cost are unknown/complex

  • a. We don’t know how the next state will look like. Hard to model.
  • b. We don’t know the cost to go for an action

The expected immediate cost of taking action a in state s The expected immediate cost of executing policy pi in state s Cost-to-go = total cost of executing pi over T

slide-20
SLIDE 20

Imitation learning – core idea

1.

Idea: imitate expert trajectories!

  • a. Bound J for any cost function C based on how well pi mimics expert’s

policy

slide-21
SLIDE 21

Imitation Learning by Classification

Algorithm from - A Course in Machine Learning by Hal Daumé III. Ch. 18

slide-22
SLIDE 22

How well does Imitation Learning by Classification work?

1.

Depends on

  • a. How good the expert is.
  • b. How much error the classifier makes.
slide-23
SLIDE 23

Running Example: Super Tux Cart

Figure from ‘Interactive Learning for Sequential Decisions and Predictions’ by Stephane Ross.

slide-24
SLIDE 24

Learned behavior influences states and observations

When are data samples i.i.d?

  • Challenge: system dynamics are assumed both unknown and complex, we cannot

compute dπ and can only sample it by executing π in the system.

  • non-i.i.d. supervised learning problem due to the dependence of the input distribution on

the policy π itself. Difficult optimization due to dependence which makes problem non-convex Typical assumption in statistics and machine learning: Observations in a sample are independent and identically distributed. This simplifies many methods although not true in many practical settings. Examples are coin flips. Roulette spins

slide-25
SLIDE 25

Running Example: Super Tux Cart from

A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011. https://www.youtube.com/watch?feature=oembed&v=V00npNnWzSU

slide-26
SLIDE 26

Another example – Super Mario from

A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011.

https://www.youtube.com/watch?v=anOI0xZ3kGM

slide-27
SLIDE 27

How do we train a policy that can deal with any possible situation?

1.

This is impossible since the state/observation space may be prohibitively large and we cannot train on allpossible configurations. If we could, we may just memorize anyway.

2.

Goal: Train f to do well on configurations that it encounters itself.

3.

Chicken and egg problem:

  • a. We want a policy that does well in a bunch of world configuration.
  • b. What configurations? The ones it encounters/

4.

Solve by iteration: roll out f. Collect data, retrain.

slide-28
SLIDE 28

Dataset Aggregation Algorithm (Dagger)

Figure and Algorithm from - A Course in Machine Learning by Hal Daumé III. Ch. 18

slide-29
SLIDE 29

How well does Dagger work?

Theorem from - A Course in Machine Learning by Hal Daumé III. Ch. 18

slide-30
SLIDE 30

Running Example: Super Tux Cart from

A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011.

slide-31
SLIDE 31

Requirements on the Expert

1.

Human demonstrations

2.

Expensive but exact algorithm that is too slow to run in real time