CS 285 Instructor: Sergey Levine UC Berkeley Terminology & - - PowerPoint PPT Presentation

cs 285
SMART_READER_LITE
LIVE PREVIEW

CS 285 Instructor: Sergey Levine UC Berkeley Terminology & - - PowerPoint PPT Presentation

Supervised Learning of Behaviors CS 285 Instructor: Sergey Levine UC Berkeley Terminology & notation 1. run away 2. ignore 3. pet Terminology & notation 1. run away 2. ignore 3. pet Aside: notation Lev


slide-1
SLIDE 1

Supervised Learning of Behaviors

CS 285

Instructor: Sergey Levine UC Berkeley

slide-2
SLIDE 2
  • 1. run away
  • 2. ignore
  • 3. pet

Terminology & notation

slide-3
SLIDE 3
  • 1. run away
  • 2. ignore
  • 3. pet

Terminology & notation

slide-4
SLIDE 4

Aside: notation

Richard Bellman Lev Pontryagin

управление

slide-5
SLIDE 5

Images: Bojarski et al. ‘16, NVIDIA

training data supervised learning

Imitation Learning

behavioral cloning

slide-6
SLIDE 6

The original deep imitation learning system

ALVINN: Autonomous Land Vehicle In a Neural Network 1989

slide-7
SLIDE 7

Does it work? No!

slide-8
SLIDE 8

Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA

slide-9
SLIDE 9

Why did that work?

Bojarski et al. ‘16, NVIDIA

slide-10
SLIDE 10

Can we make it work more often?

cost

stability (more on this later)

slide-11
SLIDE 11

Can we make it work more often?

slide-12
SLIDE 12

Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11

slide-13
SLIDE 13

DAgger Example

Ross et al. ‘11

slide-14
SLIDE 14

What’s the problem?

Ross et al. ‘11

slide-15
SLIDE 15

Deep imitation learning in practice

slide-16
SLIDE 16

Can we make it work without more data?

  • DAgger addresses the problem of

distributional “drift”

  • What if our model is so good that it

doesn’t drift?

  • Need to mimic expert behavior very

accurately

  • But don’t overfit!
slide-17
SLIDE 17

Why might we fail to fit the expert?

  • 1. Non-Markovian behavior
  • 2. Multimodal behavior

behavior depends only

  • n current observation

If we see the same thing twice, we do the same thing twice, regardless of what happened before Often very unnatural for human demonstrators

behavior depends on all past observations

slide-18
SLIDE 18

How can we use the whole history?

variable number of frames, too many weights

slide-19
SLIDE 19

How can we use the whole history?

RNN state RNN state RNN state

shared weights

Typically, LSTM cells work better here

slide-20
SLIDE 20

Aside: why might this work poorly?

“causal confusion”

see: de Haan et al., “Causal Confusion in Imitation Learning”

Question 1: Does including history mitigate causal confusion? Question 2: Can DAgger mitigate causal confusion?

slide-21
SLIDE 21

Why might we fail to fit the expert?

  • 1. Non-Markovian behavior
  • 2. Multimodal behavior
  • 1. Output mixture of

Gaussians

  • 2. Latent variable models
  • 3. Autoregressive

discretization

slide-22
SLIDE 22

Why might we fail to fit the expert?

  • 1. Output mixture of

Gaussians

  • 2. Latent variable models
  • 3. Autoregressive

discretization

slide-23
SLIDE 23

Why might we fail to fit the expert?

  • 1. Output mixture of

Gaussians

  • 2. Latent variable models
  • 3. Autoregressive

discretization

Look up some of these:

  • Conditional variational autoencoder
  • Normalizing flow/realNVP
  • Stein variational gradient descent
slide-24
SLIDE 24

Why might we fail to fit the expert?

  • 1. Output mixture of

Gaussians

  • 2. Latent variable models
  • 3. Autoregressive

discretization

(discretized) distribution

  • ver dimension 1 only

discrete sampling discrete sampling dim 1 value dim 2 value

slide-25
SLIDE 25

Imitation learning: recap

  • Often (but not always) insufficient by itself
  • Distribution mismatch problem
  • Sometimes works well
  • Hacks (e.g. left/right images)
  • Samples from a stable trajectory distribution
  • Add more on-policy data, e.g. using Dagger
  • Better models that fit more accurately

training data supervised learning

slide-26
SLIDE 26

A case study: trail following from human demonstration data

slide-27
SLIDE 27

Case study 1: trail following as classification

slide-28
SLIDE 28
slide-29
SLIDE 29

Cost functions, reward functions, and a bit of theory

slide-30
SLIDE 30

Imitation learning: what’s the problem?

  • Humans need to provide data, which is typically finite
  • Deep learning works best when data is plentiful
  • Humans are not good at providing some kinds of actions
  • Humans can learn autonomously; can our machines do the same?
  • Unlimited data from own experience
  • Continuous self-improvement
slide-31
SLIDE 31
  • 1. run away
  • 2. ignore
  • 3. pet

Terminology & notation

slide-32
SLIDE 32

Aside: notation

Richard Bellman Lev Pontryagin

slide-33
SLIDE 33

Cost functions, reward functions, and a bit of theory

slide-34
SLIDE 34

A cost function for imitation?

training data supervised learning

Ross et al. ‘11

slide-35
SLIDE 35

Some analysis

slide-36
SLIDE 36

More general analysis

For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

slide-37
SLIDE 37

More general analysis

For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

slide-38
SLIDE 38

Another way to imitate

slide-39
SLIDE 39

Another imitation idea

slide-40
SLIDE 40

Goal-conditioned behavioral cloning

slide-41
SLIDE 41
slide-42
SLIDE 42
  • 1. Collect data
  • 2. Train goal conditioned policy
slide-43
SLIDE 43
  • 3. Reach goals
slide-44
SLIDE 44

Going beyond just imitation?

➢ Start with a random policy ➢ Collect data with random goals ➢ Treat this data as “demonstrations” for the goals that were reached ➢ Use this to improve the policy ➢ Repeat