CS 285 Instructor: Sergey Levine UC Berkeley Terminology & - - PowerPoint PPT Presentation

▶

Sep 08, 2023 212 likes •674 views

Supervised Learning of Behaviors CS 285 Instructor: Sergey Levine UC Berkeley Terminology & notation 1. run away 2. ignore 3. pet Terminology & notation 1. run away 2. ignore 3. pet Aside: notation Lev

SLIDE 1

Supervised Learning of Behaviors

CS 285

Instructor: Sergey Levine UC Berkeley

SLIDE 2

1. run away
2. ignore
3. pet

Terminology & notation

SLIDE 3

1. run away
2. ignore
3. pet

Terminology & notation

SLIDE 4

Aside: notation

Richard Bellman Lev Pontryagin

управление

SLIDE 5

Images: Bojarski et al. ‘16, NVIDIA

training data supervised learning

Imitation Learning

behavioral cloning

SLIDE 6

The original deep imitation learning system

ALVINN: Autonomous Land Vehicle In a Neural Network 1989

SLIDE 7

Does it work? No!

SLIDE 8

Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA

SLIDE 9

Why did that work?

Bojarski et al. ‘16, NVIDIA

SLIDE 10

Can we make it work more often?

cost

stability (more on this later)

SLIDE 11

Can we make it work more often?

SLIDE 12

Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11

SLIDE 13

DAgger Example

Ross et al. ‘11

SLIDE 14

What’s the problem?

Ross et al. ‘11

SLIDE 15

Deep imitation learning in practice

SLIDE 16

Can we make it work without more data?

DAgger addresses the problem of

distributional “drift”

What if our model is so good that it

doesn’t drift?

Need to mimic expert behavior very

accurately

But don’t overfit!

SLIDE 17

Why might we fail to fit the expert?

1. Non-Markovian behavior
2. Multimodal behavior

behavior depends only

n current observation

If we see the same thing twice, we do the same thing twice, regardless of what happened before Often very unnatural for human demonstrators

behavior depends on all past observations

SLIDE 18

How can we use the whole history?

variable number of frames, too many weights

SLIDE 19

How can we use the whole history?

RNN state RNN state RNN state

shared weights

Typically, LSTM cells work better here

SLIDE 20

Aside: why might this work poorly?

“causal confusion”

see: de Haan et al., “Causal Confusion in Imitation Learning”

Question 1: Does including history mitigate causal confusion? Question 2: Can DAgger mitigate causal confusion?

SLIDE 21

Why might we fail to fit the expert?

1. Non-Markovian behavior
2. Multimodal behavior
1. Output mixture of

Gaussians

2. Latent variable models
3. Autoregressive

discretization

SLIDE 22

Why might we fail to fit the expert?

1. Output mixture of

Gaussians

2. Latent variable models
3. Autoregressive

discretization

SLIDE 23

Why might we fail to fit the expert?

1. Output mixture of

Gaussians

2. Latent variable models
3. Autoregressive

discretization

Look up some of these:

Conditional variational autoencoder
Normalizing flow/realNVP
Stein variational gradient descent

SLIDE 24

Why might we fail to fit the expert?

1. Output mixture of

Gaussians

2. Latent variable models
3. Autoregressive

discretization

(discretized) distribution

ver dimension 1 only

discrete sampling discrete sampling dim 1 value dim 2 value

SLIDE 25

Imitation learning: recap

Often (but not always) insufficient by itself
Distribution mismatch problem
Sometimes works well
Hacks (e.g. left/right images)
Samples from a stable trajectory distribution
Add more on-policy data, e.g. using Dagger
Better models that fit more accurately

training data supervised learning

SLIDE 26

A case study: trail following from human demonstration data

SLIDE 27

Case study 1: trail following as classification

SLIDE 28

SLIDE 29

Cost functions, reward functions, and a bit of theory

SLIDE 30

Imitation learning: what’s the problem?

Humans need to provide data, which is typically finite
Deep learning works best when data is plentiful
Humans are not good at providing some kinds of actions
Humans can learn autonomously; can our machines do the same?
Unlimited data from own experience
Continuous self-improvement

SLIDE 31

1. run away
2. ignore
3. pet

Terminology & notation

SLIDE 32

Aside: notation

Richard Bellman Lev Pontryagin

SLIDE 33

Cost functions, reward functions, and a bit of theory

SLIDE 34

A cost function for imitation?

training data supervised learning

Ross et al. ‘11

SLIDE 35

Some analysis

SLIDE 36

More general analysis

For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

SLIDE 37

More general analysis

For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

SLIDE 38

Another way to imitate

SLIDE 39

Another imitation idea

SLIDE 40

Goal-conditioned behavioral cloning

SLIDE 41

SLIDE 42

1. Collect data
2. Train goal conditioned policy

SLIDE 43

3. Reach goals

SLIDE 44

Going beyond just imitation?

➢ Start with a random policy ➢ Collect data with random goals ➢ Treat this data as “demonstrations” for the goals that were reached ➢ Use this to improve the policy ➢ Repeat