to No-Regret Online Learning Stephane Ross Joint work with Drew - - PowerPoint PPT Presentation

to no regret online learning
SMART_READER_LITE
LIVE PREVIEW

to No-Regret Online Learning Stephane Ross Joint work with Drew - - PowerPoint PPT Presentation

Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2 Imitation Learning Many


slide-1
SLIDE 1

Reduction of Imitation Learning to No-Regret Online Learning

Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

slide-2
SLIDE 2

Imitation Learning

Expert Demonstrations Machine Learning Algorithm Policy 2

slide-3
SLIDE 3

Imitation Learning

  • Many successes:

– Legged locomotion [Ratliff 06] – Outdoor navigation [Silver 08] – Helicopter flight [Abbeel 07] – Car driving [Pomerleau 89] – etc...

3

slide-4
SLIDE 4

Example Scenario

Policy

Steering in [-1,1]

Hard left turn Hard right turn

Input: Output: Learning to drive from demonstrations

Camera Image

4

slide-5
SLIDE 5

Supervised Training Procedure

Expert Trajectories Dataset Learned Policy: ))] s ( , s , ( [ Ε min arg ˆ

* *) ( D ~ s sup

  

 

 

5

slide-6
SLIDE 6

Poor Performance in Practice

6

slide-7
SLIDE 7

# Mistakes Grows Quadratically in T!

 

2

T ) ˆ ( J

sup 

  • Exp. # of mistakes
  • ver T steps
  • Avg. loss on D(*)

# time steps

Reason: Doesn’t learn how to recover from errors!

[Ross 2010]

7

slide-8
SLIDE 8

Reduction-Based Approach & Analysis

8

Easier Related Problem(s) Hard Learning Problem Performance: f(ε) Performance: ε Example: Cost-sensitive Multiclass classification to Binary classification [Beygelzimer 2005] , , ...

slide-9
SLIDE 9

Previous Work: Forward Training

  • Sequentially learn one policy/step
  • # mistakes grows linearly:

– J(1:T)  Tε

  • Impractical if T large

* 1 2 n-1 n

[Ross 2010]

9

slide-10
SLIDE 10

Previous Work: SMILe

  • Learn stochastic policy, changing policy slowly

– n = n-1 + αn(’n - *) – ’n trained to mimic * under D(n-1) – Similar to SEARN [Daume 2009]

  • Near-linear bound:

– J()  O(Tlog(T)ε + 1)

  • Stochasticity undesirable

[Ross 2010]

n-1

10

Steering from expert

slide-11
SLIDE 11

DAgger: Dataset Aggregation

  • Collect trajectories with expert *

11

* Steering from expert

slide-12
SLIDE 12

DAgger: Dataset Aggregation

12

  • Collect trajectories with expert *
  • Dataset D0 = {(s, *(s))}

* Steering from expert

slide-13
SLIDE 13

DAgger: Dataset Aggregation

*

  • Collect trajectories with expert *
  • Dataset D0 = {(s, *(s))}
  • Train 1 on D0

13

Steering from expert

slide-14
SLIDE 14

DAgger: Dataset Aggregation

  • Collect new trajectories with 1

1

14

Steering from expert

slide-15
SLIDE 15

DAgger: Dataset Aggregation

  • Collect new trajectories with 1
  • New Dataset D1’ = {(s, *(s))}

15

1 Steering from expert

slide-16
SLIDE 16

DAgger: Dataset Aggregation

  • Collect new trajectories with 1
  • New Dataset D1’ = {(s, *(s))}
  • Aggregate Datasets:

D1 = D0 U D1’

16

1 Steering from expert

slide-17
SLIDE 17

DAgger: Dataset Aggregation

  • Collect new trajectories with 1
  • New Dataset D1’ = {(s, *(s))}
  • Aggregate Datasets:

D1 = D0 U D1’

  • Train 2 on D1

17

1 Steering from expert

slide-18
SLIDE 18

DAgger: Dataset Aggregation

2

  • Collect new trajectories with 2
  • New Dataset D2’ = {(s, *(s))}
  • Aggregate Datasets:

D2 = D1 U D2’

  • Train 3 on D2

18

Steering from expert

slide-19
SLIDE 19

DAgger: Dataset Aggregation

n

  • Collect new trajectories with n
  • New Dataset Dn’ = {(s, *(s))}
  • Aggregate Datasets:

Dn = Dn-1 U Dn’

  • Train n+1 on Dn

19

Steering from expert

slide-20
SLIDE 20

Online Learning

Adversary Learner

20

slide-21
SLIDE 21

Online Learning

Adversary Learner

21

...

+ +

slide-22
SLIDE 22

Online Learning

Adversary Learner

22

...

+ +

  • + +
  • +
slide-23
SLIDE 23

Online Learning

Adversary Learner ...

23 + +

  • + +
  • +

+ +

  • +
slide-24
SLIDE 24

Online Learning

Adversary Learner ...

24 + +

  • + +
  • +

+ +

  • +

+ +

  • +
slide-25
SLIDE 25

Online Learning

Adversary Learner ...

  • Avg. Regret:

       

 

   n i n i i H h i i n

) h ( L min ) h ( L n

1 1

1 

25

slide-26
SLIDE 26

DAgger as Online Learning

Adversary Learner

))] s ( , s , ( [ E ) ( L

* ) ( D ~ s n

n

  

 

...

26 n

slide-27
SLIDE 27

DAgger as Online Learning

Adversary Learner

     n i i n

) ( L min arg

1 1

 

...

27

))] s ( , s , ( [ E ) ( L

* ) ( D ~ s n

n

  

 

n

slide-28
SLIDE 28

DAgger as Online Learning

Adversary Learner

     n i i n

) ( L min arg

1 1

 

Follow-The-Leader (FTL)

...

28

))] s ( , s , ( [ E ) ( L

* ) ( D ~ s n

n

  

 

n

slide-29
SLIDE 29

Theoretical Guarantees of DAgger

  • Best policy  in sequence 1:N guarantees:

) N / T ( O ) ( T ) ( J

N N

     

  • Avg. Loss on Aggregate

Dataset

  • Avg. Regret of 1:N

29

Iterations of DAgger

slide-30
SLIDE 30

Theoretical Guarantees of DAgger

  • Best policy  in sequence 1:N guarantees:
  • For strongly convex loss, N = O(TlogT) iterations:

) N / T ( O ) ( T ) ( J

N N

      ) ( O T ) ( J

N

1    

  • Avg. Loss on Aggregate

Dataset

  • Avg. Regret of 1:N

30

Iterations of DAgger

slide-31
SLIDE 31

Theoretical Guarantees of DAgger

  • Best policy  in sequence 1:N guarantees:
  • For strongly convex loss, N = O(TlogT) iterations:
  • Any No-Regret algorithm has same guarantees

) N / T ( O ) ( T ) ( J

N N

      ) ( O T ) ( J

N

1    

  • Avg. Loss on Aggregate

Dataset

  • Avg. Regret of 1:N

31

Iterations of DAgger

slide-32
SLIDE 32

Theoretical Guarantees of DAgger

  • If sample m trajectories at each iteration, w.p. 1-:

) Nm / ) / log( T ( O ) ˆ ( T ) ( J

N N

    1   

Empirical Avg. Loss on Aggregate Dataset

  • Avg. Regret of 1:N

32

slide-33
SLIDE 33

Theoretical Guarantees of DAgger

  • If sample m trajectories at each iteration, w.p. 1-:
  • For strongly convex loss, N = O(T2log(1/)) , m=1,

w.p. 1-:

) ( O ˆ T ) ( J

N

1    

Empirical Avg. Loss on Aggregate Dataset

  • Avg. Regret of 1:N

33

) Nm / ) / log( T ( O ) ˆ ( T ) ( J

N N

    1   

slide-34
SLIDE 34

Experiments: 3D Racing Game

Steering in [-1,1] Input: Output:

Resized to 25x19 pixels (1425 features)

34

slide-35
SLIDE 35

DAgger Test-Time Execution

35

slide-36
SLIDE 36

Average Falls/Lap

Better

36

slide-37
SLIDE 37

Experiments: Super Mario Bros

Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1}

Extracted 27K+ binary features from last 4 observations (14 binary features for every cell)

Output: Input: From Mario AI competition 2009

37

slide-38
SLIDE 38

Test-Time Execution

38

slide-39
SLIDE 39

Average Distance/Stage

Better

39

slide-40
SLIDE 40

Conclusion

  • Take-Home Message

– Simple iterative procedures can yield much better performance.

  • Can also be applied for Structured Prediction:

– NLP (e.g. Handwriting Recognition) – Computer Vision [Ross & al., CVPR 2011]

  • Future Work:

– Combining with other Imitation Learning techniques [Ratliff 06] – Potential extensions to Reinforcement Learning?

40

slide-41
SLIDE 41

Questions

41

slide-42
SLIDE 42

Structured Prediction

42

... ... ... ... ...

  • Example: Scene Labeling

Image Graph Structure

  • ver Labels
slide-43
SLIDE 43

Structured Prediction

43

  • Sequentially label each node using neighboring

predictions

– e.g. In Breath-First-Search Order (Forward & Backward passes)

A B C D A B C D C B ...

Graph Sequence of Classifications

slide-44
SLIDE 44

Structured Prediction

44

  • Input to Classifier:

– Local image features in neighborhood of pixel – Current neighboring pixels’ labels

  • Neighboring labels depend on classifier itself
  • DAgger finds a classifier that does well at predicting

pixel labels given the neighbors’ labels it itself generates during the labeling process.

slide-45
SLIDE 45

Experiments: Handwriting Recognition

[Taskar 2003]

Current letter in {a,b,...,z} Input: Output:

Previous predicted letter: Image current letter:

  • 45
slide-46
SLIDE 46

Test Folds Character Accuracy

Better

46