to no regret online learning
play

to No-Regret Online Learning Stephane Ross Joint work with Drew - PowerPoint PPT Presentation

Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2 Imitation Learning Many


  1. Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

  2. Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2

  3. Imitation Learning • Many successes: – Legged locomotion [Ratliff 06] – Outdoor navigation [Silver 08] – Helicopter flight [Abbeel 07] – Car driving [Pomerleau 89] – etc... 3

  4. Example Scenario Learning to drive from demonstrations Input: Output: Policy Steering in [-1,1] Camera Image Hard left turn Hard right turn 4

  5. Supervised Training Procedure Dataset Expert Trajectories     ˆ Ε * Learned Policy:  arg min [ ( , s , ( s ))] sup  5 s ~ D ( *)   

  6. Poor Performance in Practice 6

  7. # Mistakes Grows Quadratically in T! [Ross 2010]  sup   ˆ 2 J ( ) T Avg. loss on D(  *) Exp. # of mistakes over T steps # time steps Reason: Doesn’t learn how to recover from errors! 7

  8. Reduction-Based Approach & Analysis Easier Related Problem(s) Hard Learning Problem , , ... Performance: f( ε ) Performance: ε Example: Cost-sensitive Multiclass classification to Binary classification [Beygelzimer 2005] 8

  9. Previous Work: Forward Training [Ross 2010] • Sequentially learn one policy/step  n • # mistakes grows linearly: – J(  1:T )  T ε  *  n-1 • Impractical if T large  2  1 9

  10. Previous Work: SMILe [Ross 2010] • Learn stochastic policy, changing policy slowly –  n =  n-1 + α n (  ’ n -  *) –  ’ n trained to mimic  * under D(  n-1 ) – Similar to SEARN [Daume 2009]  n-1 • Near-linear bound: – J(  )  O(Tlog(T) ε + 1) Steering • Stochasticity undesirable from expert 10

  11. DAgger: Dataset Aggregation • Collect trajectories with expert  *  * Steering from expert 11

  12. DAgger: Dataset Aggregation • Collect trajectories with expert  *  * • Dataset D 0 = {(s,  *(s))} Steering from expert 12

  13. DAgger: Dataset Aggregation • Collect trajectories with expert  *  * • Dataset D 0 = {(s,  *(s))} • Train  1 on D 0 Steering from expert 13

  14. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 Steering from expert 14

  15. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} Steering from expert 15

  16. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from expert 16

  17. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from • Train  2 on D 1 expert 17

  18. DAgger: Dataset Aggregation • Collect new trajectories with  2  2 • New Dataset D 2 ’ = {(s,  *(s))} • Aggregate Datasets: D 2 = D 1 U D 2 ’ Steering from • Train  3 on D 2 expert 18

  19. DAgger: Dataset Aggregation • Collect new trajectories with  n  n • New Dataset D n ’ = {(s,  *(s))} • Aggregate Datasets: D n = D n-1 U D n ’ Steering from • Train  n+1 on D n expert 19

  20. Online Learning Learner Adversary 20

  21. Online Learning + + ... - Learner Adversary - 21

  22. Online Learning + + ... - Learner Adversary - + + - + - 22

  23. Online Learning + + ... - Learner Adversary - + + - + - + + - + - 23

  24. Online Learning + + ... - Learner Adversary - + + - + - + + - + - + + - - + - 24

  25. Online Learning ... Learner Adversary   n n 1      Avg. Regret:  L ( h ) min L ( h )  n i i i    25 n h H   i 1 i 1

  26. DAgger as Online Learning ... Learner Adversary  n     *  L ( ) E [ ( , s , ( s ))] 26 n  s ~ D ( ) n

  27. DAgger as Online Learning ... Learner Adversary  n n      arg min L ( ) n 1 i     i 1     *  L ( ) E [ ( , s , ( s ))] 27 n  s ~ D ( ) n

  28. DAgger as Online Learning ... Learner Adversary  n n      arg min L ( ) n 1 i     i 1     Follow-The-Leader (FTL) *  L ( ) E [ ( , s , ( s ))] 28 n  s ~ D ( ) n

  29. Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset 29

  30. Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations :     J ( ) T O ( ) 1 N 30

  31. Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations :     J ( ) T O ( ) 1 N • Any No-Regret algorithm has same guarantees 31

  32. Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1-  :        ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of  1:N Empirical Avg. Loss on Aggregate Dataset 32

  33. Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1-  :        ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of  1:N Empirical Avg. Loss on Aggregate Dataset • For strongly convex loss, N = O(T 2 log(1/  )) , m=1 , w.p. 1-  :     ˆ J ( ) T O ( ) 1 N 33

  34. Experiments: 3D Racing Game Input: Output: Steering in [-1,1] Resized to 25x19 pixels (1425 features) 34

  35. DAgger Test-Time Execution 35

  36. Average Falls/Lap Better 36

  37. Experiments: Super Mario Bros From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} Extracted 27K+ binary features from last 4 observations 37 (14 binary features for every cell)

  38. Test-Time Execution 38

  39. Average Distance/Stage Better 39

  40. Conclusion • Take-Home Message – Simple iterative procedures can yield much better performance. • Can also be applied for Structured Prediction : – NLP (e.g. Handwriting Recognition) – Computer Vision [Ross & al., CVPR 2011] • Future Work: – Combining with other Imitation Learning techniques [Ratliff 06] – Potential extensions to Reinforcement Learning? 40

  41. Questions 41

  42. Structured Prediction • Example: Scene Labeling Graph Structure Image over Labels ... ... ... ... ... 42

  43. Structured Prediction • Sequentially label each node using neighboring predictions – e.g. In Breath-First-Search Order (Forward & Backward passes) Graph Sequence of Classifications A B ... A B C D C B C D 43

  44. Structured Prediction • Input to Classifier: – Local image features in neighborhood of pixel – Current neighboring pixels’ labels • Neighboring labels depend on classifier itself • DAgger finds a classifier that does well at predicting pixel labels given the neighbors’ labels it itself generates during the labeling process. 44

  45. Experiments: Handwriting Recognition [Taskar 2003] Output: Input: Image current letter: Current letter in {a,b,...,z} o Previous predicted letter: 45

  46. Test Folds Character Accuracy Better 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend