to No-Regret Online Learning Stephane Ross Joint work with Drew - PowerPoint PPT Presentation

Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2

Imitation Learning • Many successes: – Legged locomotion [Ratliff 06] – Outdoor navigation [Silver 08] – Helicopter flight [Abbeel 07] – Car driving [Pomerleau 89] – etc... 3

Example Scenario Learning to drive from demonstrations Input: Output: Policy Steering in [-1,1] Camera Image Hard left turn Hard right turn 4

Supervised Training Procedure Dataset Expert Trajectories     ˆ Ε * Learned Policy:  arg min [ ( , s , ( s ))] sup  5 s ~ D ( *)   

Poor Performance in Practice 6

# Mistakes Grows Quadratically in T! [Ross 2010]  sup   ˆ 2 J ( ) T Avg. loss on D(  *) Exp. # of mistakes over T steps # time steps Reason: Doesn’t learn how to recover from errors! 7

Reduction-Based Approach & Analysis Easier Related Problem(s) Hard Learning Problem , , ... Performance: f( ε ) Performance: ε Example: Cost-sensitive Multiclass classification to Binary classification [Beygelzimer 2005] 8

Previous Work: Forward Training [Ross 2010] • Sequentially learn one policy/step  n • # mistakes grows linearly: – J(  1:T )  T ε  *  n-1 • Impractical if T large  2  1 9

Previous Work: SMILe [Ross 2010] • Learn stochastic policy, changing policy slowly –  n =  n-1 + α n (  ’ n -  *) –  ’ n trained to mimic  * under D(  n-1 ) – Similar to SEARN [Daume 2009]  n-1 • Near-linear bound: – J(  )  O(Tlog(T) ε + 1) Steering • Stochasticity undesirable from expert 10

DAgger: Dataset Aggregation • Collect trajectories with expert  *  * Steering from expert 11

DAgger: Dataset Aggregation • Collect trajectories with expert  *  * • Dataset D 0 = {(s,  *(s))} Steering from expert 12

DAgger: Dataset Aggregation • Collect trajectories with expert  *  * • Dataset D 0 = {(s,  *(s))} • Train  1 on D 0 Steering from expert 13

DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 Steering from expert 14

DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} Steering from expert 15

DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from expert 16

DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from • Train  2 on D 1 expert 17

DAgger: Dataset Aggregation • Collect new trajectories with  2  2 • New Dataset D 2 ’ = {(s,  *(s))} • Aggregate Datasets: D 2 = D 1 U D 2 ’ Steering from • Train  3 on D 2 expert 18

DAgger: Dataset Aggregation • Collect new trajectories with  n  n • New Dataset D n ’ = {(s,  *(s))} • Aggregate Datasets: D n = D n-1 U D n ’ Steering from • Train  n+1 on D n expert 19

Online Learning Learner Adversary 20

Online Learning + + ... - Learner Adversary - 21

Online Learning + + ... - Learner Adversary - + + - + - 22

Online Learning + + ... - Learner Adversary - + + - + - + + - + - 23

Online Learning + + ... - Learner Adversary - + + - + - + + - + - + + - - + - 24

Online Learning ... Learner Adversary   n n 1      Avg. Regret:  L ( h ) min L ( h )  n i i i    25 n h H   i 1 i 1

DAgger as Online Learning ... Learner Adversary  n     *  L ( ) E [ ( , s , ( s ))] 26 n  s ~ D ( ) n

DAgger as Online Learning ... Learner Adversary  n n      arg min L ( ) n 1 i     i 1     *  L ( ) E [ ( , s , ( s ))] 27 n  s ~ D ( ) n

DAgger as Online Learning ... Learner Adversary  n n      arg min L ( ) n 1 i     i 1     Follow-The-Leader (FTL) *  L ( ) E [ ( , s , ( s ))] 28 n  s ~ D ( ) n

Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset 29

Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations :     J ( ) T O ( ) 1 N 30

Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations :     J ( ) T O ( ) 1 N • Any No-Regret algorithm has same guarantees 31

Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1-  :        ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of  1:N Empirical Avg. Loss on Aggregate Dataset 32

Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1-  :        ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of  1:N Empirical Avg. Loss on Aggregate Dataset • For strongly convex loss, N = O(T 2 log(1/  )) , m=1 , w.p. 1-  :     ˆ J ( ) T O ( ) 1 N 33

Experiments: 3D Racing Game Input: Output: Steering in [-1,1] Resized to 25x19 pixels (1425 features) 34

DAgger Test-Time Execution 35

Average Falls/Lap Better 36

Experiments: Super Mario Bros From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} Extracted 27K+ binary features from last 4 observations 37 (14 binary features for every cell)

Test-Time Execution 38

Average Distance/Stage Better 39

Conclusion • Take-Home Message – Simple iterative procedures can yield much better performance. • Can also be applied for Structured Prediction : – NLP (e.g. Handwriting Recognition) – Computer Vision [Ross & al., CVPR 2011] • Future Work: – Combining with other Imitation Learning techniques [Ratliff 06] – Potential extensions to Reinforcement Learning? 40

Questions 41

Structured Prediction • Example: Scene Labeling Graph Structure Image over Labels ... ... ... ... ... 42

Structured Prediction • Sequentially label each node using neighboring predictions – e.g. In Breath-First-Search Order (Forward & Backward passes) Graph Sequence of Classifications A B ... A B C D C B C D 43

Structured Prediction • Input to Classifier: – Local image features in neighborhood of pixel – Current neighboring pixels’ labels • Neighboring labels depend on classifier itself • DAgger finds a classifier that does well at predicting pixel labels given the neighbors’ labels it itself generates during the labeling process. 44

Experiments: Handwriting Recognition [Taskar 2003] Output: Input: Image current letter: Current letter in {a,b,...,z} o Previous predicted letter: 45

Test Folds Character Accuracy Better 46

to No-Regret Online Learning Stephane Ross Joint work with Drew - PowerPoint PPT Presentation

Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2 Imitation Learning Many

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

Efficient Online Portfolio with Logarithmic Regret Haipeng Luo (USC) Chen-Yu Wei (USC) Kai Zheng

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Algorithms: Learning & Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

A Closer Look at Adaptive Regret Dmitry Adamskiy Joint work with Wouter Koolen, Volodya Vovk and

Royal Economic Society The history of Regret Theory Robert Sugden Contribution to Economic

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Intro to Online Learning Instructor: Haifeng Xu Outline Online Learning/Optimization

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Online Learning and Online Convex Optimization Nicol` o Cesa-Bianchi Universit` a degli Studi

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

COVID-19, IDEA, & the Delivery of Services in Nonpublic Schools - PART I Sept. 21, 2020

Reopening Plans - Instruction Tuesday, August 4, 2020 Staff 9:30 A.M. Parents/Community 6:00

AUHSD Distance Learning Guidelines for Parents & Guardians Revised: March 23, 2020

How to Use the 5 Senses Self-Checking Games The games can be used on a computer, tablet, or

Cold Stream Distance Learning Program August 2020 How to submit questions during the meeting:

8/6/2020 Building the Parachute After the Jump: Using Remote Learning Data to Inform Decision

Distance Learning and e-learning revolution - how does it affect your work? What are we talking

USING CAMBRIDGE ENGLISH KAHOOTS IN ONLINE TEACHING AND LEARNING Poll Do you use kahoots for

to No-Regret Online Learning Stephane Ross Joint work with Drew - PowerPoint PPT Presentation

Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2 Imitation Learning Many

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

Efficient Online Portfolio with Logarithmic Regret Haipeng Luo (USC) Chen-Yu Wei (USC) Kai Zheng

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Algorithms: Learning &amp; Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

A Closer Look at Adaptive Regret Dmitry Adamskiy Joint work with Wouter Koolen, Volodya Vovk and

Royal Economic Society The history of Regret Theory Robert Sugden Contribution to Economic

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Intro to Online Learning Instructor: Haifeng Xu Outline Online Learning/Optimization

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Online Learning and Online Convex Optimization Nicol` o Cesa-Bianchi Universit` a degli Studi

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

COVID-19, IDEA, &amp; the Delivery of Services in Nonpublic Schools - PART I Sept. 21, 2020

Reopening Plans - Instruction Tuesday, August 4, 2020 Staff 9:30 A.M. Parents/Community 6:00

AUHSD Distance Learning Guidelines for Parents &amp; Guardians Revised: March 23, 2020

How to Use the 5 Senses Self-Checking Games The games can be used on a computer, tablet, or

Cold Stream Distance Learning Program August 2020 How to submit questions during the meeting:

8/6/2020 Building the Parachute After the Jump: Using Remote Learning Data to Inform Decision

Distance Learning and e-learning revolution - how does it affect your work? What are we talking

USING CAMBRIDGE ENGLISH KAHOOTS IN ONLINE TEACHING AND LEARNING Poll Do you use kahoots for

Online Algorithms: Learning & Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1

COVID-19, IDEA, & the Delivery of Services in Nonpublic Schools - PART I Sept. 21, 2020

AUHSD Distance Learning Guidelines for Parents & Guardians Revised: March 23, 2020