to No-Regret Online Learning Stephane Ross Joint work with Drew - - PowerPoint PPT Presentation
to No-Regret Online Learning Stephane Ross Joint work with Drew - - PowerPoint PPT Presentation
Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2 Imitation Learning Many
SLIDE 1
SLIDE 2
Imitation Learning
Expert Demonstrations Machine Learning Algorithm Policy 2
SLIDE 3
Imitation Learning
- Many successes:
– Legged locomotion [Ratliff 06] – Outdoor navigation [Silver 08] – Helicopter flight [Abbeel 07] – Car driving [Pomerleau 89] – etc...
3
SLIDE 4
Example Scenario
Policy
Steering in [-1,1]
Hard left turn Hard right turn
Input: Output: Learning to drive from demonstrations
Camera Image
4
SLIDE 5
Supervised Training Procedure
Expert Trajectories Dataset Learned Policy: ))] s ( , s , ( [ Ε min arg ˆ
* *) ( D ~ s sup
5
SLIDE 6
Poor Performance in Practice
6
SLIDE 7
# Mistakes Grows Quadratically in T!
2
T ) ˆ ( J
sup
- Exp. # of mistakes
- ver T steps
- Avg. loss on D(*)
# time steps
Reason: Doesn’t learn how to recover from errors!
[Ross 2010]
7
SLIDE 8
Reduction-Based Approach & Analysis
8
Easier Related Problem(s) Hard Learning Problem Performance: f(ε) Performance: ε Example: Cost-sensitive Multiclass classification to Binary classification [Beygelzimer 2005] , , ...
SLIDE 9
Previous Work: Forward Training
- Sequentially learn one policy/step
- # mistakes grows linearly:
– J(1:T) Tε
- Impractical if T large
* 1 2 n-1 n
[Ross 2010]
9
SLIDE 10
Previous Work: SMILe
- Learn stochastic policy, changing policy slowly
– n = n-1 + αn(’n - *) – ’n trained to mimic * under D(n-1) – Similar to SEARN [Daume 2009]
- Near-linear bound:
– J() O(Tlog(T)ε + 1)
- Stochasticity undesirable
[Ross 2010]
n-1
10
Steering from expert
SLIDE 11
DAgger: Dataset Aggregation
- Collect trajectories with expert *
11
* Steering from expert
SLIDE 12
DAgger: Dataset Aggregation
12
- Collect trajectories with expert *
- Dataset D0 = {(s, *(s))}
* Steering from expert
SLIDE 13
DAgger: Dataset Aggregation
*
- Collect trajectories with expert *
- Dataset D0 = {(s, *(s))}
- Train 1 on D0
13
Steering from expert
SLIDE 14
DAgger: Dataset Aggregation
- Collect new trajectories with 1
1
14
Steering from expert
SLIDE 15
DAgger: Dataset Aggregation
- Collect new trajectories with 1
- New Dataset D1’ = {(s, *(s))}
15
1 Steering from expert
SLIDE 16
DAgger: Dataset Aggregation
- Collect new trajectories with 1
- New Dataset D1’ = {(s, *(s))}
- Aggregate Datasets:
D1 = D0 U D1’
16
1 Steering from expert
SLIDE 17
DAgger: Dataset Aggregation
- Collect new trajectories with 1
- New Dataset D1’ = {(s, *(s))}
- Aggregate Datasets:
D1 = D0 U D1’
- Train 2 on D1
17
1 Steering from expert
SLIDE 18
DAgger: Dataset Aggregation
2
- Collect new trajectories with 2
- New Dataset D2’ = {(s, *(s))}
- Aggregate Datasets:
D2 = D1 U D2’
- Train 3 on D2
18
Steering from expert
SLIDE 19
DAgger: Dataset Aggregation
n
- Collect new trajectories with n
- New Dataset Dn’ = {(s, *(s))}
- Aggregate Datasets:
Dn = Dn-1 U Dn’
- Train n+1 on Dn
19
Steering from expert
SLIDE 20
Online Learning
Adversary Learner
20
SLIDE 21
Online Learning
Adversary Learner
21
...
+ +
SLIDE 22
Online Learning
Adversary Learner
22
...
+ +
- + +
- +
SLIDE 23
Online Learning
Adversary Learner ...
23 + +
- + +
- +
+ +
- +
SLIDE 24
Online Learning
Adversary Learner ...
24 + +
- + +
- +
+ +
- +
+ +
- +
SLIDE 25
Online Learning
Adversary Learner ...
- Avg. Regret:
n i n i i H h i i n
) h ( L min ) h ( L n
1 1
1
25
SLIDE 26
DAgger as Online Learning
Adversary Learner
))] s ( , s , ( [ E ) ( L
* ) ( D ~ s n
n
...
26 n
SLIDE 27
DAgger as Online Learning
Adversary Learner
n i i n
) ( L min arg
1 1
...
27
))] s ( , s , ( [ E ) ( L
* ) ( D ~ s n
n
n
SLIDE 28
DAgger as Online Learning
Adversary Learner
n i i n
) ( L min arg
1 1
Follow-The-Leader (FTL)
...
28
))] s ( , s , ( [ E ) ( L
* ) ( D ~ s n
n
n
SLIDE 29
Theoretical Guarantees of DAgger
- Best policy in sequence 1:N guarantees:
) N / T ( O ) ( T ) ( J
N N
- Avg. Loss on Aggregate
Dataset
- Avg. Regret of 1:N
29
Iterations of DAgger
SLIDE 30
Theoretical Guarantees of DAgger
- Best policy in sequence 1:N guarantees:
- For strongly convex loss, N = O(TlogT) iterations:
) N / T ( O ) ( T ) ( J
N N
) ( O T ) ( J
N
1
- Avg. Loss on Aggregate
Dataset
- Avg. Regret of 1:N
30
Iterations of DAgger
SLIDE 31
Theoretical Guarantees of DAgger
- Best policy in sequence 1:N guarantees:
- For strongly convex loss, N = O(TlogT) iterations:
- Any No-Regret algorithm has same guarantees
) N / T ( O ) ( T ) ( J
N N
) ( O T ) ( J
N
1
- Avg. Loss on Aggregate
Dataset
- Avg. Regret of 1:N
31
Iterations of DAgger
SLIDE 32
Theoretical Guarantees of DAgger
- If sample m trajectories at each iteration, w.p. 1-:
) Nm / ) / log( T ( O ) ˆ ( T ) ( J
N N
1
Empirical Avg. Loss on Aggregate Dataset
- Avg. Regret of 1:N
32
SLIDE 33
Theoretical Guarantees of DAgger
- If sample m trajectories at each iteration, w.p. 1-:
- For strongly convex loss, N = O(T2log(1/)) , m=1,
w.p. 1-:
) ( O ˆ T ) ( J
N
1
Empirical Avg. Loss on Aggregate Dataset
- Avg. Regret of 1:N
33
) Nm / ) / log( T ( O ) ˆ ( T ) ( J
N N
1
SLIDE 34
Experiments: 3D Racing Game
Steering in [-1,1] Input: Output:
Resized to 25x19 pixels (1425 features)
34
SLIDE 35
DAgger Test-Time Execution
35
SLIDE 36
Average Falls/Lap
Better
36
SLIDE 37
Experiments: Super Mario Bros
Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1}
Extracted 27K+ binary features from last 4 observations (14 binary features for every cell)
Output: Input: From Mario AI competition 2009
37
SLIDE 38
Test-Time Execution
38
SLIDE 39
Average Distance/Stage
Better
39
SLIDE 40
Conclusion
- Take-Home Message
– Simple iterative procedures can yield much better performance.
- Can also be applied for Structured Prediction:
– NLP (e.g. Handwriting Recognition) – Computer Vision [Ross & al., CVPR 2011]
- Future Work:
– Combining with other Imitation Learning techniques [Ratliff 06] – Potential extensions to Reinforcement Learning?
40
SLIDE 41
Questions
41
SLIDE 42
Structured Prediction
42
... ... ... ... ...
- Example: Scene Labeling
Image Graph Structure
- ver Labels
SLIDE 43
Structured Prediction
43
- Sequentially label each node using neighboring
predictions
– e.g. In Breath-First-Search Order (Forward & Backward passes)
A B C D A B C D C B ...
Graph Sequence of Classifications
SLIDE 44
Structured Prediction
44
- Input to Classifier:
– Local image features in neighborhood of pixel – Current neighboring pixels’ labels
- Neighboring labels depend on classifier itself
- DAgger finds a classifier that does well at predicting
pixel labels given the neighbors’ labels it itself generates during the labeling process.
SLIDE 45
Experiments: Handwriting Recognition
[Taskar 2003]
Current letter in {a,b,...,z} Input: Output:
Previous predicted letter: Image current letter:
- 45
SLIDE 46
Test Folds Character Accuracy
Better
46