Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , - PowerPoint PPT Presentation

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han Bao 3,2 , Voot Tangkaratt 2 , Masashi Sugiyama 2,3 1 National Taiwan University 2 RIKEN Center for Advanced Intelligence Project 3 The University of Tokyo Poster #47 Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 1 / 12 Poster #47

Introduction Imitation learning learning from demonstration instead of a reward function Demonstration a set of decision makings (state-action pairs x ) Collected demonstration may be imperfect Driving: traffic violation Playing basketball: technical foul Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 2 / 12 Poster #47

Motivation Confidence : how optimal is state-action pair x (between 0 and 1) A semi-supervised setting: demonstration partially equipped with confidence How? crowdsourcing: N (1) / ( N (1) + N (0)). digitized score: 0 . 0 , 0 . 1 , 0 . 2 , . . . , 1 . 0 Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 3 / 12 Poster #47

Generative Adversarial Imitation Learning [1] One-to-one correspondence between the policy π and the distribution of demonstration [2] Utilize generative adversarial training min θ max E x ∼ p θ [log D w ( x )] + E x ∼ p opt [log(1 − D w ( x ))] w D w : discriminator, p opt : demonstration distribution of π opt , and p θ : trajectory distribution of agent π θ Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 4 / 12 Poster #47

Problem Setting Human switches to non-optimal policies when they make mistakes or are distracted p ( x ) = α p ( x | y = +1) +(1 − α ) p ( x | y = − 1) � �� p opt ( x ) p non ( x ) Confidence: r ( x ) � Pr( y = +1 | x ) Unlabeled demonstration: { x i } n u i =1 ∼ p Demonstration with confidence: { ( x j , r j ) } n c j =1 ∼ q Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 5 / 12 Poster #47

Proposed Method 1: Two-Step Importance Weighting Imitation Learning Step 1: estimate confidence by learning a confidence scoring function g Unbiased risk estimator (come to Poster #47 for details): R SC ,ℓ ( g ) = E x , r ∼ q [ r · ( ℓ ( g ( x )))] + E x , r ∼ q [(1 − r ) ℓ ( − g ( x ))] � �� Risk for optimal Risk for non-optimal Theorem For δ ∈ (0 , 1) , with probability at least 1 − δ over repeated sampling of data for training ˆ g, n − 1 / 2 n − 1 / 2 g ) − R SC ,ℓ ( g ∗ ) = O p ( R SC ,ℓ (ˆ + ) c u � �� # of confidence # of unlabeled Step 2: employ importance weighting to reweight GAIL objective Importance weighting E x ∼ p θ [log D w ( x )] + E x ∼ p [ ˆ r ( x ) min θ max log(1 − D w ( x ))] α w Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 6 / 12 Poster #47

Proposed Method 2: GAIL with Imperfect Demonstration and Confidence Mix the agent demonstration with the non-optimal one p ′ = α p θ + (1 − α ) p non Matching p ′ with p enables p θ = p opt and meanwhile benefits from the large amount of unlabeled data. Objective: V ( θ, D w ) = E x ∼ p [log(1 − D w ( x ))] + α E x ∼ p θ [log D w ( x )] + E x , r ∼ q [(1 − r ) log D w ( x )] � �� Risk for P class Risk for N class Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 7 / 12 Poster #47

Setup Confidence is given by a classifier trained with the demonstration mixture labeled as optimal ( y = +1) and non-optimal ( y = − 1) Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 8 / 12 Poster #47

Results: Higher Average Return of the Proposed Methods Environment: Mujoco Proportion of labeled data: 20% Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 9 / 12 Poster #47

Results: Unlabeled Data Helps More unlabeled data results in lower variance and better performance proposed methods are robust to noise (a) Number of unlabeled data. The number in the (b) Noise influence. The number in the legend indicates legend indicates proportion of orignal unlabeled data. standard deviation of Gaussian noise. Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 10 / 12 Poster #47

Conclusion Two approaches that utilize both unlabeled and confidence data are proposed Our methods are robust to labelers with noise The proposed approaches can be generalized to other IL and IRL methods Poster #47 Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 11 / 12 Poster #47

Reference [1] Ho, Jonathan, and Stefano Ermon. ”Generative adversarial imitation learning.” Advances in Neural Information Processing Systems. 2016. [2] Syed, Umar, Michael Bowling, and Robert E. Schapire. ”Apprenticeship learning using linear programming.” Proceedings of the 25th international conference on Machine learning. ACM, 2008. Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration 12 / 12 Poster #47

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , - PowerPoint PPT Presentation

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han Bao 3,2 , Voot Tangkaratt 2 , Masashi Sugiyama 2,3 1 National Taiwan University 2 RIKEN Center for Advanced Intelligence Project 3 The University of

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

Collaborating in an Imperfect World: Collaborating in an Imperfect World: Understanding Category

Chapter 29: The Imperfect Subjunctive Chapter 29 covers the following: the formation of the

Imperfect competition with exit/entry Session 13 Imperfect competition Slide 1 P1 SepOct

Imperfect Information Extensive Form Games CMPUT 654: Modelling Human Strategic Behaviour

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

COVID-19 Response Webinar Thursday, April 2, 2020 - 2:15 to 3:45 PM Welcome & Introductions

Using Reeb Graphs Jacopo Aleotti aleotti@ce.unipr.it Stefano Caselli caselli@ce.unipr.it

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc

Pastor Congregation Outline of Presentation Thesis Key Points Summary Features of

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

Music Informatics Alan Smaill Mar 29 2018 Alan Smaill Music Informatics Mar 29 2018 1/21

Sambuz

Useful Links

Newsletter

Mail Us

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , - PowerPoint PPT Presentation

Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han Bao 3,2 , Voot Tangkaratt 2 , Masashi Sugiyama 2,3 1 National Taiwan University 2 RIKEN Center for Advanced Intelligence Project 3 The University of

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&amp;M University Shift

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell &amp; Geoff Gordon

Collaborating in an Imperfect World: Collaborating in an Imperfect World: Understanding Category

Chapter 29: The Imperfect Subjunctive Chapter 29 covers the following: the formation of the

Imperfect competition with exit/entry Session 13 Imperfect competition Slide 1 P1 SepOct

Imperfect Information Extensive Form Games CMPUT 654: Modelling Human Strategic Behaviour

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

COVID-19 Response Webinar Thursday, April 2, 2020 - 2:15 to 3:45 PM Welcome &amp; Introductions

Using Reeb Graphs Jacopo Aleotti aleotti@ce.unipr.it Stefano Caselli caselli@ce.unipr.it

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc

Pastor Congregation Outline of Presentation Thesis Key Points Summary Features of

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

Music Informatics Alan Smaill Mar 29 2018 Alan Smaill Music Informatics Mar 29 2018 1/21

Sambuz

Useful Links

Newsletter

Mail Us

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

COVID-19 Response Webinar Thursday, April 2, 2020 - 2:15 to 3:45 PM Welcome & Introductions