gen enerativ erative e adver ersaria sarial l im
play

Gen enerativ erative e Adver ersaria sarial l Im Imitation - PowerPoint PPT Presentation

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning Stefan efano o Ermon on Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song Reinforcement Learning Goal: Learn


  1. Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning Stefan efano o Ermon on Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song

  2. Reinforcement Learning • Goal: Learn policies • High-dimensional, raw observations action

  3. Reinforcement Learning • MDP: Model for (stochastic) sequential decision making problems +5 +1 • States S • Actions A • Co Cost st function (immediate): C: SxA  R • Transition Probabilities: P(s’| s,a) 0 • Policy: mapping from states to actions – E.g., (S 0 ->a 1 , S 1 ->a 0 , S 2 ->a 0 ) • Reinforcement learning: minimize total (expected, discounted) cost  1 T  ( ) c s t  t 0

  4. Reinforcement Learning Cost Function Optimal Reinforcement c(s,a) policy p Learning (RL) Policy: mapping from states to actions C: SxA  R Environment E.g., (S 0 ->a 1 , (MDP) S 1 ->a 0 , Cost S 2 ->a 0 ) +5 +1 • States S • Actions A RL nee eeds ds • Transitions: P( s’ |s, a) cost st sig ignal 0

  5. Imitation Input: expert behavior generated by π E Goal: learn cost function (reward) or policy (Ng and Russell, 2000), (Abbeel and Ng, 2004; Syed and Schapire, 2007), (Ratliff et al., 2006), (Ziebart et al., 2008), (Kolter et al., 2008), (Finn et al., 2016), etc.

  6. Behavioral Cloning (State,Action) (State,Action) Policy … (State,Action) Supervised Learning (regression) • Small errors compound over time (cascading errors) • Decisions are purposeful (require planning)

  7. Inverse RL • An approach to imitation • Learns a cost c such that

  8. Problem setup Optimal Cost Function Reinforcement c(s) policy p Learning (RL) Environment (MDP) Inverse Reinforcement Expert’s Trajectories Cost Function Learning (IRL) s 0 , s 1 , s 2 , … c(s) Expert has Everything else (Ziebart et al., 2010; small cost has high cost Rust 1987) 15

  9. Problem setup Optimal Cost Function Reinforcement c(s) policy p Learning (RL) ≈ ? Environment (similar wrt ψ ) (MDP) Inverse Reinforcement Expert’s Trajectories Cost Function Learning (IRL) s 0 , s 1 , s 2 , … c(s) Convex cost regularizer 16

  10. Combining RL o IRL ρ p = occupancy measure = Optimal Reinforcement distribution of state-action pairs policy p Learning (RL) encountered when navigating ≈ the environment with the policy (similar w.r.t. ψ ) ρ pE = Expert’s ψ -regularized Inverse Expert’s Trajectories Reinforcement occupancy measure s 0 , s 1 , s 2 , … Learning (IRL) Theorem orem: ψ -regularized inverse reinforcement learning, implicitly, seeks ks a polic icy y whose e occupancy ncy measure sure is close e to the expert’s , as measured by ψ * (convex conjugate of ψ ) 17

  11. Takeaway Theorem rem: ψ -regularized inverse reinforcement learning, implicitly, se seeks s a poli licy cy wh whose se occupancy upancy measure sure is s close to the expert’s , as measured by ψ * • Typical IRL definition: finding a cost function c such that the expert policy is uniquely optimal w.r.t. c • Alternative view: IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure (gene nerati rative model)

  12. Special cases • If ψ (c)=constant, then – Not a useful algorithm. In practice, we only have sampled trajectories • Ov Overfitting: itting: Too much flexibility in choosing the cost function (and the policy) All cost functions ψ (c)=constant

  13. Towards Apprenticeship learning • Solu lution: tion: use fea eatu tures res f s,a ) = θ . f s,a • Cost st c( c(s,a s,a) Only these “simple” cost functions are allowed ψ(c)= ∞ Linear in features All cost functions ψ (c)= 0 20

  14. Apprenticeship learning • For that choice of ψ , RL o IRL ψ framework gives apprenticeship learning • Apprenticeship learning: find π performing better than π E over costs linear in the features – Abbeel and Ng (2004) – Syed and Schapire (2007)

  15. Apprenticeship learning • Given • Goal: find π performing better than π E over a class of costs Approximated using demonstrations

  16. Issues with Apprenticeship learning • Need to craft features very carefully – unless the true expert cost function (assuming it exists) lies in C, there is no guarantee that AL will recover the expert policy • RL o IRL ψ ( p E ) is “encoding” the expert behavior as a cost function in C. – it might not be possible to decode it back if C is too simple All cost functions RL p R IRL p E

  17. Generative Adversarial Imitation Learning • Solu lution tion: use a more expressive class of cost functions All cost functions Linear in features

  18. Generative Adversarial Imitation Learning • ψ * = optimal negative log-loss of the binary classification problem of distinguishing between state- action pairs of π and π E Policy π Expert Policy π E D

  19. Generative Adversarial Networks Figure from Goodfellow et al, 2014

  20. GAIL D tries to D tries to output 1 output 0 Differentiable Differentiable function D function D Sample from Sample from model expert Simulator Black box simulator (Environment) Generator G Differentiable function P Ho and Ermon, Generative Adversarial Imitation Learning

  21. How to optimize the objective • Previous Apprenticeship learning work: – Full dynamics model – Small environment – Repeated RL • We propose: gradient descent over policy parameters (and discriminator) J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. ICML 2016.

  22. Properties • Inherits pros of policy gradient – Convergence to local minima – Can be model free • Inherits cons of policy gradient – High variance – Small steps required

  23. Properties • Inherits pros of policy gradient – Convergence to local minima – Can be model free • Inherits cons of policy gradient – High variance – Small steps required • Solu lution: tion: tr trust t reg egion ion policy licy optim timizat ization ion

  24. Results

  25. Results Input: driving demonstrations (Torcs) Output policy: From m raw visual al inputs uts Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

  26. Experimental results

  27. Latent structure in demonstrations Human model Observed Latent variables Environment Policy Behavior Z Semantically meaningful latent structure? 35

  28. InfoGAIL Observed Latent structure data Infer structure Hou el al. Maximize mutual information Observed Latent variables Environment Policy Behavior Z

  29. InfoGAIL Latent code Maximize mutual information (s,a ,a) Observed Environment c Policy Z Behavior

  30. Synthetic Experiment Demonstrations GAIL Info-GAIL Demonstrations

  31. InfoGAIL Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations model Latent variables Trajectories Environment Policy Z Pass right (z=1) Pass left (z=0) 40

  32. InfoGAIL Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations model Latent variables Trajectories Environment Policy Z Turn outside (z=1) Turn inside (z=0) 41

  33. Multi-agent environments What are the goals of these 4 agents?

  34. Problem setup Optimal policies p1 Cost Functions c 1 (s,a 1 ) … MA Reinforcement .. Learning (MARL) c N (s,a N ) Optimal policies pK Environment (Markov Game) R L R 0,0 10,10 L 10,10 0,0

  35. Problem setup Cost Functions Optimal c 1 (s,a 1 ) MA Reinforcement .. policies p Learning (MARL) c N (s,a N ) ≈ Environment (similar wrt ψ ) (Markov Game) Expert’s Trajectories Cost Functions (s (s 0 ,a ,a 0 1 ,..a 0 N ) c 1 (s,a 1 ) Inverse Reinforcement .. (s 1 ,a ,a 1 1 ,..a 1 N ) Learning (MAIRL) c N (s,a N ) … 46

  36. MAGAIL D 1 tries D 2 tries D 2 tries D1 tries D N tries D N tries to to to to to to output 0 output 1 output 0 output 1 output 0 output 1 Diff. Diff. Diff. Diff. Diff. Diff. … … function function function function function function D 1 D 2 D1 D 2 D N D N Sample from model Sample from expert (s,a 1 ,a 2 ,…, a N ) (s,a 1 ,a 2 ,…, a N ) Black box simulator Generator G Policy Policy Agent N Agent 1 Song, Ren, Sadigh, Ermon, Multi-Agent Generative Adversarial Imitation Learning

  37. Environments Demonstrations MAGAIL

  38. Environments Demonstrations MAGAIL

  39. Suboptimal demos MAGAIL Expert lighter plank + bumps on ground

  40. Conclusions • IRL is a dual of an occupancy measure matching problem (generative modeling) • Might need flexible cost functions – GAN style approach • Policy gradient approach – Scales to high dimensional settings • Towards unsupervised learning of latent structure from demonstrations 51

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend