Gen enerativ erative e Adver ersaria sarial l Im Imitation - - PowerPoint PPT Presentation

gen enerativ erative e adver ersaria sarial l im
SMART_READER_LITE
LIVE PREVIEW

Gen enerativ erative e Adver ersaria sarial l Im Imitation - - PowerPoint PPT Presentation

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning Stefan efano o Ermon on Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song Reinforcement Learning Goal: Learn


slide-1
SLIDE 1

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning

Stefan efano

  • Ermon
  • n

Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song

slide-2
SLIDE 2

Reinforcement Learning

  • Goal: Learn policies
  • High-dimensional, raw
  • bservations

action

slide-3
SLIDE 3

+5 +1

Reinforcement Learning

  • MDP: Model for (stochastic) sequential decision

making problems

  • States S
  • Actions A
  • Co

Cost st function (immediate): C: SxA  R

  • Transition Probabilities: P(s’|s,a)
  • Policy: mapping from states to actions

– E.g., (S0->a1, S1->a0, S2->a0)

  • Reinforcement learning: minimize total (expected,

discounted) cost

  1

) (

T t t

s c

slide-4
SLIDE 4

Reinforcement Learning

Optimal policy p

Reinforcement Learning (RL) Cost Function c(s,a)

+5 +1

Environment (MDP)

  • States S
  • Actions A
  • Transitions: P(s’|s, a)

Cost

Policy: mapping from states to actions E.g., (S0->a1, S1->a0, S2->a0)

C: SxA  R RL nee eeds ds cost st sig ignal

slide-5
SLIDE 5

Imitation

Input: expert behavior generated by πE Goal: learn cost function (reward) or policy

(Ng and Russell, 2000), (Abbeel and Ng, 2004; Syed and Schapire, 2007), (Ratliff et al., 2006), (Ziebart et al., 2008), (Kolter et al., 2008), (Finn et al., 2016), etc.

slide-6
SLIDE 6

Behavioral Cloning

  • Small errors compound over time (cascading

errors)

  • Decisions are purposeful (require planning)

(State,Action) (State,Action) … (State,Action) Policy Supervised Learning (regression)

slide-7
SLIDE 7

Inverse RL

  • An approach to imitation
  • Learns a cost c such that
slide-8
SLIDE 8

Problem setup

15

Optimal policy p

Reinforcement Learning (RL) Environment (MDP) Cost Function c(s)

Expert’s Trajectories s0, s1, s2, …

Cost Function c(s) Inverse Reinforcement Learning (IRL)

Expert has small cost Everything else has high cost (Ziebart et al., 2010; Rust 1987)

slide-9
SLIDE 9

Problem setup

16

Optimal policy p

Reinforcement Learning (RL) Environment (MDP) Cost Function c(s)

Expert’s Trajectories s0, s1, s2, …

Cost Function c(s) Inverse Reinforcement Learning (IRL)

?

Convex cost regularizer

≈ (similar wrt ψ)

slide-10
SLIDE 10

Combining RLoIRL

17

Optimal policy p

Reinforcement Learning (RL)

Expert’s Trajectories s0, s1, s2, …

ψ-regularized Inverse Reinforcement Learning (IRL)

≈ (similar w.r.t. ψ)

ρp = occupancy measure =

distribution of state-action pairs encountered when navigating the environment with the policy

ρpE = Expert’s

  • ccupancy measure

Theorem

  • rem: ψ-regularized inverse reinforcement learning,

implicitly, seeks ks a polic icy y whose e occupancy ncy measure sure is close e to the expert’s, as measured by ψ* (convex conjugate of ψ)

slide-11
SLIDE 11

Takeaway

Theorem rem: ψ-regularized inverse reinforcement learning, implicitly, se seeks s a poli licy cy wh whose se occupancy upancy measure sure is s close to the expert’s, as measured by ψ*

  • Typical IRL definition: finding a cost function c such

that the expert policy is uniquely optimal w.r.t. c

  • Alternative view: IRL as a procedure that tries to

induce a policy that matches the expert’s occupancy measure (gene nerati rative model)

slide-12
SLIDE 12

Special cases

  • If ψ(c)=constant, then

– Not a useful algorithm. In practice, we only have sampled trajectories

  • Ov

Overfitting: itting: Too much flexibility in choosing the cost function (and the policy)

All cost functions ψ(c)=constant

slide-13
SLIDE 13

Towards Apprenticeship learning

  • Solu

lution: tion: use fea eatu tures res fs,a

  • Cost

st c( c(s,a s,a) ) = θ . fs,a

20

Only these “simple” cost functions are allowed All cost functions Linear in features

ψ(c)= 0 ψ(c)= ∞

slide-14
SLIDE 14

Apprenticeship learning

  • For that choice of ψ, RL oIRLψ framework

gives apprenticeship learning

  • Apprenticeship learning: find π performing

better than πE over costs linear in the features

– Abbeel and Ng (2004) – Syed and Schapire (2007)

slide-15
SLIDE 15

Apprenticeship learning

  • Given
  • Goal: find π performing better than πE over a

class of costs

Approximated using demonstrations

slide-16
SLIDE 16

Issues with Apprenticeship learning

  • Need to craft features very carefully

– unless the true expert cost function (assuming it exists) lies in C, there is no guarantee that AL will recover the expert policy

  • RL o IRLψ(pE) is “encoding” the expert

behavior as a cost function in C.

– it might not be possible to decode it back if C is too simple

All cost functions pE pR IRL RL

slide-17
SLIDE 17

Generative Adversarial Imitation Learning

  • Solu

lution tion: use a more expressive class of cost functions

All cost functions Linear in features

slide-18
SLIDE 18

Generative Adversarial Imitation Learning

  • ψ* = optimal negative log-loss of the binary

classification problem of distinguishing between state-action pairs of π and πE

D Policy π Expert Policy πE

slide-19
SLIDE 19

Generative Adversarial Networks

Figure from Goodfellow et al, 2014

slide-20
SLIDE 20

GAIL

Simulator (Environment)

Sample from expert Differentiable function D D tries to

  • utput 0

Sample from model Differentiable function D D tries to

  • utput 1

Differentiable function P Black box simulator Generator G

Ho and Ermon, Generative Adversarial Imitation Learning

slide-21
SLIDE 21

How to optimize the objective

  • Previous Apprenticeship learning work:

– Full dynamics model – Small environment – Repeated RL

  • We propose: gradient descent over policy

parameters (and discriminator)

  • J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization.

ICML 2016.

slide-22
SLIDE 22

Properties

  • Inherits pros of policy gradient

– Convergence to local minima – Can be model free

  • Inherits cons of policy gradient

– High variance – Small steps required

slide-23
SLIDE 23

Properties

  • Inherits pros of policy gradient

– Convergence to local minima – Can be model free

  • Inherits cons of policy gradient

– High variance – Small steps required

  • Solu

lution: tion: tr trust t reg egion ion policy licy optim timizat ization ion

slide-24
SLIDE 24

Results

slide-25
SLIDE 25

Results

Input: driving demonstrations (Torcs) Output policy:

Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

From m raw visual al inputs uts

slide-26
SLIDE 26

Experimental results

slide-27
SLIDE 27

Latent structure in demonstrations

35

Environment Policy Observed Behavior

Human model

Latent variables

Z

Semantically meaningful latent structure?

slide-28
SLIDE 28

InfoGAIL

Latent structure

Observed data Infer structure Environment Policy Observed Behavior

Latent variables

Z

Maximize mutual information

Hou el al.

slide-29
SLIDE 29

InfoGAIL

Environment Policy Observed Behavior

Z

Maximize mutual information c

(s,a ,a)

Latent code

slide-30
SLIDE 30

Synthetic Experiment

Demonstrations

Demonstrations GAIL Info-GAIL

slide-31
SLIDE 31

InfoGAIL

40

Environment Policy Trajectories

model

Pass left (z=0) Pass right (z=1)

Latent variables

Z

Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

slide-32
SLIDE 32

InfoGAIL

41

Environment Policy Trajectories

model

Turn inside (z=0) Turn outside (z=1)

Latent variables

Z

Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

slide-33
SLIDE 33

Multi-agent environments

What are the goals of these 4 agents?

slide-34
SLIDE 34

Problem setup

Optimal policies p1

MA Reinforcement Learning (MARL) Environment (Markov Game) Cost Functions c1(s,a1) .. cN(s,aN)

Optimal policies pK R L R 0,0 10,10 L 10,10 0,0

slide-35
SLIDE 35

Problem setup

46

Optimal policies p

MA Reinforcement Learning (MARL) Environment (Markov Game) Cost Functions c1(s,a1) .. cN(s,aN)

Expert’s Trajectories (s (s0,a ,a0

1,..a0 N)

(s1,a ,a1

1,..a1 N) …

Inverse Reinforcement Learning (MAIRL) Cost Functions c1(s,a1) .. cN(s,aN)

≈ (similar wrt ψ)

slide-36
SLIDE 36

MAGAIL

Sample from expert (s,a1,a2,…,aN)

Diff. function D1 D1 tries to

  • utput 0

Sample from model (s,a1,a2,…,aN) Policy Agent 1 Black box simulator Generator G

Song, Ren, Sadigh, Ermon, Multi-Agent Generative Adversarial Imitation Learning

Diff. function DN DN tries to

  • utput 0

… Policy Agent N

Diff. function DN DN tries to

  • utput 1

Diff. function D1 D1 tries to

  • utput 1

Diff. function D2 D2 tries to

  • utput 0

Diff. function D2 D2 tries to

  • utput 1
slide-37
SLIDE 37

Environments

Demonstrations MAGAIL

slide-38
SLIDE 38

Environments

Demonstrations MAGAIL

slide-39
SLIDE 39

Suboptimal demos

Expert MAGAIL lighter plank + bumps on ground

slide-40
SLIDE 40

Conclusions

51

  • IRL is a dual of an occupancy measure

matching problem (generative modeling)

  • Might need flexible cost functions

– GAN style approach

  • Policy gradient approach

– Scales to high dimensional settings

  • Towards unsupervised learning of latent

structure from demonstrations