Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - - PowerPoint PPT Presentation

maximum entropy inverse rl adversarial imitation learning
SMART_READER_LITE
LIVE PREVIEW

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement Learning Probability Dynamics distribution over next Model T


slide-1
SLIDE 1

Maximum Entropy Inverse RL, Adversarial imitation learning

Deep Reinforcement Learning and Control Katerina Fragkiadaki

Carnegie Mellon School of Computer Science

slide-2
SLIDE 2

Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

Diagram: Pieter Abbeel

slide-3
SLIDE 3

IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy !

Inverse Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

Diagram: Pieter Abbeel

π∗

slide-4
SLIDE 4

IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy ! In contrast to the DAGGER setup, we cannot interactively query the expert for additional labels.

Inverse Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

Diagram: Pieter Abbeel

π∗

slide-5
SLIDE 5

Mathematically imitation boils down to a distribution matching problem: the learner needs to come up with a reward/policy whose resulting state, action trajectory distribution matches the expert trajectory distribution.

Inverse Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.
slide-6
SLIDE 6
  • Roads have unknown costs linear in features
  • Paths (trajectories) have unknown costs, sum of road (state) costs
  • Experts (taxi-drivers) demonstrate Pittsburgh traveling behavior
  • How can we learn to navigate Pitts like a taxi (or uber) driver?

A simple example

  • Assumption: cost is independent of the goal state, so it only depends on road

features, e.g., traffic width tolls etc.

slide-7
SLIDE 7

State features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be:

slide-8
SLIDE 8

A good guess: Match expected features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:

X

Pathτi

P(τi)fτi = ˜ f

“If a driver uses136.3 miles of interstate and crosses 12 bridges in a month’s worth of trips, the model should also use 136.3 miles of interstate and 12 bridges in expectation for those same start-destination pairs.”

slide-9
SLIDE 9

A good guess: Match expected features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:

X

Pathτi

P(τi)fτi = ˜ f

Demonstrated feature counts

slide-10
SLIDE 10

A good guess: Match expected features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:

X

Pathτi

P(τi)fτi = ˜ f

Demonstrated feature counts

p(τ) = p(s1) Y p(at|st)P(st+1|st, at)

a policy induces a distribution over trajectories

slide-11
SLIDE 11

Ambiguity

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:

X

Pathτi

P(τi)fτi = ˜ f

Demonstrated feature counts

p(τ) = p(s1) Y p(at|st)P(st+1|st, at)

a policy induces a distribution over trajectories However, many distributions over paths can match feature counts, and some will be very different from observed behavior. The model could produce a policy that avoid the interstate and bridges for all routes except

  • ne, which drives in circles on the interstate

for 136 miles and crosses 12 bridges.

slide-12
SLIDE 12

Principle of Maximum Entropy

The probability distribution which best represents the current state of knowledge is the one with largest entropy, in the context

  • f precisely stated prior data (such as a proposition that

expresses testable information). Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. The distribution with maximal information entropy is the best choice.

  • Maximizing entropy minimizes the amount of prior information built into the

distribution

  • Many physical systems tend to move towards maximal entropy

configurations over time

slide-13
SLIDE 13

Resolve Ambiguity by Maximum Entropy

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching constraint:

X

Pathτi

P(τi)fτi = ˜ f

Demonstrated feature counts

p(τ) = p(s1) Y p(at|st)P(st+1|st, at)

a policy induces a distribution over trajectories Let’s pick the policy that satisfies feature count constraints without over-committing!

max

P

− X

τ

P(τ) log P(τ)

slide-14
SLIDE 14

Maximizing the entropy over paths: While matching feature counts (and being a probability distribution):

Maximum Entropy Inverse Optimal Control

as uniform as possible

X

τ

P(τ)fτ = fdem X

τ

P(τ) = 1

max

P

− X

τ

P(τ) log P(τ)

slide-15
SLIDE 15

Cost of a trajectory (linear): Constraint: Match the cost of expert trajectories in expectation:

Z p(τ)cθ(τ)dτ = 1 |D| X

τ ∗∈Dτ

cθ(τ ∗)

Maximum Entropy

min . − H(p(τ)) s.t. Z p(τ)cθ(τ)dτ = ˜ c, Z p(τ)dτ = 1 τ cθ(τ) = θT fτ = X

s∈τ

θT fs

From features to costs

slide-16
SLIDE 16

∂L ∂p = log p(τ) + 1 + λ1cθ(τ) + λ0 ⇐ ⇒ L(p, λ) = Z p(τ) log(p(τ))dτ + λ1( Z p(τ)cθ(τ)dτ − ˜ c) +λ0( Z p(τ)dτ − 1) p(τ) = e(−1−λ0−λ1cθ(τ)) ∂L ∂p = 0 ⇐ ⇒ log p(τ) = −1 − λ1cθ(τ) − λ0

Maximum Entropy

min . − H(p(τ)) s.t. Z p(τ)cθ(τ)dτ = ˜ c, Z p(τ)dτ = 1

From maximum entropy to exponential family

p(τ) ∝ ecθ(τ)

slide-17
SLIDE 17

From maximum entropy to exponential family

  • Strong Preference for Low Cost Paths
  • Equal Cost Paths Equally Probable
  • Maximizing the entropy of the distribution over paths subject to the feature

constraints from observed data implies that we maximize the likelihood of the

  • bserved data under the maximum entropy (exponential family) distribution

(Jaynes 1957)

P(τi|θ) = 1 Z(θ)eθT fτi = 1 Z(θ)e

P

sj ∈τi θT fsj

Z(θ, s) = X

τS

eθT fτS

slide-18
SLIDE 18

Maximum Likelihood

max

θ

. log Y

τ ∗∈D

p(τ ∗) ⇐ ⇒ max

θ

. X

τ ∗∈D

log p(τ ∗) max

θ

. X

τ ∗∈D

log e−cθ(τ ∗) Z max

θ

. X

τ ∗∈D

−cθ(τ ∗) − X

τ ∗

log( X

τ

e−cθ(τ)) max

θ

. X

τ ∗∈D

−cθ(τ ∗) − log( X

τ

e−cθ(τ))|D| min

θ

. X

τ ∗∈D

cθ(τ ∗) + |D| log( X

τ

e−cθ(τ)) → J(θ) rθJ(θ) = X

τ ∗∈D

dcθ(τ ∗) dθ + |D| 1 P

τ e−cθ(τ)

X

τ (e−cθ(τ)

(− dcθ(τ) dθ ))

= X

τ ∗∈D

dcθ(τ ∗) dθ + |D| X

τ

p(τ|θ)dcθ(τ) dθ

slide-19
SLIDE 19

From trajectories to states

Successful imitation boils down to learning a policy that matches the state visitation distribution (or state/action visitation distribution)

p(τ) = p(s1) Y p(at|st)P(st+1|st, at) s) ⇒ X

s,a

p(s, a|θ, τ)dcθ(s, a) dθ rθJ(θ) = X

s∈τ ∗∈D

dcθ(s) dθ + |D| X

s

p(s|θ, τ)dcθ(s) dθ

− p(τ)∞e−cθ(τ) p(τ)∞e−∑s∈τ cθ(s)

cθ(τ) = X

s∈τ

cθ(s)

slide-20
SLIDE 20

State densities

In the tabular case and for known dynamics we can compute them with dynamic programming, assuming we have obtained the policy:

µ1(s) = p(ss) for t = 1, ..., T µt+1(s) = X

a

X

s0

µt(s0)p(a|s0)p(s|s0, a) p(s|θ, T ) = X

t

µt(s) cθ(s) = θT fs

For linear costs:

∇θJ(θ) = ∑

s∈τ*

fs + |D|∑

s

p(s|θ, 𝒰)fs

Time indexed state densities

rθJ(θ) = X

st∈τ ∗∈D

dcθ(s) dθ + |D| X

s

p(s|θ, T )dcθ(s) dθ

slide-21
SLIDE 21

Maximum entropy Inverse RL

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

Known dynamics, linear costs

slide-22
SLIDE 22

Demonstrated Behavior

Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10

Model Behavior (ExpectaIon)

Bridges crossed: ? Cost Weight: 5.0 Miles of interstate: ? Cost Weight: 3.0 Stoplights : ?

31

Maximum entropy Inverse RL

slide-23
SLIDE 23

Demonstrated Behavior

Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10

Model Behavior (ExpectaIon)

Bridges crossed: 4.7 +1.7 Cost Weight: 5.0 Miles of interstate: 16.2 ‐4.5 Cost Weight: 3.0 Stoplights : 7.4 ‐2.6

34

Maximum entropy Inverse RL

slide-24
SLIDE 24

Demonstrated Behavior

Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10

Model Behavior (ExpectaIon)

Bridges crossed: 4.7 7.2 Cost Weight: 5.0 Miles of interstate: 16.2 1.1 Cost Weight: Stoplights : 7.4

35

Maximum entropy Inverse RL

slide-25
SLIDE 25

Limitations of MaxEntIRL

  • Cost was assumed linear over features f
  • Dynamics T were assumed known

Next:

  • General function approximations for the cost: Finn et al. 2016
  • Unknown Dynamics -> sample based approximations for the partition

function Z: Boularias et al. 2011, Kalakrishnan et al. 2013, Finn et al. 2016

slide-26
SLIDE 26

MaxEnt IRL general cost function

max

θ

X

τ∈D

log pcθ(τ)

p(τ) = 1 Z exp(−Cθ(τ)) Z = Z exp(−Cθ(τ))dτ

Cθ(τ) = X

t

cθ(xt, ut)

Cost of a trajectory is decomposed over costs

  • f individual states
slide-27
SLIDE 27

max

θ

X

τ∈D

log pcθ(τ)

p(τ) = 1 Z exp(−Cθ(τ)) Z = Z exp(−Cθ(τ))dτ

Cθ(τ) = X

t

cθ(xt, ut)

Cost of a trajectory is decomposed over costs

  • f individual states

MaxEnt IRL general cost function

Before:

cθ(xt, ut) = θ⊤f(xt, ut)

slide-28
SLIDE 28

MaxEnt IRL general cost function

max

θ

X

τ∈D

log pcθ(τ)

p(τ) = 1 Z exp(−Cθ(τ)) Z = Z exp(−Cθ(τ))dτ

Cθ(τ) = X

t

cθ(xt, ut)

Cost of a trajectory is decomposed over costs

  • f individual states

In the form of a loss function: Before:

cθ(xt, ut) = θ⊤f(xt, ut)

slide-29
SLIDE 29

Approximating Z with Importance Sampling

Z = Z exp(−Cθ(τ))dτ

slide-30
SLIDE 30

MaxEntIOC with Importance Sampling

slide-31
SLIDE 31

MaxEntIOC with Importance Sampling

slide-32
SLIDE 32

MaxEntIOC with Importance Sampling

slide-33
SLIDE 33

Adapting the sampling distribution q

What should be the sampling distribution q?

  • Uniform: Boularias et al. 2011
  • In the vicinity of demonstrations: Kalakrishnan et al. 2013
  • Refine it over time! Finn at al. 2016: Interleave IRL with policy
  • ptimization, then sample trajectories according to the policy -> better

trajectories (have much higher likelihood) guided by your current estimate of the cost

slide-34
SLIDE 34

MaxEntIRL with Adaptive Importance Sampling

This can be any method that given rewards computes a policy (the forward RL problem) Given expert demonstrations and policy sampled trajectories improve rewards/costs (Inverse RL)

Diagram from Chelsea Finn

slide-35
SLIDE 35

MaxEntIRL with Adaptive Importance Sampling

Update cost using samples & demos generate policy samples from q update q w.r.t. cost

x x x

1 2 n

h h h

1 2 k

h h h

1 2 p

(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)

h h h

1 2 m

( 3 ) ( 3 ) ( 3 )

c (x)

θ

2

policy q cost c

Diagram from Chelsea Finn

Generator Discriminator

slide-36
SLIDE 36

Generative Adversarial Networks

Generator Discriminator

z ~ uniform([0, 1]) Real Data x D(x): the probability that x came from the data rather than the generator

min

G max D 𝔽x∼pdata(x)[log D(x)] + 𝔽z∼pz(z)[log(1−D(G(z)))]

slide-37
SLIDE 37

Discriminative Deep Learning

x

Recipe for success

slide-38
SLIDE 38

Generative Modeling

x ~ pdata(x ) x ~ pmodel(x )

  • Have training examples
  • Want a model that can draw samples:
  • Where

x ∼ pdata(x) x ∼ pmodel(x) pmodel ≈ pdata

slide-39
SLIDE 39

Why generative models?

  • Conditional generative models
  • Speech synthesis: Text->Speech
  • Machine Translation: French->English
  • French: Si mon tonton tond ton tonton, ton tonton sera tondu.
  • English: If my uncle shaves your uncle, your uncle will be

shaved

  • Image->Image segmentation
  • Environment simulator
  • Reinforcement learning
  • Planning
  • Leverage unlabeled data
slide-40
SLIDE 40

Maximum Likelihood: the dominant approach θ∗ = max

θ

1 m

m

X

i=1

log p ⇣ x(i); θ ⌘

slide-41
SLIDE 41

Undirected Graphical Models

  • Several “hidden layers” h

p(h, x) = 1 Z ˜ p(h, x) ˜ p(h, x) = exp(−E(h, x)) Z =

  • h,x

˜ p(h, x)

h(1) h(2) h(3) x

  • State-of-the-art general purpose undirected graphical

model: Deep Boltzmann machines

  • Several “hidden layers” h
slide-42
SLIDE 42

Undirected Graphical Models: Disadvantage

h(1) h(2) h(3) x d dθi log p(x) = d dθi " log X

h

˜ p(h, x) − log Z(θ) # d dθi log Z(θ) =

d dθi Z(θ)

Z(θ)

  • ML Learning requires that we draw samples
slide-43
SLIDE 43

Directed graphical models

  • Two problems:

p(x, h) = p(x | h(1))p(h(1) | h(2)) . . . p(h(L−1) | h(L))p(h(L))

h(1) h(2) h(3) x

d dθi log p(x) = 1 p(x) d dθi p(x) p(x) =

  • h

p(x | h)p(h)

  • Two problems:
  • 1. Summation over exponentially many states in h
  • 2. Posterior inference, i.e. calculating , is

intractable p(h|x)

slide-44
SLIDE 44

Directed graphical models: new approaches

The Variational Autoencoder model:

  • Kingma and Welling, Auto-Encoding Variational Bayes, International

Conference on Learning Representations (ICLR) 2014

  • Rezende, Mohamed and Wierstra, Stochastic back-propagation and

variational inference in deep latent Gaussian models. ArXiv. Use a reparametrization that allows them to train very efficiently with gradient backpropagation

slide-45
SLIDE 45

Generative Adversarial Networks

  • A game between two players:
  • 1. Discriminator D
  • 2. Generator G
  • D tries to discriminate between:
  • A sample from the data distribution
  • And a sample from the generator G
  • G tries to “trick” D by generating samples that are had for D to

distinguish from data

  • General strategy: Do not write a formula for p(x), just learn to sample
  • directly. No intractable summations
slide-46
SLIDE 46
  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

Input noise Z Differentiable function G x sampled from model Differentiable function D D tries to

  • utput 0

x sampled from data Differentiable function D D tries to

  • utput 1

x x

z

Generative Adversarial Networks Generative Adversarial Networks

slide-47
SLIDE 47

Zero-sum game

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

  • Minimax objective function:
  • In practice, to estimate G we use:

Why? Stronger gradient for G when D is very good.

min

G max D V (D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))].

max

G Ez∼pz(z)[log D(G(z))]

Adapted from Ian Goodfellow

slide-48
SLIDE 48

Learning process

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

. . .

Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution

pD(data)

Diagram from Ian Goodfellow

slide-49
SLIDE 49

Learning process

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

. . .

Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution

pD(data)

Diagram from Ian Goodfellow

slide-50
SLIDE 50

Learning process

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

. . .

Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution

pD(data)

Diagram from Ian Goodfellow

slide-51
SLIDE 51

Learning process

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

. . .

Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution

pD(data)

Diagram from Ian Goodfellow

slide-52
SLIDE 52

GANs are both fun and useful

male -> female

Adversarial Inverse Graphics Networks, Tung et al. 2017

slide-53
SLIDE 53

GANs are both fun and useful

anybody -> Tom Cruise

Adversarial Inverse Graphics Networks, Tung et al. 2017

slide-54
SLIDE 54

Find a policy that makes it impossible for a discriminator network to distinguish between trajectory chunks visited by the expert and by the learner’s application of :

min

G max D V (D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))].

Reward for the policy optimization is how well I matched the demo trajectory distribution, else, how well I confused the discriminator: logD(s)

min

πθ max D E∗ π[log D(s)] + Eπθ[log(1 − D(s))]

πθ πθ

D outputs 1 if state comes from the demo policy

Generative Adversarial Imitation learning

slide-55
SLIDE 55

NIPS 2016

slide-56
SLIDE 56

37

Generative Adversarial Imitation learning