Maximum Entropy Inverse RL, Adversarial imitation learning
Deep Reinforcement Learning and Control Katerina Fragkiadaki
Carnegie Mellon School of Computer Science
Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - - PowerPoint PPT Presentation
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement Learning Probability Dynamics distribution over next Model T
Deep Reinforcement Learning and Control Katerina Fragkiadaki
Carnegie Mellon School of Computer Science
Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π
!
Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability
Diagram: Pieter Abbeel
IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy !
Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π
!
Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability
Diagram: Pieter Abbeel
π∗
IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy ! In contrast to the DAGGER setup, we cannot interactively query the expert for additional labels.
Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π
!
Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability
Diagram: Pieter Abbeel
π∗
Mathematically imitation boils down to a distribution matching problem: the learner needs to come up with a reward/policy whose resulting state, action trajectory distribution matches the expert trajectory distribution.
Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π
!
Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability
features, e.g., traffic width tolls etc.
e:
# Bridges crossed # Miles of interstate # Stoplights Features f can be:
e:
# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:
X
Pathτi
P(τi)fτi = ˜ f
“If a driver uses136.3 miles of interstate and crosses 12 bridges in a month’s worth of trips, the model should also use 136.3 miles of interstate and 12 bridges in expectation for those same start-destination pairs.”
e:
# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:
X
Pathτi
P(τi)fτi = ˜ f
Demonstrated feature counts
e:
# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:
X
Pathτi
P(τi)fτi = ˜ f
Demonstrated feature counts
p(τ) = p(s1) Y p(at|st)P(st+1|st, at)
a policy induces a distribution over trajectories
e:
# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching:
X
Pathτi
P(τi)fτi = ˜ f
Demonstrated feature counts
p(τ) = p(s1) Y p(at|st)P(st+1|st, at)
a policy induces a distribution over trajectories However, many distributions over paths can match feature counts, and some will be very different from observed behavior. The model could produce a policy that avoid the interstate and bridges for all routes except
for 136 miles and crosses 12 bridges.
The probability distribution which best represents the current state of knowledge is the one with largest entropy, in the context
expresses testable information). Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. The distribution with maximal information entropy is the best choice.
distribution
configurations over time
e:
# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching constraint:
X
Pathτi
P(τi)fτi = ˜ f
Demonstrated feature counts
p(τ) = p(s1) Y p(at|st)P(st+1|st, at)
a policy induces a distribution over trajectories Let’s pick the policy that satisfies feature count constraints without over-committing!
max
P
− X
τ
P(τ) log P(τ)
Maximizing the entropy over paths: While matching feature counts (and being a probability distribution):
as uniform as possible
X
τ
P(τ)fτ = fdem X
τ
P(τ) = 1
max
P
− X
τ
P(τ) log P(τ)
Cost of a trajectory (linear): Constraint: Match the cost of expert trajectories in expectation:
Z p(τ)cθ(τ)dτ = 1 |D| X
τ ∗∈Dτ
cθ(τ ∗)
Maximum Entropy
min . − H(p(τ)) s.t. Z p(τ)cθ(τ)dτ = ˜ c, Z p(τ)dτ = 1 τ cθ(τ) = θT fτ = X
s∈τ
θT fs
∂L ∂p = log p(τ) + 1 + λ1cθ(τ) + λ0 ⇐ ⇒ L(p, λ) = Z p(τ) log(p(τ))dτ + λ1( Z p(τ)cθ(τ)dτ − ˜ c) +λ0( Z p(τ)dτ − 1) p(τ) = e(−1−λ0−λ1cθ(τ)) ∂L ∂p = 0 ⇐ ⇒ log p(τ) = −1 − λ1cθ(τ) − λ0
Maximum Entropy
min . − H(p(τ)) s.t. Z p(τ)cθ(τ)dτ = ˜ c, Z p(τ)dτ = 1
p(τ) ∝ ecθ(τ)
constraints from observed data implies that we maximize the likelihood of the
(Jaynes 1957)
P(τi|θ) = 1 Z(θ)eθT fτi = 1 Z(θ)e
P
sj ∈τi θT fsj
τS
max
θ
. log Y
τ ∗∈D
p(τ ∗) ⇐ ⇒ max
θ
. X
τ ∗∈D
log p(τ ∗) max
θ
. X
τ ∗∈D
log e−cθ(τ ∗) Z max
θ
. X
τ ∗∈D
−cθ(τ ∗) − X
τ ∗
log( X
τ
e−cθ(τ)) max
θ
. X
τ ∗∈D
−cθ(τ ∗) − log( X
τ
e−cθ(τ))|D| min
θ
. X
τ ∗∈D
cθ(τ ∗) + |D| log( X
τ
e−cθ(τ)) → J(θ) rθJ(θ) = X
τ ∗∈D
dcθ(τ ∗) dθ + |D| 1 P
τ e−cθ(τ)
X
∑
τ (e−cθ(τ)
(− dcθ(τ) dθ ))
= X
τ ∗∈D
dcθ(τ ∗) dθ + |D| X
τ
p(τ|θ)dcθ(τ) dθ
−
Successful imitation boils down to learning a policy that matches the state visitation distribution (or state/action visitation distribution)
p(τ) = p(s1) Y p(at|st)P(st+1|st, at) s) ⇒ X
s,a
p(s, a|θ, τ)dcθ(s, a) dθ rθJ(θ) = X
s∈τ ∗∈D
dcθ(s) dθ + |D| X
s
p(s|θ, τ)dcθ(s) dθ
− p(τ)∞e−cθ(τ) p(τ)∞e−∑s∈τ cθ(s)
cθ(τ) = X
s∈τ
cθ(s)
In the tabular case and for known dynamics we can compute them with dynamic programming, assuming we have obtained the policy:
µ1(s) = p(ss) for t = 1, ..., T µt+1(s) = X
a
X
s0
µt(s0)p(a|s0)p(s|s0, a) p(s|θ, T ) = X
t
µt(s) cθ(s) = θT fs
For linear costs:
∇θJ(θ) = ∑
s∈τ*
fs + |D|∑
s
p(s|θ, 𝒰)fs
Time indexed state densities
rθJ(θ) = X
st∈τ ∗∈D
dcθ(s) dθ + |D| X
s
p(s|θ, T )dcθ(s) dθ
−
Body Level Five
Known dynamics, linear costs
−
Demonstrated Behavior
Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10
Model Behavior (ExpectaIon)
Bridges crossed: ? Cost Weight: 5.0 Miles of interstate: ? Cost Weight: 3.0 Stoplights : ?
31
Demonstrated Behavior
Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10
Model Behavior (ExpectaIon)
Bridges crossed: 4.7 +1.7 Cost Weight: 5.0 Miles of interstate: 16.2 ‐4.5 Cost Weight: 3.0 Stoplights : 7.4 ‐2.6
34
Demonstrated Behavior
Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10
Model Behavior (ExpectaIon)
Bridges crossed: 4.7 7.2 Cost Weight: 5.0 Miles of interstate: 16.2 1.1 Cost Weight: Stoplights : 7.4
35
Next:
function Z: Boularias et al. 2011, Kalakrishnan et al. 2013, Finn et al. 2016
θ
τ∈D
p(τ) = 1 Z exp(−Cθ(τ)) Z = Z exp(−Cθ(τ))dτ
Cθ(τ) = X
t
cθ(xt, ut)
Cost of a trajectory is decomposed over costs
θ
τ∈D
p(τ) = 1 Z exp(−Cθ(τ)) Z = Z exp(−Cθ(τ))dτ
Cθ(τ) = X
t
cθ(xt, ut)
Cost of a trajectory is decomposed over costs
Before:
cθ(xt, ut) = θ⊤f(xt, ut)
θ
τ∈D
p(τ) = 1 Z exp(−Cθ(τ)) Z = Z exp(−Cθ(τ))dτ
Cθ(τ) = X
t
cθ(xt, ut)
Cost of a trajectory is decomposed over costs
In the form of a loss function: Before:
cθ(xt, ut) = θ⊤f(xt, ut)
Z = Z exp(−Cθ(τ))dτ
What should be the sampling distribution q?
trajectories (have much higher likelihood) guided by your current estimate of the cost
This can be any method that given rewards computes a policy (the forward RL problem) Given expert demonstrations and policy sampled trajectories improve rewards/costs (Inverse RL)
Diagram from Chelsea Finn
Update cost using samples & demos generate policy samples from q update q w.r.t. cost
x x x
1 2 n
h h h
1 2 k
h h h
1 2 p
(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)
h h h
1 2 m
( 3 ) ( 3 ) ( 3 )
c (x)
θ
2policy q cost c
Diagram from Chelsea Finn
Generator Discriminator
Generator Discriminator
z ~ uniform([0, 1]) Real Data x D(x): the probability that x came from the data rather than the generator
min
G max D 𝔽x∼pdata(x)[log D(x)] + 𝔽z∼pz(z)[log(1−D(G(z)))]
x
Recipe for success
x ∼ pdata(x) x ∼ pmodel(x) pmodel ≈ pdata
shaved
θ
m
i=1
h(1) h(2) h(3) x
model: Deep Boltzmann machines
h(1) h(2) h(3) x d dθi log p(x) = d dθi " log X
h
˜ p(h, x) − log Z(θ) # d dθi log Z(θ) =
d dθi Z(θ)
Z(θ)
p(x, h) = p(x | h(1))p(h(1) | h(2)) . . . p(h(L−1) | h(L))p(h(L))
h(1) h(2) h(3) x
intractable p(h|x)
The Variational Autoencoder model:
Conference on Learning Representations (ICLR) 2014
variational inference in deep latent Gaussian models. ArXiv. Use a reparametrization that allows them to train very efficiently with gradient backpropagation
distinguish from data
Body Level Five
Input noise Z Differentiable function G x sampled from model Differentiable function D D tries to
x sampled from data Differentiable function D D tries to
x x
z
Body Level Five
min
G max D V (D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))].
G Ez∼pz(z)[log D(G(z))]
Adapted from Ian Goodfellow
Body Level Five
. . .
Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution
pD(data)
Diagram from Ian Goodfellow
Body Level Five
. . .
Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution
pD(data)
Diagram from Ian Goodfellow
Body Level Five
. . .
Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution
pD(data)
Diagram from Ian Goodfellow
Body Level Five
. . .
Poorly fit model After updating D After updating G Mixed strategy equilibrium Data distribution Model distribution
pD(data)
Diagram from Ian Goodfellow
male -> female
Adversarial Inverse Graphics Networks, Tung et al. 2017
anybody -> Tom Cruise
Adversarial Inverse Graphics Networks, Tung et al. 2017
Find a policy that makes it impossible for a discriminator network to distinguish between trajectory chunks visited by the expert and by the learner’s application of :
min
G max D V (D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))].
Reward for the policy optimization is how well I matched the demo trajectory distribution, else, how well I confused the discriminator: logD(s)
min
πθ max D E∗ π[log D(s)] + Eπθ[log(1 − D(s))]
πθ πθ
D outputs 1 if state comes from the demo policy
NIPS 2016
37