Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - - PowerPoint PPT Presentation

maximum entropy inverse rl adversarial imitation learning
SMART_READER_LITE
LIVE PREVIEW

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement Learning Probability Dynamics distribution over next Model T


slide-1
SLIDE 1

Maximum Entropy Inverse RL, Adversarial imitation learning

Deep Reinforcement Learning and Control Katerina Fragkiadaki

Carnegie Mellon School of Computer Science

slide-2
SLIDE 2

Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

Diagram: Pieter Abbeel

slide-3
SLIDE 3

IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy !

Inverse Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

π∗

Diagram: Pieter Abbeel

slide-4
SLIDE 4

IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy ! In contrast to the DAGGER setup, we cannot interactively query the expert for additional labels.

Inverse Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

π∗

Diagram: Pieter Abbeel

slide-5
SLIDE 5

IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s recover reward R and policy ! In contrast to the DAGGER setup, we cannot interactively query the expert for additional labels.

Inverse Reinforcement Learning

Dynamics Model T Reward Function R Reinforcement ! Learning / Optimal Control ! Controller/ Policy π

!

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

π∗

Q: Why inferring the reward is useful as opposed to learning a policy directly? A: Because it can generalize better, e.g., if the dynamics of the environment change, you can use the reward to learn a policy that can handle those new dynamics

Diagram: Pieter Abbeel

slide-6
SLIDE 6
  • Roads have unknown costs linear in features
  • Paths (trajectories) have unknown costs, sum of road (state) costs
  • Experts (taxi-drivers) demonstrate Pittsburgh traveling behavior
  • How can we learn to navigate Pitts like a taxi (or uber) driver?

A simple example

  • Assumption: cost is independent of the goal state, so it only depends on road

features, e.g., traffic width tolls etc.

slide-7
SLIDE 7

State features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be:

slide-8
SLIDE 8

A good guess: Match expected features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be:

Feature matching:

“If a driver uses136.3 miles of interstate and crosses 12 bridges in a month’s worth of trips, the model should also use 136.3 miles of interstate and 12 bridges in expectation for those same start-destination pairs.”

τi

p(τi)fτi = ˜ f

slide-9
SLIDE 9

A good guess: Match expected features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Demonstrated feature counts

τi

p(τi)fτi = ˜ f

Feature matching:

slide-10
SLIDE 10

A good guess: Match expected features

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Demonstrated feature counts

p(τ) = p(s1) Y p(at|st)P(st+1|st, at)

a policy induces a distribution over trajectories

τi

p(τi)fτi = ˜ f

Feature matching:

slide-11
SLIDE 11

∑ path τi p(τi)fτi = ˜ f

Ambiguity

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Demonstrated feature counts

p(τ) = p(s1) Y p(at|st)P(st+1|st, at)

a policy induces a distribution over trajectories However, many distributions over paths can match feature counts, and some will be very different from observed behavior. The model could produce a policy that avoid the interstate and bridges for all routes except

  • ne, which drives in circles on the interstate

for 136 miles and crosses 12 bridges.

Feature matching:

slide-12
SLIDE 12

Principle of Maximum Entropy

The Principle of Maximum Entropy is based on the premise that when estimating the probability distribution, you should select that distribution which leaves you the largest remaining uncertainty (i.e., the maximum entropy) consistent with your constraints. That way you have not introduced any additional assumptions or biases into your calculations

H(x) = −

n

i=1

p(xi)log(p(xi))

slide-13
SLIDE 13

Resolve Ambiguity by Maximum Entropy

e:

# Bridges crossed # Miles of interstate # Stoplights Features f can be: Feature matching constraint: Demonstrated feature counts

p(τ) = p(s1) Y p(at|st)P(st+1|st, at)

a policy induces a distribution over trajectories Let’s pick the policy that satisfies feature count constraints without over-committing!

∑ path τi p(τi)fτi = ˜ f max

p

. − ∑

τ

p(τ)log p(τ)

slide-14
SLIDE 14

Constraint: Match the cost of expert trajectories in expectation:

From features to costs

∫ p(τ)cθ(τ)dτ = 1 |Ddemo| ∑

τi∈Ddemo

cθ(τi) = ˜ c

slide-15
SLIDE 15

Maximum Entropy Inverse Optimal Control

min

p .

−H(p(τ)) = ∑

τ

p(τ)log p(τ) s.t. ∫τ p(τ)cθ(τ) = ˜ c, ∫τ p(τ) = 1

Optimization problem:

slide-16
SLIDE 16

From maximum entropy to exponential family

⟺ ℒ(p, λ) = ∫ p(τ)log(p(τ))dτ + λ1(∫ p(τ)cθ(τ)dτ − ˜ c) + λ0(∫ p(τ)dτ − 1)

∂ℒ ∂p = log p(τ) + 1 + λ1cθ(τ) + λ0 ∂ℒ ∂p = 0 ⟺ log p(τ) = − 1 − λ1cθ(τ) − λ0 ⟺ p(τ) = e−1−λ0−λ1cθ(τ) → p(τ) ∝ ecθ(τ)

min

p .

−H(p(τ)) = ∑

τ

p(τ)log p(τ) s.t. ∫τ p(τ)cθ(τ) = ˜ c, ∫τ p(τ) = 1

slide-17
SLIDE 17

From maximum entropy to exponential family

  • Strong preference for low cost trajectories
  • Equal cost trajectories are equally probable

Maximizing the entropy of the distribution over paths subject to the cost constraints from observed data implies that we maximize the likelihood of the

  • bserved data under the maximum entropy (exponential family) distribution

(Jaynes 1957)

p(τ|θ) = e−cost(τ|θ) ∑τ′e−cost(τ′|θ)

slide-18
SLIDE 18

Maximum Likelihood

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log(∑

τ

e−cθ(τ)) ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − log(∑

τ

e−cθ(τ))|Ddemo| ⟺ min

θ .

τi∈Ddemo

cθ(τi) + |Ddemo|log(∑

τ

e−cθ(τ)) → ℒ(θ)

slide-19
SLIDE 19

Maximum Likelihood

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log(∑

τ

e−cθ(τ)) ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − log(∑

τ

e−cθ(τ))|Ddemo| ⟺ min

θ .

τi∈Ddemo

cθ(τi) + |Ddemo|log(∑

τ

e−cθ(τ)) → ℒ(θ)

slide-20
SLIDE 20

Maximum Likelihood

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log(∑

τ

e−cθ(τ)) ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − log(∑

τ

e−cθ(τ))|Ddemo| ⟺ min

θ .

τi∈Ddemo

cθ(τi) + |Ddemo|log(∑

τ

e−cθ(τ)) → ℒ(θ)

slide-21
SLIDE 21

Maximum Likelihood

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log(∑

τ

e−cθ(τ)) ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − log(∑

τ

e−cθ(τ))|Ddemo| ⟺ min

θ .

τi∈Ddemo

cθ(τi) + |Ddemo|log(∑

τ

e−cθ(τ)) → ℒ(θ)

This is a huge sum, intractable to compute in large state spaces.

slide-22
SLIDE 22

Maximum Likelihood

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log(∑

τ

e−cθ(τ)) ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − log(∑

τ

e−cθ(τ))|Ddemo| ⟺ min

θ .

τi∈Ddemo

cθ(τi) + |Ddemo|log(∑

τ

e−cθ(τ)) → ℒ(θ)

This is a huge sum, intractable to compute in large state spaces.

slide-23
SLIDE 23

Maximum Likelihood

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log(∑

τ

e−cθ(τ)) ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − log(∑

τ

e−cθ(τ))|Ddemo| ⟺ min

θ .

τi∈Ddemo

cθ(τi) + |Ddemo|log(∑

τ

e−cθ(τ)) → ℒ(θ)

This is a huge sum, intractable to compute in large state spaces.

slide-24
SLIDE 24

Maximum Likelihood

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ + |Ddemo| 1 ∑τ e−cθ(τ) ∑

τ

(e−cθ(τ)(− dcθ(τ) dθ )) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

Trajectory cost is additive over states:

cθ(τ) = ∑

s∈τ

cθ(s) ⇒ p(τ) ∞e−∑s∈τ cθ(s)

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

slide-25
SLIDE 25

Maximum Likelihood

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ + |Ddemo| 1 ∑τ e−cθ(τ) ∑

τ

(e−cθ(τ)(− dcθ(τ) dθ )) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

Trajectory cost is additive over states:

cθ(τ) = ∑

s∈τ

cθ(s) ⇒ p(τ) ∞e−∑s∈τ cθ(s)

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

slide-26
SLIDE 26

Maximum Likelihood

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ + |Ddemo| 1 ∑τ e−cθ(τ) ∑

τ

(e−cθ(τ)(− dcθ(τ) dθ )) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

Trajectory cost is additive over states:

cθ(τ) = ∑

s∈τ

cθ(s) ⇒ p(τ) ∞e−∑s∈τ cθ(s)

slide-27
SLIDE 27

Maximum Likelihood

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ + |Ddemo| 1 ∑τ e−cθ(τ) ∑

τ

(e−cθ(τ)(− dcθ(τ) dθ )) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

Trajectory cost is additive over states:

cθ(τ) = ∑

s∈τ

cθ(s) ⇒ p(τ) ∞e−∑s∈τ cθ(s)

slide-28
SLIDE 28

Maximum Likelihood

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ + |Ddemo| 1 ∑τ e−cθ(τ) ∑

τ

(e−cθ(τ)(− dcθ(τ) dθ )) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

Trajectory cost is additive over states:

cθ(τ) = ∑

s∈τ

cθ(s) ⇒ p(τ) ∞e−∑s∈τ cθ(s)

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ ∇θℒ(θ) = ∑

s∈τi∈Ddemo

dcθ(s) dθ + |Ddemo|∑

s

p(s|θ)dcθ(s) dθ

slide-29
SLIDE 29

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ ∇θℒ(θ) = ∑

s∈τi∈Ddemo

dcθ(s) dθ − |Ddemo|∑

s

p(s|θ)dcθ(s) dθ

Maximum Likelihood

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ + |Ddemo| 1 ∑τ e−cθ(τ) ∑

τ

(e−cθ(τ)(− dcθ(τ) dθ )) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ

Trajectory cost is additive over states:

cθ(τ) = ∑

s∈τ

cθ(s) ⇒ p(τ) ∞e−∑s∈τ cθ(s)

This is still an intractable sum, impossible to compute exactly in large state spaces.

slide-30
SLIDE 30

Trajectory cost is additive over states

∇θℒ(θ) = ∑

τi∈Ddemo

dcθ(τi) dθ − |Ddemo|∑

τ

p(τ|θ)dcθ(τ) dθ ∇θℒ(θ) = ∑

s∈τi∈Ddemo

dcθ(s) dθ − |Ddemo|∑

s

p(s|θ) dcθ(s) dθ State densities: how much time the policy spends on each state ∇θℒ(θ) = ∑

s∈Ddemo

fs − |Ddemo|∑

s

p(s|θ)fs For linear costs: cθ(s) = θ⊤fs

slide-31
SLIDE 31

State densities can be computed analytically in small MDPs with known dynamics

slide-32
SLIDE 32

State densities can be computed analytically in small MDPs with known dynamics

μt(s) : time indexed state density initialize μ1(s)∀s for t = 1,…, T μt+1(s) = ∑

a ∑ s′

μt(s′)π(a|s′)p(s|s′, a) p(s|θ) = ∑

t

μt(s)

slide-33
SLIDE 33

State densities can be computed analytically in small MDPs with known dynamics

μt(s) : time indexed state density initialize μ1(s)∀s for t = 1,…, T μt+1(s) = ∑

a ∑ s′

μt(s′)π(a|s′)p(s|s′, a) p(s|θ) = ∑

t

μt(s)

slide-34
SLIDE 34

State densities can be computed analytically in small MDPs with known dynamics

μt(s) : time indexed state density initialize μ1(s)∀s for t = 1,…, T μt+1(s) = ∑

a ∑ s′

μt(s′)π(a|s′)p(s|s′, a) p(s|θ) = ∑

t

μt(s)

slide-35
SLIDE 35

State densities can be computed analytically in small MDPs with known dynamics

μt(s) : time indexed state density initialize μ1(s)∀s for t = 1,…, T μt+1(s) = ∑

a ∑ s′

μt(s′)π(a|s′)p(s|s′, a) p(s|θ) = ∑

t

μt(s)

slide-36
SLIDE 36

μt(s) : time indexed state density initialize μ1(s)∀s for t = 1,…, T μt+1(s) = ∑

a ∑ s′

μt(s′)π(a|s′)p(s|s′, a) p(s|θ) = ∑

t

μt(s)

Known dynamics

State densities can be computed analytically in small MDPs with known dynamics

slide-37
SLIDE 37

μt(s) : time indexed state density initialize μ1(s)∀s for t = 1,…, T μt+1(s) = ∑

a ∑ s′

μt(s′)π(a|s′)p(s|s′, a) p(s|θ) = ∑

t

μt(s)

Known dynamics Unknown policy

State densities can be computed analytically in small MDPs with known dynamics

slide-38
SLIDE 38

Maximum entropy Inverse RL

  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

Known dynamics, small state space, linear costs

− Ddemo p(s|θ)

∇θℒ(θ) ∇θℒ(θ) = ∑

s∈Ddemo

fs − |Ddemo|∑

s

p(s|θ)fs

slide-39
SLIDE 39

Demonstrated Behavior

Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10

Model Behavior (ExpectaIon)

Bridges crossed: ? Cost Weight: 5.0 Miles of interstate: ? Cost Weight: 3.0 Stoplights : ?

31

Maximum entropy Inverse RL

slide-40
SLIDE 40

Demonstrated Behavior

Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10

Model Behavior (ExpectaIon)

Bridges crossed: 4.7 +1.7 Cost Weight: 5.0 Miles of interstate: 16.2 ‐4.5 Cost Weight: 3.0 Stoplights : 7.4 ‐2.6

34

Maximum entropy Inverse RL

slide-41
SLIDE 41

Demonstrated Behavior

Bridges crossed: 3 Miles of interstate: 20.7 Stoplights: 10

Model Behavior (ExpectaIon)

Bridges crossed: 4.7 7.2 Cost Weight: 5.0 Miles of interstate: 16.2 1.1 Cost Weight: Stoplights : 7.4

35

Maximum entropy Inverse RL

slide-42
SLIDE 42

Limitations of the formulation so far

  • Cost was assumed linear over features f
  • Dynamics were assumed known
  • State space was small

Next:

  • General function approximations for the cost: Finn et al. 2016
  • Unknown Dynamics -> sample based approximations for the

partition function Z: Boularias et al. 2011, Kalakrishnan et al. 2013, Finn et al. 2016

slide-43
SLIDE 43

Recall our maximum likelihood formulation

max

θ

. ∑

τi∈Ddemo

log p(τi) ⟺ max

θ

. ∑

τi∈Ddemo

log e−cθ(τi) Z ⟺ max

θ

. ∑

τi∈Ddemo

− cθ(τi) − ∑

τi∈Ddemo

log Z

ℒ(θ) = 1 |Ddemo| ∑

τi∈Ddemo

cθ(τi) + log(Z)

We need to minimize the following loss function:

slide-44
SLIDE 44

Sample approximation for Z

Z = ∫ e−cθ(τ)dτ

This is a huge integral, intractable to compute.

slide-45
SLIDE 45

Sample approximation for Z

Z = ∫ e−cθ(τ)dτ Z = ∫ e−cθ(τ)dτ = ∫ q(τ)e−cθ(τ) q(τ) dτ ≈ 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj)

This is a huge integral, intractable to compute.

slide-46
SLIDE 46

Sample approximation for Z

Z = ∫ e−cθ(τ)dτ Z = ∫ e−cθ(τ)dτ = ∫ q(τ)e−cθ(τ) q(τ) dτ ≈ 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj)

This is a huge integral, intractable to compute.

slide-47
SLIDE 47

Sample approximation for Z

Z = ∫ e−cθ(τ)dτ

ℒ(θ) = 1 |Ddemo| ∑

τi∈Ddemo

cθ(τi) + log 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj)

Z = ∫ e−cθ(τ)dτ = ∫ q(τ)e−cθ(τ) q(τ) dτ ≈ 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj)

This is a huge integral, intractable to compute.

slide-48
SLIDE 48

What q shall we choose?

∇θℒ(θ) = 1 |Ddemo| ∑

τi∈Ddemo

dcθ dθ (τi) − log 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj) dcθ dθ (τj)

  • When is this approximation good?
  • When q samples highly probable trajectories..
  • Whn q is the expert policy!!
  • Finding a good q is a chicken and egg problem. If I knew the expert

reward function, then I ‘d compute the expert policy with RL, and I ‘d sample highly likely trajectories with that policy!

  • Solution: iteration. Refine the sampling distribution q (policy) over time.

(Finn at al. 2016)

slide-49
SLIDE 49
  • 1. Initialize q0 either from a random policy or using behavior cloning
  • n expert demonstations.
  • 2. for iteration k = 1...I
  • 3. Generate samples Dtraj from qk(τ)
  • 4. Append samples: Dsamp ← Dsamp ∪ Dtraj .
  • 5. Use Dsamp to update cost cθ using gradient descent.
  • 6. Update qk(τ) using any RL method

MaxEntIRL with Adaptive Importance Sampling

∇θℒ(θ) = 1 |Ddemo| ∑

τi∈Ddemo

dcθ dθ (τi) − log 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj) dcθ dθ (τj)

slide-50
SLIDE 50
  • 1. Initialize q0 either from a random policy or using behavior cloning
  • n expert demonstations.
  • 2. for iteration k = 1...I
  • 3. Generate samples Dtraj from qk(τ)
  • 4. Append samples: Dsamp ← Dsamp ∪ Dtraj .
  • 5. Use Dsamp to update cost cθ using gradient descent.
  • 6. Update qk(τ) using any RL method

MaxEntIRL with Adaptive Importance Sampling

∇θℒ(θ) = 1 |Ddemo| ∑

τi∈Ddemo

dcθ dθ (τi) − log 1 |Dsamp| ∑

τj∈Dsamp

e−cθ(τj) q(τj) dcθ dθ (τj)

slide-51
SLIDE 51

MaxEntIRL with Adaptive Importance Sampling

Update cost using samples & demos generate policy samples from q update q w.r.t. cost

x x x

1 2 n

h h h

1 2 k

h h h

1 2 p

(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)

h h h

1 2 m

( 3 ) ( 3 ) ( 3 )

c (x)

θ

2

policy q cost c

Diagram from Chelsea Finn

Generator Discriminator

The discriminator adjusts the cost so that it makes the expert trajectories be better distinguished from the generated ones

slide-52
SLIDE 52

Density estimation

  • Sample generation
  • So far we have been seeking to learn a generative model of trajectories,

by computing trajectory densities:

  • We were trying to estimate a model that given a trajectory will be able to
  • utput the probability of this trajectory: expert trajectories should be

highly probable, and non-expert less probable

  • This is in general what we do when we maximize likelihood of the data
  • The problem is that probabilities need to sum to one

Generative models-density estimation

p(τ|θ) = e−cost(τ|θ) ∑τ′e−cost(τ′|θ)

θ∗ = max

θ

1 m

m

X

i=1

log p ⇣ x(i); θ ⌘

slide-53
SLIDE 53
  • Recently, new classes of generative models has been proposed that

instead of computing densities, they learn directly a sampler, without necessarily having an explicit density.

Generative models-sample generation

(Goodfellow 2016)

Sample generation Training examples Model samples

  • Have we done this for trajectories?
  • Well, we used behavior cloning, but assumed access to a teacher
slide-54
SLIDE 54

x ~ pdata(x ) x ~ pmodel(x )

  • Have training examples
  • Want a model that can draw samples:
  • Where

x ∼ pdata(x) x ∼ pmodel(x) pmodel ≈ pdata

Generative models-sample generation

slide-55
SLIDE 55

The sampling can be both conditional and unconditional

male -> female

Adversarial Inverse Graphics Networks, Tung et al. 2017

slide-56
SLIDE 56

anybody -> Tom Cruise

Adversarial Inverse Graphics Networks, Tung et al. 2017

The sampling can be both conditional and unconditional

slide-57
SLIDE 57

Generative Adversarial Networks

  • A game between two players:
  • 1. Discriminator D
  • 2. Generator G
  • D tries to discriminate between:
  • A sample from the data distribution
  • And a sample from the generator G
  • G tries to “trick” D by generating samples that are had for D to

distinguish from data

  • General strategy: Do not write a formula for p(x), just learn to sample
  • directly. No intractable summations!
slide-58
SLIDE 58

(Goodfellow 2016)

Adversarial Nets Framework

x sampled from data Differentiable function D D(x) tries to be near 1 Input noise z Differentiable function G x sampled from model D D tries to make D(G(z)) near 0, G tries to make D(G(z)) near 1

Generative Adversarial Networks

slide-59
SLIDE 59
  • Body Level One
  • Body Level Two
  • Body Level Three
  • Body Level Four

Body Level Five

Input noise Z Differentiable function G x sampled from model Differentiable function D D tries to

  • utput 0

x sampled from data Differentiable function D D tries to

  • utput 1

x x

z

Generative Adversarial Networks Generative Adversarial Networks

ersarial Nets Framework

Input noise z Differentiable function G x sampled from model D D tries to make D(G(z)) near 0, G tries to make D(G(z)) near 1

x sampled from data Differentiable function D D(x) tries to be near 1

That’s our sampler! The rest are only used at training time.

slide-60
SLIDE 60

Generative Adversarial Networks

Generator Discriminator

D(x): the probability that x came from the real data rather than the generator

min

G max D 𝔽x∼pdata(x)[log D(x)] + 𝔽z∼pz(z)[log(1−D(G(z)))]

z ∼ 𝒪(0,I) x ∼ pdata

slide-61
SLIDE 61

Generative Adversarial Networks-in practise

Generator Discriminator

D(x): the probability that x came from the real data rather than the generator

max

D 𝔽x∼pdata(x)[log D(x)] + 𝔽z∼pz(z)[log(1−D(G(z)))]

z ∼ 𝒪(0,I) x ∼ pdata

max

G 𝔽z∼pz(z)[log(D(G(z)))]

slide-62
SLIDE 62

(Goodfellow 2016)

(Goodfellow 2014)

0.0 0.2 0.4 0.6 0.8 1.0 D(G(z)) −20 −15 −10 −5 5 J(G)

Minimax Non-saturating heuristic Maximum likelihood cost

Comparison of generator losses

slide-63
SLIDE 63

(Goodfellow 2016)

x Probability Density

q∗ = argminqDKL(pq) p(x) q∗(x)

x Probability Density

q∗ = argminqDKL(qp) p(x) q∗(x)

(Goodfellow et al 2016) Maximum likelihood Reverse KL

Generative Adversarial Networks

slide-64
SLIDE 64

Find a policy that makes it impossible for a discriminator network to distinguish between trajectory chunks visited by the expert and by the learner’s application of :

min

G max D V (D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))].

Reward for the policy optimization is how well I matched the demo trajectory distribution, else, how well I confused the discriminator: logD(s)

min

πθ max D E∗ π[log D(s)] + Eπθ[log(1 − D(s))]

πθ πθ

D outputs 1 if state comes from the demo policy

Generative Adversarial Imitation learning

slide-65
SLIDE 65

NIPS 2016

slide-66
SLIDE 66

37

Generative Adversarial Imitation learning