LAB MEETING: A Connection Between Generative Adversarial Networks, - - PowerPoint PPT Presentation

lab meeting a connection between generative adversarial
SMART_READER_LITE
LIVE PREVIEW

LAB MEETING: A Connection Between Generative Adversarial Networks, - - PowerPoint PPT Presentation

LAB MEETING: A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning and Energy-Based models Suwon Suh POSTECH MLG Feb, 13, 2017 Goal Understanding Basic Models 1) Generative Adversarial Networks (GAN) 2)


slide-1
SLIDE 1

LAB MEETING: A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning and Energy-Based models

Suwon Suh

POSTECH MLG

Feb, 13, 2017

slide-2
SLIDE 2

Goal

◮ Understanding Basic Models

1) Generative Adversarial Networks (GAN) 2) Energy Based Model (EBM) 3) Inverse Reinforcement Learning (IRL)

◮ Relationship among Three models

1) Equivalence between Guided Cost Learning and GAN

◮ New algorithm for EBM training with GAN

1) New type of discriminator with model distribution (EBM) and sampling distribution 2) We can get efficient sampler as a result!

slide-3
SLIDE 3

GAN []

A generative model in adversarial setting

◮ Generative model with Discriminator:

min

G max D V (G, D) = Ex∼P[log D(x)]+Ez∼Unif [log(1−D(G(z)))] ,

rewriting it as: min

G max D V (G, D) = Ex∼P[log D(x)] + Ex∼Q[log(1 − D(x))] ,

P: Data distribution, Q: Distribution of the generator.

◮ Optimal discriminator D∗ fixing G

D∗ = P(x) P(x) + Q(x) (1)

slide-4
SLIDE 4

A Variant of GAN minimizing KL[Q||P]

◮ The loss function for a discriminator

Loss(D) = Ex∼P[− log D(x)] + Ex∼Q[− log(1 − D(x))]

◮ The original loss function for a generator []

Lossorg(G) = Ex∼G[log(1 − D(x))] , log(1 − D(x)) ≈ log(1) when it starts to learn slowly because gradient of d log(x)

dx

|x=1 is not steep, which brings an alternative loss Lossalter(G) = −Ex∼G[log(D(x))] ,

◮ We can use both []:

Lgen(G) = Lossorg(G) + Lossalter(G) = Ex∼G[log (1 − D(x)) D(x) ]

slide-5
SLIDE 5

A Variant of GAN minimizing KL[Q||P]

◮ Huszar says ”it minimizes KL[Q||P] when D is near D∗” []:

Ex∼G[log (1 − D(x)) D(x) ] ≈ Ex∼G[log (1 − D∗(x)) D∗(x) ] = Ex∼Q[log Q(x) P(x) ] = KL[Q||P] by invoking Eq. 1.

slide-6
SLIDE 6

Energy Based Models (EBMs)

◮ Every configuration x ∈ RD has a corresponding energy

Eθ(x) .

◮ By normalizing them, we can define probability density

function (pdf), pθ(x) = exp(−Eθ(x))

Z

, where Z(θ) =

  • exp(−Eθ(x′))dx′ .

◮ How to learn parameters θ?

log pθ(x) = −Eθ(x) − log(Z(θ))

◮ Too many configuration, we need to estimate Z(θ) with

samples with Markov chain Monte Carlo (MCMC) 1) Constrative Divergence with only one K-step sample from a MCMC chain. 2) Persistent CD maintains multiple chains to sample from the model in the learning process using Stochastic Gradient Descent (SGD).

slide-7
SLIDE 7

Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL)

Given states X, actions U, dynamics P(xt+1|xt, ut) and discount factor γ in MDP(X, U, P, cθ, γ) and demonstrations of experts, we need to find cost or negative reward cθ .

◮ Maximum entropy inverse reinforcement learning (MaxEnt

IRL) models demonstration with Boltzmann distribution pθ(τ) = exp(−cθ(τ)) Z , τ = {x1, u1, · · · , xT, uT} is a trajectory cθ(τ) =

t cθ(xt, ut) ◮ Guided cost learning (CGL), where partition function Z is

approximated by importance sampling Lcost(θ) = Eτ∼P[− log pθ(τ)] = Eτ∼P[cθ(τ)] + log Z = Eτ∼P[cθ(τ)] + log(Eτ∼q[exp(−cτ(τ)) q(τ) ])

slide-8
SLIDE 8

Inverse Reinforcement Learning

CGL needs to match sampling distribution q(τ) to model distribution pθ(τ)

Lsampler(q) = KL[q(τ)||pθ(τ)] , where we only choose term that related to q: Lsampler(q) = Eτ∼Q[cθ(τ)] + Eτ∼Q[log q(τ)] ,

modifying sampling distribution with mixture

To reduce the variance of a estimator Z using q only, µ = 1

2p + 1 2q

is used as sampling distribution. Lcost(θ) = Eτ∼P[cθ(τ)] + log(Eτ∼µ[exp(−cτ(τ))

1 2 ˜

p + 1

2q

]) , where ˜ p is a rough estimate for density of demonstrations using the current model pθ.

slide-9
SLIDE 9

Model

(Idea) Explicitly modeling a discriminator D in the form of the

  • ptimal discriminator D∗

We assume p is the data distribution, ˜ pθ is a model distribution parameterized θ and q is a sampling distribution;

◮ Before D∗ = p(τ) p(τ)+q(τ) ◮ After Dθ = ˜ pθ(τ) ˜ pθ(τ)+q(τ) ◮ Why EBM as a model distribution?

Product of Experts (PoE) can capture modes and put less density between modes compared to Mixture of Experts (MoE) of similar capacity. Dθ =

1 Z exp(−cθ(τ)) 1 Z exp(−cθ(τ)) + q(τ) ◮ we need to evaluate the sampling density function q(τ)

effectively to learn: Autoregressive model, Normalized Flow and MoE.

slide-10
SLIDE 10

Equivalance between GAN and CGL

◮ loss from a variant of GAN

Ldisc(θ) = Eτ∼p[− log Dθ(τ)] + Eτ∼q[− log(1 − Dθ(τ))] = Eτ∼p[− log

1 Z exp(−cθ(τ)) 1 Z exp(−cθ(τ)) + q(τ)]+Eτ∼q[− log

q(τ)

1 Z exp(−cθ(τ)) + q ◮ loss from GCL

Lcost(θ) = Eτ∼p[cθ(τ)] + log(Eτ∼µ[exp(−cθ(τ))

1 2 ˜

p + 1

2q

])

◮ Equivalence:

1) The value of Z which minimizes Ldisc is importance sampling estimator for the partition function 2) For this value Z, the derivative of Ldisc(θ) with respect to θ is equal to the derivative of Lcost(θ) 3) the derivative of Lgen(q) with regard to q is equal to the derivative of Lsampler(q)

slide-11
SLIDE 11

Tranining EBM with GAN

Why?

As PoEs, EBMs are good at modeling complicated manifold well. However, the sampling is not independent because it uses MCMC. This method directly learns effective sampling distribution.

◮ update partition function with importance sampling

Z ⇐ Eτ∼µ[exp(−cτ(x))

1 2 ˜

p + 1

2q

]

◮ update model parameter with SGD

Lenergy(θ) = Eτ∼p[cθ(x)] + log(Eτ∼µ[exp(−cθ(x))

1 2 ˜

p + 1

2q

])

◮ update sampling parameter with SGD

Lsampler(q) = Eτ∼q[Eθ(x)] + Eτ∼q[log q(x)] ,

slide-12
SLIDE 12

Discussion

◮ Return of EBMs

Recently, EBMs have been subsided by VAE and GAN because its sampling and hardship to get approximated log-likelihood. In this model, we can evade these problem.

◮ Combination of EBMs and other generative models such as

Autoregressive and VAE as sampler.

◮ Adversarial Variational Bayes

Minimizing KL divergence with GAN.