SLIDE 1
LAB MEETING: A Connection Between Generative Adversarial Networks, - - PowerPoint PPT Presentation
LAB MEETING: A Connection Between Generative Adversarial Networks, - - PowerPoint PPT Presentation
LAB MEETING: A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning and Energy-Based models Suwon Suh POSTECH MLG Feb, 13, 2017 Goal Understanding Basic Models 1) Generative Adversarial Networks (GAN) 2)
SLIDE 2
SLIDE 3
GAN []
A generative model in adversarial setting
◮ Generative model with Discriminator:
min
G max D V (G, D) = Ex∼P[log D(x)]+Ez∼Unif [log(1−D(G(z)))] ,
rewriting it as: min
G max D V (G, D) = Ex∼P[log D(x)] + Ex∼Q[log(1 − D(x))] ,
P: Data distribution, Q: Distribution of the generator.
◮ Optimal discriminator D∗ fixing G
D∗ = P(x) P(x) + Q(x) (1)
SLIDE 4
A Variant of GAN minimizing KL[Q||P]
◮ The loss function for a discriminator
Loss(D) = Ex∼P[− log D(x)] + Ex∼Q[− log(1 − D(x))]
◮ The original loss function for a generator []
Lossorg(G) = Ex∼G[log(1 − D(x))] , log(1 − D(x)) ≈ log(1) when it starts to learn slowly because gradient of d log(x)
dx
|x=1 is not steep, which brings an alternative loss Lossalter(G) = −Ex∼G[log(D(x))] ,
◮ We can use both []:
Lgen(G) = Lossorg(G) + Lossalter(G) = Ex∼G[log (1 − D(x)) D(x) ]
SLIDE 5
A Variant of GAN minimizing KL[Q||P]
◮ Huszar says ”it minimizes KL[Q||P] when D is near D∗” []:
Ex∼G[log (1 − D(x)) D(x) ] ≈ Ex∼G[log (1 − D∗(x)) D∗(x) ] = Ex∼Q[log Q(x) P(x) ] = KL[Q||P] by invoking Eq. 1.
SLIDE 6
Energy Based Models (EBMs)
◮ Every configuration x ∈ RD has a corresponding energy
Eθ(x) .
◮ By normalizing them, we can define probability density
function (pdf), pθ(x) = exp(−Eθ(x))
Z
, where Z(θ) =
- exp(−Eθ(x′))dx′ .
◮ How to learn parameters θ?
log pθ(x) = −Eθ(x) − log(Z(θ))
◮ Too many configuration, we need to estimate Z(θ) with
samples with Markov chain Monte Carlo (MCMC) 1) Constrative Divergence with only one K-step sample from a MCMC chain. 2) Persistent CD maintains multiple chains to sample from the model in the learning process using Stochastic Gradient Descent (SGD).
SLIDE 7
Inverse Reinforcement Learning
Inverse Reinforcement Learning (IRL)
Given states X, actions U, dynamics P(xt+1|xt, ut) and discount factor γ in MDP(X, U, P, cθ, γ) and demonstrations of experts, we need to find cost or negative reward cθ .
◮ Maximum entropy inverse reinforcement learning (MaxEnt
IRL) models demonstration with Boltzmann distribution pθ(τ) = exp(−cθ(τ)) Z , τ = {x1, u1, · · · , xT, uT} is a trajectory cθ(τ) =
t cθ(xt, ut) ◮ Guided cost learning (CGL), where partition function Z is
approximated by importance sampling Lcost(θ) = Eτ∼P[− log pθ(τ)] = Eτ∼P[cθ(τ)] + log Z = Eτ∼P[cθ(τ)] + log(Eτ∼q[exp(−cτ(τ)) q(τ) ])
SLIDE 8
Inverse Reinforcement Learning
CGL needs to match sampling distribution q(τ) to model distribution pθ(τ)
Lsampler(q) = KL[q(τ)||pθ(τ)] , where we only choose term that related to q: Lsampler(q) = Eτ∼Q[cθ(τ)] + Eτ∼Q[log q(τ)] ,
modifying sampling distribution with mixture
To reduce the variance of a estimator Z using q only, µ = 1
2p + 1 2q
is used as sampling distribution. Lcost(θ) = Eτ∼P[cθ(τ)] + log(Eτ∼µ[exp(−cτ(τ))
1 2 ˜
p + 1
2q
]) , where ˜ p is a rough estimate for density of demonstrations using the current model pθ.
SLIDE 9
Model
(Idea) Explicitly modeling a discriminator D in the form of the
- ptimal discriminator D∗
We assume p is the data distribution, ˜ pθ is a model distribution parameterized θ and q is a sampling distribution;
◮ Before D∗ = p(τ) p(τ)+q(τ) ◮ After Dθ = ˜ pθ(τ) ˜ pθ(τ)+q(τ) ◮ Why EBM as a model distribution?
Product of Experts (PoE) can capture modes and put less density between modes compared to Mixture of Experts (MoE) of similar capacity. Dθ =
1 Z exp(−cθ(τ)) 1 Z exp(−cθ(τ)) + q(τ) ◮ we need to evaluate the sampling density function q(τ)
effectively to learn: Autoregressive model, Normalized Flow and MoE.
SLIDE 10
Equivalance between GAN and CGL
◮ loss from a variant of GAN
Ldisc(θ) = Eτ∼p[− log Dθ(τ)] + Eτ∼q[− log(1 − Dθ(τ))] = Eτ∼p[− log
1 Z exp(−cθ(τ)) 1 Z exp(−cθ(τ)) + q(τ)]+Eτ∼q[− log
q(τ)
1 Z exp(−cθ(τ)) + q ◮ loss from GCL
Lcost(θ) = Eτ∼p[cθ(τ)] + log(Eτ∼µ[exp(−cθ(τ))
1 2 ˜
p + 1
2q
])
◮ Equivalence:
1) The value of Z which minimizes Ldisc is importance sampling estimator for the partition function 2) For this value Z, the derivative of Ldisc(θ) with respect to θ is equal to the derivative of Lcost(θ) 3) the derivative of Lgen(q) with regard to q is equal to the derivative of Lsampler(q)
SLIDE 11
Tranining EBM with GAN
Why?
As PoEs, EBMs are good at modeling complicated manifold well. However, the sampling is not independent because it uses MCMC. This method directly learns effective sampling distribution.
◮ update partition function with importance sampling
Z ⇐ Eτ∼µ[exp(−cτ(x))
1 2 ˜
p + 1
2q
]
◮ update model parameter with SGD
Lenergy(θ) = Eτ∼p[cθ(x)] + log(Eτ∼µ[exp(−cθ(x))
1 2 ˜
p + 1
2q
])
◮ update sampling parameter with SGD
Lsampler(q) = Eτ∼q[Eθ(x)] + Eτ∼q[log q(x)] ,
SLIDE 12