Cycle-Consistent Adversarial Learning as Approximate Bayesian - - PowerPoint PPT Presentation

cycle consistent adversarial learning as approximate
SMART_READER_LITE
LIVE PREVIEW

Cycle-Consistent Adversarial Learning as Approximate Bayesian - - PowerPoint PPT Presentation

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference Louis C. Tiao 1 Edwin V. Bonilla 2 Fabio Ramos 1 July 22, 2018 1 University of Sydney, 2 University of New South Wales Motivation: Unpaired Image-to-Image Translation


slide-1
SLIDE 1

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Louis C. Tiao1 Edwin V. Bonilla2 Fabio Ramos1 July 22, 2018

1University of Sydney, 2University of New South Wales

slide-2
SLIDE 2

Motivation: Unpaired Image-to-Image Translation

Zebras Horses horse zebra zebra horse Summer Winter summer winter winter summer Photograph Van Gogh Cezanne Monet Ukiyo-e Monet Photos Monet photo photo Monet

( )

,

( )

Paired

Unpaired

n

  • ,

n

  • ,

n

  • ,

X Y

xi yi

Figure 1: From Zhu et al. (2017)

1

slide-3
SLIDE 3

Cycle-Consistent Adversarial Learning (CycleGAN)

  • Introduced by Kim et al. (2017); Zhu et al. (2017)
  • Forward and reverse mappings mφ : x → z and µθ : z → x
  • Discriminators Dα and Dβ

Distribution matching (gan objectives) Yield realistic outputs in the other domain. ℓreverse

gan

(α; φ) = Ep∗(z)[log Dα(z)] + Eq∗(x)[log(1 − Dα(mφ(x)))], ℓforward

gan

(β; θ) = Ep∗(x)[log Dβ(x)] + Ep∗(z)[log(1 − Dβ(µθ(z)))]. Cycle-consistency losses Encourage tighter correspondences—must be able to reconstruct

  • utput from input and vice versa. May alleviate mode-collapse

ℓreverse

const (θ, φ) = Eq∗(x)[∥x − µθ(mφ(x))∥ρ ρ],

ℓforward

const (θ, φ) = Ep∗(z)[∥z − mφ(µθ(z))∥ρ ρ]. 2

slide-4
SLIDE 4

Contributions

We cast the problem of learning inter-domain correspondences without paired data as approximate Bayesian inference in a latent variable model (lvm).

  • 1. We introduce implicit latent variable models (ilvms),
  • prior over latent variables specified flexibly as implicit

distribution.

  • 2. We develop a new variational inference (vi) algorithm based on
  • minimizing the symmetric Kullback-Leibler (kl) divergence
  • between a variational and exact joint distribution.
  • 3. We demonstrate that cyclegan (Kim et al., 2017; Zhu et al., 2017)

can be instantiated as a special case of our framework.

3

slide-5
SLIDE 5

Implicit Latent Variable Models

Join Distribution pθ(x, z) = pθ(x | z)

  • likelihood

p∗(z)

prior

xn zn θ

N

Prescribed Likelihood Likelihood pθ(xn | zn) is prescribed (as usual) Implicit Prior Prior p∗(z) over latent variables specified as implicit distribution

  • Given only by a finite collection

Z∗ = {z∗

m}M m=1 of its samples,

z∗

m ∼ p∗(z)

  • Offers utmost degree of flexibility in

treatment of prior information.

4

slide-6
SLIDE 6

Implicit Latent Variable Models: Example

Unpaired Image-to-Image Translation

  • Prior distribution p∗(z) specified by images Z∗ = {z∗

m}M m=1 from

  • ne domain.
  • Empirical data distribution q∗(x) specified by images

X∗ = {xn}N

n=1 from another domain.

(a) samples from p∗(z) (b) a sample from q∗(x)

5

slide-7
SLIDE 7

Inference in Implicit Latent Variable Models

Having specified the generative model, our aims are

  • Optimize θ by maximizing marginal likelihood pθ(x)
  • Infer hidden representations z by computing posterior pθ(z | x)

Both require intractable pθ(x)

  • must resort to approximate inference

Classical Variational Inference

  • Approximate exact posterior pθ(z | x) with variational posterior

qφ(z | x)

  • Reduces inference problem to optimization problem

min

φ kl [qφ(z | x) ∥ pθ(z | x)] 6

slide-8
SLIDE 8

Symmetric Joint-Matching Variational Inference

slide-9
SLIDE 9

Joint-Matching Variational Inference

Variational Joint

  • Consider instead directly approximating the exact joint with

variational joint qφ(x, z) = qφ(z | x)q∗(x)

  • variational posterior qφ(z | x) also prescribed

xn zn φ θ

N

7

slide-10
SLIDE 10

Symmetric Joint-Matching Variational Inference

Minimize symmetric kl divergence between joints klsymm [pθ(x, z) ∥ qφ(x, z)] where klsymm [p ∥ q] = kl [p ∥ q]

  • forward kl

+ kl [q ∥ p]

  • reverse kl

Why?

  • 1. Because we can:
  • klsymm [pθ(x, z) ∥ qφ(x, z)] tractable
  • klsymm [pθ(z | x) ∥ qφ(z | x)] intractable
  • 2. Helps avoid under/over-dispersed approximations (see paper

for details)

8

slide-11
SLIDE 11

Reverse kl Variational Objective

  • Minimizing reverse kl divergence between joints equivalent to

maximizing usual evidence lower bound (elbo), kl [qφ(x, z) ∥ pθ(x, z)] = Eqφ(x,z) [log qφ(x, z) − log pθ(x, z)] = Eqφ(x,z) [log qφ(z | x) − log pθ(x, z)]

  • Lnelbo(θ,φ)

− H[q∗(x)]

  • constant
  • Recall (negative) elbo,

Lnelbo(θ, φ) = Eq∗(x)qφ(z | x)[− log pθ(x | z)]

  • Lnell(θ,φ)

+ Eq∗(x)kl [qφ(z | x) ∥ p∗(z)]

  • intractable
  • kl term is intractable as prior p∗(z) is unavailable—can only

sample!

9

slide-12
SLIDE 12

Forward kl Variational Objective

  • Minimizing forward kl divergence between joints

kl [pθ(x, z) ∥ qφ(x, z)] = Epθ(x,z) [log pθ(x, z) − log qφ(x, z)] = Epθ(x,z) [log pθ(x | z) − log qφ(x, z)]

  • Lnaplbo(θ,φ)

− H[p∗(z)]

  • constant
  • New variational objective, aggregate posterior lower bound

(aplbo) Lnaplbo(θ, φ) = Ep∗(z)pθ(x | z)[− log qφ(z | x)]

  • Lnelp(θ,φ)

+ Ep∗(z)kl [pθ(x | z) ∥ q∗(x)]

  • intractable
  • kl term is intractable as empirical data distribution q∗(x) is

unavailable—can only sample!

10

slide-13
SLIDE 13

Density Ratio Estimation and f-divergence Approximation

General f-divergence lower bound (Nguyen et al., 2010) For convex lower-semicontinuous function f : R+ → R, Eq∗(x)Df [p∗(z) ∥ qφ(z | x)]

  • intractable

≥ max

α Llatent f

(α; φ)

  • tractable

, where Llatent

f

(α; φ) = Eq∗(x)qφ(z | x)[f′(rα(z; x))] − Eq∗(x)p∗(z)[f⋆(f′(rα(z; x)))]

  • Turns divergence estimation into an optimization problem
  • Estimate divergence using a l.b. that just requires samples!
  • rα is a neural net with parameters α, with equality at

r∗

α(z; x) = qφ(z | x)

p∗(z)

11

slide-14
SLIDE 14

kl divergence lower bound

Example: kl divergence lower bound For f(u) = u log u, we instantiate the kl lower bound Eq∗(x)kl [qφ(z | x) ∥ p∗(z)]

  • intractable

≥ max

α Llatent kl

(α; φ)

  • tractable

where Llatent

kl

(α; φ) = Eq∗(x)qφ(z | x)[log rα(z; x)] − Eq∗(x)p∗(z)[rα(z; x) − 1] Yields estimate of the elbo where all terms are tractable, Lnelbo(θ, φ) = Lnell(θ, φ)

  • tractable

+ Eq∗(x)kl [qφ(z | x) ∥ p∗(z)]

  • intractable

≥ max

α Lnell(θ, φ)

  • tractable

+ Llatent

kl

(α; φ)

  • tractable

12

slide-15
SLIDE 15

CycleGAN as a Special Case

slide-16
SLIDE 16

Cycle-consistency as Conditional Probability Maximization

For Gaussian likelihood and variational posterior pθ(x | z) = N(x | µθ(z), τ 2I), qφ(z | x) = N(z | mφ(x), t2I) Can instantiate ℓreverse

const (θ, φ) from Lnell(θ, φ)

as posterior qφ(z | x) degenerates (as t → 0) Can instantiate ℓforward

const (θ, φ) from Lnelp(θ, φ)

as likelihood pθ(x | z) degenerates (as τ → 0) Cycle-consistency corresponds to maximizing conditional probabilities:

  • ell. forces qφ(z | x) to place mass on hidden representations

that recover the data

  • elp. forces pθ(x | z) to generate observations that recover the

prior

13

slide-17
SLIDE 17

Distribution Matching as Regularization

For appropriate setting of f, and simplifying the mappings and discriminators,

  • Can instantiate ℓreverse

gan

(α; φ) from Llatent

f

(α; φ)

  • Can instantiate ℓforward

gan

(β; θ) from Lobserved

f

(β; θ) Approximately minimizes intractable divergences:

  • Df [p∗(z) ∥ qφ(z | x)] — forces qφ(z | x) to match prior p∗(z)
  • Df [q∗(x) ∥ pθ(x | z)] — forces pθ(x | z) to match data q∗(x)

Summary Lnelbo(θ, φ) ≥ max

α Lnell(θ, φ)

  • ℓreverse

const (θ,φ)

+ Llatent

kl

(α; φ)

  • ℓreverse

gan

(α;φ)

Lnaplbo(θ, φ) ≥ max

β Lnelp(θ, φ)

  • ℓforward

const (θ,φ)

+ Lobserved

kl

(β; θ)

  • ℓforward

gan

(β;θ) 14

slide-18
SLIDE 18

Conclusion

  • Formulated implicit latent variable models, which introduces

implicit prior over latent variables

  • Offers utmost degree of flexibility in incorporating prior

knowledge

  • Developed new paradigm for variational inference
  • directly approximates exact joint distribution
  • minimizes the symmetric kl divergence
  • Provided theoretical treatment of the links between CycleGAN

methods and Variational Bayes Poster Session To find out more, come visit us at our poster! Poster #14, Session 4 (17:10-18:00 Saturday, 14 July)

15

slide-19
SLIDE 19

Questions?

15

slide-20
SLIDE 20

References i

References

Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. (2017). Learning to Discover Cross-Domain Relations with Generative Adversarial

  • Networks. In Proceedings of the 34th International Conference on

Machine Learning (ICML), volume 70, pages 1857–1865. Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk

  • Minimization. IEEE Trans. Information Theory, 56(11):5847–5861.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial

  • networks. In IEEE International Conference on Computer Vision

(ICCV).

16

slide-21
SLIDE 21

Symmetric Joint-Matching kl Minimization i

  • kl divergence is asymmetric kl [p ∥ q] ̸= kl [q ∥ p]
  • kl [qφ(z | x) ∥ pθ(z | x)] (reverse) underestimates support
  • kl [pθ(z | x) ∥ qφ(z | x)] (forward) overestimates support
  • Consider symmetric kl: klsymm [p ∥ q] = kl [p ∥ q] + kl [q ∥ p]
  • Forward kl involves expectation under intractable posterior

pθ(z | x)—what we’re trying to approximate in the first place kl [pθ(z | x) ∥ qφ(z | x)] = Epθ(z | x) [ log pθ(z | x) qφ(z | x) ]

slide-22
SLIDE 22

Symmetric Joint-Matching kl Minimization ii

  • Can show

arg min

φ

kl [qφ(z | x) ∥ pθ(z | x)] = arg min

φ

kl [qφ(x, z) ∥ pθ(x, z)] arg min

φ

kl [pθ(z | x) ∥ qφ(z | x)] = arg min

φ

kl [pθ(x, z) ∥ qφ(x, z)]

  • Already showed

arg max

φ

Lelbo(θ, φ) = arg min

φ

kl [qφ(x, z) ∥ pθ(x, z)]

  • Can we find something similar for kl [pθ(x, z) ∥ qφ(x, z)] ?