Bregman and Wasserstein, with Applications to Generative Adversarial - PowerPoint PPT Presentation

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions What is Bregman Divergence Function? BDF D φ ( x , y ) Let φ : R d �→ R be a strictly convex, differentiable function Then, D φ : R d × R d �→ R is defined as D φ ( x , y ) = φ ( x ) − φ ( y ) − � x − y , ∇ φ ( y ) � . For any x , y ∈ R d , D φ ( x , y ) ≥ 0, the equality holds iff x = y . 6 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 7 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution 7 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution j =1 p j = 1, with φ ( p ) . � d = � d j =1 p j log p j (negative Shannon entropy) is strictly convex on the d -simplex. 7 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution j =1 p j = 1, with φ ( p ) . � d = � d j =1 p j log p j (negative Shannon entropy) is strictly convex on the d -simplex. Let q = ( q 1 , . . . , q d ) be another probability distribution 7 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Some examples of BDFs φ ( x ) = x 2 , then D φ ( x , y ) = ( x − y ) 2 Let p . = ( p 1 , . . . , p d ) be a probability distribution j =1 p j = 1, with φ ( p ) . � d = � d j =1 p j log p j (negative Shannon entropy) is strictly convex on the d -simplex. Let q = ( q 1 , . . . , q d ) be another probability distribution d d � � D φ ( p , q ) = p j log p j − q j log q j j =1 j =1 −� p − q , ∇ φ ( q ) � d � = p j log ( p j / q j ) , j =1 is the KL-divergence between p and q 7 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . 8 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . Notice E [ � X − Y , ∇ φ ( Y ) � ] = E [ E [ � X − Y , ∇ φ ( Y ) �|G ]] = E [ � Y ∗ − Y , ∇ φ ( Y ) � ] 8 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . Notice E [ � X − Y , ∇ φ ( Y ) � ] = E [ E [ � X − Y , ∇ φ ( Y ) �|G ]] = E [ � Y ∗ − Y , ∇ φ ( Y ) � ] Thus E [ � X − Y ∗ , ∇ φ ( Y ∗ ) � ] = 0, 8 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Proof of sufficiency Let Y be any G -measurable random variable, and Y ∗ . = E [ X |G ]. Then E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] E [ φ ( Y ∗ ) − φ ( Y ) − � X − Y , ∇ φ ( Y ) � = + � X − Y ∗ , ∇ φ ( Y ∗ ) � ] . Notice E [ � X − Y , ∇ φ ( Y ) � ] = E [ E [ � X − Y , ∇ φ ( Y ) �|G ]] = E [ � Y ∗ − Y , ∇ φ ( Y ) � ] Thus E [ � X − Y ∗ , ∇ φ ( Y ∗ ) � ] = 0, And E [ D φ ( X , Y )] − E [ D φ ( X , Y ∗ )] = E [ D φ ( Y ∗ , Y )] ≥ 0 . 8 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) 9 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) 9 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) 9 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) Widely applied to data analysis and machine learning, such as K-means clustering 9 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) Widely applied to data analysis and machine learning, such as K-means clustering Well adopted in convex optimization 9 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . 10 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . The real samples X is inaccessible. 10 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . The real samples X is inaccessible. Generate more compelling copies of X . 10 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generator Network [Goodfellow et al., 2014] Generate the samples according to P θ . The real samples X is inaccessible. Generate more compelling copies of X . inaccessible generate 10 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions How to Make Generator Network Better? A knowledgeable mentor (discriminator)— 11 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. 12 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. has access to the real samples X . 12 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. has access to the real samples X . optimizes the generator network by identifying faked samples. 12 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discriminator Network [Goodfellow et al., 2014] Determines whether the samples are generated or not. has access to the real samples X . optimizes the generator network by identifying faked samples. pass! fail again! 12 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Graphical Model 13 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . g θ is complicated but structured . 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . g θ is complicated but structured . g θ is the reason why the generative modeling is powerful . 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Generative modeling The procedure of generative modeling is to construct a class of suitable parametric probability distributions P θ . Generates latent variable Z ∈ Z with a fixed probability distribution P Z . P Z is known and simple , e.g., uniform distribution. Generates a sequence of parametric functions g θ : Z → X . g θ is complicated but structured . g θ is the reason why the generative modeling is powerful . Construct P θ as the probability distribution of g θ ( Z ). More specifically, � � � P θ ( dx ) = 1 { g θ ( z )= dx } P Z ( dz ) = E Z 1 { g θ ( Z )= dx } . Z 14 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: 15 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. 15 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] 15 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] Wasserstein GANs: 15 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions GANs: different divergence functions GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] Wasserstein GANs: WGANs [Arjovsky et al., 2017]: Wasserstein L 1 divergence. Improved WGANs [Gulrajani et al., 2017]: Gradient Penalty. 15 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude 16 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: � P ( dx ) � � KL ( P , Q ) = P ( dx ) · log . Q ( dx ) X 16 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: � P ( dx ) � � KL ( P , Q ) = P ( dx ) · log . Q ( dx ) X Jensen-Shannon (JS) divergence: JS ( P , Q ) = 1 � KL ( P , P + Q ) + KL ( Q , P + Q � ) . 2 2 2 16 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Several Choices of Divergence The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: � P ( dx ) � � KL ( P , Q ) = P ( dx ) · log . Q ( dx ) X Jensen-Shannon (JS) divergence: JS ( P , Q ) = 1 � KL ( P , P + Q ) + KL ( Q , P + Q � ) . 2 2 2 Wasserstein divergence/distance of order p � 1 � � p m ( x , y ) p π ( dx , dy ) W p ( P , Q ) = inf , π ∈ Π( P , Q ) X×X with m a metric such as m ( x , y ) = || x − y || q for q ≥ 1. 16 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discussions on these divergences Example: Given θ ∈ [0 , 1], assume that P and Q satisfy ∀ ( x , y ) ∈ P , x = 0 , y ∼ Uniform(0 , 1) , ∀ ( x , y ) ∈ Q , x = θ, y ∼ Uniform(0 , 1) , 17 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discussions on these divergences Example: Given θ ∈ [0 , 1], assume that P and Q satisfy ∀ ( x , y ) ∈ P , x = 0 , y ∼ Uniform(0 , 1) , ∀ ( x , y ) ∈ Q , x = θ, y ∼ Uniform(0 , 1) , As θ � = 0, KL ( P , Q ) = KL ( Q , P ) = + ∞ , JS ( P , Q ) = log(2) , W 1 ( P , Q ) = | θ | . 17 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Discussions on these divergences Example: Given θ ∈ [0 , 1], assume that P and Q satisfy ∀ ( x , y ) ∈ P , x = 0 , y ∼ Uniform(0 , 1) , ∀ ( x , y ) ∈ Q , x = θ, y ∼ Uniform(0 , 1) , As θ � = 0, KL ( P , Q ) = KL ( Q , P ) = + ∞ , JS ( P , Q ) = log(2) , W 1 ( P , Q ) = | θ | . As θ = 0, KL ( P , Q ) = KL ( Q , P ) = JS ( P , Q ) = W 1 ( P , Q ) = 0 . 17 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; 18 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; 18 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; W 1 is continuous and relatively smooth; 18 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Remark KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; W 1 is continuous and relatively smooth; Wasserstein L 1 divergence outperforms KL and JS divergences but lacks the flexibility. 18 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Remedy: Relaxed Wasserstein Definition (G., Hong, Lin, and Yang 2018) The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as � W D φ ( P , Q ) = inf D φ ( x , y ) π ( dx , dy ) , π ∈ Π( P , Q ) X×X where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d → R , i.e., D φ ( x , y ) = φ ( x ) − φ ( y ) − �∇ φ ( y ) , x − y � 1 W D φ ( P , Q ) ≥ 0 and = 0 iff P = Q almost everywhere. 19 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Remedy: Relaxed Wasserstein Definition (G., Hong, Lin, and Yang 2018) The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as � W D φ ( P , Q ) = inf D φ ( x , y ) π ( dx , dy ) , π ∈ Π( P , Q ) X×X where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d → R , i.e., D φ ( x , y ) = φ ( x ) − φ ( y ) − �∇ φ ( y ) , x − y � 1 W D φ ( P , Q ) ≥ 0 and = 0 iff P = Q almost everywhere. 2 W D φ ( P , Q ) is a metric, as it is asymmetric. 19 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Remedy: Relaxed Wasserstein Definition (G., Hong, Lin, and Yang 2018) The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as � W D φ ( P , Q ) = inf D φ ( x , y ) π ( dx , dy ) , π ∈ Π( P , Q ) X×X where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d → R , i.e., D φ ( x , y ) = φ ( x ) − φ ( y ) − �∇ φ ( y ) , x − y � 1 W D φ ( P , Q ) ≥ 0 and = 0 iff P = Q almost everywhere. 2 W D φ ( P , Q ) is a metric, as it is asymmetric. 3 W D φ ( P , Q ) includes W KL with φ ( x ) = − x ⊤ log( x ). 19 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? 20 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. 20 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. Requirement: W φ ( P , Q ) should be dominated by standard divergence, | P ( A ) − Q ( A ) | . TV ( P , Q ) := sup A ∈B 20 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. Requirement: W φ ( P , Q ) should be dominated by standard divergence, | P ( A ) − Q ( A ) | . TV ( P , Q ) := sup A ∈B Point 2: W φ ( P n , P r ) → 0 as n → ∞ where P r is a true distribution P r and P n is the empirical distribution based on X = ( X 1 , X 2 , . . . , X n ) i . i . d . ∼ P r . 20 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein as Divergence Question: Is W φ a good divergence? Point 1: W φ ( P , Q ) should be small when P and Q are close. Requirement: W φ ( P , Q ) should be dominated by standard divergence, | P ( A ) − Q ( A ) | . TV ( P , Q ) := sup A ∈B Point 2: W φ ( P n , P r ) → 0 as n → ∞ where P r is a true distribution P r and P n is the empirical distribution based on X = ( X 1 , X 2 , . . . , X n ) i . i . d . ∼ P r . Requirement: W φ ( P n , P r ) should have the moment estimate and concentration inequality, i.e., there exist α, β > 0 such that O ( n − α ) � � E W D φ ( P n , P r ) = (Moment Estimate) , O ( n − β ) Prob � W D φ ( P n , P r ) ≥ ǫ � = (Concentration Inequality) . 20 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Dominated by TV and Standard Wasserstein Theorem (G., Hong, Lin, and Yang 2018) Assume that φ : X → R is a strictly convex and smooth function with an L-Lipschitz continuous factor, ≤ L [ diam ( X )] 2 · TV ( P , Q ) W D φ ( P , Q ) ≤ L 2 W L 2 ( P , Q ) 2 W D φ ( P , Q ) where P and Q are two probability distributions supported on a compact set X ⊂ R d . 21 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Table of Contents Bregman Divergence Function 1 Generative Adversarial Networks (GANs) 2 Wasserstein Divergence and GANs 3 Relaxed Wasserstein 4 Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme Empirical Results 5 Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets Conclusions 6 22 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Moment Estimate for RW Theorem (G, Hong, Lin, and Yang 2018) Assume that � � x � q 2 P r ( dx ) < + ∞ M q ( P r ) = X for some q > 2 , then there exists a constant C ( q , d ) > 0 such that, for n ≥ 1 , � � E W D φ ( P n , P r )  n − 1 2 + n − q − 2 q , 1 ≤ d ≤ 3 , q � = 4 , 2  q  C ( q , d ) LM q ( P r )  n − 1 2 log(1 + n ) + n − q − 2 ≤ · q , d = 4 , q � = 4 , 2  d + n − q − 2 n − 2  q , d ≥ 5 , q � = d / ( d − 2) .  23 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Concentration Inequality for RW Theorem (G., Hong, Lin, and Yang 2018) Assume that � exp ( γ � x � α E α,γ ( P r ) = 2 ) P r ( dx ) . X and one of the three following conditions holds, ∃ α > 2 , ∃ γ > 0 , E α,γ ( P r ) < ∞ , or ∃ α ∈ (0 , 2) , ∃ γ > 0 , E α,γ ( P r ) < ∞ , ∃ q > 4 , M q ( P r ) < ∞ . or Then for n ≥ 1 and ǫ > 0 , there exist the scalar a ( n , ǫ ) and b ( n , ǫ ) such that � W D φ ( P n , P r ) ≥ ǫ � ≤ a ( n , ǫ )1 { ǫ ≤ L Prob 2 } + b ( n , ǫ ) . 24 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Duality Representation for RW Theorem (G., Hong, Lin, and Yang 2018) Assume that two probability distributions P and Q satisfy � � x � 2 2 ( P + Q ) ( dx ) < + ∞ . X Then there exists a Lipschitz continuous function f : X → R such that the RW divergence has a duality representation as � � W D φ ( P , Q ) = φ ( x ) ( P − Q ) ( dx ) + �∇ φ ( x ) , x � Q ( dx ) X X �� f ∗ ( ∇ φ ( x )) Q ( dx ) − f ( x ) P ( dx ) + , X X where f ∗ is the conjugate of f , i.e., f ∗ ( y ) = sup x ∈ R d � x , y � − f ( x ) . 25 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Key element for proof of duality The classical duality representation for the standard Wasserstein distance 26 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Key element for proof of duality The classical duality representation for the standard Wasserstein distance The RW can be decomposed in terms of a distorted squared Wasserstein- L 2 distance of order 2, plus some residual terms that are independent of the choice of the coupling π . 26 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein for GANs Question: Is W φ tractable for GANs? 28 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein for GANs Question: Is W φ tractable for GANs? Requirement 1: W φ ( P r , P θ ) should be continuous and differentiable w.r.t. θ . 28 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Relaxed Wasserstein for GANs Question: Is W φ tractable for GANs? Requirement 1: W φ ( P r , P θ ) should be continuous and differentiable w.r.t. θ . Requirement 2: W φ ( P r , P θ ) should have the easily computed or approximated gradient evaluation, i.e., � � ∇ θ W D φ ( P r , P θ ) = F ( g θ , φ, Z , . . . ) . where F is an abstract mapping. 28 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Continuity and Differentiablity Theorem (G., Hong, Lin, and Yang 2018) 29 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Continuity and Differentiablity Theorem (G., Hong, Lin, and Yang 2018) 1 W D φ ( P r , P θ ) is continuous in θ if g θ is continuous in θ . 29 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Continuity and Differentiablity Theorem (G., Hong, Lin, and Yang 2018) 1 W D φ ( P r , P θ ) is continuous in θ if g θ is continuous in θ . 2 W D φ ( P r , P θ ) is differentiable almost everywhere if g θ is locally Lipschitz with a constant ¯ � ¯ L ( θ, Z ) 2 � < ∞ , L ( θ, z ) such that E i.e., for each given ( θ 0 , z 0 ) , there exists a neighborhood N such that � g θ ( z ) − g θ 0 ( z 0 ) � 2 ≤ L ( θ 0 , z 0 ) ( � θ − θ 0 � 2 + � z − z 0 � 2 ) . for any ( θ, z ) ∈ N . 29 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Moment Estimate, Concentration Inequality, and Duality Wasserstein Divergence and GANs Continuity, Differentiability Relaxed Wasserstein Gradient Descent Scheme Empirical Results Conclusions Gradient Descent Scheme Corollary (G., Hong, Lin, and Yang 2018) Assume that g θ is locally Lipschitz with a constant L ( θ, z ) such X � x � 2 � L ( θ, Z ) 2 � � < ∞ , and 2 ( P r + P θ ) ( dx ) < + ∞ . Then that E there exists a Lipschitz continuous solution f : X → R such that the gradient of the RW divergence has an explicit form, i.e., [ ∇ θ g θ ( Z )] ⊤ ∇ 2 φ ( g θ ( Z )) g θ ( Z ) � � � � ∇ θ W D φ ( P r , P θ ) = E Z + E Z [ ∇ θ f ( ∇ φ ( g θ ( Z )))] . 31 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Table of Contents Bregman Divergence Function 1 Generative Adversarial Networks (GANs) 2 Wasserstein Divergence and GANs 3 Relaxed Wasserstein 4 Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme Empirical Results 5 Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets Conclusions 6 32 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). 33 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). Approach: RMSProp [Tieleman and Hinton, 2012]. 33 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). Approach: RMSProp [Tieleman and Hinton, 2012]. Experiment I: Baselines: WGANs, CGANs, InfoGANs, GANs, LSGANs, DRAGANs, BEGANs, EBGANs and ACGANs. Datasets: MNIST: 60000 (train) and 10000 (test). Fashion-MNIST: 60000 (train) and 10000 (test). 33 / 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Experiment Setup Wasserstein Divergence and GANs MNIST and Fashion-MNIST datasets Relaxed Wasserstein CIFAR-10 and ImageNet datasets Empirical Results Conclusions Experiment Setup RW: KL divergence where φ ( x ) = − x ⊤ log( x ). Approach: RMSProp [Tieleman and Hinton, 2012]. Experiment I: Baselines: WGANs, CGANs, InfoGANs, GANs, LSGANs, DRAGANs, BEGANs, EBGANs and ACGANs. Datasets: MNIST: 60000 (train) and 10000 (test). Fashion-MNIST: 60000 (train) and 10000 (test). Experiment II: Baselines: WGANs and WGANs-GP. Datasets: CIFAR-10 (color): 50000 (train) and 10000 (test). ImageNet (color): 14197122. 33 / 50

Bregman and Wasserstein, with Applications to Generative Adversarial - PowerPoint PPT Presentation

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and

generative design systems Generative Brief Design Definitions Workshop Processes

MELODI M achin E L earning, O ptimization, & D ata I nterpretation @ UW Iyer & Bilmes,

A Bregman near neighbor lower bound via directed isoperimetry Amirali Abdullah Suresh

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative Adversarial Networks, Wasserstein Distance, and Adversarial Loss Zhiyu Min Alibaba

A Gradual, Semi-Discrete Approach to Generative Network Training via Explicit Wasserstein

Stronger and Faster Wasserstein Adversarial Attacks Kaiwen Wu kaiwen.wu@uwaterloo.ca Joint work

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

A variational finite volume scheme for Wasserstein gradient flows es 1 , T. O. Gallou et 2 , G.

On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin, Darina Dvinskikh,

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

A Theory of Content Mark Steedman ( with Mike Lewis and Nathan Schneider) August 2016 Steedman,

Burcin Becerik-Gerber Assistant Professor Civil and Environmental Engineering University of

Experiments and Causal Inference Erik Gahner Larsen Advanced applied statistics, 2015 1 / 67

The Online Approach to Machine Learning Nicol` o Cesa-Bianchi Universit` a degli Studi di

GROWN ON PERMANENT RAISED BED UNDER DIFFERENT CLIMATE AND SOIL BASED SCENARIOS Rubina Ansari 1* ,

After the EU Referendum: What Next for Britain and Europe? Prof. Simon Hix London School of

China in a changing global economy Haihong Gao Institute of World Economics and Politics Chinese

Audited Financial Results for year ended June 30 2017 Agenda 2 Introduction Post unbundling

Bregman and Wasserstein, with Applications to Generative Adversarial - PowerPoint PPT Presentation

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and

generative design systems Generative Brief Design Definitions Workshop Processes

MELODI M achin E L earning, O ptimization, &amp; D ata I nterpretation @ UW Iyer &amp; Bilmes,

A Bregman near neighbor lower bound via directed isoperimetry Amirali Abdullah Suresh

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative Adversarial Networks, Wasserstein Distance, and Adversarial Loss Zhiyu Min Alibaba

A Gradual, Semi-Discrete Approach to Generative Network Training via Explicit Wasserstein

Stronger and Faster Wasserstein Adversarial Attacks Kaiwen Wu kaiwen.wu@uwaterloo.ca Joint work

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

A variational finite volume scheme for Wasserstein gradient flows es 1 , T. O. Gallou et 2 , G.

On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin, Darina Dvinskikh,

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

A Theory of Content Mark Steedman ( with Mike Lewis and Nathan Schneider) August 2016 Steedman,

Burcin Becerik-Gerber Assistant Professor Civil and Environmental Engineering University of

Experiments and Causal Inference Erik Gahner Larsen Advanced applied statistics, 2015 1 / 67

The Online Approach to Machine Learning Nicol` o Cesa-Bianchi Universit` a degli Studi di

GROWN ON PERMANENT RAISED BED UNDER DIFFERENT CLIMATE AND SOIL BASED SCENARIOS Rubina Ansari 1* ,

After the EU Referendum: What Next for Britain and Europe? Prof. Simon Hix London School of

China in a changing global economy Haihong Gao Institute of World Economics and Politics Chinese

Audited Financial Results for year ended June 30 2017 Agenda 2 Introduction Post unbundling

MELODI M achin E L earning, O ptimization, & D ata I nterpretation @ UW Iyer & Bilmes,