recent advances in hilbert space representation of
play

Recent Advances in Hilbert Space Representation of Probability - PowerPoint PPT Presentation

Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany RegML 2020, Genova, Italy July 1, 2020 1/53 Reference Kernel Mean Embedding of


  1. RKHS as Feature Space Reproducing kernels are kernels k ( x , x ′ ) Let H be a Hilbert space on X with a reproducing kernel k . Then, H is an �· , ·� RKHS and is also a feature space of k , where the feature map φ : X → H is given φ x , x ′ φ ( x ) , φ ( x ′ ) by φ ( x ) = k ( · , x ) . We call φ the canonical feature map . Proof We fix an x ′ ∈ X and write f := k ( · , x ′ ). Then, for x ∈ X , the reproducing property implies � φ ( x ′ ) , φ ( x ) � = � k ( · , x ′ ) , k ( · , x ) � = � f , k ( · , x ) � = f ( x ) = k ( x , x ′ ) . 12/53

  2. RKHS as Feature Space Universal kernels (Steinwart 2002) A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. 13/53

  3. RKHS as Feature Space Universal kernels (Steinwart 2002) A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. Universal approximation theorem (Cybenko 1989) Given any ε > 0 and f ∈ C ( X ), there exist � n α i ϕ ( w ⊤ h ( x ) = i x + b i ) i =1 such that | f ( x ) − h ( x ) | < ε for all x ∈ X . 13/53

  4. Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H 14/53

  5. Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , 14/53

  6. Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , ◮ Implicit representation of data points : ◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK) 14/53

  7. Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , ◮ Implicit representation of data points : ◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK) ◮ Good references on kernel methods. ◮ Support vector machine (2008), Christmann and Steinwart. ◮ Gaussian process for ML (2005), Rasmussen and Williams. ◮ Learning with kernels (1998), Sch¨ olkopf and Smola. 14/53

  8. Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 15/53

  9. Probability Measures Learning on Group Anomaly/OOD Distributions/Point Clouds Detection e c a p s n o i t u b i t r s i D e c a p s u t p n I Generalization across Statistical and Causal Inference Environments X Y P ... P 1 P 2 P N P X p ( x ) XY XY XY P Q ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n training data unseen test data x 16/53

  10. Embedding of Dirac Measures Input Space X Feature Space H k ( y , · ) y f x k ( x , · ) 17/53

  11. Embedding of Dirac Measures Input Space X Feature Space H k ( y , · ) y f x k ( x , · ) � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) = k ( · , x ) 17/53

  12. Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 18/53

  13. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. 19/53

  14. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. Kernel mean embedding Let P be a space of all probability measures P . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . 19/53

  15. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. Kernel mean embedding Let P be a space of all probability measures P . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . Remark: The kernel k is Bochner integrable if it is bounded . 19/53

  16. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then for µ P ∈ H and f ∈ H , � f , µ P � = � f , E X ∼ P [ k ( · , X )] � = E X ∼ P [ � f , k ( · , X ) � ] = E X ∼ P [ f ( X )] . 20/53

  17. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then for µ P ∈ H and f ∈ H , � f , µ P � = � f , E X ∼ P [ k ( · , X )] � = E X ∼ P [ � f , k ( · , X ) � ] = E X ∼ P [ f ( X )] . ◮ The kernel k is said to be characteristic if the map P �→ µ P is injective , i.e., � µ P − µ Q � H = 0 if and only if P = Q . 20/53

  18. Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P 21/53

  19. Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P Moment-generating function Consider k ( x , x ′ ) = exp( � x , x ′ � ). Then, µ P = E X ∼ P [ e � X , ·� ]. 21/53

  20. Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P Moment-generating function Consider k ( x , x ′ ) = exp( � x , x ′ � ). Then, µ P = E X ∼ P [ e � X , ·� ]. Characteristic function If k ( x , y ) = ψ ( x − y ) where ψ is a positive definite function, then � µ P ( y ) = ψ ( x − y ) d P ( x ) = Λ k · ϕ P for positive finite measure Λ k . 21/53

  21. Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. 22/53

  22. Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups 22/53

  23. Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups ◮ Examples of characteristic kernels: Gaussian RBF kernel Laplacian kernel � � � � −� x − x ′ � 2 −� x − x ′ � 1 k ( x , x ′ ) = exp 2 k ( x , x ′ ) = exp 2 σ 2 σ 22/53

  24. Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups ◮ Examples of characteristic kernels: Gaussian RBF kernel Laplacian kernel � � � � −� x − x ′ � 2 −� x − x ′ � 1 k ( x , x ′ ) = exp 2 k ( x , x ′ ) = exp 2 σ 2 σ ◮ Kernel choice vs parametric assumption ◮ Parametric assumption is susceptible to model misspecification . ◮ But the choice of kernel matters in practice. ◮ We can optimize the kernel to maximize the performance of the downstream tasks. 22/53

  25. Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

  26. Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 P = 1 � ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

  27. Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

  28. Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ The rate O p ( n − 1 / 2 ) was shown to be minimax optimal . 3 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

  29. Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ The rate O p ( n − 1 / 2 ) was shown to be minimax optimal . 3 ◮ Similar to James-Stein estimators, we can improve an estimation by shrinkage estimators: 4 µ α := α f ∗ + (1 − α )ˆ f ∗ ∈ H . ˆ µ P , 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

  30. Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ 24/53

  31. Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 24/53

  32. Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H 24/53

  33. Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H ◮ Negative autocorrelation : O (1 / T ) rate of convergence. 24/53

  34. Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H ◮ Negative autocorrelation : O (1 / T ) rate of convergence. ◮ Deep generative models (see the following slides). 24/53

  35. Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 25/53

  36. Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . 25/53

  37. Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. 25/53

  38. Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . 25/53

  39. Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . ◮ It can be estimated consistently, i.e., with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n 25/53

  40. Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . ◮ It can be estimated consistently, i.e., with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ Given the embedding ˆ µ , it is possible to reconstruct the distribution or generate samples from it. 25/53

  41. Application: High-Level Generalization Learning from Distributions � KM. , Fukumizu, Dinuzzo, Sch¨ olkopf. NIPS 2012. 26/53

  42. Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. 26/53

  43. Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. Domain Generalization P ... P 1 P 2 P N P X XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data � KM. et al. ICML 2013; Zhang, KM. et al. ICML 2013 26/53

  44. Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. Domain Generalization Cause-Effect Inference P ... P 1 P 2 P N P X X Y XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data � KM. et al. ICML 2013; � Lopez-Paz, KM. et al. Zhang, KM. et al. ICML 2013 JMLR 2015, ICML 2015. 26/53

  45. Support Measure Machine (SMM) KM , K. Fukumizu, F. Dinuzzo, and B. Sch¨ olkopf (NeurIPS2012) � � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) P �→ k ( · , z ) d P ( z ) ( P 1 , y 1 ) , ( P 2 , y 2 ) , . . . , ( P n , y n ) ∼ P × Y Training data : 27/53

  46. Support Measure Machine (SMM) KM , K. Fukumizu, F. Dinuzzo, and B. Sch¨ olkopf (NeurIPS2012) � � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) P �→ k ( · , z ) d P ( z ) ( P 1 , y 1 ) , ( P 2 , y 2 ) , . . . , ( P n , y n ) ∼ P × Y Training data : Theorem (Distributional representer theorem) Under technical assumptions on Ω : [0 , + ∞ ) → R , and a loss function ℓ : ( P × R 2 ) m → R ∪ { + ∞} , any f ∈ H minimizing ℓ ( P 1 , y 1 , E P 1 [ f ] , . . . , P m , y m , E P m [ f ]) + Ω ( � f � H ) admits a representation of the form � m � m α i E x ∼ P i [ k ( x , · )] = f = α i µ P i . i =1 i =1 27/53

  47. Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). 28/53

  48. Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM. , B. Sch¨ olkopf, I. Tolstikhin. JMLR 2015, ICML 2015. 28/53

  49. Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM. , B. Sch¨ olkopf, I. Tolstikhin. JMLR 2015, ICML 2015. Topological Data Analysis G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018 28/53

  50. Domain Generalization Blanchard et al., NeurIPS2012; KM , D. Balduzzi, B. Sch¨ olkopf, ICML2013 P ... P 1 P 2 P N P X XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data K (( P i , x ) , ( P j , ˜ x )) = k 1 ( P i , P j ) k 2 ( x , ˜ x ) = k 1 ( µ P i , µ P j ) k 2 ( x , ˜ x ) 29/53

  51. Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . 30/53

  52. Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 30/53

  53. Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 ◮ If k is characteristic , then � µ P − µ Q � H = 0 if and only if P = Q . 30/53

  54. Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 ◮ If k is characteristic , then � µ P − µ Q � H = 0 if and only if P = Q . ◮ Given { x i } n i =1 ∼ P and { y j } m j =1 ∼ Q , the empirical MMD is n n m m � � � � 1 1 � MMD 2 u ( P , Q , H ) = k ( x i , x j ) + k ( y i , y j ) n ( n − 1) m ( m − 1) i =1 i =1 j � = i j � = i n m � � − 2 k ( x i , y j ) . nm i =1 j =1 30/53

  55. Kernel Two-Sample Testing Gretton et al., JMLR2012 P P Q Q Question: Given { x i } n i =1 ∼ P and { y j } n j =1 ∼ Q , check if P = Q . H 0 : P = Q , H 1 : P � = Q 31/53

  56. Kernel Two-Sample Testing Gretton et al., JMLR2012 P P Q Q Question: Given { x i } n i =1 ∼ P and { y j } n j =1 ∼ Q , check if P = Q . H 0 : P = Q , H 1 : P � = Q ◮ MMD test statistic: 2 t 2 � = MMD u ( P , Q , H ) � 1 = h (( x i , y i ) , ( x j , y j )) n ( n − 1) 1 ≤ i � = j ≤ n where h (( x i , y i ) , ( x j , y j )) = k ( x i , x j ) + k ( y i , y j ) − k ( x i , y j ) − k ( x j , y i ). 31/53

  57. Generative Adversarial Networks Learn a deep generative model G via a minimax optimization min G max E x [log D ( x )] + E z [log(1 − D ( G ( z )))] D where D is a discriminator and z ∼ N ( 0 , σ 2 I ). Discriminator D φ Generator G θ real or synthetic? random noise z x or G θ ( z ) G θ ( z ) • • • • • • • • • × • × × • • × real data • ×× × × × { x i } synthetic data × ×× × { G θ ( z i ) } MMD Test � � � ˆ � µ X − ˆ µ G θ ( Z ) H is zero? 32/53

  58. Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . 33/53

  59. Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers � � � � 2 � � � � � 2 φ ( ˜ X ) d G θ ( ˜ � µ X − µ G θ ( Z ) � � min H = min φ ( X ) d P ( X ) − X ) � � θ θ H � � � � � � � � � � = min sup h d P − h d G θ � � θ h ∈H , � h �≤ 1 33/53

  60. Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers � � � � 2 � � � � � 2 φ ( ˜ X ) d G θ ( ˜ � µ X − µ G θ ( Z ) � � min H = min φ ( X ) d P ( X ) − X ) � � θ θ H � � � � � � � � � � = min sup h d P − h d G θ � � θ h ∈H , � h �≤ 1 ◮ Many tricks have been proposed to improve the GMMN: ◮ Optimized kernels and feature extractors (Sutherland et al., 2017; Li et al., 2017a) ◮ Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018) ◮ Repulsive loss (Wang et al., 2019) ◮ Optimized witness points (Mehrjou et al., 2019) ◮ Etc. 33/53

  61. Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 34/53

  62. Conditional Distribution P ( Y | X ) Y X A collection of distributions P Y := { P ( Y | X = x ) : x ∈ X} . 35/53

  63. Conditional Distribution P ( Y | X ) Y X A collection of distributions P Y := { P ( Y | X = x ) : x ∈ X} . ◮ For each x ∈ X , we can define an embedding of P ( Y | X = x ) as � µ Y | x := ϕ ( Y ) d P ( Y | X = x ) = E Y | x [ ϕ ( Y )] Y where ϕ : Y → G is a feature map of Y . 35/53

  64. Embedding of Conditional Distributions p ( y | x ) P ( Y | X = x ) y C YX C − 1 XX k ( x , · ) G H C YX C − 1 µ Y | X = x k ( x , · ) XX Y X The conditional mean embedding of P ( Y | X ) can be defined as U Y | X := C YX C − 1 U Y | X : H → G , XX 36/53

  65. Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . 37/53

  66. Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . 37/53

  67. Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . 37/53

  68. Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . ◮ In an infinite RKHS, C − 1 XX does not exists. Hence, we often use U Y | X := C YX ( C XX + ε I ) − 1 . 37/53

  69. Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . ◮ In an infinite RKHS, C − 1 XX does not exists. Hence, we often use U Y | X := C YX ( C XX + ε I ) − 1 . ◮ Conditional mean estimator n � β ( x ) := ( K + n ε I ) − 1 k x . µ Y | x = ˆ β i ( x ) ϕ ( y i ) , i =1 37/53

  70. Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . 38/53

  71. Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . 38/53

  72. Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . ◮ Counterfactual distribution � P Y � 0 | 1 � ( y ) = P Y 0 | X 0 ( y | x ) d P X 1 ( x ) . 38/53

  73. Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . ◮ Counterfactual distribution � P Y � 0 | 1 � ( y ) = P Y 0 | X 0 ( y | x ) d P X 1 ( x ) . ◮ The counterfactual distribution P Y � 0 | 1 � ( y ) can be estimated using the kernel mean embedding. 38/53

  74. Quantum Mean Embedding 39/53

  75. Quick Summary ◮ Many applications requires information in P ( Y | X ). 40/53

  76. Quick Summary ◮ Many applications requires information in P ( Y | X ). ◮ Hilbert space embedding of P ( Y | X ) is not a single element , but an operator U Y | X mapping from H to G : U Y | X k ( x , · ) = C YX C − 1 µ Y | x = XX k ( x , · ) � µ Y | x , g � G = E Y | x [ g ( Y ) | X = x ] 40/53

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend