an introduction to hilbert space embedding of probability
play

An Introduction to Hilbert Space Embedding of Probability Measures - PowerPoint PPT Presentation

An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany Jeju, South Korea, February 22, 2019 1/34 Reference Kernel Mean Embedding of


  1. An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems T¨ ubingen, Germany Jeju, South Korea, February 22, 2019 1/34

  2. Reference Kernel Mean Embedding of Distributions: A Review and Beyond M , Fukumizu, Sriperumbudur, and Sch¨ olkopf . FnT ML, 2017. 2/34

  3. From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 3/34

  4. From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 4/34

  5. Classification Problem Data in Input Space 1.0 +1 -1 0.5 0.0 x 2 −0.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 5/34

  6. Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 1.0 0.6 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , 6/34

  7. Data in Feature Space +1 -1 0.8 0.6 0.4 0.2 0.0 ϕ 3 −0.2 −0.4 −0.6 −0.8 0.2 0.3 0.4 0.5 1.0 0.6 0.7 0.8 0.6 0.8 ϕ 2 0.4 0.9 0.2 ϕ 1 0.0 Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space 1.0 +1 -1 0.5 0.0 x 2 −0.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 6/34

  8. Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 0.6 1.0 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 6/34

  9. Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 0.6 1.0 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 � φ ( x ) , φ ( x ′ ) � R 3 = ( x · x ′ ) 2 6/34

  10. 7/34

  11. Our recipe: 1. Construct a non-linear feature map φ : X → H . 2. Evaluate D φ = { φ ( x 1 ) , φ ( x 2 ) , . . . , φ ( x n ) } . 3. Solve the learning problem in H using D φ . 7/34

  12. Kernels Definition A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x , x ′ ∈ X we have k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H . We call φ a feature map and H a feature space of k . 8/34

  13. Kernels Definition A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x , x ′ ∈ X we have k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H . We call φ a feature map and H a feature space of k . Example 1. k ( x , x ′ ) = ( x · x ′ ) 2 for x , x ′ ∈ R 2 √ ◮ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 , ◮ H = R 3 2. k ( x , x ′ ) = ( x · x ′ + c ) m for c > 0 , x , x ′ ∈ R d � d + m ◮ dim ( H ) = � m � � 3. k ( x , x ′ ) = exp − γ � x − x ′ � 2 2 ◮ H = R ∞ 8/34

  14. Positive Definite Kernels Definition (Positive definiteness) A function k : X × X → R is called positive definite if, for all n ∈ N , α 1 , . . . , α n ∈ R and all x 1 , . . . , x n ∈ X , we have � n � n α i α j k ( x j , x i ) ≥ 0 . i =1 j =1 Equivalently, we have that a Gram matrix K is positive definite. 9/34

  15. Positive Definite Kernels Definition (Positive definiteness) A function k : X × X → R is called positive definite if, for all n ∈ N , α 1 , . . . , α n ∈ R and all x 1 , . . . , x n ∈ X , we have � n � n α i α j k ( x j , x i ) ≥ 0 . i =1 j =1 Equivalently, we have that a Gram matrix K is positive definite. Example (Any kernel is positive definite) Let k be a kernel with feature map φ : X → H , then we have � n � � n � n � � n ≥ 0 . α i α j k ( x j , x i ) = α i φ ( x i ) , α j φ ( x j ) i =1 j =1 i =1 j =1 H Positive definiteness is a necessary (and sufficient ) condition. 9/34

  16. Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 10/34

  17. Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 10/34

  18. Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δ x : H → R defined by δ x ( f ) := f ( x ) , f ∈ H , is continuous. 10/34

  19. Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δ x : H → R defined by δ x ( f ) := f ( x ) , f ∈ H , is continuous. Remark: If � f n − f � H → 0 for n → ∞ , then for all x ∈ X , we have n →∞ f n ( x ) = f ( x ) lim . 10/34

  20. Reproducing Kernels Lemma (Reproducing kernels are kernels) Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ ( x ) = k ( · , x ) , x ∈ X . We call φ the canonical feature map . 11/34

  21. Reproducing Kernels Lemma (Reproducing kernels are kernels) Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ ( x ) = k ( · , x ) , x ∈ X . We call φ the canonical feature map . Proof We fix an x ′ ∈ X and write f := k ( · , x ′ ). Then, for x ∈ X , the reproducing property yields � φ ( x ′ ) , φ ( x ) � = � k ( · , x ′ ) , k ( · , x ) � = � f , k ( · , x ) � = f ( x ) = k ( x , x ′ ) . 11/34

  22. Kernels and RKHSs Theorem (Every RKHS has a unique reproducing kernel) Let H be an RKHS over X . Then k : X × X → R defined by x , x ′ ∈ X k ( x , x ′ ) = � δ x , δ x ′ � H , is the only reproducing kernel of H . Furthermore, if ( e i ) i ∈ I is an orthonormal basis of H , then for all x , x ′ ∈ X we have � k ( x , x ′ ) = e i ( x ) e i ( x ′ ) . i ∈ I 12/34

  23. Kernels and RKHSs Theorem (Every RKHS has a unique reproducing kernel) Let H be an RKHS over X . Then k : X × X → R defined by x , x ′ ∈ X k ( x , x ′ ) = � δ x , δ x ′ � H , is the only reproducing kernel of H . Furthermore, if ( e i ) i ∈ I is an orthonormal basis of H , then for all x , x ′ ∈ X we have � k ( x , x ′ ) = e i ( x ) e i ( x ′ ) . i ∈ I Universal kernels A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. 12/34

  24. From Points to Measures Input space X Feature space H k ( y , · ) y f x k ( x , · ) 13/34

  25. From Points to Measures Input space X Feature space H k ( y , · ) y f x k ( x , · ) � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) 13/34

  26. From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 14/34

  27. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x 15/34

  28. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Definition Let P be a space of all probability measures on a measurable space ( X , Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . 15/34

  29. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Definition Let P be a space of all probability measures on a measurable space ( X , Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . Remark: For a Dirac measure δ x , δ x �→ µ [ δ x ] ≡ x �→ k ( · , x ). 15/34

  30. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then µ P ∈ H and E X ∼ P [ f ( X )] = � f , µ P � , f ∈ H . 16/34

  31. Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then µ P ∈ H and E X ∼ P [ f ( X )] = � f , µ P � , f ∈ H . ◮ The kernel k is said to be characteristic if the map P �→ µ P is injective. That is, � µ P − µ Q � H = 0 if and only if P = Q . 16/34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend