An Introduction to Hilbert Space Embedding of Probability Measures - PowerPoint PPT Presentation

An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems T¨ ubingen, Germany Jeju, South Korea, February 22, 2019 1/34

Reference Kernel Mean Embedding of Distributions: A Review and Beyond M , Fukumizu, Sriperumbudur, and Sch¨ olkopf . FnT ML, 2017. 2/34

From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 3/34

Classification Problem Data in Input Space 1.0 +1 -1 0.5 0.0 x 2 −0.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 5/34

Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 1.0 0.6 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , 6/34

Data in Feature Space +1 -1 0.8 0.6 0.4 0.2 0.0 ϕ 3 −0.2 −0.4 −0.6 −0.8 0.2 0.3 0.4 0.5 1.0 0.6 0.7 0.8 0.6 0.8 ϕ 2 0.4 0.9 0.2 ϕ 1 0.0 Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space 1.0 +1 -1 0.5 0.0 x 2 −0.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 6/34

Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 0.6 1.0 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 6/34

Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 0.6 1.0 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 � φ ( x ) , φ ( x ′ ) � R 3 = ( x · x ′ ) 2 6/34

Our recipe: 1. Construct a non-linear feature map φ : X → H . 2. Evaluate D φ = { φ ( x 1 ) , φ ( x 2 ) , . . . , φ ( x n ) } . 3. Solve the learning problem in H using D φ . 7/34

Kernels Definition A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x , x ′ ∈ X we have k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H . We call φ a feature map and H a feature space of k . 8/34

Kernels Definition A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x , x ′ ∈ X we have k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H . We call φ a feature map and H a feature space of k . Example 1. k ( x , x ′ ) = ( x · x ′ ) 2 for x , x ′ ∈ R 2 √ ◮ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 , ◮ H = R 3 2. k ( x , x ′ ) = ( x · x ′ + c ) m for c > 0 , x , x ′ ∈ R d � d + m ◮ dim ( H ) = � m � � 3. k ( x , x ′ ) = exp − γ � x − x ′ � 2 2 ◮ H = R ∞ 8/34

Positive Definite Kernels Definition (Positive definiteness) A function k : X × X → R is called positive definite if, for all n ∈ N , α 1 , . . . , α n ∈ R and all x 1 , . . . , x n ∈ X , we have � n � n α i α j k ( x j , x i ) ≥ 0 . i =1 j =1 Equivalently, we have that a Gram matrix K is positive definite. 9/34

Positive Definite Kernels Definition (Positive definiteness) A function k : X × X → R is called positive definite if, for all n ∈ N , α 1 , . . . , α n ∈ R and all x 1 , . . . , x n ∈ X , we have � n � n α i α j k ( x j , x i ) ≥ 0 . i =1 j =1 Equivalently, we have that a Gram matrix K is positive definite. Example (Any kernel is positive definite) Let k be a kernel with feature map φ : X → H , then we have � n � � n � n � � n ≥ 0 . α i α j k ( x j , x i ) = α i φ ( x i ) , α j φ ( x j ) i =1 j =1 i =1 j =1 H Positive definiteness is a necessary (and sufficient ) condition. 9/34

Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 10/34

Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 10/34

Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δ x : H → R defined by δ x ( f ) := f ( x ) , f ∈ H , is continuous. 10/34

Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δ x : H → R defined by δ x ( f ) := f ( x ) , f ∈ H , is continuous. Remark: If � f n − f � H → 0 for n → ∞ , then for all x ∈ X , we have n →∞ f n ( x ) = f ( x ) lim . 10/34

Reproducing Kernels Lemma (Reproducing kernels are kernels) Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ ( x ) = k ( · , x ) , x ∈ X . We call φ the canonical feature map . 11/34

Reproducing Kernels Lemma (Reproducing kernels are kernels) Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ ( x ) = k ( · , x ) , x ∈ X . We call φ the canonical feature map . Proof We fix an x ′ ∈ X and write f := k ( · , x ′ ). Then, for x ∈ X , the reproducing property yields � φ ( x ′ ) , φ ( x ) � = � k ( · , x ′ ) , k ( · , x ) � = � f , k ( · , x ) � = f ( x ) = k ( x , x ′ ) . 11/34

Kernels and RKHSs Theorem (Every RKHS has a unique reproducing kernel) Let H be an RKHS over X . Then k : X × X → R defined by x , x ′ ∈ X k ( x , x ′ ) = � δ x , δ x ′ � H , is the only reproducing kernel of H . Furthermore, if ( e i ) i ∈ I is an orthonormal basis of H , then for all x , x ′ ∈ X we have � k ( x , x ′ ) = e i ( x ) e i ( x ′ ) . i ∈ I 12/34

Kernels and RKHSs Theorem (Every RKHS has a unique reproducing kernel) Let H be an RKHS over X . Then k : X × X → R defined by x , x ′ ∈ X k ( x , x ′ ) = � δ x , δ x ′ � H , is the only reproducing kernel of H . Furthermore, if ( e i ) i ∈ I is an orthonormal basis of H , then for all x , x ′ ∈ X we have � k ( x , x ′ ) = e i ( x ) e i ( x ′ ) . i ∈ I Universal kernels A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. 12/34

From Points to Measures Input space X Feature space H k ( y , · ) y f x k ( x , · ) 13/34

From Points to Measures Input space X Feature space H k ( y , · ) y f x k ( x , · ) � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) 13/34

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x 15/34

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Definition Let P be a space of all probability measures on a measurable space ( X , Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . 15/34

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Definition Let P be a space of all probability measures on a measurable space ( X , Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . Remark: For a Dirac measure δ x , δ x �→ µ [ δ x ] ≡ x �→ k ( · , x ). 15/34

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then µ P ∈ H and E X ∼ P [ f ( X )] = � f , µ P � , f ∈ H . 16/34

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then µ P ∈ H and E X ∼ P [ f ( X )] = � f , µ P � , f ∈ H . ◮ The kernel k is said to be characteristic if the map P �→ µ P is injective. That is, � µ P − µ Q � H = 0 if and only if P = Q . 16/34

An Introduction to Hilbert Space Embedding of Probability Measures - PowerPoint PPT Presentation

An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany Jeju, South Korea, February 22, 2019 1/34 Reference Kernel Mean Embedding of

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

A new weak Hilbert space Jess Surez de la Fuente, UEx Workshop on Banach spaces and Banach

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

On Hilbert IVth Problem Marc Troyanov (EPFL) SJTU, June 21, 2019 On Hilbert IVth Abstract

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

... . . . 3. Uniform Probability Space: Pr [ ] = 1 / | | for all . 1 2 In

Counting and Probability Whats to come? Counting and Probability Whats to come?

Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet

Introduction to Hilbert schemes of curves on a 3-fold . Hirokazu Nasu Tokai University Autust

Probability Probability Random variables Atomic events Sample space Probability

Quantum symmetry groups of Hilbert C*-modules equipped with orthogonal filtrations Manon

CNN and Musical Applications Juhan Nam Motivation Sensory data (image or audio) have

Evidential Clustering: a Review of Some New Developments Thierry Denux Universit de

Reform! Research on learning shows: the entire standards-based Some benefits of

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott (Adapted from Tom Mitchells

Quickest Quickest Exam 1 Extra Credit: Exam 1 Extra Credit: either either show up and watch

Parallelization and Parallelization and Proling Proling Programming for Statistical

Many-Core Scheduling of Data Parallel Applications using SMT Solvers Pranav Tendulkar Peter

Transport Performance Metrics MIB draft-ietf-rmonmib-tpm-mib-06.txt Robert Cole, Russell Dietz