An Introduction to Hilbert Space Embedding of Probability Measures - - PowerPoint PPT Presentation

an introduction to hilbert space embedding of probability
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Hilbert Space Embedding of Probability Measures - - PowerPoint PPT Presentation

An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany Jeju, South Korea, February 22, 2019 1/34 Reference Kernel Mean Embedding of


slide-1
SLIDE 1

1/34

An Introduction to Hilbert Space Embedding of Probability Measures

Krikamol Muandet Max Planck Institute for Intelligent Systems T¨ ubingen, Germany Jeju, South Korea, February 22, 2019

slide-2
SLIDE 2

2/34

Reference

Kernel Mean Embedding of Distributions: A Review and Beyond M, Fukumizu, Sriperumbudur, and Sch¨

  • lkopf. FnT ML, 2017.
slide-3
SLIDE 3

3/34

From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions

slide-4
SLIDE 4

4/34

From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions

slide-5
SLIDE 5

5/34

Classification Problem

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1
slide-6
SLIDE 6

6/34

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ1 0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1
slide-7
SLIDE 7

6/34

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ1 0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1
slide-8
SLIDE 8

6/34

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ1 0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1
slide-9
SLIDE 9

6/34

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ1 0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1

φ(x), φ(x′)R3 = (x · x′)2

slide-10
SLIDE 10

7/34

slide-11
SLIDE 11

7/34

Our recipe:

  • 1. Construct a non-linear feature map φ : X → H.
  • 2. Evaluate Dφ = {φ(x1), φ(x2), . . . , φ(xn)}.
  • 3. Solve the learning problem in H using Dφ.
slide-12
SLIDE 12

8/34

Kernels

Definition

A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x, x′ ∈ X we have k(x, x′) = φ(x), φ(x′)H. We call φ a feature map and H a feature space of k.

slide-13
SLIDE 13

8/34

Kernels

Definition

A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x, x′ ∈ X we have k(x, x′) = φ(x), φ(x′)H. We call φ a feature map and H a feature space of k.

Example

  • 1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

◮ φ(x) = (x2

1, x2 2,

√ 2x1x2)

◮ H = R3

  • 2. k(x, x′) = (x · x′ + c)m for c > 0, x, x′ ∈ Rd

◮ dim(H) =

d+m

m

  • 3. k(x, x′) = exp
  • −γx − x′2

2

  • ◮ H = R∞
slide-14
SLIDE 14

9/34

Positive Definite Kernels

Definition (Positive definiteness)

A function k : X × X → R is called positive definite if, for all n ∈ N, α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X, we have

n

  • i=1

n

  • j=1

αiαjk(xj, xi) ≥ 0. Equivalently, we have that a Gram matrix K is positive definite.

slide-15
SLIDE 15

9/34

Positive Definite Kernels

Definition (Positive definiteness)

A function k : X × X → R is called positive definite if, for all n ∈ N, α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X, we have

n

  • i=1

n

  • j=1

αiαjk(xj, xi) ≥ 0. Equivalently, we have that a Gram matrix K is positive definite.

Example (Any kernel is positive definite)

Let k be a kernel with feature map φ : X → H, then we have

n

  • i=1

n

  • j=1

αiαjk(xj, xi) = n

  • i=1

αiφ(xi),

n

  • j=1

αjφ(xj)

  • H

≥ 0. Positive definiteness is a necessary (and sufficient) condition.

slide-16
SLIDE 16

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

slide-17
SLIDE 17

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

  • 1. A function k : X × X → R is called a reproducing kernel of H if we

have k(·, x) ∈ H for all x ∈ X and the reproducing property f (x) = f , k(·, x) holds for all f ∈ H and all x ∈ X.

slide-18
SLIDE 18

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

  • 1. A function k : X × X → R is called a reproducing kernel of H if we

have k(·, x) ∈ H for all x ∈ X and the reproducing property f (x) = f , k(·, x) holds for all f ∈ H and all x ∈ X.

  • 2. The space H is called a reproducing kernel Hilbert space (RKHS)
  • ver X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H, is continuous.

slide-19
SLIDE 19

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

  • 1. A function k : X × X → R is called a reproducing kernel of H if we

have k(·, x) ∈ H for all x ∈ X and the reproducing property f (x) = f , k(·, x) holds for all f ∈ H and all x ∈ X.

  • 2. The space H is called a reproducing kernel Hilbert space (RKHS)
  • ver X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H, is continuous. Remark: If fn − f H → 0 for n → ∞, then for all x ∈ X, we have lim

n→∞ fn(x) = f (x)

.

slide-20
SLIDE 20

11/34

Reproducing Kernels

Lemma (Reproducing kernels are kernels)

Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ(x) = k(·, x), x ∈ X. We call φ the canonical feature map.

slide-21
SLIDE 21

11/34

Reproducing Kernels

Lemma (Reproducing kernels are kernels)

Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ(x) = k(·, x), x ∈ X. We call φ the canonical feature map.

Proof

We fix an x′ ∈ X and write f := k(·, x′). Then, for x ∈ X, the reproducing property yields φ(x′), φ(x) = k(·, x′), k(·, x) = f , k(·, x) = f (x) = k(x, x′).

slide-22
SLIDE 22

12/34

Kernels and RKHSs

Theorem (Every RKHS has a unique reproducing kernel)

Let H be an RKHS over X. Then k : X × X → R defined by k(x, x′) = δx, δx′H, x, x′ ∈ X is the only reproducing kernel of H. Furthermore, if (ei)i∈I is an

  • rthonormal basis of H, then for all x, x′ ∈ X we have

k(x, x′) =

  • i∈I

ei(x)ei(x′).

slide-23
SLIDE 23

12/34

Kernels and RKHSs

Theorem (Every RKHS has a unique reproducing kernel)

Let H be an RKHS over X. Then k : X × X → R defined by k(x, x′) = δx, δx′H, x, x′ ∈ X is the only reproducing kernel of H. Furthermore, if (ei)i∈I is an

  • rthonormal basis of H, then for all x, x′ ∈ X we have

k(x, x′) =

  • i∈I

ei(x)ei(x′).

Universal kernels

A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C(X), i.e., for every function g ∈ C(X) and all ε > 0 there exist an f ∈ H such that f − g∞ ≤ ε.

slide-24
SLIDE 24

13/34

From Points to Measures

Feature space H Input space X x y k(x, ·) k(y, ·) f

slide-25
SLIDE 25

13/34

From Points to Measures

Feature space H Input space X x y k(x, ·) k(y, ·) f x → k(·, x) δx →

  • k(·, z)dδx(z)
slide-26
SLIDE 26

14/34

From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions

slide-27
SLIDE 27

15/34

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

slide-28
SLIDE 28

15/34

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

Definition

Let P be a space of all probability measures on a measurable space (X, Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R. A kernel mean embedding is defined by µ : P → H, P →

  • k(·, x) dP(x).
slide-29
SLIDE 29

15/34

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

Definition

Let P be a space of all probability measures on a measurable space (X, Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R. A kernel mean embedding is defined by µ : P → H, P →

  • k(·, x) dP(x).

Remark: For a Dirac measure δx, δx → µ[δx] ≡ x → k(·, x).

slide-30
SLIDE 30

16/34

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

◮ If EX∼P[

  • k(X, X)] < ∞, then µP ∈ H and

EX∼P[f (X)] = f , µP, f ∈ H.

slide-31
SLIDE 31

16/34

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

◮ If EX∼P[

  • k(X, X)] < ∞, then µP ∈ H and

EX∼P[f (X)] = f , µP, f ∈ H.

◮ The kernel k is said to be characteristic if the map

P → µP is injective. That is, µP − µQH = 0 if and only if P = Q.

slide-32
SLIDE 32

17/34

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

ˆ µP := 1 n

n

  • i=1

k(xi, ·).

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-33
SLIDE 33

17/34

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

ˆ µP := 1 n

n

  • i=1

k(xi, ·).

◮ For each f ∈ H, we have EX∼ P[f (X)] = f , ˆ

µP.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-34
SLIDE 34

17/34

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

ˆ µP := 1 n

n

  • i=1

k(xi, ·).

◮ For each f ∈ H, we have EX∼ P[f (X)] = f , ˆ

µP.

◮ Assume that f ∞ ≤ 1 for all f ∈ H with f H ≤ 1. W.p.a.l 1 − δ,

ˆ µP − µPH ≤ 2

  • Ex∼P[k(x, x)]

n +

  • 2 log 1

δ

n .

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-35
SLIDE 35

17/34

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

ˆ µP := 1 n

n

  • i=1

k(xi, ·).

◮ For each f ∈ H, we have EX∼ P[f (X)] = f , ˆ

µP.

◮ Assume that f ∞ ≤ 1 for all f ∈ H with f H ≤ 1. W.p.a.l 1 − δ,

ˆ µP − µPH ≤ 2

  • Ex∼P[k(x, x)]

n +

  • 2 log 1

δ

n .

◮ The convergence happens at a rate Op(n−1/2) which has been shown

to be minimax optimal.1

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-36
SLIDE 36

17/34

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

ˆ µP := 1 n

n

  • i=1

k(xi, ·).

◮ For each f ∈ H, we have EX∼ P[f (X)] = f , ˆ

µP.

◮ Assume that f ∞ ≤ 1 for all f ∈ H with f H ≤ 1. W.p.a.l 1 − δ,

ˆ µP − µPH ≤ 2

  • Ex∼P[k(x, x)]

n +

  • 2 log 1

δ

n .

◮ The convergence happens at a rate Op(n−1/2) which has been shown

to be minimax optimal.1

◮ In high dimensional setting, we can improve an estimation by

shrinkage estimators:2 ˆ µα := αf ∗ + (1 − α)ˆ µP, f ∗ ∈ H.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-37
SLIDE 37

18/34

Explicit Representation

What properties are captured by µP?

◮ k(x, x′) = x, x′

the first moment of P

◮ k(x, x′) = (x, x′ + 1)p

moments of P up to order p ∈ N

◮ k(x, x′) is universal/characteristic

all information of P

slide-38
SLIDE 38

18/34

Explicit Representation

What properties are captured by µP?

◮ k(x, x′) = x, x′

the first moment of P

◮ k(x, x′) = (x, x′ + 1)p

moments of P up to order p ∈ N

◮ k(x, x′) is universal/characteristic

all information of P

Moment-generating function

Consider k(x, x′) = exp(x, x′). Then, µP = EX∼P[eX,·].

slide-39
SLIDE 39

18/34

Explicit Representation

What properties are captured by µP?

◮ k(x, x′) = x, x′

the first moment of P

◮ k(x, x′) = (x, x′ + 1)p

moments of P up to order p ∈ N

◮ k(x, x′) is universal/characteristic

all information of P

Moment-generating function

Consider k(x, x′) = exp(x, x′). Then, µP = EX∼P[eX,·].

Characteristic function

Consider k(x, y) = ψ(x − y), x, y ∈ Rd where ψ is a positive definite

  • function. Then,

µP(y) =

  • ψ(x − y) dP(x) = Λ · ˆ

P for positive finite measure Λ.

slide-40
SLIDE 40

19/34

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.
slide-41
SLIDE 41

19/34

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.

Group Anomaly Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

  • KM. and Sch¨
  • lkopf, UAI 2013.
slide-42
SLIDE 42

19/34

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.

Group Anomaly Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

  • KM. and Sch¨
  • lkopf, UAI 2013.

Domain Adaptation/Generalization

training data unseen test data

P2

XY

P1

XY

P PN

XY

...

(Xk, Yk)

...

Xk

PX

(Xk, Yk) (Xk, Yk) k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1

  • KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

slide-43
SLIDE 43

19/34

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.

Group Anomaly Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

  • KM. and Sch¨
  • lkopf, UAI 2013.

Domain Adaptation/Generalization

training data unseen test data

P2

XY

P1

XY

P PN

XY

...

(Xk, Yk)

...

Xk

PX

(Xk, Yk) (Xk, Yk) k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1

  • KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference X Y Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

slide-44
SLIDE 44

20/34

Support Measure Machine (SMM)

x → k(·, x) δx →

  • k(·, z)dδx(z)

P →

  • k(·, z)dP(z)
slide-45
SLIDE 45

20/34

Support Measure Machine (SMM)

x → k(·, x) δx →

  • k(·, z)dδx(z)

P →

  • k(·, z)dP(z)

Theorem

Under technical assumptions on Ω : [0, +∞) → R, and a loss function ℓ : (P × R2)m → R ∪ {+∞}, any f ∈ H minimizing ℓ (P1, y1, EP1[f ], . . . , Pm, ym, EPm[f ]) + Ω (f H) admits a representation of the form f =

m

  • i=1

αiEx∼Pi[k(x, ·)] =

m

  • i=1

αiµPi.

slide-46
SLIDE 46

21/34

Kernel Discrepancy Measure for Distributions

◮ Maximum mean discrepancy (MMD)

MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
slide-47
SLIDE 47

21/34

Kernel Discrepancy Measure for Distributions

◮ Maximum mean discrepancy (MMD)

MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
  • ◮ MMD is an integral probability metric (IPM) and corresponds to

the RKHS distance between mean embeddings. MMD2(P, Q, H) = µP − µQ2

H.

slide-48
SLIDE 48

21/34

Kernel Discrepancy Measure for Distributions

◮ Maximum mean discrepancy (MMD)

MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
  • ◮ MMD is an integral probability metric (IPM) and corresponds to

the RKHS distance between mean embeddings. MMD2(P, Q, H) = µP − µQ2

H. ◮ If k is universal, then µP − µQH = 0 if and only if P = Q.

slide-49
SLIDE 49

21/34

Kernel Discrepancy Measure for Distributions

◮ Maximum mean discrepancy (MMD)

MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
  • ◮ MMD is an integral probability metric (IPM) and corresponds to

the RKHS distance between mean embeddings. MMD2(P, Q, H) = µP − µQ2

H. ◮ If k is universal, then µP − µQH = 0 if and only if P = Q. ◮ Given {xi}n i=1 ∼ P and {yj}m j=1 ∼ Q, the empirical MMD is

  • MMD2

u(P, Q, H) =

1 n(n − 1)

n

  • i=1

n

  • j=i

k(xi, xj) + 1 m(m − 1)

m

  • i=1

m

  • j=i

k(yi, yj) − 2 nm

n

  • i=1

m

  • j=1

k(xi, yj).

slide-50
SLIDE 50

22/34

Generative Adversarial Networks

Learn a deep generative model G via a minimax optimization min

G max D

Ex[log D(x)] + Ez[log(1 − D(G(z)))] where D is a discriminator and z ∼ N(0, σ2I).

random noise z Gθ(z) Generator Gθ real or synthetic? x or Gθ(z) Discriminator Dφ

  • × ×

× ×× × ×× × × × × ×

real data {xi} synthetic data {Gθ(zi)}

  • ˆ

µX − ˆ µGθ(Z)

  • H is zero?

MMD Test

slide-51
SLIDE 51

23/34

Generative Moment Matching Network

◮ The GAN aims to match two distributions P(X) and Gθ.

slide-52
SLIDE 52

23/34

Generative Moment Matching Network

◮ The GAN aims to match two distributions P(X) and Gθ. ◮ Generative moment matching network (GMMN) proposed by

Dziugaite et al. (2015) and Li et al. (2015) considers min

θ

  • µX − µGθ(Z)
  • 2

H = min θ

  • φ(X) dP(X) −
  • φ( ˜

X) dGθ( ˜ X)

  • 2

H

= min

θ

  • sup

h∈H,h≤1

  • h dP −
  • h dGθ
slide-53
SLIDE 53

23/34

Generative Moment Matching Network

◮ The GAN aims to match two distributions P(X) and Gθ. ◮ Generative moment matching network (GMMN) proposed by

Dziugaite et al. (2015) and Li et al. (2015) considers min

θ

  • µX − µGθ(Z)
  • 2

H = min θ

  • φ(X) dP(X) −
  • φ( ˜

X) dGθ( ˜ X)

  • 2

H

= min

θ

  • sup

h∈H,h≤1

  • h dP −
  • h dGθ
  • ◮ Many tricks have been proposed to improve the GMMN:

◮ Optimized kernels and feature extractors (Sutherland et al., 2017; Li

et al., 2017a),

◮ Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018) ◮ Repulsive loss (Wang et al., 2019) ◮ Optimized witness points (Mehrjou et al., 2019)

slide-54
SLIDE 54

24/34

From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions

slide-55
SLIDE 55

25/34

Conditional Distribution P(Y |X)?

X Y A collection of distributions PY := {P(Y |X = x) : x ∈ X}.

slide-56
SLIDE 56

25/34

Conditional Distribution P(Y |X)?

X Y A collection of distributions PY := {P(Y |X = x) : x ∈ X}.

◮ For each x ∈ X, we can define an embedding of P(Y |X = x) as

µY |x :=

  • Y

ϕ(Y ) dP(Y |X = x) = EY |x[ϕ(Y )] where ϕ : Y → G is a feature map of Y .

slide-57
SLIDE 57

26/34

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps

φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·).

slide-58
SLIDE 58

26/34

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps

φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·).

◮ Let CXX : H → H and CYX : H → G be the covariance operator on

X and cross-covariance operator from X to Y , i.e., CXX =

  • φ(X) ⊗ φ(X) dP(X),

CYX =

  • ϕ(Y ) ⊗ φ(X) dP(Y , X)
slide-59
SLIDE 59

26/34

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps

φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·).

◮ Let CXX : H → H and CYX : H → G be the covariance operator on

X and cross-covariance operator from X to Y , i.e., CXX =

  • φ(X) ⊗ φ(X) dP(X),

CYX =

  • ϕ(Y ) ⊗ φ(X) dP(Y , X)

◮ Alternatively, CYX is a unique bounded operator satisfying

g, CYXf G = Cov[g(Y ), f (X)].

slide-60
SLIDE 60

26/34

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps

φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·).

◮ Let CXX : H → H and CYX : H → G be the covariance operator on

X and cross-covariance operator from X to Y , i.e., CXX =

  • φ(X) ⊗ φ(X) dP(X),

CYX =

  • ϕ(Y ) ⊗ φ(X) dP(Y , X)

◮ Alternatively, CYX is a unique bounded operator satisfying

g, CYXf G = Cov[g(Y ), f (X)].

◮ If EYX[g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX[g(Y )|X = ·] = CXY g.

slide-61
SLIDE 61

27/34

Embedding of Conditional Distributions

X Y H G CYXC−1

XXk(x, ·)

µY |X=x k(x, ·) CYXC−1

XX

y p(y|x) P(Y |X = x) The conditional mean embedding of P(Y | X) can be defined as UY |X : H → G, UY |X := CYXC−1

XX

slide-62
SLIDE 62

28/34

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and

conditional expectation.

slide-63
SLIDE 63

28/34

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and

conditional expectation.

◮ To represent P(Y |X = x) for x ∈ X, it follows that

EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x.

slide-64
SLIDE 64

28/34

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and

conditional expectation.

◮ To represent P(Y |X = x) for x ∈ X, it follows that

EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x. ◮ It follows from the reproducing property of G that

EY |x[g(Y ) | X = x] = µY |x, gG for all g ∈ G.

slide-65
SLIDE 65

28/34

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and

conditional expectation.

◮ To represent P(Y |X = x) for x ∈ X, it follows that

EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x. ◮ It follows from the reproducing property of G that

EY |x[g(Y ) | X = x] = µY |x, gG for all g ∈ G.

◮ In an infinite RKHS, C−1 XX does not exists. Hence, we often use

UY |X := CYX(CXX + εI)−1.

slide-66
SLIDE 66

29/34

Conditional Mean Estimation

◮ Given a joint sample (x1, y1), . . . , (xn, yn) from P(X, Y ), we have

  • CXX = 1

n

n

  • i=1

φ(xi) ⊗ φ(xi),

  • CYX = 1

n

n

  • i=1

ϕ(yi) ⊗ φ(xi).

slide-67
SLIDE 67

29/34

Conditional Mean Estimation

◮ Given a joint sample (x1, y1), . . . , (xn, yn) from P(X, Y ), we have

  • CXX = 1

n

n

  • i=1

φ(xi) ⊗ φ(xi),

  • CYX = 1

n

n

  • i=1

ϕ(yi) ⊗ φ(xi).

◮ Then, µY |x for some x ∈ X can be estimated as

ˆ µY |x = CYX( CXX + εI)−1k(x, ·) = Φ(K + nεIn)−1kx =

n

  • i=1

βiϕ(yi), where λ > 0 is a regularization parameter and Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi, xj), kx = [k(x1, x), .., k(xn, x)].

slide-68
SLIDE 68

29/34

Conditional Mean Estimation

◮ Given a joint sample (x1, y1), . . . , (xn, yn) from P(X, Y ), we have

  • CXX = 1

n

n

  • i=1

φ(xi) ⊗ φ(xi),

  • CYX = 1

n

n

  • i=1

ϕ(yi) ⊗ φ(xi).

◮ Then, µY |x for some x ∈ X can be estimated as

ˆ µY |x = CYX( CXX + εI)−1k(x, ·) = Φ(K + nεIn)−1kx =

n

  • i=1

βiϕ(yi), where λ > 0 is a regularization parameter and Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi, xj), kx = [k(x1, x), .., k(xn, x)].

◮ Under some technical assumptions, ˆ

µY |x → µY |x as n → ∞.

slide-69
SLIDE 69

30/34

Kernel Sum Rule: P(X) =

Y P(X, Y )

◮ By the law of total expectation,

µX = EX[φ(X)] = EY [EX|Y [φ(X)|Y ]] = EY [UX|Y ϕ(Y )] = UX|Y EY [ϕ(Y )] = UX|Y µY

slide-70
SLIDE 70

30/34

Kernel Sum Rule: P(X) =

Y P(X, Y )

◮ By the law of total expectation,

µX = EX[φ(X)] = EY [EX|Y [φ(X)|Y ]] = EY [UX|Y ϕ(Y )] = UX|Y EY [ϕ(Y )] = UX|Y µY

◮ Let ˆ

µY = m

i=1 αiϕ(˜

yi) and UX|Y = CXY C−1

YY . Then,

ˆ µX = UX|Y ˆ µY = CXY C−1

YY ˆ

µY = Υ(L + nλI)−1˜ Lα. where α = (α1, . . . , αm)⊤, Lij = l(yi, yj), and ˜ Lij = l(yi, ˜ yj).

slide-71
SLIDE 71

30/34

Kernel Sum Rule: P(X) =

Y P(X, Y )

◮ By the law of total expectation,

µX = EX[φ(X)] = EY [EX|Y [φ(X)|Y ]] = EY [UX|Y ϕ(Y )] = UX|Y EY [ϕ(Y )] = UX|Y µY

◮ Let ˆ

µY = m

i=1 αiϕ(˜

yi) and UX|Y = CXY C−1

YY . Then,

ˆ µX = UX|Y ˆ µY = CXY C−1

YY ˆ

µY = Υ(L + nλI)−1˜ Lα. where α = (α1, . . . , αm)⊤, Lij = l(yi, yj), and ˜ Lij = l(yi, ˜ yj).

◮ That is, we have

ˆ µX =

n

  • j=1

βjφ(xj) with β = (L + nλI)−1˜ Lα.

slide-72
SLIDE 72

31/34

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as

EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)]

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-73
SLIDE 73

31/34

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as

EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)]

◮ Let µ⊗ X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-74
SLIDE 74

31/34

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as

EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)]

◮ Let µ⊗ X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )]. ◮ Then, the product rule becomes

µXY = UX|Y µ⊗

Y = UY |Xµ⊗ X .

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-75
SLIDE 75

31/34

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as

EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)]

◮ Let µ⊗ X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )]. ◮ Then, the product rule becomes

µXY = UX|Y µ⊗

Y = UY |Xµ⊗ X . ◮ Alternatively, we may write the above formulation as

CXY = UX|Y CYY and CYX = UY |XCXX

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-76
SLIDE 76

31/34

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as

EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)]

◮ Let µ⊗ X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )]. ◮ Then, the product rule becomes

µXY = UX|Y µ⊗

Y = UY |Xµ⊗ X . ◮ Alternatively, we may write the above formulation as

CXY = UX|Y CYY and CYX = UY |XCXX

◮ The kernel sum and product rules can be combined to obtain the

kernel Bayes’ rule.3

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-77
SLIDE 77

32/34

From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions

slide-78
SLIDE 78

33/34

Future Directions

◮ Representation learning and embedding of distributions ◮ Kernel methods in deep learning

◮ MMD-GAN ◮ Wasserstein autoencoder (WAE) ◮ Invariant learning in deep neural networks

◮ Kernel mean estimation in high dimensional setting ◮ Recovering (conditional) distributions from mean embeddings

slide-79
SLIDE 79

34/34

Q & A