Recent Advances in Hilbert Space Representation of Probability - - PowerPoint PPT Presentation

recent advances in hilbert space representation of
SMART_READER_LITE
LIVE PREVIEW

Recent Advances in Hilbert Space Representation of Probability - - PowerPoint PPT Presentation

Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany RegML 2020, Genova, Italy July 1, 2020 1/53 Reference Kernel Mean Embedding of


slide-1
SLIDE 1

1/53

Recent Advances in Hilbert Space Representation

  • f Probability Distributions

Krikamol Muandet Max Planck Institute for Intelligent Systems T¨ ubingen, Germany RegML 2020, Genova, Italy July 1, 2020

slide-2
SLIDE 2

2/53

Reference

Kernel Mean Embedding of Distributions: A Review and Beyond KM, K. Fukumizu, B. Sriperumbudur, and B. Sch¨

  • lkopf. FnT ML, 2017.
slide-3
SLIDE 3

3/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development

slide-4
SLIDE 4

4/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development

slide-5
SLIDE 5

5/53

Classification Problem

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1
slide-6
SLIDE 6

6/53

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ

1

0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1
slide-7
SLIDE 7

6/53

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ

1

0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1
slide-8
SLIDE 8

6/53

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ

1

0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1
slide-9
SLIDE 9

6/53

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ

1

0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1

φ(x), φ(z)R3 = x2

1z2 1 + x2 2z2 2 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

slide-10
SLIDE 10

6/53

Feature Map

φ : (x1, x2) − → (x2

1, x2 2,

√ 2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2

Data in Input Space

+1

  • 1

ϕ

1

0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Data in Feature Space

+1

  • 1

φ(x), φ(z)R3 = x2

1z2 1 + x2 2z2 2 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

Question: How to generalize the idea of implicit feature map?

slide-11
SLIDE 11

7/53

https://xkcd.com/655/

slide-12
SLIDE 12

7/53

https://xkcd.com/655/ Recipe for ML Problems

  • 1. Collect a data set D = {x1, x2, . . . , xn}.
  • 2. Specify or learn a feature map φ : X → H.
  • 3. Apply the feature map Dφ = {φ(x1), φ(x2), . . . , φ(xn)}.
  • 4. Solve the (easier) problem in the feature space H using Dφ.
slide-13
SLIDE 13

8/53

Representation Learning

Perceptron1: f (x) = w ⊤x + b Explicit Representation

. . .

Perceptron f (x) = w ⊤

2 σ(w ⊤ 1 x + b1) + b2

Implicit Representation x k(x, ·) f (x) = w ⊤φ(x) + b

1Rosenblatt 1958; Minsky and Papert 1969

slide-14
SLIDE 14

9/53

Kernels

A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x, x′ ∈ X we have k(x, x′) = φ(x), φ(x′)H We call φ a feature map and H a feature space associated with k.

slide-15
SLIDE 15

9/53

Kernels

A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x, x′ ∈ X we have k(x, x′) = φ(x), φ(x′)H We call φ a feature map and H a feature space associated with k. φ x, x′ φ(x), φ(x′) k(x, x′) ·, ·

slide-16
SLIDE 16

9/53

Kernels

A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x, x′ ∈ X we have k(x, x′) = φ(x), φ(x′)H We call φ a feature map and H a feature space associated with k.

Example

  • 1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

◮ φ(x) = (x2

1, x2 2,

√ 2x1x2) ◮ H = R3

  • 2. k(x, x′) = (x · x′ + c)m, x, x′ ∈ Rd

◮ dim(H) = d+m

m

  • 3. k(x, x′) = exp
  • −γx − x′2

2

  • ◮ H = R∞

φ x, x′ φ(x), φ(x′) k(x, x′) ·, ·

slide-17
SLIDE 17

10/53

Positive Definite Kernels

A function k : X × X → R is called positive definite if, for all n ∈ N, α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X, we have α⊤Kα =

n

  • i=1

n

  • j=1

αiαjk(xj, xi) ≥ 0, K :=    k(x1, x1) · · · k(x1, xn) . . . ... . . . k(xn, x1) · · · k(xn, xn)    Equivalently, the Gram matrix K is positive definite.

slide-18
SLIDE 18

10/53

Positive Definite Kernels

A function k : X × X → R is called positive definite if, for all n ∈ N, α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X, we have α⊤Kα =

n

  • i=1

n

  • j=1

αiαjk(xj, xi) ≥ 0, K :=    k(x1, x1) · · · k(x1, xn) . . . ... . . . k(xn, x1) · · · k(xn, xn)    Equivalently, the Gram matrix K is positive definite.

Any explicit kernel is positive definite

For any kernel k(x, x′) := φ(x), φ(x′)H,

n

  • i=1

n

  • j=1

αiαjk(xj, xi) = n

  • i=1

αiφ(xi),

n

  • j=1

αjφ(xj)

  • H

≥ 0.

slide-19
SLIDE 19

10/53

Positive Definite Kernels

A function k : X × X → R is called positive definite if, for all n ∈ N, α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X, we have α⊤Kα =

n

  • i=1

n

  • j=1

αiαjk(xj, xi) ≥ 0, K :=    k(x1, x1) · · · k(x1, xn) . . . ... . . . k(xn, x1) · · · k(xn, xn)    Equivalently, the Gram matrix K is positive definite.

Any explicit kernel is positive definite

For any kernel k(x, x′) := φ(x), φ(x′)H,

n

  • i=1

n

  • j=1

αiαjk(xj, xi) = n

  • i=1

αiφ(xi),

n

  • j=1

αjφ(xj)

  • H

≥ 0. Positive definiteness is a necessary (and sufficient) condition.

slide-20
SLIDE 20

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X.

  • 2N. Aronszajn. Theory of reproducing kernels. Transactions of the American

Mathematical Society, 68(3):337–404, 1950.

slide-21
SLIDE 21

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X.

  • 1. The space H is called a reproducing kernel Hilbert space (RKHS)
  • ver X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H, is continuous.

  • 2N. Aronszajn. Theory of reproducing kernels. Transactions of the American

Mathematical Society, 68(3):337–404, 1950.

slide-22
SLIDE 22

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X.

  • 1. The space H is called a reproducing kernel Hilbert space (RKHS)
  • ver X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H, is continuous.

  • 2. A function k : X × X → R is called a reproducing kernel of H if

k(·, x) ∈ H for all x ∈ X and the reproducing property f (x) = f , k(·, x)H holds for all f ∈ H and all x ∈ X.

  • 2N. Aronszajn. Theory of reproducing kernels. Transactions of the American

Mathematical Society, 68(3):337–404, 1950.

slide-23
SLIDE 23

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X.

  • 1. The space H is called a reproducing kernel Hilbert space (RKHS)
  • ver X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H, is continuous.

  • 2. A function k : X × X → R is called a reproducing kernel of H if

k(·, x) ∈ H for all x ∈ X and the reproducing property f (x) = f , k(·, x)H holds for all f ∈ H and all x ∈ X. Aronszajn (1950)2: “There is a one-to-one correspondance between the reproducing kernel k and the RKHS H”.

  • 2N. Aronszajn. Theory of reproducing kernels. Transactions of the American

Mathematical Society, 68(3):337–404, 1950.

slide-24
SLIDE 24

12/53

RKHS as Feature Space

Reproducing kernels are kernels

Let H be a Hilbert space on X with a reproducing kernel k. Then, H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ(x) = k(·, x). We call φ the canonical feature map. φ x, x′ φ(x), φ(x′) k(x, x′) ·, ·

slide-25
SLIDE 25

12/53

RKHS as Feature Space

Reproducing kernels are kernels

Let H be a Hilbert space on X with a reproducing kernel k. Then, H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ(x) = k(·, x). We call φ the canonical feature map. φ x, x′ φ(x), φ(x′) k(x, x′) ·, ·

Proof

We fix an x′ ∈ X and write f := k(·, x′). Then, for x ∈ X, the reproducing property implies φ(x′), φ(x) = k(·, x′), k(·, x) = f , k(·, x) = f (x) = k(x, x′).

slide-26
SLIDE 26

13/53

RKHS as Feature Space

Universal kernels (Steinwart 2002)

A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C(X), i.e., for every function g ∈ C(X) and all ε > 0 there exist an f ∈ H such that f − g∞ ≤ ε.

slide-27
SLIDE 27

13/53

RKHS as Feature Space

Universal kernels (Steinwart 2002)

A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C(X), i.e., for every function g ∈ C(X) and all ε > 0 there exist an f ∈ H such that f − g∞ ≤ ε.

Universal approximation theorem (Cybenko 1989)

Given any ε > 0 and f ∈ C(X), there exist h(x) =

n

  • i=1

αiϕ(w ⊤

i x + bi)

such that |f (x) − h(x)| < ε for all x ∈ X.

slide-28
SLIDE 28

14/53

Quick Summary

◮ A positive definite kernel k(x, x′) defines an implicit feature map: k(x, x′) = φ(x), φ(x′)H

slide-29
SLIDE 29

14/53

Quick Summary

◮ A positive definite kernel k(x, x′) defines an implicit feature map: k(x, x′) = φ(x), φ(x′)H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H

  • f functions on X for which k is a reproducing kernel:

f (x) = f , k(·, x)H, k(x, x′) = k(·, x), k(·, x′)H.

slide-30
SLIDE 30

14/53

Quick Summary

◮ A positive definite kernel k(x, x′) defines an implicit feature map: k(x, x′) = φ(x), φ(x′)H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H

  • f functions on X for which k is a reproducing kernel:

f (x) = f , k(·, x)H, k(x, x′) = k(·, x), k(·, x′)H. ◮ Implicit representation of data points:

◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK)

slide-31
SLIDE 31

14/53

Quick Summary

◮ A positive definite kernel k(x, x′) defines an implicit feature map: k(x, x′) = φ(x), φ(x′)H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H

  • f functions on X for which k is a reproducing kernel:

f (x) = f , k(·, x)H, k(x, x′) = k(·, x), k(·, x′)H. ◮ Implicit representation of data points:

◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK)

◮ Good references on kernel methods.

◮ Support vector machine (2008), Christmann and Steinwart. ◮ Gaussian process for ML (2005), Rasmussen and Williams. ◮ Learning with kernels (1998), Sch¨

  • lkopf and Smola.
slide-32
SLIDE 32

15/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development

slide-33
SLIDE 33

16/53

Probability Measures

Learning on Distributions/Point Clouds Group Anomaly/OOD Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

Generalization across Environments

training data unseen test data

P2

XY

P1

XY

P PN

XY

...

(Xk, Yk)

...

Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1

Statistical and Causal Inference X Y x p(x) P Q

slide-34
SLIDE 34

17/53

Embedding of Dirac Measures

Feature Space H Input Space X x y k(x, ·) k(y, ·) f

slide-35
SLIDE 35

17/53

Embedding of Dirac Measures

Feature Space H Input Space X x y k(x, ·) k(y, ·) f x → k(·, x) δx →

  • k(·, z) dδx(z) = k(·, x)
slide-36
SLIDE 36

18/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development

slide-37
SLIDE 37

19/53

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

Probability measure

Let P be a probability measure defined on a measurable space (X, Σ) with a σ-algebra Σ.

slide-38
SLIDE 38

19/53

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

Probability measure

Let P be a probability measure defined on a measurable space (X, Σ) with a σ-algebra Σ.

Kernel mean embedding

Let P be a space of all probability measures P. A kernel mean embedding is defined by µ : P → H, P →

  • k(·, x) dP(x).
slide-39
SLIDE 39

19/53

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f

Probability measure

Let P be a probability measure defined on a measurable space (X, Σ) with a σ-algebra Σ.

Kernel mean embedding

Let P be a space of all probability measures P. A kernel mean embedding is defined by µ : P → H, P →

  • k(·, x) dP(x).

Remark: The kernel k is Bochner integrable if it is bounded.

slide-40
SLIDE 40

20/53

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f ◮ If EX∼P[

  • k(X, X)] < ∞, then for µP ∈ H and f ∈ H,

f , µP = f , EX∼P[k(·, X)] = EX∼P[f , k(·, X)] = EX∼P[f (X)].

slide-41
SLIDE 41

20/53

Embedding of Marginal Distributions

x p(x) RKHS H µP µQ P Q f ◮ If EX∼P[

  • k(X, X)] < ∞, then for µP ∈ H and f ∈ H,

f , µP = f , EX∼P[k(·, X)] = EX∼P[f , k(·, X)] = EX∼P[f (X)]. ◮ The kernel k is said to be characteristic if the map P → µP is injective, i.e., µP − µQH = 0 if and only if P = Q.

slide-42
SLIDE 42

21/53

Interpretation of Kernel Mean Representation

What properties are captured by µP?

◮ k(x, x′) = x, x′ the first moment of P ◮ k(x, x′) = (x, x′ + 1)p moments of P up to order p ∈ N ◮ k(x, x′) is universal/characteristic all information of P

slide-43
SLIDE 43

21/53

Interpretation of Kernel Mean Representation

What properties are captured by µP?

◮ k(x, x′) = x, x′ the first moment of P ◮ k(x, x′) = (x, x′ + 1)p moments of P up to order p ∈ N ◮ k(x, x′) is universal/characteristic all information of P

Moment-generating function

Consider k(x, x′) = exp(x, x′). Then, µP = EX∼P[eX,·].

slide-44
SLIDE 44

21/53

Interpretation of Kernel Mean Representation

What properties are captured by µP?

◮ k(x, x′) = x, x′ the first moment of P ◮ k(x, x′) = (x, x′ + 1)p moments of P up to order p ∈ N ◮ k(x, x′) is universal/characteristic all information of P

Moment-generating function

Consider k(x, x′) = exp(x, x′). Then, µP = EX∼P[eX,·].

Characteristic function

If k(x, y) = ψ(x − y) where ψ is a positive definite function, then µP(y) =

  • ψ(x − y) dP(x) = Λk · ϕP

for positive finite measure Λk.

slide-45
SLIDE 45

22/53

Characteristic Kernels

◮ All universal kernels are characteristic, but characteristic kernels may not be universal.

slide-46
SLIDE 46

22/53

Characteristic Kernels

◮ All universal kernels are characteristic, but characteristic kernels may not be universal. ◮ Important characterizations:

◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on Rd whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups

slide-47
SLIDE 47

22/53

Characteristic Kernels

◮ All universal kernels are characteristic, but characteristic kernels may not be universal. ◮ Important characterizations:

◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on Rd whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups

◮ Examples of characteristic kernels: Gaussian RBF kernel k(x, x′) = exp

  • −x − x′2

2

2σ2

  • Laplacian kernel

k(x, x′) = exp

  • −x − x′1

σ

slide-48
SLIDE 48

22/53

Characteristic Kernels

◮ All universal kernels are characteristic, but characteristic kernels may not be universal. ◮ Important characterizations:

◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on Rd whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups

◮ Examples of characteristic kernels: Gaussian RBF kernel k(x, x′) = exp

  • −x − x′2

2

2σ2

  • Laplacian kernel

k(x, x′) = exp

  • −x − x′1

σ

  • ◮ Kernel choice vs parametric assumption

◮ Parametric assumption is susceptible to model misspecification. ◮ But the choice of kernel matters in practice. ◮ We can optimize the kernel to maximize the performance of the downstream tasks.

slide-49
SLIDE 49

23/53

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by ˆ µP := 1

n

n

i=1 k(xi, ·) ∈ H,

  • P = 1

n

n

i=1 δxi.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-50
SLIDE 50

23/53

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by ˆ µP := 1

n

n

i=1 k(xi, ·) ∈ H,

  • P = 1

n

n

i=1 δxi.

◮ For each f ∈ H, we have EX∼

P[f (X)] = f , ˆ

µP.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-51
SLIDE 51

23/53

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by ˆ µP := 1

n

n

i=1 k(xi, ·) ∈ H,

  • P = 1

n

n

i=1 δxi.

◮ For each f ∈ H, we have EX∼

P[f (X)] = f , ˆ

µP. ◮ Consistency: with probability at least 1 − δ, ˆ µP − µPH ≤ 2

  • EX∼P[k(X, X)]

n +

  • 2 log 1

δ

n .

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-52
SLIDE 52

23/53

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by ˆ µP := 1

n

n

i=1 k(xi, ·) ∈ H,

  • P = 1

n

n

i=1 δxi.

◮ For each f ∈ H, we have EX∼

P[f (X)] = f , ˆ

µP. ◮ Consistency: with probability at least 1 − δ, ˆ µP − µPH ≤ 2

  • EX∼P[k(X, X)]

n +

  • 2 log 1

δ

n . ◮ The rate Op(n−1/2) was shown to be minimax optimal.3

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-53
SLIDE 53

23/53

Kernel Mean Estimation

◮ Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by ˆ µP := 1

n

n

i=1 k(xi, ·) ∈ H,

  • P = 1

n

n

i=1 δxi.

◮ For each f ∈ H, we have EX∼

P[f (X)] = f , ˆ

µP. ◮ Consistency: with probability at least 1 − δ, ˆ µP − µPH ≤ 2

  • EX∼P[k(X, X)]

n +

  • 2 log 1

δ

n . ◮ The rate Op(n−1/2) was shown to be minimax optimal.3 ◮ Similar to James-Stein estimators, we can improve an estimation by shrinkage estimators:4 ˆ µα := αf ∗ + (1 − α)ˆ µP, f ∗ ∈ H.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

slide-54
SLIDE 54

24/53

Recovering Samples/Distributions

◮ An approximate pre-image problem θ∗ = arg min

θ

ˆ µ − µPθ2

H.

µPθ ˆ µ Pθ

slide-55
SLIDE 55

24/53

Recovering Samples/Distributions

◮ An approximate pre-image problem θ∗ = arg min

θ

ˆ µ − µPθ2

H.

µPθ ˆ µ Pθ ◮ The distribution Pθ is assumed to be in a certain class Pθ(x) =

K

  • k=1

πkN(x : µk, Σk),

K

  • k=1

πk = 1.

slide-56
SLIDE 56

24/53

Recovering Samples/Distributions

◮ An approximate pre-image problem θ∗ = arg min

θ

ˆ µ − µPθ2

H.

µPθ ˆ µ Pθ ◮ The distribution Pθ is assumed to be in a certain class Pθ(x) =

K

  • k=1

πkN(x : µk, Σk),

K

  • k=1

πk = 1. ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error E2

T =

  • µP − 1

T

T

  • t=1

k(·, xt)

  • 2

H

.

slide-57
SLIDE 57

24/53

Recovering Samples/Distributions

◮ An approximate pre-image problem θ∗ = arg min

θ

ˆ µ − µPθ2

H.

µPθ ˆ µ Pθ ◮ The distribution Pθ is assumed to be in a certain class Pθ(x) =

K

  • k=1

πkN(x : µk, Σk),

K

  • k=1

πk = 1. ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error E2

T =

  • µP − 1

T

T

  • t=1

k(·, xt)

  • 2

H

. ◮ Negative autocorrelation: O(1/T) rate of convergence.

slide-58
SLIDE 58

24/53

Recovering Samples/Distributions

◮ An approximate pre-image problem θ∗ = arg min

θ

ˆ µ − µPθ2

H.

µPθ ˆ µ Pθ ◮ The distribution Pθ is assumed to be in a certain class Pθ(x) =

K

  • k=1

πkN(x : µk, Σk),

K

  • k=1

πk = 1. ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error E2

T =

  • µP − 1

T

T

  • t=1

k(·, xt)

  • 2

H

. ◮ Negative autocorrelation: O(1/T) rate of convergence. ◮ Deep generative models (see the following slides).

slide-59
SLIDE 59

25/53

Quick Summary

◮ A kernel mean embedding of distribution P µP :=

  • k(·, x) dP(x),

ˆ µP := 1 n

n

  • i=1

k(xi, ·).

slide-60
SLIDE 60

25/53

Quick Summary

◮ A kernel mean embedding of distribution P µP :=

  • k(·, x) dP(x),

ˆ µP := 1 n

n

  • i=1

k(xi, ·). ◮ If k is characteristic, µP captures all information about P.

slide-61
SLIDE 61

25/53

Quick Summary

◮ A kernel mean embedding of distribution P µP :=

  • k(·, x) dP(x),

ˆ µP := 1 n

n

  • i=1

k(xi, ·). ◮ If k is characteristic, µP captures all information about P. ◮ All universal kernels are characteristic, but not vice versa.

slide-62
SLIDE 62

25/53

Quick Summary

◮ A kernel mean embedding of distribution P µP :=

  • k(·, x) dP(x),

ˆ µP := 1 n

n

  • i=1

k(xi, ·). ◮ If k is characteristic, µP captures all information about P. ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µP requires no parametric assumption about P.

slide-63
SLIDE 63

25/53

Quick Summary

◮ A kernel mean embedding of distribution P µP :=

  • k(·, x) dP(x),

ˆ µP := 1 n

n

  • i=1

k(xi, ·). ◮ If k is characteristic, µP captures all information about P. ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µP requires no parametric assumption about P. ◮ It can be estimated consistently, i.e., with probability at least 1 − δ, ˆ µP − µPH ≤ 2

  • EX∼P[k(X, X)]

n +

  • 2 log 1

δ

n .

slide-64
SLIDE 64

25/53

Quick Summary

◮ A kernel mean embedding of distribution P µP :=

  • k(·, x) dP(x),

ˆ µP := 1 n

n

  • i=1

k(xi, ·). ◮ If k is characteristic, µP captures all information about P. ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µP requires no parametric assumption about P. ◮ It can be estimated consistently, i.e., with probability at least 1 − δ, ˆ µP − µPH ≤ 2

  • EX∼P[k(X, X)]

n +

  • 2 log 1

δ

n . ◮ Given the embedding ˆ µ, it is possible to reconstruct the distribution

  • r generate samples from it.
slide-65
SLIDE 65

26/53

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.
slide-66
SLIDE 66

26/53

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.

Group Anomaly Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

  • KM. and Sch¨
  • lkopf, UAI 2013.
slide-67
SLIDE 67

26/53

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.

Group Anomaly Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

  • KM. and Sch¨
  • lkopf, UAI 2013.

Domain Generalization

training data unseen test data

P2

XY

P1

XY

P PN

XY

...

(Xk, Yk)

...

Xk

PX

(Xk, Yk) (Xk, Yk) k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1

  • KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

slide-68
SLIDE 68

26/53

Application: High-Level Generalization

Learning from Distributions KM., Fukumizu, Dinuzzo,

Sch¨

  • lkopf. NIPS 2012.

Group Anomaly Detection

D i s t r i b u t i

  • n

s p a c e I n p u t s p a c e

  • KM. and Sch¨
  • lkopf, UAI 2013.

Domain Generalization

training data unseen test data

P2

XY

P1

XY

P PN

XY

...

(Xk, Yk)

...

Xk

PX

(Xk, Yk) (Xk, Yk) k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1

  • KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference X Y Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

slide-69
SLIDE 69

27/53

Support Measure Machine (SMM)

KM, K. Fukumizu, F. Dinuzzo, and B. Sch¨

  • lkopf (NeurIPS2012)

x → k(·, x) δx →

  • k(·, z)dδx(z)

P →

  • k(·, z)dP(z)

Training data: (P1, y1), (P2, y2), . . . , (Pn, yn) ∼ P × Y

slide-70
SLIDE 70

27/53

Support Measure Machine (SMM)

KM, K. Fukumizu, F. Dinuzzo, and B. Sch¨

  • lkopf (NeurIPS2012)

x → k(·, x) δx →

  • k(·, z)dδx(z)

P →

  • k(·, z)dP(z)

Training data: (P1, y1), (P2, y2), . . . , (Pn, yn) ∼ P × Y

Theorem (Distributional representer theorem)

Under technical assumptions on Ω : [0, +∞) → R, and a loss function ℓ : (P × R2)m → R ∪ {+∞}, any f ∈ H minimizing ℓ (P1, y1, EP1[f ], . . . , Pm, ym, EPm[f ]) + Ω (f H) admits a representation of the form f =

m

  • i=1

αiEx∼Pi[k(x, ·)] =

m

  • i=1

αiµPi.

slide-71
SLIDE 71

28/53

Supervised Learning on Point Clouds

Training set (S1, y1), . . . , (Sn, yn) with Si = {x(i)

j } ∼ Pi(X).

slide-72
SLIDE 72

28/53

Supervised Learning on Point Clouds

Training set (S1, y1), . . . , (Sn, yn) with Si = {x(i)

j } ∼ Pi(X).

Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM., B. Sch¨

  • lkopf, I. Tolstikhin. JMLR 2015, ICML 2015.
slide-73
SLIDE 73

28/53

Supervised Learning on Point Clouds

Training set (S1, y1), . . . , (Sn, yn) with Si = {x(i)

j } ∼ Pi(X).

Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM., B. Sch¨

  • lkopf, I. Tolstikhin. JMLR 2015, ICML 2015.

Topological Data Analysis

  • G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018
slide-74
SLIDE 74

29/53

Domain Generalization

Blanchard et al., NeurIPS2012; KM, D. Balduzzi, B. Sch¨

  • lkopf, ICML2013

training data unseen test data

P2

XY

P1

XY

P PN

XY

...

(Xk, Yk)

...

Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1

K((Pi, x), (Pj, ˜ x)) = k1(Pi, Pj)k2(x, ˜ x) = k1(µPi, µPj)k2(x, ˜ x)

slide-75
SLIDE 75

30/53

Comparing Distributions

◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD2(P, Q, H) = µP − µQ2

H = µPH − 2µP, µQH + µQH.

slide-76
SLIDE 76

30/53

Comparing Distributions

◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD2(P, Q, H) = µP − µQ2

H = µPH − 2µP, µQH + µQH.

◮ MMD is an integral probability metric (IPM): MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
  • .
slide-77
SLIDE 77

30/53

Comparing Distributions

◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD2(P, Q, H) = µP − µQ2

H = µPH − 2µP, µQH + µQH.

◮ MMD is an integral probability metric (IPM): MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
  • .

◮ If k is characteristic, then µP − µQH = 0 if and only if P = Q.

slide-78
SLIDE 78

30/53

Comparing Distributions

◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD2(P, Q, H) = µP − µQ2

H = µPH − 2µP, µQH + µQH.

◮ MMD is an integral probability metric (IPM): MMD2(P, Q, H) := sup

h∈H,h≤1

  • h(x) dP(x) −
  • h(x) dQ(x)
  • .

◮ If k is characteristic, then µP − µQH = 0 if and only if P = Q. ◮ Given {xi}n

i=1 ∼ P and {yj}m j=1 ∼ Q, the empirical MMD is

  • MMD2

u(P, Q, H) =

1 n(n − 1)

n

  • i=1

n

  • j=i

k(xi, xj) + 1 m(m − 1)

m

  • i=1

m

  • j=i

k(yi, yj) − 2 nm

n

  • i=1

m

  • j=1

k(xi, yj).

slide-79
SLIDE 79

31/53

Kernel Two-Sample Testing

Gretton et al., JMLR2012

P Q P Q Question: Given {xi}n

i=1 ∼ P and {yj}n j=1 ∼ Q, check if P = Q.

H0 : P = Q, H1 : P = Q

slide-80
SLIDE 80

31/53

Kernel Two-Sample Testing

Gretton et al., JMLR2012

P Q P Q Question: Given {xi}n

i=1 ∼ P and {yj}n j=1 ∼ Q, check if P = Q.

H0 : P = Q, H1 : P = Q ◮ MMD test statistic: t2 =

  • MMD

2 u(P, Q, H)

= 1 n(n − 1)

  • 1≤i=j≤n

h((xi, yi), (xj, yj)) where h((xi, yi), (xj, yj)) = k(xi, xj) + k(yi, yj) − k(xi, yj) − k(xj, yi).

slide-81
SLIDE 81

32/53

Generative Adversarial Networks

Learn a deep generative model G via a minimax optimization min

G max D

Ex[log D(x)] + Ez[log(1 − D(G(z)))] where D is a discriminator and z ∼ N(0, σ2I).

random noise z Gθ(z) Generator Gθ real or synthetic? x or Gθ(z) Discriminator Dφ

  • × ×

× ×× × ×× × × × × ×

real data {xi} synthetic data {Gθ(zi)}

  • ˆ

µX − ˆ µGθ(Z)

  • H is zero?

MMD Test

slide-82
SLIDE 82

33/53

Generative Moment Matching Network

◮ The GAN aims to match two distributions P(X) and Gθ.

slide-83
SLIDE 83

33/53

Generative Moment Matching Network

◮ The GAN aims to match two distributions P(X) and Gθ. ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers min

θ

  • µX − µGθ(Z)
  • 2

H = min θ

  • φ(X) dP(X) −
  • φ( ˜

X) dGθ( ˜ X)

  • 2

H

= min

θ

  • sup

h∈H,h≤1

  • h dP −
  • h dGθ
slide-84
SLIDE 84

33/53

Generative Moment Matching Network

◮ The GAN aims to match two distributions P(X) and Gθ. ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers min

θ

  • µX − µGθ(Z)
  • 2

H = min θ

  • φ(X) dP(X) −
  • φ( ˜

X) dGθ( ˜ X)

  • 2

H

= min

θ

  • sup

h∈H,h≤1

  • h dP −
  • h dGθ
  • ◮ Many tricks have been proposed to improve the GMMN:

◮ Optimized kernels and feature extractors (Sutherland et al., 2017; Li et al., 2017a) ◮ Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018) ◮ Repulsive loss (Wang et al., 2019) ◮ Optimized witness points (Mehrjou et al., 2019) ◮ Etc.

slide-85
SLIDE 85

34/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development

slide-86
SLIDE 86

35/53

Conditional Distribution P(Y |X)

X Y A collection of distributions PY := {P(Y |X = x) : x ∈ X}.

slide-87
SLIDE 87

35/53

Conditional Distribution P(Y |X)

X Y A collection of distributions PY := {P(Y |X = x) : x ∈ X}. ◮ For each x ∈ X, we can define an embedding of P(Y |X = x) as µY |x :=

  • Y

ϕ(Y ) dP(Y |X = x) = EY |x[ϕ(Y )] where ϕ : Y → G is a feature map of Y .

slide-88
SLIDE 88

36/53

Embedding of Conditional Distributions

X Y H G CYXC−1

XXk(x, ·)

µY |X=x k(x, ·) CYXC−1

XX

y p(y|x) P(Y |X = x) The conditional mean embedding of P(Y | X) can be defined as UY |X : H → G, UY |X := CYXC−1

XX

slide-89
SLIDE 89

37/53

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and conditional expectation.

slide-90
SLIDE 90

37/53

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and conditional expectation. ◮ To represent P(Y |X = x) for x ∈ X, it follows that EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x.

slide-91
SLIDE 91

37/53

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and conditional expectation. ◮ To represent P(Y |X = x) for x ∈ X, it follows that EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x.

◮ It follows from the reproducing property of G that EY |x[g(Y ) | X = x] = µY |x, gG, ∀g ∈ G.

slide-92
SLIDE 92

37/53

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and conditional expectation. ◮ To represent P(Y |X = x) for x ∈ X, it follows that EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x.

◮ It follows from the reproducing property of G that EY |x[g(Y ) | X = x] = µY |x, gG, ∀g ∈ G. ◮ In an infinite RKHS, C−1

XX does not exists. Hence, we often use

UY |X := CYX(CXX + εI)−1.

slide-93
SLIDE 93

37/53

Conditional Mean Embedding

◮ To fully represent P(Y |X), we need to perform conditioning and conditional expectation. ◮ To represent P(Y |X = x) for x ∈ X, it follows that EY |x[ϕ(Y ) | X = x] = UY |Xk(x, ·) = CYXC−1

XXk(x, ·) =: µY |x.

◮ It follows from the reproducing property of G that EY |x[g(Y ) | X = x] = µY |x, gG, ∀g ∈ G. ◮ In an infinite RKHS, C−1

XX does not exists. Hence, we often use

UY |X := CYX(CXX + εI)−1. ◮ Conditional mean estimator ˆ µY |x =

n

  • i=1

βi(x)ϕ(yi), β(x) := (K + nεI)−1kx.

slide-94
SLIDE 94

38/53

Counterfactual Mean Embedding

KM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) PY ∗

0 (·) − PY ∗ 1 (·)

where Y ∗

0 and Y ∗ 1 are potential outcomes of a treatment policy T.

slide-95
SLIDE 95

38/53

Counterfactual Mean Embedding

KM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) PY ∗

0 (·) − PY ∗ 1 (·)

where Y ∗

0 and Y ∗ 1 are potential outcomes of a treatment policy T.

◮ We can only observe either PY ∗

0 or PY ∗ 1 .

slide-96
SLIDE 96

38/53

Counterfactual Mean Embedding

KM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) PY ∗

0 (·) − PY ∗ 1 (·)

where Y ∗

0 and Y ∗ 1 are potential outcomes of a treatment policy T.

◮ We can only observe either PY ∗

0 or PY ∗ 1 .

◮ Counterfactual distribution PY 0|1(y) =

  • PY0|X0(y|x) dPX1(x).
slide-97
SLIDE 97

38/53

Counterfactual Mean Embedding

KM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) PY ∗

0 (·) − PY ∗ 1 (·)

where Y ∗

0 and Y ∗ 1 are potential outcomes of a treatment policy T.

◮ We can only observe either PY ∗

0 or PY ∗ 1 .

◮ Counterfactual distribution PY 0|1(y) =

  • PY0|X0(y|x) dPX1(x).

◮ The counterfactual distribution PY 0|1(y) can be estimated using the kernel mean embedding.

slide-98
SLIDE 98

39/53

Quantum Mean Embedding

slide-99
SLIDE 99

40/53

Quick Summary

◮ Many applications requires information in P(Y |X).

slide-100
SLIDE 100

40/53

Quick Summary

◮ Many applications requires information in P(Y |X). ◮ Hilbert space embedding of P(Y |X) is not a single element, but an

  • perator UY |X mapping from H to G:

µY |x = UY |Xk(x, ·) = CYXC−1

XXk(x, ·)

µY |x, gG = EY |x[g(Y ) | X = x]

slide-101
SLIDE 101

40/53

Quick Summary

◮ Many applications requires information in P(Y |X). ◮ Hilbert space embedding of P(Y |X) is not a single element, but an

  • perator UY |X mapping from H to G:

µY |x = UY |Xk(x, ·) = CYXC−1

XXk(x, ·)

µY |x, gG = EY |x[g(Y ) | X = x] ◮ The conditional mean operator UY |X := CYX(CXX + εI)−1,

  • UY |X =

CYX( CXX + εI)−1

slide-102
SLIDE 102

40/53

Quick Summary

◮ Many applications requires information in P(Y |X). ◮ Hilbert space embedding of P(Y |X) is not a single element, but an

  • perator UY |X mapping from H to G:

µY |x = UY |Xk(x, ·) = CYXC−1

XXk(x, ·)

µY |x, gG = EY |x[g(Y ) | X = x] ◮ The conditional mean operator UY |X := CYX(CXX + εI)−1,

  • UY |X =

CYX( CXX + εI)−1 ◮ Probabilistic inference such as sum, product, and Bayes rules, can be performed via the embeddings.

slide-103
SLIDE 103

41/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development

slide-104
SLIDE 104

42/53

Machine Learning in Economics

Recommendation Autonomous Car Healthcare Finance Law Enforcement Public Policy

slide-105
SLIDE 105

43/53

Instrumental Variable Regression

S Y E I education season of birth income socioeconomic status

slide-106
SLIDE 106

43/53

Instrumental Variable Regression

S Y E I education season of birth income socioeconomic status ◮ We aim to estimate a function f from a structural equation model Y = f (E) + ε, E[ε | E] = 0.

slide-107
SLIDE 107

43/53

Instrumental Variable Regression

S Y E I education season of birth income socioeconomic status ◮ We aim to estimate a function f from a structural equation model Y = f (E) + ε, E[ε | E] = 0. ◮ We have an instrumental variable I with property E[ε | I] = 0, i.e., E[Y − f (E) | I] = 0

slide-108
SLIDE 108

43/53

Instrumental Variable Regression

S Y E I education season of birth income socioeconomic status ◮ We aim to estimate a function f from a structural equation model Y = f (E) + ε, E[ε | E] = 0. ◮ We have an instrumental variable I with property E[ε | I] = 0, i.e., E[Y − f (E) | I] = 0 ◮ Conditional moment restriction (CMR): E[ψ(Z, θ) | X] = 0. Z = (E, Y ), X = I, θ = f , ψ(Z; θ) = Y − f (E).

slide-109
SLIDE 109

44/53

Conditional Moment Restriction (CMR)

Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies E[ψ(Z; θ0) | X] = 0, PX − a.s., where ψ : Z × Θ → Rq is a generalized residual function.

slide-110
SLIDE 110

44/53

Conditional Moment Restriction (CMR)

Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies E[ψ(Z; θ0) | X] = 0, PX − a.s., where ψ : Z × Θ → Rq is a generalized residual function. ◮ The function ψ is known and is problem-dependent, e.g., ψ(Z; θ) = Y − f (E), Z = (Y , E), X = I, θ = f .

slide-111
SLIDE 111

44/53

Conditional Moment Restriction (CMR)

Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies E[ψ(Z; θ0) | X] = 0, PX − a.s., where ψ : Z × Θ → Rq is a generalized residual function. ◮ The function ψ is known and is problem-dependent, e.g., ψ(Z; θ) = Y − f (E), Z = (Y , E), X = I, θ = f . ◮ The CMR implies unconditional moment restriction (UMR): E[ψ(Z; θ0)⊤f (X)] = 0 for any measurable vector-valued function f : X → Rq. The function f (X) is often called an instrument.

slide-112
SLIDE 112

44/53

Conditional Moment Restriction (CMR)

Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies E[ψ(Z; θ0) | X] = 0, PX − a.s., where ψ : Z × Θ → Rq is a generalized residual function. ◮ The function ψ is known and is problem-dependent, e.g., ψ(Z; θ) = Y − f (E), Z = (Y , E), X = I, θ = f . ◮ The CMR implies unconditional moment restriction (UMR): E[ψ(Z; θ0)⊤f (X)] = 0 for any measurable vector-valued function f : X → Rq. The function f (X) is often called an instrument. ◮ Given the instruments f1, . . . , fm, one can use the generalized method of moment (GMM) to learn the parameter θ.

slide-113
SLIDE 113

45/53

Maximum Moment Restriction (MMR)

KM, W. Jitkrittum, J. K¨ ubler, UAI2020

Let F be a space of instruments f (x). E[ψ(Z; θ0) | X] = 0

  • CMR

⇔ sup

f ∈F

  • E[ψ(Z; θ0)⊤f (X)]
  • = 0
  • MMR(F,θ0)
slide-114
SLIDE 114

45/53

Maximum Moment Restriction (MMR)

KM, W. Jitkrittum, J. K¨ ubler, UAI2020

Let F be a space of instruments f (x). E[ψ(Z; θ0) | X] = 0

  • CMR

⇔ sup

f ∈F

  • E[ψ(Z; θ0)⊤f (X)]
  • = 0
  • MMR(F,θ0)

◮ The equivalence above holds if F is a universal vector-valued RKHS.

slide-115
SLIDE 115

45/53

Maximum Moment Restriction (MMR)

KM, W. Jitkrittum, J. K¨ ubler, UAI2020

Let F be a space of instruments f (x). E[ψ(Z; θ0) | X] = 0

  • CMR

⇔ sup

f ∈F

  • E[ψ(Z; θ0)⊤f (X)]
  • = 0
  • MMR(F,θ0)

◮ The equivalence above holds if F is a universal vector-valued RKHS. ◮ Let µθ := KXψ(Z; θ). MMR(F, θ) := sup

f ∈F,f ≤1

  • E
  • ψ(Z; θ)⊤f (X)
  • =

E[KXψ(Z; θ)]F = µθF.

slide-116
SLIDE 116

45/53

Maximum Moment Restriction (MMR)

KM, W. Jitkrittum, J. K¨ ubler, UAI2020

Let F be a space of instruments f (x). E[ψ(Z; θ0) | X] = 0

  • CMR

⇔ sup

f ∈F

  • E[ψ(Z; θ0)⊤f (X)]
  • = 0
  • MMR(F,θ0)

◮ The equivalence above holds if F is a universal vector-valued RKHS. ◮ Let µθ := KXψ(Z; θ). MMR(F, θ) := sup

f ∈F,f ≤1

  • E
  • ψ(Z; θ)⊤f (X)
  • =

E[KXψ(Z; θ)]F = µθF. ◮ MMR2(F, θ) = E[ψ(Z; θ)⊤K(X, X ′)ψ(Z ′; θ)].

slide-117
SLIDE 117

46/53

Maximum Moment Restriction (MMR)

KM, W. Jitkrittum, J. K¨ ubler, UAI2020

Parameter Estimation Given observations (xi, zi)n

i=1 from P(X, Z), we aim to estimate θ0 by

ˆ θ = arg min

θ∈Θ

  • MMR

2

(F, θ) = arg min

θ∈Θ

1 n(n − 1)

  • 1≤i=j≤n

ψ(zi; θ)⊤K(xi, xj)ψ(zj; θ).

slide-118
SLIDE 118

46/53

Maximum Moment Restriction (MMR)

KM, W. Jitkrittum, J. K¨ ubler, UAI2020

Parameter Estimation Given observations (xi, zi)n

i=1 from P(X, Z), we aim to estimate θ0 by

ˆ θ = arg min

θ∈Θ

  • MMR

2

(F, θ) = arg min

θ∈Θ

1 n(n − 1)

  • 1≤i=j≤n

ψ(zi; θ)⊤K(xi, xj)ψ(zj; θ). Hypothesis Testing Given observations (xi, zi)n

i=1 from P(X, Z) and the parameter estimate

ˆ θ, we aim to test H0 : MMR

2

(F, ˆ θ) = 0, H1 : MMR

2

(F, ˆ θ) = 0.

slide-119
SLIDE 119

47/53

Conditional Moment Embedding

PX θ2 θ1 θ0

E[ψ(Z; θ)|X]

µθ0 µθ1 µθ2

RKHS F

Figure: The conditional moments E[ψ(Z; θ)|X] for different parameters θ are uniquely (PX-almost surely) embedded into the RKHS.

Kernel Conditional Moment Test via Maximum Moment Restriction (UAI2020) Paper: https://arxiv.org/abs/2002.09225 Code: https://github.com/krikamol/kcm-test

slide-120
SLIDE 120

48/53

Future Direction

Contact: Website: krikamol@tuebingen.mpg.de http://krikamol.org

slide-121
SLIDE 121

49/53

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·).

slide-122
SLIDE 122

49/53

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·). ◮ Let CXX and CYX be the covariance operator on X and cross-covariance operator from X to Y , i.e., CXX =

  • φ(X) ⊗ φ(X) dP(X),

CYX =

  • ϕ(Y ) ⊗ φ(X) dP(Y , X)
slide-123
SLIDE 123

49/53

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·). ◮ Let CXX and CYX be the covariance operator on X and cross-covariance operator from X to Y , i.e., CXX =

  • φ(X) ⊗ φ(X) dP(X),

CYX =

  • ϕ(Y ) ⊗ φ(X) dP(Y , X)

◮ Alternatively, CYX is a unique bounded operator satisfying g, CYXf G = Cov[g(Y ), f (X)].

slide-124
SLIDE 124

49/53

Covariance Operators

◮ Let H, G be RKHSes on X, Y with feature maps φ(x) = k(x, ·), ϕ(y) = ℓ(y, ·). ◮ Let CXX and CYX be the covariance operator on X and cross-covariance operator from X to Y , i.e., CXX =

  • φ(X) ⊗ φ(X) dP(X),

CYX =

  • ϕ(Y ) ⊗ φ(X) dP(Y , X)

◮ Alternatively, CYX is a unique bounded operator satisfying g, CYXf G = Cov[g(Y ), f (X)]. ◮ If EYX[g(Y )|X = ·] ∈ H for g ∈ G, then CXXEYX[g(Y )|X = ·] = CXY g.

slide-125
SLIDE 125

50/53

Conditional Mean Estimation

◮ Given a joint sample (x1, y1), . . . , (xn, yn) from P(X, Y ), we have

  • CXX = 1

n

n

  • i=1

φ(xi) ⊗ φ(xi),

  • CYX = 1

n

n

  • i=1

ϕ(yi) ⊗ φ(xi).

slide-126
SLIDE 126

50/53

Conditional Mean Estimation

◮ Given a joint sample (x1, y1), . . . , (xn, yn) from P(X, Y ), we have

  • CXX = 1

n

n

  • i=1

φ(xi) ⊗ φ(xi),

  • CYX = 1

n

n

  • i=1

ϕ(yi) ⊗ φ(xi). ◮ Then, µY |x for some x ∈ X can be estimated as ˆ µY |x = CYX( CXX + εI)−1k(x, ·) = Φ(K + nεIn)−1kx =

n

  • i=1

βiϕ(yi), where ε > 0 is a regularization parameter and Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi, xj), kx = [k(x1, x), .., k(xn, x)].

slide-127
SLIDE 127

50/53

Conditional Mean Estimation

◮ Given a joint sample (x1, y1), . . . , (xn, yn) from P(X, Y ), we have

  • CXX = 1

n

n

  • i=1

φ(xi) ⊗ φ(xi),

  • CYX = 1

n

n

  • i=1

ϕ(yi) ⊗ φ(xi). ◮ Then, µY |x for some x ∈ X can be estimated as ˆ µY |x = CYX( CXX + εI)−1k(x, ·) = Φ(K + nεIn)−1kx =

n

  • i=1

βiϕ(yi), where ε > 0 is a regularization parameter and Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi, xj), kx = [k(x1, x), .., k(xn, x)]. ◮ Under some technical assumptions, ˆ µY |x → µY |x as n → ∞.

slide-128
SLIDE 128

51/53

Kernel Sum Rule: P(X) =

Y P(X, Y ) ◮ By the law of total expectation, µX = EX[φ(X)] = EY [EX|Y [φ(X)|Y ]] = EY [UX|Y ϕ(Y )] = UX|Y EY [ϕ(Y )] = UX|Y µY

slide-129
SLIDE 129

51/53

Kernel Sum Rule: P(X) =

Y P(X, Y ) ◮ By the law of total expectation, µX = EX[φ(X)] = EY [EX|Y [φ(X)|Y ]] = EY [UX|Y ϕ(Y )] = UX|Y EY [ϕ(Y )] = UX|Y µY ◮ Let ˆ µY = m

i=1 αiϕ(˜

yi) and UX|Y = CXY C−1

YY . Then,

ˆ µX = UX|Y ˆ µY = CXY C−1

YY ˆ

µY = Υ(L + nλI)−1˜ Lα. where α = (α1, . . . , αm)⊤, Lij = l(yi, yj), and ˜ Lij = l(yi, ˜ yj).

slide-130
SLIDE 130

51/53

Kernel Sum Rule: P(X) =

Y P(X, Y ) ◮ By the law of total expectation, µX = EX[φ(X)] = EY [EX|Y [φ(X)|Y ]] = EY [UX|Y ϕ(Y )] = UX|Y EY [ϕ(Y )] = UX|Y µY ◮ Let ˆ µY = m

i=1 αiϕ(˜

yi) and UX|Y = CXY C−1

YY . Then,

ˆ µX = UX|Y ˆ µY = CXY C−1

YY ˆ

µY = Υ(L + nλI)−1˜ Lα. where α = (α1, . . . , αm)⊤, Lij = l(yi, yj), and ˜ Lij = l(yi, ˜ yj). ◮ That is, we have ˆ µX = n

j=1 βjφ(xj)

with β = (L + nλI)−1˜ Lα.

slide-131
SLIDE 131

52/53

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)]

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-132
SLIDE 132

52/53

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)] ◮ Let µ⊗

X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-133
SLIDE 133

52/53

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)] ◮ Let µ⊗

X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].

◮ Then, the product rule becomes µXY = UX|Y µ⊗

Y = UY |Xµ⊗ X .

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-134
SLIDE 134

52/53

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)] ◮ Let µ⊗

X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].

◮ Then, the product rule becomes µXY = UX|Y µ⊗

Y = UY |Xµ⊗ X .

◮ Alternatively, we may write the above formulation as CXY = UX|Y CYY and CYX = UY |XCXX

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-135
SLIDE 135

52/53

Kernel Product Rule: P(X, Y ) = P(Y |X)P(X)

◮ We can factorize µXY = EXY [φ(X) ⊗ ϕ(Y )] as EY [EX|Y [φ(X)|Y ] ⊗ ϕ(Y )] = UX|Y EY [ϕ(Y ) ⊗ ϕ(Y )] EX[EY |X[ϕ(Y )|X] ⊗ φ(X)] = UY |XEX[φ(X) ⊗ φ(X)] ◮ Let µ⊗

X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].

◮ Then, the product rule becomes µXY = UX|Y µ⊗

Y = UY |Xµ⊗ X .

◮ Alternatively, we may write the above formulation as CXY = UX|Y CYY and CYX = UY |XCXX ◮ The kernel sum and product rules can be combined to obtain the kernel Bayes’ rule.5

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

slide-136
SLIDE 136

53/53

Calibration of Computer Simulation

Kennedy and O’Hagan (2002); Kisamori et al., (AISTATS 2020)

Figure taken from Kisamori et al., (2020) The computer simulator: r(x, θ), θ ∈ Θ. The posterior embedding: µΘ|r ∗ :=

  • kΘ(·, θ) dPπ(θ|r ∗)