Estimation of the Kernel Mean Embedding (with uncertainty) Paul - - PowerPoint PPT Presentation

estimation of the kernel mean embedding with uncertainty
SMART_READER_LITE
LIVE PREVIEW

Estimation of the Kernel Mean Embedding (with uncertainty) Paul - - PowerPoint PPT Presentation

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of Cambridge Max-Planck Institute for Intelligent Systems, Tbingen 20th January 2016 RKHS theory A function k : X X R is a kernel if given x 1 ,


slide-1
SLIDE 1

Estimation of the Kernel Mean Embedding (with uncertainty)

Paul Rubenstein

University of Cambridge Max-Planck Institute for Intelligent Systems, Tübingen

20th January 2016

slide-2
SLIDE 2

RKHS theory

A function k : X × X − → R is a kernel if given x1, . . . , xn ∈ X, Kij = k(xi, xk) K is symmetric and positive semi-definite (= is a valid covariance matrix) Associated to k are:

◮ A Hilbert space H of functions X −

→ R

◮ A ‘feature map’ φ : X −

→ R such that k(x, x′) = φ(x), φ(x′)

2 of 11

slide-3
SLIDE 3

RKHS theory

Suppose we are given:

◮ A random variable X ∼ P taking value in X ◮ A function f : X −

→ R and that we want to evaluate

  • f(x)dP(x) = EXf(X). If f ∈ H, then

EXf(X) = EXf, φ(X) = f, EXφ(X) So if we know the mean embedding of X, µX := EXφ(X), then we can calculate expectations of any function in H by taking an product.

3 of 11

slide-4
SLIDE 4

RKHS theory

For certain k, the mapping P → µP is injective, ie P = Q ⇐ ⇒ µP = µQ We can exploit this to construct statistical tests of properties of distributions. Two sample test: Given {Xi} ∼ P, {Yi} ∼ Q, does P = Q? Idea: estimate µP, µQ and see how different they are Independence testing: Given {(Xi, Yi)} ∼ PXY does PXY = PXPY ? Idea: estimate PXY , PXPY and see how different they are

4 of 11

slide-5
SLIDE 5

Estimating µX

How do we estimate µX? µX = Eφ(X) =

  • k(x, ·)dP(x)

If {Xi}n

i=1 ∼ P, then

ˆ µ := 1 n

n

  • i=1

k(Xi, ·) − → µX 1n =    1/n . . . 1/n   , Φ =    k(X1, ·) . . . k(Xn, ·)    ˆ µ = 1⊺

x

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 mu(x) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Emprical mean embedding

5 of 11

slide-6
SLIDE 6

Estimating µX

In Muandet et al 2015(?) (Kernel Mean Shrinkage Estimators), the risk

  • f an estimator ˆ

µ is defined: ∆ = Eˆ µ − µ2

H

and estimators that minimise ∆ are sought. Two proposals: For particular α that can be estimated from observations, ˆ µα = (1 − α)ˆ µ = (1 − α)1⊺

For λ estimated (by cross validation) from observations, ˆ µλ = Φ⊺(K + λI)−1K1n (this looks like GP regression)

6 of 11

slide-7
SLIDE 7

Bayesian estimation of µX

ˆ µα = (1 − α)1⊺

ˆ µλ = Φ⊺(K + λI)−1K1n Kernel Ridge Regression ⇐ ⇒ MAP inference in GP regression. Can we show that these estimators are the MAP solution to a Bayesian inference problem? µ ∼ GP(0, k) ˆ µ|µ ∼ GP(µ, γk) µ ∼ GP(0, k) ˆ µ|µ ∼ GP(µ, λIx=x′) Define ‘pseudo-targets’ ˆ µ(x) = K1n and then perform Bayesian inference

7 of 11

slide-8
SLIDE 8

Deriving µ = (1 − α)Φ1n

Consider µ ∼ GP(0, k), ˆ µ|µ ∼ GP(µ, γk) Choose a previously unobserved z and consider distribution of µ(z) ˆ µ(x)

  • µ(z)

ˆ µ(x)

  • ∼ N
  • 0,

kzz k⊺

z

kz (1 + γ)K

  • =

⇒ µ(z)|ˆ µ(x) ∼ N

  • 1

1 + γ k⊺

z1n, kzz −

1 1 + γ k⊺

zK−1kz

  • So if

1 1+γ = (1 − α) ⇐

⇒ γ =

α 1−α then MAP solution is

µ = (1 − α)Φ1n

8 of 11

slide-9
SLIDE 9

Deriving ˆ µλ = Φ⊺(K + λI)−1K1n

Considering next µ ∼ GP(0, k), ˆ µ|µ ∼ GP(µ, λIx=x′) µ(z) ˆ µ(x)

  • ∼ N
  • 0,

kzz k⊺

z

kz K + λI

  • =

⇒ µ(z)|ˆ µ(x) ∼ N

  • k⊺

z(K + λI)−1K1n, kzz − k⊺ z(K + λI)−1kz

  • Thus the MAP solution is

µ = Φ⊺(K + λI)−1K1n

9 of 11

slide-10
SLIDE 10

Some problems

Although we derive the same solution, most of the approach taken in the above doesn’t really make sense:

◮ The prior over µ is not sensible ◮ The likelihood ˆ

µ is wrong - in fact, for large n, ˆ µ ≈ GP(µ, 1

n[CXX − µX ⊗ µX)] ◮ Uncertainty does not decay far away from observations as n grows.

10 of 11

slide-11
SLIDE 11

Thanks!

Discussion?

11 of 11