Consistent Kernel Mean Estimation for Functions of Random Variables - - PowerPoint PPT Presentation

consistent kernel mean estimation for functions of random
SMART_READER_LITE
LIVE PREVIEW

Consistent Kernel Mean Estimation for Functions of Random Variables - - PowerPoint PPT Presentation

Consistent Kernel Mean Estimation for Functions of Random Variables Ilya Tolstikhin jointly with C.-J. Simon-Gabriel, A. ` Scibior, and B. Sch olkopf (NIPS 2016) Dagstuhl December 2016 Motivation Given: Independent random variables X


slide-1
SLIDE 1

Consistent Kernel Mean Estimation for Functions of Random Variables

Ilya Tolstikhin jointly with C.-J. Simon-Gabriel, A. ` Scibior, and B. Sch¨

  • lkopf

(NIPS 2016) Dagstuhl December 2016

slide-2
SLIDE 2

Motivation

Given:

◮ Independent random variables X ∈ X and Y ∈ Y; ◮ i.i.d. samples {Xi}N

i=1 and {Yj}N j=1;

◮ Any function f : X × Y → Z.

Construct a flexible representation for the distribution of Z = f(X, Y ). Let’s represent distributions using their mean embeddings. The simplest estimator is: ˆ µ(1)

Z

:= 1 N N

i=1 kZ

  • f(Xi, Yi), ·
  • .

√ N − consistent Experiments show that the U-statistic estimator performs better: ˆ µ(2)

Z

:= 1 N 2 N

i,j=1 kZ

  • f(Xi, Yj), ·
  • .

√ N − consistent

slide-3
SLIDE 3

Motivation

Given:

◮ Independent random variables X ∈ X and Y ∈ Y; ◮ i.i.d. samples {Xi}N

i=1 and {Yj}N j=1;

◮ Any function f : X × Y → Z.

Construct a flexible representation for the distribution of Z = f(X, Y ). Let’s represent distributions using their mean embeddings. The simplest estimator is: ˆ µ(1)

Z

:= 1 N N

i=1 kZ

  • f(Xi, Yi), ·
  • .

√ N − consistent Experiments show that the U-statistic estimator performs better: ˆ µ(2)

Z

:= 1 N 2 N

i,j=1 kZ

  • f(Xi, Yj), ·
  • .

√ N − consistent

slide-4
SLIDE 4

Motivation

Experiments show that the U-statistic estimator performs better: ˆ µ(2)

Z

:= 1 N 2 N

i,j=1 kZ

  • f(Xi, Yj), ·
  • .

√ N − consistent Unfortunately, N 2 may be computationally prohibitive. Sch¨

  • lkopf et. al (2015) : take n ≪ N and use reduced set methods to
  • 1. Approximate

1 N

N

i=1 k(Xi, ·) ≈ n i=1 wik(X′ i, ·);

  • 2. Approximate

1 N

N

j=1 k(Yj, ·) ≈ n j=1 vjk(Y ′ j , ·);

  • 3. Use the following estimator:

ˆ µZ :=

n

  • i,j=1

wivjkZ

  • f(X′

i, Y ′ j ), ·

  • .

Question: is ˆ µZ consistent?

slide-5
SLIDE 5

Motivation

Experiments show that the U-statistic estimator performs better: ˆ µ(2)

Z

:= 1 N 2 N

i,j=1 kZ

  • f(Xi, Yj), ·
  • .

√ N − consistent Unfortunately, N 2 may be computationally prohibitive. Sch¨

  • lkopf et. al (2015) : take n ≪ N and use reduced set methods to
  • 1. Approximate

1 N

N

i=1 k(Xi, ·) ≈ n i=1 wik(X′ i, ·);

  • 2. Approximate

1 N

N

j=1 k(Yj, ·) ≈ n j=1 vjk(Y ′ j , ·);

  • 3. Use the following estimator:

ˆ µZ :=

n

  • i,j=1

wivjkZ

  • f(X′

i, Y ′ j ), ·

  • .

Question: is ˆ µZ consistent?

slide-6
SLIDE 6

New results

Answer: yes, ˆ µZ is indeed consistent. Assume: Proof based on [SS16]

◮ X and Z are compact; ◮ f : X → Z is continuous; ◮ kX , kZ are continuous p.d. kernels on X and Z; ◮ kX is c0-universal; ◮ There exists C s.t.

i |wi| ≤ C independently of n.

Then:

N

  • i=1

wikX (Xi, ·) →

HkX

µX ⇒

N

  • i=1

wikZ

  • f(Xi), ·

HkZ

µZ.

◮ Importantly, w1, . . . , wN and X1, . . . , XN can be interdependent. ◮ Finite sample guarantees for X = Rd, Z = Rd′ and Mat´

ern kernels.

◮ Applications: probabilistic programming, privacy-preserving ML, . . .

slide-7
SLIDE 7

New results

Answer: yes, ˆ µZ is indeed consistent. Assume: Proof based on [SS16]

◮ X and Z are compact; ◮ f : X → Z is continuous; ◮ kX , kZ are continuous p.d. kernels on X and Z; ◮ kX is c0-universal; ◮ There exists C s.t.

i |wi| ≤ C independently of n.

Then:

N

  • i=1

wikX (Xi, ·) →

HkX

µX ⇒

N

  • i=1

wikZ

  • f(Xi), ·

HkZ

µZ.

◮ Importantly, w1, . . . , wN and X1, . . . , XN can be interdependent. ◮ Finite sample guarantees for X = Rd, Z = Rd′ and Mat´

ern kernels.

◮ Applications: probabilistic programming, privacy-preserving ML, . . .

slide-8
SLIDE 8

Related results. . .

◮ Minimax Estimation of Kernel Mean Embeddings

T., Sriperumbudur, Muandet, 2016, arXiv Task: Estimate

  • X k(x, ·)dP(x) based on the i.i.d. sample {Xi}N

i=1

Result: for translation-invariant kernels you can not do it faster than N −1/2.

◮ Minimax Estimation of MMD with Radial Kernels

T., Sriperumbudur, Sch¨

  • lkopf, 2016, NIPS

Task: Estimate µP − µQHk based on i.i.d. samples {Xi}N

i=1 and {Yi}M i=1

Result: for radial kernels you can not do it faster than N −1/2 + M −1/2.