Estimation with Infinite Dimensional Kernel Exponential Families - - PowerPoint PPT Presentation

โ–ถ
estimation with infinite dimensional
SMART_READER_LITE
LIVE PREVIEW

Estimation with Infinite Dimensional Kernel Exponential Families - - PowerPoint PPT Presentation

Estimation with Infinite Dimensional Kernel Exponential Families Kenji Fukumizu The Institute of Statistical Mathematics Joint work with Bharath Sriperumbudur (Penn State U), Arthur Gretton (UCL), Aapo Hyvarinen (U Helsinki), Revant Kumar


slide-1
SLIDE 1

1

Kenji Fukumizu

The Institute of Statistical Mathematics

Joint work with Bharath Sriperumbudur (Penn State U), Arthur Gretton (UCL), Aapo Hyvarinen (U Helsinki), Revant Kumar (Georgia Tech)

IGAIA IV. June 12-17, 2016. Liblice, Czech Republic

Estimation with Infinite Dimensional Kernel Exponential Families

slide-2
SLIDE 2

Introduction

2

slide-3
SLIDE 3

Infinite dimensional exponential family

๏ฎ (Finite dim.) exponential family

๐‘ž๐œ„ ๐‘ฆ = exp เท

๐‘˜=1 ๐‘›

๐œ„

๐‘˜๐‘ˆ ๐‘˜ ๐‘ฆ โˆ’ ๐ต ๐œ„

๐‘Ÿ0(๐‘ฆ)

๏ฎ Infinite dimensional extension?

๐‘ž๐‘” ๐‘ฆ = exp ๐‘” ๐‘ฆ โˆ’ ๐ต ๐‘” ๐‘Ÿ0(๐‘ฆ) where ๐ต ๐‘” โ‰” log โˆซ ๐‘“๐‘”(๐‘ฆ)๐‘Ÿ0 ๐‘ฆ ๐‘’๐‘ฆ ๐‘” is a natural parameter in an infinite dimensional function class. โ€“ Maximal exponential model (Pistone & Sempi AoS 1995):

  • Orlicz space (Banach sp.) is used.
  • Estimation is not at all obvious.

โ€œEmpiricalโ€ mean parameter cannot be defined.

slide-4
SLIDE 4

๏ฎ Kernel exponential manifold (Fukumizu 2009; Canu & Smola 2005)

Reproducing kernel Hilbert space is used.

  • ๐‘ž๐‘” ๐‘ฆ = exp โŒฉ๐‘”, ๐‘™ โ‹…, ๐‘ฆ โŒช โˆ’ ๐ต ๐‘”

๐‘Ÿ0(๐‘ฆ)

  • Empirical estimation is possible

โ€“ Mean parameter: ๐‘›๐‘” = ๐น๐‘ž๐‘”[๐‘™ โ‹…, ๐‘Œ ] โ€“ Maximum likelihood estimator: เท ๐‘›๐‘” =

1 ๐‘œ ฯƒ๐‘—=1 ๐‘œ

๐‘™(โ‹…, ๐‘Œ๐‘—)

  • Manifold structure can be defined (Fukumizu 2009)

4

Parameter Infinite dimensional sufficient statistics

slide-5
SLIDE 5

Problems in estimation

๏ฎ Normalization constant / partition function

โ€“ Even in finite dim. cases ๐ต ๐œ„ โ‰” log โˆซ ๐‘“ฯƒ๐‘˜=1

๐‘›

๐œ„๐‘˜๐‘ˆ๐‘˜ ๐‘ฆ ๐‘Ÿ0 ๐‘ฆ ๐‘’๐‘ฆ

is not easy to compute. โ€“ MLE: โ€œMean parameter ๏ƒ  natural parameterโ€ needs to solve ๐œ–๐ต ๐œ„ ๐œ–๐œ„ = 1 ๐‘œ เท

๐‘—=1 ๐‘œ

๐‘ˆ ๐‘Œ๐‘— . โ€“ Even more difficult for an infinite dimensional exponential family

๏ฎ This talk ๏ƒ  score matching (Hyvarinen, JMLR 2005)

โ€“ Estimation method without normalization constants. โ€“ Introducing a new method for (unnormalized) density estimation.

5

slide-6
SLIDE 6

Score Matching

6

slide-7
SLIDE 7

Score matching for exponential family

(Hyvรคrinen, JMLR2005)

๏ฎ Fisher divergence

๐‘ž, ๐‘Ÿ: two p.d.f.โ€™s on ฮฉ = ฯ‚๐‘=1

๐‘’

(๐‘ก๐‘, ๐‘ข๐‘) โŠ‚ ๐’ โˆช ยฑโˆž

๐‘’.

๐พ ๐‘ž||๐‘Ÿ โ‰” 1 2 เถฑ เท

๐‘=1 ๐‘’

๐œ– log ๐‘ž ๐‘ฆ ๐œ–๐‘ฆ๐‘ โˆ’ ๐œ– log ๐‘Ÿ ๐‘ฆ ๐œ–๐‘ฆ๐‘

2

๐‘ž(๐‘ฆ)๐‘’๐‘ฆ โ€“ ๐พ(๐‘ž| ๐‘Ÿ โ‰ฅ 0. Equality holds iff ๐‘ž = ๐‘Ÿ (under mild conditions). โ€“ Derivative w.r.t. ๐‘ฆ, not parameter.

  • For location parameter ๐‘ž(๐‘ฆ) = ๐‘” ๐‘ฆ โˆ’ ๐œ„ ,

๐œ– log ๐‘ž ๐‘ฆ ๐œ–๐‘ฆ๐‘ = โˆ’ ๐œ– log ๐‘”

๐œ„ ๐‘ฆ

๐œ–๐œ„๐‘ ๐พ(๐‘ž||๐‘Ÿ) = squared ๐‘€2-distance of Fisher scores.

7

slide-8
SLIDE 8

Set ๐‘ž = ๐‘ž0 (true), and ๐‘Ÿ = ๐‘ž๐œ„ to be estimated. ๐พ ๐œ„ โ‰” ๐พ ๐‘ž0||๐‘ž๐œ„

=

1 2 โˆซฯƒ๐‘=1

๐‘’ ๐œ– log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘

โˆ’ ๐œ– log ๐‘ž0 ๐‘ฆ

๐œ–๐‘ฆ๐‘ 2

๐‘ž0(๐‘ฆ)๐‘’๐‘ฆ

= 1 2 เถฑ เท

๐‘=1 ๐‘’

๐œ– log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘

2

๐‘ž0(๐‘ฆ)๐‘’๐‘ฆ + เถฑ เท

๐‘=1 ๐‘’ ๐œ–2 log ๐‘ž๐œ„ ๐‘ฆ

๐œ–๐‘ฆ๐‘

2

๐‘ž0 ๐‘ฆ ๐‘’๐‘ฆ + const.

  • Assume

lim

๐‘ฆ๐‘โ†’๐‘ก๐‘ or ๐‘ข๐‘ ๐‘ž0(๐‘ฆ) ๐œ– log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘

= 0, and use partial integral

โˆซ

๐œ– log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘ ๐œ– log ๐‘ž0 ๐‘ฆ ๐œ–๐‘ฆ๐‘

๐‘ž0 ๐‘ฆ ๐‘’๐‘ฆ = ๐‘ž0 ๐‘ฆ

๐œ– log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘ ๐‘ก๐‘ ๐‘ข๐‘

โˆ’ โˆซ

๐œ–2 log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘

2

๐‘ž0(๐‘ฆ)๐‘’๐‘ฆ

8

โ‰ก แˆš ๐พ ๐œ„

๐œ–๐‘ž0 ๐‘ฆ ๐œ–๐‘ฆ๐‘

slide-9
SLIDE 9

๏ฎ Empirical estimation

แˆš ๐พ ๐œ„ = 1 2 เถฑ เท

๐‘=1 ๐‘’

๐œ– log ๐‘ž๐œ„ ๐‘ฆ ๐œ–๐‘ฆ๐‘

2

๐‘ž0(๐‘ฆ)๐‘’๐‘ฆ + เถฑ เท

๐‘=1 ๐‘’ ๐œ–2 log ๐‘ž๐œ„ ๐‘ฆ

๐œ–๐‘ฆ๐‘

2

๐‘ž0 ๐‘ฆ ๐‘’๐‘ฆ ๐‘Œ1, โ€ฆ , ๐‘Œ๐‘œ: i.i.d. sample ~ ๐‘ž0. แˆš ๐พ๐‘œ ๐œ„ = 1 ๐‘œ เท

๐‘=1 ๐‘’

เท

๐‘—=1 ๐‘œ

1 2 ๐œ– log ๐‘ž๐œ„ ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

+ ๐œ–2 log ๐‘ž๐œ„ ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

แˆ˜ ๐œ„ = arg min แˆš ๐พ๐‘œ(๐œ„) : Score matching estimator

9

slide-10
SLIDE 10

Score matching for exponential family

โ€“ For exponential family ๐‘ž๐œ„ ๐‘ฆ = exp ฯƒ๐‘˜ ๐œ„

๐‘˜๐‘ˆ ๐‘˜ ๐‘ฆ โˆ’ ๐ต ๐œ„

๐‘Ÿ0 ๐‘ฆ ,

แˆš ๐พ๐‘œ ๐œ„ = เท

๐‘—=1 ๐‘œ

เท

๐‘=1 ๐‘’ 1

2 เท

๐‘˜=1 ๐‘›

๐œ„

๐‘˜

๐œ–๐‘ˆ

๐‘˜ ๐‘Œ๐‘—

๐œ–๐‘ฆ๐‘ + ๐œ– log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

+ เท

๐‘˜=1 ๐‘›

๐œ„

๐‘˜

๐œ–2๐‘ˆ

๐‘˜ ๐‘Œ๐‘—

๐œ–๐‘ฆ๐‘

2

+ ๐œ–2 log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

  • No need of ๐ต ๐œ„ ! (derivative w.r.t. ๐‘ฆ)
  • Quadratic form w.r.t. ๐œ„ ๏ƒ  Solvable!
  • In the Gaussian case, เท 

๐œ„ is the same as MLE.

10

slide-11
SLIDE 11

Kernel Exponential Family

11

slide-12
SLIDE 12

Reproducing kernel Hilbert space

โ€“ Def. ฮฉ: set. ๐ผ: Hilbert space consisting of functions on ฮฉ. ๐ผ: reproducing kernel Hilbert space (RKHS), if for any ๐‘ฆ โˆˆ ฮฉ there is ๐‘™๐‘ฆ โˆˆ ๐ผ s.t.

๐‘”, ๐‘™๐‘ฆ = ๐‘” ๐‘ฆ

for โˆ€๐‘” โˆˆ ๐ผ [reproducing property] โ€“ ๐‘™ ๐‘ฆ, ๐‘ง โ‰” ๐‘™๐‘ฆ(๐‘ง). ๐‘™ is a positive definite kernel, i.e., ๐‘™ ๐‘ฆ, ๐‘ง = ๐‘™(๐‘ง, ๐‘ฆ) and the Gram matrix ๐‘™ ๐‘ฆ๐‘—, ๐‘ฆ๐‘˜

๐‘—๐‘˜ is positive

semidefinite for any ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘œ. โ€“ Moore-Aronszajn theorem: for any positive definite kernel on ฮฉ, there uniquely exists an RKHS s.t. its reproducing kernel is ๐‘™(โ‹…, ๐‘ฆ). (One-to-one correspondence between p.d. kernel and RKHS) โ€“ Example of pos. def. kernel on ๐’๐‘’: ๐‘™ ๐‘ฆ, ๐‘ง = exp โˆ’

โ€–๐‘ฆโˆ’๐‘งโ€–2 2๐œ2

.

12

slide-13
SLIDE 13

Kernel exponential family

  • Def. ๐‘™: pos. def. kernel on ฮฉ = ฯ‚๐‘=1

๐‘’

(๐‘ก๐‘, ๐‘ข๐‘) โŠ‚ ๐’ โˆช ยฑโˆž

๐‘’.

๐ผ๐‘™: RKHS. ๐‘Ÿ0: p.d.f. on ฮฉ with supp ๐‘Ÿ0 = ฮฉ. ๐บ๐‘™ โ‰” {๐‘” โˆˆ ๐ผ๐‘™ โˆฃ โˆซ ๐‘“๐‘” ๐‘ฆ ๐‘Ÿ0 ๐‘ฆ ๐‘’๐‘ฆ < โˆž} (functional) parameter space

๐‘„

๐‘™ โ‰” {๐‘ž๐‘”: ฮฉ โ†’ 0, โˆž โˆฃ

๐‘ž๐‘” ๐‘ฆ = ๐‘“๐‘” ๐‘ฆ โˆ’๐ต ๐‘” ๐‘Ÿ0 ๐‘ฆ , ๐‘” โˆˆ ๐บ๐‘™}

where ๐ต ๐‘” โ‰” โˆซ ๐‘“๐‘”(๐‘ฆ)๐‘Ÿ0 ๐‘ฆ ๐‘’๐‘ฆ ๐‘„๐‘™: kernel exponential family (KEF) โ€“ With finite dimensional ๐ผ๐‘™, KEF is reduced to a finite dim. exponential family. e.g. ๐‘™ ๐‘ฆ, ๐‘ง = 1 + ๐‘ฆ๐‘ˆ๐‘ง 2 ๏ƒ  Gaussian distributions.

13

slide-14
SLIDE 14

Score matching for KEF

Assume ๐‘™ is of class ๐ท2 (๐œ–๐‘+๐‘๐‘™(๐‘ฆ, ๐‘ง)/๐œ–๐‘๐‘ฆ๐œ–๐‘๐‘ง exists and is continuous for ๐‘ + ๐‘ โ‰ค 2) and lim

๐‘ฆ๐‘โ†’๐‘ก๐‘ or ๐‘ข๐‘

เธฌ

๐œ–2๐‘™ ๐‘ฆ,๐‘ง ๐œ–๐‘ฆ๐‘๐œ–๐‘ง๐‘ ๐‘ง=๐‘ฆ

๐‘ž0 ๐‘ฆ = 0 (for partial integral). โ€“ Score matching objective function แˆš ๐พ๐‘œ ๐‘” โ‰” เท

๐‘—=1 ๐‘œ

เท

๐‘=1 ๐‘’ 1

2 ๐œ–๐‘” ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ + ๐œ– log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

+ ๐œ–2๐‘” ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

+ ๐œ–2 log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

Note ๐‘” ๐‘Œ๐‘— = ๐‘”, ๐‘™ โ‹…, ๐‘Œ๐‘— ,

๐œ–๐‘” ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ = ๐‘”, ๐œ–๐‘™ โ‹…,๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

,

๐œ–2๐‘” ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

= ๐‘”,

๐œ–2๐‘™ โ‹…,๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

. แˆš ๐พ๐‘œ ๐‘” is a quadratic form w.r.t. ๐‘” โˆˆ ๐ผ.

14

slide-15
SLIDE 15

โ€“ Estimation แˆ˜ ๐ท๐‘œ๐‘” = ๐œŠ๐‘œ where แˆ˜ ๐ท๐‘œ โ‰” 1 ๐‘œ เท

๐‘—=1 ๐‘œ

เท

๐‘=1 ๐‘’ ๐œ–๐‘™ โ‹…, ๐‘Œ๐‘—

๐œ–๐‘ฆ๐‘ ๐œ–๐‘™ โ‹…, ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ ,โˆ— โˆถ ๐ผ๐‘™ โ†’ ๐ผ๐‘™ แˆ˜ ๐œŠ๐‘œ โ‰” 1 ๐‘œ เท

๐‘—=1 ๐‘œ

เท

๐‘=1 ๐‘’

๐œ–๐‘™ โ‹…, ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ ๐œ– log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ + ๐œ–2๐‘™ โ‹…, ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

โˆˆ ๐ผ๐‘™ โ€“ Regularized estimator เทก ๐‘”

๐‘œ =

แˆ˜ ๐ท๐‘œ + ๐œ‡๐‘œ๐ฝ

โˆ’1 แˆ˜

๐œŠ๐‘œ i.e., เทก ๐‘”

๐‘œ = argmin๐‘” แˆš

๐พ๐‘œ ๐‘” + ๐œ‡๐‘œ ๐‘” ๐ผ๐‘™

2

15

slide-16
SLIDE 16

โ€“ Estimator: แˆ˜ ๐‘”

๐‘œ = ๐›ฝ แˆ˜

๐œŠ๐‘œ + เท

๐‘˜=1 ๐‘œ

เท

๐‘=1 ๐‘’

๐›พ๐‘˜๐‘ ๐œ–๐‘™ โ‹…, ๐‘Œ

๐‘˜

๐œ–๐‘ฆ๐‘

where

1 ๐‘œ ฯƒ๐‘,๐‘— โ„Ž๐‘— ๐‘ 2 + ๐œ‡

แˆ˜ ๐œŠ๐‘œ

2 1 ๐‘œ ฯƒ๐‘,๐‘— โ„Ž๐‘— ๐‘ ๐ป๐‘—๐‘˜ ๐‘๐‘ + ๐œ‡ โ„Ž๐‘˜ ๐‘ 1 ๐‘œ ฯƒ๐‘,๐‘— โ„Ž๐‘— ๐‘ ๐ป๐‘—๐‘˜ ๐‘๐‘ + ๐œ‡ โ„Ž๐‘˜ ๐‘ 1 ๐‘œ ฯƒ๐‘‘,๐‘› ๐ป๐‘—๐‘› ๐‘๐‘‘๐ป๐‘›๐‘˜ ๐‘๐‘‘ + ๐œ‡ ๐ป๐‘—๐‘˜ ๐‘๐‘

๐›ฝ ๐›พ๐‘—๐‘ = โˆ’ แˆ˜ ๐œŠ๐‘œ

2

โ„Ž๐‘˜

๐‘

โ„Ž๐‘˜

๐‘ = 1 ๐‘œ ฯƒ๐‘—,๐‘ ๐œ–3๐‘™ ๐‘Œ๐‘—,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘

2๐œ–๐‘ง๐‘ +

๐œ–2๐‘™ ๐‘Œ๐‘—,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘๐œ–๐‘ง๐‘ ๐œ–โ„“ ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ , ๐œ–โ„“ ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ = ๐œ– log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

๐ป๐‘—๐‘˜

๐‘๐‘ = ๐œ–2๐‘™ ๐‘Œ๐‘—,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘๐œ–๐‘ง๐‘ , แˆ˜

๐œŠ๐‘œ

2 = 1 ๐‘œ2 ฯƒ๐‘—๐‘˜,๐‘๐‘ ๐œ–4๐‘™ ๐‘Œ๐‘—,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘

2๐œ–๐‘ง๐‘ 2 + 2

๐œ–3๐‘™ ๐‘Œ๐‘—,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘

2๐œ–๐‘ง๐‘

๐œ–โ„“(๐‘Œ๐‘˜) ๐œ–๐‘ฆ๐‘ + ๐œ–2๐‘™ ๐‘Œ๐‘—,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘๐œ–๐‘ง๐‘ ๐œ–โ„“ ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ ๐œ–โ„“(๐‘Œ๐‘˜) ๐œ–๐‘ฆ๐‘

  • แˆ˜

๐‘”

๐‘œ can be taken in Span ๐œ–๐‘™ โ‹…,๐‘Œ๐‘˜ ๐œ–๐‘ฆ๐‘

, แˆ˜ ๐œŠ๐‘œ .

  • Estimator is simply given by solving 1 + ๐‘œ๐‘’ -dimensional linear

equation.

16

Explicit Solution

(from representer theorem)

slide-17
SLIDE 17

Unnormalized p.d.f.

โ€“ Score matching for KEF gives only ๐‘”(๐‘ฆ) or ๐‘“๐‘” ๐‘ฆ , unnormalized p.d.f.

  • Estimation of ๐ต ๐‘” โ‰” โˆซ ๐‘“๐‘”(๐‘ฆ)๐‘Ÿ0 ๐‘ฆ ๐‘’๐‘ฆ is yet nontrivial.

โ€“ There are interesting applications. 1) Nonparametric structure learning for graphical model given data (Sun, Kolar, Xu NIPS2015) ๐‘ž ๐‘Œ โˆ เท‘

๐‘—๐‘˜โˆˆ๐น

๐‘ž๐‘—๐‘˜ ๐‘Œ๐‘—, ๐‘Œ

๐‘˜ ,

๐ป = (๐‘Š, ๐น) ๐‘ž๐‘—๐‘˜ is estimated nonparametrically with KEF (with sparse edges).

17

b c d e a

slide-18
SLIDE 18

2) Hamiltonian Monte Carlo with intractable gradient

(Strathmann et al. NIPS 2015)

Estimate

๐œ– log ๐œŒ ๐‘ฆ ๐œ–๐‘ฆ

with EKF, assuming it does not allow a closed form expression (intractable cases).

  • Hamiltonian Monte Carlo (Neal 2012)

Goal: sample from ๐œŒ ๐‘‰ ๐‘ฆ = โˆ’ log ๐œŒ ๐‘ฆ ๐ฟ ๐‘จ : auxiliary momentum, e.g. โˆ’๐‘จ2/๐œ2 Hamiltonian ๐ผ ๐‘จ, ๐‘ฆ โ‰” ๐‘‰ ๐‘ฆ + ๐ฟ(๐‘จ) Hamiltonian flow:

๐‘’๐‘ฆ ๐‘’๐‘ข = ๐œ–๐ผ ๐‘’๐‘จ = ๐œ–๐ฟ ๐œ–๐‘จ, ๐‘’๐‘จ ๐‘’๐‘ข = โˆ’ ๐œ–๐ผ ๐‘’๐‘ฆ = ๐œ– log ๐œŒ ๐‘ฆ ๐œ–๐‘ฆ

This flow is used in proposal of MCMC

18

slide-19
SLIDE 19

Convergence

๏ฎ Misspecification

True parameter ๐‘”

โˆ— is taken from a wider space than ๐ผ๐‘™.

Extended parameter space ๐‘‹

2 0 ๐‘ž0 โ‰”

เต— ๐‘” โˆˆ ๐ท1 ฮฉ โˆฃ

๐œ–๐‘” ๐‘ฆ ๐œ–๐‘ฆ๐‘ โˆˆ ๐‘€2 ฮฉ; ๐‘ž0 , ๐‘ = 1, โ€ฆ , ๐‘’

โˆผ where ๐‘” โˆผ ๐‘• โŸบ ฯƒ๐‘=1

๐‘’

๐œ–๐‘”/๐œ–๐‘ฆ๐‘ โˆ’ ๐œ–๐‘•/๐œ–๐‘ฆ๐‘

๐‘€2 ๐‘ž0 2

= 0 ๐‘” , [๐‘•] ๐‘‹

2 0(๐‘ž0): = ฯƒ๐‘=1

๐‘’

โˆซ

๐œ–๐‘” ๐‘ฆ ๐œ–๐‘ฆ๐‘ ๐œ–๐‘• ๐‘ฆ ๐œ–๐‘ฆ๐‘ ๐‘ž0 ๐‘ฆ ๐‘’๐‘ฆ.

๐‘‹

2 ๐‘ž0 : completion of the pre-Hilbert space ๐‘‹ 2 0 ๐‘ž0 .

  • With ๐‘™ is of class ๐ท2 (and other technical conditions), the

canonical map ๐ฝ๐‘™: ๐ผ๐‘™ โ†’ ๐‘‹

2 ๐‘ž0 ,

๐‘” โ†ฆ ๐‘” defines a (up to constant) embedding of ๐ผ๐‘™.

19

slide-20
SLIDE 20

Theorem (convergence rate) Under some assumptions, (i) If ๐‘”

โˆ— โ‰” log(๐‘ž0/๐‘Ÿ0) โˆˆ ๐‘† ๐ฝ๐‘™๐ฝ๐‘™ โˆ— , with ๐œ‡๐‘œ โ†’ 0, ๐‘œ๐œ‡๐‘œ โ†’ โˆž

๐พ(๐‘ž0||๐‘ž แˆ˜

๐‘”

๐‘œ) โ†’ 0

๐‘œ โ†’ โˆž . (ii) If ๐‘”

โˆ— โˆˆ ๐‘†( ๐ฝ๐‘™๐ฝ๐‘™ โˆ— ๐›พ) (0 < ๐›พ โ‰ค 1), then with ๐œ‡๐‘œ = ๐‘œ โˆ’ max 1

3, 1 2๐›พ+1 ,

๐พ(๐‘ž0โ€–๐‘ž แˆ˜

๐‘”

๐‘œ) = ๐‘ƒ๐‘ž ๐‘œ

โˆ’ min 2 3, 2๐›พ 2๐›พ+1

. ๐ฝ๐‘™๐ฝ๐‘™

โˆ—: operator on ๐‘‹ 2 ๐‘ž0 , given by

๐ฝ๐‘™๐ฝ๐‘™

โˆ— ๐‘” = เถฑ เท ๐‘=1 ๐‘’ ๐œ–๐‘™ โ‹…, ๐‘ฆ

๐œ–๐‘ฆ๐‘ ๐œ–๐‘” ๐‘ฆ ๐œ–๐‘ฆ๐‘ ๐‘ž0 ๐‘ฆ ๐‘’๐‘ฆ

20

slide-21
SLIDE 21

Hyperparameter selection

โ€“ Hyperparameters

  • Kernel / kernel parameter (๐‘™ ๐‘ฆ, ๐‘ง = exp โˆ’

1 2๐œ2 ๐‘ฆ โˆ’ ๐‘ง 2 )

  • regularization coefficient

โ€“ Cross-validation is possible with the objective function

แˆš ๐พ๐‘œ ๐‘” โ‰” เท

๐‘—=1 ๐‘œ

เท

๐‘=1 ๐‘’ 1

2 ๐œ–๐‘” ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘ + ๐œ– log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

+ ๐œ–2๐‘” ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

+ ๐œ–2 log ๐‘Ÿ0 ๐‘Œ๐‘— ๐œ–๐‘ฆ๐‘

2

.

21

slide-22
SLIDE 22

Experiments

22

slide-23
SLIDE 23

Kernel Density Estimation

โ€“ KDE: standard nonparametric method for estimating p.d.f. Given i.i.d. sample ๐‘Œ1, โ€ฆ , ๐‘Œ๐‘œ โˆผ ๐‘„ ฦธ ๐‘ž๐‘œ ๐‘ฆ = 1 ๐‘œ เท

๐‘—=1 ๐‘œ

๐ฟ ๐‘ฆ โˆ’ ๐‘Œ๐‘— โ„Ž๐‘œ ๐ฟ ๐‘ฆ : p.d.f. e.g. ๐ฟ ๐‘ฆ =

1 2๐œŒ ๐‘’/2 exp โˆ’ ๐‘ฆ 2 2

  • KDE works well for one-dimensional cases, but weak for high

(say, 10) dimensional cases.

  • Sensitive to the choice of โ„Ž๐‘œ, (though CV and other methods

are applicable).

23

slide-24
SLIDE 24

Comparison: EKF vs KDE

24

5 10 15 20 1 2 3 4 Dimension Socre objective function Gaussian distribution: n = 500 Score match KDE

๏ฎ Evaluated by score objective function ๐พ

kernel: ๐‘™ ๐‘ฆ, ๐‘ง = exp โˆ’ ๐‘ฆ โˆ’ ๐‘ง 2/2๐œ2 + 0.1 ๐‘ฆ๐‘ˆ๐‘ง + 0.5 2

5 10 15 20 1 2 3 4 5 Dimension Score objective function Gaussian mixture: n = 300 Score match KDE

5 10 15 20 1 2 3 4 5 Dimension Score objective function Gaussian mixture: n = 300 Score match KDE

ใƒป Gaussian Mixture

๐‘ž0 = 0.5๐œš๐‘’ ๐‘ฆ; 4๐Ÿ๐‘’, ๐ฝ๐‘’ + 0.5๐œš๐‘’(๐‘ฆ; โˆ’4๐Ÿ๐‘’, ๐ฝ๐‘’)

ใƒป Gaussian ๐‘ž0 = ๐œš๐‘’ ๐‘ฆ; 0, ๐ฝ๐‘’

slide-25
SLIDE 25

25

๏ฎ Evaluated by correlation

๐ท๐‘๐‘  ๐‘ž, ๐‘ž0 โ‰”

๐น๐‘† ๐‘ž ๐‘Ž ๐‘ž0 ๐‘Ž ๐น๐‘† ๐‘ž ๐‘Ž 2 ๐น๐‘†[๐‘ž0 ๐‘Ž 2], ๐‘Ž โˆผ 1 104 ฯƒ๐‘—=1 104 ๐œ€๐‘Œ๐‘— , ๐‘Œ๐‘— โˆผ ๐‘ž0๐‘’๐‘ฆ

โ€“ Gaussian

๐‘ž0 = ๐œš๐‘’ ๐‘ฆ; 0, ๐ฝ๐‘’

5 10 15 20 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Dimension Socre objective function Gaussian distribution: n = 500 Score match KDE

๐‘—. ๐‘—. ๐‘’. Correlation

5 10 15 20 0.5 0.6 0.7 0.8 0.9 1 Dimension Correlation Gaussian mixture: n = 300 Score match KDE

โ€“ Gaussian Mixture

๐‘ž0 = 0.5๐œš๐‘’ ๐‘ฆ; 41๐‘’, ๐ฝ๐‘’ + 0.5๐œš๐‘’(๐‘ฆ; โˆ’41๐‘’, ๐ฝ๐‘’)

slide-26
SLIDE 26

Conclusions

26

๏ฎ Infinite dimensional exponential family with positive definite kernel

โ€“ A natural extension of finite dimensional exponential family โ€“ Sufficient statistics and parameter are given by feature vector ๐‘™(โ‹…, ๐‘ฆ) and function ๐‘”, respectively.

๏ฎ Score matching method gives a tractable estimator for kernel exponential family.

โ€“ No need of computing normalization constants. โ€“ The estimator is given as a solution to a linear equation. โ€“ Non-normalized density function is estimated nonparametrically.

slide-27
SLIDE 27

Thank you.

Reference

  • B. Sriperumbudur, K. Fukumizu, R. Kumar, A. Gretton, and A.
  • Hyvarinen. Density Estimation in Infinite Dimensional Exponential
  • Families. arXiv:1312.3516.

27