Gram Matrix estimation in high dimension Ilaria Giulini INRIA - - PowerPoint PPT Presentation

gram matrix estimation in high dimension
SMART_READER_LITE
LIVE PREVIEW

Gram Matrix estimation in high dimension Ilaria Giulini INRIA - - PowerPoint PPT Presentation

Gram Matrix estimation in high dimension Ilaria Giulini INRIA (project CLASSIC) D epartement de Math ematiques et Applications ENS, 45 rue dUlm, 75005 Paris Joint work with Olivier Catoni Journ ee DIM RDM-IdF 2013 12 septembre


slide-1
SLIDE 1

Gram Matrix estimation in high dimension

Ilaria Giulini

INRIA (project CLASSIC) D´ epartement de Math´ ematiques et Applications ENS, 45 rue d’Ulm, 75005 Paris Joint work with Olivier Catoni Journ´ ee DIM RDM-IdF 2013 12 septembre 2013

slide-2
SLIDE 2

General Setting

Let P ∈ M1

+(Rd).

The Gram matrix is G =

  • x x⊤ dP(x)

Estimate G is equivalent to estimate N(θ) =

  • x, θ2dP(x)

since N(θ) = θ⊤ Gθ P is unknown X1, . . . , Xn ∈ Rd ∼ P i.i.d. sample Goal: Estimate N(θ) for every θ ∈ Rd from the sample

slide-3
SLIDE 3

General Setting

Let P ∈ M1

+(Rd).

The Gram matrix is G =

  • x x⊤ dP(x)

Estimate G is equivalent to estimate N(θ) =

  • x, θ2dP(x)

since N(θ) = θ⊤ Gθ P is unknown X1, . . . , Xn ∈ Rd ∼ P i.i.d. sample Goal: Estimate N(θ) for every θ ∈ Rd from the sample

slide-4
SLIDE 4

General Setting

Let P ∈ M1

+(Rd).

The Gram matrix is G =

  • x x⊤ dP(x)

Estimate G is equivalent to estimate N(θ) =

  • x, θ2dP(x)

since N(θ) = θ⊤ Gθ P is unknown X1, . . . , Xn ∈ Rd ∼ P i.i.d. sample Goal: Estimate N(θ) for every θ ∈ Rd from the sample

slide-5
SLIDE 5

General Setting

Let P ∈ M1

+(Rd).

The Gram matrix is G =

  • x x⊤ dP(x)

Estimate G is equivalent to estimate N(θ) =

  • x, θ2dP(x)

since N(θ) = θ⊤ Gθ P is unknown X1, . . . , Xn ∈ Rd ∼ P i.i.d. sample Goal: Estimate N(θ) for every θ ∈ Rd from the sample

slide-6
SLIDE 6

Assumption: Tr(G) =

  • x2 dP(x) < +∞.

Our goal: estimate N(θ) =

  • θ, x2 dP(x)

that is, built ˆ N (depending on X1, . . . , Xn) such that, with probability 1 − ǫ, for any θ ∈ Rd, |N(θ) − ˆ N(θ)| ≤ η(n, θ, ǫ) where η(n, θ, ǫ) → 0 as n → ∞ Tecnhiques: PAC-Bayesiennes

slide-7
SLIDE 7

Dimension Dependent Bound

Let κ = supθ=0

  • θ,x4dP(x)

(

  • θ,x2dP(x))

2 < +∞. For any ǫ > 0 and n such that

n >

  • 27

√ κd + 5κ − 4

  • 2(κ − 1)
  • log(ǫ−1) + 1.11d

2 ,

with probability at least 1 − 2ǫ, for any θ ∈ Rd,

  • ˆ

N(θ) − N(θ)

  • ≤ N(θ)

µ 1 − 3µ, (1) where

µ =

  • 2(κ − 1)

n (log(ǫ−1) + 1.11d) +

  • 2κ × 89d

n

Remark: Var(θ, X2) ∼ (κ − 1)N(θ)2

slide-8
SLIDE 8

Dimension Dependent Bound

Let κ = supθ=0

  • θ,x4dP(x)

(

  • θ,x2dP(x))

2 < +∞. For any ǫ > 0 and n such that

n >

  • 27

√ κd + 5κ − 4

  • 2(κ − 1)
  • log(ǫ−1) + 1.11d

2 ,

with probability at least 1 − 2ǫ, for any θ ∈ Rd,

  • ˆ

N(θ) − N(θ)

  • ≤ N(θ)

µ 1 − 3µ, (1) where

µ =

  • 2(κ − 1)

n (log(ǫ−1) + 1.11d) +

  • 2κ × 89d

n

Remark: Var(θ, X2) ∼ (κ − 1)N(θ)2

slide-9
SLIDE 9

Dimension-free Bound

With probability at least 1 − 2ǫ, for any θ ∈ Rd, the same estimator ˆ N is such that ✶{4µ<1}

  • ˆ

N(θ) N(θ) − 1

µ 1 − 4µ where, for n < 1020,

µ =

  • 2.07(κ − 1)

n

  • log(ǫ−1) + 4.3 + 1.6 × θ2Tr(G)

N(θ)

  • +

n × 92θ2Tr(G) N(θ)

slide-10
SLIDE 10

Remark

Let θi, i = 1, . . . , d be a ON basis Tr(G) =

  • x2 dP(x) =

d

  • i=1

N(θi) If the energy is equally distributed, that is N(θi) = N(θ) for any i = 1, . . . , d then Tr(G) N(θ) = d

i=1 N(θi)

N(θ) = dN(θ) N(θ) = d

slide-11
SLIDE 11

Remark

Let θi, i = 1, . . . , d be a ON basis Tr(G) =

  • x2 dP(x) =

d

  • i=1

N(θi) If the energy is equally distributed, that is N(θi) = N(θ) for any i = 1, . . . , d then Tr(G) N(θ) = d

i=1 N(θi)

N(θ) = dN(θ) N(θ) = d

slide-12
SLIDE 12

PAC-Bayesian approach

Let X1, . . . , Xn ∼ P be an i.i.d. sample

  • D. McAllester; O. Catoni (2012)

Let ν ∈ M1

+(Θ) be a prior probability measure.

∀ f, ∀ posterior ρ ∈ M1

+(Θ) such that K(ρ, ν) < +∞

P

  • 1

n

n

  • i=1
  • log
  • 1 + f(Xi, θ′, λ)
  • dρ(θ′) ≤
  • f(x, θ′, λ) dP(x)dρ(θ′) + K(ρ, ν) + log(ǫ−1)

n

  • ≥ 1 − ǫ

where the Kullback divergence of ρ with respect to ν is

K(ρ, ν) =

  • log

if ρ ≪ ν +∞

  • therwise
slide-13
SLIDE 13

1

With probability at least 1 − 2ǫ, for any θ ∈ Rd, ˆ B−(θ) ≤ N(θ) ≤ ˆ B+(θ)

2

Definition of ˆ N ˆ N(θ) = ˆ B+(θ) + ˆ B−(θ) 2

3

Results: With probability at least 1 − 2ǫ, for any θ ∈ Rd,

  • N(θ) − ˆ

N(θ)

ˆ B+(θ) − ˆ B−(θ) 2

slide-14
SLIDE 14

1

With probability at least 1 − 2ǫ, for any θ ∈ Rd, ˆ B−(θ) ≤ N(θ) ≤ ˆ B+(θ)

2

Definition of ˆ N ˆ N(θ) = ˆ B+(θ) + ˆ B−(θ) 2

3

Results: With probability at least 1 − 2ǫ, for any θ ∈ Rd,

  • N(θ) − ˆ

N(θ)

ˆ B+(θ) − ˆ B−(θ) 2

slide-15
SLIDE 15

1

With probability at least 1 − 2ǫ, for any θ ∈ Rd, ˆ B−(θ) ≤ N(θ) ≤ ˆ B+(θ)

2

Definition of ˆ N ˆ N(θ) = ˆ B+(θ) + ˆ B−(θ) 2

3

Results: With probability at least 1 − 2ǫ, for any θ ∈ Rd,

  • N(θ) − ˆ

N(θ)

ˆ B+(θ) − ˆ B−(θ) 2

slide-16
SLIDE 16

Work in progress

dimension-free bounds for the quadratic form associated to the empirical Gram matrix ˆ G = 1 n

n

  • i=1

XiX⊤

i

Stability of algorithms for spectral clustering (PCA)

slide-17
SLIDE 17

Bibliography

  • O. Catoni, Estimating the Gram matrix through PAC-Bayes bounds,

preprint.

  • O. Catoni. Challenging the empirical mean and empirical variance: a

deviation study, Ann. Inst. H. Poincar´ e Probab. Statist. Vol. 48, No 4 (2012).

  • G. Biau, A. Mas. PCA-Kernel Estimation, Stat. Risk. Model. 29, No. 1

(2012).

  • J. Langford, J. Shawe-Taylor, PAC-Bayes & Margins, Advances in

Neural Information Processing Systems (2002).

  • D. McAllester, Simplified PAC-Bayesian margin bounds, In COLT

(2003).