High-dimensional statistics: Some progress and challenges ahead - - PowerPoint PPT Presentation

high dimensional statistics some progress and challenges
SMART_READER_LITE
LIVE PREVIEW

High-dimensional statistics: Some progress and challenges ahead - - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,


slide-1
SLIDE 1

High-dimensional statistics: Some progress and challenges ahead

Martin Wainwright

UC Berkeley Departments of Statistics, and EECS

University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

slide-2
SLIDE 2

Non-parametric regression

Goal: How to predict output from covariates? given covariates (x1, x2, x3, . . . , xp)

  • utput variable y

want to predict y based on (x1, . . . , xp)

slide-3
SLIDE 3

Non-parametric regression

Goal: How to predict output from covariates? given covariates (x1, x2, x3, . . . , xp)

  • utput variable y

want to predict y based on (x1, . . . , xp) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric

slide-4
SLIDE 4

Non-parametric regression

Goal: How to predict output from covariates? given covariates (x1, x2, x3, . . . , xp)

  • utput variable y

want to predict y based on (x1, . . . , xp) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric Challenge: How to control statistical and computational complexity for large number of predictors p?

slide-5
SLIDE 5

High dimensions and sample complexity

Possible models:

  • rdinary linear regression: y =

p

  • j=1

θjxj

  • θ, x

+w general non-parametric model: y = f(x1, x2, . . . , xp) + w.

slide-6
SLIDE 6

High dimensions and sample complexity

Possible models:

  • rdinary linear regression: y =

p

  • j=1

θjxj

  • θ, x

+w general non-parametric model: y = f(x1, x2, . . . , xp) + w. Sample complexity: How many samples n for reliable prediction? linear models

◮ without any structure: sample size n ≍

p/ǫ2

  • linear in p

necessary/sufficient

slide-7
SLIDE 7

High dimensions and sample complexity

Possible models:

  • rdinary linear regression: y =

p

  • j=1

θjxj

  • θ, x

+w general non-parametric model: y = f(x1, x2, . . . , xp) + w. Sample complexity: How many samples n for reliable prediction? linear models

◮ without any structure: sample size n ≍

p/ǫ2

  • linear in p

necessary/sufficient

◮ with sparsity s ≪ p: sample size n ≍

(s log p)/ǫ2

  • logarithmic in p

necessary/sufficient

slide-8
SLIDE 8

High dimensions and sample complexity

Possible models:

  • rdinary linear regression: y =

p

  • j=1

θjxj

  • θ, x

+w general non-parametric model: y = f(x1, x2, . . . , xp) + w. Sample complexity: How many samples n for reliable prediction? linear models

◮ without any structure: sample size n ≍

p/ǫ2

  • linear in p

necessary/sufficient

◮ with sparsity s ≪ p: sample size n ≍

(s log p)/ǫ2

  • logarithmic in p

necessary/sufficient

non-parametric models: p-dimensional, smoothness α Curse of dimensionality: n ≍ (1/ǫ)2+p/α

  • Exponential in p
slide-9
SLIDE 9

Structure in non-parametric regression

Upshot: Essential to impose structural constraints for high-dimensional non-parametric models. Reduced dimension models: dimension-reducing function: ϕ : Rp → Rk, where k ≪ p lower-dimensional function: g : Rk → R composite function: f : Rp → R f(x1, x2, . . . , xp) = g

  • ϕ(x1, x2, . . . , xp)
  • Martin Wainwright

(UC Berkeley) High-dimensional statistics February, 2013 4 / 27

slide-10
SLIDE 10

Structure in non-parametric regression

Reduced dimension models: dimension-reducing function: ϕ : Rp → Rk, where k ≪ p lower-dimensional function: g : Rk → R composite function: f : Rp → R f(x1, x2, . . . , xp) = g

  • ϕ(x1, x2, . . . , xp)
  • Example: Regression on k-dimensional manifold:

Form of model f(x1, x2, . . . , xp) = g

  • ϕ(x1, x2, . . . , xp)
  • ϕ is co-ordinate mapping.

−1.5 −1 −0.5 0.5 1 1.5 −0.1 −0.05 0.05 0.1 −1.5 −1 −0.5 0.5 1 1.5

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

slide-11
SLIDE 11

Structure in non-parametric regression

Reduced dimension models: dimension-reducing function: ϕ : Rp → Rk, where k ≪ p lower-dimensional function: g : Rk → R composite function: f : Rp → R f(x1, x2, . . . , xp) = g

  • ϕ(x1, x2, . . . , xp)
  • Example: Ridge functions

Form of model: f(x1, x2, . . . , xp) =

k

  • j=1

gj

  • aj, x
  • Dimension-reducing mapping

ϕ(x1, . . . , xp) = Ax for some A ∈ Rk×p.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

slide-12
SLIDE 12

Remainder of lecture

1 Sparse additive models

◮ formulation, applications ◮ families of estimators ◮ efficient implementation as SOCP

2 Statistical rates

◮ Kernel complexity ◮ Subset selection plus univariate function estimation

3 Minimax lower bounds

◮ Statistics as channel coding ◮ Metric entropy and lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 27

slide-13
SLIDE 13

Sparse additive models

additive models f(x1, x2, . . . , xp) = p

j=1 fj(xj)

(Stone, 1985; Hastie & Tibshirani, 1990)

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

slide-14
SLIDE 14

Sparse additive models

additive models f(x1, x2, . . . , xp) = p

j=1 fj(xj)

(Stone, 1985; Hastie & Tibshirani, 1990)

additivity with sparsity f(x1, x2, . . . , xp) =

  • j∈S

fj(xj) for unknown subset of cardinality |S| = s

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

slide-15
SLIDE 15

Sparse additive models

additive models f(x1, x2, . . . , xp) = p

j=1 fj(xj)

(Stone, 1985; Hastie & Tibshirani, 1990)

additivity with sparsity f(x1, x2, . . . , xp) =

  • j∈S

fj(xj) for unknown subset of cardinality |S| = s studied by previous authors:

◮ Lin & Zhang, 2006: COSSO relaxation, ◮ Ravikumar et al., 2007: SPAM back-fitting procedure, consistency ◮ Bach et al., 2008: multiple kernel learning (MLK), consistency in classical

setting

◮ Meier et al., 2007, L2(Pn) regularization ◮ Koltchinski & Yuan, 2008, 2010. ◮ Raskutti, W. & Yu, 2009: minimax lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

slide-16
SLIDE 16

Application: Copula methods and graphical models

transform Xj → Zj = fj(Xj) model (Z1, . . . , Zp) as jointly Gaussian Markov random field P(z1, z2, . . . , zp) ∝ exp

  • (s,t)∈E

θstzszt

  • .

Xs Xt1 Xt2 Xt3 Xt4 Xt5 exploit Markov properties: neighborhood-based selection for learning graphs

(Besag, 1974; Meinshausen & Buhlmann, 2006)

combined with copula method: semi-parametric approach to graphical model learning

(Liu, Lafferty & Wasserman, 2009)

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 7 / 27

slide-17
SLIDE 17

Sparse and smooth

Noisy samples yi = f ∗(xi1, xi2, . . . , xip) + wi for i = 1, 2, . . . , n

  • f unknown function f ∗ with:

sparse representation: f ∗ =

j∈S f ∗ j

univariate functions are smooth: fj ∈ Hj

slide-18
SLIDE 18

Sparse and smooth

Noisy samples yi = f ∗(xi1, xi2, . . . , xip) + wi for i = 1, 2, . . . , n

  • f unknown function f ∗ with:

sparse representation: f ∗ =

j∈S f ∗ j

univariate functions are smooth: fj ∈ Hj Disregarding computational cost: min

|S|≤s

min

f=

j∈S

fj fj∈Hj

1 n

n

  • i=1
  • yi − f(xi)

2

  • y−f2

n

slide-19
SLIDE 19

Sparse and smooth

Disregarding computational cost: min

|S|≤s

min

f=

j∈S

fj fj∈Hj

1 n

n

  • i=1
  • yi − f(xi)

2

  • y−f2

n

1-Hilbert-norm as convex surrogate: f1,H :=

p

  • j=1

fjHj

slide-20
SLIDE 20

Sparse and smooth

Disregarding computational cost: min

|S|≤s

min

f=

j∈S

fj fj∈Hj

1 n

n

  • i=1
  • yi − f(xi)

2

  • y−f2

n

1-Hilbert-norm as convex surrogate: f1,H :=

p

  • j=1

fjHj 1-L2(Pn)-norm as convex surrogate: f1,n :=

p

  • j=1

fjL2(Pn) where fj2

L2(Pn) := 1 n

n

i=1 f 2 j (xij).

slide-21
SLIDE 21

A family of estimators

Noisy samples yi = f ∗(xi1, xi2, . . . , xip) + wi for i = 1, 2, . . . , n

  • f unknown function f ∗ =

j∈S f ∗ j .

slide-22
SLIDE 22

A family of estimators

Noisy samples yi = f ∗(xi1, xi2, . . . , xip) + wi for i = 1, 2, . . . , n

  • f unknown function f ∗ =

j∈S f ∗ j .

Estimator:

  • f ∈ arg

min

f=p

j=1 fj

1 n

n

  • i=1
  • yi −

p

  • j=1

fj(xij) 2 + ρnf1,H + µnf1,n

  • .
slide-23
SLIDE 23

A family of estimators

Noisy samples yi = f ∗(xi1, xi2, . . . , xip) + wi for i = 1, 2, . . . , n

  • f unknown function f ∗ =

j∈S f ∗ j .

Estimator:

  • f ∈ arg

min

f=p

j=1 fj

1 n

n

  • i=1
  • yi −

p

  • j=1

fj(xij) 2 + ρnf1,H + µnf1,n

  • .

Two kinds of regularization: f1,n =

p

  • j=1

fjL2(Pn) =

p

  • j=1
  • 1

n

n

  • i=1

f 2

j (xij),

and f1,H =

p

  • j=1

fjHj.

slide-24
SLIDE 24

Example: Polynomial kernels

Polynomial kernel K(z, x) =

  • 1 + z, x

d Functions in span of data: f(z) =

n

  • i=1

αi

  • 1 + z, xi

d

−0.5 −0.25 0.25 0.5 −1.5 −1 −0.5 0.5 Design value x Function value 2nd order polynomial kernel

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 10 / 27

slide-25
SLIDE 25

Example: First-order Sobolev kernel

First-order Sobolev kernel K(z, x) = 1 + min{z, x} Functions in span of data are Lipschitz: f(z) =

n

  • i=1

αi

  • 1 + min{z, x}
  • −0.5

−0.25 0.25 0.5 −1.5 −1 −0.5 0.5 Design value x Function value 1st order Sobolev

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 27

slide-26
SLIDE 26

Efficient implementation by kernelization

Representer theorem: Reduces to convex program involving: matrix A = (α1, α2, . . . , αp) ∈ Rn×p. empirical kernel matrices [Kj]iℓ = Kj(xij, xℓj).

(Kimeldorf & Wahba, 1971)

Original estimator and kernelized form:

  • f ∈ arg

min

f=p

j=1 fj

1 n

n

  • i=1
  • yi −

p

  • j=1

fj(xij) 2| + ρn

p

  • j=1

fjHj + µn

p

  • j=1

fjL2(Pn)

  • Martin Wainwright

(UC Berkeley) High-dimensional statistics February, 2013 12 / 27

slide-27
SLIDE 27

Efficient implementation by kernelization

Representer theorem: Reduces to convex program involving: matrix A = (α1, α2, . . . , αp) ∈ Rn×p. empirical kernel matrices [Kj]iℓ = Kj(xij, xℓj).

(Kimeldorf & Wahba, 1971)

Original estimator and kernelized form:

  • f ∈ arg

min

f=p

j=1 fj

1 n

n

  • i=1
  • yi −

p

  • j=1

fj(xij) 2| + ρn

p

  • j=1

fjHj + µn

p

  • j=1

fjL2(Pn)

  • A ∈ arg

min

A∈Rn×p

1 ny −

p

  • j=1

Kjαj2

2 + ρn p

  • j=1
  • αT

j Kjαj + µn p

  • j=1
  • αT

j K2 j αj

  • .

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 12 / 27

slide-28
SLIDE 28

Empirical results: Unrescaled

100 200 300 400 500 600 700 800 900 0.005 0.01 0.015 0.02 0.025 0.03 Sample size Mean−squared error MSE versus raw sample size p = 256 p = 128 p = 64

slide-29
SLIDE 29

Empirical results: Apppropriately rescaled

4 5 6 7 8 9 10 0.005 0.01 0.015 0.02 0.025 0.03 Rescaled sample size Mean−squared error MSE versus rescaled sample size p = 256 p = 128 p = 64

slide-30
SLIDE 30

Decay rate of kernel eigenvalues

Mercer’s theorem: orthonormal basis {φj} and non-negative eigenvalues {λj} such that K(z, x) =

  • j=1

λjφj(z) φj(x). Key intuition: Decay rate λj → 0 controls complexity of kernel class.

slide-31
SLIDE 31

Decay rate of kernel eigenvalues

Mercer’s theorem: orthonormal basis {φj} and non-negative eigenvalues {λj} such that K(z, x) =

  • j=1

λjφj(z) φj(x). Key intuition: Decay rate λj → 0 controls complexity of kernel class. Local Rademacher complexity

(Mendelson, 2002)

RK(δ) := 1 √n ∞

  • j=1

min

  • λj, δ21/2

.

slide-32
SLIDE 32

Decay rate of kernel eigenvalues

Mercer’s theorem: orthonormal basis {φj} and non-negative eigenvalues {λj} such that K(z, x) =

  • j=1

λjφj(z) φj(x). Key intuition: Decay rate λj → 0 controls complexity of kernel class. Local Rademacher complexity

(Mendelson, 2002)

RK(δ) := 1 √n ∞

  • j=1

min

  • λj, δ21/2

. Example: Gaussian kernel λj ≍ exp(−j2)

slide-33
SLIDE 33

Decay rate of kernel eigenvalues

Mercer’s theorem: orthonormal basis {φj} and non-negative eigenvalues {λj} such that K(z, x) =

  • j=1

λjφj(z) φj(x). Key intuition: Decay rate λj → 0 controls complexity of kernel class. Local Rademacher complexity

(Mendelson, 2002)

RK(δ) := 1 √n ∞

  • j=1

min

  • λj, δ21/2

. Example: For Sobolev smoothness kernels: First-order (Lipschitz): λj ≍ (1/j)2 Second-order (Twice diff’ble): λj ≍ (1/j)4

slide-34
SLIDE 34

Achievable results

Model: f ∗ has s ≪ p non-zero components each univariate component f ∗

j in same univariate Hilbert space H with

eigenvalues {λj} critical univariate rate δn determined by solving δ2 ≍ RK(δn) = 1 √n ∞

  • j=1

min{λj, δ21/2

slide-35
SLIDE 35

Achievable results

Model: f ∗ has s ≪ p non-zero components each univariate component f ∗

j in same univariate Hilbert space H with

eigenvalues {λj} critical univariate rate δn determined by solving δ2 ≍ RK(δn) = 1 √n ∞

  • j=1

min{λj, δ21/2 Theorem (Raskutti, W. & Yu, 2010, 2012) For appropriate choices of regularization parameters ρn, µn, we have f − f ∗2

L2(Pn)

s log p n Cost of subset selection + s δ2

n

  • Cost of s-variate estimation

with high probability.

slide-36
SLIDE 36

Consequence: Kernels with exponential decay

univariate kernel with α-exponential eigendecay eigenvalue decay λj ≍ exp(−jα) example: Gaussian kernel (α = 2)

slide-37
SLIDE 37

Consequence: Kernels with exponential decay

univariate kernel with α-exponential eigendecay eigenvalue decay λj ≍ exp(−jα) example: Gaussian kernel (α = 2) Corollary For kernel with α-exponential decay, f − f ∗2

L2(Pn)

s log p n Cost of subset selection + s(log n)2/α n

  • Cost of s-variate estimation

with high probability.

slide-38
SLIDE 38

Consequence: Kernels with exponential decay

univariate kernel with α-exponential eigendecay eigenvalue decay λj ≍ exp(−jα) example: Gaussian kernel (α = 2) Corollary For kernel with α-exponential decay, f − f ∗2

L2(Pn)

s log p n Cost of subset selection + s(log n)2/α n

  • Cost of s-variate estimation

with high probability. Note: Either term can dominate, depending on relative scalings of sample size n, ambient dimension p and decay rate α.

slide-39
SLIDE 39

Consequence: Kernels with polynomial decay

univariate kernel with α-polynomial eigendecay eigenvalue decay λj ≍ (1/j)α examples: Sobolev smoothness classes

◮ α = 2: Lipschitz functions ◮ α = 4: twice differentiable functions

slide-40
SLIDE 40

Consequence: Kernels with polynomial decay

univariate kernel with α-polynomial eigendecay eigenvalue decay λj ≍ (1/j)α examples: Sobolev smoothness classes

◮ α = 2: Lipschitz functions ◮ α = 4: twice differentiable functions

Corollary For kernel with α-polynomial decay: f − f ∗2

L2(Pn)

s log p n Cost of subset selection + s n

2α 2α+1

Cost of s-variate estimation with high probability.

slide-41
SLIDE 41

Consequence: Kernels with polynomial decay

univariate kernel with α-polynomial eigendecay eigenvalue decay λj ≍ (1/j)α examples: Sobolev smoothness classes

◮ α = 2: Lipschitz functions ◮ α = 4: twice differentiable functions

Corollary For kernel with α-polynomial decay: f − f ∗2

L2(Pn)

s log p n Cost of subset selection + s n

2α 2α+1

Cost of s-variate estimation with high probability. Note: Either term can dominate, depending on relative scalings of sample size n, ambient dimension p and the smoothness α.

slide-42
SLIDE 42

Consequence: Finite-rank kernels

a (block) univariate kernel K has rank m if λj = 0 for all j > m. many examples:

◮ linear function classes in Rm ◮ polynomials of degree d = m − 1 in R

slide-43
SLIDE 43

Consequence: Finite-rank kernels

a (block) univariate kernel K has rank m if λj = 0 for all j > m. many examples:

◮ linear function classes in Rm ◮ polynomials of degree d = m − 1 in R

Corollary For any kernel with rank m, we have f − f ∗2

L2(Pn)

s log p n Cost of subset selection + sm n

  • Cost of s-variate estimation

with high probability.

slide-44
SLIDE 44

Consequence: Finite-rank kernels

a (block) univariate kernel K has rank m if λj = 0 for all j > m. many examples:

◮ linear function classes in Rm ◮ polynomials of degree d = m − 1 in R

Corollary For any kernel with rank m, we have f − f ∗2

L2(Pn)

s log p n Cost of subset selection + sm n

  • Cost of s-variate estimation

with high probability. Note: Either term can dominate, depending on relative scalings of ambient dimension p and kernel rank m.

slide-45
SLIDE 45

High-dimensional results from past work

Ravikumar et al, 2008:

◮ “back-fitting” method for sparse additive models ◮ establish consistency but do not track sparsity s Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 20 / 27

slide-46
SLIDE 46

High-dimensional results from past work

Ravikumar et al, 2008:

◮ “back-fitting” method for sparse additive models ◮ establish consistency but do not track sparsity s

Meier et al., 2008:

◮ regularize with fn,1: ◮ establish a rate of the order s

log p

n

2α+1 for α-smooth Sobolev spaces Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 20 / 27

slide-47
SLIDE 47

High-dimensional results from past work

Ravikumar et al, 2008:

◮ “back-fitting” method for sparse additive models ◮ establish consistency but do not track sparsity s

Meier et al., 2008:

◮ regularize with fn,1: ◮ establish a rate of the order s

log p

n

2α+1 for α-smooth Sobolev spaces

Koltchinski & Yuan, 2008:

◮ regularize with fH,1 ◮ establish rates involving terms at least s3 log p

n

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 20 / 27

slide-48
SLIDE 48

High-dimensional results from past work

Ravikumar et al, 2008:

◮ “back-fitting” method for sparse additive models ◮ establish consistency but do not track sparsity s

Meier et al., 2008:

◮ regularize with fn,1: ◮ establish a rate of the order s

log p

n

2α+1 for α-smooth Sobolev spaces

Koltchinski & Yuan, 2008:

◮ regularize with fH,1 ◮ establish rates involving terms at least s3 log p

n

Concurrent work: Koltchinski & Yuan, 2010:

◮ analyze same estimator but under a global boundedness condition ◮ rates are not minimax-optimal Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 20 / 27

slide-49
SLIDE 49

Rates with global boundedness

Koltchinski & Yuan, 2010: analyzed same estimator but under global boundedness: f ∗∞ =

  • j∈S

f ∗

j ∞ =

  • j∈S

f ∗

j ∞ ≤ B.

similar rates claimed to be optimal

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 21 / 27

slide-50
SLIDE 50

Rates with global boundedness

Koltchinski & Yuan, 2010: analyzed same estimator but under global boundedness: f ∗∞ =

  • j∈S

f ∗

j ∞ =

  • j∈S

f ∗

j ∞ ≤ B.

similar rates claimed to be optimal Proposition (Raskutti, W. & Yu, 2010) Faster rates are possible under global boundedness conditions. For any Sobolev kernel with smoothness α, f − f ∗2

L2(Pn) φ(s, n)

s n

2α 2α+1 + s log(p/s)

n for a function such that φ(s, n) = o(1) if s √n.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 21 / 27

slide-51
SLIDE 51

Information-theoretic lower bounds

Thus far: polynomial-time algorithm based on solving SOCP upper bounds on error that hold w.h.p. Question: But are these “good” results? Statistical minimax: For a function class F, define the minimax error: Mn(Fs,p,α) := inf

  • f

sup

f ∗∈Fs,p,α

f − f ∗2

2.

Lower bounds behavior of any algorithm over class F.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 22 / 27

slide-52
SLIDE 52

Function estimation as channel coding Noisy channel

F

  • f

P(y | f ∗)

1 Nature chooses a function f ∗ from a class F. 2 User makes n observations of f ∗ from a noisy channel. 3 Function estimation as decoding: return estimate

f based on samples {(yi, xi)}n

i=1.

(Hasminskii, 1978, Birge, 1981, Yang & Barron, 1999)

slide-53
SLIDE 53

Complexity of function classes

Lǫ −Lǫ L −L ǫ 2ǫ Tǫ complexity measured by packing number

(Kolmogorov & Tikhomirov, 1960)

ǫ-packing set: functions {f 1, f 2, . . . , f M} such that f j − f k > ǫ for all j = k

slide-54
SLIDE 54

Complexity of function classes

Lǫ −Lǫ L −L ǫ 2ǫ Tǫ complexity measured by packing number

(Kolmogorov & Tikhomirov, 1960)

ǫ-packing set: functions {f 1, f 2, . . . , f M} such that f j − f k > ǫ for all j = k for Lipschitz functions in 1-dimension: M(ǫ) ≍ 2(L/ǫ).

slide-55
SLIDE 55

Complexity of function classes

Lǫ −Lǫ L −L ǫ 2ǫ Tǫ complexity measured by packing number

(Kolmogorov & Tikhomirov, 1960)

ǫ-packing set: functions {f 1, f 2, . . . , f M} such that f j − f k > ǫ for all j = k for Lipschitz functions in p-dimensions M(ǫ) ≍ 2(L/ǫ)p.

slide-56
SLIDE 56

Metric entropy classes

Covering number N(δ; F) = smallest # δ-balls needed to cover F F

slide-57
SLIDE 57

Metric entropy classes

Covering number N(δ; F) = smallest # δ-balls needed to cover F F

1 Logarithmic metric entropy

log N(δ; F) ≍ m log(1/δ) Examples:

◮ parametric classes ◮ finite-rank kernels ◮ any function class with finite VC dimension

slide-58
SLIDE 58

Metric entropy classes

Covering number N(δ; F) = smallest # δ-balls needed to cover F F

1 Polynomial metric entropy:

log N(δ; F) ≍ 1 δ 1

α

Examples:

◮ various smoothness classes ◮ Sobolev classes

slide-59
SLIDE 59

Lower bounds on minimax risk

Theorem (Raskutti, W. & Yu, 2010) Under the same conditions, there is a constant c0 > 0 such that:

1 For function class F with m-logarithmic metric entropy:

P

  • Mn(Fs,p,α) ≥ c0

s log p/s n

  • subset sel.

+ s m n

  • s-var. est.
  • ≥ 1/2.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 26 / 27

slide-60
SLIDE 60

Lower bounds on minimax risk

Theorem (Raskutti, W. & Yu, 2010) Under the same conditions, there is a constant c0 > 0 such that:

1 For function class F with m-logarithmic metric entropy:

P

  • Mn(Fs,p,α) ≥ c0

s log p/s n

  • subset sel.

+ s m n

  • s-var. est.
  • ≥ 1/2.

2 For function class F with α-polynomial metric entropy:

P

  • Mn(Fs,p,α) ≥ c0

s log p/s n

  • subset sel.

+ s 1 n

2α+1

  • s-var. est.
  • ≥ 1/2.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 26 / 27

slide-61
SLIDE 61

Summary

structure is essential for high-dimensional non-parametric models sparse and smooth additive models:

◮ convex relaxation based on a composite regularizer ◮ attains minimax-optimal rates for kernel classes: ⋆ cost of subset selection: s log p/s

n

⋆ cost of s-variate function estimation: sδ2

n

slide-62
SLIDE 62

Summary

structure is essential for high-dimensional non-parametric models sparse and smooth additive models:

◮ convex relaxation based on a composite regularizer ◮ attains minimax-optimal rates for kernel classes: ⋆ cost of subset selection: s log p/s

n

⋆ cost of s-variate function estimation: sδ2

n

many open questions:

◮ functional ANOVA decompositions: allowing groupings of variables

(doublets, triplets etc.)

◮ other types of high-dimensional non-parametric models: ⋆ kernel PCA/CCA with structural constraints ⋆ structured density estimation ◮ trade-offs between computational and statistical efficiency

slide-63
SLIDE 63

Summary

structure is essential for high-dimensional non-parametric models sparse and smooth additive models:

◮ convex relaxation based on a composite regularizer ◮ attains minimax-optimal rates for kernel classes: ⋆ cost of subset selection: s log p/s

n

⋆ cost of s-variate function estimation: sδ2

n

many open questions:

◮ functional ANOVA decompositions: allowing groupings of variables

(doublets, triplets etc.)

◮ other types of high-dimensional non-parametric models: ⋆ kernel PCA/CCA with structural constraints ⋆ structured density estimation ◮ trade-offs between computational and statistical efficiency

Related paper: Raskutti, W., & Yu (2012). Minimax-optimal rates for sparse additive models

  • ver kernel classes. Journal of Machine Learning Research, March 2012.