Nonparametric Sparsity John Lafferty Larry Wasserman - - PowerPoint PPT Presentation

nonparametric sparsity
SMART_READER_LITE
LIVE PREVIEW

Nonparametric Sparsity John Lafferty Larry Wasserman - - PowerPoint PPT Presentation

Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University


slide-1
SLIDE 1

Nonparametric Sparsity

John Lafferty Larry Wasserman

Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University

slide-2
SLIDE 2

Motivation

  • “Modern” data are very high dimensional
  • In order to be “learnable,” there must be lower-dimensional

structure

  • Developing practical algorithms with theoretical guarantees

for beating the curse of (apparent) dimensionality is a main scientific challenge for our field

2

slide-3
SLIDE 3

Motivation

  • Sparsity is emerging as a key concept in statistics and

machine learning

  • Dramatic progress in recent years on understanding

sparsity in parametric settings

  • Nonparametric sparsity: Wide open

3

slide-4
SLIDE 4

Outline

  • High dimensional learning: Parametric and nonparametric
  • Rodeo: Greedy, sparse nonparametric regression
  • Extensions of the Rodeo

4

slide-5
SLIDE 5

Parametric Case: Variable Selection in Linear Models

5

Y =

d

  • j=1

βjXj + ǫ = XT β + ǫ where d might be larger than n. Predictive risk R = E(Ynew − XT

newβ)2.

Want to choose subset (Xj : j ∈ S), S ⊂ {1, . . . , d} to make R small. Bias-variance tradeoff: small S = ⇒ Bias ↑ Variance ↓ large S = ⇒ Bias ↓ Variance ↑

slide-6
SLIDE 6

Lasso/Basis Pursuit

(Chen & Donoho, 1994; Tibshirani, 1996)

6

d

j=1 |βj| ≤ t

Level sets of squared error For orthogonal designs, solution given by soft thresholding ˆ βj = sign(βj) (|βj| − λ)+

slide-7
SLIDE 7

Convex Relaxations for Sparse Signal Recovery

7

Desired problem: min β0 such that Xβ − y2 ≤ Requires intractable combinatorial optimization. Convex optimization surrogate: min β1 such that Xβ − y2 ≤ Substantial progress recently on theoretical justification

(Cand` es and Tao, Donoho, Tropp, Meinshausen and B¨ uhlmann, Wainwright, Zhao and Yu, Fan and Peng,...)

slide-8
SLIDE 8

Nonparametric Regression

8

Given (X1, Y1), . . . , (Xn, Yn) where Yi ∈ R, Xi = (X1i, . . . , Xdi)T ∈ Rd, Yi = m(X1i, . . . , Xdi) + ǫi, E(ǫi) = 0 Risk: R(m, ˆ m) =

  • E( ˆ

m(x) − m(x))2dx Minimax theorem: inf

ˆ m sup m∈F

R(m, ˆ m) ≍ 1 n 4/(4+d) where F is class of functions with 2 smooth derivatives. Note the curse of dimensionality.

slide-9
SLIDE 9

The Curse of Dimensionality

(Sobolev space of order 2)

9 1e+02 1e+04 1e+06 1e+08 0.0 0.1 0.2 0.3 0.4 0.5

sample size Risk

10 12 14 16 18 20 0e+00 2e+11 4e+11 6e+11 8e+11 1e+12 dimension sample size

Risk = 0.01 d = 20

slide-10
SLIDE 10

Nonparametric Sparsity

10

  • In many applications, reasonable to expect true function

depends only on small number of variables

  • Assume

m(x) = m(xR) where xR = (xj)j∈R are the relevant variables with |R| = r ≪ d

  • Can hope to achieve the better minimax rate n−4/(4+r)
  • Challenge: Variable selection in nonparametric regression
slide-11
SLIDE 11

Rodeo: Regularization of derivative expectation operator

  • A general strategy for nonparametric estimation:

Regularize derivatives of estimator with respect to smoothing parameters

  • A simple new algorithm for simultaneous bandwidth and

variable selection in nonparametric regression

  • Theoretical analysis: Algorithm correctly determines

relevant variables, with high probability, and achieves (near) optimal minimax rate of convergence

  • Examples showing performance consistent with theory

11

slide-12
SLIDE 12

Key Idea in Rodeo: Change of Representation

12

F(h) = F(0) + h F (x) dx

slide-13
SLIDE 13

Rodeo: The Main Idea

  • Use a nonparametric estimator based on a kernel
  • Start with large bandwidths in each dimension, for an

estimate having small variance but high bias

  • Choosing large bandwidth is like ignoring a variable
  • Compute the derivatives of the estimate with respect to

bandwidth

  • Threshold the derivatives to get a sparse estimate
  • Intuition: If a variable is irrelevant, then changing the

bandwidth in that dimension should only result in a small change in the estimator

13

slide-14
SLIDE 14

Rodeo: The Main Idea

14

h2 h1 Start Optimal bandwidth Ideal path Rodeo path

slide-15
SLIDE 15

Using Local Linear Smoothing

15

The estimator can be written as ˆ mh(x) =

n

  • i=1

G(Xi, x, h)Yi Our method is based on the statistic Zj = ∂ ˆ mh(x) ∂hj =

n

  • i=1

Gj(Xi, x, h)Yi The estimated variance is s2

j = Var(Zj | X1, . . . , Xn) = σ2 n

  • i=1

G2

j(Xi, x, h)

slide-16
SLIDE 16

Rodeo: Hard Tresholding Version

16

  • 1. Select parameter 0 < β < 1 and initial bandwidth h0.
  • 2. Initialize the bandwidths, and activate all covariates:

(a) hj = h0, j = 1, 2, . . . , d. (b) A = {1, 2, . . . , d}

  • 3. While A is nonempty, do for each j ∈ A:

(a) Compute estimated derivative expectation: Zj and sj (b) Compute threshold λj = sj

  • 2 log n.

(c) If |Zj| > λj, set hj ← βhj; otherwise remove j from A.

  • 4. Output bandwidths h = (h1, . . . , hd) and estimator

˜ m(x) = mh(x)

slide-17
SLIDE 17

17 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0

Rodeo Step Bandwidth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Average over 50 runs Typical Run

Example: m(x) = 2(x1 + 1)3 + 2 sin(10x2), d = 20

2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0

Rodeo Step Bandwidth

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

slide-18
SLIDE 18

18 5 10 15 20 25 30 5 10 15 20 25 30 0.00 0.02 0.04 0.06 0.08 0.10

Loss with r=2, Increasing Dimension

Leave-one-out cross-validation Rodeo

slide-19
SLIDE 19

Main Result: Near Optimal Rates

19

  • Theorem. Suppose that d = O(log n/ log log n), h0 = 1/ log log n, and

|mjj(x)| > 0. Then the rodeo outputs bandwidths h that satisfy P

  • h

j = h0 for all j > r

→ 1 and for every > 0, P

  • n−1/(4+r)− ≤ h

j ≤ n−1/(4+r)+ for all j ≤ r

→ 1 . Let Tn be the stopping time of the algorithm. Then P(tL ≤ Tn ≤ tU) → 1 where tL = 1 (r + 4) log(1/β)log

  • nAmin

log n(log log n)d

  • tU

= 1 (r + 4) log(1/β)log

  • nAmax

log n(log log n)d

slide-20
SLIDE 20

Greedy Rodeo and LARS

  • Rodeo can be viewed as a nonparametric version of least

angle regression (LARS), (Efron et al., 2004)

  • In forward stagewise, variable selection is incremental.

LARS adds the variable most correlated with the residuals

  • f the current fit.
  • For the Rodeo, the derivative is essentially the correlation

between the output and the derivative of the effective kernel

  • Reducing the bandwidth is like adding more of that

variable

20

slide-21
SLIDE 21

21 0.0 0.2 0.4 0.6 0.8 1.0 2 2 4 6 8 |beta|/max|beta| Standardized Coefficients 5 10 2 9 7 1 3 13 4

LARS Regularization Paths

slide-22
SLIDE 22

Greedy Rodeo Bandwidth Paths

22

20 40 60 80 100 0.0 0.1 0.2 0.3 0.4 0.5

Greedy Rodeo Step Bandwidth

1 2 3 4 7 8 9

Rodeo order: 3 (body mass index), 9 (serum), 7 (serum), 4 (blood pressure), 1 (age), 2 (sex), 8 (serum), 5 (serum), 10 (serum), 6 (serum). LARS order: 3, 9, 4, 7, 2, 10, 5, 8, 6, 1.

slide-23
SLIDE 23

Extensions

  • Sparse density estimation
  • Local polynomial estimation
  • Classification using Rodeo with generalized linear models
  • Other nonparametric estimators
  • Data-adaptive basis pursuit

23

slide-24
SLIDE 24

Combining Rodeo and Lasso: Data-Adaptive Basis Pursuit

(with Han Liu)

24

0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4

true regression line

x y 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4

data adaptive basis, J=36

x fitted

slide-25
SLIDE 25

Data-Adaptive Basis Pursuit

25

  • Recall idea of Rodeo:

˜ m(x) = m1(x) − 1 Z(x, h(s)), ˙ h(s)

  • ds
  • Let Φ(Xi) = vec (Z(Xi, h(sk)) · dh(sk)) over a grid of bandwidths
  • Run the Lasso:

min

β

Y − Φ(X)β2 such that β1 ≤ t

slide-26
SLIDE 26

Data-Adaptive Basis Pursuit

26

0.0 0.2 0.4 0.6 0.8 1.0 0.03 0.02 0.01 0.00 0.01 0.02 0.03 base 27 x y 0.0 0.2 0.4 0.6 0.8 1.0 0.02 0.01 0.00 0.01 0.02 0.03 base 30 x y 0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.00 0.05 base 15 x y 0.0 0.2 0.4 0.6 0.8 1.0 0.06 0.04 0.02 0.00 0.02 0.04 0.06 base 18 x y 0.0 0.2 0.4 0.6 0.8 1.0 0.04 0.02 0.00 0.02 base 6 x y 0.0 0.2 0.4 0.6 0.8 1.0 0.06 0.04 0.02 0.00 0.02 0.04 0.06 base 9 x y
slide-27
SLIDE 27

Summary

  • Sparsity is playing an increasingly important role in

statistics and machine learning

  • In order to be “learnable,” there must be lower-

dimensional structure

  • Nonparametric sparsity: many open problems.
  • Rodeo: conceptually simple and practical, theoretically

nice properties.

27