The Search For Structure or The Relationship Between Structure and - - PowerPoint PPT Presentation

the search for structure
SMART_READER_LITE
LIVE PREVIEW

The Search For Structure or The Relationship Between Structure and - - PowerPoint PPT Presentation

The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University 1 The Search For Structure Searching For Structure


slide-1
SLIDE 1

The Search For Structure

  • r

The Relationship Between Structure and Prediction June 2012

Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University

1

slide-2
SLIDE 2

The Search For Structure

Searching For Structure ⇓ choose tuning parameters for structure finding ⇓ converting structure finding into prediction ⇓ conformal inference (distribution free prediction)

2

slide-3
SLIDE 3

The Three Lectures

  • 1. The Search For Structure. (Today).
  • 2. Manifolds and Filaments.
  • 3. Undirected Graphs.

3

slide-4
SLIDE 4

Collaborators

  • Xi Chen
  • Chris Genovese
  • Haijie Gu
  • Anupam Gupta
  • John Lafferty
  • Jing Lei
  • Han Liu
  • Pradeep Ravikumar
  • Marco Perone-Pacifico
  • Isabella Verdinelli
  • Min Xu
  • Aarti Singh
  • Martin Azizyan
  • Sivaraman Balikrishnan
  • Don Sheehy
  • Mladen Kolar
  • Alessandro Rinaldo
  • And ...

4

slide-5
SLIDE 5

Outline

  • 1. Prediction is easy, finding structure is hard.
  • 2. Examples.
  • 3. Using prediction to find structure: (minimax) conformal pre-

diction. (4. Using structure to help with prediction: (minimax) semisu- pervised inference.)

5

slide-6
SLIDE 6

The Three Eras of Statistics and Machine Learning

  • 1. PALEOZOIC: parameter estimation

(a) mle (b) confidence intervals, etc.

  • 2. MESOZOIC: prediction

(a) classification (b) regression (c) SVM etc

  • 3. CENOZOIC: the search for structure

(a) graphical models (b) manifolds (c) matrix factorization

6

slide-7
SLIDE 7

Prediction is “Easy.” Example 1: Nonparametric Regression Let (X1, Y1), . . . , (X2n, Y2n) ∼ P. Split the data into training and test. Let { mh : h ∈ H} be estimates of m(x) = E(Y |X = x) from the training data. Choose h to minimize 1 n

  • i∈test

(Yi − mh(Xi))2. Then∗

Risk(

m

h) ≤ c1Risk(

m∗) + c2 log |H| n .

∗See Gyorfi et al, for example.

7

slide-8
SLIDE 8

Prediction is “Easy.” Example 2: The Lasso. Let (X1, Y1), . . . , (Xn, Yn) ∼ P where Xi ∈ Rd. Let β minimize n

i=1(Yi − XT i β)2 s.t. ||β||1 ≤ L (the lasso).

Then, w.h.p.∗ R( β) ≤ R(β∗) + O

 

  • L4 log d

n

 

where β∗ minimizes Risk(β) subject to ||β||1 ≤ L. Choose L by cross-validation.

∗See Greenshtein and Ritov 2004

8

slide-9
SLIDE 9

Prediction is “Easy.” Example 3: SpAM. Sparse Additive Models∗ Y = m(X) =

d

  • j=1

sj(Xj) + ǫ. Choose s1, . . . , sd to minimize

n

  • i=1

(Yi −

  • j

sj(Xj))2 subject to sj being smooth and

j ||sj|| ≤ L.

∗Ravikumar, Lafferty, Liu and Wasserman 2009

9

slide-10
SLIDE 10

Prediction is “Easy.” Example 3: SpAM. Choose L by minimizing generalized cross-validation 1 n RSS (1 − d f(L)/n)2. If d ≤ enξ for ξ < 1 then

Risk(

m) − Risk(m∗) = OP

1

n

1−ξ

2 . 10

slide-11
SLIDE 11

Prediction Prediction is easy because:

  • 1. Goal is clear.
  • 2. Tuning parameters can be selected by cross-validation, data-

splitting etc. Important: The results on data-splitting give distribution-free

  • guarantees. This is a goal we want to emulate.

11

slide-12
SLIDE 12

Structure Finding Examples:

  • clustering
  • curve clustering
  • manifolds
  • graphs
  • graph-valued regression

(Details about graphs and manifolds in lectures 2 and 3.) In this talk, we will show how prediction helps finding structure.

12

slide-13
SLIDE 13

Clustering Despite many, many years of research and many, many papers, there does not seem to be a consensus on how to choose tuning parameters.

  • k-means: choose k.
  • Density-based clustering: choose bandwidth h.
  • Hierarchical clustering: choice of merging rule.
  • Spectral clustering: many parameters.

13

slide-14
SLIDE 14

Clustering Various suggestions include:

  • stability
  • hypothesis testing
  • information-theoretic
  • others

I’ll (tentatively) propose an alternative.

14

slide-15
SLIDE 15

Example of Our Results: Dustribution Free Curve Clustering

5 10 15 20 25 30 −1000 −500 500 1000 1500

15

slide-16
SLIDE 16

Relating Structure to Prediction Our approach (Lei, Rinaldo, Robins, Wasserman) is to convert a structure-finding problem into a prediction problem. Example: Density estimation = ⇒ conformal prediction. Conformal prediction is due to Vovk et al. Rest of talk:

  • explain conformal prediction
  • minimax theory for conformal prediction (briefly)
  • using conformal prediction to guide structure finding

16

slide-17
SLIDE 17

Conformal Inference A theory of distribution free prediction. See: Vovk, Gammerman and Shafer (2005) + many papers by Vovk and co-workers. (See also Phil Dawid’s work on prequential inference.) Our contribution: marrying conformal inference with traditional statistical theory (minimax theory) and extending some of the techniques:

Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Wasserman (arXiv:1203.5422) Lei, Rinaldo, Wasserman (submitted to NIPS) Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Robins and Wasserman (in progress)

17

slide-18
SLIDE 18

(Batch) Conformal Prediction Observe Y1, . . . , Yn ∼ P. Construct Cn ≡ Cn(Y1, . . . , Yn) such that P(Yn+1 ∈ Cn) ≥ 1 − α for all P and all n. Here, P ≡ P n+1. See Vovk et al for the general (sequential) theory. We are only concerned with the bacth version. We will also be concerned with minimax optimality (efficiency).

18

slide-19
SLIDE 19

(Batch) Conformal Prediction

  • 1. Observe Y1, . . . , Yn ∼ P where Yi ∈ Rd.
  • 2. Choose any fixed y ∈ Rd.
  • 3. Let aug(y) = (Y1, . . . , Yn, y).
  • 4. Compute conformity scores σ1(y), . . . , σn+1(y).
  • 5. Under H0 : Yn+1 = y, the ranks are uniform. The p-value is

π(y) =

n+1

i=1 I(σi(y) ≤ σn+1(y))

n + 1 .

  • 6. Invert the test:

Cn = {y : π(y) ≥ α}.

19

slide-20
SLIDE 20

Conformity Scores Use aug(y) = (Y1, . . . , Yn, y) to construct a function g. Compute σi(y) =

  • g(Yi)

i = 1, . . . , n g(y) i = n + 1. Example: σi = −|Yi − Y (y)| where Y (y) = y + n

i=1 Yi

n + 1 . * In certain cases, we need to use σi = gi(Yi) where gi is built from aug(y) − {Yi}. More on this later.

20

slide-21
SLIDE 21

(Batch) Conformal Prediction When H0 : Yn+1 = y is true, the ranks of the σi’s are uniform. It follows that, for any P and any n, P(Yn+1 ∈ Cn) ≡ P n+1(Yn+1 ∈ Cn) ≥ 1 − α. This is true, finite sample, distribution-free prediction. But what is the best conformity score?

21

slide-22
SLIDE 22

Oracle Best (smallest) prediction set or Oracle: C∗ = {y : p(y) > λ} where λ is such that P(C∗) = 1 − α, The form of C∗ suggests using an estimate p of p to define a conformity score. And this leads to a method for level set density clustering.

22

slide-23
SLIDE 23

Loss Function Loss function: L(C) = µ(C∆C∗) where A∆B = (A ∩ Bc) ∪ (Ac ∩ B) and µ is Lebesgue measure. Minimax risk: inf

C∈Γn

sup

P∈P

EP[µ(C∆C∗)] where Γn denotes all 1 − α prediction regions.

23

slide-24
SLIDE 24

Kernel Conformity Define the augmented kernel density estimator

  • py

h(u) =

  • 1

n + 1

  • n
  • i=1

1 hdK

  • ||u − Yi||

h

  • +
  • 1

n + 1

  • 1

hdK

  • ||y − Yi||

h

  • .

Let σi(y) = py

h(Yi),

σn+1(y) = py

h(y)

π(y) =

n+1

i=1 I(σi(y) ≤ σn+1)

n + 1 Cn = {y : π(y) ≥ α}. Then P(Yn+1 ∈ Cn) ≥ 1 − α for all P and n.

24

slide-25
SLIDE 25

Helpful Approximation Cn is not a density level set. Also, it is expensive to compute. However, Cn ⊂ C+

n where

C+

n = {y :

  • ph(y) > cn}

where cn = ph(Y(nα)) − K(0) nhd and Y(1), Y(2), · · · are ordered so that ph(Y(1)) ≥ ph(Y(2)) ≥ · · · . The set C+

n involves no augmentation set but still satisfies

P(Yn+1 ∈ C+

n ) ≥ 1−α. Its connected components are the density

clusters.

25

slide-26
SLIDE 26

Optimality Assuming Holder-β smoothness, if hn ≍ (log n/n)

1 2β+d then (with

high probability) µ(Cn∆C∗)

log n

n

  • β

2β+d .

The same holds for C+

n . This rate is minimax optimal: w.h.p.

inf

C sup P∈P

L(C)

log n

n

  • β

2β+d

where the infimum is over all level 1 − α prediction sets. Note: the minimax result requires smoothness assumptions; the finite sample distribution free guarantee does not. Note: the rate for the alternative loss L(C) = µ(C) − µ(C∗) is faster.

26

slide-27
SLIDE 27

Data-Driven Bandwidth Each bandwidth h yields a conformal prediction region Cn,h. Choose h to minimize µ(Cn,h). (With some adjustments, this still has fi- nite sample validity.)

0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80

log2(h/hn) µ( ˆ C) µ( ˆ C) µ( ˜ C−) µ( ˜ C+)

27

slide-28
SLIDE 28

−5 5 1 2 3 4 5 −5 5 0.0 0.2 0.4 −5 5 0.00 0.06 0.12 −5 5 0.000 0.015 0.030 1 2 3 4 8 10 12 14 16 Bandwidth Lebesgue Measure −5 5 0.00 0.10 0.20

28

slide-29
SLIDE 29

Level Set Clustering To summarize so far:

  • choose tuning parameters by minimizing size of conformal pre-

diction region

  • leads to optimized density clusters
  • and the resulting set has a finite-sample prediction property

29

slide-30
SLIDE 30

2d Example

y(1) y(2) −2 2 4 6 8 10 −2 2 4 6 8 10 Optimal Set Outer Bound Inner Bound Conformal Set Data Point −2 2 4 6 8 10 12 −2 2 4 6 8 10 y(1) y(2) Data points not in region Data points in region convex hull of data poins in region

Left: conformal. Right: Data-depth method (Li and Liu 2008). The conformal method is 1,000 times faster.

30

slide-31
SLIDE 31

Singular Measures

−6 −4 −2 2 4 6 10 20 30 40 50 −6 −4 −2 2 4 6 10 20 30 40

  • −7

−6 −5 −4 −3 −2 −1 200 400 600 800 1000 −6 −4 −2 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5

But cross-validation chooses h = 0.

31

slide-32
SLIDE 32

Conformal k-means? Can we use conformal prediction to choose k? Yes, but ... Let c1, . . . , ck minimize Rn = 1 n

n

  • i=1

min

j

||Yi − cj||2. Similarly, let c1(y), . . . , ck(y) denote the augmented centers and let σi(y) = min

j

||Yi − cj(y)||2 ≡ g(Yi). THEOREM: µ(Cn) = ∞.

32

slide-33
SLIDE 33

This can be fixed using g built from aug(y) − {Yi}. However, this is computationally expensive. Instead: Split-Conformal Method.

  • 1. Split data into D1 and D2.
  • 2. Compute g from D1.
  • 3. Let c be α quantile of g(Yi) from D2.
  • 4. Let Cn = {y : g(y) > c}.

Then P(Yn+1 ∈ Cn) ≥ 1 − α. When applied to k-means, µ(Cn) < ∞. Can choose k to minimize µ(Cn).

slide-34
SLIDE 34

Histogram of x

x Frequency −3 −2 −1 1 2 3 2 4 6 8 10 −4 −2 2 4 1 2 3 4 5 6 x k

  • 1

2 3 4 5 k Lebesgue

33

slide-35
SLIDE 35

Function Mining Given functions Y1(·), . . . , Yn(·):

  • project into lower dimensional space (data-dependent basis)
  • apply the (modified) conformal prediction to estimated density
  • f the coefficients
  • go back to function space (lots of detail omitted here, including

quadratic programming) leads to conformal prediction bands:

34

slide-36
SLIDE 36

(a) Neuron, Principal Components

PC1 PC2

5 10 15 20 25 30

  • 2000
  • 1000

1000 2000

(b) Neuron, 90% Prediction Bands for Projection

Response Time

(c) Phoneme, Principal Components

PC1 PC2

50 100 150 5 10 15 20

(d) Phoneme, 90% Prediction Bands for Projection

Response Time

35

slide-37
SLIDE 37

Summary So Far By optimizing conformal inferences we can choose tuning pa- rameters for clustering. Theory is still in progress. We can get curve clusters in the form of prediction bands with finite sample, distribution free prediction guarantees. Let’s consider yet another structure finding problem: undirected graphs.

36

slide-38
SLIDE 38

Tuning Parameters for Undirected Graphs (Very preliminary.) X1, . . . , Xn ∼ N(µ, Σ). Glasso (graphical lasso): minimize −loglik(Σ) + λ

  • j=k

|Σ−1

jk |

The non-zeros of Σ−1 give the edges of the graph. Conformal score: φ(Yi; µ, Σ). (Valid even if the data are not really Gaussian.) Choose λ to minimize volume of Cn.

37

slide-39
SLIDE 39

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • 38
slide-40
SLIDE 40

−4 −3 −2 −1 300 310 320 330 340 350 360 log(Lambda) V

39

slide-41
SLIDE 41

Structure = ⇒ Prediction We have seen the predictive thinking helps find structure. But the reverse is also true: finding structure can help with prediction. Familiar example: Semisupervised Learning But does it provably help? Specifically, in the minimax sense. (With Martin Azizyan and Aarti Singh)

40

slide-42
SLIDE 42

Structure Helping Prediction: Semisupervised Learning Labeled data: Ln = {(X1, Y1), . . . , (Xn, Yn)}. Unlabeled data: UN = {Xn+1, . . . , XN} with N > n. The usual intuition: UN helps us estimate p ≡ pX well. Assume that m(x) = E(Y |X = x) is related to p(x). For example, the cluster assumption: m is very smooth where p(x) has clusters. (The clusters might be manifolds.) Then Ln and p help us estimate m(x): UN = ⇒ p = ⇒ m ⇐ = Ln The structure of p helps predict Y .

41

slide-43
SLIDE 43

Semisupervised But does it provably improve inference? Is it true that En = inf

m∈SSN supP∈Pn R(

m) inf

m∈Sn supP∈Pn R(

m) → 0 as n → ∞? Here, R(m) = E(Y − m(X))2, SSN denotes all semisupervised estimators, Sn denotes all super- vised estimators.

42

slide-44
SLIDE 44

Semisupervised Azizyan, Singh and Wasserman (2012): under fairly complicated conditions, En → 0 as n → ∞. The conditions require that n → ∞ N → ∞ n/N → 0 m is smooth “relative to p” p is highly concentrated (around lower dimensional sets). Without these conditions, it appears that structure does not help, at least in the minimax sense. (See also, Singh, Nowak and Zhu (2008) and Niyogi (2008).) In fact, we can adapt to the structure as follows.

43

slide-45
SLIDE 45

Adaptive Semisupervised Let Dα(x, y) = inf

γ

  • ds

pα(γ(s)) where the infimum is over all paths connecting x and y. Assume that |m(x) − m(y)| ≤ L Dα(x, y). The improvement over supervised learning depends on α and we can use the labeled data to learn α (adapt to α.)

44

slide-46
SLIDE 46

Adaptive Semisupervised Let

  • mα,h(x) =

n

i=1 Yi K(

Dα(x, Xi)/h)

n

i=1 K(

Dα(x, Xi)/h) .

  • Dα from the the unlabelled data.

If we choose α and h by data-splitting (cross-validation) then, w.h.p.

Risk(

m

α, h) inf α,h Risk(

mα,h) + O

log n

n

  • .

We have minimaxity and adaptivity: successful use of structure.

45

slide-47
SLIDE 47

Conclusion

  • Finding structure in data is challenging.
  • Predictive ideas might be helpful for structure finding and vice

versa.

  • In the remaining lectures I’ll discuss two structure finding prob-

lems in detail: manifold estimation and high dimensional undi- rected graphs.

46

slide-48
SLIDE 48

THE END

47