The Search For Structure
- r
The Relationship Between Structure and Prediction June 2012
Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University
1
The Search For Structure or The Relationship Between Structure and - - PowerPoint PPT Presentation
The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University 1 The Search For Structure Searching For Structure
Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University
1
Searching For Structure ⇓ choose tuning parameters for structure finding ⇓ converting structure finding into prediction ⇓ conformal inference (distribution free prediction)
2
The Three Lectures
3
Collaborators
4
Outline
diction. (4. Using structure to help with prediction: (minimax) semisu- pervised inference.)
5
The Three Eras of Statistics and Machine Learning
(a) mle (b) confidence intervals, etc.
(a) classification (b) regression (c) SVM etc
(a) graphical models (b) manifolds (c) matrix factorization
6
Prediction is “Easy.” Example 1: Nonparametric Regression Let (X1, Y1), . . . , (X2n, Y2n) ∼ P. Split the data into training and test. Let { mh : h ∈ H} be estimates of m(x) = E(Y |X = x) from the training data. Choose h to minimize 1 n
(Yi − mh(Xi))2. Then∗
m
h) ≤ c1Risk(
m∗) + c2 log |H| n .
∗See Gyorfi et al, for example.
7
Prediction is “Easy.” Example 2: The Lasso. Let (X1, Y1), . . . , (Xn, Yn) ∼ P where Xi ∈ Rd. Let β minimize n
i=1(Yi − XT i β)2 s.t. ||β||1 ≤ L (the lasso).
Then, w.h.p.∗ R( β) ≤ R(β∗) + O
n
where β∗ minimizes Risk(β) subject to ||β||1 ≤ L. Choose L by cross-validation.
∗See Greenshtein and Ritov 2004
8
Prediction is “Easy.” Example 3: SpAM. Sparse Additive Models∗ Y = m(X) =
d
sj(Xj) + ǫ. Choose s1, . . . , sd to minimize
n
(Yi −
sj(Xj))2 subject to sj being smooth and
j ||sj|| ≤ L.
∗Ravikumar, Lafferty, Liu and Wasserman 2009
9
Prediction is “Easy.” Example 3: SpAM. Choose L by minimizing generalized cross-validation 1 n RSS (1 − d f(L)/n)2. If d ≤ enξ for ξ < 1 then
m) − Risk(m∗) = OP
1
n
1−ξ
2 . 10
Prediction Prediction is easy because:
splitting etc. Important: The results on data-splitting give distribution-free
11
Structure Finding Examples:
(Details about graphs and manifolds in lectures 2 and 3.) In this talk, we will show how prediction helps finding structure.
12
Clustering Despite many, many years of research and many, many papers, there does not seem to be a consensus on how to choose tuning parameters.
13
Clustering Various suggestions include:
I’ll (tentatively) propose an alternative.
14
Example of Our Results: Dustribution Free Curve Clustering
5 10 15 20 25 30 −1000 −500 500 1000 1500
15
Relating Structure to Prediction Our approach (Lei, Rinaldo, Robins, Wasserman) is to convert a structure-finding problem into a prediction problem. Example: Density estimation = ⇒ conformal prediction. Conformal prediction is due to Vovk et al. Rest of talk:
16
Conformal Inference A theory of distribution free prediction. See: Vovk, Gammerman and Shafer (2005) + many papers by Vovk and co-workers. (See also Phil Dawid’s work on prequential inference.) Our contribution: marrying conformal inference with traditional statistical theory (minimax theory) and extending some of the techniques:
Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Wasserman (arXiv:1203.5422) Lei, Rinaldo, Wasserman (submitted to NIPS) Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Robins and Wasserman (in progress)
17
(Batch) Conformal Prediction Observe Y1, . . . , Yn ∼ P. Construct Cn ≡ Cn(Y1, . . . , Yn) such that P(Yn+1 ∈ Cn) ≥ 1 − α for all P and all n. Here, P ≡ P n+1. See Vovk et al for the general (sequential) theory. We are only concerned with the bacth version. We will also be concerned with minimax optimality (efficiency).
18
(Batch) Conformal Prediction
π(y) =
n+1
i=1 I(σi(y) ≤ σn+1(y))
n + 1 .
Cn = {y : π(y) ≥ α}.
19
Conformity Scores Use aug(y) = (Y1, . . . , Yn, y) to construct a function g. Compute σi(y) =
i = 1, . . . , n g(y) i = n + 1. Example: σi = −|Yi − Y (y)| where Y (y) = y + n
i=1 Yi
n + 1 . * In certain cases, we need to use σi = gi(Yi) where gi is built from aug(y) − {Yi}. More on this later.
20
(Batch) Conformal Prediction When H0 : Yn+1 = y is true, the ranks of the σi’s are uniform. It follows that, for any P and any n, P(Yn+1 ∈ Cn) ≡ P n+1(Yn+1 ∈ Cn) ≥ 1 − α. This is true, finite sample, distribution-free prediction. But what is the best conformity score?
21
Oracle Best (smallest) prediction set or Oracle: C∗ = {y : p(y) > λ} where λ is such that P(C∗) = 1 − α, The form of C∗ suggests using an estimate p of p to define a conformity score. And this leads to a method for level set density clustering.
22
Loss Function Loss function: L(C) = µ(C∆C∗) where A∆B = (A ∩ Bc) ∪ (Ac ∩ B) and µ is Lebesgue measure. Minimax risk: inf
C∈Γn
sup
P∈P
EP[µ(C∆C∗)] where Γn denotes all 1 − α prediction regions.
23
Kernel Conformity Define the augmented kernel density estimator
h(u) =
n + 1
1 hdK
h
n + 1
hdK
h
Let σi(y) = py
h(Yi),
σn+1(y) = py
h(y)
π(y) =
n+1
i=1 I(σi(y) ≤ σn+1)
n + 1 Cn = {y : π(y) ≥ α}. Then P(Yn+1 ∈ Cn) ≥ 1 − α for all P and n.
24
Helpful Approximation Cn is not a density level set. Also, it is expensive to compute. However, Cn ⊂ C+
n where
C+
n = {y :
where cn = ph(Y(nα)) − K(0) nhd and Y(1), Y(2), · · · are ordered so that ph(Y(1)) ≥ ph(Y(2)) ≥ · · · . The set C+
n involves no augmentation set but still satisfies
P(Yn+1 ∈ C+
n ) ≥ 1−α. Its connected components are the density
clusters.
25
Optimality Assuming Holder-β smoothness, if hn ≍ (log n/n)
1 2β+d then (with
high probability) µ(Cn∆C∗)
log n
n
2β+d .
The same holds for C+
n . This rate is minimax optimal: w.h.p.
inf
C sup P∈P
L(C)
log n
n
2β+d
where the infimum is over all level 1 − α prediction sets. Note: the minimax result requires smoothness assumptions; the finite sample distribution free guarantee does not. Note: the rate for the alternative loss L(C) = µ(C) − µ(C∗) is faster.
26
Data-Driven Bandwidth Each bandwidth h yields a conformal prediction region Cn,h. Choose h to minimize µ(Cn,h). (With some adjustments, this still has fi- nite sample validity.)
0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80
log2(h/hn) µ( ˆ C) µ( ˆ C) µ( ˜ C−) µ( ˜ C+)
27
−5 5 1 2 3 4 5 −5 5 0.0 0.2 0.4 −5 5 0.00 0.06 0.12 −5 5 0.000 0.015 0.030 1 2 3 4 8 10 12 14 16 Bandwidth Lebesgue Measure −5 5 0.00 0.10 0.20
28
Level Set Clustering To summarize so far:
diction region
29
2d Example
y(1) y(2) −2 2 4 6 8 10 −2 2 4 6 8 10 Optimal Set Outer Bound Inner Bound Conformal Set Data Point −2 2 4 6 8 10 12 −2 2 4 6 8 10 y(1) y(2) Data points not in region Data points in region convex hull of data poins in region
Left: conformal. Right: Data-depth method (Li and Liu 2008). The conformal method is 1,000 times faster.
30
Singular Measures
−6 −4 −2 2 4 6 10 20 30 40 50 −6 −4 −2 2 4 6 10 20 30 40
−6 −5 −4 −3 −2 −1 200 400 600 800 1000 −6 −4 −2 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5
But cross-validation chooses h = 0.
31
Conformal k-means? Can we use conformal prediction to choose k? Yes, but ... Let c1, . . . , ck minimize Rn = 1 n
n
min
j
||Yi − cj||2. Similarly, let c1(y), . . . , ck(y) denote the augmented centers and let σi(y) = min
j
||Yi − cj(y)||2 ≡ g(Yi). THEOREM: µ(Cn) = ∞.
32
This can be fixed using g built from aug(y) − {Yi}. However, this is computationally expensive. Instead: Split-Conformal Method.
Then P(Yn+1 ∈ Cn) ≥ 1 − α. When applied to k-means, µ(Cn) < ∞. Can choose k to minimize µ(Cn).
Histogram of x
x Frequency −3 −2 −1 1 2 3 2 4 6 8 10 −4 −2 2 4 1 2 3 4 5 6 x k
2 3 4 5 k Lebesgue
33
Function Mining Given functions Y1(·), . . . , Yn(·):
quadratic programming) leads to conformal prediction bands:
34
(a) Neuron, Principal Components
PC1 PC2
5 10 15 20 25 30
1000 2000
(b) Neuron, 90% Prediction Bands for Projection
Response Time
(c) Phoneme, Principal Components
PC1 PC2
50 100 150 5 10 15 20
(d) Phoneme, 90% Prediction Bands for Projection
Response Time
35
Summary So Far By optimizing conformal inferences we can choose tuning pa- rameters for clustering. Theory is still in progress. We can get curve clusters in the form of prediction bands with finite sample, distribution free prediction guarantees. Let’s consider yet another structure finding problem: undirected graphs.
36
Tuning Parameters for Undirected Graphs (Very preliminary.) X1, . . . , Xn ∼ N(µ, Σ). Glasso (graphical lasso): minimize −loglik(Σ) + λ
|Σ−1
jk |
The non-zeros of Σ−1 give the edges of the graph. Conformal score: φ(Yi; µ, Σ). (Valid even if the data are not really Gaussian.) Choose λ to minimize volume of Cn.
37
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
−4 −3 −2 −1 300 310 320 330 340 350 360 log(Lambda) V
39
Structure = ⇒ Prediction We have seen the predictive thinking helps find structure. But the reverse is also true: finding structure can help with prediction. Familiar example: Semisupervised Learning But does it provably help? Specifically, in the minimax sense. (With Martin Azizyan and Aarti Singh)
40
Structure Helping Prediction: Semisupervised Learning Labeled data: Ln = {(X1, Y1), . . . , (Xn, Yn)}. Unlabeled data: UN = {Xn+1, . . . , XN} with N > n. The usual intuition: UN helps us estimate p ≡ pX well. Assume that m(x) = E(Y |X = x) is related to p(x). For example, the cluster assumption: m is very smooth where p(x) has clusters. (The clusters might be manifolds.) Then Ln and p help us estimate m(x): UN = ⇒ p = ⇒ m ⇐ = Ln The structure of p helps predict Y .
41
Semisupervised But does it provably improve inference? Is it true that En = inf
m∈SSN supP∈Pn R(
m) inf
m∈Sn supP∈Pn R(
m) → 0 as n → ∞? Here, R(m) = E(Y − m(X))2, SSN denotes all semisupervised estimators, Sn denotes all super- vised estimators.
42
Semisupervised Azizyan, Singh and Wasserman (2012): under fairly complicated conditions, En → 0 as n → ∞. The conditions require that n → ∞ N → ∞ n/N → 0 m is smooth “relative to p” p is highly concentrated (around lower dimensional sets). Without these conditions, it appears that structure does not help, at least in the minimax sense. (See also, Singh, Nowak and Zhu (2008) and Niyogi (2008).) In fact, we can adapt to the structure as follows.
43
Adaptive Semisupervised Let Dα(x, y) = inf
γ
pα(γ(s)) where the infimum is over all paths connecting x and y. Assume that |m(x) − m(y)| ≤ L Dα(x, y). The improvement over supervised learning depends on α and we can use the labeled data to learn α (adapt to α.)
44
Adaptive Semisupervised Let
n
i=1 Yi K(
Dα(x, Xi)/h)
n
i=1 K(
Dα(x, Xi)/h) .
If we choose α and h by data-splitting (cross-validation) then, w.h.p.
m
α, h) inf α,h Risk(
mα,h) + O
log n
n
We have minimaxity and adaptivity: successful use of structure.
45
Conclusion
versa.
lems in detail: manifold estimation and high dimensional undi- rected graphs.
46
47