The Search For Structure or The Relationship Between Structure and - PowerPoint PPT Presentation

The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University 1

The Search For Structure Searching For Structure ⇓ choose tuning parameters for structure finding ⇓ converting structure finding into prediction ⇓ conformal inference (distribution free prediction) 2

The Three Lectures 1. The Search For Structure. (Today). 2. Manifolds and Filaments. 3. Undirected Graphs. 3

Collaborators • Xi Chen • Chris Genovese • Haijie Gu • Anupam Gupta • John Lafferty • Jing Lei • Han Liu • Pradeep Ravikumar • Marco Perone-Pacifico • Isabella Verdinelli • Min Xu • Aarti Singh • Martin Azizyan • Sivaraman Balikrishnan • Don Sheehy • Mladen Kolar • Alessandro Rinaldo • And ... 4

Outline 1. Prediction is easy, finding structure is hard. 2. Examples. 3. Using prediction to find structure: (minimax) conformal prediction. (4. Using structure to help with prediction: (minimax) semisu- pervised inference.) 5

The Three Eras of Statistics and Machine Learning 1. PALEOZOIC: parameter estimation (a) mle (b) confidence intervals, etc. 2. MESOZOIC: prediction (a) classification (b) regression (c) SVM etc 3. CENOZOIC: the search for structure (a) graphical models (b) manifolds (c) matrix factorization 6

Prediction is “Easy.” Example 1: Nonparametric Regression Let ( X 1 , Y 1 ) , . . . , ( X 2 n , Y 2 n ) ∼ P . Split the data into training and test. Let { � m h : h ∈ H} be estimates of m ( x ) = E ( Y | X = x ) from the training data. Choose � h to minimize � 1 m h ( X i )) 2 . ( Y i − � n i ∈ test Then ∗ m ∗ ) + c 2 log |H| Risk ( � m � h ) ≤ c 1 Risk ( � . n ∗ See Gyorfi et al, for example. 7

Prediction is “Easy.” Example 2: The Lasso. Let ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P where X i ∈ R d . β minimize � n i β ) 2 s.t. || β || 1 ≤ L (the lasso). Let � i =1 ( Y i − X T Then, w.h.p. ∗  �  L 4 log d R ( �   β ) ≤ R ( β ∗ ) + O n where β ∗ minimizes Risk ( β ) subject to || β || 1 ≤ L . Choose L by cross-validation. ∗ See Greenshtein and Ritov 2004 8

Prediction is “Easy.” Example 3: SpAM. Sparse Additive Models ∗ d � Y = m ( X ) = s j ( X j ) + ǫ. j =1 Choose � s 1 , . . . , � s d to minimize n � � s j ( X j )) 2 ( Y i − i =1 j subject to s j being smooth and � j || s j || ≤ L . ∗ Ravikumar, Lafferty, Liu and Wasserman 2009 9

Prediction is “Easy.” Example 3: SpAM. Choose L by minimizing generalized cross-validation 1 RSS f ( L ) /n ) 2 . n (1 − d If d ≤ e n ξ for ξ < 1 then � 1 − ξ � 1 2 . Risk ( � m ) − Risk ( m ∗ ) = O P n 10

Prediction Prediction is easy because: 1. Goal is clear. 2. Tuning parameters can be selected by cross-validation, data- splitting etc. Important: The results on data-splitting give distribution-free guarantees. This is a goal we want to emulate. 11

Structure Finding Examples: -clustering -curve clustering -manifolds -graphs -graph-valued regression (Details about graphs and manifolds in lectures 2 and 3.) In this talk, we will show how prediction helps finding structure. 12

Clustering Despite many, many years of research and many, many papers, there does not seem to be a consensus on how to choose tuning parameters. • k -means: choose k . • Density-based clustering: choose bandwidth h . • Hierarchical clustering: choice of merging rule. • Spectral clustering: many parameters. 13

Clustering Various suggestions include: • stability • hypothesis testing • information-theoretic • others I’ll (tentatively) propose an alternative. 14

Example of Our Results: Dustribution Free Curve Clustering 1500 1000 500 0 −500 −1000 0 5 10 15 20 25 30 15

Relating Structure to Prediction Our approach (Lei, Rinaldo, Robins, Wasserman) is to convert a structure-finding problem into a prediction problem. Example: Density estimation = ⇒ conformal prediction. Conformal prediction is due to Vovk et al. Rest of talk: -explain conformal prediction -minimax theory for conformal prediction (briefly) -using conformal prediction to guide structure finding 16

Conformal Inference A theory of distribution free prediction. See: Vovk, Gammerman and Shafer (2005) + many papers by Vovk and co-workers. (See also Phil Dawid’s work on prequential inference.) Our contribution: marrying conformal inference with traditional statistical theory (minimax theory) and extending some of the techniques: Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Wasserman (arXiv:1203.5422) Lei, Rinaldo, Wasserman (submitted to NIPS) Lei, Robins and Wasserman (arXiv:1111.1418) Lei, Robins and Wasserman (in progress) 17

(Batch) Conformal Prediction Observe Y 1 , . . . , Y n ∼ P . Construct C n ≡ C n ( Y 1 , . . . , Y n ) such that P ( Y n +1 ∈ C n ) ≥ 1 − α for all P and all n . Here, P ≡ P n +1 . See Vovk et al for the general (sequential) theory. We are only concerned with the bacth version. We will also be concerned with minimax optimality (efficiency). 18

(Batch) Conformal Prediction 1. Observe Y 1 , . . . , Y n ∼ P where Y i ∈ R d . 2. Choose any fixed y ∈ R d . 3. Let aug ( y ) = ( Y 1 , . . . , Y n , y ). 4. Compute conformity scores σ 1 ( y ) , . . . , σ n +1 ( y ). 5. Under H 0 : Y n +1 = y , the ranks are uniform. The p-value is � n +1 i =1 I ( σ i ( y ) ≤ σ n +1 ( y )) π ( y ) = . n + 1 6. Invert the test: C n = { y : π ( y ) ≥ α } . 19

Conformity Scores Use aug ( y ) = ( Y 1 , . . . , Y n , y ) to construct a function g . Compute � g ( Y i ) i = 1 , . . . , n σ i ( y ) = g ( y ) i = n + 1 . Example: σ i = −| Y i − Y ( y ) | where Y ( y ) = y + � n i =1 Y i . n + 1 * In certain cases, we need to use σ i = g i ( Y i ) where g i is built from aug ( y ) − { Y i } . More on this later. 20

(Batch) Conformal Prediction When H 0 : Y n +1 = y is true, the ranks of the σ i ’s are uniform. It follows that, for any P and any n , P ( Y n +1 ∈ C n ) ≡ P n +1 ( Y n +1 ∈ C n ) ≥ 1 − α. This is true, finite sample, distribution-free prediction. But what is the best conformity score? 21

Oracle Best (smallest) prediction set or Oracle: C ∗ = { y : p ( y ) > λ } where λ is such that P ( C ∗ ) = 1 − α , The form of C ∗ suggests using an estimate � p of p to define a conformity score. And this leads to a method for level set density clustering. 22

Loss Function Loss function: L ( C ) = µ ( C ∆ C ∗ ) where A ∆ B = ( A ∩ B c ) ∪ ( A c ∩ B ) and µ is Lebesgue measure. Minimax risk: inf sup E P [ µ ( C ∆ C ∗ )] C ∈ Γ n P ∈P where Γ n denotes all 1 − α prediction regions. 23

Kernel Conformity Define the augmented kernel density estimator � � � � � � � � n � 1 1 || u − Y i || 1 1 || y − Y i || p y � h ( u ) = h d K + h d K . n + 1 h n + 1 h i =1 Let p y p y σ i ( y ) = � h ( Y i ) , σ n +1 ( y ) = � h ( y ) � n +1 i =1 I ( σ i ( y ) ≤ σ n +1 ) π ( y ) = n + 1 C n = { y : π ( y ) ≥ α } . Then P ( Y n +1 ∈ C n ) ≥ 1 − α for all P and n . 24

Helpful Approximation C n is not a density level set. Also, it is expensive to compute. However, C n ⊂ C + n where C + n = { y : p h ( y ) > c n } � where p h ( Y ( nα ) ) − K (0) c n = � nh d and Y (1) , Y (2) , · · · are ordered so that � p h ( Y (1) ) ≥ � p h ( Y (2) ) ≥ · · · . The set C + n involves no augmentation set but still satisfies P ( Y n +1 ∈ C + n ) ≥ 1 − α . Its connected components are the density clusters. 25

Optimality 1 2 β + d then (with Assuming Holder- β smoothness, if h n ≍ (log n/n ) high probability) β � log n � 2 β + d . µ ( C n ∆ C ∗ ) � n The same holds for C + n . This rate is minimax optimal: w.h.p. β � log n � 2 β + d inf C sup L ( C ) � n P ∈P where the infimum is over all level 1 − α prediction sets. Note: the minimax result requires smoothness assumptions; the finite sample distribution free guarantee does not. Note: the rate for the alternative loss L ( C ) = µ ( C ) − µ ( C ∗ ) is faster. 26

Data-Driven Bandwidth Each bandwidth h yields a conformal prediction region C n,h . Choose h to minimize µ ( C n,h ). (With some adjustments, this still has finite sample validity.) 80 µ ( ˆ C ) 70 µ ( ˜ C − ) 60 µ ( ˜ C + ) 50 C ) µ ( ˆ 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 3.5 4 log 2 ( h/h n ) 27

Lebesgue Measure 8 10 12 14 16 0.00 0.06 0.12 0 1 2 3 4 5 0 −5 −5 1 Bandwidth 2 0 0 3 5 5 4 0.00 0.10 0.20 0.000 0.015 0.030 0.0 0.2 0.4 −5 −5 −5 0 0 0 5 5 5 28

Level Set Clustering To summarize so far: -choose tuning parameters by minimizing size of conformal prediction region -leads to optimized density clusters -and the resulting set has a finite-sample prediction property 29

2d Example 10 Optimal Set Data points not in region 10 Outer Bound Data points in region Inner Bound convex hull of data poins in region 8 Conformal Set 8 Data Point 6 6 y (2) y (2) 4 4 2 2 0 0 −2 −2 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 12 y (1) y (1) Left: conformal. Right: Data-depth method (Li and Liu 2008). The conformal method is 1,000 times faster. 30

The Search For Structure or The Relationship Between Structure and - PowerPoint PPT Presentation

The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University 1 The Search For Structure Searching For Structure

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Search 3 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 3 1 3 Search 3.1 Problem-solving

Informed Search strategies AIMA sections 3.5, 3.6 Summary Informed Search strategies

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Prr t rt

LINER HANGER SYSTEMS Customized system designed to meet the specific customer requirements. 14

5/11/2013 Results- Microbiology Results- Outcomes One Stage Exchange Cohort N=28 54% 46%

Its a Mathematical World Cristian Rios University of Calgary PIMS Lunchbox Lecture Series

How students can contribute to free software OpenOffice.org success story Eric Bachard (OOo) |

Mean cloud cover for BBC as derived using the AVHRR Cloud Type algorithm Accuracy, reliability

Introduction to Mobile Robotics Wheeled Locomotion Wolfram Burgard, Cyrill Stachniss, Maren

Curves and surfaces with constant nonlocal mean curvature Xavier Cabr e ICREA and UPC,