High Dimensional Classification in the Presence of Correlation: A - PowerPoint PPT Presentation

High Dimensional Classification in the Presence of Correlation: A Factor Model Approach A. PEDRO DUARTE SI LVA * Faculdade de Economia e Gestão / Centro de Estudos em Gestão e Economia Universidade Católica Portuguesa Centro Regional do Porto PARIS, 23-28 August 2010 Compstat’ 2010 (*) Supported by: FEDER / POCI 2010

High Dim ensional Correlation Adjusted Classification Overview 1. A Factor-model linear classification rule for High-Dimensional correlated data 2. Asymptotic properties with p   3. Variable selection for problems with “rare” and “mostly weak” group differences 4. Performance in Micro-Array problems 5. Conclusions and Perspectives Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Problem Statment: X   p ( Y ; X ) Y  {0,1} We want to find a rule that predicts Y given X  ˆ Y argmax π f (X) Bayes rule: g g g X | Y ~ N ( μ , Σ ) Assuming p (Y)  Bayes rule: π 1  =  ( 1 ) -  ( 0)  { } 1 ( ) ˆ     Y T 1 Δ Σ X ( μ μ ) log 0 i 0 1 2 π 1 How to estimate  -1 when p > n and the X correlations are important ? Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification A Factor-Model Approach  i   P f i   q X i =  ( Yi) + B f i +  i q < < p  j D  (j) > k 0   0 f ~ N (0 , I ) ε ~ N (0 , D ) i q q i p ε   = B B T + D   -1 = D  -1 B [ I q + B T D  -1 B] -1 B T D  -1 - D  -1 ˆ Σ  T  ˆ ˆ ˆ B B D RFctq ε   ˆ ˆ ˆ -1/2 ˆ ˆ -1/2 ˆ -1/2 ˆ -1/2 2 B , D arg min || V Σ V V S V ||  RFctq F B ˆ , D ˆ ε Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Asym ptotic Properties We will compare empirical linear rules 1 n { } ( )      ˆ T ˆ 1 δ 1 Δ Σ X ( X X ) log 0 L δ i 0 1 2 n L 1 and  estimator ˆ Δ For some parameter space Γ satisfying δ L   ˆ 2 max E || Δ Δ || o(1) ( C1 ) Γ L θ δ based on the criterion     ˆ T ˆ -1 ˆ  Δ Σ Δ      δ      W ( δ max P δ (Y 1 | Y 0 max 1 Φ L ) )     Γ L Γ θ L i i Γ   δ δ δ   ˆ T -1 -1 ˆ L L L ˆ ˆ 2 Δ Σ Σ Σ Δ     δ δ L L n(p)      when p ; d p Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Asym ptotic Properties     Main Result      T 1  2 θ : Δ Σ Δ c ,         when k λ ( ) λ ( Σ k )  1 min max 2          θ μ , μ , Σ Γ (k , k , k , q, B, c) Δ B   (0) (1) F 0 1 2 q     β (j, a)    0 =  1 = 1 / 2    j, a    R(j' , l' )  j' , l'     D (j)       j ε ( C1 ) is satisfied    R(j' , l' )   j' , l'  T  2   -1/2 -1/2 2  -1/2 -1/2 Σ B B D arg min || R V Σ V || R V Σ V  RFctq B, D RFctq F RFctq RFct  q n(p)     p ; It follows that: when log p   1 1 K    λ ( Σ ) Σ Σ Σ Σ   0Fq 2 2   max 0Fq W ( δ 1 Φ c  K max ) RFct RFct   Γ Fq Fq  q q 0Fq Γ 0F 1 K λ ( Σ ) q δ F q   0Fq min 0Fq Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Selecting Predictors 1 - Rank variables acording to tw o-sam ple t-scores 2 – Choose a selection cut-off for the score values (Donoho e Jin 2004) Higher Criticism Given p ordered p-values:  1 , ...,  p ( ) j/p - π p  j HC(j; π ) j ( ) ( ) j / p 1 - (j / p)  HC * max HC(j; π )  j α j 0 Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Selecting Predictors Higher Criticism I n a tw o-group hom okedastic m odel, w ith : - Diagonal classification rules - p-values derived from two-group t-scores - Independent variables - Rare “effects” (mean group diferences) - Weak effects w hen p  HC* is asym ptotically equivalent to the (Donoho e Jin 2009) optim al selection threshold Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Selecting Predictors Control of false discovery rates Given a sequence of p independent tests w ith ordered p-values:  1 , ...,  p Reject the null hypothesis ( H 0 j ) w here j  k, w ith   j (Benjamini e Hochberg 1995)   k max j : π α   j p   Given a sequence of p dependent tests w ith ordered p-values:  1 , ...,  p Reject the null hypothesis ( H 0 j ) w here j  k, w ith       j (Benjamini e Yekutieli 2001)   k max j : π α   j p 1    p   i    i 1 Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Selecting Predictors Expanded Higher Criticism A selection scheme for problems where effects are rare and m ost (but not necessarly all) effects are weak 1 - Include all variables that satisfy Benjamini and Yekutieli’s criterion Estimate an “empirical null distributiuon” 2 - 3 - Compute p-values for the effects of non-selected variables, based on the null estimated in step 2 4 - Find the HC* threshold from the p-values computed in step 3 Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Singh’s Prostate Cancer Data – p= 6033; n= 50+ 52 Rule Error Estimate # Variables kept (std error) (min – median - max) 0.2146 58 – 134.5 – 421 Fisher’s LDA* (0.0101) 0.0670 Naive Bayes* 58 – 134.5 – 421 (0.0052) 0.0642 Support Vector Machines* 58 – 134.5 – 421 (0.0052) 0.0838 108 – 356 – 1771 Nearest Shruken Centroids (0.0063) 0.0741 Regularized DA 82 – 390 – 1201 (0.0053) 0.0650 Shrunken DA* 58 – 134.5 – 421 (0.0051) 0.0641 Factor-based LDA* (q=1) 58 – 134.5 – 421 (0.0052) 0.0720 NLDA* 58 – 134.5 – 421 (0.0052) * After variable selection by the maximum of FDR (False Discovery Rates) and HC (Higher Criticism), both derived from Independence based T-scores. The p-values used in the HC computations are derived from empirical Null distributions Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Golubs’s Leukemia Data –- p = 7 129 ; n = 47+ 25 Rule Error Estimate # Variables kept (std error) (min – median - max) 0.2558 326 – 478 – 712 Fisher’s LDA* (0.0109) 0.480 326 – 478 – 712 Naive Bayes* (0.0085) 0.0405 326 – 478 – 712 Support Vector Machines* (0.0049) 0.0201 Nearest Shruken Centroids 703 – 3166 – 7129 (0.0039) 0.0491 12 – 1934 – 7124 Regularized DA (0.0062) 0.0276 326 – 478 – 712 Shrunken DA* (0.0044) 0.0174 Factor-based LDA* (q=1) 326 – 478 – 712 (0.0034) 0.1510 326 – 478 – 712 NLDA* (0.0085) * After variable selection by the maximum of FDR (False Discovery Rates) and HC (Higher Criticism), both derived from Independence based T-scores. The p-values used in the HC computations are derived from empirical Null distributions Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Alon’s Colon Data -– p = 2 000 ; n = 40+ 22 Rule Error Estimate # Variables kept (std error) (min – median - max) 0.3285 3 – 71.5 – 200 Fisher’s LDA* (0.0143) 0.2275 3 – 71.5 – 200 Naive Bayes* (0.0133) 0.1576 Support Vector Machines* 3 – 71.5 – 200 (0.0095) 0.1563 Nearest Shruken Centroids 7 – 39 – 527 (0.0098) 0.2174 14 – 425 – 2000 Regularized DA (0.0126) 0.1865 3 – 71.5 – 200 Shrunken DA* (0.0100) 0.1746 Factor-based LDA* (q=1) 3 – 71.5 – 200 (0.0098) 0.2614 3 – 71.5 – 200 NLDA* (0.0114) * After variable selection by the maximum of FDR (False Discovery Rates) and HC (Higher Criticism), both derived from Independence based T-scores. The p-values used in the HC computations are derived from empirical Null distributions Compstat ’ 2010 PARIS, 23-28 August 2010

High Dim ensional Correlation Adjusted Classification Conclusions  A factor-m odel classification rule, designed for high- dim ensional correlated data, w as proposed  Asymptotic Analysis show that As p  the new rule can approach a low expected error rate Often, much lower than unrestricted covariance rules independence-based rules  Empirical comparisons sugest that w hen com bined w ith sensible variable selection schem es the new rule is highly com petitive in MicroArray Applications Compstat ’ 2010 PARIS, 23-28 August 2010

High Dimensional Classification in the Presence of Correlation: A - PowerPoint PPT Presentation

High Dimensional Classification in the Presence of Correlation: A Factor Model Approach A. PEDRO DUARTE SI LVA * Faculdade de Economia e Gesto / Centro de Estudos em Gesto e Economia Universidade Catlica Portuguesa Centro Regional do

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Presence Presence Presence When we wake up in the morning we may automatically leave our

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Management of Classification Lookup Files The basics of classification The basics of

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Draft EE 8235: Lecture 21 1 Lecture 21: Input-output analysis in fluid mechanics Linear

Global Monitoring of Atmospheric Composition by IAGOS-CORE Aircraft: Current Achievements and

Maps and Online Media: Let's look ahead Birgit Wahrenburg-Jhnke Maps and Media Hamburg 10.

TPC performance with HLT clusters Markus Khler Gesellschaft fr Schwerionenforschung,

Score-P A Joint Perform ance Measurem ent Run-Tim e I nfrastructure VI-HPS Team VIRTUAL

Leveraging Parallelware in MAESTRO and EPEEC Contributions by Appentra and Enhancements to

Support for Claiming Gift Aid Welcome! Support for Claiming Gift Aid 1. Welcome, purpose,

Retail Gift Aid Stephen Maudsley Retail Gift Aid and Digital Donations Lead Business Tax and