High Dimensional Classification in the Presence of Correlation: A - - PowerPoint PPT Presentation

high dimensional classification in the presence of
SMART_READER_LITE
LIVE PREVIEW

High Dimensional Classification in the Presence of Correlation: A - - PowerPoint PPT Presentation

High Dimensional Classification in the Presence of Correlation: A Factor Model Approach A. PEDRO DUARTE SI LVA * Faculdade de Economia e Gesto / Centro de Estudos em Gesto e Economia Universidade Catlica Portuguesa Centro Regional do


slide-1
SLIDE 1

Faculdade de Economia e Gestão / Centro de Estudos em Gestão e Economia

Universidade Católica Portuguesa

Centro Regional do Porto

  • A. PEDRO DUARTE SI LVA*

Compstat’ 2010

PARIS, 23-28 August 2010

High Dimensional Classification in the Presence of Correlation: A Factor Model Approach

(*) Supported by: FEDER / POCI 2010

slide-2
SLIDE 2

Overview

  • 1. A Factor-model linear classification rule for

High-Dimensional correlated data

  • 3. Variable selection for problems with “rare” and

“mostly weak” group differences

  • 4. Performance in Micro-Array problems
  • 2. Asymptotic properties with p  
  • 5. Conclusions and Perspectives

PARIS, 23-28 August 2010

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-3
SLIDE 3

Problem Statment: (Y ; X )

Y {0,1} X  p Σ) , (μ N ~ Y | X

(Y) p

We want to find a rule that predicts Y given X Assuming

Bayes rule: Bayes rule:

(X) f π argmax Y

g g g

 ˆ

PARIS, 23-28 August 2010

Compstat’ 2010

Y ˆ

} {

1 1 i 1 T

π π log ) μ (μ 2 1 X Σ Δ

) (

1

  

 = ( 1 )- ( 0)

High Dim ensional Correlation Adjusted Classification How to estimate -1 when p > n and the X correlations are important ?

slide-4
SLIDE 4

) D , (0 N ~ ε

ε p i

) I , (0 N ~ f

q q i

2 F

  • 1/2
  • 1/2
  • 1/2

RFctq

  • 1/2

D , B ε T RFctq

|| V S V V Σ V || min arg D , B D B B

ε

Σ

ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ

ˆ ˆ

ˆ

   

A Factor-Model Approach

Xi = ( Yi) + B fi + i

fi  q i  P

q < < p

 = B BT + D  -1 = D 

  • 1 - D 
  • 1 B [ I q + BT D 
  • 1 B] -1 BTD 
  • 1

j D(j) > k0  0 PARIS, 23-28 August 2010

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-5
SLIDE 5

PARIS, 23-28 August 2010 We will compare empirical linear rules based on the criterion

 

                         Δ Σ Σ Σ Δ 2 Δ Σ Δ Φ 1 max Y | 1 (Y δ P max (δ W

  • 1

δ

  • 1

δ T

  • 1

δ T Γ i i L θ Γ L Γ

L L L L δ L δ L δ

ˆ ˆ ˆ ˆ ˆ ˆ ˆ ) )

( C1 )

  • (1)

|| Δ Δ || E max

2 θ Γ L

δ

  ˆ

} {

1 1 i 1 δ T L

n n log ) X X ( 2 1 X Σ Δ 1 δ

) (

L

   

ˆ ˆ

Asym ptotic Properties

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

For some parameter space and  estimator satisfying

L δ

Γ Δ ˆ

when

     d p n(p) ; p

slide-6
SLIDE 6

PARIS, 23-28 August 2010

 

                                                    

 

 l' , j' ε l' , j' 2 max min 1 2 1 T 2 1 F (1) (0)

) l' , R(j' (j) D j ) l' , R(j' a) β(j, a j, B Δ k (Σ λ ) ( λ k , c Δ Σ Δ : θ c) B, q, , k , k , (k Γ Σ , μ , μ θ

q

)

Main Result

when ( C1 ) is satisfied

           c K 1 K Φ 1 (δ W

0Fq 0Fq Fq Γ Fq

δ

) ) (Σ λ ) (Σ λ max K

0Fq min 0Fq max Γ 0Fq

q F

2 1 RFct 2 1 RFct 0F

q q q

Σ Σ Σ Σ

 

    p log n(p) ; p

It follows that: when

2

  • 1/2
  • 1/2

RFctq D B, 2 T RFctq

F

|| V Σ V R || min arg D B B Σ    

Asym ptotic Properties

Compstat’ 2010

  • 1/2

RFct

  • 1/2

RFctq

V Σ V R

q

High Dim ensional Correlation Adjusted Classification

0 = 1 = 1 / 2

slide-7
SLIDE 7

Selecting Predictors

Higher Criticism 2 – Choose a selection cut-off for the score values

(Donoho e Jin 2004) Given p ordered p-values: 1, ..., p

) ( ) ( ) (

p) / (j

  • 1

p / j π

  • j/p

) π HC(j;

j j

p

) π HC(j; max * HC

j α j 

PARIS, 23-28 August 2010

1 - Rank variables acording to tw o-sam ple t-scores

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-8
SLIDE 8

(Donoho e Jin 2009)

I n a tw o-group hom okedastic m odel, w ith :

  • Independent variables

w hen p 

  • Rare “effects” (mean group diferences)
  • Weak effects
  • p-values derived from two-group t-scores

HC* is asym ptotically equivalent to the

  • ptim al selection threshold

PARIS, 23-28 August 2010

Selecting Predictors

Higher Criticism

  • Diagonal classification rules

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-9
SLIDE 9

Given a sequence of p independent tests w ith ordered p-values: 1, ..., p

Control of false discovery rates

(Benjamini e Hochberg 1995) Reject the null hypothesis ( H 0 j) w here j  k, w ith

        α p j π : j max k

j

(Benjamini e Yekutieli 2001)

               

α i 1 p j π : j max k

p 1 i j

PARIS, 23-28 August 2010

Selecting Predictors

Reject the null hypothesis ( H 0 j) w here j  k, w ith Given a sequence of p dependent tests w ith ordered p-values: 1, ..., p

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-10
SLIDE 10

A selection scheme for problems where effects are rare and m ost (but not necessarly all) effects are weak

1 - Include all variables that satisfy

Benjamini and Yekutieli’s criterion

2 -

Estimate an “empirical null distributiuon” 4 - Find the HC* threshold from the p-values computed in step 3 PARIS, 23-28 August 2010

Selecting Predictors

Expanded Higher Criticism

3 - Compute p-values for the effects of non-selected variables,

based on the null estimated in step 2

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-11
SLIDE 11

Singh’s Prostate Cancer Data – p= 6033; n= 50+ 52

58 – 134.5 – 421 0.0641 (0.0052) Factor-based LDA* (q=1) Rule Error Estimate (std error) # Variables kept (min – median - max) Fisher’s LDA* 0.2146 (0.0101) 58 – 134.5 – 421 Naive Bayes* 0.0670 (0.0052) 58 – 134.5 – 421 Support Vector Machines* 0.0642 (0.0052) 58 – 134.5 – 421 Nearest Shruken Centroids 0.0838 (0.0063) 108 – 356 – 1771 Regularized DA 0.0741 (0.0053) 82 – 390 – 1201 Shrunken DA* 0.0650 (0.0051) 58 – 134.5 – 421 NLDA* 0.0720 (0.0052) 58 – 134.5 – 421

* After variable selection by the maximum of FDR (False Discovery Rates) and HC (Higher Criticism), both derived from Independence based T-scores. The p-values used in the HC computations are derived from empirical Null distributions

PARIS, 23-28 August 2010

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-12
SLIDE 12

PARIS, 23-28 August 2010 Golubs’s Leukemia Data –- p = 7 129 ; n = 47+ 25

326 – 478 – 712 0.0174 (0.0034) Factor-based LDA* (q=1) Rule Error Estimate (std error) # Variables kept (min – median - max) Fisher’s LDA* 0.2558 (0.0109) 326 – 478 – 712 Naive Bayes* 0.480 (0.0085) 326 – 478 – 712 Support Vector Machines* 0.0405 (0.0049) 326 – 478 – 712 Nearest Shruken Centroids 0.0201 (0.0039) 703 – 3166 – 7129 Regularized DA 0.0491 (0.0062) 12 – 1934 – 7124 Shrunken DA* 0.0276 (0.0044) 326 – 478 – 712 NLDA* 0.1510 (0.0085) 326 – 478 – 712

* After variable selection by the maximum of FDR (False Discovery Rates) and HC (Higher Criticism), both derived from Independence based T-scores. The p-values used in the HC computations are derived from empirical Null distributions

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-13
SLIDE 13

PARIS, 23-28 August 2010 Alon’s Colon Data -– p = 2 000 ; n = 40+ 22

3 – 71.5 – 200 0.1746 (0.0098) Factor-based LDA* (q=1) Rule Error Estimate (std error) # Variables kept (min – median - max) Fisher’s LDA* 0.3285 (0.0143) 3 – 71.5 – 200 Naive Bayes* 0.2275 (0.0133) 3 – 71.5 – 200 Support Vector Machines* 0.1576 (0.0095) 3 – 71.5 – 200 Nearest Shruken Centroids 0.1563 (0.0098) 7 – 39 – 527 Regularized DA 0.2174 (0.0126) 14 – 425 – 2000 Shrunken DA* 0.1865 (0.0100) 3 – 71.5 – 200 NLDA* 0.2614 (0.0114) 3 – 71.5 – 200

* After variable selection by the maximum of FDR (False Discovery Rates) and HC (Higher Criticism), both derived from Independence based T-scores. The p-values used in the HC computations are derived from empirical Null distributions

Compstat’ 2010 High Dim ensional Correlation Adjusted Classification

slide-14
SLIDE 14

Conclusions

 A factor-m odel classification rule, designed for high- dim ensional correlated data, w as proposed  Asymptotic Analysis show that

PARIS, 23-28 August 2010

Compstat’ 2010

 Empirical comparisons sugest that As p  the new rule can approach a low expected error rate independence-based rules unrestricted covariance rules

Often, much lower than

w hen com bined w ith sensible variable selection schem es the new rule is highly com petitive in MicroArray Applications

High Dim ensional Correlation Adjusted Classification

slide-15
SLIDE 15

Open Questions

PARIS, 23-28 August 2010

Compstat’ 2010

 Should correlations also be incorporated the selection scheme ?

W hen and How ?

 How do factor-based rules perform in problems with more than two groups ?  Do differences in misclassification costs affect the relative standing

  • f different classification rules ?

High Dim ensional Correlation Adjusted Classification

slide-16
SLIDE 16

 Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, 32, 962-944.

References

 Tibshirani, R., Hastie, B., Narismhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids with applications to DNA microarrays. Statistical Science, 18, 104-117.  Guo, Y., Hastie, T. and Tibshirani, T. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics 8, 86-100.  Thomaz, C.E. and Gillies, D.F. (2005). A maximum uncertainty lda-based approach for limited sample size problems with application to face recognition. In: 18th Brazilian Symposium on computer Graphics and Image Processimg. SIBGRAPI 2005, 89-96.  Duarte Silva, A.P. (2009). Linear Discriminant Analysis with more Variables than

  • Observations. A not so Naïve Approach. In: Classification as a Tool for Research. Proceedings
  • f the 11th IFCS Biennial Conference and 33rd Annual Conference of the Gesellschaft für
  • Klassifikation. Dresden, Germany, 227-234.

High Dim ensional Correlation Adjusted Classification

 Donoho, D. and Jin, J. (2009). Feature selection by higher criticism thresholding: Optimal phase

  • diagram. Philosophical Transactions of the Royal Society A, 367, 4449-4470.

 Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statistical Science 1, 1-22.  Benjamini, Y. and Yekutileli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165-1188.  Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57, 289-300.  Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci, USA 105, 14790-14795.  Ahdesmaki, P. and Strimmer, K. (2009). Feature selection in "omics"prediction problems using cat scores and non-discovery rate control. rXiv,stat.AP: 0903.2003v1.