classifica4on
play

Classifica4on Binary classifica,on Given a set of examples ( x i , - PDF document

4/17/09 CSCI1950Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hIp://cs.brown.edu/courses/csci1950z/ Classifica4on Binary classifica,on Given a set of examples ( x i , y i ) , where y i = + 1, from unknown


  1. 4/17/09 CSCI1950‐Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hIp://cs.brown.edu/courses/csci1950‐z/ Classifica4on Binary classifica,on Given a set of examples ( x i , y i ) , where y i = +‐ 1, from unknown distribu4on D. Design func4on f: R n  {‐1,+1} that op+mally assigns addi4onal samples x i to one of two classes. Supervised learning ( x i , y i ) training data x i (j) : feature. R n : feature space. 1

  2. 4/17/09 Dimensionality Reduc4on Genomic data (e.g. gene expression) o[en high‐ dimensional ( n > 5000), but rela4vely few samples available. Reduce dimensionality of data (lower dimensional subspace) to improve performance of the classifier by: • Removing features that do not contribute to the classifica4on and may introduce noise. • Reducing opportuni4es for overfi]ng. • Improving 4me/memory efficiency in algorithms for learning and classifica4on. Feature Construc4on n features l features Linear/nonlinear transforma4on Common method: Principal components analysis. [Whiteboard] 2

  3. 4/17/09 PCA and Clustering Yeast gene expression data (477 genes) clustered into 7 clusters. First two principal components contain ≈89% of varia4on in data. Yeung and Ruzzo (Bioinforma4cs 2001) PCA and Clustering Exon and junc4on microarrays detect widespread mouse strain‐ and sex‐bias expression differences. (Su et al. BMC Genomics 2008) 3

  4. 4/17/09 Feature Selec4on • Selec4ng l << n features that are informa(ve for classifica4on. • Gene expression: subset of genes. Feature Selec4on Informa4ve features: Use a measure of associa4on between x i and y i . � m k =1 ( x i k − x i )( y i k − y i ) r x i y i = • Correla4on: ( m − 1) s x i s y i • Chi‐square (con4ngency table) F ( x i ) = | ( x i ) + − ( x i ) − | 2 s 2 ( x i ) + + s 2 • Fischer criterion: ( x i ) − (x i ) + are elements in + class. • t ‐test sta4s4c • Mutual informa4on • TNoM score (previous lecture) [Whiteboard] 4

  5. 4/17/09 Feature Selec4on Results Colon Leukemia Feature Selec4on Results Top scoring genes (TNoM < 14) in colon dataset. 5

  6. 4/17/09 Assessing Performance Feature Selec4on Build Test (e.g. TNoM) Classifier WRONG Cross‐valida4on Assessing Performance Feature Selec4on Build Test (e.g. TNoM) Classifier Cross‐valida4on Must assess performance of both steps together! 6

  7. 4/17/09 Gene Selec4on Results Predictors of Breast Cancer Prognosis • 70 gene signature to predict breast cancer pa4ents with metastasis within 5 years (van de Vijver et al. NEJM 2002, van’t Veer et al. Nature 2002) • Now an FDA approved test: Mammaprint 7

  8. 4/17/09 Predictors of Breast Cancer Prognosis Step 1: Clustering n = 25000 genes 98 tumors. >2 fold change (and p<0.01) in >4 tumors n = 5000 differen+ally expressed genes Hierarchical clustering: genes and samples Predictors of Breast Cancer Prognosis Step 2: Classifica4on n = 5000 genes differen4ally expressed genes in 78 (sporadic lymph‐node nega4ve) tumors. Compute correla4on coefficient ρ( x i , y i ) Between each gene and prognosis. Choose 231 genes with |ρ( x i , y i )| > 0.3. Rank genes by ρ( x i , y i ). 8

  9. 4/17/09 Predictors of Breast Cancer Prognosis Step 2: Classifica4on n = 5000 genes differen4ally expressed genes in 78 (sporadic lymph‐node nega4ve) tumors. Compute correla4on coefficient ρ( x i , y i ) Between each gene and prognosis. Choose 231 genes with |ρ( x i , y i )| > 0.3. Rank genes by ρ( x i , y i ). Predictors of Breast Cancer Prognosis Step 3: Build a classifier Leave‐out one sample x . Let R = top 5 genes in list of 231. Compute correla4on coefficients ρ(μ( x R+ ) , x R ) and ρ(μ( x R‐ ) , x R ), where μ( x R+ ) is mean vector of genes in + class in R . Assign to best class. Add 5 genes to R un4l performance does not improve. 9

  10. 4/17/09 Predictors of Breast Cancer Prognosis • 70 gene classifier • 65/78 (83%) of pa4ents predicted correctly. – 5 poor and 8 good incorrectly assigned. • Changing threshold gave 3 poor and 12 good incorrectly assigned. Discussion • Cross‐valida4on done a[er feature selec4on! – Also fixed this problem. • Resul4ng 70 gene signature is not unique (Ein‐Dor et. al 2005: see notes) • Drawing biological conclusions from the output of a “black box” predic4on algorithm is not wise. – Correla4on vs. causality. 10

  11. 4/17/09 Results: Class Discovery with TNoM (ben‐Dor, Friedman, Yakhini, 2001) • Find op4mal labeling L. – Solu4on: use heuris4c search • Find mul4ple (subop4mal) labelings – Solu4on: Peeling: remove previously used genes from set. Results: Class Discovery with TNoM (ben‐Dor, Friedman, Yakhini, 2001) Leukemia (Golub et al. 1999): 72 expression profiles. 25 AML, 47 ALL. 7129 genes Lymphoma (Alizadeh et al.): 96 expression profiles, 46 Diffuse large B‐cell lymphoma (DLBCL) 50 from 8 different 4ssues. Lymphoma‐DLBCL: subset of 46 of above. 11

  12. 4/17/09 TNoM Results (ben‐Dor, Friedman, Yakhini, 2001) % survival years 40 pa4ents 24 pa4ents with low clinical risk. 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend