Neural Network Classifiers and Gene Selection Methods for - - PowerPoint PPT Presentation

neural network classifiers and gene selection methods for
SMART_READER_LITE
LIVE PREVIEW

Neural Network Classifiers and Gene Selection Methods for - - PowerPoint PPT Presentation

Neural Network Classifiers and Gene Selection Methods for Microarray Data on Human Lung Adenocarcinoma Gaolin Zheng School of Computer Science Florida International University, Miami, FL E.O. George Department of Mathematical Sciences,


slide-1
SLIDE 1

Neural Network Classifiers and Gene Selection Methods for Microarray Data

  • n Human Lung Adenocarcinoma

Gaolin Zheng

School of Computer Science Florida International University, Miami, FL

E.O. George

Department of Mathematical Sciences, University of Memphis

  • G. Narasimhan

School of Computer Science, Florida International University

slide-2
SLIDE 2

Zheng et al. 2

Our Work

Building classifiers to predict tumor stage based on gene expression data. Comparative study of neural network classifiers. Comparative study of gene selection methods. Explore data integration.

slide-3
SLIDE 3

Zheng et al. 3

Data Set 1

(Michigan)

Stage 1 Tumor Stage 3 Tumor Normal Lung Female Male Female Male Non-smoking 9 1 2 7 Smoking 10 33 24 10

86 patients with Adenocarcinoma divided into: Stage 1 and Stage 3 tumors, male and female Smoking and non-smoking Data is severely unbalanced. 10 non-neoplastic (normal) lung samples and their gene expression, but with no additional information (e.g. gender, smoking etc.) 7129 probe sets.

slide-4
SLIDE 4

Zheng et al. 4

Data Set 2

(Boston)

Stage 1 Tumor Stage 2 Tumor Stage 3 Tumor Normal Lung Female Male Female Male Female Male Non-smoking 5 2 3 2 Smoking 13 39 30 12 9 4 7

113 patients with Adenocarcinoma divided into: Stage 1, Stage 2, and Stage 3 tumors male and female Smoking and non-smoking 13 normal lung samples without any additional information. Over 12600 probe sets Looked at 490 overlapping probe sets.

slide-5
SLIDE 5

Zheng et al. 5

Gene Selection

Goal: Find genes that discriminate on the basis of tumor stage information. Methods:

  • ANOVA
  • SAM (http://www-stat.stanford.edu/~tibs/SAM/)
  • GS
  • GS-Robust
  • PCA
  • Select principal components contributing to >80% variation.
slide-6
SLIDE 6

Zheng et al. 6

ANOVA Based Gene Selection

  • For individual data set:
  • Single-factor (stage)
  • Multifactor (stage, gender, smoking)
  • For integrated data set
  • Single factor, multiple factor
  • Mixed-effect model (stage as fixed factor, lab as

random factor) for the 490 overlapping probe sets.

  • Use P-value for stage to rank significance of

genes

slide-7
SLIDE 7

Zheng et al. 7

Gene Selection Method: GS

∑∑ ∑

= = =

− − − − =

k j n l ij ij ijl i i

ij

n g g k g GS

1 1 2 . 2 ..

) 1 /( ) ( ) 1 /( )

k 1 j ij.

g (

) (

ij

g mean =

ij.

g

} ,..., 1 ), ( {

..

k j g mean mean g

ij i

= =

A measure of ratio of inter-group to intra-group variation.

slide-8
SLIDE 8

Zheng et al. 8

GS-Robust

=

=

k j ij ik i i

g MAD g median g median MAD GSRobust

1 1

) ( )] ( ),..., ( [

i

GSRobust : the GSRobust value for the ith gene. MAD : median absolute deviation

ij

g : the vector of gene expression corresponding to ith gene and jth class.

A robust measure of ratio of inter-group to intra-group variation.

slide-9
SLIDE 9

Zheng et al. 9

Classifiers

  • Feed forward neural network (nnet() from R)

Yet another machine learning classifier. Yawn!

  • FNN with Bayesian learning of network

weights.

  • Neural Network Ensembles
  • Bagging (Breiman, 1994)
  • Boosting (Freund and Schapire, 1996)
slide-10
SLIDE 10

Zheng et al. 10

Bayesian Neural Networks: Bayesian Learning of the Weights

Choose initial values of hyperparameters α and β

W ~ N (0, 1/α)

Classifier SW = βED + αEW

D new W new W i i i

E N E 2 / ) ( 2 /

1

γ β γ α α λ λ γ − = = + ≡∑

=

Eigenvalue of the Hessian matrix Error term Regularization term Total Error

slide-11
SLIDE 11

Zheng et al. 11

Bagging Classifiers

... ...

Ensembled Classifier

slide-12
SLIDE 12

Zheng et al. 12

Boosting Classifiers

slide-13
SLIDE 13

Zheng et al. 13

Benchmarking the classifiers

5-fold Cross-validation Error

10 10 10 10 10 10 N =

BBOOST BBAG BNN NBOOST NBAG NNET .10 .09 .08 .07 .06 .05 .04 .03

4 8 10 10 10 10 10 10 N =

BBOOST BBAG BNN NBOOST NBAG NNET .18 .16 .14 .12 .10 .08 .06 .04 .02 0.00

2 1 10

Iris

BreastCancer

5-fold Cross-validation Error

slide-14
SLIDE 14

Zheng et al. 14

How different are these gene selection methods?

200 20 28 23 GS-Robust 200 164 179 ANOVA 200 167 GS 200 SAM GS-Robust ANOVA GS SAM 200 6 13 8 GS-Robust 200 35 68 ANOVA 200 43 GS 200 SAM GS-Robust ANOVA GS SAM

Michigan Boston

slide-15
SLIDE 15

Zheng et al. 15

Common Significant Genes

Gene Name Unigene Comment GAPD glyceraldehyde-3-phosphate dehydrogenase MGP matrix Gla protein RTVP1 GLI pathogenesis-related 1 (glioma) DDXBP1 Not found FGR Gardner-Rasheed feline sarcoma viral (v-fgr)

  • ncogene homolog

FGFR2 fibroblast growth factor receptor 2 (bacteria- expressed kinase, keratinocyte growth factor receptor, craniofacial dysostosis 1, Crouzon syndrome, Pfeiffer syndrome, Jackson-Weiss syndrome) TNNC1 troponin C, slow KIAA0140 KIAA0140 gene product

slide-16
SLIDE 16

Zheng et al. 16

Neural Network Topology

Input Layer Hidden Layer Output Layer Michigan Boston 20 4 3 4 4 20

slide-17
SLIDE 17

Zheng et al. 17

Practical Issues

Underrepresented classes Contradictions in mapping Unbalanced testing data

slide-18
SLIDE 18

Zheng et al. 18

K-fold Cross-Validation

Training Testing Training Testing Training Testing Error1 Error3 Error2 3-fold Cross-Validation Error

slide-19
SLIDE 19

Zheng et al. 19

Validation across Data Sets

Data Set 1 Training Data Set 2 Testing Data Set 2 Training Data Set 1 Testing

slide-20
SLIDE 20

Zheng et al. 20

Results – Data Set 1

Gene Selection Method NN Type 0.277 ± 0.013 0.246 ± 0.015 0.257 ± 0.019 0.280 ± 0.015 0.282 ± 0.012 bayes.boost 0.280 ± 0.009 0.236 ± 0.017 0.264 ± 0.021 0.273 ± 0.014 0.282 ± 0.008 bayes.bag 0.299 ± 0.034 0.269 ± 0.030 0.315 ± 0.036 0.311 ± 0.046 0.335 ± 0.048 Bayesian 0.282 ± 0.013 0.272 ± 0.012 0.262 ± 0.016 0.290 ± 0.017 0.292 ± 0.012 nnet.boost 0.278 ± 0.000 0.273 ± 0.006 0.267 ± 0.018 0.277 ± 0.008 0.279 ± 0.004 nnet.bag 0.288 ± 0.021 0.277 ± 0.024 0.296 ± 0.031 0.290 ± 0.022 0.289 ± 0.025 nnet GS-PCA GS-Robust GS GS-SAM GS-ANOVA

slide-21
SLIDE 21

Zheng et al. 21

Results – Data Set 2

Gene Selection Method NN Type 0.148 ± 0.000 0.149 ± 0.002 0.142 ± 0.006 0.149 ± 0.005 0.147 ± 0.003 bayes.boost 0.148 ± 0.000 0.148 ± 0.000 0.147 ± 0.002 0.149 ± 0.003 0.148 ± 0.000 bayes.bag 0.148 ± 0.000 0.154 ± 0.014 0.145 ± 0.006 0.152 ± 0.005 0.157 ± 0.016 Bayesian 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 0.149 ± 0.002 0.148 ± 0.000 nnet.boost 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 nnet.bag 0.150 ± 0.007 0.149 ± 0.002 0.148 ± 0.006 0.150 ± 0.005 0.153 ± 0.010 nnet GS-PCA GS-Robust GS GS-SAM GS-ANOVA

slide-22
SLIDE 22

Zheng et al. 22

Validation Across Different Data Sets

Michigan/Boston & Boston/Michigan

0.221 ± 0.000 0.337 ± 0.206 0.399 ± 0.286 0.271 ± 0.101 0.241 ± 0.042

bayes.boost

0.221 ± 0.000 0.280 ± 0.167 0.307 ± 0.149 0.222 ± 0.004 0.226 ± 0.015

bayes.bag

0.276 ± 0.131 0.510 ± 0.336 0.380 ± 0.201 0.343 ± 0.249 0.434 ± 0.245

Bayesian

0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 0.222 ± 0.004 0.219 ± 0.004

nnet.boost

0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000

nnet.bag

0.221 ± 0.000 0.293 ± 0.178 0.299 ± 0.154 0.250 ± 0.077 0.391 ± 0.226

nnet

Boston/ Michigan 0.061 ± 0.086 0.033 ± 0.000 0.138 ± 0.188 0.060 ± 0.038 0.037 ± 0.007

bayes.boost

0.105 ± 0.155 0.033 ± 0.003 0.057 ± 0.059 0.035 ± 0.003 0.034 ± 0.003

bayes.bag

0.171 ± 0.294 0.099 ± 0.126 0.405 ± 0.466 0.269 ± 0.358 0.172 ± 0.309

Bayesian

0.054 ± 0.050 0.033 ± 0.000 0.049 ± 0.037 0.055 ± 0.068 0.036 ± 0.008

nnet.boost

0.035 ± 0.005 0.033 ± 0.000 0.034 ± 0.003 0.033 ± 0.000 0.033 ± 0.000

nnet.bag

0.142 ± 0.272 0.033 ± 0.000 0.122 ± 0.257 0.055 ± 0.054 0.090 ± 0.122

nnet

Michigan /Boston GS-PCA GS-Robust GS GS-SAM GS-ANOVA Gene Selection Method NN Type Training/ Testing

slide-23
SLIDE 23

Zheng et al. 23

Questions from Anomalous Results

  • Could it be due to different compositions of the

data sets?

  • Could the assignment of tumor stage by TNM

system be non-uniform? Does “Stage 1” mean the same for both data sets?

  • Could there be differences in preprocessing

(normalization)?

  • Tumor heterogeneity?
  • Differences in treatment?
  • How can these questions be approached?
slide-24
SLIDE 24

Zheng et al. 24

Conclusions

Bagging exhibited consistently better performance. Boosting improved classification, but was erratic. Univariate Bayesian learning did not usually improve performance. Bagging is a faster and simpler ensemble technique than boosting. GS-Robust selected many unique genes and had excellent ability to select features for our classifiers.

slide-25
SLIDE 25

Zheng et al. 25

Acknowledgements

Members of the Bioinformatics Research Group (BioRG), School of Computer Science, FIU:

  • Patricia Buendia
  • Daniel Cazalis
  • Tom Milledge
  • Xintao Wei
  • Chengyong Yang
  • Erliang Zeng

http://www.cs.fiu.edu/~giri/BNN/

slide-26
SLIDE 26

Zheng et al. 26