Supervised classification and outliers detection in gene expression - - PowerPoint PPT Presentation

supervised classification and outliers detection in gene
SMART_READER_LITE
LIVE PREVIEW

Supervised classification and outliers detection in gene expression - - PowerPoint PPT Presentation

Supervised classification and outliers detection in gene expression data Laurent Br eh elin and Fran cois Major LIRMM, Montpellier, France LBIT, Montr eal, Qu ebec 1. Gene expression data and classification 2. Outliers detection


slide-1
SLIDE 1

Supervised classification and outliers detection in gene expression data

Laurent Br´ eh´ elin and Fran¸ cois Major LIRMM, Montpellier, France LBIT, Montr´ eal, Qu´ ebec

  • 1. Gene expression data and classification
  • 2. Outliers detection
  • 3. Results
slide-2
SLIDE 2

Gene expression data

x11 x12 x13 x21 x22 x31 ... ... ... ... n n−1 1 2 3 4 1 2 3 p p−1

  • Huge number of genes;
  • low number of samples;
  • high level of noise;
  • missing values;
  • few discriminant genes.

5 10 15 20 25

slide-3
SLIDE 3

Applications

  • Cancer diagnosis:

– annotation (tumor vs. normal); – detection (early for better treatment); – distinction (cancers with same clinical symptoms); – prediction (prognostic).

  • Biological interest :

– what are the genes? – what are the classification rules? – etc.

slide-4
SLIDE 4

Gene selection

Why ?

  • eliminate noise ;
  • reduce computing time ;
  • understand better.

5 10 15 20 25

Selection scheme:

  • 1. Gene scoring. e.g., g-score : sg = |mg0−mg1|

sg0+sg1 .

  • 2. Selection of the best k genes.
slide-5
SLIDE 5

Learning a classifier - 1

  • G a set of selected genes.
  • x = (x1, . . . , xn) a new sample.

Probabilist approach : cMAP = argmax

c∈{0,1}

P(c|xG) = argmax

c∈{0,1}

P(xG|c)P(c) P(xG) = argmax

c∈{0,1}

P(xG|c)P(c)

  • Estimating the P(c) :
  • P(0) = # ex. class 0

# ex.

  • P(1) = # ex. class 1

# ex.

  • The problem is more difficult for P(xG|c).
slide-6
SLIDE 6

Learning a classifier - 2

The naive Bayes approach : Gene expression levels are conditionally independent given the class: P(xG|c) =

  • g∈G

P(xg|c). Normal assumption : P(xg|c) ∼ N(xg; µgc, σ2

gc).

and we have :

µgc = mgc

σ2gc = s2

gc

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

  • 10
  • 5

5 10 15 20

slide-7
SLIDE 7

Learning a classifier - 2

The naive Bayes approach : Gene expression levels are conditionally independent given the class: P(xG|c) =

  • g∈G

P(xg|c). Normal assumption : P(xg|c) ∼ N(xg; µgc, σ2

gc).

and we have :

µgc = mgc

σ2gc = s2

gc

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

  • 10
  • 5

5 10 15 20 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

  • 10
  • 5

5 10 15 20

slide-8
SLIDE 8

Evaluating the classifier

Low number of samples → cross-validation. Leave-one-out procedure: Data : X, the complete set of samples. foreach x ∈ X do Learning using X − x; Classify x; Return the fault coverage;

slide-9
SLIDE 9

Evaluating the classifier

Low number of samples → cross-validation. Leave-one-out procedure: Data : X, the complete set of samples. Genes selection; foreach x ∈ X do Learning using X − x; Classify x; Return the fault coverage;

slide-10
SLIDE 10

Non-biased Leave-one-out

Data : X, the complete set of samples. foreach x ∈ X do Gene selection using X − x; Learning using X − x; Classify x; Return the fault coverage;

slide-11
SLIDE 11

Non-biased Leave-one-out

Data : X, the complete set of samples. foreach x ∈ X do Gene selection using X − x; Learning using X − x; Classify x; Return the fault coverage;

5 10 15 20 25 30 35 40 45 20 40 60 80 100 biaise 5 10 15 20 25 30 35 40 45 20 40 60 80 100 biaise

Breast cancer SAGE data

slide-12
SLIDE 12

Non-biased Leave-one-out

Data : X, the complete set of samples. foreach x ∈ X do Gene selection using X − x; Learning using X − x; Classify x; Return the fault coverage;

5 10 15 20 25 30 35 40 45 20 40 60 80 100 biaise 5 10 15 20 25 30 35 40 45 20 40 60 80 100 biaise 5 10 15 20 25 30 35 40 45 20 40 60 80 100 biaise non biaise 5 10 15 20 25 30 35 40 45 20 40 60 80 100 biaise non biaise

Breast cancer SAGE data

slide-13
SLIDE 13

Outliers

Outlier : A gene expression measurement which differs surprisingly from the other measurements obtained for the same gene on other samples of the same class.

5 10 15 20 25 30

Outliers bias :

  • the estimates of the model parameters (µcg and σ2

cg) ;

  • the gene score.
  • Ex. g-score : sg = |mg0−mg1|

sg0+sg1

slide-14
SLIDE 14

Origins

  • Intrinsic factors : the surprising measurement is actually the true measure. It

results from rare but non impossible biological phenomena.

  • Extrinsic factors : Error measurement :

– material reasons; – human reasons; – inherent limits of the measurement method; – . . .

slide-15
SLIDE 15

Outlier detection - 1

Principle :

  • assume that the data, with the possible exception of any outlier, form a sample
  • f a given distribution —here the normal distribution;
  • use a reasonable test statistical to decide whether or not the suspect measure-

ment is an outlier.

5 10 15 20 25 30

The Thompson statistic : Tgc = |x∗

gc − mgc|

sgc The greater Tgc the more x∗

gc is unlikely.

slide-16
SLIDE 16

Outlier detection - 2

5 10 15 20 25 30

The rule : If Tgc ≥ ταc then x∗

gc is an outlier.

How can we set ταc ?

  • Compare with what is expected in the null hypothesis H0 that there is no

spurious observation (i.e. all points belong to the same normal distribution).

  • Find ταc so that

P(Tgc > ταc|H0) = α. (e.g. α = 10−5.)

slide-17
SLIDE 17

Leave-one-out & outliers

5 10 15 20 25 30

Data : X and α foreach x ∈ X do Select a set of genes using X − x; Estimate parameters using X − x; Classify x; Return the fault coverage;

slide-18
SLIDE 18

Leave-one-out & outliers

5 10 15 20 25 30

Data : X and α foreach x ∈ X do Compute τα0 and τα1 from α; Select a set of genes using X − x; Estimate parameters using X − x; Classify x; Return the fault coverage;

slide-19
SLIDE 19

Leave-one-out & outliers

5 10 15 20 25 30

Data : X and α foreach x ∈ X do Compute τα0 and τα1 from α; foreach gene g do Compute Tg0 and Tg1 using X − x; Select a set of genes using X − x; Estimate parameters using X − x; Classify x; Return the fault coverage;

slide-20
SLIDE 20

Leave-one-out & outliers

5 10 15 20 25 30

Data : X and α foreach x ∈ X do Compute τα0 and τα1 from α; foreach gene g do Compute Tg0 and Tg1 using X − x; if Tg0 > τα0 then remove x∗

g0;

Select a set of genes using X − x; Estimate parameters using X − x; Classify x; Return the fault coverage;

slide-21
SLIDE 21

Leave-one-out & outliers

5 10 15 20 25 30

Data : X and α foreach x ∈ X do Compute τα0 and τα1 from α; foreach gene g do Compute Tg0 and Tg1 using X − x; if Tg0 > τα0 then remove x∗

g0;

if Tg1 > τα1 then remove x∗

g1;

Select a set of genes using X − x; Estimate parameters using X − x; Classify x; Return the fault coverage;

slide-22
SLIDE 22

Gene selection & outliers

Use the outlier detection in the gene selection procedure.

slide-23
SLIDE 23

Gene selection & outliers

Use the outlier detection in the gene selection procedure.

5 10 15 20 25 30 35 40

slide-24
SLIDE 24

Gene selection & outliers

Use the outlier detection in the gene selection procedure.

5 10 15 20 25 30 35 40

Data : X, α′ and x Compute the τα′0 and τα′1; foreach gene g do Compute T ′

g0 and T ′ g1;

if T ′

g0 > τα′0 and T ′ g1 > τα′1 then

Reject gene g; else Compute the score of g; Return the best genes;

slide-25
SLIDE 25

Experiments

  • α′ = 10−2 ;
  • α = 10−2, 10−5, 10−10, 10−15, 10−20.

20 25 30 35 40 45 50 55 50 100 150 200 250 300 NB NB+OD KNN

Breast cancer :

  • 78 samples : 44 vs. 34.
  • ∼ 24000 genes.

10 20 30 40 50 60 70 50 100 150 200 250 300 NB NB+OD KNN

Lymphome :

  • 58 samples : 32 vs. 26.
  • ∼ 7000 genes.
slide-26
SLIDE 26

Conclusions

  • Outlier detection can improve the performance of the naive Bayes classifier.
  • Naive Bayes classifier + outlier detection:

– simple approach; – low computing time; – can achieve better results than more sophisticated methods. Several questions:

  • interest of outlier detection combined with other approaches: KNNs, SVMs,

weighted voting approach, . . .

  • comparison with more robust estimates (e.g. median vs. mean);
  • outlier origins: intrinsic or extrinsic factors?