Supervised classification and outliers detection in gene expression - PowerPoint PPT Presentation

Supervised classification and outliers detection in gene expression data Laurent Br´ eh´ elin and Fran¸ cois Major LIRMM, Montpellier, France LBIT, Montr´ eal, Qu´ ebec 1. Gene expression data and classification 2. Outliers detection 3. Results

Gene expression data 1 2 3 p−1 p ... 1 x11 x12 x13 ... 2 x21 x22 • Huge number of genes; ... 3 x31 ... 4 • low number of samples; • high level of noise; • missing values; • few discriminant genes. 0 5 10 15 20 25 n−1 n

Applications • Cancer diagnosis: – annotation (tumor vs. normal); – detection (early for better treatment); – distinction (cancers with same clinical symptoms); – prediction (prognostic). • Biological interest : – what are the genes? – what are the classification rules? – etc.

Gene selection Why ? • eliminate noise ; • reduce computing time ; • understand better. 0 5 10 15 20 25 Selection scheme: 1. Gene scoring. e.g., g-score : s g = | m g 0 − m g 1 | s g 0 + s g 1 . 2. Selection of the best k genes.

Learning a classifier - 1 • G a set of selected genes. • x = ( x 1 , . . . , x n ) a new sample. Probabilist approach : P ( x G | c ) P ( c ) P ( c | x G ) = argmax P ( x G | c ) P ( c ) c MAP = argmax = argmax P ( x G ) c ∈{ 0 , 1 } c ∈{ 0 , 1 } c ∈{ 0 , 1 } • Estimating the P ( c ) : P (0) = # ex. class 0 � # ex. P (1) = # ex. class 1 � # ex. • The problem is more difficult for P ( x G | c ).

Learning a classifier - 2 The naive Bayes approach : Gene expression levels are conditionally independent given the class: � P ( x G | c ) = P ( x g | c ) . g ∈ G 0.16 Normal assumption : 0.14 P ( x g | c ) ∼ N ( x g ; µ gc , σ 2 gc ) . 0.12 0.1 0.08 and we have : 0.06 • � µ gc = m gc 0.04 • � σ 2 gc = s 2 0.02 gc 0 -10 -5 0 5 10 15 20

Learning a classifier - 2 The naive Bayes approach : Gene expression levels are conditionally independent given the class: � P ( x G | c ) = P ( x g | c ) . g ∈ G 0.16 0.16 Normal assumption : 0.14 0.14 P ( x g | c ) ∼ N ( x g ; µ gc , σ 2 gc ) . 0.12 0.12 0.1 0.1 0.08 0.08 and we have : 0.06 0.06 • � µ gc = m gc 0.04 0.04 • � σ 2 gc = s 2 0.02 0.02 gc 0 0 -10 -10 -5 -5 0 0 5 5 10 10 15 15 20 20

Evaluating the classifier Low number of samples → cross-validation. Leave-one-out procedure: : X , the complete set of samples. Data foreach x ∈ X do Learning using X − x ; Classify x ; Return the fault coverage;

Evaluating the classifier Low number of samples → cross-validation. Leave-one-out procedure: : X , the complete set of samples. Data Genes selection; foreach x ∈ X do Learning using X − x ; Classify x ; Return the fault coverage;

Non-biased Leave-one-out : X , the complete set of samples. Data foreach x ∈ X do Gene selection using X − x ; Learning using X − x ; Classify x ; Return the fault coverage;

Non-biased Leave-one-out : X , the complete set of samples. Data foreach x ∈ X do Gene selection using X − x ; Learning using X − x ; Classify x ; Return the fault coverage; 45 45 biaise biaise 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Breast cancer SAGE data

Non-biased Leave-one-out : X , the complete set of samples. Data foreach x ∈ X do Gene selection using X − x ; Learning using X − x ; Classify x ; Return the fault coverage; 45 45 45 45 biaise biaise biaise biaise non biaise non biaise 40 40 40 40 35 35 35 35 30 30 30 30 25 25 25 25 20 20 20 20 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 0 20 20 40 40 60 60 80 80 100 100 0 0 20 20 40 40 60 60 80 80 100 100 Breast cancer SAGE data

Outliers Outlier : A gene expression measurement which differs surprisingly from the other measurements obtained for the same gene on other samples of the same class. 0 5 10 15 20 25 30 Outliers bias : • the estimates of the model parameters ( µ cg and σ 2 cg ) ; • the gene score. Ex. g-score : s g = | m g 0 − m g 1 | s g 0 + s g 1

Origins • Intrinsic factors : the surprising measurement is actually the true measure. It results from rare but non impossible biological phenomena. • Extrinsic factors : Error measurement : – material reasons; – human reasons; – inherent limits of the measurement method; – . . .

Outlier detection - 1 Principle : • assume that the data, with the possible exception of any outlier, form a sample of a given distribution —here the normal distribution; • use a reasonable test statistical to decide whether or not the suspect measurement is an outlier. 0 5 10 15 20 25 30 The Thompson statistic : T gc = | x ∗ gc − m gc | s gc The greater T gc the more x ∗ gc is unlikely.

Outlier detection - 2 0 5 10 15 20 25 30 The rule : If T gc ≥ τ αc then x ∗ gc is an outlier. How can we set τ αc ? • Compare with what is expected in the null hypothesis H 0 that there is no spurious observation (i.e. all points belong to the same normal distribution). • Find τ αc so that P ( T gc > τ αc | H 0 ) = α. (e.g. α = 10 − 5 .)

Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; foreach gene g do Compute T g 0 and T g 1 using X − x ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; foreach gene g do Compute T g 0 and T g 1 using X − x ; if T g 0 > τ α 0 then remove x ∗ g 0 ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; foreach gene g do Compute T g 0 and T g 1 using X − x ; if T g 0 > τ α 0 then remove x ∗ g 0 ; if T g 1 > τ α 1 then remove x ∗ g 1 ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

Gene selection & outliers Use the outlier detection in the gene selection procedure.

Gene selection & outliers Use the outlier detection in the gene selection procedure. 0 5 10 15 20 25 30 35 40

Gene selection & outliers Use the outlier detection in the gene selection procedure. 0 5 10 15 20 25 30 35 40 : X , α ′ and x Data Compute the τ α ′ 0 and τ α ′ 1 ; foreach gene g do Compute T ′ g 0 and T ′ g 1 ; if T ′ g 0 > τ α ′ 0 and T ′ g 1 > τ α ′ 1 then Reject gene g ; else Compute the score of g ; Return the best genes;

Experiments • α ′ = 10 − 2 ; • α = 10 − 2 , 10 − 5 , 10 − 10 , 10 − 15 , 10 − 20 . 55 70 60 50 45 50 40 40 35 30 20 30 25 10 NB NB NB+OD NB+OD KNN KNN 0 20 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Breast cancer : Lymphome : • 78 samples : 44 vs. 34. • 58 samples : 32 vs. 26. • ∼ 24000 genes. • ∼ 7000 genes.

Conclusions • Outlier detection can improve the performance of the naive Bayes classifier. • Naive Bayes classifier + outlier detection: – simple approach; – low computing time; – can achieve better results than more sophisticated methods. Several questions: • interest of outlier detection combined with other approaches: KNNs, SVMs, weighted voting approach, . . . • comparison with more robust estimates (e.g. median vs. mean); • outlier origins: intrinsic or extrinsic factors?

Supervised classification and outliers detection in gene expression - PowerPoint PPT Presentation

Supervised classification and outliers detection in gene expression data Laurent Br eh elin and Fran cois Major LIRMM, Montpellier, France LBIT, Montr eal, Qu ebec 1. Gene expression data and classification 2. Outliers detection

Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty:

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Object detection as supervised classification Tues Nov 10 Kristen Grauman UT Austin Today

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Detecting Outliers in HMM modeling through Relative Entropy with Applications to Change-Point

Detection, Segmentation Overview Object Detection deer cat Object Detection as Classification

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

Error Detection and Correction of Gene Trees Using Gene Order Manuel Lafond , Krister M. Swenson

Conference on Seasonality, Seasonal Adjustment and their implications for Short-Term Analysis and

Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC December 11, 2018 1 under the

Supporting an Environment for Student Motivation Level 1: Foundations Graduate Teaching and

Exploration and Function Approximation CMU 10703 Katerina Fragkiadaki This lecture Exploration

Todays Presenter Rachel G. Rubin MLIS, PhD, Director of Library and Information Services,

ETIOLOGY of MALOCCLUSIONS PREVENTIVE and INTERCEPTIVE ORTHODONTICS Nov. 2007 Jules E. Lemay III

Computer Graphics - Camera Transformations - Hendrik Lensch Computer Graphics WS07/08 Camera

75 yo Acute/Chronic LBP + Right Leg Pain RAIN 2017: Challenging Cases LBP and right

5/31/2013 Disclosures Lumbar Facet Joint Pain: Evidence I have nothing to disclose David J.

In Interv rventional Pain Management Primary Spine and Joint Conference Carlton K. McQueen MD