Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 - - PowerPoint PPT Presentation

DUBii - Module - Statistics with R Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 Institut Franais de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Universit (AMU) Lab.


slide-1
SLIDE 1

Machine learning

DUBii - Module - Statistics with R

Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique (IFB) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU)

  • Lab. Theory and Approaches of Genomic Complexity (TAGC)
slide-2
SLIDE 2

Brain-learning exercise : assign individuals to groups based on their features

slide-3
SLIDE 3

Conceptual illustration with two predictor variables

n

In the next slides, we will provide you with a higher- resolution of the plots, which represent represent a study case.

n

Exercise: assign intuitively each individual (black dot) to

  • ne of the two groups (A, B).

q

At each step, ask yourself the following questions.

q

Which criterion did you use to assign an individual to a group?

q

How confident do you feel for each of your predictions?

q

What is the effect of the respective means?

q

What is the effect of the respective standard deviations?

q

What is the effect of the correlations between the two variables?

3

slide-4
SLIDE 4

Conceptual illustration with two variables – Study case 1

n

Inspect the distribution of points for the two groups

  • f individuals (pink, blue) on the 2-dimensional

feature space.

4

X1 (Feature 1) X2 (Feature 1)

slide-5
SLIDE 5

Conceptual illustration with two variables – Study case 2

n

Effect of the group centre location.

5

X1 (Feature 1) X2 (Feature 1)

slide-6
SLIDE 6

Conceptual illustration with two variables – Study case 3

n

Effect of the group variance.

6

X1 (Feature 1) X2 (Feature 1)

slide-7
SLIDE 7

Conceptual illustration with two variables – Study case 4

n

Effect of the group variance.

7

X1 (Feature 1) X2 (Feature 1)

slide-8
SLIDE 8

Conceptual illustration with two variables – Study case 5

n

Impact of the group-specific variances (heteroscedasticity of the data)

8

X1 (Feature 1) X2 (Feature 1)

slide-9
SLIDE 9

Conceptual illustration with two variables – Study case 6

n

Impact of the group-specific variances (heteroscedasticity

  • f the data)

9

X1 (Feature 1) X2 (Feature 1)

slide-10
SLIDE 10

Conceptual illustration with two variables – Study case 7

n

Effect of the covariance between features.

10

X1 (Feature 1) X2 (Feature 1)

slide-11
SLIDE 11

Conceptual illustration with two variables – Study case 8

n

Effect of the covariance between features

11

X1 (Feature 1) X2 (Feature 1)

slide-12
SLIDE 12

Conceptual illustration with two variables – Study case 9

n

Group-specific covariances between features.

q The two groups have different

covariance matrices: the clouds of points are elongated in different directions.

q How does this difference affects group

assignments ?

12

X1 (Feature 1) X2 (Feature 1)

slide-13
SLIDE 13

Multivariate analysis Introduction

Statistics Applied to Bioinformatics

Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique (IFB) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU)

  • Lab. Theory and Approaches of Genomic Complexity (TAGC)
slide-14
SLIDE 14

Multivariate data

n Each row represents one object (also called unit) n Each column represents one variable

variable 1 variable 2 ... variable p individual 1 x11 x21 ... xp1 individual 2 x12 x22 ... xp2 individual 3 x13 x23 ... xp3 individual 4 x14 x24 ... xp4 individual 5 x15 x25 ... xp5 individual 6 x16 x26 ... xp6 individual 7 x17 x27 ... xp7 individual 8 x18 x28 ... xp8 ... ... ... ... ... individual n x1n x2n ... xpn

slide-15
SLIDE 15

Multivariate data with an outcome variable

n The outcome variable (also called criterion variable) can be

q qualitative (nominal) : classes (e.g. cancer type) q quantitative (e.g. survival expectation for a cancer patient)

Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x 11 x 21 ... x p1 y 1 individual 2 x 12 x 22 ... x p2 y 2 individual 3 x 13 x 23 ... x p3 y 3 individual 4 x 14 x 24 ... x p4 y 4 individual 5 x 15 x 25 ... x p5 y 5 individual 6 x 16 x 26 ... x p6 y 6 individual 7 x 17 x 27 ... x p7 y 7 individual 8 x 18 x 28 ... x p8 y 8 ... ... ... ... ... ... individual n x 1n x 2n ... x pn y n Predictor variables

slide-16
SLIDE 16

Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x11 x21 ... xp1 y1 individual 2 x12 x22 ... xp2 y2 individual 3 x13 x23 ... xp3 y3 ... ... ... ... ... ... individual N_train x1n x2n ... xpn yn Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x11 x21 ... xp1 ? individual 2 x12 x22 ... xp2 ? individual 3 x13 x23 ... xp3 ? ... ... ... ... ... ... individual N_pred x1n x2n ... xpn ? Predictor variables Set to predict Predictor variables Training set

Predictive approaches - Training set

n The training set is used to build a predictive function n This function is used to predict the value of the outcome variable for new objects

slide-17
SLIDE 17

Evaluation of prediction with a testing set

Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x11 x12 ... x1p y1 individual 2 x21 x22 ... x2p y2 individual 3 x31 x32 ... x3p y3 ... ... ... ... ... ... individual ntrain xn1 xn2 ... xnp yn Outcome variable variable 1 variable 2 ... variable p variable p+1 (known value) variable p+1 (predicted) individual 1 x11 x12 ... x1p y1 y'1 individual 2 x21 x22 ... x2p y2 y'2 individual 3 x31 x32 ... x3p y3 y'3 ... ... ... ... ... ... ... individual ntest xn1 xn2 ... xnp yntest y'ntest Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x11 x12 ... x1p ? individual 2 x21 x22 ... x2p ? individual 3 x31 x32 ... x3p ? ... ... ... ... ... ... Predictor variables Set to predict Predictor variables Training set Predictor variables Testing set

slide-18
SLIDE 18

Flowchart of the approaches in multivariate analysis

multivariate table X Reduction of dimensions

  • variable selection
  • principal component analysis

Multidimensional scaling distance matrix

  • utcome

variable Y? Exploratory analysis none quantitative Regression analysis Predicted value of a quantitative variable yest = f(x) Supervised classification nominal Assignment of individuals to predefined classes g=f(x) Discovered classes + individual assignment Cluster analysis Visualisation Graphical representations

slide-19
SLIDE 19

Quizz

Check your understanding of the concepts presented in the previous slides by applying them to your own data. 1. Describe in one sentence a typical case of multidimensional data that is handled in your domain. 2. Explain how you would organise this dataset into a multivariate structure q What would correspond to the individuals? q What would correspond to the variables? q How many individuals (n) would you have? q How many variables (p) would you have? q Do you dispose of one or several outcome variable(s)? q If so, are these quantitative, qualitative or both? 3. Based on the conceptual framework defined above, which kind of approaches would be you envisage to extract which kind of relevant information from this data? Note that several approaches can be combined to address different questions.

slide-20
SLIDE 20

Historical (vintage) examples

slide-21
SLIDE 21

Historical example of clustering heat map

n

Spellman et al. (1998).

n

Systematic detection of genes regulated in a periodic way during the cell cycle.

n

Several experiments were regrouped, with various ways of synchronization (elutriation, cdc mutants, …)

n

~800 genes showing a periodic patterns of expression were selected (by Fourier analysis)

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray

  • hybridization. Mol Biol Cell 9, 3273-97.Time profiles of yeast cells followed during cell cycle.
slide-22
SLIDE 22

Stress response in yeast

Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241- 57. n

Gasch et al. (2000) tested the transcriptional response of yeast genome to

q Various stress conditions (heat shock,

  • smotic shock, …)

q Drugs q Alternative carbon sources q … n

The heatmap shows clusters of genes having similar profiles of responses to the different types of stress.

22

slide-23
SLIDE 23

Cancer types (Golub, 1999)

n

Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL

  • r AML, respectively.

n

Selected the 50 genes most correlated with the cancer type.

n

Goal: use these genes as molecular signatures for the diagnostic of new patients.

23

n

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri,

  • M. A., Bloomfield, C. D. & Lander, E. S.

(1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.

slide-24
SLIDE 24

Den Boer et al., 2009 : procedure

n

Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. n

Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.

n

They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.

n

They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.

slide-25
SLIDE 25

Den Boer 2009 - The transcriptomic signature

n

Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. n

The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.

n

The heatmaps show that the selected genes are differentially expressed

q between subtypes of the

training set (left);

q between subtypes of the

testing set (right).

n

The heatmap is bi-clustered, in

  • rder to identify

simultaneously the groups of patients (rows), and groups of genes (columns) based on the similarity between expression profiles.

slide-26
SLIDE 26

Supervised classification

Statistics for bioinformatics

Jacques van Helden Aix-Marseille Université (AMU)

  • Lab. Theory and Approaches of Genomic Complexity (TAGC)

Institut Français de Bioinformatique (IFB) French node of the European ELIXIR bioinformatics infrastructure https://orcid.org/0000-0002-8799-8584

slide-27
SLIDE 27

Supervised classification - Introduction

n

In the previous chapter, we presented the problem of clustering, which consists in grouping

  • bjects without any a priori definition of the groups. The group definition emerge from the

clustering itself (class discovery). Clustering is thus unsupervised.

n

In some cases, one would like to focus on some pre-defined classes :

q classifying tissues as cancer or non-cancer q classifying tissues between different cancer types q classifying genes according to pre-defined functional classes (e.g. metabolic pathway, different phases of

the cell cycle, ...)

n

The classifier can be built with a training set, and used later for classifying new objects. This is called supervised classification.

27

slide-28
SLIDE 28

Supervised classification methods

n There are many alternative methods for supervised classification

q Discriminant analysis (linear: LDA or quadratic: QDA) q Bayesian classifier q K-nearest neighbours (KNN) q Support Vector Machine (SVM) q Decision tree q Random Forest (RF) q Neural network (NN) q ...

n Questions

q Which method should we choose? q How should we tune its parameters? q How to evaluate the respective performances of the methods and parametric choices? 28

slide-29
SLIDE 29

Supervised classification methods

Choosing the best method is not trivial

n

Some methods rely on strong assumptions.

q LDA and QDA: multivariate normality. q LDA: all the classes have the same

covariance matrix.

n

Some methods implicitly rely on Euclidian distance (e.g. KNN)

n

Some methods require a large training set, to avoid over-fitting.

n

Global vs local classifiers.

q Global classifiers (e.g. LDA, QDA): same

classification rule in the whole data space. The rule is built on the whole training set.

q Local classifiers (e.g. KNN): rules are made

in different sub-spaces on the basis of the neighbouring training points.

n

The choice of the method thus depends on the structure and on the size of the data sets. Choosing the best parameters is not trivial

n

KNN: number of neighbours

n

LDA, QDA: prior/posterior probabilities

n

SVM: kernel

n

Decision trees

n

RF: number of iterations

n

29

slide-30
SLIDE 30

Study case 1: ALL versus AML (data from Golub et al., 1999)

slide-31
SLIDE 31

Cancer types (Golub, 1999)

n

A founding paper: Golub et al (1999)

n

Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.

n

Selected the 50 genes most correlated with the cancer type.

n

Goal: use these genes as molecular signatures for the diagnostic of new patients.

n

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7. 31

slide-32
SLIDE 32

Motivation

n The article by Golub et al. (1999) was motivated by the need to develop efficient diagnostics

to predict the cancer type from blood samples of patients.

n They proposed a “molecular signature” of cancer type, allowing to discriminate ALM from

ALL.

n This first “historical” study relied on somewhat arbitrary criteria to select the genes

composing this signature, and the way to apply them to classify new patients.

n We will present here the classical methods used in statistics to classify “objects” (patients,

genes) in pre-defined classes.

32

slide-33
SLIDE 33

Golub et al (1999)

n

Data source: Golub et al (1999). First historical publication searching for molecular signatures of cancer type.

n

Training set

q

38 samples from 2 types of leukemia

  • 27 Acute lymphoblastic leukemia (note: 2 subtypes:

ALL-T and ALL-B)

  • 11 Acute myeloid leukemia

q

Original data set contains ~7000 genes

q

Filtering out poorly expressed genes retains 3051 genes

n

We re-analyze the data using different methods.

n

Selection of differentially expressed genes (DEG)

q

Welch t-test with robust estimators (median, IQR) retains 367differentially expressed genes with E-value <= 1.

q

Top plot: circle radius indicates T-test significance.

q

Bottom plot (volcano plot):

  • sig = -log10(E-value) >= 0

33

n

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E.

  • S. (1999). Molecular classification of cancer: class discovery and class prediction by

gene expression monitoring. Science 286, 531-7.

  • −3

−2 −1 1 2 3 −2 2 4 6 8

volcano plot − standardization with median and IQR

golub.t.result.robust$means.diff golub.t.result.robust$sig

  • −2

−1 1 2 0.1 0.2 0.3 0.4 0.5 0.6 Difference between the means Standard error on the difference

slide-34
SLIDE 34

Golub 1999 - Profiles of selected genes

n

The 367 gene selected by the T-test have apparently different profiles.

q

Some genes seem greener for the ALL patients (27 leftmost samples)

q

Some genes seem greener for the AML patients (11 rightmost samples)

AML ALL

34

slide-35
SLIDE 35

Golub – hierarchical vlustering of DEG genes / profiles

n

Hierarchical clustering perfectly separates the two cancer types (AML versus ALL).

n

This perfect separation is

  • bserved for various

metrics (Euclidian, correlation, dot product) and agglomeration rules (complete, average, Ward).

n

Sample clustering further reveals subgroups of ALL.

n

Gene clustering reveals 4 groups of profiles:

q AML red, ALL green q AML green, ALL red q Overall green, stronger

in AML

q Overall red, stronger in

ALL

515 2266 1668 479 906 394 1834 849 1202 1016 117 1723 1489 785 1787 1869 1025 2365 1955 2386 2489 1042 1883 1181 1329 1368 171 542 2072 66 621 1271 1993 320 1101 239 422 329 192 389 839 1057 1441 178 2079 1559 648 2307 1811 1564 1653 1425 2418 2645 323 1070 2553 2734 2920 2921 2922 1829 937 1010 1417 400 1112 1754 622 896 1448 1066 1124 2681 566 932 1162 1881 1030 2306 377 2065 2459 1766 922 2410 182 807 1470 2179 1524 1542 704 364 1221 1978 1959 2851 1985 2950 2132 2801 763 1348 259 2702 1293 2276 1963 713 1337 202 561 1193 1920 801 2235 248 1638 1702 2786 701 1282 971 2829 462 1598 2364 2794 1060 2348 2347 1110 23 557 283 126 1459 1648 2512 1390 304 522 1298 297 1081 1086 307 2627 2889 1455 1979 1817 1916 984 1445 376 2216 877 1006 2593 494 1045 2356 703 1640 135 1094 1604 1594 1712 2265 725 1126 2602 2208 1019 2087 344 1381 345 1585 2020 2736 2355 205 1253 940 2506 1629 1327 2616 963 1882 1513 1732 330 1078 686 1948 1719 2080 1456 2802 546 2236 1909 523 717 337 2673 2444 237 2052 1334 1939 489 862 1037 1995 2860 96 695 2180 174 1616 2939 3046 858 2155 55 1671 1238 1938 172 1926 187 453 1453 2182 1225 1468 1903 2430 2002 2289 2663 2664 829 1768 2656 2977 1069 2589 1005 2749 1556 1797 2631 168 2056 1849 2456 419 1907 789 904 1611 436 1378 1009 2670 1413 1778 2958 2750 378 860 1391 2500 1062 2952 11 1396 808 1977 2647 1220 2322 2174 2329 1998 2570 1439 1665 1383 1652 2791 1497 1493 2743 841 1122 792 666 1411 1975 1784 2761 1523 108 988 1601 1676 773 475 1553 1038 2124 2600 1911 766 803 1048 1901 2661 1774 2800 2813 1430 1171 1647 1406 3031 401 830 519 1242 1884 2870 2874 1910 2636 1229 2197 2198 3011 995 2989

golub ; eu distance; complete linkage

AML ALL

35

slide-36
SLIDE 36

Principal component analysis (PCA)

n

Principal component analysis (PCA) relies on a transformation of a multivariate table into a multi- dimensional table of “components”.

n

With Golub dataset,

q

Most variance is captured by the first component.

q

The first component (Y axis) clearly separates ALL from AML patients.

q

The second component splits the AML set into two well- separated groups, which correspond almost perfectly to T-cells and B-cells, resp.

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 PC1 PC2 X1_ALL_B X2_ALL_T X3_ALL_T X4_ALL_B X5_ALL_B X6_ALL_T X7_ALL_B X8_ALL_B X9_ALL_T X10_ALL_T X11_ALL_T X12_ALL_B X13_ALL_B X14_ALL_T X15_ALL_B X16_ALL_B X17_ALL_B X18_ALL_B X19_ALL_B X20_ALL_B X21_ALL_B X22_ALL_B X23_ALL_T X24_ALL_B X25_ALL_B X26_ALL_B X27_ALL_B X28_AML X29_AML X30_AML X31_AML X32_AML X33_AML X34_AML X35_AML X36_AML X37_AML X38_AML −5 5 10 −5 5 10 11 23 55 66 96 108 117 126 135 158 168 171 172 174 178 182 187 192 202 205 237 239 248 253 259 283 297 304 307 320 323 329 330 337 344 345 364 376 377 378 389 394 400 401 419 422 436 453 462 475 479 489 494 515 519 522 523 542 546 557 560 561 566 617 621 622 624 648 666 686 695 701 703 704 713 717 725 735 738 746 763 766 773 785 789 792 801 803 807 808 829 830 839 841 849 858 860 862 866 877 896 904 906 922 932 937 940 963 968 971 984 988 995 1005 1006 1009 1010 1016 1019 1025 1030 1037 1038 1042 1045 1048 1057 1060 1062 1066 1069 1070 1078 1081 1086 1094 1101 1110 1112 1122 1124 1126 1145 1162 1171 1181 1193 1202 1220 1221 1225 1229 1238 1242 1253 1271 1282 1293 1298 13271329 1334 1337 1348 1368 1378 1381 1383 1390 1391 1396 1406 1411 1413 1417 1425 1430 1439 1441 1445 1448 1453 1455 1456 1459 1468 1470 1489 1493 1497 1513 1523 1524 1542 1553 1556 1559 1564 1585 1594 1598 1601 1604 1611 1616 1629 1638 1640 1647 1648 1652 1653 1665 1668 1671 1676 1702 1712 1719 1723 1732 1754 1766 1768 1774 1778 1784 1787 1797 1807 1811 1817 1821 1829 1834 1849 1869 1881 1882 1883 1884 1901 1903 1907 1909 1910 1911 1916 19201926 1938 1939 1948 1955 1959 1963 1975 1977 1978 1979 1985 1993 1995 1998 2002 2020 2052 2056 2065 2072 2079 2080 2087 2124 2132 2155 2174 2179 2180 2182 2197 2198 2208 2216 2235 2236 2265 2266 2276 2289 2306 2307 2322 2329 2347 2348 2355 2356 2364 2365 2386 2402 2410 2418 2430 2438 2444 2456 2459 2466 2489 2500 2506 2512 2553 2570 2589 2593 2600 2602 2616 2627 2631 2636 2645 2647 2656 2661 2663 2664 2670 2673 2681 2702 2734 2736 2743 2749 2750 2753 2761 2786 2791 2794 2800 2801 2802 2813 2829 2851 2860 2870 2874 2889 2902 2920 2921 2922 2939 2950 2952 2958 2977 2989 3011 3031 3046

golub.pca

Variances 10 20 30 40 50 60

36

slide-37
SLIDE 37

Study case 2: ALL subtypes (data from Den Boer et al., 2009)

slide-38
SLIDE 38

Den Boer et al., 2009 : procedure

n

Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.

n

They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.

n

They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.

n

Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 38

hyperdiploid 44 pre-B ALL 44 TEL-AML1 43 T-ALL 36 E2A-rearranged (EP) 8 BCR-ABL 4 E2A-rearranged (E-sub) 4 MLL 4 BCR-ABL + hyperdiploidy 1 E2A-rearranged (E) 1 TEL-AML1 + hyperdiploidy 1

slide-39
SLIDE 39

Den Boer 2009 - The transcriptomic signature

n

The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.

n

The heatmaps show that the selected genes are differentially expressed

q between subtypes of the

training set (left);

q between subtypes of the testing

set (right).

n

Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 39

slide-40
SLIDE 40

Den Boer 2009 - Accuracy of the classifier

n

The signature has an excellent diagnostic value: for the well-represented cancer types, the sensitivity and specificity are >90%.

n

Note: accuracy is misleading some subtypes have 98% accuracy with 0% sensitivity.

n

Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 40

Sn = TP / (TP + FN) Sp = TN / (TN + FP) PPV = TP / (VP + FP)

slide-41
SLIDE 41

Den Boer 2009 - Exploring some profiles

n

Left: expression for 2 genes selected at random. Each symbol represents one sample, coloured by cancer type. All cancer types are intermingled.

n

Right: expression of the 2 genes with the highest sample-wise variance. The first gene (CD9) separates cell types T and Bt (low expression) from Bh, Bep, Br (high expression). Bo is dispersed over the whole range.

n

Question: how can we identify a combination of genes that discriminate the different subtypes as well as possible ?

41

3.5 4.0 4.5 5.0 5.5 4.5 5.0 5.5 6.0 6.5

Den Boer (2009), randomly selected genes

CCDC28A|209479_at TCF3|215260_s_at T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bth Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh BE BEp BEp BEp BEp BEp BEp BEp BEp BEs BEs BEs BEs Bc Bc Bc Bc Bch BM BM BM BM Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo 2 4 6 8 4 6 8 10 12

2 genes with the highest variance

gene CD9|201005_at gene IL23A|211796_s_at T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bt Bth Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh Bh BE BEp BEp BEp BEp BEp BEp BEp BEp BEs BEs BEsBEs Bc Bc Bc Bc Bch BM BM BM BM Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo Bo

slide-42
SLIDE 42

Supervised classification: methodological principles

Statistical Analysis of Microarray Data

slide-43
SLIDE 43

Multivariate data with a nominal criterion variable

n

One disposes of a set of objects (the sample) which have been previously assigned to predefined classes.

n

Each object is characterized by a series of quantitative variables (the predictors), and its class is indicated in a separated column (the criterion variable).

43

Criterion variable variable 1 variable 2 ... variable p class

  • bject 1

x1,1 x2,1 ... xp,1 A

  • bject 2

x1,2 x2,2 ... xp,2 A

  • bject 3

x1,3 x2,3 ... xp,3 A ... ... ... ... ... ...

  • bject i

x1,i x2,i ... xp,i B

  • bject i+1

x1,i+1 x2,i+1 ... xp,i+1 B

  • bject i+2

x1,i+2 x2,i+2 ... xp,i+2 B ... ... ...

  • bject n-1

x1,n-1 x2,n-1 ... xp,n-1 K

  • bject n

x1,n x2,n ... xp,n K Predictor variables

slide-44
SLIDE 44

Supervised classification – training and prediction

n

Training phase (training + evaluation)

q The sample is used to build a discriminant function q The quality of the discriminant function is evaluated

n

Prediction phase

q The discriminant function is used to predict the value of the criterion variable for new objects

44

Criterion variable variable 1 variable 2 ... variable p class

  • bject 1

x11 x21 ... xp1 A

  • bject 2

x12 x22 ... xp2 A

  • bject 3

x13 x23 ... xp3 B ... ... ... ... ... ...

  • bject ntrain

x1n x2n ... xpn K Criterion variable variable 1 variable 2 ... variable p class

  • bject 1

x11 x21 ... xp1 ?

  • bject 2

x12 x22 ... xp2 ?

  • bject 3

x13 x23 ... xp3 ? ... ... ... ... ... ...

  • bject npred

x1n x2n ... xpn ? Predictor variables Predictor variables

slide-45
SLIDE 45

Prediction: individuals of unknown class Testing: individuals of known class

Supervised classification: steps of the general procedure

Training: individuals of known class Training set Training Trained classifier Testing set Prediction Predicted class Confusion table (for evaluation) Comparison Individuals of unknown class Prediction Predicted class

45

Validated classifier

slide-46
SLIDE 46

Discriminant analysis

Statistical Analysis of Microarray Data

slide-47
SLIDE 47

Linear or quadratic discriminant analysis (LDA vs QDA)

n

Equal covariance matrix between groups?

q

Linear Discriminant Analysis (LDA) is appropriate

q

Green lines on the graph

q

The discrimination rule amounts to draw a straight line between the gravity centers of the training groups

n

Different covariant matrices?

q

Quadratic Discriminant Analysis is recommended (red boundaries on the graphs)

47

slide-48
SLIDE 48

Classification rules

n New units can be classified on the basis of rules based on the calibration sample n Several alternative rules can be used

q Maximum likelihood rule: based on the density function. Assign unit u to group g if q Inverse probability rule: based on the probability. Assign unit u to group g if q Posterior probability rule: assign unit u to group g if

Where

X is the unit vector g,g’ are two groups f(X|g) is the density function of the value X for group g P(X|g) is the probability to emit the value X given the group g P(g|X) is the probability to belong to group g, given the value X

48

P(X | g) > P(X | g') for g'≠ g P(g | X) > P(g'| X) for g'≠ g f (X | g) > f (X | g') for g'≠ g

slide-49
SLIDE 49

Posterior probability rule

n The posterior probability can be obtained by application of Bayes' theorem

49

P(g | X) = P(X | g)P(g) P(X) P(g | X) = P(X | g)π g P(X | g')π g'

g'=1 k

Where

q X

is the unit vector

q g

is a group

q k

is the number of groups

q pg

is the prior probability of group g

slide-50
SLIDE 50

Choice of the prior probabilities

n The classes may have different proportions between the sample and the population n For example, we could decide, based on our knowledge of a problem, that it is likely to have

1% of the individuals that belong to the first group, whereas the training set contains 11% of them.

50

Class Sample Priors from sample Arbitrary priors PHO 13 659 58 11% 11% 1% MET 19 964 58 17% 17% 1% CTL 82 4160 5667 72% 72% 98% TOTAL 114 5783 5783 Population

slide-51
SLIDE 51

Evaluating the performances of a classifier

slide-52
SLIDE 52

Concepts

n Evaluation settings

q Internal evaluation (“training error”) versus external test set (“testing error”) q Independent testing set q Split out of the training set into training and testing subsets

  • Iterative subsampling
  • K-fold cross-validation
  • Leave-one-out (LOO)

n Evaluation statistics

q Confusion table q Misclassification error rate (MER) q Additional metrics for two-groups classification

  • FP, FN, TP, TN
  • Many metrics derived from there: Sn, PPV, FPR, FDR, …

52

slide-53
SLIDE 53

Training – testing settings

slide-54
SLIDE 54

Evaluation of the classifier – predicted and known classes

n

The evaluation of a classifier relies on a data set for which we know the class of each individual : the testing set.

n

The trained classifier is used to predict the class of each individual of the testing set

n

The predicted and known classes are then compared

n54

variable 1 variable 2 ... variable p predicted known individual 1 x1,1 x2,1 ... xp,1 A A individual 2 x1,2 x2,2 ... xp,2 B A individual 3 x1,3 x2,3 ... xp,3 A A ... ... ... ... ... ... ... individual i x1,i x2,i ... xp,i K B individual i+1 x1,i+1 x2,i+1 ... xp,i+1 B B individual i+2 x1,i+2 x2,i+2 ... xp,i+2 B B ... ... ... individual n-1 x1,n-1 x2,n-1 ... xp,n-1 K K individual n x1,n x2,n ... xp,n K K Predictor variables Criterion variable

slide-55
SLIDE 55

Training, testing and prediction

n

Ideally : dispose of an independent testing set

n

Alternatives

q Internal validation

(NOT RECOMMENDED)

q Splitting the training set

  • Iterative subsampling
  • K-fold cross-validation (CV)
  • Leave-one-out (LOO)

55 Input variables (features) Quantitative (numbers) Matrix: individuals x variables Outcome variable Qualitative (class labels)

Prediction

X3 (n3 x m)

  • n3 individuals
  • m variables

Y3’

Prediction Accuracy or error rate

Testing

X2 (n2 x m)

  • n2 individuals
  • m variables

Y2 Y2’

Prediction Comparison Confusion table

Training

X1 (n1 x m)

  • n1 individuals
  • m variables

Y1

Training

Trained classifier

slide-56
SLIDE 56

Using an independent testing set

n

Using the sample itself for evaluation is problematic, because the evaluation is biased (too optimistic).

n

To obtain an independent evaluation, one needs two separate sets : one for training, and one for testing.

n

However, we do not always dispose of two independent sets.

n

An alternative setting is to split randomly the samples of known class into two subsets (holdout approach) :

q the training set is used to build a discriminant function q the testing set is used for evaluation 56

Criterion variable variable 1 variable 2 ... variable p class

  • bject 1

x11 x21 ... xp1 A

  • bject 2

x12 x22 ... xp2 A

  • bject 3

x13 x23 ... xp3 B ... ... ... ... ... ...

  • bject ntrain

x1n x2n ... xpn K variable 1 variable 2 ... variable p known predicted

  • bject 1

x11 x21 ... xp1 A A

  • bject 2

x12 x22 ... xp2 B A

  • bject 3

x13 x23 ... xp3 B B ... ... ... ... ... ... ...

  • bject ntest

x1n x2n ... xpn K K Criterion variable Predictor variables Training set Predictor variables Testing set

slide-57
SLIDE 57

Training error rate

n

One way to evaluate the performances of a classifier is to run it on the training set itself.

n

This approach is called internal analysis.

n

The known and predicted class are then compared for each individual of the training set itself.

n

The result is denoted as the training error rate (the error rate measured on the training set itself).

n

Warning :

q This approach is obviously biased, since the training set was used to train the classifier, it is thus

  • ptimised for this very specific dataset.

q The training error rate is thus too optimistic: the performances may be much weaker on an

independent set.

q This approach is not recommended for general purposes. q The main interest of this approach is to compare it with an independent testing set (testing error rate)

in order to measure the overfitting of the classifier to the particular training set.

57

slide-58
SLIDE 58

K-fold cross validation

n Split the training set into k parts (e.g. 10-fold cross-validation) n Iterate for each subset i

1. Train a classifier with all subsets except subset i 2. Run the classifier to predict the class of each element of the testing subset (subset i)

n Compare the predicted and known classes for each individual n Each individual is thus used

q 1 time for testing q k-1 times for training

n58

slide-59
SLIDE 59

Leave-one-out (LOO) validation

n

When the sample is too small, it is problematic to loose half of it for testing.

n

In such a case, the leave-one-out (LOO) approach is recommended :

  • 1. Discard a single object from the sample.
  • 2. With the remaining objects, build a discriminant function.
  • 3. Use this discriminant function to predict the class of the discarded object.
  • 4. Compare known and predicted class for the discarded object.
  • 5. Iterate the above steps with each object of the sample.

n

Note : LOO is equivalent to perform a N-fold cross validation (where N is the training set size)

59

slide-60
SLIDE 60

Evaluation measures for supervised classification

slide-61
SLIDE 61

Evaluation of a classifier – confusion table

n

The results of the evaluation are summarized in a confusion table, which contains the count of the predicted/known combinations.

n

The confusion table can be used to calculate the accuracy of the predictions.

n

When there are more than 2 groups or when the groups are not associated to + and -, the performances are estimated by computing the misclassification error rate (MER)

61

A B C SUM PHO MET CTL SUM A 8 8 PHO 8 8 B 1 1 2 MET 1 1 2 C 5 18 81 104 CTL 5 18 81 104 SUM 13 19 82 114 SUM 13 19 82 114

Hits Diagonal Hits 8 + 1 + 81 90 Errors Non-diagonal Errors 114 - 90 24 Hit rate Hits / total Also named 'accuracy" Hit rate 90 / 114 78.95% MER Errors / total Misclassification error rate MER 24 / 114 21.05%

3-groups confusion table Known group Example Known Predicted Predicted group

slide-62
SLIDE 62

Evaluation of a classifier – confusion table for 2a-groups classification

n

The results of the evaluation are summarized in a confusion table, which contains the count of the predicted/known combinations.

n

The confusion table can be used to calculate the performances of the classifier.

n

For 2-groups classification, specific metrics can be applied if one group is considered negative and the other one positive

62

Case Control SUM Case Control SUM Case TP FP P Case 99 20 119 Control FN TN N Control 1 180 181 SUM TP+FN FP+TN T SUM 100 200 300 Errors FN+FP Errors 21 7.00% Correct TP+TN Correct 279 93.00% FPR FP/(FP+TN) FPR 20/(200) 10.00% Sn TP/(TP+FN) Sn 99/(100) 99.00% FDR FP/P FDR 20/119 16.81% Known class Known class Predicted class 2-groups classification Example

slide-63
SLIDE 63

Receiving Operator Characteristics (ROC)

n

The Receiving Operator Characteristics (ROC) represents the performances of a classifier as a function of a continuous score (e.g. discriminant function, posterior probability)

n

The result is a curve with

q

Abscissa: FPR

q

Ordinate: Sensitivity

n

A random classifier will be aligned onto the diagonal

n

A perfect classifier achieves FPR=0 and Sn=1 (upper left corner)

n

The closer the curves comes to this perfect performance, the better the classifier.

n

The Area Under the Curve (AUC) is often used to compare the performances

q

Between classifiers

q

Between different parameter settings for the same classifier

63

slide-64
SLIDE 64

Evaluation statistics for 2-groups classifiers

Various statistics can be derived from the 4 elements of a contingency table (TP, FP, TN, FN).

Abbrev Name Formula TP True positive TP FP False positive FP FN False negative FN TN True negative TN KP Known Positive TP+FN KN Known Negative TN+FP PP Predicted Positive TP+FP PN Predicted Negative FN+TN N Total TP + FP + FN + TN Prev Prevalence (TP + FN)/N ODP Overall Diagnostic Power (FP + TN)/N CCR Correct Classification Rate (TP + TN)/N Sn True Positive Rate (Sensitivity) TP/(TP + FN) TNR True Negative Rate (Specificity) TN/(FP + TN) FPR False Positive Rate FP/(FP + TN) FNR False Negative Rate FN/(TP + FN) = 1-Sn PPV Positive Predictive Value TP/(TP + FP) FDR False Discovery Rate FP/(FP+TP) NPV Negative Predictive Value TN/(FN + TN) Mis Misclassification Rate (FP + FN)/N Odds Odds-ratio (TP + TN)/(FN + FP) Kappa Kappa ((TP + TN) - (((TP + FN)*(TP + FP) + (FP + TN)*(FN + TN))/N))/(N - (((TP + FN)*(TP + FP) + (FP + TN)*(FN + TN))/N)) NMI NMI n(s) (1 - -TP*log(TP)-FP*log(FP)-FN*log(FN)- TN*log(TN)+(TP+FP)*log(TP+FP)+(FN+TN)*log (FN+TN))/(N*log(N) - ((TP+FN)*log(TP+FN) + (FP+TN)*log(FP+TN))) ACP Average Conditional Probability 0.25*(Sn+ PPV + Sp + NPV) MCC Matthews correlation coefficient (TP*TN - FP*FN) / sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)) Acc.a Arithmetic accuracy (Sn + PPV)/2 Acc.a2 Accuracy (alternative) (Sn + Sp)/2 Acc.g Geometric accuracy sqrt(Sn*PPV) Hit.noTN A sort of hit rate without TN (to avoid the effect of their large number) TP/(TP+FP+FN) TP FP TN FN Predicted

True False

Known

False True

PPV=TP/(TP+FP)

TP FP TN FN Predicted

True False

Known

False True

Sn = TP/(TP+FN)

TP FP TN FN Predicted

True False

Known

False True

FPR=FP/(FP+TN)

TP FP TN FN Predicted

True False

Known

False True

Sp=TN/(FP+TN)

TP FP TN FN Predicted

True False

Known

False True

FNR=FN/(TP+FN)

TP FP TN FN Predicted

True False

Known

False True

NPV=TN/(FN+TN)

TP FP TN FN Predicted

True False

Known

False True

FDR=FP/(FP+TP)

TP FP TN FN Predicted

True False

Known

False True

FN/(FN+TN)

TP FP TN FN Predicted

True False

Known

False True

slide-65
SLIDE 65

The arithmetic accuracy may be misleading

n

Acca = (Sn + PPV)/2

n

An easy way to fool the arithmetic accuracy: predict all features as positive

q -> Sn guaranteed to be 100% q à Acca guaranteed to be >50% q Of course, you have a poor PPV, but

the accuracy > 0.5 will be misleading

n

The geometric accuracy circumvents this problem

q Accg = sqrt(Sn*PPV) q Requires for both Sn and PPV to be

high.

TP FP TN FN Predicted

True False

Known

False True

PPV=TP/(TP+FP)

TP FP TN FN Predicted

True False

Known

False True

Sn = TP/(TP+FN)

TP FP TN FN Predicted

True False

Known

False True Abbrev Name Formula TP True positive TP FP False positive FP FN False negative FN TN True negative TN KP Known Positive TP+FN KN Known Negative TN+FP PP Predicted Positive TP+FP PN Predicted Negative FN+TN N Total TP + FP + FN + TN Prev Prevalence (TP + FN)/N ODP Overall Diagnostic Power (FP + TN)/N CCR Correct Classification Rate (TP + TN)/N Sn True Positive Rate (Sensitivity) TP/(TP + FN) TNR True Negative Rate (Specificity) TN/(FP + TN) FPR False Positive Rate FP/(FP + TN) FNR False Negative Rate FN/(TP + FN) = 1-Sn PPV Positive Predictive Value TP/(TP + FP) FDR False Discovery Rate FP/(FP+TP) NPV Negative Predictive Value TN/(FN + TN) Mis Misclassification Rate (FP + FN)/N Odds Odds-ratio (TP + TN)/(FN + FP) Kappa Kappa ((TP + TN) - (((TP + FN)*(TP + FP) + (FP + TN)*(FN + TN))/N))/(N - (((TP + FN)*(TP + FP) + (FP + TN)*(FN + TN))/N)) NMI NMI n(s) (1 - -TP*log(TP)-FP*log(FP)-FN*log(FN)- TN*log(TN)+(TP+FP)*log(TP+FP)+(FN+TN)*log (FN+TN))/(N*log(N) - ((TP+FN)*log(TP+FN) + (FP+TN)*log(FP+TN))) ACP Average Conditional Probability 0.25*(Sn+ PPV + Sp + NPV) MCC Matthews correlation coefficient (TP*TN - FP*FN) / sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)) Acc.a Arithmetic accuracy (Sn + PPV)/2 Acc.a2 Accuracy (alternative) (Sn + Sp)/2 Acc.g Geometric accuracy sqrt(Sn*PPV) Hit.noTN A sort of hit rate without TN (to avoid the effect of their large number) TP/(TP+FP+FN)

slide-66
SLIDE 66

TN-based statistics may be misleading

n

For some types of analyses, TN can represent >99.9%

n

Example: predicting transcription factor binding sites in a whole genome.

n

  • > All the statistics

including TN are misleading

n

For example, a classifier will have a very high specificity (Sp) and a very low false positive rate (FPR) even though its predictions are mostly wrong.

TP FP TN FN Predicted

True False

Known

False True TP FP TN FN Predicted

True False

Known

False True

PPV=TP/(TP+FP)

TP FP TN FN Predicted

True False

Known

False True

Sn = TP/(TP+FN)

TP FP TN FN Predicted

True False

Known

False True

FPR=FP/(FP+TN)

TP FP TN FN Predicted

True False

Known

False True

Sp=TN/(FP+TN)

TP FP TN FN Predicted

True False

Known

False True

FNR=FN/(TP+FN)

TP FP TN FN Predicted

True False

Known

False True

NPV=TN/(FN+TN)

TP FP TN FN Predicted

True False

Known

False True

FDR=FP/(FP+TP)

TP FP TN FN Predicted

True False

Known

False True

FN/(FN+TN)

Abbrev Name Formula TP True positive TP FP False positive FP FN False negative FN TN True negative TN KP Known Positive TP+FN KN Known Negative TN+FP PP Predicted Positive TP+FP PN Predicted Negative FN+TN N Total TP + FP + FN + TN Prev Prevalence (TP + FN)/N ODP Overall Diagnostic Power (FP + TN)/N CCR Correct Classification Rate (TP + TN)/N Sn True Positive Rate (Sensitivity) TP/(TP + FN) TNR True Negative Rate (Specificity) TN/(FP + TN) FPR False Positive Rate FP/(FP + TN) FNR False Negative Rate FN/(TP + FN) = 1-Sn PPV Positive Predictive Value TP/(TP + FP) FDR False Discovery Rate FP/(FP+TP) NPV Negative Predictive Value TN/(FN + TN) Mis Misclassification Rate (FP + FN)/N Odds Odds-ratio (TP + TN)/(FN + FP) Kappa Kappa ((TP + TN) - (((TP + FN)*(TP + FP) + (FP + TN)*(FN + TN))/N))/(N - (((TP + FN)*(TP + FP) + (FP + TN)*(FN + TN))/N)) NMI NMI n(s) (1 - -TP*log(TP)-FP*log(FP)-FN*log(FN)- TN*log(TN)+(TP+FP)*log(TP+FP)+(FN+TN)*log (FN+TN))/(N*log(N) - ((TP+FN)*log(TP+FN) + (FP+TN)*log(FP+TN))) ACP Average Conditional Probability 0.25*(Sn+ PPV + Sp + NPV) MCC Matthews correlation coefficient (TP*TN - FP*FN) / sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)) Acc.a Arithmetic accuracy (Sn + PPV)/2 Acc.a2 Accuracy (alternative) (Sn + Sp)/2 Acc.g Geometric accuracy sqrt(Sn*PPV) Hit.noTN A sort of hit rate without TN (to avoid the effect of their large number) TP/(TP+FP+FN)

slide-67
SLIDE 67

Machine-learning – classification approaches

slide-68
SLIDE 68

K Nearest Neighbours (KNN): principle, pros and cons

Principle

  • memorize the positions of individuals training

set

  • predict the class of an individual based on

class labels of its closest neighbours in the training set. Pros

  • Variety of distance criteria to be choose from

pretty intuitive and simple.

  • no assumptions about data distribution
  • No Training Step
  • Easy to implement for multi-class problem.

Cons

  • How to choose K ? No general criterion to

choose the optimal number of neighbors.

  • Sensitive to the curse of dimensionality (over-

fitting)

  • Imbalanced data causes problems.
  • Outlier sensitivity.
  • Slow algorithm
  • Missing Value treatment.

68

Acharya, A. (2017). Comparative Study of Machine Learning Algorithms for Heart Disease Prediction, (April).

slide-69
SLIDE 69

Decision trees (DT): principles, pros and cons

Principle

  • A Decision Tree (DT) builds logical rules (e.g.

if variable i > a threshold, assign to class c) that progressively lead to assign each sample to a single class. Pros

  • Expressive: one can understand a posteriori

which criteria are important for class assignation. Cons

  • Very sensitive to over-fitting
  • Lack of generalisation on unseen data.

69

Acharya, A. (2017). Comparative Study of Machine Learning Algorithms for Heart Disease Prediction, (April).

slide-70
SLIDE 70

Random Forest (RF): principle, pros and cons

Principle

  • A random forest (RF) is a classifier

consisting of a collection of decision trees,

  • Bagging (bootstrapping): each tree is

constructed based on a subset of the training set.

  • Majority vote: a sample is assigned to the

class having the majority of assignations by individual trees. Pros

  • Reduces the over-fitting problem of the

decision trees. Cons

  • Not easy to visually interpret

70

Adapted from: Anwar Isied and Hashem Tamimi. Using Random Forest (RF) as a transfer learning classifier for detecting Error-Related Potential (ErrP) within the context of P300-Speller. DOI: 10.12751/nncn.bc2015.0143

Predicted classes iteration 1 Predicted classes iteration 2 Predicted classes iteration 3

Majority vote Final class assignation

slide-71
SLIDE 71

Support Vector Machines (SVM): principle, pros and cons

Principle

  • Separate the various classes by a hyperplane

in the feature hyperspace

  • SVM is modelled with train data and outputs

the hyperplane in the test data.

  • The SVM model tries to find the space in the

matrix of data where different classes of data can be widely separated and draws a hyperplane. Pros

  • Performs similarly to logistic regression when

linear separation

  • Performs well with non-linear boundary

depending on the kernel used

  • Handles well high-dimensional data.

Cons

  • Sensitive to overfitting
  • Training issues depending on kernel

71

Acharya, A. (2017). Comparative Study of Machine Learning Algorithms for Heart Disease Prediction, (April).

slide-72
SLIDE 72

Over-fitting and feature selection

72

slide-73
SLIDE 73
  • 100

200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Number of variables and over−fitting random tests

Number of variables (p) Error rate

  • LOO

internal expected (balanced classes) expected (majority class)

Over-fitting

n A typical application of supervised

classification is to classify experiments (e.g. patient types) on the basis of the expression profiles.

n In this case, the objects are the

experiments, and the variables the genes.

n This raises a problem of over-fitting: the

number of variables is much larger than the number of objects in the training set.

n In such situations, the classifier will tend to

build a classification rule which perfectly fits the training set, but fails to generalize to other observations.

73

slide-74
SLIDE 74

Feature selection (variable selection)

n One approach to circumvent this problem is to select a subset of variables only. n This subset of variables can be selected according to different rules.

q

Variable ordering: variables are ordered according to some criterion, and the topmost variables are retained.

  • Non-supervised criterion: e.g. sort features by decreasing variance (the relevance is questionable).
  • P-value of the t-test (the P-value is not always linear with the t statistics, since the number of
  • bservations can vary from row to row if there are missing values).

q

Variables combinations

  • Selection of a subset of variables and estimation of the capability of each subset to classify correctly.
  • The number of possible combinations of variables increases exponentially with the number of variables.
  • All combinations of features. Generally not tractable: 2^m possibilities

q

Stepwise selection

  • Stepwise selection is an heuristics to select a subset of variables in a quadratic time, but they do not

guarantee optimality.

n Forward selection n Backward selection n Forward-backward selection

74

slide-75
SLIDE 75

Conclusions

75

slide-76
SLIDE 76

Summary – supervised classification

n

Setting:

q a set of quantitative predictor variables (input variables) q a single nominal criterion variable (output variable) n

A sample is used to train the classifier training set), which is then evaluated on an independent testing set (testing) before being used to assign additional units to classes (prediction).

n

The discriminant function can be either linear or quadratic. Linear discriminant analysis relies on the assumption that the different classes have similar covariance matrices.

n

The accuracy of the discriminant function can be evaluated in different ways.

q On the whole sample (internal approach) q Splitting of the sample into training and testing set (holdout approach)

  • Iterative subsampling
  • K-fold Cross-validation
  • Leave-one-out

n

The efficiency decreases with the p/N ratio. When this ratio is too low, there is a problem of over-fitting.

n

Stepwise approaches consist in selecting the subset of variables which raises the highest efficiency.

n76

slide-77
SLIDE 77

KNN classifiers

Statistical Analysis of Microarray Data

slide-78
SLIDE 78

K nearest neighbours

n Discriminant analysis is a global approach to classification: the discriminant rule is

established in the same way for the whole data space, on the basis of group centres and covariance matrices. Discriminant analysis is thus a global classifier.

n K nearest neighbour (KNN) classifiers takes a very different approach: at each position of

the feature space

q The K closest neighbour points from the training set are identified; q A vote is established as a function of the relative proportions of the respective training

groups in this set of neighbours.

n KNN is thus a local classifier. n The choice of K drastically affects group assignments.

78

slide-79
SLIDE 79

Supplementary material

79

slide-80
SLIDE 80

Conceptual illustration with a single predictor variable

Exercise

n

Given two predefined classes (A and B), try intuitively to assign a class to each new object (X positions denoted by vertical black bars).

n

How confident do you feel for each of your predictions ?

n

What is the effect of the respective means ?

n

What is the effect of the respective standard deviations ?

n

What is the effect of the population sizes ?

n80

slide-81
SLIDE 81

Conceptual illustration with a single variable

n

In this conceptual example, the two populations have equal means and variances.

n

To which group (A or B) would you assign the points at coordinate x, y, z, t, respectively ?

81

slide-82
SLIDE 82

Conceptual illustration with a single variable

n

Same exercise.

n

This example shows that the assignation is affected by the position of the group centres.

82

slide-83
SLIDE 83

Conceptual illustration with a single variable

n

Same exercise.

n

When the centres become too close, some uncertainty is attached to some points (y, but also partly z).

n

There is thus an effect of group distance.

83

slide-84
SLIDE 84

Conceptual illustration with a single variable

n

Same exercise.

n

The centres are in the same position as in the first example, but the variance is larger.

n

This affects the level of separation of the groups, and raises some uncertainty about the group membership of z.

n

The group variance thus affects the assignation.

84

slide-85
SLIDE 85

Conceptual illustration with a single variable

n

Same exercise.

n

This illustrates the effect of the sample size: if a sample has a much larger size than another one, it will increase the likelihood that some

  • bservations were issued from this group.

85

slide-86
SLIDE 86

Conceptual illustration with a single variable

n

Same exercise.

n

This is the symmetric situation of the preceding figure.

n

Although the group centres and variances are identical, the change of sample sizes completely modifies the group assignations.

n

This is an effect of prior probability.

86

slide-87
SLIDE 87

Conceptual illustration with a single variable

n

Same exercise.

n

If the two groups have different dispersions, it will affect their likelihood to be the originators of some observations.

n

The relative dispersion of the groups affects the assignation.

87

slide-88
SLIDE 88

Conceptual illustration with a single variable

n

Same exercise.

n

Symmetrical situation of the preceding one: same centres, same sample sizes, but the relative variances vary in the opposite way.

n

The relative dispersion of the groups affects the assignation.

88

slide-89
SLIDE 89

Conceptual illustration with a single variable

n

Same exercise.

n

When the dispersion of one group becomes too high, a simple boundary is not sufficient anymore to separate the two groups.

n

In this example, we would classify the leftmost (x) and rightmost (t, and maybe z) objects as B, and the central ones (y) as A. `

n

We need thus two boundaries to separate these groups.

n

The relative dispersion of the groups affects the assignation.

89

slide-90
SLIDE 90

Conceptual illustration with a single variable

n

Same exercise.

n

Symmetrical situation of the preceding figure.

n

The relative dispersion of the groups affects the assignation.

90

slide-91
SLIDE 91

Maximum likelihood rule - multivariate normal case

n If the predictor variable is univariate normal

91

f (X | g) = 1 2π σ g

2 e −1 2 X −µg σ g % & ' ' ( ) * *

2

f (X | g) = 1 2π

( )

p

Σg e

−1 2 X −µg

( )'Σg

−1 X −µg

( )

% & ' ( ) *

n

If the predictor variable is multivariate normal

Where

q X

is the unit vector

q p

is the number of variables

q µg

is the mean vector for group g

q Sg

is the covariance matrix for group g

slide-92
SLIDE 92

Bayesian classification in case of normality

n Each object is assigned to the group which minimizes the function

( )

( ) ( )

! " # $ % & − Σ − −

Σ =

g g g

X X g p

e g P f

µ µ

π

1

' 2 1

2 1 ) (

92

slide-93
SLIDE 93

Linear versus quadratic classification rule

n

There is one covariance matrix per group g.

q This matrix indicates the covariance between each column (variable) of the

data set, for the considered group.

q The diagonals of this matrix represent the variance (=covariance between a

variable and itself)

n

When all covariance matrix are assumed to be identical

q The classification rule can be simplified to obtain a linear function. This is

referred to as Linear Discriminant Analysis (LDA)

q In this case,the boundary between groups will be a plane (2 variables) or a

hyper-plane (more than 2 variables).

n

If the variances and covariances are expected to differ between groups

q A specific covariance matrix has to be used for each group. q The boundary between two groups is a curve (with two variables) or a

hyper-surface (more than 2 variables).

q This is referred to as Quadratic Discriminant Analysis (QDA)

93