Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 - PowerPoint PPT Presentation

DUBii - Module - Statistics with R Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity ( TAGC )

Brain-learning exercise : assign individuals to groups based on their features

Conceptual illustration with two predictor variables In the next slides, we will n provide you with a higher- resolution of the plots, which represent represent a study case. Exercise: assign intuitively n each individual (black dot) to one of the two groups (A, B). At each step, ask yourself q the following questions. Which criterion did you use q to assign an individual to a group? How confident do you feel q for each of your predictions? What is the effect of the q respective means? What is the effect of the q respective standard deviations? What is the effect of the q correlations between the 3 two variables?

Conceptual illustration with two variables – Study case 1 Inspect the distribution of points for the two groups n of individuals (pink, blue) on the 2-dimensional feature space. X2 (Feature 1) X1 (Feature 1) 4

Conceptual illustration with two variables – Study case 2 Effect of the group centre location . n X2 (Feature 1) X1 (Feature 1) 5

Conceptual illustration with two variables – Study case 3 Effect of the group variance . n X2 (Feature 1) X1 (Feature 1) 6

Conceptual illustration with two variables – Study case 4 Effect of the group variance . n X2 (Feature 1) X1 (Feature 1) 7

Conceptual illustration with two variables – Study case 5 Impact of the group-specific variances n (heteroscedasticity of the data) X2 (Feature 1) X1 (Feature 1) 8

Conceptual illustration with two variables – Study case 6 Impact of the group-specific n variances (heteroscedasticity of the data) X2 (Feature 1) X1 (Feature 1) 9

Conceptual illustration with two variables – Study case 7 Effect of the co variance between n features. X2 (Feature 1) X1 (Feature 1) 10

Conceptual illustration with two variables – Study case 8 Effect of the co variance between n features X2 (Feature 1) X1 (Feature 1) 11

Conceptual illustration with two variables – Study case 9 Group-specific co variances between n features. q The two groups have different covariance matrices: the clouds of points are elongated in different directions. q How does this difference affects group assignments ? X2 (Feature 1) X1 (Feature 1) 12

Statistics Applied to Bioinformatics Multivariate analysis Introduction Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity ( TAGC )

Multivariate data n Each row represents one object (also called unit) n Each column represents one variable variable 1 variable 2 ... variable p individual 1 x 11 x 21 ... x p1 individual 2 x 12 x 22 ... x p2 individual 3 x 13 x 23 ... x p3 individual 4 x 14 x 24 ... x p4 individual 5 x 15 x 25 ... x p5 individual 6 x 16 x 26 ... x p6 individual 7 x 17 x 27 ... x p7 individual 8 x 18 x 28 ... x p8 ... ... ... ... ... individual n x 1n x 2n ... x pn

Multivariate data with an outcome variable n The outcome variable (also called criterion variable) can be q qualitative (nominal) : classes (e.g. cancer type) q quantitative (e.g. survival expectation for a cancer patient) Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x 11 x 21 ... x p1 y 1 individual 2 x 12 x 22 ... x p2 y 2 individual 3 x 13 x 23 ... x p3 y 3 individual 4 x 14 x 24 ... x p4 y 4 individual 5 x 15 x 25 ... x p5 y 5 individual 6 x 16 x 26 ... x p6 y 6 individual 7 x 17 x 27 ... x p7 y 7 individual 8 x 18 x 28 ... x p8 y 8 ... ... ... ... ... ... individual n x 1n x 2n ... x pn y n

Predictive approaches - Training set n The training set is used to build a predictive function n This function is used to predict the value of the outcome variable for new objects Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 Training set individual 1 x 11 x 21 ... x p1 y 1 individual 2 x 12 x 22 ... x p2 y 2 individual 3 x 13 x 23 ... x p3 y 3 ... ... ... ... ... ... individual N_train x 1n x 2n ... x pn y n Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 Set to predict individual 1 x 11 x 21 ... x p1 ? individual 2 x 12 x 22 ... x p2 ? individual 3 x 13 x 23 ... x p3 ? ... ... ... ... ... ... individual N_pred x 1n x 2n ... x pn ?

Predictor variables Outcome variable Evaluation of prediction with a testing set variable 1 variable 2 ... variable p variable p+1 Training set individual 1 x 11 x 12 ... x 1p y 1 individual 2 x 21 x 22 ... x 2p y 2 individual 3 x 31 x 32 ... x 3p y 3 ... ... ... ... ... ... individual ntrain x n1 x n2 ... x np y n Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 variable p+1 (known value) (predicted) Testing set individual 1 x 11 x 12 ... x 1p y 1 y' 1 individual 2 x 21 x 22 ... x 2p y 2 y' 2 individual 3 x 31 x 32 ... x 3p y 3 y' 3 ... ... ... ... ... ... ... individual ntest x n1 x n2 ... x np y ntest y' ntest Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 Set to predict individual 1 x 11 x 12 ... x 1p ? individual 2 x 21 x 22 ... x 2p ? individual 3 x 31 x 32 ... x 3p ? ... ... ... ... ... ...

Flowchart of the approaches in multivariate analysis Multidimensional multivariate table X distance matrix scaling Reduction of dimensions outcome - variable selection variable Y? - principal component analysis none quantitative nominal Exploratory analysis Visualisation Cluster analysis Regression analysis Supervised classification Predicted value of Discovered classes Assignment of individuals Graphical a quantitative variable + individual assignment to predefined classes representations y est = f(x) g=f(x)

Quizz Check your understanding of the concepts presented in the previous slides by applying them to your own data. 1. Describe in one sentence a typical case of multidimensional data that is handled in your domain. 2. Explain how you would organise this dataset into a multivariate structure q What would correspond to the individuals? q What would correspond to the variables? q How many individuals (n) would you have? q How many variables (p) would you have? q Do you dispose of one or several outcome variable(s)? q If so, are these quantitative, qualitative or both? 3. Based on the conceptual framework defined above, which kind of approaches would be you envisage to extract which kind of relevant information from this data? Note that several approaches can be combined to address different questions.

Historical (vintage) examples

Historical example of clustering heat map Spellman et al. (1998). n Systematic detection of genes regulated in a n periodic way during the cell cycle. Several experiments were regrouped, with n various ways of synchronization (elutriation, cdc mutants, …) ~800 genes showing a periodic patterns of n expression were selected (by Fourier analysis) Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97.Time profiles of yeast cells followed during cell cycle.

Stress response in yeast Gasch et al. (2000) tested the transcriptional n response of yeast genome to q Various stress conditions (heat shock, osmotic shock, …) q Drugs q Alternative carbon sources q … The heatmap shows clusters of genes having n similar profiles of responses to the different types of stress. Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. 22 (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241- 57.

Cancer types (Golub, 1999) Compared the profiles of n expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively. Selected the 50 genes most n correlated with the cancer type. Goal: use these genes as n molecular signatures for the diagnostic of new patients. Golub, T. R., Slonim, D. K., Tamayo, P., n Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene 23 expression monitoring. Science 286, 531-7.

Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 - PowerPoint PPT Presentation

DUBii - Module - Statistics with R Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 Institut Franais de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Universit (AMU) Lab.

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Oncology Grand Rounds New Agents and Strategies in Chimeric Antigen Receptor T-Cell Therapy

Getting past PowerPoint slides and putting RWE into action MONDAY APRIL 24, 2017 3:15 PM 4:30

Realizing the Dreams of Personalized Medicine Realizing the Dreams of Personalized Medicine

Brief BIO Sam Milham Union College BS 1954 Albany Medical College MD 1958 Johns

Survey The anonymous, 2-page survey was distributed at the PanCare Brno meeting. 31

Leucemia Linfoblastica Acuta dellet pediatrica A.Biondi and G.Cazzaniga Department of

High-dimensional data analysis Nicolai Meinshausen Seminar f ur Statistik, ETH Z urich Van

Gene expression analysis Roadmap Microarray technology: how it work Applications: what