CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning — Training by Marcel Turcotte Version November 6, 2019

Preamble Preamble 2/47

Preamble Fundamentals of Machine Learning — Training In this lecture, we introduce we focus on training learning algorithms. This will include the need for 2, 3 or k sets, tuning the hyperparameters values, as well as concepts such as under- and over-fitting the data. General objective : Describe the fundamental concepts of machine learning Preamble 3/47

Learning objectives Describe the role of the training , validation , and test sets Clarify the concepts of under- and overfitting the data Explain the process of tuning hyperparameters values Reading: Chicco, D. Ten quick tips for machine learning in computational biology. BioData Mining 10 :35 (2017). Boulesteix, A.-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11 :e1004191 (2015). Domingos, P. A few useful things to know about machine learning. Commun Acm 55 :7887 (2012). Preamble 4/47

Plan 1. Preamble 2. Problem 3. Testing 4. Under- and overfitting 5. 7-Steps workflow 6. Prologue Preamble 5/47

https://youtu.be/nKW8Ndu7Mjw The 7 Steps of Machine Learning Preamble 6/47

Problem Problem 7/47

Supervised learning - regression The data set is a collection of labelled examples. { ( x i , y i ) } N i = 1 Each x i is a feature vector with D dimensions. x ( j ) is the value of the feature j of the example i , i for j ∈ 1 . . . D and i ∈ 1 . . . N . The label y i is a real number . Problem : given the data set as input, create a “ model ” that can be used to predict the value of y for an unseen x . Problem 8/47

QSAR QSAR stands for Quantitative Structure-Activity Relationship As a machine learning problem, Each x i is a chemical compound y i is the biological activity of the compound x i Examples of biological activity include toxicology and biodegradability 0.615 -0.125 1.140 . . . . . . 0.941 Problem 9/47

HIV-1 reverse transcriptase inhibitors Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76 :205223 (2017). Problem 10/47

HIV-1 reverse transcriptase inhibitors Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76 :205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” Problem 10/47

HIV-1 reverse transcriptase inhibitors Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76 :205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” “Many small molecule compounds (. . . ) have been studied over the years.” Problem 10/47

HIV-1 reverse transcriptase inhibitors Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76 :205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” “Many small molecule compounds (. . . ) have been studied over the years.” “Due to mutations and other influencing factors, the search for new inhibitor molecules for HIV-1 is ongoing.” Problem 10/47

HIV-1 reverse transcriptase inhibitors Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76 :205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” “Many small molecule compounds (. . . ) have been studied over the years.” “Due to mutations and other influencing factors, the search for new inhibitor molecules for HIV-1 is ongoing.” “Our recent design, modelling, and synthesis effort in the search for new compounds has resulted in two new, small, low toxicity (. . . ) inhibitors.” Problem 10/47

https://aidsinfo.nih.gov/understanding-hiv-aids HIV Life Cycle Problem 11/47

HIV Life Cycle Problem 11/47

HIV-1 reverse transcriptase inhibitors Each compound ( example ) in ChemDB has features such as the number of atoms , area , solvation , coulombic , molecular weight , XLogP , etc. Problem 12/47

HIV-1 reverse transcriptase inhibitors Each compound ( example ) in ChemDB has features such as the number of atoms , area , solvation , coulombic , molecular weight , XLogP , etc. A possible solution, a model, would look something like this: y = 44 . 418 − 35 . 133 × x ( 1 ) − 13 . 518 × x ( 2 ) + 0 . 766 × x ( 3 ) ˆ Problem 12/47

Testing Testing 13/47

Two sets! Training set versus test set Testing 14/47

Two sets! Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Testing 14/47

Two sets! Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Testing 14/47

Two sets! Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model Testing 14/47

Two sets! Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model Generalization error: error rate on new cases Testing 14/47

Two sets! Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model Generalization error: error rate on new cases In most cases, the training error will be low , this because most learning algorithms are designed to find a set of values for their (weights) parameters such that the training error is low. However, the generalization error can still be high , we say that the model is overfitting the training data . Testing 14/47

Two sets! Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model Generalization error: error rate on new cases In most cases, the training error will be low , this because most learning algorithms are designed to find a set of values for their (weights) parameters such that the training error is low. However, the generalization error can still be high , we say that the model is overfitting the training data . If the training error is high, we say that the model is underfitting the training data . Testing 14/47

Under-andover-fitting Under- and overfitting 15/47

Underfitting and overfitting Underfitting and overfitting are two important concepts for machine learning projects We will use a regression task to illustrate those two concepts Under- and overfitting 16/47

Linear Regression A linear model assumes that the value of the label, ˆ y i , can be expressed as a linear combination of the feature values, x ( j ) i : y i = h ( x i ) = θ 0 + θ 1 x ( 1 ) + θ 2 x ( 2 ) + . . . + θ D x ( D ) ˆ i i i Under- and overfitting 17/47

Linear Regression A linear model assumes that the value of the label, ˆ y i , can be expressed as a linear combination of the feature values, x ( j ) i : y i = h ( x i ) = θ 0 + θ 1 x ( 1 ) + θ 2 x ( 2 ) + . . . + θ D x ( D ) ˆ i i i Here, θ j is the j th parameter of the (linear) model , with θ 0 being the bias term/parameter, θ 1 . . . θ D being the feature weights . Under- and overfitting 17/47

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning Training by Marcel Turcotte Version November 6, 2019 Preamble Preamble 2/47 Preamble Fundamentals of Machine Learning Training In this lecture,

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning tasks

CSI5180. MachineLearningfor BioinformaticsApplications Essential Cellular Biology (continued) by

CSI5180. MachineLearningfor BioinformaticsApplications Essential Cellular Biology by Marcel

CSI5180. MachineLearningfor BioinformaticsApplications Ensemble Learning by Marcel Turcotte

CSI5180. MachineLearningfor BioinformaticsApplications Hidden Markov Models by Marcel Turcotte

CSI5180. MachineLearningfor BioinformaticsApplications Essential Bioinformatics Skills by Marcel

CSI5180. MachineLearningfor BioinformaticsApplications Course overview by Marcel Turcotte

CSI5180. MachineLearningfor BioinformaticsApplications Rule Learning by Marcel Turcotte Version

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning encoding and transfer

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

CSI5180. MachineLearningfor BioinformaticsApplications Regularized Linear Models by Marcel

SENSE Trial Once Daily Etravirine versus Efavirenz in Treatment-Naive SENSE: Study Design Study

Biological Realms in Computer Science CS Discovery vs. Invention Tinkering vs. Didier Verna

Transverse momentum distributions and the determination of the W mass Andrea Signori Loop

Outline CSE 527 What is it Lecture 17, 11/24/04 How is it Represented Why is it

Computational Optimization Duality Theory (MW 12.9) Prof. K. Bennett Bennek@rpi.edu

Fatty Liver Disease NASH-NAFLD with histologic evidence of liver Diagnostic Challenges &

Achieving Single Channel Full-Duplex Wireless Communication Jung Il Choi, Mayank Jain, Kannan

Environmental policy with intermittent sources of energy Stefan Ambec and Claude Crampes

A deep dive and comparison of Python drivers for Cassandra and Scylla Why and how we wrote a

Lecture 3.2: Cosets Matthew Macauley Department of Mathematical Sciences Clemson University

A Scalable and Nearly Uniform Generator of SAT Witnesses Supratik Chakraborty 1 , Kuldeep S Meel 2

Beat Googles bidder using ML Dolead RD team Arnaud Fouchet DOLEAD @ Py.Paris NOV 2018

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning Training by Marcel Turcotte Version November 6, 2019 Preamble Preamble 2/47 Preamble Fundamentals of Machine Learning Training In this lecture,

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning tasks

CSI5180. MachineLearningfor BioinformaticsApplications Essential Cellular Biology (continued) by

CSI5180. MachineLearningfor BioinformaticsApplications Essential Cellular Biology by Marcel

CSI5180. MachineLearningfor BioinformaticsApplications Ensemble Learning by Marcel Turcotte

CSI5180. MachineLearningfor BioinformaticsApplications Hidden Markov Models by Marcel Turcotte

CSI5180. MachineLearningfor BioinformaticsApplications Essential Bioinformatics Skills by Marcel

CSI5180. MachineLearningfor BioinformaticsApplications Course overview by Marcel Turcotte

CSI5180. MachineLearningfor BioinformaticsApplications Rule Learning by Marcel Turcotte Version

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning encoding and transfer

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

CSI5180. MachineLearningfor BioinformaticsApplications Regularized Linear Models by Marcel

SENSE Trial Once Daily Etravirine versus Efavirenz in Treatment-Naive SENSE: Study Design Study

Biological Realms in Computer Science CS Discovery vs. Invention Tinkering vs. Didier Verna

Transverse momentum distributions and the determination of the W mass Andrea Signori Loop

Outline CSE 527 What is it Lecture 17, 11/24/04 How is it Represented Why is it

Computational Optimization Duality Theory (MW 12.9) Prof. K. Bennett Bennek@rpi.edu

Fatty Liver Disease NASH-NAFLD with histologic evidence of liver Diagnostic Challenges &amp;

Achieving Single Channel Full-Duplex Wireless Communication Jung Il Choi, Mayank Jain, Kannan

Environmental policy with intermittent sources of energy Stefan Ambec and Claude Crampes

A deep dive and comparison of Python drivers for Cassandra and Scylla Why and how we wrote a

Lecture 3.2: Cosets Matthew Macauley Department of Mathematical Sciences Clemson University

A Scalable and Nearly Uniform Generator of SAT Witnesses Supratik Chakraborty 1 , Kuldeep S Meel 2

Beat Googles bidder using ML Dolead RD team Arnaud Fouchet DOLEAD @ Py.Paris NOV 2018

Fatty Liver Disease NASH-NAFLD with histologic evidence of liver Diagnostic Challenges &