CSI5180. MachineLearningfor BioinformaticsApplications Deep learning - - PowerPoint PPT Presentation

csi5180 machinelearningfor bioinformaticsapplications
SMART_READER_LITE
LIVE PREVIEW

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning - - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning encoding and transfer learning by Marcel Turcotte Version November 12, 2019 Preamble Preamble 2/47 Preamble Deep learning encoding and transfer learning In this


slide-1
SLIDE 1
  • CSI5180. MachineLearningfor

BioinformaticsApplications

Deep learning — encoding and transfer learning

by

Marcel Turcotte

Version November 12, 2019

slide-2
SLIDE 2

Preamble 2/47

Preamble

slide-3
SLIDE 3

Preamble

Preamble 3/47

Deep learning — encoding and transfer learning In this lecture, we further investigate deep learning. We review diverse methods to encode the data for these artificial neural networks. We present the concept of embeddings and specifically embeddings for biological sequences. Finally, we discuss the concept of transfer learning. General objective :

Explain the various ways to encode data for deep networks

slide-4
SLIDE 4

Learning objectives

Preamble 4/47

Explain the concept of embeddings Describe how to implement transfer learning Justify the application of transfer learning

Reading:

Ehsaneddin Asgari and Mohammad R K Mofrad, Continuous distributed representation of biological sequences for deep proteomics and genomics, PLoS One 10:11, e0141287, 2015. Wang, S., Li, Z., Yu, Y., Xu, J. Folding Membrane Proteins by Deep Transfer Learning. Cell Systems 5:3, 202, 2017.

slide-5
SLIDE 5

Plan

Preamble 5/47

  • 1. Preamble
  • 2. Summary
  • 3. Keras
  • 4. Preprocessing
  • 5. Transfer learning
  • 6. Prologue
slide-6
SLIDE 6

Summary 6/47

Summary

slide-7
SLIDE 7

Summary - threshold logic unit

Summary 7/47

Source: [3] Figure 10.4

Model hw(x) = ϕ(x Tw)

slide-8
SLIDE 8

Summary - Perceptron

Summary 8/47

Source: [3] Figure 10.5

A Perceptron consists of a single layer of threshold logic units.

slide-9
SLIDE 9

Summary - Perceptron

Summary 8/47

Source: [3] Figure 10.5

A Perceptron consists of a single layer of threshold logic units. It computes the following function: hW ,b(X) = ϕ(WX + b)

slide-10
SLIDE 10

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input.

slide-11
SLIDE 11

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input. Bias neuron: a neuron that always return 1.

slide-12
SLIDE 12

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input. Bias neuron: a neuron that always return 1. Fully connected layer or dense layer: all the neurons are connected to all the neurons of the previous layer.

slide-13
SLIDE 13

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input. Bias neuron: a neuron that always return 1. Fully connected layer or dense layer: all the neurons are connected to all the neurons of the previous layer. X: input matrix (rows are instances, columns are features).

slide-14
SLIDE 14

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input. Bias neuron: a neuron that always return 1. Fully connected layer or dense layer: all the neurons are connected to all the neurons of the previous layer. X: input matrix (rows are instances, columns are features). W: weight matrix (# rows corresponds to the number of inputs, # columns corresponds to the number of neurons in the output layer).

slide-15
SLIDE 15

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input. Bias neuron: a neuron that always return 1. Fully connected layer or dense layer: all the neurons are connected to all the neurons of the previous layer. X: input matrix (rows are instances, columns are features). W: weight matrix (# rows corresponds to the number of inputs, # columns corresponds to the number of neurons in the output layer). b: bias vector (same size as the number of neurons in the output layer).

slide-16
SLIDE 16

Summary - Definitions

Summary 9/47

Input neuron: a special type of neuron that simply returns the value of its input. Bias neuron: a neuron that always return 1. Fully connected layer or dense layer: all the neurons are connected to all the neurons of the previous layer. X: input matrix (rows are instances, columns are features). W: weight matrix (# rows corresponds to the number of inputs, # columns corresponds to the number of neurons in the output layer). b: bias vector (same size as the number of neurons in the output layer). Activation function: maps its input domain to a restricted set of values (heavyside and sign are commonly used with threshold logic unit perceptrons).

slide-17
SLIDE 17

Summary - Multilayer Perceptron

Summary 10/47

A two-layer perceptron computes: y = f2(f1(X)) where fl(Z) = ϕ(WlZ + bl) ϕ is an activation function, typically

  • ne of: hyperbolic tangent,

Rectified Linear Unit function, sigmoid, etc. W is a weight matrix, X is an input matrix, and b is a bias vector. In the context of artificial neural networks, matrices are called tensors.

Source: [3] Figure 10.7

slide-18
SLIDE 18

Summary - Multilayer Perceptron

Summary 11/47

A k-layer perceptron computes the following function: y = fk(. . . f2(f1(X)) . . .) where fl(Z) = ϕ(WlZ + bl)

slide-19
SLIDE 19

Keras 12/47

Keras

slide-20
SLIDE 20

Using Keras

Keras 13/47

https://keras.io (François Chollet/Google/2015 1st release)

slide-21
SLIDE 21

Using Keras

Keras 13/47

https://keras.io (François Chollet/Google/2015 1st release) Personally, I find it easier to install and maintain Keras using a package manager, such as Conda (specifically, I use Anaconda).

slide-22
SLIDE 22

Using Keras

Keras 13/47

https://keras.io (François Chollet/Google/2015 1st release) Personally, I find it easier to install and maintain Keras using a package manager, such as Conda (specifically, I use Anaconda). Easy to use, yet powerfull and efficient (makes use of GPUs if available)

slide-23
SLIDE 23

Using Keras

Keras 13/47

https://keras.io (François Chollet/Google/2015 1st release) Personally, I find it easier to install and maintain Keras using a package manager, such as Conda (specifically, I use Anaconda). Easy to use, yet powerfull and efficient (makes use of GPUs if available) Two main API: Sequential and Functional

slide-24
SLIDE 24

Sequential API

Keras 14/47

from keras . models import S e q u e n t i a l model = S e q u e n t i a l ()

slide-25
SLIDE 25

Sequential API

Keras 14/47

from keras . models import S e q u e n t i a l model = S e q u e n t i a l () from keras . l a y e r s import Dense model . add ( Dense ( u n i t s =64, a c t i v a t i o n=’ r e l u ’ , input_dim =100)) model . add ( Dense ( u n i t s =10, a c t i v a t i o n=’ softmax ’ ))

slide-26
SLIDE 26

Sequential API

Keras 14/47

from keras . models import S e q u e n t i a l model = S e q u e n t i a l () from keras . l a y e r s import Dense model . add ( Dense ( u n i t s =64, a c t i v a t i o n=’ r e l u ’ , input_dim =100)) model . add ( Dense ( u n i t s =10, a c t i v a t i o n=’ softmax ’ )) model . compile ( l o s s=’ c a t e g o r i c a l _ c r o s s e n t r o p y ’ ,

  • p t i m i z e r=’ sgd ’ ,

m e t r i c s =[ ’ accuracy ’ ] )

slide-27
SLIDE 27

Sequential API

Keras 14/47

from keras . models import S e q u e n t i a l model = S e q u e n t i a l () from keras . l a y e r s import Dense model . add ( Dense ( u n i t s =64, a c t i v a t i o n=’ r e l u ’ , input_dim =100)) model . add ( Dense ( u n i t s =10, a c t i v a t i o n=’ softmax ’ )) model . compile ( l o s s=’ c a t e g o r i c a l _ c r o s s e n t r o p y ’ ,

  • p t i m i z e r=’ sgd ’ ,

m e t r i c s =[ ’ accuracy ’ ] ) model . f i t ( x_train , y_train , epochs =5, batch_size =32)

slide-28
SLIDE 28

Sequential API

Keras 14/47

from keras . models import S e q u e n t i a l model = S e q u e n t i a l () from keras . l a y e r s import Dense model . add ( Dense ( u n i t s =64, a c t i v a t i o n=’ r e l u ’ , input_dim =100)) model . add ( Dense ( u n i t s =10, a c t i v a t i o n=’ softmax ’ )) model . compile ( l o s s=’ c a t e g o r i c a l _ c r o s s e n t r o p y ’ ,

  • p t i m i z e r=’ sgd ’ ,

m e t r i c s =[ ’ accuracy ’ ] ) model . f i t ( x_train , y_train , epochs =5, batch_size =32) loss_and_metrics = model . e v a l u a t e ( x_test , y_test )

slide-29
SLIDE 29

Functional API

Keras 15/47

from keras . l a y e r s import Input , Dense from keras . models import Model # This r e t u r n s a t e n s o r i n p u t s = Input ( shape =(784 ,)) # a l a y e r i n s t a n c e i s c a l l a b l e

  • n a

tensor , and r e t u r n s a t e n s o r

  • utput_1 = Dense (64 ,

a c t i v a t i o n=’ r e l u ’ )( i n p u t s )

  • utput_2 = Dense (64 ,

a c t i v a t i o n=’ r e l u ’ )( output_1 ) p r e d i c t i o n s = Dense (10 , a c t i v a t i o n=’ softmax ’ )( output_2 ) # This c r e a t e s a model that i n c l u d e s # the Input l a y e r and t h r e e Dense l a y e r s model = Model ( i n p u t s=inputs ,

  • utputs=p r e d i c t i o n s )

model . compile ( o p t i m i z e r=’ rmsprop ’ , l o s s=’ c a t e g o r i c a l _ c r o s s e n t r o p y ’ , m e t r i c s =[ ’ accuracy ’ ] ) model . f i t ( data , l a b e l s ) # s t a r t s t r a i n i n g

slide-30
SLIDE 30

Preprocessing 16/47

Preprocessing

slide-31
SLIDE 31

Scaling

Preprocessing 17/47

As discussed at the begining of the term, it is almost always a good idea to scale the input data.

Custom code sklearn.preprocessing.StandardScaler keras.layers.Lambda Standardization layer

slide-32
SLIDE 32

keras.layers.Lambda

Preprocessing 18/47

means = np . mean( X_train , a x i s =0, keepdims=True ) s t d s = np . std ( X_train , a x i s =0, keepdims=True ) eps = keras . backend . e p s i l o n () model = keras . models . S e q u e n t i a l ( [ keras . l a y e r s . Lambda( lambda i n p u t s : ( i n p u t s − means ) / ( s t d s + eps ) ) , [ . . . ] # other l a y e r s ] )

Source: [3] §11

slide-33
SLIDE 33

Standardization layer

Preprocessing 19/47

c l a s s S t a n d a r d i z a t i o n ( keras . l a y e r s . Layer ) : def adapt ( s e l f , data_sample ) : s e l f . means_ = np . mean( data_sample , a x i s =0, keepdims=True ) s e l f . stds_ = np . std ( data_sample , a x i s =0, keepdims=True ) def c a l l ( s e l f , i n p u t s ) : return ( inputs −s e l f . means_ )/( s e l f . stds_+keras . backend . e p s i l o n ( ) )

slide-34
SLIDE 34

Standardization layer

Preprocessing 19/47

c l a s s S t a n d a r d i z a t i o n ( keras . l a y e r s . Layer ) : def adapt ( s e l f , data_sample ) : s e l f . means_ = np . mean( data_sample , a x i s =0, keepdims=True ) s e l f . stds_ = np . std ( data_sample , a x i s =0, keepdims=True ) def c a l l ( s e l f , i n p u t s ) : return ( inputs −s e l f . means_ )/( s e l f . stds_+keras . backend . e p s i l o n ( ) ) s t d _ l a y e r = S t a n d a r d i z a t i o n () s t d _ l a y e r . adapt ( data_sample )

slide-35
SLIDE 35

Standardization layer

Preprocessing 19/47

c l a s s S t a n d a r d i z a t i o n ( keras . l a y e r s . Layer ) : def adapt ( s e l f , data_sample ) : s e l f . means_ = np . mean( data_sample , a x i s =0, keepdims=True ) s e l f . stds_ = np . std ( data_sample , a x i s =0, keepdims=True ) def c a l l ( s e l f , i n p u t s ) : return ( inputs −s e l f . means_ )/( s e l f . stds_+keras . backend . e p s i l o n ( ) ) s t d _ l a y e r = S t a n d a r d i z a t i o n () s t d _ l a y e r . adapt ( data_sample ) model = keras . S e q u e n t i a l () model . add ( s t d _ l a y e r ) . . . # c r e a t e the r e s t

  • f

the model model . compile ( [ . . . ] ) model . f i t ( [ . . . ] )

slide-36
SLIDE 36

Standardization layer

Preprocessing 19/47

c l a s s S t a n d a r d i z a t i o n ( keras . l a y e r s . Layer ) : def adapt ( s e l f , data_sample ) : s e l f . means_ = np . mean( data_sample , a x i s =0, keepdims=True ) s e l f . stds_ = np . std ( data_sample , a x i s =0, keepdims=True ) def c a l l ( s e l f , i n p u t s ) : return ( inputs −s e l f . means_ )/( s e l f . stds_+keras . backend . e p s i l o n ( ) ) s t d _ l a y e r = S t a n d a r d i z a t i o n () s t d _ l a y e r . adapt ( data_sample ) model = keras . S e q u e n t i a l () model . add ( s t d _ l a y e r ) . . . # c r e a t e the r e s t

  • f

the model model . compile ( [ . . . ] ) model . f i t ( [ . . . ] )

Source: [3] §11

slide-37
SLIDE 37

Categorical data

Preprocessing 20/47

from numpy import a r r a y import numpy as np from s k l e a r n . p r e p r o c e s s i n g import LabelEncoder from keras . u t i l s import t o _ c a t e g o r i c a l data = [ ’T ’ , ’T ’ , ’C ’ , ’T ’ , ’G ’ , ’G ’ , ’C ’ , ’A ’ , ’C ’ , ’T ’ , ’T ’ , ’G ’ ] v a l u e s = a r r a y ( data ) label_encoder = LabelEncoder () integer_encoded = label_encoder . f i t _ t r a n s f o r m ( v a l u e s ) data_encoded = t o _ c a t e g o r i c a l ( integer_encoded )

slide-38
SLIDE 38

Categorical data

Preprocessing 21/47

p r i n t ( data_encoded )

[[0. 0. 0. 1.] [0. 0. 0. 1.] [0. 1. 0. 0.] [0. 0. 0. 1.] [0. 0. 1. 0.] [0. 0. 1. 0.] [0. 1. 0. 0.] [1. 0. 0. 0.] [0. 1. 0. 0.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 1. 0.]]

slide-39
SLIDE 39

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13

slide-40
SLIDE 40

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories.

slide-41
SLIDE 41

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

slide-42
SLIDE 42

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

The representation is learnt from the data!

slide-43
SLIDE 43

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

The representation is learnt from the data! Initially, each category is assigned a random vector.

slide-44
SLIDE 44

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

The representation is learnt from the data! Initially, each category is assigned a random vector. During learning, gradient descent will make the vector representations of similar categories more similar one to another.

slide-45
SLIDE 45

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

The representation is learnt from the data! Initially, each category is assigned a random vector. During learning, gradient descent will make the vector representations of similar categories more similar one to another.

Why?

slide-46
SLIDE 46

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

The representation is learnt from the data! Initially, each category is assigned a random vector. During learning, gradient descent will make the vector representations of similar categories more similar one to another.

Why?

A better representation can accelerate learning and make more accurate predictions.

slide-47
SLIDE 47

Embeddings

Preprocessing 22/47

“An embedding is a trainable dense vector that represents a category.” [3] §13 With the one hot encoding, we used a sparse encoding with one dimension per category, e.g. A = [1,0,0,0], to avoid creating false associations between categories. With embeddings, the philosophy is the other way around, we want categories that are similar to have similar vector representations.

The representation is learnt from the data! Initially, each category is assigned a random vector. During learning, gradient descent will make the vector representations of similar categories more similar one to another.

Why?

A better representation can accelerate learning and make more accurate predictions. Embeddings can be reused! [A form of transfer learning]

slide-48
SLIDE 48

Word embbedings

Preprocessing 23/47

Source: [3] Figure 13.5

“Man is to King as Woman is to Queen”

slide-49
SLIDE 49

2013

Preprocessing 24/47

Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean https://arxiv.org/abs/1310.4546

“Somewhat surprisingly, many of these patterns can be represented as linear translations.” “For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector.”

slide-50
SLIDE 50

Hypothetical example

Preprocessing 25/47

Imagine that a (coding) DNA sequence is divided into 3-letter words.

slide-51
SLIDE 51

Hypothetical example

Preprocessing 25/47

Imagine that a (coding) DNA sequence is divided into 3-letter words. There would be 64 such words (64 categories).

slide-52
SLIDE 52

Hypothetical example

Preprocessing 25/47

Imagine that a (coding) DNA sequence is divided into 3-letter words. There would be 64 such words (64 categories). Initially, each category is assigned a random vector.

slide-53
SLIDE 53

Hypothetical example

Preprocessing 25/47

Imagine that a (coding) DNA sequence is divided into 3-letter words. There would be 64 such words (64 categories). Initially, each category is assigned a random vector. During learning, 3-letter words corresponding to codons encoding the same amino acid would see their vector representation be made more and more similar.

slide-54
SLIDE 54

Embeddings in bioinformatics

Preprocessing 26/47

Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. arXiv.org cs.LG, (2019). † Woloszynek, S., Zhao, Z., Chen, J. & Rosen, G. L. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses. PLoS Comput Biol 15, (2019). †

slide-55
SLIDE 55

Embeddings in bioinformatics

Preprocessing 27/47

Asgari, E. & Mofrad, M. R. K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10, (2015). Menegaux, R. & Vert, J.-P. Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics. J Comput Biol 26, cmb.2018.0174518 (2019). Min, X., Zeng, W., Chen, N., Chen, T. & Jiang, R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer

  • embedding. Bioinformatics 33, I92I101 (2017).

Hamid, M.-N. & Friedberg, I. Identifying Antimicrobial Peptides using Word Embedding with Deep Recurrent Neural Networks. Bioinformatics 25, 3389 (2018). Shen, Z., Bao, W. & Huang, D.-S. Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 8, 15270 (2018).

slide-56
SLIDE 56

Transfer learning 28/47

Transferlearning

slide-57
SLIDE 57

Transfer learning

Transfer learning 29/47

Transfer learning is taking a sizable portion of a deep network trained for one application, and slightly modify it before using it in another application.

slide-58
SLIDE 58

Transfer learning

Transfer learning 29/47

Transfer learning is taking a sizable portion of a deep network trained for one application, and slightly modify it before using it in another application.

Why?

slide-59
SLIDE 59

Transfer learning

Transfer learning 29/47

Transfer learning is taking a sizable portion of a deep network trained for one application, and slightly modify it before using it in another application.

Why?

An obvious reason would be to speed up the learning process.

slide-60
SLIDE 60

Transfer learning

Transfer learning 29/47

Transfer learning is taking a sizable portion of a deep network trained for one application, and slightly modify it before using it in another application.

Why?

An obvious reason would be to speed up the learning process. A much more interesting reason (IMHO) is to apply deep learning for applications where the number of examples is low.

slide-61
SLIDE 61

Transfer learning

Transfer learning 30/47

Source: [3] Figure 11.4

slide-62
SLIDE 62

Transfer learning in bioinformatics

Transfer learning 31/47

Computational elucidation of membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology modeling.

slide-63
SLIDE 63

Transfer learning in bioinformatics

Transfer learning 31/47

Computational elucidation of membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology

  • modeling. Here, we describe a high-throughput deep transfer learning method

that first predicts MP contacts by learning from non-MPs and then predicts 3D structure models using the predicted contacts as distance restraints.

slide-64
SLIDE 64

Transfer learning in bioinformatics

Transfer learning 31/47

Computational elucidation of membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology

  • modeling. Here, we describe a high-throughput deep transfer learning method

that first predicts MP contacts by learning from non-MPs and then predicts 3D structure models using the predicted contacts as distance restraints.

Wang, S., Li, Z., Yu, Y., Xu, J. Folding Membrane Proteins by Deep Transfer Learning. Cell Systems 5(3), 202, 2017.

slide-65
SLIDE 65

Kipoi

Transfer learning 32/47

Ziga Avsec, Roman Kreuzhuber, Johnny Israeli, Nancy Xu, Jun Cheng, Avanti Shrikumar, Abhimanyu Banerjee, Daniel S Kim, Thorsten Beier, Lara Urban, Anshul Kundaje, Oliver Stegle, and Julien Gagneur. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat Biotechnol, 37(6):592600, Jun 2019.

slide-66
SLIDE 66

Transfer learning

Transfer learning 33/47

[4] §8.7:

  • 1. You build a deep model on the original big dataset ([non-membrane

proteins]).

slide-67
SLIDE 67

Transfer learning

Transfer learning 33/47

[4] §8.7:

  • 1. You build a deep model on the original big dataset ([non-membrane

proteins]).

  • 2. You compile a much smaller labelled dataset for your second model

([membrane proteins]).

slide-68
SLIDE 68

Transfer learning

Transfer learning 33/47

[4] §8.7:

  • 1. You build a deep model on the original big dataset ([non-membrane

proteins]).

  • 2. You compile a much smaller labelled dataset for your second model

([membrane proteins]).

  • 3. You remove the last one or several layers from the first model. Usually,

these are layers responsible for the classification or regression; they usually follow the embedding layer.

slide-69
SLIDE 69

Transfer learning

Transfer learning 33/47

[4] §8.7:

  • 1. You build a deep model on the original big dataset ([non-membrane

proteins]).

  • 2. You compile a much smaller labelled dataset for your second model

([membrane proteins]).

  • 3. You remove the last one or several layers from the first model. Usually,

these are layers responsible for the classification or regression; they usually follow the embedding layer.

  • 4. You replace the removed layers with new layers adapted for your new

problem.

slide-70
SLIDE 70

Transfer learning

Transfer learning 33/47

[4] §8.7:

  • 1. You build a deep model on the original big dataset ([non-membrane

proteins]).

  • 2. You compile a much smaller labelled dataset for your second model

([membrane proteins]).

  • 3. You remove the last one or several layers from the first model. Usually,

these are layers responsible for the classification or regression; they usually follow the embedding layer.

  • 4. You replace the removed layers with new layers adapted for your new

problem.

  • 5. You “freeze” the parameters of the layers remaining from the first

model.

slide-71
SLIDE 71

Transfer learning

Transfer learning 33/47

[4] §8.7:

  • 1. You build a deep model on the original big dataset ([non-membrane

proteins]).

  • 2. You compile a much smaller labelled dataset for your second model

([membrane proteins]).

  • 3. You remove the last one or several layers from the first model. Usually,

these are layers responsible for the classification or regression; they usually follow the embedding layer.

  • 4. You replace the removed layers with new layers adapted for your new

problem.

  • 5. You “freeze” the parameters of the layers remaining from the first

model.

  • 6. You use your smaller labelled dataset and gradient descent to train the

parameters of only the new layers.

slide-72
SLIDE 72

Transfer learning with Keras

Transfer learning 34/47

[3] §11:

model_A = keras . models . load_model ( "my_model_A . h5" ) model_B_on_A = keras . models . S e q u e n t i a l ( model_A . l a y e r s [: −1]) model_B_on_A . add ( keras . l a y e r s . Dense (1 , a c t i v a t i o n=" sigmoid " ))

slide-73
SLIDE 73

Transfer learning with Keras

Transfer learning 34/47

[3] §11:

model_A = keras . models . load_model ( "my_model_A . h5" ) model_B_on_A = keras . models . S e q u e n t i a l ( model_A . l a y e r s [: −1]) model_B_on_A . add ( keras . l a y e r s . Dense (1 , a c t i v a t i o n=" sigmoid " ))

Alternatively:

model_A_clone = keras . models . clone_model ( model_A ) model_A_clone . set_weights ( model_A . get_weights ( ) )

slide-74
SLIDE 74

Transfer learning with Keras

Transfer learning 35/47

[3] §11:

for l a y e r in model_B_on_A . l a y e r s [ : − 1 ] : l a y e r . t r a i n a b l e = Fa l s e model_B_on_A . compile ( l o s s=" b i n a r y _ c r o s s e n t r o p y " ,

  • p t i m i z e r=" sgd " ,

m e t r i c s =[" accuracy " ] ) h i s t o r y = model_B_on_A . f i t ( X_train_B , y_train_B , epochs =4, v a l i d a t i o n _ d a t a =(X_valid_B , y_valid_B ))

slide-75
SLIDE 75

Transfer learning with Keras

Transfer learning 36/47

[3] §11:

for l a y e r in model_B_on_A . l a y e r s [ : − 1 ] : l a y e r . t r a i n a b l e = True

  • p t i m i z e r = keras . o p t i m i z e r s .SGD( l r =1e−4) # the

d e f a u l t l r i s 1e−2 model_B_on_A . compile ( l o s s=" b i n a r y _ c r o s s e n t r o p y " ,

  • p t i m i z e r=optimizer ,

m e t r i c s =[" accuracy " ] ) h i s t o r y = model_B_on_A . f i t ( X_train_B , y_train_B , epochs =16, v a l i d a t i o n _ d a t a =(X_valid_B , y_valid_B ))

slide-76
SLIDE 76

Remarks

Transfer learning 37/47

Transfer learning is possibly unique to deep learning methods.

slide-77
SLIDE 77

Remarks

Transfer learning 37/47

Transfer learning is possibly unique to deep learning methods. When the number of training examples available is too small to justify using deep learning, there might be a sufficiently similar problem for which a lot of data is available.

slide-78
SLIDE 78

Prologue 38/47

Prologue

slide-79
SLIDE 79

Summary

Prologue 39/47

Embeddings are representations that are learnt from data.

slide-80
SLIDE 80

Summary

Prologue 39/47

Embeddings are representations that are learnt from data. Transfer learning allows for the application of deep learning to problems for which the number of training data is low.

slide-81
SLIDE 81

Next module

Prologue 40/47

Deep learning - architectures

slide-82
SLIDE 82

References

Prologue 41/47

Ehsaneddin Asgari and Mohammad R K Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10(11):e0141287, 2015. François Chollet. Deep learning with Python. Manning Publications, 2017. Aurélien Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, 2nd edition, 2019. Andriy Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov, 2019. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–44, May 2015.

slide-83
SLIDE 83

References

Prologue 42/47

Prabina Kumar Meher, Tanmaya Kumar Sahu, Shachi Gahoi, Subhrajit Satpathy, and Atmakuri Ramakrishna Rao. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene, 705:113–126, Jul 2019. Long Zhang, Guoxian Yu, Dawen Xia, and Jun Wang. Protein-protein interactions prediction based on ensemble deep neural networks. Neurocomputing, 324:10–19, 2019. Ruiqing Zheng, Min Li, Xiang Chen, Fang-Xiang Wu, Yi Pan, and Jianxin Wang. BiXGBoost: a scalable, flexible boosting-based method for reconstructing gene regulatory networks. Bioinformatics, 35(11):1893–1900, Jun 2019. Maria Colomé-Tatché and Fabian J Theis. Statistical single cell multi-omics integration. Current Opinion in Systems Biology, 7:54–59, 2018.

slide-84
SLIDE 84

References

Prologue 43/47

Yuming Ma, Yihui Liu, and Jinyong Cheng. Protein secondary structure prediction based on data partition and semi-random subspace method. Sci Rep, 8(1):9856, Jun 2018. Xuan Zhang, Jun Wang, Jing Li, Wen Chen, and Changning Liu. Crlncrc: a machine learning-based method for cancer-related long noncoding rna identification using integrated features. BMC Med Genomics, 11(Suppl 6):120, Dec 2018. Xiaoying Wang, Bin Yu, Anjun Ma, Cheng Chen, Bingqiang Liu, and Qin Ma. Protein–protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics, 35(14):2395–2402, 12 2018. Zhen Cao, Xiaoyong Pan, Yang Yang, Yan Huang, and Hong-Bin Shen. The lncLocator: a subcellular localization predictor for long non-coding rnas based

  • n a stacked ensemble classifier.

Bioinformatics, 34(13):2185–2194, 07 2018.

slide-85
SLIDE 85

References

Prologue 44/47

Xing Chen, Chi-Chi Zhu, and Jun Yin. Ensemble of decision tree reveals potential miRNA-disease associations. PLoS Comput Biol, 15(7):e1007209, Jul 2019. Jialin Yu, Shaoping Shi, Fang Zhang, Guodong Chen, and Man Cao. PredGly: predicting lysine glycation sites for homo sapiens based on XGboost feature optimization. Bioinformatics, 35(16):2749–2756, Aug 2019. Hui Peng, Yi Zheng, Zhixun Zhao, Tao Liu, and Jinyan Li. Recognition of CRISPR/Cas9 off-target sites through ensemble learning of uneven mismatch distributions. Bioinformatics, 34(17):i757–i765, 09 2018. Weijia Su, Xun Gu, and Thomas Peterson. TIR-Learner, a new ensemble method for TIR transposable element annotation, provides evidence for abundant new transposable elements in the maize genome. Mol Plant, 12(3):447–460, 03 2019.

slide-86
SLIDE 86

References

Prologue 45/47

Xiangxiang Zeng, Yue Zhong, Wei Lin, and Quan Zou. Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods. Brief Bioinform, Oct 2019. Jaswinder Singh, Jack Hanson, Rhys Heffernan, Kuldip Paliwal, Yuedong Yang, and Yaoqi Zhou. Detecting proline and non-proline cis isomers in protein structures from sequences using deep residual ensemble learning. J Chem Inf Model, 58(9):2033–2042, 09 2018. Anand Pratap Singh, Sarthak Mishra, and Suraiya Jabin. Sequence based prediction of enhancer regions from DNA random walk. Sci Rep, 8(1):15912, 10 2018. Stephen Woloszynek, Zhengqiao Zhao, Jian Chen, and Gail L Rosen. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses. PLoS Comput Biol, 15(2):e1006721, 02 2019.

slide-87
SLIDE 87

References

Prologue 46/47

John M Giorgi and Gary D Bader. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics, 34(23):4087–4094, Dec 2018. Tongxin Wang, Travis S Johnson, Wei Shao, Zixiao Lu, Bryan R Helm, Jie Zhang, and Kun Huang. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol, 20(1):165, 08 2019. Sheng Wang, Zhen Li, Yizhou Yu, and Jinbo Xu. Folding membrane proteins by deep transfer learning. Cell Syst, 5(3):202–211.e3, 09 2017. Žiga Avsec, Roman Kreuzhuber, Johnny Israeli, Nancy Xu, Jun Cheng, Avanti Shrikumar, Abhimanyu Banerjee, Daniel S Kim, Thorsten Beier, Lara Urban, Anshul Kundaje, Oliver Stegle, and Julien Gagneur. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat Biotechnol, 37(6):592–600, Jun 2019.

slide-88
SLIDE 88

Prologue 47/47

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca School of Electrical Engineering and Computer Science (EECS) University of Ottawa