CSI5180. MachineLearningfor BioinformaticsApplications Deep learning - - PowerPoint PPT Presentation

csi5180 machinelearningfor bioinformaticsapplications
SMART_READER_LITE
LIVE PREVIEW

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning - - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by Marcel Turcotte Version November 19, 2019 Preamble 2/31 salpha Preamble Preamble 2/31 Preamble Deep learning practical issues In this last


slide-1
SLIDE 1
  • CSI5180. MachineLearningfor

BioinformaticsApplications

Deep learning — practical issues

by

Marcel Turcotte

Version November 19, 2019

slide-2
SLIDE 2

Preamble 2/31

slide-3
SLIDE 3

Preamble 2/31

salpha

Preamble

slide-4
SLIDE 4

Preamble

Preamble 3/31

Deep learning — practical issues In this last lecture deep learning, we consider practical issues when using existing tools and libraries. General objective :

Discuss the pitfalls, limitations, and practical considerations when using deep learning algorithms.

slide-5
SLIDE 5

Learning objectives

Preamble 4/31

Discuss the pitfalls, limitations, and practical considerations when using deep learning algorithms. Explain what is a dropout layer Discuss further mechanisms to regularize deep networks

Reading:

Christof Angermueller, Tanel Pärnamaa, Leopold Parts, and Oliver Stegle. Deep learning for computational biology. Mol Syst Biol 12(7):878, 07 2016.

slide-6
SLIDE 6

Plan

Preamble 5/31

  • 1. Preamble
  • 2. As mentioned previously
  • 3. Regularization
  • 4. Hyperparameters
  • 5. Keras
  • 6. Further considerations
  • 7. Prologue
slide-7
SLIDE 7

As mentioned previously 6/31

Asmentionedpreviously

slide-8
SLIDE 8

Overview

As mentioned previously 7/31

Source: [1] Box 1

slide-9
SLIDE 9

Summary

As mentioned previously 8/31

In a dense layer, all the neurons are connected to all the neurons from the previous layer.

slide-10
SLIDE 10

Summary

As mentioned previously 8/31

In a dense layer, all the neurons are connected to all the neurons from the previous layer.

The number of parameters grows exponentially with each additional layer, making it nearly impossible to create deep networks.

slide-11
SLIDE 11

Summary

As mentioned previously 8/31

In a dense layer, all the neurons are connected to all the neurons from the previous layer.

The number of parameters grows exponentially with each additional layer, making it nearly impossible to create deep networks.

Local connectivity. In a convolutional layer each neuron is connected to a small number of neurons from the previous layer. This small rectangular region is called the receptive field.

slide-12
SLIDE 12

Summary

As mentioned previously 8/31

In a dense layer, all the neurons are connected to all the neurons from the previous layer.

The number of parameters grows exponentially with each additional layer, making it nearly impossible to create deep networks.

Local connectivity. In a convolutional layer each neuron is connected to a small number of neurons from the previous layer. This small rectangular region is called the receptive field. Parameter sharing. All the neurons in a given feature map of a convolutional layer share the same kernel (filter).

slide-13
SLIDE 13

Convolutional layer (Conv1D)

As mentioned previously 9/31

Source: [1] Figure 2B

slide-14
SLIDE 14

Convolutional layer

As mentioned previously 10/31

Contrary to Dense layers, Conv1D layers preserve the identity of the monomers (nucleotides or amino acids), which are seen as channels.

slide-15
SLIDE 15

Convolutional layer

As mentioned previously 10/31

Contrary to Dense layers, Conv1D layers preserve the identity of the monomers (nucleotides or amino acids), which are seen as channels. Convolutional Neural Networks are able to detect patterns irrespective

  • f their location in the input.
slide-16
SLIDE 16

Convolutional layer

As mentioned previously 10/31

Contrary to Dense layers, Conv1D layers preserve the identity of the monomers (nucleotides or amino acids), which are seen as channels. Convolutional Neural Networks are able to detect patterns irrespective

  • f their location in the input.

Pooling makes the network less sensitive to small translations.

slide-17
SLIDE 17

Convolutional layer

As mentioned previously 10/31

Contrary to Dense layers, Conv1D layers preserve the identity of the monomers (nucleotides or amino acids), which are seen as channels. Convolutional Neural Networks are able to detect patterns irrespective

  • f their location in the input.

Pooling makes the network less sensitive to small translations. In bioinformatics, CNN networks are ideally suited to detect local (sequence) motifs, independent of their position within the input (sequence). They are also the most prevalent architecture.

slide-18
SLIDE 18

Summary

As mentioned previously 11/31

Recurrent networks (RNN) and Long Short-Term Memory (LSTM) can process input sequences of varying length.

slide-19
SLIDE 19

Summary

As mentioned previously 11/31

Recurrent networks (RNN) and Long Short-Term Memory (LSTM) can process input sequences of varying length.

Literature suggests that RNNs are more difficult to train than other architectures.

slide-20
SLIDE 20

Regularization 12/31

Regularization

slide-21
SLIDE 21

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-22
SLIDE 22

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. During training, each input unit in a dropout layer has probability p of being ignored (set to 0). https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-23
SLIDE 23

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. During training, each input unit in a dropout layer has probability p of being ignored (set to 0).

According to [3] §11:

https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-24
SLIDE 24

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. During training, each input unit in a dropout layer has probability p of being ignored (set to 0).

According to [3] §11:

20-30% is a typical value of p convolution networks;

https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-25
SLIDE 25

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. During training, each input unit in a dropout layer has probability p of being ignored (set to 0).

According to [3] §11:

20-30% is a typical value of p convolution networks; whereas, 40-50% is a typical of p for recurrent networks.

https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-26
SLIDE 26

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. During training, each input unit in a dropout layer has probability p of being ignored (set to 0).

According to [3] §11:

20-30% is a typical value of p convolution networks; whereas, 40-50% is a typical of p for recurrent networks.

Dropout layers can make the network converging more slowly. However, the resulting network is expected to make fewer generalization errors. https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-27
SLIDE 27

Dropout

Regularization 13/31

Hinton and colleagues say that dropout layers are “preventing co-adaptation”. During training, each input unit in a dropout layer has probability p of being ignored (set to 0).

According to [3] §11:

20-30% is a typical value of p convolution networks; whereas, 40-50% is a typical of p for recurrent networks.

Dropout layers can make the network converging more slowly. However, the resulting network is expected to make fewer generalization errors. https://keras.io/layers/core/

model = keras . models . S e q u e n t i a l ( [ . . . Dropout ( 0 . 5 ) , . . . ] )

slide-28
SLIDE 28

Dropout

Regularization 14/31

Source: [1] Figure 5F

slide-29
SLIDE 29

Regularizers

Regularization 15/31

Applying penalties on layer parameters https://keras.io/regularizers/

# other import d i r e c t i v e s are here from keras import r e g u l a r i z e r s model = S e q u e n t i a l () model . add ( Dense (32 , input_shape =(16 ,))) model . add ( Dense (64 , input_dim =64, k e r n e l _ r e g u l a r i z e r=r e g u l a r i z e r s . l 2 ( 0 . 0 1 ) ) )

Available penalties

keras . r e g u l a r i z e r s . l 1 ( 0 . ) keras . r e g u l a r i z e r s . l 2 ( 0 . ) keras . r e g u l a r i z e r s . l1_l2 ( l 1 =0.01 , l 2 =0.01)

slide-30
SLIDE 30

Early stopping

Regularization 16/31

Source: [1] Figure 5E

slide-31
SLIDE 31

Hyperparameters 17/31

Hyperparameters

slide-32
SLIDE 32

Optimizers

Hyperparameters 18/31

An optimizer should be fast and should ideally guide the solution towards a “good” local optimum (or better, a global optimum).

slide-33
SLIDE 33

Optimizers

Hyperparameters 18/31

An optimizer should be fast and should ideally guide the solution towards a “good” local optimum (or better, a global optimum). Momentum

slide-34
SLIDE 34

Optimizers

Hyperparameters 18/31

An optimizer should be fast and should ideally guide the solution towards a “good” local optimum (or better, a global optimum). Momentum

Momentum methods keep track of the previous gradients and this information is used to update the weights. m = βm − η∇θJ(θ) θ = θ + m

slide-35
SLIDE 35

Optimizers

Hyperparameters 18/31

An optimizer should be fast and should ideally guide the solution towards a “good” local optimum (or better, a global optimum). Momentum

Momentum methods keep track of the previous gradients and this information is used to update the weights. m = βm − η∇θJ(θ) θ = θ + m Momentum methods can escape plateau more effectively.

slide-36
SLIDE 36

Optimizers

Hyperparameters 18/31

An optimizer should be fast and should ideally guide the solution towards a “good” local optimum (or better, a global optimum). Momentum

Momentum methods keep track of the previous gradients and this information is used to update the weights. m = βm − η∇θJ(θ) θ = θ + m Momentum methods can escape plateau more effectively. Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam and Nadam.

slide-37
SLIDE 37

Optimizers

Hyperparameters 18/31

An optimizer should be fast and should ideally guide the solution towards a “good” local optimum (or better, a global optimum). Momentum

Momentum methods keep track of the previous gradients and this information is used to update the weights. m = βm − η∇θJ(θ) θ = θ + m Momentum methods can escape plateau more effectively. Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam and Nadam. Adam is a good default choice.

slide-38
SLIDE 38

Loss function

Hyperparameters 19/31

Regression

mean_squared_error (MSE) or mean_absolute_error (MAE)

Classification

Binary classification : binary_crossentropy Multiclass classification : categorical_crossentropy

https://keras.io/losses/

from keras import l o s s e s model . compile ( l o s s=l o s s e s . mean_squared_error ,

  • p t i m i z e r=’ sgd ’ )
slide-39
SLIDE 39

Output layer activation

Hyperparameters 20/31

Regression [3] Table 10.1:

ReLU/softplus (if positive outputs) logistic/tanh (if bounded outputs)

Classification

Binary classification : logistic Multiclass classification : softmax

https://keras.io/activations/

model = keras . models . S e q u e n t i a l ( [ . . . Dense (64 , a c t i v a t i o n=" r e l u " ) , . . . ] )

slide-40
SLIDE 40

Hyperparameters

Hyperparameters 21/31

Source: [1] Table 2

slide-41
SLIDE 41

Keras 22/31

Keras

slide-42
SLIDE 42

Keras

Keras 23/31

model = keras . models . S e q u e n t i a l ( [ Conv2D (64 , 7 , . . . , input_shape =[28 , 28 , 1 ] ) , MaxPooling2D (2) , Conv2D (128 , 3 , a c t i v a t i o n=" r e l u " , padding="same" ) , Conv2D (128 , 3 , a c t i v a t i o n=" r e l u " , padding="same" ) , MaxPooling2D (2) , Conv2D (256 , 3 , a c t i v a t i o n=" r e l u " , padding="same" ) , Conv2D (256 , 3 , a c t i v a t i o n=" r e l u " , padding="same" ) , MaxPooling2D (2) , F l a t t e n ( ) , Dense (128 , a c t i v a t i o n=" r e l u " ) , Dropout ( 0 . 5 ) , Dense (64 , a c t i v a t i o n=" r e l u " ) , Dropout ( 0 . 5 ) , Dense (10 , a c t i v a t i o n=" softmax " ) ] )

[3] §14:

slide-43
SLIDE 43

Further considerations 24/31

Furtherconsiderations

slide-44
SLIDE 44

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization.

slide-45
SLIDE 45

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization. Weights initialization.

slide-46
SLIDE 46

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization. Weights initialization. Data augmentation.

slide-47
SLIDE 47

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization. Weights initialization. Data augmentation. Understanding what the network has learnt:

slide-48
SLIDE 48

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization. Weights initialization. Data augmentation. Understanding what the network has learnt:

Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv.org cs.CV, (2017). [DeepLIFT]

slide-49
SLIDE 49

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization. Weights initialization. Data augmentation. Understanding what the network has learnt:

Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv.org cs.CV, (2017). [DeepLIFT]

Attention layer

slide-50
SLIDE 50

Further considerations

Further considerations 25/31

We obviously barely scratched the surface of deep learning. Here are some important concepts that we did not consider:

The vanishing and exploding gradient, see BatchNormalization. Weights initialization. Data augmentation. Understanding what the network has learnt:

Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv.org cs.CV, (2017). [DeepLIFT]

Attention layer Multi-tasks (not multi-class, not multi-labels)

slide-51
SLIDE 51

Of deep neural networks

Further considerations 26/31

The see the world as a hierarchy of concepts, effectively bypassing the need to create features (features engineering).

“Deep neural networks can help circumventing the manual extraction of features by learning them from data.” [1]

Transfer learning is a possibly unique to deep learning. Hundreds of papers in bioinformatics alone.

slide-52
SLIDE 52

Prologue 27/31

Prologue

slide-53
SLIDE 53

Summary

Prologue 28/31

Deep networks consisting only of dense layers become computationally intractable as the number of parameters grows exponentially with each additional layer.

slide-54
SLIDE 54

Summary

Prologue 28/31

Deep networks consisting only of dense layers become computationally intractable as the number of parameters grows exponentially with each additional layer. Convolutional layers considerably reduce the number of parameters since each unit is connected to a limited number of neurons from the previous layer, its receptive field.

slide-55
SLIDE 55

Summary

Prologue 28/31

Deep networks consisting only of dense layers become computationally intractable as the number of parameters grows exponentially with each additional layer. Convolutional layers considerably reduce the number of parameters since each unit is connected to a limited number of neurons from the previous layer, its receptive field. CNN is able to detect patterns in a positon independent manner.

slide-56
SLIDE 56

Summary

Prologue 28/31

Deep networks consisting only of dense layers become computationally intractable as the number of parameters grows exponentially with each additional layer. Convolutional layers considerably reduce the number of parameters since each unit is connected to a limited number of neurons from the previous layer, its receptive field. CNN is able to detect patterns in a positon independent manner. RNN and LSTM handle sequence information, where the input sequences can be of different lengths. They can detect patterns along the sequence.

slide-57
SLIDE 57

Summary

Prologue 28/31

Deep networks consisting only of dense layers become computationally intractable as the number of parameters grows exponentially with each additional layer. Convolutional layers considerably reduce the number of parameters since each unit is connected to a limited number of neurons from the previous layer, its receptive field. CNN is able to detect patterns in a positon independent manner. RNN and LSTM handle sequence information, where the input sequences can be of different lengths. They can detect patterns along the sequence. Dropout layers are an effective regularization mechanism.

slide-58
SLIDE 58

Next module

Prologue 29/31

Concept- and rule-based

slide-59
SLIDE 59

References

Prologue 30/31

Christof Angermueller, Tanel Pärnamaa, Leopold Parts, and Oliver Stegle. Deep learning for computational biology. Mol Syst Biol, 12(7):878, 07 2016. François Chollet. Deep learning with Python. Manning Publications, 2017. Aurélien Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, 2nd edition, 2019. Andriy Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov, 2019.

slide-60
SLIDE 60

Prologue 31/31

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca School of Electrical Engineering and Computer Science (EECS) University of Ottawa