CSI5180. MachineLearningfor BioinformaticsApplications Ensemble - - PowerPoint PPT Presentation

csi5180 machinelearningfor bioinformaticsapplications
SMART_READER_LITE
LIVE PREVIEW

CSI5180. MachineLearningfor BioinformaticsApplications Ensemble - - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Ensemble Learning by Marcel Turcotte Version December 5, 2019 Preamble Preamble 2/50 Preamble Ensemble Learning In this lecture, we consider several meta learning algorithms all based


slide-1
SLIDE 1
  • CSI5180. MachineLearningfor

BioinformaticsApplications

Ensemble Learning

by

Marcel Turcotte

Version December 5, 2019

slide-2
SLIDE 2

Preamble 2/50

Preamble

slide-3
SLIDE 3

Preamble

Preamble 3/50

Ensemble Learning In this lecture, we consider several meta learning algorithms all based on the principle that the combined opinion of a large group of individuals is often more accurate than the opinion of a single expert — this is often referred to as the wisdom of the crowd. Today, we tell apart the following meta-algorithms: bagging, pasting, random patches, random subspaces, boosting, and stacking. General objective :

Compare the specific features of various ensemble learning meta-algorithms

slide-4
SLIDE 4

Learning objectives

Preamble 4/50

Discuss the intuition behind bagging and pasting methods Explain the difference between random patches and random subspaces Describe boosting methods Contrast the stacking meta-algorithms from bagging

Reading:

Jaswinder Singh, Jack Hanson, Kuldip Paliwal, and Yaoqi Zhou. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communications 10(1):5407, 2019.

slide-5
SLIDE 5

www.mims.ai

Preamble 5/50

bioinformatics.ca/job-postings

slide-6
SLIDE 6

Plan

Preamble 6/50

  • 1. Preamble
  • 2. Introduction
  • 3. Justification
  • 4. Meta-algorithms
  • 5. Prologue
slide-7
SLIDE 7

Introduction 7/50

Introduction

slide-8
SLIDE 8

Ensemble Learning - What is it?

Introduction 8/50

“Ensemble learning is a learning paradigm that, instead of trying to learn

  • ne super-accurate model, focuses on training a large number of

low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.” [Burkov, 2019] §7.5

slide-9
SLIDE 9

Ensemble Learning - What is it?

Introduction 8/50

“Ensemble learning is a learning paradigm that, instead of trying to learn

  • ne super-accurate model, focuses on training a large number of

low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.” [Burkov, 2019] §7.5 Weak learners (low-accuracy) models are simple and fast, both for training and prediction.

slide-10
SLIDE 10

Ensemble Learning - What is it?

Introduction 8/50

“Ensemble learning is a learning paradigm that, instead of trying to learn

  • ne super-accurate model, focuses on training a large number of

low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.” [Burkov, 2019] §7.5 Weak learners (low-accuracy) models are simple and fast, both for training and prediction. The general idea is that each learner has a vote, and these votes are combined to establish the final decision.

slide-11
SLIDE 11

Ensemble Learning - What is it?

Introduction 8/50

“Ensemble learning is a learning paradigm that, instead of trying to learn

  • ne super-accurate model, focuses on training a large number of

low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.” [Burkov, 2019] §7.5 Weak learners (low-accuracy) models are simple and fast, both for training and prediction. The general idea is that each learner has a vote, and these votes are combined to establish the final decision. Decision trees are the most commonly used weak learners.

slide-12
SLIDE 12

Ensemble Learning - What is it?

Introduction 8/50

“Ensemble learning is a learning paradigm that, instead of trying to learn

  • ne super-accurate model, focuses on training a large number of

low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.” [Burkov, 2019] §7.5 Weak learners (low-accuracy) models are simple and fast, both for training and prediction. The general idea is that each learner has a vote, and these votes are combined to establish the final decision. Decision trees are the most commonly used weak learners. Ensemble learning is fact an umbrella for a large family of meta-algorithms, including bagging, pasting, random patches, random subspaces, boosting, and stacking.

slide-13
SLIDE 13

Justification 9/50

Justification

slide-14
SLIDE 14

Weak learners/high accuracy

Justification 10/50

10 experiments

See: [Géron, 2019] §7

slide-15
SLIDE 15

Weak learners/high accuracy

Justification 10/50

10 experiments

Each experiment consists of tossing a loaded coin See: [Géron, 2019] §7

slide-16
SLIDE 16

Weak learners/high accuracy

Justification 10/50

10 experiments

Each experiment consists of tossing a loaded coin

51 % head, 49 % tail

See: [Géron, 2019] §7

slide-17
SLIDE 17

Weak learners/high accuracy

Justification 10/50

10 experiments

Each experiment consists of tossing a loaded coin

51 % head, 49 % tail

As the number of toss increases, the proportion of heads will approach 51% See: [Géron, 2019] §7

slide-18
SLIDE 18

Source code

Justification 11/50

t o s s e s = ( np . random . rand (10000 , 10) < 0 . 5 1 ) . astype ( np . i n t 8 ) cumsum = np . cumsum( tosses , a x i s =0) / np . arange (1 , 10001). reshape (−1, 1) with p l t . xkcd ( ) : p l t . f i g u r e ( f i g s i z e =(8 ,3.5)) p l t . p l o t (cumsum) p l t . p l o t ( [ 0 , 10000] , [ 0 . 5 1 , 0 . 5 1 ] , "k− −" , l i n e w i d t h =2, l a b e l="51%" ) p l t . p l o t ( [ 0 , 10000] , [ 0 . 5 , 0 . 5 ] , "k−" , l a b e l="50%" ) p l t . x l a b e l ( "Number of coin t o s s e s " ) p l t . y l a b e l ( " Heads r a t i o " ) p l t . legend ( l o c=" lower r i g h t " ) p l t . a x i s ( [ 0 , 10000 , 0.42 , 0 . 5 8 ] ) p l t . t i g h t _ l a y o u t () p l t . s a v e f i g ( " weak_learner . pdf " , format=" pdf " , dpi =264)

See: [Géron, 2019] §7

slide-19
SLIDE 19

Weak learners/high accuracy

Justification 12/50

Adapted from [Géron, 2019] §7

slide-20
SLIDE 20

Independent learners

Justification 13/50

Clearly, the learners are using the same input, they are not independent.

slide-21
SLIDE 21

Independent learners

Justification 13/50

Clearly, the learners are using the same input, they are not independent. Ensemble learning works best when the learners are as independent one from another as possible.

slide-22
SLIDE 22

Independent learners

Justification 13/50

Clearly, the learners are using the same input, they are not independent. Ensemble learning works best when the learners are as independent one from another as possible.

Different algorithms

slide-23
SLIDE 23

Independent learners

Justification 13/50

Clearly, the learners are using the same input, they are not independent. Ensemble learning works best when the learners are as independent one from another as possible.

Different algorithms Different sets of features

slide-24
SLIDE 24

Independent learners

Justification 13/50

Clearly, the learners are using the same input, they are not independent. Ensemble learning works best when the learners are as independent one from another as possible.

Different algorithms Different sets of features Different data sets

slide-25
SLIDE 25

Data set - moons

Justification 14/50

import m a t p l o t l i b . pyplot as p l t from s k l e a r n . d a t a s e t s import make_moons X, y = make_moons( n_samples =100, n o i s e =0.15) with p l t . xkcd ( ) : p l t . p l o t (X [ : , 0 ] [ y==0], X [ : , 1 ] [ y==0], " bs " ) p l t . p l o t (X [ : , 0 ] [ y==1], X [ : , 1 ] [ y==1], "g^" ) p l t . a x i s ([ −1.5 , 2.5 , −1, 1 . 5 ] ) p l t . g r i d ( True , which=’ both ’ ) p l t . x l a b e l ( r "$x_1$" , f o n t s i z e =20) p l t . y l a b e l ( r "$x_2$" , f o n t s i z e =20, r o t a t i o n =0) p l t . t i g h t _ l a y o u t () p l t . s a v e f i g ( "make_moons . pdf " , format=" pdf " , dpi =264)

Adapted from: [Géron, 2019] §5

slide-26
SLIDE 26

Data set - moons

Justification 15/50

Adapted from [Géron, 2019] §5

slide-27
SLIDE 27

Source code - VotingClassifier - hard

Justification 16/50

from s k l e a r n . ensemble import V o t i n g C l a s s i f i e r from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r from s k l e a r n . linear_model import L o g i s t i c R e g r e s s i o n from s k l e a r n . svm import SVC l o g _ c l f = L o g i s t i c R e g r e s s i o n () rnd_ clf = R a n d o m F o r e s t C l a s s i f i e r () svm_clf = SVC() e s t i m a t o r s =[( ’ l r ’ , l o g _ c l f ) , ( ’ r f ’ , rnd_ clf ) , ( ’ svc ’ , svm_clf ) ] v o t i n g _ c l f = V o t i n g C l a s s i f i e r ( e s t i m a t o r s=estimators , v oting=’ hard ’ ) v o t i n g _ c l f . f i t ( X_train , y_train )

Source: [Géron, 2019] §7

slide-28
SLIDE 28

Source code - accuracy

Justification 17/50

from s k l e a r n . m e t r i c s import accuracy_score for c l f in ( log_clf , rnd_clf , svm_clf , v o t i n g _ c l f ) : c l f . f i t ( X_train , y_train ) y_pred = c l f . p r e d i c t ( X_test ) p r i n t ( c l f . __class__ . __name__, accuracy_score ( y_test , y_pred ))

LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.888 VotingClassifier 0.904

slide-29
SLIDE 29

Source code - accuracy

Justification 17/50

from s k l e a r n . m e t r i c s import accuracy_score for c l f in ( log_clf , rnd_clf , svm_clf , v o t i n g _ c l f ) : c l f . f i t ( X_train , y_train ) y_pred = c l f . p r e d i c t ( X_test ) p r i n t ( c l f . __class__ . __name__, accuracy_score ( y_test , y_pred ))

LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.888 VotingClassifier 0.904

slide-30
SLIDE 30

Source code - accuracy

Justification 17/50

from s k l e a r n . m e t r i c s import accuracy_score for c l f in ( log_clf , rnd_clf , svm_clf , v o t i n g _ c l f ) : c l f . f i t ( X_train , y_train ) y_pred = c l f . p r e d i c t ( X_test ) p r i n t ( c l f . __class__ . __name__, accuracy_score ( y_test , y_pred ))

LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.888 VotingClassifier 0.904 [Géron, 2019] §7

slide-31
SLIDE 31

Source code - VotingClassifier - soft

Justification 18/50

from s k l e a r n . ensemble import V o t i n g C l a s s i f i e r from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r from s k l e a r n . linear_model import L o g i s t i c R e g r e s s i o n from s k l e a r n . svm import SVC l o g _ c l f = L o g i s t i c R e g r e s s i o n () rnd_ clf = R a n d o m F o r e s t C l a s s i f i e r () svm_clf = SVC( p r o b a b i l i t y=True ) e s t i m a t o r s =[( ’ l r ’ , l o g _ c l f ) , ( ’ r f ’ , rnd_ clf ) , ( ’ svc ’ , svm_clf ) ] v o t i n g _ c l f = V o t i n g C l a s s i f i e r ( e s t i m a t o r s=estimators , v oting=’ s o f t ’ ) v o t i n g _ c l f . f i t ( X_train , y_train )

Source: [Géron, 2019] §7

slide-32
SLIDE 32

Source code - accuracy

Justification 19/50

from s k l e a r n . m e t r i c s import accuracy_score for c l f in ( log_clf , rnd_clf , svm_clf , v o t i n g _ c l f ) : c l f . f i t ( X_train , y_train ) y_pred = c l f . p r e d i c t ( X_test ) p r i n t ( c l f . __class__ . __name__, accuracy_score ( y_test , y_pred ))

LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.896 VotingClassifier 0.92

slide-33
SLIDE 33

Source code - accuracy

Justification 19/50

from s k l e a r n . m e t r i c s import accuracy_score for c l f in ( log_clf , rnd_clf , svm_clf , v o t i n g _ c l f ) : c l f . f i t ( X_train , y_train ) y_pred = c l f . p r e d i c t ( X_test ) p r i n t ( c l f . __class__ . __name__, accuracy_score ( y_test , y_pred ))

LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.896 VotingClassifier 0.92

slide-34
SLIDE 34

Source code - accuracy

Justification 19/50

from s k l e a r n . m e t r i c s import accuracy_score for c l f in ( log_clf , rnd_clf , svm_clf , v o t i n g _ c l f ) : c l f . f i t ( X_train , y_train ) y_pred = c l f . p r e d i c t ( X_test ) p r i n t ( c l f . __class__ . __name__, accuracy_score ( y_test , y_pred ))

LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.896 VotingClassifier 0.92

Soft uses the average probability score, rather than hard voting.

[Géron, 2019] §7

slide-35
SLIDE 35

Meta-algorithms 20/50

Meta-algorithms

slide-36
SLIDE 36

Bagging and pasting

Meta-algorithms 21/50

Ensemble learning works best when the learners are independent.

slide-37
SLIDE 37

Bagging and pasting

Meta-algorithms 21/50

Ensemble learning works best when the learners are independent. One way to achieve this is to train the learners on (slightly) different data sets.

slide-38
SLIDE 38

Bagging and pasting

Meta-algorithms 21/50

Ensemble learning works best when the learners are independent. One way to achieve this is to train the learners on (slightly) different data sets.

Bagging: sampling with replacement (bootstrap aggregating);

slide-39
SLIDE 39

Bagging and pasting

Meta-algorithms 21/50

Ensemble learning works best when the learners are independent. One way to achieve this is to train the learners on (slightly) different data sets.

Bagging: sampling with replacement (bootstrap aggregating); Pasting: sampling without replacement.

slide-40
SLIDE 40

Bagging and pasting

Meta-algorithms 21/50

Ensemble learning works best when the learners are independent. One way to achieve this is to train the learners on (slightly) different data sets.

Bagging: sampling with replacement (bootstrap aggregating); Pasting: sampling without replacement.

As an added bonus, the learns can be trained in parallel!

slide-41
SLIDE 41

Bagging and pasting

Meta-algorithms 21/50

Ensemble learning works best when the learners are independent. One way to achieve this is to train the learners on (slightly) different data sets.

Bagging: sampling with replacement (bootstrap aggregating); Pasting: sampling without replacement.

As an added bonus, the learns can be trained in parallel! Literature suggests that bagging outperforms pasting [Géron, 2019].

slide-42
SLIDE 42

sklearn.ensemble.BaggingClassifier

Meta-algorithms 22/50

from s k l e a r n . ensemble import B a g g i n g C l a s s i f i e r from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r bag_clf = B a g g i n g C l a s s i f i e r ( D e c i s i o n T r e e C l a s s i f i e r ( ) , n_estimators =500, max_samples=100, bootstrap=True , n_jobs=8 ) bag_clf . f i t ( nX_train , y_train ) y_pred = bag_clf . p r e d i c t ( X_test )

Soft voting by default bootstrap=False implies pasting

Adapted from: [Géron, 2019] §7

slide-43
SLIDE 43

Not just for classification

Meta-algorithms 23/50

Bagging and pasting apply for regression tasks as well.

BaggingRegressor in Keras Voting is replaced the average

slide-44
SLIDE 44

Claim

Meta-algorithms 24/50

Claim:

slide-45
SLIDE 45

Claim

Meta-algorithms 24/50

Claim:

On average 37 % of the training examples are not used when bagging!

slide-46
SLIDE 46

Claim

Meta-algorithms 24/50

Claim:

On average 37 % of the training examples are not used when bagging!

By default, bagging samples N examples with replacement, where N is the size of the training set.

slide-47
SLIDE 47

Empirical evidence

Meta-algorithms 25/50

from random import random def do_sample_with_replacement ( ) : xs = [ 1 for i in range (100) ] for sample in range ( 1 0 0 ) : index = i n t (100 ∗ random ( ) ) xs [ index ] = 0 p r i n t (sum( xs )) for run in range ( 1 0 ) : do_sample_with_replacement ()

slide-48
SLIDE 48

Empirical evidence

Meta-algorithms 26/50

38 33 34 37 37 37 44 37 35 37

slide-49
SLIDE 49

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

0.90133333333333332

slide-50
SLIDE 50

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

By default, bagging samples N examples with replacement, where N is the size of the training set.

0.90133333333333332

slide-51
SLIDE 51

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

By default, bagging samples N examples with replacement, where N is the size of the training set. This means that on average, for each learner, 37% of the examples are not used.

0.90133333333333332

slide-52
SLIDE 52

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

By default, bagging samples N examples with replacement, where N is the size of the training set. This means that on average, for each learner, 37% of the examples are not used. These unseen, out-of-bag, examples can be used for validation!

0.90133333333333332

slide-53
SLIDE 53

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

By default, bagging samples N examples with replacement, where N is the size of the training set. This means that on average, for each learner, 37% of the examples are not used. These unseen, out-of-bag, examples can be used for validation! OOB (possibly) eliminates the need for a separate validation set.

0.90133333333333332

slide-54
SLIDE 54

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

By default, bagging samples N examples with replacement, where N is the size of the training set. This means that on average, for each learner, 37% of the examples are not used. These unseen, out-of-bag, examples can be used for validation! OOB (possibly) eliminates the need for a separate validation set.

0.90133333333333332

slide-55
SLIDE 55

Out-of-bag evaluation (oob)

Meta-algorithms 27/50

By default, bagging samples N examples with replacement, where N is the size of the training set. This means that on average, for each learner, 37% of the examples are not used. These unseen, out-of-bag, examples can be used for validation! OOB (possibly) eliminates the need for a separate validation set.

bag_clf = B a g g i n g C l a s s i f i e r ( D e c i s i o n T r e e C l a s s i f i e r ( ) , n_estimators =500, bootstrap=True , n_jobs=−1, oob_score=True ) bag_clf . f i t ( X_train , y_train ) p r i n t ( bag_clf . oob_score_ )

0.90133333333333332

slide-56
SLIDE 56

Random patches and subspaces

Meta-algorithms 28/50

BaggingClassifier also supports sampling features.

slide-57
SLIDE 57

Random patches and subspaces

Meta-algorithms 28/50

BaggingClassifier also supports sampling features.

This is controlled by the parameters bootstrap_features and max_features.

slide-58
SLIDE 58

Random patches and subspaces

Meta-algorithms 28/50

BaggingClassifier also supports sampling features.

This is controlled by the parameters bootstrap_features and max_features.

Random patches: sampling both instances and features.

bag_clf = B a g g i n g C l a s s i f i e r ( D e c i s i o n T r e e C l a s s i f i e r ( ) , n_estimators =500, bootstrap=True , max_samples =1.0 , b o o t s t r a p _ f e a t u r e s=True , max_features =0.4 , n_jobs=−1, oob_score=True )

slide-59
SLIDE 59

Random patches and subspaces

Meta-algorithms 28/50

BaggingClassifier also supports sampling features.

This is controlled by the parameters bootstrap_features and max_features.

Random patches: sampling both instances and features.

bag_clf = B a g g i n g C l a s s i f i e r ( D e c i s i o n T r e e C l a s s i f i e r ( ) , n_estimators =500, bootstrap=True , max_samples =1.0 , b o o t s t r a p _ f e a t u r e s=True , max_features =0.4 , n_jobs=−1, oob_score=True )

Random subspaces: only sampling features.

slide-60
SLIDE 60

Random Forest

Meta-algorithms 29/50

bag_clf = B a g g i n g C l a s s i f i e r ( D e c i s i o n T r e e C l a s s i f i e r ( s p l i t t e r="random" , max_leaf_nodes =16) , n_estimators =500, max_samples =1.0 , bootstrap=True )

slide-61
SLIDE 61

sklearn.ensemble.RandomForestClassifier

Meta-algorithms 30/50

“The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node (. . . ), it searches for the best feature among a random subset

  • f features.” [Géron, 2019]

from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r r f c = R a n d o m F o r e s t C l a s s i f i e r ( n_estimators =500, max_leaf_nodes=16) r f c . f i t ( X_train , y_train ) y_pred_rf = r f c . p r e d i c t ( X_test )

See also ExtraTreesClassifier and ExtraTreesRegressor.

slide-62
SLIDE 62

Boosting

Meta-algorithms 31/50

Boosting meta-algorithms are training learners sequentially, in such a way that each classifier is trying to correct the mistakes of the previous classifier in the chain.

slide-63
SLIDE 63

AdaBoost

Meta-algorithms 32/50

AdaBoost stands for Adaptive Boosting.

slide-64
SLIDE 64

AdaBoost

Meta-algorithms 32/50

AdaBoost stands for Adaptive Boosting. Each learner focuses on examples that were incorrectly classified by the previous classifier.

slide-65
SLIDE 65

AdaBoost

Meta-algorithms 32/50

AdaBoost stands for Adaptive Boosting. Each learner focuses on examples that were incorrectly classified by the previous classifier.

Specifically, the weight of examples incorrectly is increased with each iteration.

slide-66
SLIDE 66

AdaBoost

Meta-algorithms 32/50

AdaBoost stands for Adaptive Boosting. Each learner focuses on examples that were incorrectly classified by the previous classifier.

Specifically, the weight of examples incorrectly is increased with each iteration. Initially, the weight of each example (wi) is 1

N , where N is the number of

examples.

slide-67
SLIDE 67

AdaBoost - error rate

Meta-algorithms 33/50

Let’s define an indicator function: I(ˆ y (j)

i , yi) =

  

if ˆ y (j)

i

= yi 1 if ˆ y (j)

i

̸= yi where ˆ y (j)

i

is the prediction of the jth learner on example i and yi is the label

  • f example i.
slide-68
SLIDE 68

AdaBoost - error rate

Meta-algorithms 33/50

Let’s define an indicator function: I(ˆ y (j)

i , yi) =

  

if ˆ y (j)

i

= yi 1 if ˆ y (j)

i

̸= yi where ˆ y (j)

i

is the prediction of the jth learner on example i and yi is the label

  • f example i.

The error rate of the jth learner is defined as: rj =

∑N

i=1 wi × I(ˆ

y (j)

i , yi)

∑N

i=1 wi

slide-69
SLIDE 69

AdaBoost - learner’s weight

Meta-algorithms 34/50

When making a final decision (vote), each learner has a weigth.

The weight of the learner j: αj = η log 1 − rj rj where η is the learning rate, default value is 1.

Low error rate implies high learn’s weight. Random guesses, error rate = 0.5, implies a weight of 0. Error rate > 0.5 implies a negative weight.

slide-70
SLIDE 70

AdaBoost - update

Meta-algorithms 35/50

After training the learner j, the weight of each example is updated as follows. wi =

  

wi if ˆ y (j)

i

= yi wi × eαj if ˆ y (j)

i

̸= yi

slide-71
SLIDE 71

AdaBoost - update

Meta-algorithms 35/50

After training the learner j, the weight of each example is updated as follows. wi =

  

wi if ˆ y (j)

i

= yi wi × eαj if ˆ y (j)

i

̸= yi The weights are then normalized, dividing them by ∑N

i=1 wi

slide-72
SLIDE 72

AdaBoost - prediction

Meta-algorithms 36/50

The outcome is the class with the largest weighted vote: ˆ y(x) = argmaxk

m

j=1 ˆ y(j)(x)=k

αj where m is the number of learners.

slide-73
SLIDE 73

sklearn.ensemble.AdaBoostClassifier

Meta-algorithms 37/50

from s k l e a r n . ensemble import A d a B o o s t C l a s s i f i e r ada_clf = A d a B o o s t C l a s s i f i e r ( D e c i s i o n T r e e C l a s s i f i e r ( max_depth=1) , n_estimators =200, algorithm="SAMME.R" , l e a r n i n g _ r a t e =0.5) ada_clf . f i t ( X_train , y_train )

[Géron, 2019] §7

slide-74
SLIDE 74

AdaBoost

Meta-algorithms 38/50

A literature search using Scopus for “AdaBoost” and “bioinformatics returns 78 references. Including the following two papers:

  • Y. Qu, B.-L. Adam, Y. Yasui, M.D. Ward, L.H. Cazares, P.F. Schellhammer,
  • Z. Feng, O.J. Semmes, and G.L. Wright Jr., Boosted decision tree analysis of

surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients, Clinical Chemistry 48 (2002), no. 10, 18351843, cited By 382. P.M. Long and V.B. Vega, Boosting and microarray data, Machine Learning 52 (2003), no. 1-2, 3144, cited By 40.

slide-75
SLIDE 75

AdaBoost

Meta-algorithms 39/50

https://youtu.be/GM3CDQfQ4sw

slide-76
SLIDE 76

Stacking

Meta-algorithms 40/50

Source [Géron, 2019] Figure 7.12

slide-77
SLIDE 77

Stacking

Meta-algorithms 41/50

Like bagging, stacking combines the predictions of several learners.

slide-78
SLIDE 78

Stacking

Meta-algorithms 41/50

Like bagging, stacking combines the predictions of several learners. Unlike bagging, stacking does not use a predetermined function to combine the predictions, say majority vote, instead, it trains a classifier/regressor.

slide-79
SLIDE 79

Stacking

Meta-algorithms 41/50

Like bagging, stacking combines the predictions of several learners. Unlike bagging, stacking does not use a predetermined function to combine the predictions, say majority vote, instead, it trains a classifier/regressor. A holdout set is used to train the blender.

slide-80
SLIDE 80

Prologue 42/50

Prologue

slide-81
SLIDE 81

Summary

Prologue 43/50

Ensemble learning is the idea of combining the predictions of several weak learners.

slide-82
SLIDE 82

Summary

Prologue 43/50

Ensemble learning is the idea of combining the predictions of several weak learners. Ensemble learning works best when the learners are as independent one from another as possible.

slide-83
SLIDE 83

Summary

Prologue 43/50

Ensemble learning is the idea of combining the predictions of several weak learners. Ensemble learning works best when the learners are as independent one from another as possible. This diversity of learners can be achieved in various ways: different algorithms, different sets of features, (slightly) different data sets.

slide-84
SLIDE 84

Summary

Prologue 43/50

Ensemble learning is the idea of combining the predictions of several weak learners. Ensemble learning works best when the learners are as independent one from another as possible. This diversity of learners can be achieved in various ways: different algorithms, different sets of features, (slightly) different data sets. Boosting combines the learners in a sequential, rather than parallel,

  • manner. Each learner fixes the mistakes of its predecessor.
slide-85
SLIDE 85

Summary

Prologue 43/50

Ensemble learning is the idea of combining the predictions of several weak learners. Ensemble learning works best when the learners are as independent one from another as possible. This diversity of learners can be achieved in various ways: different algorithms, different sets of features, (slightly) different data sets. Boosting combines the learners in a sequential, rather than parallel,

  • manner. Each learner fixes the mistakes of its predecessor.

With stacking, a learning algorithm is used to combine the results the weak classifiers.

slide-86
SLIDE 86

Next module

Prologue 44/50

Null

slide-87
SLIDE 87

References

Prologue 45/50

Burkov, A. (2019). The Hundred-Page Machine Learning Book. Andriy Burkov. Cao, Z., Pan, X., Yang, Y., Huang, Y., and Shen, H.-B. (2018). The lncLocator: a subcellular localization predictor for long non-coding rnas based

  • n a stacked ensemble classifier.

Bioinformatics, 34(13):2185–2194. Chen, X., Zhu, C.-C., and Yin, J. (2019). Ensemble of decision tree reveals potential miRNA-disease associations. PLoS Comput Biol, 15(7):e1007209. Colomé-Tatché, M. and Theis, F. J. (2018). Statistical single cell multi-omics integration. Current Opinion in Systems Biology, 7:54–59. Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, 2nd edition.

slide-88
SLIDE 88

References

Prologue 46/50

Ma, Y., Liu, Y., and Cheng, J. (2018). Protein secondary structure prediction based on data partition and semi-random subspace method. Sci Rep, 8(1):9856. Meher, P. K., Sahu, T. K., Gahoi, S., Satpathy, S., and Rao, A. R. (2019). Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene, 705:113–126. Peng, H., Zheng, Y., Zhao, Z., Liu, T., and Li, J. (2018). Recognition of CRISPR/Cas9 off-target sites through ensemble learning of uneven mismatch distributions. Bioinformatics, 34(17):i757–i765. Singh, A. P., Mishra, S., and Jabin, S. (2018a). Sequence based prediction of enhancer regions from DNA random walk. Sci Rep, 8(1):15912.

slide-89
SLIDE 89

References

Prologue 47/50

Singh, J., Hanson, J., Heffernan, R., Paliwal, K., Yang, Y., and Zhou, Y. (2018b). Detecting proline and non-proline cis isomers in protein structures from sequences using deep residual ensemble learning. J Chem Inf Model, 58(9):2033–2042. Singh, J., Hanson, J., Paliwal, K., and Zhou, Y. (2019). RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communications, 10(1):5407. Su, W., Gu, X., and Peterson, T. (2019). TIR-Learner, a new ensemble method for TIR transposable element annotation, provides evidence for abundant new transposable elements in the maize genome. Mol Plant, 12(3):447–460. Wang, X., Yu, B., Ma, A., Chen, C., Liu, B., and Ma, Q. (2018). Protein–protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics, 35(14):2395–2402.

slide-90
SLIDE 90

References

Prologue 48/50

Yu, J., Shi, S., Zhang, F., Chen, G., and Cao, M. (2019). PredGly: predicting lysine glycation sites for homo sapiens based on XGboost feature optimization. Bioinformatics, 35(16):2749–2756. Zeng, X., Zhong, Y., Lin, W., and Zou, Q. (2019). Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods. Brief Bioinform. Zhang, L., Yu, G., Xia, D., and Wang, J. (2019). Protein-protein interactions prediction based on ensemble deep neural networks. Neurocomputing, 324:10–19. Zhang, X., Wang, J., Li, J., Chen, W., and Liu, C. (2018). Crlncrc: a machine learning-based method for cancer-related long noncoding rna identification using integrated features. BMC Med Genomics, 11(Suppl 6):120.

slide-91
SLIDE 91

References

Prologue 49/50

Zheng, R., Li, M., Chen, X., Wu, F.-X., Pan, Y., and Wang, J. (2019). BiXGBoost: a scalable, flexible boosting-based method for reconstructing gene regulatory networks. Bioinformatics, 35(11):1893–1900.

slide-92
SLIDE 92

Prologue 50/50

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca School of Electrical Engineering and Computer Science (EECS) University of Ottawa