p
play

P An Introduction - PowerPoint PPT Presentation

P An Introduction With Special Emphasis On Deep Learning Dr. Ulrich Bodenhofer Associate Professor Institute of Bioinformatics Johannes Kepler


  1. ▼■❙❈❊▲▲❆◆❊❖❯❙ ❚❖P■❈❙ Reinforcement learning: learning by feedback from the environ- ment in an online process Feature extraction: computation of features from data prior to ma- chine learning (e.g. signal and image processing) Feature selection: selection of those features that are rele- vant/sufficient to solve a given learning task Feature construction: construction of new features as part of the learning process Machine Learning / UMA, May 2018 27

  2. ❚❊❘▼■◆❖▲❖●❨ Model: the specific relationship/representation we are aiming at Model class: the class of models in which we search for the model Parameters: representations of concrete models inside the given model class Model selection/training: process of finding that model from the model class that fits/explains the observed data in the best way Hyperparameters: parameters controlling the model complexity or the training procedure Machine Learning / UMA, May 2018 28

  3. ❇❆❙■❈ ❉❆❚❆ ❆◆❆▲❨❙■❙ ❲❖❘❑❋▲❖❲ Question/Task + Data Preprocessing Prior Knowledge Choose Features Choose Model Class Train Model Evaluate Model Final Model + Answer Machine Learning / UMA, May 2018 29

  4. ❇❆❙■❈ ❉❆❚❆ ❆◆❆▲❨❙■❙ ❲❖❘❑❋▲❖❲ Question/Task + Data Question/Task + Data Preprocessing Preprocessing Prior Knowledge Prior Knowledge Choose Features Choose Features Choose Model Class Choose Model Class Train Model Train Model Evaluate Model Evaluate Model Final Model + Answer FAIL Machine Learning / UMA, May 2018 29

  5. ❇❆❙■❈ ■◆●❘❊❉■❊◆❚❙ ❖❋ ▼❖❉❊▲ ❙❊▲❊❈❚■❖◆✴❚❘❆■◆■◆● For both supervised and unsupervised machine learning, we need the following basic ingredients: Model class: the class of models in which we search for the model Objective: criterion/measure that determines what is a good model Optimization algorithm: method that tries to find model parame- ters such that the objective is optimized The right choices of the above components depend on the charac- teristics of the given task. Machine Learning / UMA, May 2018 30

  6. ❙❖▼❊ ❲❖❘❉❙ ❖❋ ❊◆❚❍❯❙■❆❙▼ Machine learning methods are able to solve some tasks for which explicit models will never exist. Machine learning methods have become standard tools in a variety of disciplines (e.g. signal and image processing, bioin- formatics). Machine Learning / UMA, May 2018 31

  7. ❇❯❚ ✳ ✳ ✳ ❙❖▼❊ ❲❖❘❉❙ ❖❋ ❈❆❯❚■❖◆ Machine learning is not a universal remedy. The quality of machine learning models depends on the quality and quantity of data. What cannot be measured/observed can never be identified by machine learning. Machine learning complements explicit/deductive models in- stead of replacing them. Machine learning is often applied in a naive way. Machine Learning / UMA, May 2018 32

  8. ❖❱❊❘❱■❊❲ ❖❋ ❙❯P❊❘❱■❙❊❉ ▼▲

  9. ❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆● Goal of supervised machine learning: to identify the relationship between inputs and targets/labels Machine Learning / UMA, May 2018 34

  10. ❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆● Goal of supervised machine learning: to identify the relationship between inputs and targets/labels 0.843475 0.709216 -1 0.408987 0.47037 +1 0.734759 0.645298 -1 0.972187 0.0802574 +1 0.90267 0.327633 -1 0.807075 0.872155 -1 0.240068 0.801159 -1 0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267 -1 0.209818 0.342484 +1 0.94141 0.928017 -1 0.148546 0.198177 +1 0.872544 0.50608 -1 0.371062 0.272064 +1 ... ... ... Machine Learning / UMA, May 2018 34

  11. ❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆● Goal of supervised machine learning: to identify the relationship between inputs and targets/labels 0.843475 0.709216 -1 0.408987 0.47037 +1 1 0.734759 0.645298 -1 0.972187 0.0802574 +1 0.8 0.90267 0.327633 -1 0.807075 0.872155 -1 0.240068 0.801159 -1 0.6 0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267 -1 0.4 0.209818 0.342484 +1 0.94141 0.928017 -1 0.148546 0.198177 +1 0.2 0.872544 0.50608 -1 0.371062 0.272064 +1 0 ... ... ... 0 0.2 0.4 0.6 0.8 1 Machine Learning / UMA, May 2018 34

  12. ❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Machine Learning / UMA, May 2018 35

  13. ❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Can we infer tumor types from gene expression values? Machine Learning / UMA, May 2018 35

  14. ❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Can we infer tumor types from gene expression values? Which genes are most indicative? Machine Learning / UMA, May 2018 35

  15. ❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Can we infer tumor types from gene expression values? Which genes are most indicative? Machine Learning / UMA, May 2018 35

  16. ❍❖❲ ❚❖ ❆❙❙❊❙❙ ●❊◆❊❘❆▲■❩❆❚■❖◆ P❊❘❋❖❘▼❆◆❈❊❄ The quality of a model can only be judged on the basis of its per- formance on future data. So assume that future data are generated according to some joint distribution of inputs and targets , the joint density of which we denote as p ( x , y ) The generalization error (or risk ) is the expected error on future data for a given model. Machine Learning / UMA, May 2018 36

  17. ❊❙❚■▼❆❚■◆● ❚❍❊ ●❊◆❊❘❆▲■❩❆❚■❖◆ ❊❘❘❖❘ Since we typically do not know the distribution p ( x , y ) , we have to estimate the generalization performance by making use of already existing data. Two methods are common: Test set/holdout method: the data set is split randomly into a training set and a test set ; a predictor is trained on the former and evaluated on the latter; Cross validation: the data set is split randomly into a certain num- ber k of equally sized folds; k predictors are trained, each leav- ing out one fold as test set; the average performance on the k test folds is computed; Machine Learning / UMA, May 2018 37

  18. ❋■❱❊✲❋❖▲❉ ❈❘❖❙❙ ❱❆▲■❉❆❚■❖◆ 1. evaluation training 2. evaluation training . . . 5. training evaluation Machine Learning / UMA, May 2018 38

  19. ❈❖◆❋❯❙■❖◆ ▼❆❚❘■❳ ❋❖❘ ❇■◆❆❘❨ ❈▲❆❙❙■❋■❈❆❚■❖◆ For a given sample ( x , y ) and a classifier g ( . ) ), ( x , y ) is a true positive (TP) if y = +1 and g ( x ) = +1 , true negative (TN) if y = − 1 and g ( x ) = − 1 , false positive (FP) if y = − 1 and g ( x ) = +1 , false negative (FN) if y = +1 and g ( x ) = − 1 . Machine Learning / UMA, May 2018 39

  20. ❈❖◆❋❯❙■❖◆ ▼❆❚❘■❳ ❋❖❘ ❇■◆❆❘❨ ❈▲❆❙❙■❋■❈❆❚■❖◆ ✭❝♦♥t✬❞✮ Given a data set, the confusion matrix is defined as follows: predicted value g ( x ) +1 -1 actual value y +1 #TP #FN -1 #FP #TN In this table, the entries #TP , #FP , #FN and #TN denote the numbers of true positives, . . . , respectively, for the given test data set. Machine Learning / UMA, May 2018 40

  21. ❊❱❆▲❯❆❚■❖◆ ▼❊❆❙❯❘❊❙ ❉❊❘■❱❊❉ ❋❘❖▼ ❈❖◆❋❯❙■❖◆ ▼❆❚❘■❈❊❙ Accuracy: number of correctly classified Precision: proportion of predicted positive items, i.e. examples that were correct, i.e. #TP + #TN #TP ACC = #TP + #FN + #FP + #TN . PREC = #TP + #FP . True Positive Rate (aka recall/sensitivity): True Negative Rate (aka specificity): proportion of correctly identified posi- proportion of correctly identified tives, i.e. negatives, i.e. #TP #TN TPR = #TP + #FN . TNR = #FP + #TN . False Positive Rate: proportion of negative False Negative Rate: proportion of positive examples that were incorrectly classi- examples that were incorrectly classi- fied as positives, i.e. fied as negatives, i.e. #FP #FN FPR = #FP + #TN . FNR = #TP + #FN . Machine Learning / UMA, May 2018 41

  22. ❊❱❆▲❯❆❚■❖◆ ▼❊❆❙❯❘❊❙ ❋❖❘ ❯◆❇❆▲❆◆❈❊❉ ❉❆❚❆ Balanced Accuracy: mean of true positive and true negative rate, i.e. BACC = TPR + TNR 2 Matthews Correlation Coefficient: measure of non-randomness of classification; defined as normalized determinant of confu- sion matrix, i.e. #TP · #TN − #FP · #FN MCC = � ( #TP + #FP )( #TP + #FN )( #TN + #FP )( #TN + #FN ) Machine Learning / UMA, May 2018 42

  23. ❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● Underfitting: our model is too coarse to fit the data (neither training nor test data); this is usually the result of too restrictive model assumptions (i.e. too low complexity of model ). Overfitting: our model works very well on training data, but gener- alizes poorly to future/test data; this is usually the result of too high model complexity . The best generalization performance is obtained for the optimal choice of the complexity level. An estimate of the optimal choice can be determined by (cross) validation. Machine Learning / UMA, May 2018 43

  24. ❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮ error test error training error complexity Machine Learning / UMA, May 2018 44

  25. ❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮ ✛ ✛ ✛ underfitting ✛ error ✛ ✛ ✛ test error ✛ ✛ training error ✛ complexity Machine Learning / UMA, May 2018 44

  26. ❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮ ✛ ✲ ✛ ✲ ✛ ✲ underfitting overfitting ✛ ✲ error ✛ ✲ ✛ ✲ ✛ ✲ test error ✛ ✲ ✛ ✲ training error ✛ ✲ complexity Machine Learning / UMA, May 2018 44

  27. ❆ ❇❆❙■❈ ❈▲❆❙❙■❋■❊❘✿ ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ Suppose we have a labeled data set Z and a distance measure on the input space. Then the k -nearest neighbor classifier is defined as follows: g k -NN ( x ; Z ) = class that occurs most often among the k samples that are closest to x For k = 1 , we simply call this nearest neighbor classifier : g NN ( x ; Z ) = class of the sample that is closest to x Machine Learning / UMA, May 2018 45

  28. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

  29. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

  30. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ k = 5 : k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

  31. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ k = 13 : k = 1 : k = 5 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

  32. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

  33. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

  34. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 5 : k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

  35. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 13 : k = 1 : k = 5 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

  36. ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 25 : k = 13 : k = 5 : k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

  37. ❆ ❇❆❙■❈ ◆❯▼❊❘■❈❆▲ P❘❊❉■❈❚❖❘✿ ✶❉ ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ Consider a data set Z = { ( x i , y i ) | i = 1 , . . . , l } ⊆ R 2 and a linear model � � y = w 0 + w 1 · x = g x ; ( w 0 , w 1 ) . � �� � w Suppose we want to find ( w 0 , w 1 ) such that the average quadratic loss, l l Q ( w 0 , w 1 ) = 1 w 0 + w 1 · x i − y i � 2 = 1 � � � � g ( x i ; w ) − y i � 2 , l l i =1 i =1 is minimized. Then the unique global solution is given as follows: w 1 = Cov( x , y ) w 0 = ¯ y − w 1 · ¯ x Var( x ) Machine Learning / UMA, May 2018 48

  38. ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✶ 8 6 4 2 1 2 3 4 5 6 Machine Learning / UMA, May 2018 49

  39. ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✶ 8 8 6 6 4 4 2 2 1 1 2 2 3 3 4 4 5 5 6 6 Machine Learning / UMA, May 2018 49

  40. ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❋❖❘ ▼❯▲❚■P▲❊ ❱❆❘■❆❇▲❊❙ Consider a data set Z = { ( x i , y i ) | i = 1 , . . . , l } and a linear model � � y = w 0 + w 1 · x 1 + · · · + w d · x d = (1 | x ) · w = g x ; ( w 0 , w 1 , . . . , w d ) . � �� � w T Suppose we want to find w = ( w 0 , w 1 , . . . , w d ) T such that the average quadratic loss is minimized. Then the unique global solution is given as � ˜ � − 1 · ˜ X T · ˜ X T w = · y , X � �� � ˜ X + where ˜ X = ( 1 | X ) . Machine Learning / UMA, May 2018 50

  41. ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✷ 1.0 0.5 0.0 1 0 � 1 0.0 0.5 1.0 Machine Learning / UMA, May 2018 51

  42. ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✷ 1.0 0.5 0.0 1 0 � 1 0.0 0.5 1.0 Machine Learning / UMA, May 2018 51

  43. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ Consider a data set Z = { ( x i , y i ) | i = 1 , . . . , l } and a polynomial model of degree n y = w 0 + w 1 · x + w 2 · x 2 + · · · + w n · x n = g � � x ; ( w 0 , w 1 , . . . , w n ) . � �� � w T Suppose we want to find w = ( w 0 , w 1 , . . . , w n ) T such that the average quadratic loss is minimized. Then the unique global solution is given as follows: � ˜ � − 1 · ˜ X T · ˜ X = ( 1 | x | x 2 | · · · | x n ) X T ˜ with w = · y X � �� � ˜ X + Machine Learning / UMA, May 2018 52

  44. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ 10 8 6 4 2 1 2 3 4 5 6 � 2 � 4 Machine Learning / UMA, May 2018 53

  45. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 1 : 10 10 8 8 6 6 4 4 2 2 1 1 2 2 3 3 4 4 5 5 6 6 � 2 � 2 � 4 � 4 Machine Learning / UMA, May 2018 53

  46. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 2 : n = 1 : 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 � 2 � 2 � 2 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

  47. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 3 : n = 2 : n = 1 : 10 10 10 10 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

  48. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 3 : n = 2 : n = 5 : n = 1 : 10 10 10 10 10 8 8 8 8 8 6 6 6 6 6 4 4 4 4 4 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 � 2 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

  49. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 5 : n = 3 : n = 25 : n = 2 : n = 1 : 10 10 10 10 10 10 8 8 8 8 8 8 6 6 6 6 6 6 4 4 4 4 4 4 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 � 2 � 2 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

  50. P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 5 : n = 3 : n = 75 : n = 25 : n = 2 : n = 1 : 10 10 10 10 10 10 10 8 8 8 8 8 8 8 6 6 6 6 6 6 6 4 4 4 4 4 4 4 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 6 � 2 � 2 � 2 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

  51. ❙❯PP❖❘❚ ❱❊❈❚❖❘ ▼❆❈❍■◆❊❙ ■◆ ❆ ◆❯❚❙❍❊▲▲ Putting it simply, Support Vector Machines (SVMs) are based on the idea of finding a classification border that maximizes the margin between positive and negative samples. According to a theoretical result, maximizing the margin cor- responds to minimizing an upper bound of the generalization error. Machine Learning / UMA, May 2018 54

  52. ▼❆❘●■◆ ▼❆❳■▼■❩❆❚■❖◆ margin margin Machine Learning / UMA, May 2018 55

  53. ▼❆❘●■◆ ▼❆❳■▼■❩❆❚■❖◆ ✭❝♦♥t✬❞✮ The two classes are linearly separable margin if and only if their convex hulls are dis- joint. If the two classes are linearly sep- arable , margin maximization can be achieved by making an orthogonal 50:50 split of the shortest distance con- necting the convex hulls of the two classes. The question remains how to solve margin maximization computationally: by quadratic optimization . Machine Learning / UMA, May 2018 56

  54. ❙❱▼ ❉■❙❈❘■▼■◆❆◆❚ ❋❯◆❈❚■❖◆ For a given training set { ( x i , y i ) | 1 ≤ i ≤ l } , a common support vector machine classifier is represented as the discriminant function l α i · y i · k ( x , x i ) , � g ( x ) = b + i =1 where b is a real value, α i are non-negative factors, and k ( ., . ) is the so-called kernel , a similarity measure for the inputs. The dis- criminant function only depends on those samples whose Lagrange multiplier α i is not 0. Those are called support vectors . Machine Learning / UMA, May 2018 57

  55. ❙❚❆◆❉❆❘❉ ❑❊❘◆❊▲❙ The following kernels are often used in practice: Linear: k ( x , y ) = x · y Polynomial: k ( x , y ) = ( x · y + β ) α Gaussian/RBF: a k ( x , y ) = exp 2 σ 2 � x − y � 2 � 1 � − Sigmoid: k ( x , y ) = tanh( α x · y + β ) a RBF = Radial Basis Function Machine Learning / UMA, May 2018 58

  56. ❊❳❆▼P▲❊✿ ▲■◆❊❆❘ ❑❊❘◆❊▲ ✴ ❈ = ✶ Machine Learning / UMA, May 2018 59

  57. ❊❳❆▼P▲❊✿ ▲■◆❊❆❘ ❑❊❘◆❊▲ ✴ ❈ = ✶✵✵✵ Machine Learning / UMA, May 2018 60

  58. ❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶ ✴ σ 2 = ✵✳✺ Machine Learning / UMA, May 2018 61

  59. ❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶✵ ✴ σ 2 = ✵✳✵✺ Machine Learning / UMA, May 2018 62

  60. ❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶✵✵✵ ✴ σ 2 = ✵✳✵✵✺ Machine Learning / UMA, May 2018 63

  61. ❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶✵ ✴ σ 2 = ✵✳✵✺ Machine Learning / UMA, May 2018 64

  62. ❙❱▼s ❋❖❘ ▼❯▲❚■✲❈▲❆❙❙ P❘❖❇▲❊▼❙ Support vector machines are intrinsically based on the idea of separating two classes by maximizing the margin between them. So there is no obvious way to extend them to multi-class problems. All approaches introduced so far are based on breaking down the multi-class problem into several binary classification prob- lems. Machine Learning / UMA, May 2018 65

  63. ❙❱▼s ❋❖❘ ▼❯▲❚■✲❈▲❆❙❙ P❘❖❇▲❊▼❙ ✭❝♦♥t✬❞✮ Suppose we have a classification problem with M classes. One against the rest: M support vector machines are trained, where the i -th SVM is trained to distinguish between the i -th class and other classes; a new sample is assigned to the class whose SVM has the highest discriminant function value. M ( M − 1) SVMs are trained, one for each Pairwise classification: 2 pair of classes; a new sample is assigned to the class that re- ceived the most votes from the M ( M − 1) SVMs. This is the bet- 2 ter and more common approach. Machine Learning / UMA, May 2018 66

  64. ❙❊◗❯❊◆❈❊ ❈▲❆❙❙■❋■❈❆❚■❖◆ ❯❙■◆● ❙❱▼s All considerations so far have been based on vectorial data. Biological sequences cannot be cast to vectorial data easily, in particular, if they do not have fixed lengths. Support vector machines, by means of the kernels they employ, can handle any kind of data as long as a meaningful kernel (i.e. similarity measure) is available. In the following, we will consider kernels that can be used for biological sequences. Machine Learning / UMA, May 2018 67

  65. ❙❊◗❯❊◆❈❊ ❑❊❘◆❊▲❙ We consider kernels of the following kind: � k ( x, y ) = N ( m, x ) · N ( m, y ) , m ∈M where M is a set of patterns and N ( m, x ) denotes the number of occurrences/matches of pattern m in string x . Spectrum Kernel: consider all possible K -length strings (exact matches). Machine Learning / UMA, May 2018 68

  66. ❉❊❈■❙■❖◆ ❚❘❊❊❙✿ ■◆❚❘❖❉❯❈❚■❖◆ A decision tree is a classifier that classifies samples “by asking questions successively”; each non-leaf node corresponds to a question, each leaf corresponds to a final prediction. Decision tree learning is concerned with partitioning the train- ing data hierarchically such that the leaf nodes are hopefully homogeneous in terms of the target class. Decision trees have mainly been designed for categorical data, but they can also be applied to numerical features. Decision trees are traditionally used for classification (binary and multi-class), but regression is possible, too. Machine Learning / UMA, May 2018 69

  67. ❉❊❈■❙■❖◆ ❚❘❊❊ ▲❊❆❘◆■◆● All decision tree learning algorithms are recursive, depth-first search algorithms that perform hierarchical splits. There are three main design issues: 1. Splitting criterion: which splits to choose? 2. Stopping criterion: when to stop further growing of the tree? 3. Pruning: whether/how to collapse unnecessarily deep sub- trees? The two latter are especially relevant for adjusting the complex- ity of decision trees (underfitting vs. overfitting). Machine Learning / UMA, May 2018 70

  68. ❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚ 2.45 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Petal.Length< 2.45 ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.75 ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● Petal.Width ● ● ● ● ● ● ● ● Petal.Width< 1.75 ● ● ● ● ● ● ● ● ● ● ● ● ● setosa ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● 0.5 virginica ● versicolor ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 Petal.Length Machine Learning / UMA, May 2018 71

  69. ❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚ 2.45 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Petal.Length< 2.45 ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.75 ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● Petal.Width ● ● ● ● ● ● ● ● Petal.Width< 1.75 ● ● ● ● ● ● ● ● ● ● ● ● ● setosa ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● 0.5 virginica ● versicolor ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 Petal.Length Machine Learning / UMA, May 2018 71

  70. ❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚ 2.45 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Petal.Length< 2.45 ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.75 ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● Petal.Width ● ● ● ● ● ● ● ● Petal.Width< 1.75 ● ● ● ● ● ● ● ● ● ● ● ● ● setosa ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● 0.5 virginica ● versicolor ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 Petal.Length Machine Learning / UMA, May 2018 71

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend