P An Introduction - PowerPoint PPT Presentation

▼■❙❈❊▲▲❆◆❊❖❯❙ ❚❖P■❈❙ Reinforcement learning: learning by feedback from the environ- ment in an online process Feature extraction: computation of features from data prior to machine learning (e.g. signal and image processing) Feature selection: selection of those features that are relevant/sufficient to solve a given learning task Feature construction: construction of new features as part of the learning process Machine Learning / UMA, May 2018 27

❚❊❘▼■◆❖▲❖●❨ Model: the specific relationship/representation we are aiming at Model class: the class of models in which we search for the model Parameters: representations of concrete models inside the given model class Model selection/training: process of finding that model from the model class that fits/explains the observed data in the best way Hyperparameters: parameters controlling the model complexity or the training procedure Machine Learning / UMA, May 2018 28

❇❆❙■❈ ❉❆❚❆ ❆◆❆▲❨❙■❙ ❲❖❘❑❋▲❖❲ Question/Task + Data Preprocessing Prior Knowledge Choose Features Choose Model Class Train Model Evaluate Model Final Model + Answer Machine Learning / UMA, May 2018 29

❇❆❙■❈ ❉❆❚❆ ❆◆❆▲❨❙■❙ ❲❖❘❑❋▲❖❲ Question/Task + Data Question/Task + Data Preprocessing Preprocessing Prior Knowledge Prior Knowledge Choose Features Choose Features Choose Model Class Choose Model Class Train Model Train Model Evaluate Model Evaluate Model Final Model + Answer FAIL Machine Learning / UMA, May 2018 29

❇❆❙■❈ ■◆●❘❊❉■❊◆❚❙ ❖❋ ▼❖❉❊▲ ❙❊▲❊❈❚■❖◆✴❚❘❆■◆■◆● For both supervised and unsupervised machine learning, we need the following basic ingredients: Model class: the class of models in which we search for the model Objective: criterion/measure that determines what is a good model Optimization algorithm: method that tries to find model parameters such that the objective is optimized The right choices of the above components depend on the charac- teristics of the given task. Machine Learning / UMA, May 2018 30

❙❖▼❊ ❲❖❘❉❙ ❖❋ ❊◆❚❍❯❙■❆❙▼ Machine learning methods are able to solve some tasks for which explicit models will never exist. Machine learning methods have become standard tools in a variety of disciplines (e.g. signal and image processing, bioinformatics). Machine Learning / UMA, May 2018 31

❇❯❚ ✳ ✳ ✳ ❙❖▼❊ ❲❖❘❉❙ ❖❋ ❈❆❯❚■❖◆ Machine learning is not a universal remedy. The quality of machine learning models depends on the quality and quantity of data. What cannot be measured/observed can never be identified by machine learning. Machine learning complements explicit/deductive models in- stead of replacing them. Machine learning is often applied in a naive way. Machine Learning / UMA, May 2018 32

❖❱❊❘❱■❊❲ ❖❋ ❙❯P❊❘❱■❙❊❉ ▼▲

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆● Goal of supervised machine learning: to identify the relationship between inputs and targets/labels Machine Learning / UMA, May 2018 34

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆● Goal of supervised machine learning: to identify the relationship between inputs and targets/labels 0.843475 0.709216 -1 0.408987 0.47037 +1 0.734759 0.645298 -1 0.972187 0.0802574 +1 0.90267 0.327633 -1 0.807075 0.872155 -1 0.240068 0.801159 -1 0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267 -1 0.209818 0.342484 +1 0.94141 0.928017 -1 0.148546 0.198177 +1 0.872544 0.50608 -1 0.371062 0.272064 +1 ... ... ... Machine Learning / UMA, May 2018 34

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆● Goal of supervised machine learning: to identify the relationship between inputs and targets/labels 0.843475 0.709216 -1 0.408987 0.47037 +1 1 0.734759 0.645298 -1 0.972187 0.0802574 +1 0.8 0.90267 0.327633 -1 0.807075 0.872155 -1 0.240068 0.801159 -1 0.6 0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267 -1 0.4 0.209818 0.342484 +1 0.94141 0.928017 -1 0.148546 0.198177 +1 0.2 0.872544 0.50608 -1 0.371062 0.272064 +1 0 ... ... ... 0 0.2 0.4 0.6 0.8 1 Machine Learning / UMA, May 2018 34

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Machine Learning / UMA, May 2018 35

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Can we infer tumor types from gene expression values? Machine Learning / UMA, May 2018 35

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆ Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . . Can we infer tumor types from gene expression values? Which genes are most indicative? Machine Learning / UMA, May 2018 35

❍❖❲ ❚❖ ❆❙❙❊❙❙ ●❊◆❊❘❆▲■❩❆❚■❖◆ P❊❘❋❖❘▼❆◆❈❊❄ The quality of a model can only be judged on the basis of its performance on future data. So assume that future data are generated according to some joint distribution of inputs and targets , the joint density of which we denote as p ( x , y ) The generalization error (or risk ) is the expected error on future data for a given model. Machine Learning / UMA, May 2018 36

❊❙❚■▼❆❚■◆● ❚❍❊ ●❊◆❊❘❆▲■❩❆❚■❖◆ ❊❘❘❖❘ Since we typically do not know the distribution p ( x , y ) , we have to estimate the generalization performance by making use of already existing data. Two methods are common: Test set/holdout method: the data set is split randomly into a training set and a test set ; a predictor is trained on the former and evaluated on the latter; Cross validation: the data set is split randomly into a certain number k of equally sized folds; k predictors are trained, each leav- ing out one fold as test set; the average performance on the k test folds is computed; Machine Learning / UMA, May 2018 37

❋■❱❊✲❋❖▲❉ ❈❘❖❙❙ ❱❆▲■❉❆❚■❖◆ 1. evaluation training 2. evaluation training . . . 5. training evaluation Machine Learning / UMA, May 2018 38

❈❖◆❋❯❙■❖◆ ▼❆❚❘■❳ ❋❖❘ ❇■◆❆❘❨ ❈▲❆❙❙■❋■❈❆❚■❖◆ For a given sample ( x , y ) and a classifier g ( . ) ), ( x , y ) is a true positive (TP) if y = +1 and g ( x ) = +1 , true negative (TN) if y = − 1 and g ( x ) = − 1 , false positive (FP) if y = − 1 and g ( x ) = +1 , false negative (FN) if y = +1 and g ( x ) = − 1 . Machine Learning / UMA, May 2018 39

❈❖◆❋❯❙■❖◆ ▼❆❚❘■❳ ❋❖❘ ❇■◆❆❘❨ ❈▲❆❙❙■❋■❈❆❚■❖◆ ✭❝♦♥t✬❞✮ Given a data set, the confusion matrix is defined as follows: predicted value g ( x ) +1 -1 actual value y +1 #TP #FN -1 #FP #TN In this table, the entries #TP , #FP , #FN and #TN denote the numbers of true positives, . . . , respectively, for the given test data set. Machine Learning / UMA, May 2018 40

❊❱❆▲❯❆❚■❖◆ ▼❊❆❙❯❘❊❙ ❉❊❘■❱❊❉ ❋❘❖▼ ❈❖◆❋❯❙■❖◆ ▼❆❚❘■❈❊❙ Accuracy: number of correctly classified Precision: proportion of predicted positive items, i.e. examples that were correct, i.e. #TP + #TN #TP ACC = #TP + #FN + #FP + #TN . PREC = #TP + #FP . True Positive Rate (aka recall/sensitivity): True Negative Rate (aka specificity): proportion of correctly identified posi- proportion of correctly identified tives, i.e. negatives, i.e. #TP #TN TPR = #TP + #FN . TNR = #FP + #TN . False Positive Rate: proportion of negative False Negative Rate: proportion of positive examples that were incorrectly classi- examples that were incorrectly classified as positives, i.e. fied as negatives, i.e. #FP #FN FPR = #FP + #TN . FNR = #TP + #FN . Machine Learning / UMA, May 2018 41

❊❱❆▲❯❆❚■❖◆ ▼❊❆❙❯❘❊❙ ❋❖❘ ❯◆❇❆▲❆◆❈❊❉ ❉❆❚❆ Balanced Accuracy: mean of true positive and true negative rate, i.e. BACC = TPR + TNR 2 Matthews Correlation Coefficient: measure of non-randomness of classification; defined as normalized determinant of confusion matrix, i.e. #TP · #TN − #FP · #FN MCC = � ( #TP + #FP )( #TP + #FN )( #TN + #FP )( #TN + #FN ) Machine Learning / UMA, May 2018 42

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● Underfitting: our model is too coarse to fit the data (neither training nor test data); this is usually the result of too restrictive model assumptions (i.e. too low complexity of model ). Overfitting: our model works very well on training data, but gener- alizes poorly to future/test data; this is usually the result of too high model complexity . The best generalization performance is obtained for the optimal choice of the complexity level. An estimate of the optimal choice can be determined by (cross) validation. Machine Learning / UMA, May 2018 43

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮ error test error training error complexity Machine Learning / UMA, May 2018 44

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮ ✛ ✛ ✛ underfitting ✛ error ✛ ✛ ✛ test error ✛ ✛ training error ✛ complexity Machine Learning / UMA, May 2018 44

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮ ✛ ✲ ✛ ✲ ✛ ✲ underfitting overfitting ✛ ✲ error ✛ ✲ ✛ ✲ ✛ ✲ test error ✛ ✲ ✛ ✲ training error ✛ ✲ complexity Machine Learning / UMA, May 2018 44

❆ ❇❆❙■❈ ❈▲❆❙❙■❋■❊❘✿ ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ Suppose we have a labeled data set Z and a distance measure on the input space. Then the k -nearest neighbor classifier is defined as follows: g k -NN ( x ; Z ) = class that occurs most often among the k samples that are closest to x For k = 1 , we simply call this nearest neighbor classifier : g NN ( x ; Z ) = class of the sample that is closest to x Machine Learning / UMA, May 2018 45

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ k = 5 : k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶ k = 13 : k = 1 : k = 5 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 46

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 5 : k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 13 : k = 1 : k = 5 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷ k = 25 : k = 13 : k = 5 : k = 1 : 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Machine Learning / UMA, May 2018 47

❆ ❇❆❙■❈ ◆❯▼❊❘■❈❆▲ P❘❊❉■❈❚❖❘✿ ✶❉ ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ Consider a data set Z = { ( x i , y i ) | i = 1 , . . . , l } ⊆ R 2 and a linear model � � y = w 0 + w 1 · x = g x ; ( w 0 , w 1 ) . � �� w Suppose we want to find ( w 0 , w 1 ) such that the average quadratic loss, l l Q ( w 0 , w 1 ) = 1 w 0 + w 1 · x i − y i � 2 = 1 � � � � g ( x i ; w ) − y i � 2 , l l i =1 i =1 is minimized. Then the unique global solution is given as follows: w 1 = Cov( x , y ) w 0 = ¯ y − w 1 · ¯ x Var( x ) Machine Learning / UMA, May 2018 48

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✶ 8 6 4 2 1 2 3 4 5 6 Machine Learning / UMA, May 2018 49

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✶ 8 8 6 6 4 4 2 2 1 1 2 2 3 3 4 4 5 5 6 6 Machine Learning / UMA, May 2018 49

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❋❖❘ ▼❯▲❚■P▲❊ ❱❆❘■❆❇▲❊❙ Consider a data set Z = { ( x i , y i ) | i = 1 , . . . , l } and a linear model � � y = w 0 + w 1 · x 1 + · · · + w d · x d = (1 | x ) · w = g x ; ( w 0 , w 1 , . . . , w d ) . � �� w T Suppose we want to find w = ( w 0 , w 1 , . . . , w d ) T such that the average quadratic loss is minimized. Then the unique global solution is given as � ˜ � − 1 · ˜ X T · ˜ X T w = · y , X � �� ˜ X + where ˜ X = ( 1 | X ) . Machine Learning / UMA, May 2018 50

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✷ 1.0 0.5 0.0 1 0 � 1 0.0 0.5 1.0 Machine Learning / UMA, May 2018 51

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ Consider a data set Z = { ( x i , y i ) | i = 1 , . . . , l } and a polynomial model of degree n y = w 0 + w 1 · x + w 2 · x 2 + · · · + w n · x n = g � � x ; ( w 0 , w 1 , . . . , w n ) . � �� w T Suppose we want to find w = ( w 0 , w 1 , . . . , w n ) T such that the average quadratic loss is minimized. Then the unique global solution is given as follows: � ˜ � − 1 · ˜ X T · ˜ X = ( 1 | x | x 2 | · · · | x n ) X T ˜ with w = · y X � �� ˜ X + Machine Learning / UMA, May 2018 52

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ 10 8 6 4 2 1 2 3 4 5 6 � 2 � 4 Machine Learning / UMA, May 2018 53

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 1 : 10 10 8 8 6 6 4 4 2 2 1 1 2 2 3 3 4 4 5 5 6 6 � 2 � 2 � 4 � 4 Machine Learning / UMA, May 2018 53

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 2 : n = 1 : 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 � 2 � 2 � 2 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 3 : n = 2 : n = 1 : 10 10 10 10 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 3 : n = 2 : n = 5 : n = 1 : 10 10 10 10 10 8 8 8 8 8 6 6 6 6 6 4 4 4 4 4 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 � 2 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 5 : n = 3 : n = 25 : n = 2 : n = 1 : 10 10 10 10 10 10 8 8 8 8 8 8 6 6 6 6 6 6 4 4 4 4 4 4 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 � 2 � 2 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ n = 5 : n = 3 : n = 75 : n = 25 : n = 2 : n = 1 : 10 10 10 10 10 10 10 8 8 8 8 8 8 8 6 6 6 6 6 6 6 4 4 4 4 4 4 4 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 6 � 2 � 2 � 2 � 2 � 2 � 2 � 2 � 4 � 4 � 4 � 4 � 4 � 4 � 4 Machine Learning / UMA, May 2018 53

❙❯PP❖❘❚ ❱❊❈❚❖❘ ▼❆❈❍■◆❊❙ ■◆ ❆ ◆❯❚❙❍❊▲▲ Putting it simply, Support Vector Machines (SVMs) are based on the idea of finding a classification border that maximizes the margin between positive and negative samples. According to a theoretical result, maximizing the margin corresponds to minimizing an upper bound of the generalization error. Machine Learning / UMA, May 2018 54

▼❆❘●■◆ ▼❆❳■▼■❩❆❚■❖◆ margin margin Machine Learning / UMA, May 2018 55

▼❆❘●■◆ ▼❆❳■▼■❩❆❚■❖◆ ✭❝♦♥t✬❞✮ The two classes are linearly separable margin if and only if their convex hulls are dis- joint. If the two classes are linearly separable , margin maximization can be achieved by making an orthogonal 50:50 split of the shortest distance con- necting the convex hulls of the two classes. The question remains how to solve margin maximization computationally: by quadratic optimization . Machine Learning / UMA, May 2018 56

❙❱▼ ❉■❙❈❘■▼■◆❆◆❚ ❋❯◆❈❚■❖◆ For a given training set { ( x i , y i ) | 1 ≤ i ≤ l } , a common support vector machine classifier is represented as the discriminant function l α i · y i · k ( x , x i ) , � g ( x ) = b + i =1 where b is a real value, α i are non-negative factors, and k ( ., . ) is the so-called kernel , a similarity measure for the inputs. The discriminant function only depends on those samples whose Lagrange multiplier α i is not 0. Those are called support vectors . Machine Learning / UMA, May 2018 57

❙❚❆◆❉❆❘❉ ❑❊❘◆❊▲❙ The following kernels are often used in practice: Linear: k ( x , y ) = x · y Polynomial: k ( x , y ) = ( x · y + β ) α Gaussian/RBF: a k ( x , y ) = exp 2 σ 2 � x − y � 2 � 1 � − Sigmoid: k ( x , y ) = tanh( α x · y + β ) a RBF = Radial Basis Function Machine Learning / UMA, May 2018 58

❊❳❆▼P▲❊✿ ▲■◆❊❆❘ ❑❊❘◆❊▲ ✴ ❈ = ✶ Machine Learning / UMA, May 2018 59

❊❳❆▼P▲❊✿ ▲■◆❊❆❘ ❑❊❘◆❊▲ ✴ ❈ = ✶✵✵✵ Machine Learning / UMA, May 2018 60

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶ ✴ σ 2 = ✵✳✺ Machine Learning / UMA, May 2018 61

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶✵ ✴ σ 2 = ✵✳✵✺ Machine Learning / UMA, May 2018 62

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶✵✵✵ ✴ σ 2 = ✵✳✵✵✺ Machine Learning / UMA, May 2018 63

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈ = ✶✵ ✴ σ 2 = ✵✳✵✺ Machine Learning / UMA, May 2018 64

❙❱▼s ❋❖❘ ▼❯▲❚■✲❈▲❆❙❙ P❘❖❇▲❊▼❙ Support vector machines are intrinsically based on the idea of separating two classes by maximizing the margin between them. So there is no obvious way to extend them to multi-class problems. All approaches introduced so far are based on breaking down the multi-class problem into several binary classification problems. Machine Learning / UMA, May 2018 65

❙❱▼s ❋❖❘ ▼❯▲❚■✲❈▲❆❙❙ P❘❖❇▲❊▼❙ ✭❝♦♥t✬❞✮ Suppose we have a classification problem with M classes. One against the rest: M support vector machines are trained, where the i -th SVM is trained to distinguish between the i -th class and other classes; a new sample is assigned to the class whose SVM has the highest discriminant function value. M ( M − 1) SVMs are trained, one for each Pairwise classification: 2 pair of classes; a new sample is assigned to the class that re- ceived the most votes from the M ( M − 1) SVMs. This is the bet- 2 ter and more common approach. Machine Learning / UMA, May 2018 66

❙❊◗❯❊◆❈❊ ❈▲❆❙❙■❋■❈❆❚■❖◆ ❯❙■◆● ❙❱▼s All considerations so far have been based on vectorial data. Biological sequences cannot be cast to vectorial data easily, in particular, if they do not have fixed lengths. Support vector machines, by means of the kernels they employ, can handle any kind of data as long as a meaningful kernel (i.e. similarity measure) is available. In the following, we will consider kernels that can be used for biological sequences. Machine Learning / UMA, May 2018 67

❙❊◗❯❊◆❈❊ ❑❊❘◆❊▲❙ We consider kernels of the following kind: � k ( x, y ) = N ( m, x ) · N ( m, y ) , m ∈M where M is a set of patterns and N ( m, x ) denotes the number of occurrences/matches of pattern m in string x . Spectrum Kernel: consider all possible K -length strings (exact matches). Machine Learning / UMA, May 2018 68

❉❊❈■❙■❖◆ ❚❘❊❊❙✿ ■◆❚❘❖❉❯❈❚■❖◆ A decision tree is a classifier that classifies samples “by asking questions successively”; each non-leaf node corresponds to a question, each leaf corresponds to a final prediction. Decision tree learning is concerned with partitioning the training data hierarchically such that the leaf nodes are hopefully homogeneous in terms of the target class. Decision trees have mainly been designed for categorical data, but they can also be applied to numerical features. Decision trees are traditionally used for classification (binary and multi-class), but regression is possible, too. Machine Learning / UMA, May 2018 69

❉❊❈■❙■❖◆ ❚❘❊❊ ▲❊❆❘◆■◆● All decision tree learning algorithms are recursive, depth-first search algorithms that perform hierarchical splits. There are three main design issues: 1. Splitting criterion: which splits to choose? 2. Stopping criterion: when to stop further growing of the tree? 3. Pruning: whether/how to collapse unnecessarily deep sub- trees? The two latter are especially relevant for adjusting the complexity of decision trees (underfitting vs. overfitting). Machine Learning / UMA, May 2018 70

❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚ 2.45 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Petal.Length< 2.45 ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.75 ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● Petal.Width ● ● ● ● ● ● ● ● Petal.Width< 1.75 ● ● ● ● ● ● ● ● ● ● ● ● ● setosa ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● 0.5 virginica ● versicolor ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 Petal.Length Machine Learning / UMA, May 2018 71

P An Introduction - PowerPoint PPT Presentation

P An Introduction With Special Emphasis On Deep Learning Dr. Ulrich Bodenhofer Associate Professor Institute of Bioinformatics Johannes Kepler

1 The role of early experience in human *A critical period is a limited developmental period when

Analysis of the binding site of S1 -casein to its cellular receptor TLR4 by selective

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

I-tutorial Learning of Invariant Representations in Sensory Cortex tomaso poggio Center for

How do you compete successfully? Specialist subjects and experts Medicine & Sciences

Superstring Perturbation Theory Using Picture Changing Operators (PCO) Ashoke Sen Harish-Chandra

CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH): Breaking the Conventions

IAPCO & PCMA SEMINAR Keep the lawyers out! Contract Management Claire Smith Vice President

LBNF/DUNE and the Hunt for Leptonic Brookhaven National Lab CP Violation Introduction FPCP

Improving Methods for Conducting Patient-Centered Outcomes Research Letter of Intent (LOI)

Revision of Directors Rules and Regulations for the Prevention and Control of Bed Bugs May

Highlighting Over/Under Votes, and Full Voter Verifiability Alan T. Sherman, Russell A. Fink,

THE FUTURE OF VOTING in California The People The Equipment The Cost The Challenges

Requirements and Requirements and Challenges for 3G Challenges for 3G mobile terminal Test

Agend enda ACC-DTA Consumables Organization Consumables Mission Customers Current

Clinician Recruitment & Retention: Ideas and Solutions for Todays Challenges Association

Probability and Statistics for Computer Science Can we call the e exci-ng ? e

String Field Theory and its Applications Ashoke Sen Harish-Chandra Research Institute, Allahabad,

Preparing Students to Self-Assess their own Readiness for Online Learning Roland van Oostveen

What can we learn about Theoretical Physics with Future Gravitational Wave Detections? Nicolas

Physics 2D Lecture Slides Oct 22 Vivek Sharma UCSD Physics Compton Scattering : Quantum Pool !

Lovsz theta function and its relationships with perfect graph theory Arnaud Pcher joint work

Single Ventricle BUT Multiple PH treatment in Single Ventricule Ventricular function:

CLAS12 Run Periods 2.2, 6.4, 10.2, 10.6 GeV Run Group A Torus +/- polarity; different

P An Introduction - PowerPoint PPT Presentation

P An Introduction With Special Emphasis On Deep Learning Dr. Ulrich Bodenhofer Associate Professor Institute of Bioinformatics Johannes Kepler

1 The role of early experience in human *A critical period is a limited developmental period when

Analysis of the binding site of S1 -casein to its cellular receptor TLR4 by selective

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

I-tutorial Learning of Invariant Representations in Sensory Cortex tomaso poggio Center for

How do you compete successfully? Specialist subjects and experts Medicine &amp; Sciences

Superstring Perturbation Theory Using Picture Changing Operators (PCO) Ashoke Sen Harish-Chandra

CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH): Breaking the Conventions

IAPCO &amp; PCMA SEMINAR Keep the lawyers out! Contract Management Claire Smith Vice President

LBNF/DUNE and the Hunt for Leptonic Brookhaven National Lab CP Violation Introduction FPCP

Improving Methods for Conducting Patient-Centered Outcomes Research Letter of Intent (LOI)

Revision of Directors Rules and Regulations for the Prevention and Control of Bed Bugs May

Highlighting Over/Under Votes, and Full Voter Verifiability Alan T. Sherman, Russell A. Fink,

THE FUTURE OF VOTING in California The People The Equipment The Cost The Challenges

Requirements and Requirements and Challenges for 3G Challenges for 3G mobile terminal Test

Agend enda ACC-DTA Consumables Organization Consumables Mission Customers Current

Clinician Recruitment &amp; Retention: Ideas and Solutions for Todays Challenges Association

Probability and Statistics for Computer Science Can we call the e exci-ng ? e

String Field Theory and its Applications Ashoke Sen Harish-Chandra Research Institute, Allahabad,

Preparing Students to Self-Assess their own Readiness for Online Learning Roland van Oostveen

What can we learn about Theoretical Physics with Future Gravitational Wave Detections? Nicolas

Physics 2D Lecture Slides Oct 22 Vivek Sharma UCSD Physics Compton Scattering : Quantum Pool !

Lovsz theta function and its relationships with perfect graph theory Arnaud Pcher joint work

Single Ventricle BUT Multiple PH treatment in Single Ventricule Ventricular function:

CLAS12 Run Periods 2.2, 6.4, 10.2, 10.6 GeV Run Group A Torus +/- polarity; different

How do you compete successfully? Specialist subjects and experts Medicine & Sciences

IAPCO & PCMA SEMINAR Keep the lawyers out! Contract Management Claire Smith Vice President

Clinician Recruitment & Retention: Ideas and Solutions for Todays Challenges Association