❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●
An Introduction — With Special Emphasis On Deep Learning
P An Introduction - - PowerPoint PPT Presentation
P An Introduction With Special Emphasis On Deep Learning Dr. Ulrich Bodenhofer Associate Professor Institute of Bioinformatics Johannes Kepler
An Introduction — With Special Emphasis On Deep Learning
2 Machine Learning / UMA, May 2018
Associate Professor Institute of Bioinformatics Johannes Kepler University Altenberger Str. 69 A-4040 Linz
Fax +43 732 2468 4539 E-Mail bodenhofer@bioinf.jku.at URL http://www.bioinf.jku.at/
3 Machine Learning / UMA, May 2018
Basics of machine learning: supervised vs. unsupervised, clas- sification vs. regression Overview of supervised machine learning: basic principles, k- nearest neighbor, linear regression, support vector machines, random forests Overview of neural networks: basic idea and algorithms, deep learning, success stories
4 Machine Learning / UMA, May 2018
Slides R code examples + data sets . . . to be found at the following URL:
http://www.bioinf.jku.at/people/bodenhofer/UMA_ML/
6 Machine Learning / UMA, May 2018
Finding solutions of a system of equations Prediction of trajectory of a space shuttle Diagnosis whether a patient has a certain disease Prediction of outcome of election Recognition of handwritten characters Identification of customer target groups Prediction of function of protein from its amino acid sequence
7 Machine Learning / UMA, May 2018
Traditional disciplines like physics, chemistry, and biology are usually aiming at exact explicit models, i.e. to know how (and why) things work in a particular way; then a solution to a new problem can be found deductively using explicit knowledge. That goal, however, is sometimes too difficult to achieve; rea- sons may be computational complexity, insufficient knowledge, insufficient information, etc.
8 Machine Learning / UMA, May 2018
Machine learning tries to elicit models/knowledge from previ-
Putting it simple, machine learning is about learning from data (often called inductive learning).
9 Machine Learning / UMA, May 2018
0.843475 0.709216
0.408987 0.47037 +1 0.734759 0.645298
0.972187 0.0802574 +1 0.90267 0.327633
0.807075 0.872155
0.240068 0.801159
0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267
0.209818 0.342484 +1 0.94141 0.928017
0.148546 0.198177 +1 0.872544 0.50608
0.371062 0.272064 +1 ... ... ...
10 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
11 Machine Learning / UMA, May 2018
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
12 Machine Learning / UMA, May 2018
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
13 Machine Learning / UMA, May 2018
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
14 Machine Learning / UMA, May 2018
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
15 Machine Learning / UMA, May 2018
0.99516 0.890813 0.933726 0.793397 0.826405 0.236946
0.853206 0.611647 0.317486 0.633609 0.411492 0.985231 +1 0.387494 0.459847 0.815049 0.394526 0.678227 0.031886
0.733515 0.640438 1.19068 0.639685 0.0793674 0.160503 +1 0.274817 0.261054 1.20056 0.689895 0.401913 0.277955
0.329943 0.241299 0.848705 0.721673 0.973852 0.795238
0.334784 0.350487 0.315131 0.928277 0.816343 0.558292
0.481578 0.738839 0.0925513 0.294667 0.612725 0.573062
0.0940846 0.278992 0.451819 0.900141 0.220497 0.541176 +1 0.421025 0.785714 0.449038 0.920612 0.420418 0.749187
0.939446 0.0468747 0.15846 0.625944 0.198894 0.176125 +1 0.845362 0.767883 0.824993 0.725803 0.808218 0.63495
0.484793 0.129329 0.0783719 0.465347 0.291457 0.254278 +1 0.399041 0.751829 0.763511 0.894785 0.47902 0.15156
0.643232 0.615629 0.430261 0.0458972 0.446513 0.844081 +1 ... ... ... ... ... ... ...
16 Machine Learning / UMA, May 2018
Example borrowed from
. E. Hart, and D. G. Stork. Pattern Classification. Second edition. John Wiley & Sons, 2001. ISBN 0-471-05669- 3.
Automated system to sort fish in a fish-packing company: salmons must be distinguished from sea bass optically Given: a set of pictures with known fish, the training set Goal: automatically distinguish between salmons and sea bass for future pictures
17 Machine Learning / UMA, May 2018
Salmon: Sea bass:
17 Machine Learning / UMA, May 2018
Salmon: Sea bass:
18 Machine Learning / UMA, May 2018
Camera image Preprocessing Feature Extraction Classification Salmon Sea Bass
18 Machine Learning / UMA, May 2018
Camera image Preprocessing Feature Extraction Classification Salmon Sea Bass
Preprocessing: contrast and brightness correction, seg- mentation, alignment Features:
19 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 length 5 10 15 20 count
Salmon Sea bass
0.0 0.2 0.4 0.6 0.8 1.0 brightness 2 4 6 8 10 12 14 count
Salmon Sea bass
19 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 length 5 10 15 20 count
Salmon Sea bass
0.0 0.2 0.4 0.6 0.8 1.0 brightness 2 4 6 8 10 12 14 count
Salmon Sea bass
20 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
21 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
22 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
23 Machine Learning / UMA, May 2018
Does learning help in the future, i.e. does experience from pre- viously observed examples help us to solve a future task? What is a good model? How do we assess the quality of a model? Which methods are available? In any case, machine learning is not (only) about describing previ-
to future data.
24 Machine Learning / UMA, May 2018
24 Machine Learning / UMA, May 2018
25 Machine Learning / UMA, May 2018
Projection methods: down-projection
data to lower- dimensional space in order to concentrate on the essence of the data Clustering: grouping of similar data objects Biclustering: simultaneous grouping of samples and features Generative model: building a model that produces data that are distributed as the observed data . . .
26 Machine Learning / UMA, May 2018
Classification: the target value is a class label Regression: the target value is numerical Supervised ML is sometimes called predictive modeling. This is due to the fact that the goal is most often to predict the target value for future input values.
27 Machine Learning / UMA, May 2018
Reinforcement learning: learning by feedback from the environ- ment in an online process Feature extraction: computation of features from data prior to ma- chine learning (e.g. signal and image processing) Feature selection: selection of those features that are rele- vant/sufficient to solve a given learning task Feature construction: construction of new features as part of the learning process
28 Machine Learning / UMA, May 2018
Model: the specific relationship/representation we are aiming at Model class: the class of models in which we search for the model Parameters: representations of concrete models inside the given model class Model selection/training: process of finding that model from the model class that fits/explains the observed data in the best way Hyperparameters: parameters controlling the model complexity or the training procedure
29 Machine Learning / UMA, May 2018
Question/Task + Data Preprocessing Choose Features Choose Model Class Train Model Evaluate Model Final Model + Answer Prior Knowledge
29 Machine Learning / UMA, May 2018
Question/Task + Data Preprocessing Choose Features Choose Model Class Train Model Evaluate Model Final Model + Answer Prior Knowledge Question/Task + Data Preprocessing Choose Features Choose Model Class Train Model Evaluate Model FAIL Prior Knowledge
30 Machine Learning / UMA, May 2018
For both supervised and unsupervised machine learning, we need the following basic ingredients: Model class: the class of models in which we search for the model Objective: criterion/measure that determines what is a good model Optimization algorithm: method that tries to find model parame- ters such that the objective is optimized The right choices of the above components depend on the charac- teristics of the given task.
31 Machine Learning / UMA, May 2018
Machine learning methods are able to solve some tasks for which explicit models will never exist. Machine learning methods have become standard tools in a variety of disciplines (e.g. signal and image processing, bioin- formatics).
32 Machine Learning / UMA, May 2018
Machine learning is not a universal remedy. The quality of machine learning models depends on the quality and quantity of data. What cannot be measured/observed can never be identified by machine learning. Machine learning complements explicit/deductive models in- stead of replacing them. Machine learning is often applied in a naive way.
34 Machine Learning / UMA, May 2018
Goal of supervised machine learning: to identify the relationship between inputs and targets/labels
34 Machine Learning / UMA, May 2018
Goal of supervised machine learning: to identify the relationship between inputs and targets/labels
0.843475 0.709216
0.408987 0.47037 +1 0.734759 0.645298
0.972187 0.0802574 +1 0.90267 0.327633
0.807075 0.872155
0.240068 0.801159
0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267
0.209818 0.342484 +1 0.94141 0.928017
0.148546 0.198177 +1 0.872544 0.50608
0.371062 0.272064 +1 ... ... ...
34 Machine Learning / UMA, May 2018
Goal of supervised machine learning: to identify the relationship between inputs and targets/labels
0.843475 0.709216
0.408987 0.47037 +1 0.734759 0.645298
0.972187 0.0802574 +1 0.90267 0.327633
0.807075 0.872155
0.240068 0.801159
0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267
0.209818 0.342484 +1 0.94141 0.928017
0.148546 0.198177 +1 0.872544 0.50608
0.371062 0.272064 +1 ... ... ...
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
35 Machine Learning / UMA, May 2018
Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .
35 Machine Learning / UMA, May 2018
Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .
Can we infer tumor types from gene expression values?
35 Machine Learning / UMA, May 2018
Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .
Can we infer tumor types from gene expression values? Which genes are most indicative?
35 Machine Learning / UMA, May 2018
Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .
Can we infer tumor types from gene expression values? Which genes are most indicative?
36 Machine Learning / UMA, May 2018
The quality of a model can only be judged on the basis of its per- formance on future data. So assume that future data are generated according to some joint distribution of inputs and targets, the joint density of which we denote as p(x, y) The generalization error (or risk) is the expected error on future data for a given model.
37 Machine Learning / UMA, May 2018
Since we typically do not know the distribution p(x, y), we have to estimate the generalization performance by making use of already existing data. Two methods are common: Test set/holdout method: the data set is split randomly into a training set and a test set; a predictor is trained on the former and evaluated on the latter; Cross validation: the data set is split randomly into a certain num- ber k of equally sized folds; k predictors are trained, each leav- ing out one fold as test set; the average performance on the k test folds is computed;
38 Machine Learning / UMA, May 2018
evaluation training
1.
evaluation training
2. 5.
evaluation training
. . .
39 Machine Learning / UMA, May 2018
For a given sample (x, y) and a classifier g(.)), (x, y) is a true positive (TP) if y = +1 and g(x) = +1, true negative (TN) if y = −1 and g(x) = −1, false positive (FP) if y = −1 and g(x) = +1, false negative (FN) if y = +1 and g(x) = −1.
40 Machine Learning / UMA, May 2018
Given a data set, the confusion matrix is defined as follows: predicted value g(x) +1
+1 #TP #FN actual value y
#FP #TN In this table, the entries #TP , #FP , #FN and #TN denote the numbers of true positives, . . . , respectively, for the given test data set.
41 Machine Learning / UMA, May 2018
Accuracy: number of correctly classified items, i.e. ACC = #TP + #TN #TP + #FN + #FP + #TN . True Positive Rate (aka recall/sensitivity): proportion of correctly identified posi- tives, i.e. TPR = #TP #TP + #FN . False Positive Rate: proportion of negative examples that were incorrectly classi- fied as positives, i.e. FPR = #FP #FP + #TN . Precision: proportion of predicted positive examples that were correct, i.e. PREC = #TP #TP + #FP . True Negative Rate (aka specificity): proportion
correctly identified negatives, i.e. TNR = #TN #FP + #TN . False Negative Rate: proportion of positive examples that were incorrectly classi- fied as negatives, i.e. FNR = #FN #TP + #FN .
42 Machine Learning / UMA, May 2018
Balanced Accuracy: mean of true positive and true negative rate, i.e. BACC = TPR + TNR 2 Matthews Correlation Coefficient: measure of non-randomness
sion matrix, i.e. MCC = #TP · #TN − #FP · #FN
43 Machine Learning / UMA, May 2018
Underfitting: our model is too coarse to fit the data (neither training nor test data); this is usually the result of too restrictive model assumptions (i.e. too low complexity of model). Overfitting: our model works very well on training data, but gener- alizes poorly to future/test data; this is usually the result of too high model complexity. The best generalization performance is obtained for the optimal choice of the complexity level. An estimate of the optimal choice can be determined by (cross) validation.
44 Machine Learning / UMA, May 2018
error complexity test error training error
44 Machine Learning / UMA, May 2018
error complexity test error training error
✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛
44 Machine Learning / UMA, May 2018
error complexity test error training error
✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛
✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲
45 Machine Learning / UMA, May 2018
Suppose we have a labeled data set Z and a distance measure on the input space. Then the k-nearest neighbor classifier is defined as follows: gk-NN(x; Z) = class that occurs most often among the k samples that are closest to x For k = 1, we simply call this nearest neighbor classifier: gNN(x; Z) = class of the sample that is closest to x
46 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
46 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
46 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
46 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
47 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
47 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
47 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
47 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
47 Machine Learning / UMA, May 2018
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
48 Machine Learning / UMA, May 2018
Consider a data set Z = {(xi, yi) | i = 1, . . . , l} ⊆ R2 and a linear model y = w0 + w1 · x = g
Suppose we want to find (w0, w1) such that the average quadratic loss, Q(w0, w1) = 1 l
l
l
l
is minimized. Then the unique global solution is given as follows: w1 = Cov(x, y) Var(x) w0 = ¯ y − w1 · ¯ x
49 Machine Learning / UMA, May 2018
1 2 3 4 5 6 2 4 6 8
49 Machine Learning / UMA, May 2018
1 2 3 4 5 6 2 4 6 8 1 2 3 4 5 6 2 4 6 8
50 Machine Learning / UMA, May 2018
Consider a data set Z = {(xi, yi) | i = 1, . . . , l} and a linear model y = w0 + w1 · x1 + · · · + wd · xd = (1 | x) · w = g
Suppose we want to find w = (w0, w1, . . . , wd)T such that the average quadratic loss is minimized. Then the unique global solution is given as w = ˜ XT · ˜ X −1 · ˜ XT
X+
·y, where ˜ X = (1|X).
51 Machine Learning / UMA, May 2018
0.0 0.5 1.0 0.0 0.5 1.0 1 1
51 Machine Learning / UMA, May 2018
0.0 0.5 1.0 0.0 0.5 1.0 1 1
52 Machine Learning / UMA, May 2018
Consider a data set Z = {(xi, yi) | i = 1, . . . , l} and a polynomial model of degree n y = w0 + w1 · x + w2 · x2 + · · · + wn · xn = g
Suppose we want to find w = (w0, w1, . . . , wn)T such that the average quadratic loss is minimized. Then the unique global solution is given as follows: w = ˜ XT · ˜ X −1 · ˜ XT
X+
·y with ˜ X = (1 | x | x2 | · · · | xn)
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
53 Machine Learning / UMA, May 2018
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
1 2 3 4 5 6 4 2 2 4 6 8 10
54 Machine Learning / UMA, May 2018
Putting it simply, Support Vector Machines (SVMs) are based
margin between positive and negative samples. According to a theoretical result, maximizing the margin cor- responds to minimizing an upper bound of the generalization error.
55 Machine Learning / UMA, May 2018
margin margin
56 Machine Learning / UMA, May 2018
margin
The two classes are linearly separable if and only if their convex hulls are dis- joint. If the two classes are linearly sep- arable, margin maximization can be achieved by making an orthogonal 50:50 split of the shortest distance con- necting the convex hulls of the two classes. The question remains how to solve margin maximization computationally: by quadratic optimization.
57 Machine Learning / UMA, May 2018
For a given training set {(xi, yi) | 1 ≤ i ≤ l}, a common support vector machine classifier is represented as the discriminant function g(x) = b +
l
αi · yi · k(x, xi), where b is a real value, αi are non-negative factors, and k(., .) is the so-called kernel, a similarity measure for the inputs. The dis- criminant function only depends on those samples whose Lagrange multiplier αi is not 0. Those are called support vectors.
58 Machine Learning / UMA, May 2018
The following kernels are often used in practice: Linear: k(x, y) = x · y Polynomial: k(x, y) = (x · y + β)α Gaussian/RBF:a k(x, y) = exp
1 2σ2 x − y2
Sigmoid: k(x, y) = tanh(αx · y + β)
aRBF = Radial Basis Function
59 Machine Learning / UMA, May 2018
60 Machine Learning / UMA, May 2018
61 Machine Learning / UMA, May 2018
62 Machine Learning / UMA, May 2018
63 Machine Learning / UMA, May 2018
64 Machine Learning / UMA, May 2018
65 Machine Learning / UMA, May 2018
Support vector machines are intrinsically based on the idea
problems. All approaches introduced so far are based on breaking down the multi-class problem into several binary classification prob- lems.
66 Machine Learning / UMA, May 2018
Suppose we have a classification problem with M classes. One against the rest: M support vector machines are trained, where the i-th SVM is trained to distinguish between the i-th class and other classes; a new sample is assigned to the class whose SVM has the highest discriminant function value. Pairwise classification:
M(M−1) 2
SVMs are trained, one for each pair of classes; a new sample is assigned to the class that re- ceived the most votes from the M(M−1)
2
ter and more common approach.
67 Machine Learning / UMA, May 2018
All considerations so far have been based on vectorial data. Biological sequences cannot be cast to vectorial data easily, in particular, if they do not have fixed lengths. Support vector machines, by means of the kernels they employ, can handle any kind of data as long as a meaningful kernel (i.e. similarity measure) is available. In the following, we will consider kernels that can be used for biological sequences.
68 Machine Learning / UMA, May 2018
We consider kernels of the following kind: k(x, y) =
N(m, x) · N(m, y), where M is a set of patterns and N(m, x) denotes the number of
Spectrum Kernel: consider all possible K-length strings (exact matches).
69 Machine Learning / UMA, May 2018
A decision tree is a classifier that classifies samples “by asking questions successively”; each non-leaf node corresponds to a question, each leaf corresponds to a final prediction. Decision tree learning is concerned with partitioning the train- ing data hierarchically such that the leaf nodes are hopefully homogeneous in terms of the target class. Decision trees have mainly been designed for categorical data, but they can also be applied to numerical features. Decision trees are traditionally used for classification (binary and multi-class), but regression is possible, too.
70 Machine Learning / UMA, May 2018
All decision tree learning algorithms are recursive, depth-first search algorithms that perform hierarchical splits. There are three main design issues:
trees? The two latter are especially relevant for adjusting the complex- ity of decision trees (underfitting vs. overfitting).
71 Machine Learning / UMA, May 2018
Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica
1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75
71 Machine Learning / UMA, May 2018
Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica
1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75
71 Machine Learning / UMA, May 2018
Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica
1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75
71 Machine Learning / UMA, May 2018
Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica
1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75
72 Machine Learning / UMA, May 2018
Use CART (Classification and Regression Trees) for training the single trees, i.e. binary splits with Gini impurity gain (for classification) / variance reduction (for regression) as splitting criterion. For each tree, samples are chosen randomly from the training set (typically with replacement). For each split, only a sub-sample of randomly chosen features is considered. Trees are grown to full size and not pruned.
73 Machine Learning / UMA, May 2018
1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width
74 Machine Learning / UMA, May 2018
Random forests allow for assessing the generalization perfor- mance on the basis of training data only. For each sample, the error can be computed by considering
sub-sample. Then the overall out-of-bag error can be computed by averag- ing the out-of-bag errors of all samples.
75 Machine Learning / UMA, May 2018
Mean Gini impurity decrease: for all features, average the Gini impurity gains of all splits in all trees that involve this feature; Mean accuracy decrease:
and compute the out-of-bag errors for the data set with the permuted feature.
differences before and after permuting the feature (upon normalization by the standard deviation of the differences).
76 Machine Learning / UMA, May 2018
Sepal.Width Sepal.Length Petal.Width Petal.Length
20 30 40 50 MeanDecreaseAccuracy Sepal.Width Sepal.Length Petal.Length Petal.Width
20 30 40 MeanDecreaseGini
78 Machine Learning / UMA, May 2018
The most powerful and most versatile “learning machine” is still the human brain. Starting in the 1940ies, ideas for creating “intelligent‘” systems by mimicking the function of nerve/brain cells have been devel-
An artificial neural network is a parallel processing system with small computing units (neurons) that work similarly to nerve/brain cells.
79 Machine Learning / UMA, May 2018
The inside of every neuron (nerve or brain cell) carries a certain electric charge. Electric charge of connected neurons may raise or lower this charge (by means of transmission of ions through the synaptic interface). As soon as the charge reaches a certain threshold, an electric impulse is transmitted through the cell’s axon to the neighbor- ing cells. In the synaptic interfaces, chemicals called neurotransmitters control the strength to which an impulse is transmitted from
80 Machine Learning / UMA, May 2018
[public domain; from Wikimedia Commons]
81 Machine Learning / UMA, May 2018
A perceptron is a simple linear threshold unit: g(x; w, θ) = 1 if
d
wj · xj > θ
(1) In analogy to the biological model, the inputs xj correspond to the charges received from connected cells through the dentrites, the weights wj correspond to the properties of the synaptic interface, and the output corresponds to the impulse that is sent through the axon as soon as the charge exceeds the threshold θ. Though it seems to be a (simplistic) model of a neuron, a perceptron is nothing else but a simple linear classifier.
82 Machine Learning / UMA, May 2018
INPUT LAYER HIDDEN LAYER OUTPUT LAYER
83 Machine Learning / UMA, May 2018
Minsky and Papert conjectured in the late 1960ies that training multi- layer perceptrons is infeasible. Because of this, the study of multi-layer perceptrons was almost halted until the mid of the 1980ies. In 1986, Rumelhart and McClelland first published the backpropaga- tion algorithm and, thereby, proved Minsky and Papert wrong. It turned out later that the backpropagation algorithm had already been described by Werbos in 1974 in his dissertation. In a different context, the algorithm first appeared in the work of Bryson et al. in the 1960ies. There was a neural networks hype in the 1980ies before they were superseded by support vector machines. In recent years, however, new techniques for training deep networks have brought a renaissance of neural networks.
84 Machine Learning / UMA, May 2018
Deep learning is a class/framework of strategies for training deep net- works that are aimed to learn multiple levels of representations of the data and to allow for accurate predictions from these representations. Deep learning can be supervised or unsupervised. First approaches to deep learning employed a two-step procedure: Pre-training: levels of representations are learned layer by layer Fine-tuning: a supervised learning algorithm is applied that makes predictions from the last layer of the pre-trained network. Unsupervised deep learning only consists of unsupervised pre- training and omits fine-tuning.
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 2
HIDDEN TRAINING INPUT
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 2
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 3
HIDDEN TRAINING INPUT
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 2
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 3
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 4
HIDDEN TRAINING INPUT
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 2
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 3
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 4
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 5
HIDDEN TRAINING INPUT
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 2
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 3
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 4
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 5
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 6
HIDDEN TRAINING INPUT
85 Machine Learning / UMA, May 2018
Pre-training: hidden layer no. 1
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 2
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 3
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 4
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 5
HIDDEN TRAINING INPUT
Pre-training: hidden layer no. 6
HIDDEN TRAINING INPUT
Fine-tuning:
INPUT TRAINING OUTPUT
86 Machine Learning / UMA, May 2018
Restricted Boltzmann machine (RBM): A simple stochastic neural network with an input layer and one hidden layer that are connected in both directions with symmetric weights; RBMs aim to learn a probability distribution over the inputs. The learning algorithm uses sampling of inputs and hidden activations along with gradient descent. Autoencoders: A (denoising) autoencoder with one hidden layer is trained in each pre-training step. After training, the output layer of the autoencoder is discarded and only the hidden layer remains. In the subsequent step, another autoen- coder is trained with the inputs being the activations of the hidden neurons of the previously trained autoencoder. Supervised pre-training: A network with one hidden layer is trained in each pre- training step. After training, the output layer is discarded and only the hidden layer remains. In the subsequent step, another network is trained with the inputs being the activations of the hidden neurons of the previous network.
87 Machine Learning / UMA, May 2018
The success of a deep network is determined by how mean- ingful the representations in the hidden layers are. What is a meaningful representation?
Each hidden unit corresponds to a specific pattern (hidden)
in the data.
Different hidden units correspond to different patterns, i.e.
the patterns are disentangled.
88 Machine Learning / UMA, May 2018
Disentangling of representations can also be achieved by ensuring sparse activation, i.e. only a fraction of hidden neurons are activated for a given input. Dropout: during training, activations are ran- domly set to 0 (e.g. with a probability of 0.5); Rectified linear units (ReLU): instead of a sig- moid activation function, a function is used that gives 0 below a certain threshold. The most common choice is ϕ(x) = max(0, x). These approaches even allow for training a deep network directly without pre-training.
2 1 1 2 0.5 0.5 1.0 1.5 2.0
89 Machine Learning / UMA, May 2018
In principle, classical feed-forward neural networks (fully con- nected networks) could be used for image analysis by simply connecting the pixels to input units. However, if at all, this only makes sense for small aligned im- ages (e.g. in character recognition). For the analysis of larger and more complex images, this stan- dard architecture is not useful. Instead, it is common to have stacked layers of units that oper- ate on small overlapping patches/windows. Such networks are called convolutional neural networks (CNNs).
90 Machine Learning / UMA, May 2018
The first convolutional layer usually consists of multiple units that op- erate on small image patches (3×3, 5×5, or 7×7). Each unit corresponds to one simple feature of a patch. Such units are often called filters. The activations of all units are computed for all patches, thereby cre- ating a feature map of the image. Convolutional layers can be stacked. It can be useful to down-sample feature maps by local max pooling (e.g. with non-overlapping 2×2 windows). Such networks can either have fully connected layers on top (e.g. for image classification) or can also be fully convolutional (output is an image; e.g. for segmentation of detected objects).
91 Machine Learning / UMA, May 2018
Input image (10×10) Feature map (8×8) Convolution/filter (3×3)
91 Machine Learning / UMA, May 2018
Input image (10×10) Feature map (8×8) Convolution/filter (3×3) Input image (10×10) Feature maps (8×8) Convolution/filter (3×3)
91 Machine Learning / UMA, May 2018
Input image (10×10) Feature map (8×8) Convolution/filter (3×3) Input image (10×10) Feature maps (8×8) Convolution/filter (3×3) Input image (10×10)
Feature maps, first layer (8×8) Feature maps, second layer (6×6)
Convolution/filter (3×3) Convolution/filter (3×3)
91 Machine Learning / UMA, May 2018
Input image (10×10) Feature map (8×8) Convolution/filter (3×3) Input image (10×10) Feature maps (8×8) Convolution/filter (3×3) Input image (10×10)
Feature maps, first layer (8×8) Feature maps, second layer (6×6)
Convolution/filter (3×3) Convolution/filter (3×3) Input image (10×10)
Feature maps (8×8) Downsampled feature maps (4×4)
Convolution/filter (3×3) Max pooling mask (2×2)
92 Machine Learning / UMA, May 2018
Fully connected layers are trained as usual. In convolutional layers, each feature map/filter has only one set
called weight sharing. In may pooling layers, the error signal is only propagated to the input from which the maximal activation came.
93 Machine Learning / UMA, May 2018
Two layers of a convolutional network: hypothetical inputs maximizing ac- tivation and real images that lead to a high activation of the considered neuron
94 Machine Learning / UMA, May 2018
95 Machine Learning / UMA, May 2018
Computational challenge set up by the US agencies NIH, EPA, and FDA Unprecedented multi-million-dollar effort 12,000 compounds tested experimentally for twelve different toxic effects Goal: predict toxicity computationally
96 Machine Learning / UMA, May 2018
Input features: 40,000 very sparse features: Extended Connectivity Finger- Print (ECFP4) presence count of chemical sub-structures 5,057 additional features:
2,500 toxicophore features 200 common chemical scaffolds various chemical descriptors
97 Machine Learning / UMA, May 2018
Deep learning-based solution by JKU’s Institute of Bioinfor- matics won the grand chal- lenge, both panels (nuclear receptor panel and stress re- sponse panel), and six single prediction tasks. The hierarchical representa- tion of deep networks allowed for the identification of novel toxicophores.
98 Machine Learning / UMA, May 2018
Although the foundations of deep learning have been layed 15– 20 years ago, a major hype emerged only recently in the ma- chine learning community. Deep networks have won numerous competitions in music, speech and image recognition, drug discovery, and other fields. Deep learning has been called “. . . the biggest data science breakthrough of the decade” (J. Howard). The New York Times covered the subject twice with front-page articles in 2012.
99 Machine Learning / UMA, May 2018
Major companies, such as, Google, Microsoft, Apple, facebook,
networks in their products and services. Google has acquired companies specialized in deep learning: DNNresearch (founded by G. Hinton, U. Toronto; March 2013; price not revealed) and Deepmind (London-based company founded by D. Hassabis; January 2014; price approx. $400– 650m)
100 Machine Learning / UMA, May 2018
Feedforward neural networks require vectorial inputs. Therefore, they cannot be applied to time series
sequences directly. One option is to apply them to (sliding) windows. The obvious disadvantage of this simple approach is that windows are treated indepen- dently and no learning across windows can take place.
100 Machine Learning / UMA, May 2018
Feedforward neural networks require vectorial inputs. Therefore, they cannot be applied to time series
sequences directly. One option is to apply them to (sliding) windows. The obvious disadvantage of this simple approach is that windows are treated indepen- dently and no learning across windows can take place.
Input Sequence Output Sequence
101 Machine Learning / UMA, May 2018
Recurrent neural networks (RNNs) provide an alternative, where “recurrent” means that the network has connection cy- cles. There are several different RNN architectures. After each evaluation (for one window, in time step t), the acti- vations are kept and potentially used as inputs in time step t+1; so the generalization of the forward pass is straightforward for RNNs. The backpropagation algorithm can also be generalized to RNNs; this is typically called backpropagation through time.
102 Machine Learning / UMA, May 2018
Example of RNN with output sequence:
Input Sequence Output Sequence
102 Machine Learning / UMA, May 2018
Example of RNN with output sequence:
Input Sequence Output Sequence
Example of RNN with single output/target (output emitted only in last step):
Input Sequence Output / target
103 Machine Learning / UMA, May 2018
Standard RNNs with sigmoid activations are particularly prone to the vanishing gradient problem (actually, this problem has been formulated/discussed for RNNs first): errors/deltas de- cline (or explode) quickly when back-propagating through time. The consequence is that only short time lags between inputs and output signals can be learned correctly (up to about 10 time steps).
104 Machine Learning / UMA, May 2018
In order to overcome the vanishing gradient problem in RNNs, Hochre- iter and Schmidhuber (1997) have introduced Long Short-Term Mem-
Apart from a standard input unit, an LSTM memory cell has three main components:
tates constant error flow and thereby avoids vanishing gradients)
vant inputs
nected untis) from currently irrelevant memory contents
105 Machine Learning / UMA, May 2018
net c w c net in a in w in w out a out net out a c a c a in (t) (t) s(t) = s(t−1) +
Π Π
Input Gate Output Gate
106 Machine Learning / UMA, May 2018
107 Machine Learning / UMA, May 2018
Some benchmark records of 2014 achieved by LSTM: Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014) Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014) Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014) Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014) Medium vocabulary speech recognition (Geiger et al., Interspeech 2014) English to French translation (Sutskever et al., Google, NIPS 2014) Audio onset detection (Marchi et al., ICASSP 2014) Social signal classification (Brueckner & Schulter, ICASSP 2014) Arabic handwriting recognition (Bluche et al., DAS 2014) Image caption generation (Vinyals et al., Google, 2014) Video to textual description (Donahue et al., 2014)
108 Machine Learning / UMA, May 2018
LSTM @ Google: Neural Machine Translation System (NMT) Google Voice Transcription (Android speech recognizer) LSTM @ Microsoft: Photo-real talking head with deep bidirectional LSTM Spoken language understanding using LSTM Text-to-speech synthesis with bidirectional LSTM-based RNN LSTM @ facebook: Text analysis LSTM @ Apple: Siri
109 Machine Learning / UMA, May 2018
CAFFE: by Berkeley Vision and Learning Center; interfaces for C++, com- mand line, Python, and MATLAB; MXNet: by Distributed (Deep) Machine Learning Community; interfaces for C++, Python, Julia, Matlab, JavaScript, Go, R, and Scala; TensorFlow: by Google Brain; Python interface; Theano: by Université de Montréal; Python interface; (Py)Torch: by R. Collobert, K. Kavukcuoglu, and C. Farabet; based on the Lua programming language; interface for Lua, C, and Python; All of these frameworks support running code on GPUs (via CUDA); beside fully connected networks, all feature CNNs and RNNs. Some of those are quite low-level, while additional light-weight interfaces are available (e.g. Keras, LASAGNE).
110 Machine Learning / UMA, May 2018
Without any doubt, deep networks are the most powerful tools for audio and image recognition and other fields, also outper- forming support vector machines. Despite the practical successes, the theoretical foundations why and under which conditions deep networks work are lag- ging far behind. The spectrum of variants is hard to survey, and the choice of good parameters is both crucial and tricky. Learning good representations of complex data, such as, high- res images, requires excessive amounts of training data and excessive computational power (supercomputers, GPUs).