P An Introduction - - PowerPoint PPT Presentation

p
SMART_READER_LITE
LIVE PREVIEW

P An Introduction - - PowerPoint PPT Presentation

P An Introduction With Special Emphasis On Deep Learning Dr. Ulrich Bodenhofer Associate Professor Institute of Bioinformatics Johannes Kepler


slide-1
SLIDE 1

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

An Introduction — With Special Emphasis On Deep Learning

slide-2
SLIDE 2

2 Machine Learning / UMA, May 2018

❈❖◆❚❆❈❚

  • Dr. Ulrich Bodenhofer

Associate Professor Institute of Bioinformatics Johannes Kepler University Altenberger Str. 69 A-4040 Linz

  • Tel. +43 732 2468 4526

Fax +43 732 2468 4539 E-Mail bodenhofer@bioinf.jku.at URL http://www.bioinf.jku.at/

slide-3
SLIDE 3

3 Machine Learning / UMA, May 2018

❖❯❚▲■◆❊

Basics of machine learning: supervised vs. unsupervised, clas- sification vs. regression Overview of supervised machine learning: basic principles, k- nearest neighbor, linear regression, support vector machines, random forests Overview of neural networks: basic idea and algorithms, deep learning, success stories

slide-4
SLIDE 4

4 Machine Learning / UMA, May 2018

❈❖❯❘❙❊ ▼❆❚❊❘■❆▲

Slides R code examples + data sets . . . to be found at the following URL:

http://www.bioinf.jku.at/people/bodenhofer/UMA_ML/

slide-5
SLIDE 5

❇❆❙■❈❙ ❖❋ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

slide-6
SLIDE 6

6 Machine Learning / UMA, May 2018

❍❖❲ ❚❖ ❙❖▲❱❊ ❚❍❊❙❊ ❚❆❙❑❙❄

Finding solutions of a system of equations Prediction of trajectory of a space shuttle Diagnosis whether a patient has a certain disease Prediction of outcome of election Recognition of handwritten characters Identification of customer target groups Prediction of function of protein from its amino acid sequence

slide-7
SLIDE 7

7 Machine Learning / UMA, May 2018

❊❳P▲■❈■❚ ▼❖❉❊▲❙

Traditional disciplines like physics, chemistry, and biology are usually aiming at exact explicit models, i.e. to know how (and why) things work in a particular way; then a solution to a new problem can be found deductively using explicit knowledge. That goal, however, is sometimes too difficult to achieve; rea- sons may be computational complexity, insufficient knowledge, insufficient information, etc.

slide-8
SLIDE 8

8 Machine Learning / UMA, May 2018

▼❆❈❍■◆❊ ▲❊❆❘◆■◆● = ■◆❉❯❈❚■❱❊ ▲❊❆❘◆■◆●

Machine learning tries to elicit models/knowledge from previ-

  • usly observed data with the following two main goals:
  • 1. Getting insight
  • 2. Being able to predict future outcomes

Putting it simple, machine learning is about learning from data (often called inductive learning).

slide-9
SLIDE 9

9 Machine Learning / UMA, May 2018

❲❍❆❚ ❉❖ ❲❊ ❙❊❊ ❍❊❘❊❄

0.843475 0.709216

  • 1

0.408987 0.47037 +1 0.734759 0.645298

  • 1

0.972187 0.0802574 +1 0.90267 0.327633

  • 1

0.807075 0.872155

  • 1

0.240068 0.801159

  • 1

0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267

  • 1

0.209818 0.342484 +1 0.94141 0.928017

  • 1

0.148546 0.198177 +1 0.872544 0.50608

  • 1

0.371062 0.272064 +1 ... ... ...

slide-10
SLIDE 10

10 Machine Learning / UMA, May 2018

❆◆❉ ❍❊❘❊❄

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

slide-11
SLIDE 11

11 Machine Learning / UMA, May 2018

❆◆❉ ❍❊❘❊❄

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

slide-12
SLIDE 12

12 Machine Learning / UMA, May 2018

❆◆❉ ❍❊❘❊❄

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

slide-13
SLIDE 13

13 Machine Learning / UMA, May 2018

❆◆❉ ❍❊❘❊❄

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

slide-14
SLIDE 14

14 Machine Learning / UMA, May 2018

❆◆❉ ❍❊❘❊❄

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

slide-15
SLIDE 15

15 Machine Learning / UMA, May 2018

❆◆❉ ❍❊❘❊❄

0.99516 0.890813 0.933726 0.793397 0.826405 0.236946

  • 1

0.853206 0.611647 0.317486 0.633609 0.411492 0.985231 +1 0.387494 0.459847 0.815049 0.394526 0.678227 0.031886

  • 1

0.733515 0.640438 1.19068 0.639685 0.0793674 0.160503 +1 0.274817 0.261054 1.20056 0.689895 0.401913 0.277955

  • 1

0.329943 0.241299 0.848705 0.721673 0.973852 0.795238

  • 1

0.334784 0.350487 0.315131 0.928277 0.816343 0.558292

  • 1

0.481578 0.738839 0.0925513 0.294667 0.612725 0.573062

  • 1

0.0940846 0.278992 0.451819 0.900141 0.220497 0.541176 +1 0.421025 0.785714 0.449038 0.920612 0.420418 0.749187

  • 1

0.939446 0.0468747 0.15846 0.625944 0.198894 0.176125 +1 0.845362 0.767883 0.824993 0.725803 0.808218 0.63495

  • 1

0.484793 0.129329 0.0783719 0.465347 0.291457 0.254278 +1 0.399041 0.751829 0.763511 0.894785 0.47902 0.15156

  • 1

0.643232 0.615629 0.430261 0.0458972 0.446513 0.844081 +1 ... ... ... ... ... ... ...

slide-16
SLIDE 16

16 Machine Learning / UMA, May 2018

■◆❚❘❖❉❯❈❚❖❘❨ ❊❳❆▼P▲❊✿ ❋■❙❍ ❘❊❈❖●◆■❚■❖◆

Example borrowed from

  • R. O. Duda, P

. E. Hart, and D. G. Stork. Pattern Classification. Second edition. John Wiley & Sons, 2001. ISBN 0-471-05669- 3.

Automated system to sort fish in a fish-packing company: salmons must be distinguished from sea bass optically Given: a set of pictures with known fish, the training set Goal: automatically distinguish between salmons and sea bass for future pictures

slide-17
SLIDE 17

17 Machine Learning / UMA, May 2018

❚❲❖ ❙❆▼P▲❊ ■▼❆●❊❙

Salmon: Sea bass:

slide-18
SLIDE 18

17 Machine Learning / UMA, May 2018

❚❲❖ ❙❆▼P▲❊ ■▼❆●❊❙

Salmon: Sea bass:

How can we distinguish these two kinds of fish visually?

slide-19
SLIDE 19

18 Machine Learning / UMA, May 2018

❇❆❙■❈ ❲❖❘❑❋▲❖❲

Camera image Preprocessing Feature Extraction Classification Salmon Sea Bass

slide-20
SLIDE 20

18 Machine Learning / UMA, May 2018

❇❆❙■❈ ❲❖❘❑❋▲❖❲

Camera image Preprocessing Feature Extraction Classification Salmon Sea Bass

Preprocessing: contrast and brightness correction, seg- mentation, alignment Features:

  • 1. Length
  • 2. Brightness
slide-21
SLIDE 21

19 Machine Learning / UMA, May 2018

❯❙■◆● ❖◆❊ ❋❊❆❚❯❘❊

0.0 0.2 0.4 0.6 0.8 1.0 length 5 10 15 20 count

Salmon Sea bass

0.0 0.2 0.4 0.6 0.8 1.0 brightness 2 4 6 8 10 12 14 count

Salmon Sea bass

slide-22
SLIDE 22

19 Machine Learning / UMA, May 2018

❯❙■◆● ❖◆❊ ❋❊❆❚❯❘❊

0.0 0.2 0.4 0.6 0.8 1.0 length 5 10 15 20 count

Salmon Sea bass

0.0 0.2 0.4 0.6 0.8 1.0 brightness 2 4 6 8 10 12 14 count

Salmon Sea bass

Questions:

  • 1. Which is the better feature?
  • 2. Where should we put the threshold?
slide-23
SLIDE 23

20 Machine Learning / UMA, May 2018

❯❙■◆● ❚❲❖ ❋❊❆❚❯❘❊❙✿ ▲■◆❊❆❘ ❙❊P❆❘❆❚■❖◆

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

slide-24
SLIDE 24

21 Machine Learning / UMA, May 2018

❯❙■◆● ❚❲❖ ❋❊❆❚❯❘❊❙✿ ❍■●❍▲❨ ◆❖◆▲■◆❊❆❘ ❙❊P❆❘❆❚■❖◆

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

slide-25
SLIDE 25

22 Machine Learning / UMA, May 2018

❯❙■◆● ❚❲❖ ❋❊❆❚❯❘❊❙✿ ▼■▲❉▲❨ ◆❖◆▲■◆❊❆❘ ❙❊P❆❘❆❚■❖◆

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

slide-26
SLIDE 26

23 Machine Learning / UMA, May 2018

◗❯❊❙❚■❖◆❙

Does learning help in the future, i.e. does experience from pre- viously observed examples help us to solve a future task? What is a good model? How do we assess the quality of a model? Which methods are available? In any case, machine learning is not (only) about describing previ-

  • usly observed data, but about creating models that are applicable

to future data.

slide-27
SLIDE 27

24 Machine Learning / UMA, May 2018

❙❯P❊❘❱■❙❊❉ ❱❙✳ ❯◆❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Unsupervised ML: the goal is to iden- tify structure in the data; there is no explicit target/label given.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
slide-28
SLIDE 28

24 Machine Learning / UMA, May 2018

❙❯P❊❘❱■❙❊❉ ❱❙✳ ❯◆❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Unsupervised ML: the goal is to iden- tify structure in the data; there is no explicit target/label given.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Supervised ML: an explicit target/label value is given for each (input) data item; the goal is to identify the rela- tionship between inputs and targets.

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
slide-29
SLIDE 29

25 Machine Learning / UMA, May 2018

❯◆❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Projection methods: down-projection

  • f

data to lower- dimensional space in order to concentrate on the essence of the data Clustering: grouping of similar data objects Biclustering: simultaneous grouping of samples and features Generative model: building a model that produces data that are distributed as the observed data . . .

slide-30
SLIDE 30

26 Machine Learning / UMA, May 2018

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Classification: the target value is a class label Regression: the target value is numerical Supervised ML is sometimes called predictive modeling. This is due to the fact that the goal is most often to predict the target value for future input values.

slide-31
SLIDE 31

27 Machine Learning / UMA, May 2018

▼■❙❈❊▲▲❆◆❊❖❯❙ ❚❖P■❈❙

Reinforcement learning: learning by feedback from the environ- ment in an online process Feature extraction: computation of features from data prior to ma- chine learning (e.g. signal and image processing) Feature selection: selection of those features that are rele- vant/sufficient to solve a given learning task Feature construction: construction of new features as part of the learning process

slide-32
SLIDE 32

28 Machine Learning / UMA, May 2018

❚❊❘▼■◆❖▲❖●❨

Model: the specific relationship/representation we are aiming at Model class: the class of models in which we search for the model Parameters: representations of concrete models inside the given model class Model selection/training: process of finding that model from the model class that fits/explains the observed data in the best way Hyperparameters: parameters controlling the model complexity or the training procedure

slide-33
SLIDE 33

29 Machine Learning / UMA, May 2018

❇❆❙■❈ ❉❆❚❆ ❆◆❆▲❨❙■❙ ❲❖❘❑❋▲❖❲

Question/Task + Data Preprocessing Choose Features Choose Model Class Train Model Evaluate Model Final Model + Answer Prior Knowledge

slide-34
SLIDE 34

29 Machine Learning / UMA, May 2018

❇❆❙■❈ ❉❆❚❆ ❆◆❆▲❨❙■❙ ❲❖❘❑❋▲❖❲

Question/Task + Data Preprocessing Choose Features Choose Model Class Train Model Evaluate Model Final Model + Answer Prior Knowledge Question/Task + Data Preprocessing Choose Features Choose Model Class Train Model Evaluate Model FAIL Prior Knowledge

slide-35
SLIDE 35

30 Machine Learning / UMA, May 2018

❇❆❙■❈ ■◆●❘❊❉■❊◆❚❙ ❖❋ ▼❖❉❊▲ ❙❊▲❊❈❚■❖◆✴❚❘❆■◆■◆●

For both supervised and unsupervised machine learning, we need the following basic ingredients: Model class: the class of models in which we search for the model Objective: criterion/measure that determines what is a good model Optimization algorithm: method that tries to find model parame- ters such that the objective is optimized The right choices of the above components depend on the charac- teristics of the given task.

slide-36
SLIDE 36

31 Machine Learning / UMA, May 2018

❙❖▼❊ ❲❖❘❉❙ ❖❋ ❊◆❚❍❯❙■❆❙▼

Machine learning methods are able to solve some tasks for which explicit models will never exist. Machine learning methods have become standard tools in a variety of disciplines (e.g. signal and image processing, bioin- formatics).

slide-37
SLIDE 37

32 Machine Learning / UMA, May 2018

❇❯❚ ✳ ✳ ✳ ❙❖▼❊ ❲❖❘❉❙ ❖❋ ❈❆❯❚■❖◆

Machine learning is not a universal remedy. The quality of machine learning models depends on the quality and quantity of data. What cannot be measured/observed can never be identified by machine learning. Machine learning complements explicit/deductive models in- stead of replacing them. Machine learning is often applied in a naive way.

slide-38
SLIDE 38

❖❱❊❘❱■❊❲ ❖❋ ❙❯P❊❘❱■❙❊❉ ▼▲

slide-39
SLIDE 39

34 Machine Learning / UMA, May 2018

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Goal of supervised machine learning: to identify the relationship between inputs and targets/labels

slide-40
SLIDE 40

34 Machine Learning / UMA, May 2018

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Goal of supervised machine learning: to identify the relationship between inputs and targets/labels

0.843475 0.709216

  • 1

0.408987 0.47037 +1 0.734759 0.645298

  • 1

0.972187 0.0802574 +1 0.90267 0.327633

  • 1

0.807075 0.872155

  • 1

0.240068 0.801159

  • 1

0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267

  • 1

0.209818 0.342484 +1 0.94141 0.928017

  • 1

0.148546 0.198177 +1 0.872544 0.50608

  • 1

0.371062 0.272064 +1 ... ... ...

slide-41
SLIDE 41

34 Machine Learning / UMA, May 2018

❙❯P❊❘❱■❙❊❉ ▼❆❈❍■◆❊ ▲❊❆❘◆■◆●

Goal of supervised machine learning: to identify the relationship between inputs and targets/labels

0.843475 0.709216

  • 1

0.408987 0.47037 +1 0.734759 0.645298

  • 1

0.972187 0.0802574 +1 0.90267 0.327633

  • 1

0.807075 0.872155

  • 1

0.240068 0.801159

  • 1

0.206602 0.562109 +1 0.581611 0.335561 +1 0.700995 0.517267

  • 1

0.209818 0.342484 +1 0.94141 0.928017

  • 1

0.148546 0.198177 +1 0.872544 0.50608

  • 1

0.371062 0.272064 +1 ... ... ...

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

slide-42
SLIDE 42

35 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆

Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .

slide-43
SLIDE 43

35 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆

Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .

Can we infer tumor types from gene expression values?

slide-44
SLIDE 44

35 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆

Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .

Can we infer tumor types from gene expression values? Which genes are most indicative?

slide-45
SLIDE 45

35 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ P❘❊❉■❈❚■◆● ❚❯▼❖❘ ❚❨P❊❙ ❋❘❖▼ ●❊◆❊ ❊❳P❘❊❙❙■❖◆

Tumor type Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 . . . A 8.83 15.25 12.59 12.91 13.21 16.59 . . . A 9.41 13.37 11.95 15.09 13.39 9.94 . . . A 8.75 14.41 12.11 15.63 13.69 7.83 . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8.92 13.85 12.23 11.61 13.03 10.77 . . . B 8.65 12.93 11.58 9.47 9.81 14.79 . . . B 8.43 16.13 10.88 10.97 9.72 12.51 . . . B 9.62 15.31 12.03 10.83 10.47 14.33 . . . . . . . . . . . . . . . . . . . . . . . . . . . B 8.64 10.54 12.59 9.42 10.29 14.65 . . .

Can we infer tumor types from gene expression values? Which genes are most indicative?

slide-46
SLIDE 46

36 Machine Learning / UMA, May 2018

❍❖❲ ❚❖ ❆❙❙❊❙❙ ●❊◆❊❘❆▲■❩❆❚■❖◆ P❊❘❋❖❘▼❆◆❈❊❄

The quality of a model can only be judged on the basis of its per- formance on future data. So assume that future data are generated according to some joint distribution of inputs and targets, the joint density of which we denote as p(x, y) The generalization error (or risk) is the expected error on future data for a given model.

slide-47
SLIDE 47

37 Machine Learning / UMA, May 2018

❊❙❚■▼❆❚■◆● ❚❍❊ ●❊◆❊❘❆▲■❩❆❚■❖◆ ❊❘❘❖❘

Since we typically do not know the distribution p(x, y), we have to estimate the generalization performance by making use of already existing data. Two methods are common: Test set/holdout method: the data set is split randomly into a training set and a test set; a predictor is trained on the former and evaluated on the latter; Cross validation: the data set is split randomly into a certain num- ber k of equally sized folds; k predictors are trained, each leav- ing out one fold as test set; the average performance on the k test folds is computed;

slide-48
SLIDE 48

38 Machine Learning / UMA, May 2018

❋■❱❊✲❋❖▲❉ ❈❘❖❙❙ ❱❆▲■❉❆❚■❖◆

evaluation training

1.

evaluation training

2. 5.

evaluation training

. . .

slide-49
SLIDE 49

39 Machine Learning / UMA, May 2018

❈❖◆❋❯❙■❖◆ ▼❆❚❘■❳ ❋❖❘ ❇■◆❆❘❨ ❈▲❆❙❙■❋■❈❆❚■❖◆

For a given sample (x, y) and a classifier g(.)), (x, y) is a true positive (TP) if y = +1 and g(x) = +1, true negative (TN) if y = −1 and g(x) = −1, false positive (FP) if y = −1 and g(x) = +1, false negative (FN) if y = +1 and g(x) = −1.

slide-50
SLIDE 50

40 Machine Learning / UMA, May 2018

❈❖◆❋❯❙■❖◆ ▼❆❚❘■❳ ❋❖❘ ❇■◆❆❘❨ ❈▲❆❙❙■❋■❈❆❚■❖◆ ✭❝♦♥t✬❞✮

Given a data set, the confusion matrix is defined as follows: predicted value g(x) +1

  • 1

+1 #TP #FN actual value y

  • 1

#FP #TN In this table, the entries #TP , #FP , #FN and #TN denote the numbers of true positives, . . . , respectively, for the given test data set.

slide-51
SLIDE 51

41 Machine Learning / UMA, May 2018

❊❱❆▲❯❆❚■❖◆ ▼❊❆❙❯❘❊❙ ❉❊❘■❱❊❉ ❋❘❖▼ ❈❖◆❋❯❙■❖◆ ▼❆❚❘■❈❊❙

Accuracy: number of correctly classified items, i.e. ACC = #TP + #TN #TP + #FN + #FP + #TN . True Positive Rate (aka recall/sensitivity): proportion of correctly identified posi- tives, i.e. TPR = #TP #TP + #FN . False Positive Rate: proportion of negative examples that were incorrectly classi- fied as positives, i.e. FPR = #FP #FP + #TN . Precision: proportion of predicted positive examples that were correct, i.e. PREC = #TP #TP + #FP . True Negative Rate (aka specificity): proportion

  • f

correctly identified negatives, i.e. TNR = #TN #FP + #TN . False Negative Rate: proportion of positive examples that were incorrectly classi- fied as negatives, i.e. FNR = #FN #TP + #FN .

slide-52
SLIDE 52

42 Machine Learning / UMA, May 2018

❊❱❆▲❯❆❚■❖◆ ▼❊❆❙❯❘❊❙ ❋❖❘ ❯◆❇❆▲❆◆❈❊❉ ❉❆❚❆

Balanced Accuracy: mean of true positive and true negative rate, i.e. BACC = TPR + TNR 2 Matthews Correlation Coefficient: measure of non-randomness

  • f classification; defined as normalized determinant of confu-

sion matrix, i.e. MCC = #TP · #TN − #FP · #FN

  • (#TP + #FP)(#TP + #FN)(#TN + #FP)(#TN + #FN)
slide-53
SLIDE 53

43 Machine Learning / UMA, May 2018

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆●

Underfitting: our model is too coarse to fit the data (neither training nor test data); this is usually the result of too restrictive model assumptions (i.e. too low complexity of model). Overfitting: our model works very well on training data, but gener- alizes poorly to future/test data; this is usually the result of too high model complexity. The best generalization performance is obtained for the optimal choice of the complexity level. An estimate of the optimal choice can be determined by (cross) validation.

slide-54
SLIDE 54

44 Machine Learning / UMA, May 2018

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮

error complexity test error training error

slide-55
SLIDE 55

44 Machine Learning / UMA, May 2018

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮

error complexity test error training error

✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛

underfitting

slide-56
SLIDE 56

44 Machine Learning / UMA, May 2018

❯◆❉❊❘❋■❚❚■◆● ❆◆❉ ❖❱❊❘❋■❚❚■◆● ✭❝♦♥t✬❞✮

error complexity test error training error

✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛ ✛

underfitting

✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲

  • verfitting
slide-57
SLIDE 57

45 Machine Learning / UMA, May 2018

❆ ❇❆❙■❈ ❈▲❆❙❙■❋■❊❘✿ ❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘

Suppose we have a labeled data set Z and a distance measure on the input space. Then the k-nearest neighbor classifier is defined as follows: gk-NN(x; Z) = class that occurs most often among the k samples that are closest to x For k = 1, we simply call this nearest neighbor classifier: gNN(x; Z) = class of the sample that is closest to x

slide-58
SLIDE 58

46 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

slide-59
SLIDE 59

46 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1:

slide-60
SLIDE 60

46 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1: k = 5:

slide-61
SLIDE 61

46 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✶

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1: k = 5: k = 13:

slide-62
SLIDE 62

47 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

slide-63
SLIDE 63

47 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1:

slide-64
SLIDE 64

47 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1: k = 5:

slide-65
SLIDE 65

47 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1: k = 5: k = 13:

slide-66
SLIDE 66

47 Machine Learning / UMA, May 2018

❦✲◆❊❆❘❊❙❚ ◆❊■●❍❇❖❘ ❈▲❆❙❙■❋■❊❘ ❊❳❆▼P▲❊ ★✷

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

k = 1: k = 5: k = 13: k = 25:

slide-67
SLIDE 67

48 Machine Learning / UMA, May 2018

❆ ❇❆❙■❈ ◆❯▼❊❘■❈❆▲ P❘❊❉■❈❚❖❘✿ ✶❉ ▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆

Consider a data set Z = {(xi, yi) | i = 1, . . . , l} ⊆ R2 and a linear model y = w0 + w1 · x = g

  • x; (w0, w1)
  • w
  • .

Suppose we want to find (w0, w1) such that the average quadratic loss, Q(w0, w1) = 1 l

l

  • i=1
  • w0 + w1 · xi − yi2 = 1

l

l

  • i=1
  • g(xi; w) − yi2,

is minimized. Then the unique global solution is given as follows: w1 = Cov(x, y) Var(x) w0 = ¯ y − w1 · ¯ x

slide-68
SLIDE 68

49 Machine Learning / UMA, May 2018

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✶

1 2 3 4 5 6 2 4 6 8

slide-69
SLIDE 69

49 Machine Learning / UMA, May 2018

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✶

1 2 3 4 5 6 2 4 6 8 1 2 3 4 5 6 2 4 6 8

slide-70
SLIDE 70

50 Machine Learning / UMA, May 2018

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❋❖❘ ▼❯▲❚■P▲❊ ❱❆❘■❆❇▲❊❙

Consider a data set Z = {(xi, yi) | i = 1, . . . , l} and a linear model y = w0 + w1 · x1 + · · · + wd · xd = (1 | x) · w = g

  • x; (w0, w1, . . . , wd)
  • wT
  • .

Suppose we want to find w = (w0, w1, . . . , wd)T such that the average quadratic loss is minimized. Then the unique global solution is given as w = ˜ XT · ˜ X −1 · ˜ XT

  • ˜

X+

·y, where ˜ X = (1|X).

slide-71
SLIDE 71

51 Machine Learning / UMA, May 2018

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✷

0.0 0.5 1.0 0.0 0.5 1.0 1 1

slide-72
SLIDE 72

51 Machine Learning / UMA, May 2018

▲■◆❊❆❘ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊ ★✷

0.0 0.5 1.0 0.0 0.5 1.0 1 1

slide-73
SLIDE 73

52 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆

Consider a data set Z = {(xi, yi) | i = 1, . . . , l} and a polynomial model of degree n y = w0 + w1 · x + w2 · x2 + · · · + wn · xn = g

  • x; (w0, w1, . . . , wn)
  • wT
  • .

Suppose we want to find w = (w0, w1, . . . , wn)T such that the average quadratic loss is minimized. Then the unique global solution is given as follows: w = ˜ XT · ˜ X −1 · ˜ XT

  • ˜

X+

·y with ˜ X = (1 | x | x2 | · · · | xn)

slide-74
SLIDE 74

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-75
SLIDE 75

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 1:

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-76
SLIDE 76

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 1:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 2:

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-77
SLIDE 77

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 1:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 2:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 3:

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-78
SLIDE 78

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 1:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 2:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 3:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 5:

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-79
SLIDE 79

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 1:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 2:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 3:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 5:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 25:

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-80
SLIDE 80

53 Machine Learning / UMA, May 2018

P❖▲❨◆❖▼■❆▲ ❘❊●❘❊❙❙■❖◆ ❊❳❆▼P▲❊

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 1:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 2:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 3:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 5:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 25:

1 2 3 4 5 6 4 2 2 4 6 8 10

n = 75:

1 2 3 4 5 6 4 2 2 4 6 8 10

slide-81
SLIDE 81

54 Machine Learning / UMA, May 2018

❙❯PP❖❘❚ ❱❊❈❚❖❘ ▼❆❈❍■◆❊❙ ■◆ ❆ ◆❯❚❙❍❊▲▲

Putting it simply, Support Vector Machines (SVMs) are based

  • n the idea of finding a classification border that maximizes the

margin between positive and negative samples. According to a theoretical result, maximizing the margin cor- responds to minimizing an upper bound of the generalization error.

slide-82
SLIDE 82

55 Machine Learning / UMA, May 2018

▼❆❘●■◆ ▼❆❳■▼■❩❆❚■❖◆

margin margin

slide-83
SLIDE 83

56 Machine Learning / UMA, May 2018

▼❆❘●■◆ ▼❆❳■▼■❩❆❚■❖◆ ✭❝♦♥t✬❞✮

margin

The two classes are linearly separable if and only if their convex hulls are dis- joint. If the two classes are linearly sep- arable, margin maximization can be achieved by making an orthogonal 50:50 split of the shortest distance con- necting the convex hulls of the two classes. The question remains how to solve margin maximization computationally: by quadratic optimization.

slide-84
SLIDE 84

57 Machine Learning / UMA, May 2018

❙❱▼ ❉■❙❈❘■▼■◆❆◆❚ ❋❯◆❈❚■❖◆

For a given training set {(xi, yi) | 1 ≤ i ≤ l}, a common support vector machine classifier is represented as the discriminant function g(x) = b +

l

  • i=1

αi · yi · k(x, xi), where b is a real value, αi are non-negative factors, and k(., .) is the so-called kernel, a similarity measure for the inputs. The dis- criminant function only depends on those samples whose Lagrange multiplier αi is not 0. Those are called support vectors.

slide-85
SLIDE 85

58 Machine Learning / UMA, May 2018

❙❚❆◆❉❆❘❉ ❑❊❘◆❊▲❙

The following kernels are often used in practice: Linear: k(x, y) = x · y Polynomial: k(x, y) = (x · y + β)α Gaussian/RBF:a k(x, y) = exp

1 2σ2 x − y2

Sigmoid: k(x, y) = tanh(αx · y + β)

aRBF = Radial Basis Function

slide-86
SLIDE 86

59 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ▲■◆❊❆❘ ❑❊❘◆❊▲ ✴ ❈=✶

slide-87
SLIDE 87

60 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ▲■◆❊❆❘ ❑❊❘◆❊▲ ✴ ❈=✶✵✵✵

slide-88
SLIDE 88

61 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈=✶ ✴ σ2 =✵✳✺

slide-89
SLIDE 89

62 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈=✶✵ ✴ σ2 =✵✳✵✺

slide-90
SLIDE 90

63 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈=✶✵✵✵ ✴ σ2 =✵✳✵✵✺

slide-91
SLIDE 91

64 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❘❇❋ ❑❊❘◆❊▲ ✴ ❈=✶✵ ✴ σ2 =✵✳✵✺

slide-92
SLIDE 92

65 Machine Learning / UMA, May 2018

❙❱▼s ❋❖❘ ▼❯▲❚■✲❈▲❆❙❙ P❘❖❇▲❊▼❙

Support vector machines are intrinsically based on the idea

  • f separating two classes by maximizing the margin between
  • them. So there is no obvious way to extend them to multi-class

problems. All approaches introduced so far are based on breaking down the multi-class problem into several binary classification prob- lems.

slide-93
SLIDE 93

66 Machine Learning / UMA, May 2018

❙❱▼s ❋❖❘ ▼❯▲❚■✲❈▲❆❙❙ P❘❖❇▲❊▼❙ ✭❝♦♥t✬❞✮

Suppose we have a classification problem with M classes. One against the rest: M support vector machines are trained, where the i-th SVM is trained to distinguish between the i-th class and other classes; a new sample is assigned to the class whose SVM has the highest discriminant function value. Pairwise classification:

M(M−1) 2

SVMs are trained, one for each pair of classes; a new sample is assigned to the class that re- ceived the most votes from the M(M−1)

2

  • SVMs. This is the bet-

ter and more common approach.

slide-94
SLIDE 94

67 Machine Learning / UMA, May 2018

❙❊◗❯❊◆❈❊ ❈▲❆❙❙■❋■❈❆❚■❖◆ ❯❙■◆● ❙❱▼s

All considerations so far have been based on vectorial data. Biological sequences cannot be cast to vectorial data easily, in particular, if they do not have fixed lengths. Support vector machines, by means of the kernels they employ, can handle any kind of data as long as a meaningful kernel (i.e. similarity measure) is available. In the following, we will consider kernels that can be used for biological sequences.

slide-95
SLIDE 95

68 Machine Learning / UMA, May 2018

❙❊◗❯❊◆❈❊ ❑❊❘◆❊▲❙

We consider kernels of the following kind: k(x, y) =

  • m∈M

N(m, x) · N(m, y), where M is a set of patterns and N(m, x) denotes the number of

  • ccurrences/matches of pattern m in string x.

Spectrum Kernel: consider all possible K-length strings (exact matches).

slide-96
SLIDE 96

69 Machine Learning / UMA, May 2018

❉❊❈■❙■❖◆ ❚❘❊❊❙✿ ■◆❚❘❖❉❯❈❚■❖◆

A decision tree is a classifier that classifies samples “by asking questions successively”; each non-leaf node corresponds to a question, each leaf corresponds to a final prediction. Decision tree learning is concerned with partitioning the train- ing data hierarchically such that the leaf nodes are hopefully homogeneous in terms of the target class. Decision trees have mainly been designed for categorical data, but they can also be applied to numerical features. Decision trees are traditionally used for classification (binary and multi-class), but regression is possible, too.

slide-97
SLIDE 97

70 Machine Learning / UMA, May 2018

❉❊❈■❙■❖◆ ❚❘❊❊ ▲❊❆❘◆■◆●

All decision tree learning algorithms are recursive, depth-first search algorithms that perform hierarchical splits. There are three main design issues:

  • 1. Splitting criterion: which splits to choose?
  • 2. Stopping criterion: when to stop further growing of the tree?
  • 3. Pruning: whether/how to collapse unnecessarily deep sub-

trees? The two latter are especially relevant for adjusting the complex- ity of decision trees (underfitting vs. overfitting).

slide-98
SLIDE 98

71 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚

Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica

1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75

slide-99
SLIDE 99

71 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚

Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica

1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75

slide-100
SLIDE 100

71 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚

Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica

1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75

slide-101
SLIDE 101

71 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚

Petal.Length< 2.45 Petal.Width< 1.75 setosa versicolor virginica

1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width 2.45 1.75

slide-102
SLIDE 102

72 Machine Learning / UMA, May 2018

❘❆◆❉❖▼ ❋❖❘❊❙❚❙✿ ▼❖❙❚ ❈❖▼▼❖◆ ❱❆❘■❆◆❚

Use CART (Classification and Regression Trees) for training the single trees, i.e. binary splits with Gini impurity gain (for classification) / variance reduction (for regression) as splitting criterion. For each tree, samples are chosen randomly from the training set (typically with replacement). For each split, only a sub-sample of randomly chosen features is considered. Trees are grown to full size and not pruned.

slide-103
SLIDE 103

73 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ■❘■❙ ❉❆❚❆ ❙❊❚ ✭✶✵✵✵ ❚❘❊❊❙✮

1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width

slide-104
SLIDE 104

74 Machine Learning / UMA, May 2018

❖❯❚✲❖❋✲❇❆● ❊❙❚■▼❆❚❊❙

Random forests allow for assessing the generalization perfor- mance on the basis of training data only. For each sample, the error can be computed by considering

  • nly those trees that have not used this sample in their training

sub-sample. Then the overall out-of-bag error can be computed by averag- ing the out-of-bag errors of all samples.

slide-105
SLIDE 105

75 Machine Learning / UMA, May 2018

▼❊❆❙❯❘■◆● ❱❆❘■❆❇▲❊ ■▼P❖❘❚❆◆❈❊

Mean Gini impurity decrease: for all features, average the Gini impurity gains of all splits in all trees that involve this feature; Mean accuracy decrease:

  • 1. Compute out-of-bag error for each sample.
  • 2. For each feature separately, consider random permutations

and compute the out-of-bag errors for the data set with the permuted feature.

  • 3. Then the importance score is computed by averaging the

differences before and after permuting the feature (upon normalization by the standard deviation of the differences).

slide-106
SLIDE 106

76 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❱❆❘■❆❇▲❊ ■▼P❖❘❚❆◆❈❊❙ ❋❖❘ ■❘■❙ ❉❆❚❆ ❙❊❚ ✭✶✵✵✵ ❚❘❊❊❙✮

Sepal.Width Sepal.Length Petal.Width Petal.Length

  • 10

20 30 40 50 MeanDecreaseAccuracy Sepal.Width Sepal.Length Petal.Length Petal.Width

  • 10

20 30 40 MeanDecreaseGini

slide-107
SLIDE 107

❖❱❊❘❱■❊❲ ❖❋ ◆❊❯❘❆▲ ◆❊❚❲❖❘❑❙

slide-108
SLIDE 108

78 Machine Learning / UMA, May 2018

■◆❚❘❖❉❯❈❚■❖◆

The most powerful and most versatile “learning machine” is still the human brain. Starting in the 1940ies, ideas for creating “intelligent‘” systems by mimicking the function of nerve/brain cells have been devel-

  • ped.

An artificial neural network is a parallel processing system with small computing units (neurons) that work similarly to nerve/brain cells.

slide-109
SLIDE 109

79 Machine Learning / UMA, May 2018

◆❊❯❘❖P❍❨❙■❖▲❖●■❈❆▲ ❇❆❈❑●❘❖❯◆❉

The inside of every neuron (nerve or brain cell) carries a certain electric charge. Electric charge of connected neurons may raise or lower this charge (by means of transmission of ions through the synaptic interface). As soon as the charge reaches a certain threshold, an electric impulse is transmitted through the cell’s axon to the neighbor- ing cells. In the synaptic interfaces, chemicals called neurotransmitters control the strength to which an impulse is transmitted from

  • ne cell to another.
slide-110
SLIDE 110

80 Machine Learning / UMA, May 2018

◆❊❯❘❖P❍❨❙■❖▲❖●■❈❆▲ ❇❆❈❑●❘❖❯◆❉✭❝♦♥t✬❞✮

[public domain; from Wikimedia Commons]

slide-111
SLIDE 111

81 Machine Learning / UMA, May 2018

P❊❘❈❊P❚❘❖◆❙

A perceptron is a simple linear threshold unit: g(x; w, θ) =      1 if

d

  • j=1

wj · xj > θ

  • therwise

(1) In analogy to the biological model, the inputs xj correspond to the charges received from connected cells through the dentrites, the weights wj correspond to the properties of the synaptic interface, and the output corresponds to the impulse that is sent through the axon as soon as the charge exceeds the threshold θ. Though it seems to be a (simplistic) model of a neuron, a perceptron is nothing else but a simple linear classifier.

slide-112
SLIDE 112

82 Machine Learning / UMA, May 2018

▼❯▲❚■✲▲❆❨❊❘ P❊❘❈❊P❚❘❖◆❙

INPUT LAYER HIDDEN LAYER OUTPUT LAYER

slide-113
SLIDE 113

83 Machine Learning / UMA, May 2018

❙❖▼❊ ❍■❙❚❖❘■❈❆▲ ❘❊▼❆❘❑❙

Minsky and Papert conjectured in the late 1960ies that training multi- layer perceptrons is infeasible. Because of this, the study of multi-layer perceptrons was almost halted until the mid of the 1980ies. In 1986, Rumelhart and McClelland first published the backpropaga- tion algorithm and, thereby, proved Minsky and Papert wrong. It turned out later that the backpropagation algorithm had already been described by Werbos in 1974 in his dissertation. In a different context, the algorithm first appeared in the work of Bryson et al. in the 1960ies. There was a neural networks hype in the 1980ies before they were superseded by support vector machines. In recent years, however, new techniques for training deep networks have brought a renaissance of neural networks.

slide-114
SLIDE 114

84 Machine Learning / UMA, May 2018

❉❊❊P ▲❊❆❘◆■◆●

Deep learning is a class/framework of strategies for training deep net- works that are aimed to learn multiple levels of representations of the data and to allow for accurate predictions from these representations. Deep learning can be supervised or unsupervised. First approaches to deep learning employed a two-step procedure: Pre-training: levels of representations are learned layer by layer Fine-tuning: a supervised learning algorithm is applied that makes predictions from the last layer of the pre-trained network. Unsupervised deep learning only consists of unsupervised pre- training and omits fine-tuning.

slide-115
SLIDE 115

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

slide-116
SLIDE 116

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 2

HIDDEN TRAINING INPUT

slide-117
SLIDE 117

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 2

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 3

HIDDEN TRAINING INPUT

slide-118
SLIDE 118

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 2

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 3

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 4

HIDDEN TRAINING INPUT

slide-119
SLIDE 119

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 2

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 3

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 4

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 5

HIDDEN TRAINING INPUT

slide-120
SLIDE 120

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 2

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 3

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 4

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 5

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 6

HIDDEN TRAINING INPUT

slide-121
SLIDE 121

85 Machine Learning / UMA, May 2018

P❘❊✲❚❘❆■◆■◆● ✫ ❋■◆❊✲❚❯◆■◆●

Pre-training: hidden layer no. 1

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 2

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 3

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 4

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 5

HIDDEN TRAINING INPUT

Pre-training: hidden layer no. 6

HIDDEN TRAINING INPUT

Fine-tuning:

INPUT TRAINING OUTPUT

slide-122
SLIDE 122

86 Machine Learning / UMA, May 2018

▼❊❚❍❖❉❙ ❋❖❘ P❘❊✲❚❘❆■◆■◆●

Restricted Boltzmann machine (RBM): A simple stochastic neural network with an input layer and one hidden layer that are connected in both directions with symmetric weights; RBMs aim to learn a probability distribution over the inputs. The learning algorithm uses sampling of inputs and hidden activations along with gradient descent. Autoencoders: A (denoising) autoencoder with one hidden layer is trained in each pre-training step. After training, the output layer of the autoencoder is discarded and only the hidden layer remains. In the subsequent step, another autoen- coder is trained with the inputs being the activations of the hidden neurons of the previously trained autoencoder. Supervised pre-training: A network with one hidden layer is trained in each pre- training step. After training, the output layer is discarded and only the hidden layer remains. In the subsequent step, another network is trained with the inputs being the activations of the hidden neurons of the previous network.

slide-123
SLIDE 123

87 Machine Learning / UMA, May 2018

❍❖❲ ❚❖ ▲❊❆❘◆ ●❖❖❉ ❘❊P❘❊❙❊◆❚❆❚■❖◆❙❄

The success of a deep network is determined by how mean- ingful the representations in the hidden layers are. What is a meaningful representation?

Each hidden unit corresponds to a specific pattern (hidden)

in the data.

Different hidden units correspond to different patterns, i.e.

the patterns are disentangled.

slide-124
SLIDE 124

88 Machine Learning / UMA, May 2018

❙P❆❘❙❊ ❘❊P❘❊❙❊◆❚❆❚■❖◆❙

Disentangling of representations can also be achieved by ensuring sparse activation, i.e. only a fraction of hidden neurons are activated for a given input. Dropout: during training, activations are ran- domly set to 0 (e.g. with a probability of 0.5); Rectified linear units (ReLU): instead of a sig- moid activation function, a function is used that gives 0 below a certain threshold. The most common choice is ϕ(x) = max(0, x). These approaches even allow for training a deep network directly without pre-training.

2 1 1 2 0.5 0.5 1.0 1.5 2.0

slide-125
SLIDE 125

89 Machine Learning / UMA, May 2018

❈❖◆❱❖▲❯❚■❖◆❆▲ ◆❊❯❘❆▲ ◆❊❚❲❖❘❑❙ ✭❈◆◆s✮✿ ▼❖❚■❱❆❚■❖◆

In principle, classical feed-forward neural networks (fully con- nected networks) could be used for image analysis by simply connecting the pixels to input units. However, if at all, this only makes sense for small aligned im- ages (e.g. in character recognition). For the analysis of larger and more complex images, this stan- dard architecture is not useful. Instead, it is common to have stacked layers of units that oper- ate on small overlapping patches/windows. Such networks are called convolutional neural networks (CNNs).

slide-126
SLIDE 126

90 Machine Learning / UMA, May 2018

❈◆◆s✿ ❆❘❈❍■❚❊❈❚❯❘❊

The first convolutional layer usually consists of multiple units that op- erate on small image patches (3×3, 5×5, or 7×7). Each unit corresponds to one simple feature of a patch. Such units are often called filters. The activations of all units are computed for all patches, thereby cre- ating a feature map of the image. Convolutional layers can be stacked. It can be useful to down-sample feature maps by local max pooling (e.g. with non-overlapping 2×2 windows). Such networks can either have fully connected layers on top (e.g. for image classification) or can also be fully convolutional (output is an image; e.g. for segmentation of detected objects).

slide-127
SLIDE 127

91 Machine Learning / UMA, May 2018

❈◆◆s✿ ❆❘❈❍■❚❊❈❚❯❘❊ ✭❝♦♥t✬❞✮

Input image (10×10) Feature map (8×8) Convolution/filter (3×3)

slide-128
SLIDE 128

91 Machine Learning / UMA, May 2018

❈◆◆s✿ ❆❘❈❍■❚❊❈❚❯❘❊ ✭❝♦♥t✬❞✮

Input image (10×10) Feature map (8×8) Convolution/filter (3×3) Input image (10×10) Feature maps (8×8) Convolution/filter (3×3)

slide-129
SLIDE 129

91 Machine Learning / UMA, May 2018

❈◆◆s✿ ❆❘❈❍■❚❊❈❚❯❘❊ ✭❝♦♥t✬❞✮

Input image (10×10) Feature map (8×8) Convolution/filter (3×3) Input image (10×10) Feature maps (8×8) Convolution/filter (3×3) Input image (10×10)

Feature maps, first layer (8×8) Feature maps, second layer (6×6)

Convolution/filter (3×3) Convolution/filter (3×3)

slide-130
SLIDE 130

91 Machine Learning / UMA, May 2018

❈◆◆s✿ ❆❘❈❍■❚❊❈❚❯❘❊ ✭❝♦♥t✬❞✮

Input image (10×10) Feature map (8×8) Convolution/filter (3×3) Input image (10×10) Feature maps (8×8) Convolution/filter (3×3) Input image (10×10)

Feature maps, first layer (8×8) Feature maps, second layer (6×6)

Convolution/filter (3×3) Convolution/filter (3×3) Input image (10×10)

Feature maps (8×8) Downsampled feature maps (4×4)

Convolution/filter (3×3) Max pooling mask (2×2)

slide-131
SLIDE 131

92 Machine Learning / UMA, May 2018

❈◆◆s✿ ❚❘❆■◆■◆●

Fully connected layers are trained as usual. In convolutional layers, each feature map/filter has only one set

  • f weights and all windows contribute weight updates. This is

called weight sharing. In may pooling layers, the error signal is only propagated to the input from which the maximal activation came.

slide-132
SLIDE 132

93 Machine Learning / UMA, May 2018

❈❖◆❱❖▲❯❚■❖◆❆▲ ◆❊❚❲❖❘❑✿ ❊❳❆▼P▲❊ ❬▼✳❉✳ ❩❡✐❧❡r ✫ ❘✳ ❋❡r❣✉s❀ ❛r❳✐✈✱ ✷✵✶✸❪

Two layers of a convolutional network: hypothetical inputs maximizing ac- tivation and real images that lead to a high activation of the considered neuron

slide-133
SLIDE 133

94 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❙❊▼❆◆❚■❈ ❙❊●▼❊◆❚❆❚■❖◆

slide-134
SLIDE 134

95 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❚❖❳✷✶ ❈❍❆▲▲❊◆●❊ ✭✶✴✸✮

Computational challenge set up by the US agencies NIH, EPA, and FDA Unprecedented multi-million-dollar effort 12,000 compounds tested experimentally for twelve different toxic effects Goal: predict toxicity computationally

slide-135
SLIDE 135

96 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❚❖❳✷✶ ❈❍❆▲▲❊◆●❊ ✭✷✴✸✮

Input features: 40,000 very sparse features: Extended Connectivity Finger- Print (ECFP4) presence count of chemical sub-structures 5,057 additional features:

2,500 toxicophore features 200 common chemical scaffolds various chemical descriptors

slide-136
SLIDE 136

97 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ❚❖❳✷✶ ❈❍❆▲▲❊◆●❊ ✭✸✴✸✮

Deep learning-based solution by JKU’s Institute of Bioinfor- matics won the grand chal- lenge, both panels (nuclear receptor panel and stress re- sponse panel), and six single prediction tasks. The hierarchical representa- tion of deep networks allowed for the identification of novel toxicophores.

slide-137
SLIDE 137

98 Machine Learning / UMA, May 2018

❉❊❊P ▲❊❆❘◆■◆●✿ ❚❍❊ ❍❨P❊

Although the foundations of deep learning have been layed 15– 20 years ago, a major hype emerged only recently in the ma- chine learning community. Deep networks have won numerous competitions in music, speech and image recognition, drug discovery, and other fields. Deep learning has been called “. . . the biggest data science breakthrough of the decade” (J. Howard). The New York Times covered the subject twice with front-page articles in 2012.

slide-138
SLIDE 138

99 Machine Learning / UMA, May 2018

❉❊❊P ▲❊❆❘◆■◆●✿ ❚❍❊ ❍❨P❊ ✭❝♦♥t✬❞✮

Major companies, such as, Google, Microsoft, Apple, facebook,

  • etc. have recently invested in deep learning and are using deep

networks in their products and services. Google has acquired companies specialized in deep learning: DNNresearch (founded by G. Hinton, U. Toronto; March 2013; price not revealed) and Deepmind (London-based company founded by D. Hassabis; January 2014; price approx. $400– 650m)

slide-139
SLIDE 139

100 Machine Learning / UMA, May 2018

❚■▼❊ ❙❊❘■❊❙✴❙❊◗❯❊◆❈❊ ❆◆❆▲❨❙■❙ ❲■❚❍ ◆❊❯❘❆▲ ◆❊❚❙❄

Feedforward neural networks require vectorial inputs. Therefore, they cannot be applied to time series

  • r

sequences directly. One option is to apply them to (sliding) windows. The obvious disadvantage of this simple approach is that windows are treated indepen- dently and no learning across windows can take place.

slide-140
SLIDE 140

100 Machine Learning / UMA, May 2018

❚■▼❊ ❙❊❘■❊❙✴❙❊◗❯❊◆❈❊ ❆◆❆▲❨❙■❙ ❲■❚❍ ◆❊❯❘❆▲ ◆❊❚❙❄

Feedforward neural networks require vectorial inputs. Therefore, they cannot be applied to time series

  • r

sequences directly. One option is to apply them to (sliding) windows. The obvious disadvantage of this simple approach is that windows are treated indepen- dently and no learning across windows can take place.

Input Sequence Output Sequence

slide-141
SLIDE 141

101 Machine Learning / UMA, May 2018

❘❊❈❯❘❘❊◆❚ ◆❊❯❘❆▲ ◆❊❚❙ ✭❘◆◆s✮

Recurrent neural networks (RNNs) provide an alternative, where “recurrent” means that the network has connection cy- cles. There are several different RNN architectures. After each evaluation (for one window, in time step t), the acti- vations are kept and potentially used as inputs in time step t+1; so the generalization of the forward pass is straightforward for RNNs. The backpropagation algorithm can also be generalized to RNNs; this is typically called backpropagation through time.

slide-142
SLIDE 142

102 Machine Learning / UMA, May 2018

❘❊❈❯❘❘❊◆❚ ◆❊❯❘❆▲ ◆❊❚❙ ✭❝♦♥t✬❞✮

Example of RNN with output sequence:

Input Sequence Output Sequence

slide-143
SLIDE 143

102 Machine Learning / UMA, May 2018

❘❊❈❯❘❘❊◆❚ ◆❊❯❘❆▲ ◆❊❚❙ ✭❝♦♥t✬❞✮

Example of RNN with output sequence:

Input Sequence Output Sequence

Example of RNN with single output/target (output emitted only in last step):

Input Sequence Output / target

slide-144
SLIDE 144

103 Machine Learning / UMA, May 2018

❘◆◆s ❆◆❉ ❱❆◆■❙❍■◆● ●❘❆❉■❊◆❚❙

Standard RNNs with sigmoid activations are particularly prone to the vanishing gradient problem (actually, this problem has been formulated/discussed for RNNs first): errors/deltas de- cline (or explode) quickly when back-propagating through time. The consequence is that only short time lags between inputs and output signals can be learned correctly (up to about 10 time steps).

slide-145
SLIDE 145

104 Machine Learning / UMA, May 2018

▲❖◆● ❙❍❖❘❚✲❚❊❘▼ ▼❊▼❖❘❨ ✭▲❙❚▼✮

In order to overcome the vanishing gradient problem in RNNs, Hochre- iter and Schmidhuber (1997) have introduced Long Short-Term Mem-

  • ry (LSTM) networks.

Apart from a standard input unit, an LSTM memory cell has three main components:

  • 1. A linear self-connected memory unit (the linear activation facili-

tates constant error flow and thereby avoids vanishing gradients)

  • 2. A multiplicative input gate to protect the memory cell from irrele-

vant inputs

  • 3. A multiplicative output gate to protect the outputs (or other con-

nected untis) from currently irrelevant memory contents

slide-146
SLIDE 146

105 Machine Learning / UMA, May 2018

❆❘❈❍■❚❊❈❚❯❘❊ ❖❋ ❆◆ ▲❙❚▼ ▼❊▼❖❘❨ ❈❊▲▲

net c w c net in a in w in w out a out net out a c a c a in (t) (t) s(t) = s(t−1) +

Π Π

Input Gate Output Gate

slide-147
SLIDE 147

106 Machine Learning / UMA, May 2018

❊❳❆▼P▲❊✿ ■▼❆●❊ ❈❆P❚■❖◆❙ ✭❈◆◆s + ▲❙❚▼✮

slide-148
SLIDE 148

107 Machine Learning / UMA, May 2018

▲❙❚▼✿ ❚❍❊ ❈❯❘❘❊◆❚ ❍❨P❊

Some benchmark records of 2014 achieved by LSTM: Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014) Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014) Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014) Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014) Medium vocabulary speech recognition (Geiger et al., Interspeech 2014) English to French translation (Sutskever et al., Google, NIPS 2014) Audio onset detection (Marchi et al., ICASSP 2014) Social signal classification (Brueckner & Schulter, ICASSP 2014) Arabic handwriting recognition (Bluche et al., DAS 2014) Image caption generation (Vinyals et al., Google, 2014) Video to textual description (Donahue et al., 2014)

slide-149
SLIDE 149

108 Machine Learning / UMA, May 2018

▲❙❚▼✿ ❚❍❊ ❈❯❘❘❊◆❚ ❍❨P❊ ✭❝♦♥t✬❞✮

LSTM @ Google: Neural Machine Translation System (NMT) Google Voice Transcription (Android speech recognizer) LSTM @ Microsoft: Photo-real talking head with deep bidirectional LSTM Spoken language understanding using LSTM Text-to-speech synthesis with bidirectional LSTM-based RNN LSTM @ facebook: Text analysis LSTM @ Apple: Siri

slide-150
SLIDE 150

109 Machine Learning / UMA, May 2018

❖P❊◆✲❙❖❯❘❈❊ ❙❖❋❚❲❆❘❊ ❋❘❆▼❊❲❖❘❑❙ ✭❙❊▲❊❈❚■❖◆✮

CAFFE: by Berkeley Vision and Learning Center; interfaces for C++, com- mand line, Python, and MATLAB; MXNet: by Distributed (Deep) Machine Learning Community; interfaces for C++, Python, Julia, Matlab, JavaScript, Go, R, and Scala; TensorFlow: by Google Brain; Python interface; Theano: by Université de Montréal; Python interface; (Py)Torch: by R. Collobert, K. Kavukcuoglu, and C. Farabet; based on the Lua programming language; interface for Lua, C, and Python; All of these frameworks support running code on GPUs (via CUDA); beside fully connected networks, all feature CNNs and RNNs. Some of those are quite low-level, while additional light-weight interfaces are available (e.g. Keras, LASAGNE).

slide-151
SLIDE 151

110 Machine Learning / UMA, May 2018

❉❊❊P ▲❊❆❘◆■◆●✿ ❈❖▼▼❊◆❚❙

Without any doubt, deep networks are the most powerful tools for audio and image recognition and other fields, also outper- forming support vector machines. Despite the practical successes, the theoretical foundations why and under which conditions deep networks work are lag- ging far behind. The spectrum of variants is hard to survey, and the choice of good parameters is both crucial and tricky. Learning good representations of complex data, such as, high- res images, requires excessive amounts of training data and excessive computational power (supercomputers, GPUs).