BUT WERE FORCED TO FIND OUT Ivan tajduhar istajduh@riteh.hr SSIP - - PowerPoint PPT Presentation

but were forced to find out
SMART_READER_LITE
LIVE PREVIEW

BUT WERE FORCED TO FIND OUT Ivan tajduhar istajduh@riteh.hr SSIP - - PowerPoint PPT Presentation

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT Ivan tajduhar istajduh@riteh.hr SSIP 2019 27 TH SUMMER SCHOOL ON IMAGE PROCESSING, TIMISOARA, ROMANIA July 10 th 2019 EVERYTHING YOU NEVER WANTED TO KNOW


slide-1
SLIDE 1

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT

Ivan Štajduhar

istajduh@riteh.hr SSIP 2019 27TH SUMMER SCHOOL ON IMAGE PROCESSING, TIMISOARA, ROMANIA July 10th 2019

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

IN INTRODUCTION AND MOTIVATION

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT

slide-5
SLIDE 5
slide-6
SLIDE 6

Challenges

slide-7
SLIDE 7

Solution

  • Model-based techniques

– Manually tailored – Variation and complexity in clinical data – Limited by current insights into clinical conditions, diagnostic modelling and therapy – Hard to establish analytical solutions

slide-8
SLIDE 8

Hržić, Franko, et al. "Local-Entropy Based Approach for X-Ray Image Segmentation and Fracture Detection." Entropy 21.4 (2019): 338. (un uniri-tehn hnic-18 18-15) 15)

slide-9
SLIDE 9

Machine learning

  • Model-based techniques

– Manually tailored – Variation and complexity in clinical data – Limited by current insights into clinical conditions, diagnostic modelling and therapy – Hard to establish analytical solutions

  • An alternative: learning from data

– Minimising an objective function

slide-10
SLIDE 10

PACS

PET US MRI CT

slide-11
SLIDE 11

Summary

  • Introduction and motivation
  • Representation, optimisation & stuff
  • Evaluation metrics & experimental setup
  • Improving model performance
slide-12
SLIDE 12

REPRESENTATION, , OPTIMISATION & STUFF

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT

slide-13
SLIDE 13

Machine learning

  • Machine learning techniques mainly deal

with representation, performance assessment and optimisation:

– The learning process is always preceded by the choice of a formal representation of the

  • model. A set of possible models is called the

space of the hypothesis. – Learning algorithm uses a cost function to determine (evalu luate) how successful a model is – Op Optimisation is the process of choosing the most successful models

slide-14
SLIDE 14

Hypothesis

  • Learning type: supervised vs unsupervised
  • Hypothesis type: regression vs classification

continuous

  • utcome

categorical

  • utcome

labelled data unlabelled data

slide-15
SLIDE 15

Hypothesis

Data Learning algorithm h

Observation (known variables, easily obtainable) Outcome (prediction)

slide-16
SLIDE 16

16

slide-17
SLIDE 17

feature extraction predictor

slide-18
SLIDE 18

Hypothesis and parameter estimation

slide-19
SLIDE 19

Hypothesis and parameter estimation

slide-20
SLIDE 20

Regularisation

  • A way of reducing overfitting by ignoring non-informative

features

  • What is overfitting?
  • Many types of regularisation

– Quadratic regulariser

slide-21
SLIDE 21

Multilayer perceptron (MLP)

  • An extension of the logistic-regression idea
slide-22
SLIDE 22

Multilayer perceptron (MLP)

  • Parameters estimated through backpropagation algorithm

evidence error

CS231n: Convolutional Neural Networks for Visual Recognition, Stanford University http://cs231n.stanford.edu/2016/

slide-23
SLIDE 23

Multilayer perceptron (MLP)

CS231n: Convolutional Neural Networks for Visual Recognition, Stanford University http://cs231n.stanford.edu/2016/

Convolutional layer Normalisation layer Activation function Fully-connected layer Normalisation layer Activation function

slide-24
SLIDE 24

Support vector machine (SVM)

  • Maximum margin classifier
slide-25
SLIDE 25

Support vector machine (SVM)

  • Often used with kernels for

dealing with linearly non- separable problems

  • Quadratic programming

solver optimisation

slide-26
SLIDE 26

Support vector machine (SVM)

slide-27
SLIDE 27

Kernels

  • A possible way of dealing with linearly non-separable problems
  • A measure of similarity between data points
  • Kernel trick can implicitly escalate dimensionality
slide-28
SLIDE 28

Tree models

  • An alternative form of mapping
  • Partition the input space into

cuboid regions

– Can be easily interpretable

Temperature Sky yes yes yes no no

cold moderate hot cloudy sunny rainy

slide-29
SLIDE 29

Tree models

  • An alternative form of mapping
  • Partition the input space into

cuboid regions

– Can be easily interpretable

slide-30
SLIDE 30

CART

  • CART hypothesis
  • Local optimisation (greedy) – recursive partitioning
  • Cost function
slide-31
SLIDE 31

CART

|

|

slide-32
SLIDE 32

Short recap

  • Representation, optimisation & stuff

|

slide-33
SLIDE 33
slide-34
SLIDE 34

EVALUATION METRICS & EXPERIMENTAL SETUP

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT

slide-35
SLIDE 35

Evaluation metrics

  • The choice of an adequate model-evaluation metric depends
  • n the modelling goal
  • Common metrics for common problems:

– Mean squared error (for regression) – Classification accuracy

slide-36
SLIDE 36

Evaluation metrics

  • Confusion matrix

– binary case – multiple classes (K>2)

Predicted outcome Observed outcome Confusion matrix (normalised)

36

Predicted

  • utcome

Predicted

  • utcome

Observed

  • utcome

True positive (TP) False negative (FN) Observed

  • utcome

False positive (FP) True negative (TN)

  • utcome = class = label
slide-37
SLIDE 37

Evaluation metrics

  • Common metrics for class-imbalanced problems, or when

misclassifications are not equally bad:

– Sensitivity (recall, true positive rate) – Specificity – F1 score

37

Predicted

  • utcome

Predicted

  • utcome

Observed

  • utcome

True positive (TP) False negative (FN) Observed

  • utcome

False positive (FP) True negative (TN)

slide-38
SLIDE 38

Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27.8 (2006): 861-874.

Predicted

  • utcome

Predicted

  • utcome

Observed

  • utcome

True positive (TP) False negative (FN) Observed

  • utcome

False positive (FP) True negative (TN)

Evaluation metrics

  • For class-imbalanced problems, or when misclassifications are

not equally bad (probabilistic classification):

– Receiver operating characteristic (ROC) curve

1-SPEC

slide-39
SLIDE 39

Evaluation metrics

  • For highly-skewed class distributions (probabilistic

classification):

– Precision-recall (PR) curve – Area under the curve (AUC)

Davis, Jesse, and Mark Goadrich. "The relationship between Precision-Recall and ROC curves." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.

AUROC

slide-40
SLIDE 40

Evaluation metrics

  • Uncertain labellings, e.g., censored survival

data

  • Kaplan-Meier estimate of a survival function
  • Alternative evaluation metrics:

– Log-rank test – Concordance index – Explained residual variation of integrated Brier score

LOW HIGH

Risk group:

Kaplan-Meier estimate

slide-41
SLIDE 41

Experimental setup

  • Setting up an unbiased experiment

– How well will the model perform on new, yet unseen, data?

  • Dataset split:
  • n-fold cross-validation or leave-one-out test
  • Fold stratification of class-wise distribution
  • Multiple iterations of fold splits

train test

training

slide-42
SLIDE 42

training

Experimental setup

  • Estimating a model wisely

– validation data used for tuning hyperparameters

Data Learning algorithm MoA

slide-43
SLIDE 43

Experimental setup

  • Fair estimate of a classifier / regression model

– data preprocessing can bias your conclusions – watch out for naturally correlated observations, e.g.

  • diagnostic scan of a patient now and before
  • different imaging modalities of the same subject

– artificially generated data can cause a mess – use testing data only after you are done with the training

  • Fair estimate of a segmenter

– splitting pixels of the same image into separate sets is usually not the best idea – also, use a separate testing set of images

How well will the model perform on new, yet unseen, data?

training

slide-44
SLIDE 44

POPULATION

Method performance comparison

  • You devised a new method

– What makes your method better? – Is it significantly better?

  • Is the sample representative of a population?
  • Hypothesis: It is better (maybe)
slide-45
SLIDE 45

Method performance comparison

  • Hypothesis: A equals B!

– Non-parametric statistical tests – A level of significance α is used to determine at which level the hypothesis may be rejected – The smaller the p-value, the stronger the evidence against the null hypothesis

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." Journal of Machine learning research 7.Jan (2006): 1-30. Derrac, Joaquín, et al. "A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms." Swarm and Evolutionary Computation 1.1 (2011): 3-18.

slide-46
SLIDE 46

Two classifiers

  • Comparing performance against baseline (over multiple

datasets)

  • Wilcoxon signed ranks test

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." Journal of Machine learning research 7.Jan (2006): 1-30.

slide-47
SLIDE 47

Multiple classifiers

  • Comparing performance of multiple classifiers (over multiple

datasets)

  • Friedman test
  • Post-hoc tests

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." Journal of Machine learning research 7.Jan (2006): 1-30.

slide-48
SLIDE 48

Multiple classifiers

  • Friedman test

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." Journal of Machine learning research 7.Jan (2006): 1-30.

slide-49
SLIDE 49

Multiple classifiers

  • Post-hoc tests
  • All-vs-all

– Nemenyi test

  • One-vs-all

– Bonferroni-Dunn test

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." Journal of Machine learning research 7.Jan (2006): 1-30.

worse

slide-50
SLIDE 50
slide-51
SLIDE 51

IM IMPROVING MODEL PERFORMANCE

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT

slide-52
SLIDE 52

Tuning hyperparameters

  • How do you know if you have chosen the

right hyperparameters?

  • Empirical (training) cost

– How well can we fit the existing data

  • Expected (validation) cost

– What we can expect in reality

  • Substitute cost with arbitrary evaluation

metric

Data Learning algorithm MoA

slide-53
SLIDE 53

Overfitting and underfitting

slide-54
SLIDE 54

Bias-variance tradeoff

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York, NY, USA:: Springer series in statistics, 2001.

bias variance

slide-55
SLIDE 55

Tuning hyperparameters

  • Trial and error
  • Gridsearch CV

Data Learning algorithm MoA

slide-56
SLIDE 56

Tuning hyperparameters

  • Where applicable, track goal convergence

Hypothesis complexity Cost Number of iterations Cost

slide-57
SLIDE 57

Lowering bias and variance

  • Bias or variance still not constrained?

– Add new features – Reduce problem complexity – Generate more data – Use transfer learning – Use ensembles

slide-58
SLIDE 58

Reducing problem complexity

  • Make the learning problem easier through data preprocessing

– Spatial data preprocessing – Feature extraction

slide-59
SLIDE 59

Opposite

  • rientation

Reduce problem complexity

slide-60
SLIDE 60

Generating more data

  • Will generating more data help?

– Check using a learning curve plot

  • If possible, acquire more data and increase model

complexity

– Alternatively, generate artificial (synthetic) data

slide-61
SLIDE 61

Using transfer learning

Tajbakhsh, Nima, et al. "Convolutional neural networks for medical image analysis: full training

  • r fine tuning?." IEEE transactions on medical imaging 35.5 (2016): 1299-1312.
slide-62
SLIDE 62

Biological justification – visual cortex

Hubel, David H., and Torsten N. Wiesel. "Receptive fields of single neurones in the cat's striate cortex." The Journal of physiology 148.3 (1959): 574-591.

slide-63
SLIDE 63
slide-64
SLIDE 64

Using ensembles

slide-65
SLIDE 65

Using ensembles

  • Ensemble:

– Multiple submodels combined to make one strong model – Sample the training data differently for each submodel – Define a voting/averaging politic over submodels

  • Bagging vs boosting

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1.

  • No. 10. New York, NY, USA:: Springer series in statistics, 2001.
slide-66
SLIDE 66

Bagging

  • Bagging (or bootstrap aggregation)

– A technique for reducing the variance of an estimated model – Averaging reduces variance and leaves bias unchanged – Works especially well for high-variance, low-bias procedures (e.g. trees)

  • 50 members vote in 10

categories (4 nominations)

  • 15 members are (somewhat)

informed (random for category)

  • Majority wins
slide-67
SLIDE 67

Random forests

  • If a smaller subset of features is far more

informative than the rest – loss of diversity in the ensemble

  • At branching, randomly pick a subset of features

and bootstrap the dataset

– recursively partition

slide-68
SLIDE 68

Boosting

  • Boosting

– Similar to bagging, but involving vote weighting – Each weak-model accuracy only slightly better than random guessing

  • Adaboost.M1

Temperature = “hot”

yes no

NO YES

slide-69
SLIDE 69
slide-70
SLIDE 70

Gradient Boosting

  • Learn each new model explicitly from the error of

the previous additive model, i.e. use additive training

slide-71
SLIDE 71

Gradient Boosting

  • Optimal additive model obtained by calculating the gradient on

the residual of the previous additive model

  • XGboost

? ? ? ...

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

slide-72
SLIDE 72

Applicability of learned models

  • Good performances
  • Handling missing values
  • Noise handling
  • Transparency of diagnostic knowledge
  • Explanatory capabilities
  • Reducing the number of tests

Kononenko, Igor. "Machine learning for medical diagnosis: history, state of the art and perspective." Artificial Intelligence in medicine 23.1 (2001): 89-109.

slide-73
SLIDE 73

Noise

  • Errors in data

– Regularisation

  • Errors in labels (supervised learning)

– Ground truth availability (garbage in – garbage out)

  • Do you expect this (type of) noise in the future?

– E.g. a flaw in the data acquisition method

  • Outlier removal in a (one-time) preprocessing step

– E.g. inspect the univariate or multivariate normal distribution

slide-74
SLIDE 74
  • Minimum (end-systolic) and maximum (end-

diastolic) volumes and ejection fraction estimation

  • Diabetic retinopathy detection
  • Predicting length of hospital stay
  • Brachial Plexus nerve segmentation in ultrasound

images

  • Identifying risk population cervical cancer
slide-75
SLIDE 75
  • Understand the data
  • Define what you want to accomplish
  • Choose an appropriate hypothesis and optimisation technique

– Do not overdo it – SVMs and RFs are the best choice in most cases

  • Stay unbiased (experimentally)
  • Track the learning process
  • Reduce bias and variance
  • Perform a statistical comparison

test train

slide-76
SLIDE 76

Hmmm... Experimental results suggest the model is good Finally, I might be getting my big break...

slide-77
SLIDE 77

Other sources (of figures, mostly)

BOOKS:

  • Christopher Bishop, Pattern Recognition and Machine Learning.

Springer-Verlag New York, 2006.

  • Duda, Richard O., Peter E. Hart, and David G. Stork. Pattern
  • classification. John Wiley & Sons, 2012.
  • Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The

elements of statistical learning. Vol. 1. No. 10. New York, NY, USA:: Springer series in statistics, 2001. JOURNALS AND PROCEEDINGS:

  • D. Gering. W. Lu, K. Ruchala, G. Olivera. Utilizing Shape Models

Composed of Geometric Primitives for Organ Segmentation. American Association of Physicists in Medicine (AAPM), Anaheim, CA, July 2009.

  • Enquobahrie, Andinet, et al. "The image-guided surgery toolkit

IGSTK: an open source C++ software toolkit." Journal of Digital Imaging 20.1 (2007): 21-33.

  • Müller, Henning, Patrick Ruch, and Antoine Geissbuhler. "Enriching

content-based image retrieval with multi-lingual search terms." Swiss Medical Informatics 54 (2005): 6-11.

  • Yang, Liu, et al. "A boosting framework for visuality-preserving

distance metric learning and its application to medical image retrieval." IEEE Transactions on Pattern Analysis and Machine Intelligence 32.1 (2010): 30-44. OTHER:

  • http://cecs.wright.edu/~agoshtas/OMI.html
  • http://www.diagnijmegen.nl/index.php/Computer-

Aided_Diagnosis_of_Retinal_Images_(CADR)

  • https://radiopaedia.org/
  • http://scott.fortmann-roe.com/docs/BiasVariance.html
  • Andrea Vedaldi, (Somewhat) Advanced Convolutional Neural

Networks, MISS 2016

  • Andrew Ng, Machine Learning, Coursera 2016
  • https://www.kaggle.com/c/dogs-vs-cats/discussion/6984
  • http://hmcbee.blogspot.hr/2015/06/toddlers-in-transistors-

teaching.html

  • https://xkcd.com/
  • https://sebastianraschka.com/blog/2016/model-evaluation-

selection-part3.html

  • Wikipedia *
  • Deviantart *
  • Meme sites *
slide-78
SLIDE 78

EVERYTHING YOU NEVER WANTED TO KNOW ABOUT MACHINE LEARNING, BUT WERE FORCED TO FIND OUT

Ivan Štajduhar

istajduh@riteh.hr SSIP 2019 27TH SUMMER SCHOOL ON IMAGE PROCESSING, TIMISOARA, ROMANIA July 10th 2019