Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 - PowerPoint PPT Presentation

Notation and terminology Different communities (e.g., statistical learning vs. machine learning vs. artificial intelligence) use different terms and notation: ◮ x ( i ) instead of x i , j (hence x ( i ) instead of x i ) j ◮ m instead of n and n instead of p ◮ . . . Focus on the meaning! 24/122

Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 3 2 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. 2 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 1. learn a model (classifier) Sepal length I. setosa I. versicolor 25/122

Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 1. learn a model (classifier) Sepal length 2. “use” model on new observations I. setosa I. versicolor 25/122

“A” model? There could be many possible models: ◮ how to choose? ◮ how to compare? Q: a model of what? 26/122

Choosing the model The choice of the model/tool/technique to be used is determined by many factors: ◮ Problem size ( n and p ) ◮ Availability of an output variable ( y ) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see some options. 27/122

Comparing many models Experimentally: does the model work well on (new) data? 28/122

Comparing many models Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . . ◮ Q: what’s the difference? We will see/discuss some options. 28/122

It does not work well. . . Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues. 29/122

ML is not magic Problem : find birth town from height/weight. 200 Trieste Udine 180 Height [cm] 160 140 60 70 80 90 100 Weight [kg] Q: which is the data issue here? 30/122

Implementation When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more ML technique ◮ assess learned models “By hands?” No, with software! 31/122

ML/DM software Many options: ◮ libraries for general purpose languages: ◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . . ◮ specialized sw environments: ◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language) ◮ from scratch 32/122

ML/DM software: which one? ◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills 33/122

ML/DM software: why? In all cases, sw allows to be more productive and concise. E.g., learn and use a model for classification, in Java+Smile: double[][] instances = ...; 1 int[] labels = ...; 2 RandomForest classifier = (new RandomForest.Trainer()).train( 3 instances, labels); double[] newInstance = ...; 4 int newLabel = classifier.predict(newInstance); 5 In R: d = ... 1 classifier = randomForest(label~., d) 2 newD = ... 3 newLabels = predict(classifier, newD) 4 34/122

Section 3 Plotting data: an overview 35/122

Advanced plotting ◮ many packages (e.g., ggplot2) ◮ many options Which is the most proper chart to support a thesis? 36/122

Aim of a plot: examples 37/122

Section 4 Tree-based methods 41/122

The carousel robot attendant Problem : replace the carousel attendant with a robot which automatically decides who can ride the carousel. 42/122

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride Can ride Height h [cm] 150 100 5 10 15 Age a [year] 43/122

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] 150 100 5 10 15 Age a [year] 43/122

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 100 5 10 15 Age a [year] 43/122

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 5 10 15 Age a [year] 43/122

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 Decision tree! a < 10 5 10 15 T F Age a [year] h < 120 T F 43/122

How to build a decision tree Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera 44/122

How to build a decision tree: detail Recursive binary splitting function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem) 45/122

Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Classification error on subset: E ( y ) = |{ y ∈ y : y � = ˆ y }| | y | y = the most common class in y ˆ ◮ Greedy (choose split to minimize error now, not in later steps) 46/122

Best branch ( i ⋆ , t ⋆ ) ← arg min E ( y | x i ≥ t ) + E ( y | x i < t ) i , t The formula say what is done, not how is done! Q: “how” can different methods differ? 47/122

Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Other possible criterion: ◮ tree depth larger than d max 48/122

Best branch criteria Classification error E () works, but has been shown to be “not sufficiently sensitive for tree-growing”. E ( y ) = |{ y ∈ y : y � = ˆ y }| |{ y ∈ y : y = c }| = 1 − max = 1 − max p y , c | y | | y | c c Other two option: ◮ Gini index � G ( y ) = p y , c (1 − p y , c ) c ◮ Cross-entropy � D ( y ) = − p y , c log p y , c c For all indexes, the lower the better ( node impurity ). 49/122

Best branch criteria: binary classification Class. error E Gini index G 0 . 4 Cross-entropy D Index · ( y ) 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 p y , c Cross-entropy is rescaled. Q: what happens with multiclass problems? 50/122

Categorical independent variables ◮ Trees can work with categorical variables ◮ Branch node is x i = c or x i ∈ C ′ ⊂ C ( c is a class) ◮ Can mix categorical and numeric variables 51/122

Stopping criterion: role of k min Suppose k min = 1 (never stop for y size) 200 Cannot ride h < 120 Can ride Height h [cm] a < 9 . 0 a < 10 150 a < 9 . 6 a < 9 . 1 100 a < 9 . 4 5 10 15 Age a [year] Q: what’s wrong? (recall: “a model of what?”) 52/122

Tree complexity When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity is not related (only) with k min , but also with data 53/122

Tree complexity: other interpretation ◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits ( overfitting ) ◮ does not generalize (high variance : model varies if learning data varies) 54/122

High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S 55/122

High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S 55/122

High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S ◮ if I change the point of view on S , my knowledge about S should remain the same! 55/122

Spotting overfitting Learning error Error Model complexity Test error: error on unseen data 56/122

Spotting overfitting Learning error Test error Error Model complexity Test error: error on unseen data 56/122

k -fold cross-validation Where can I find “unseen data”? Pretend to have it! 1. split learning data ( X and y ) in k equal slices (each of n k observations/elements) 2. for each split (i.e., each i ∈ { 1 , . . . , k } ) 2.1 learn on all but k -th slice 2.2 compute classification error on unseen k -th slice 3. average the k classification errors In essence: ◮ can the learner generalize beyond available data? ◮ how the learned artifact will behave on unseen data? 57/122

k -fold cross-validation folding 1 error 1 folding 2 error 2 folding 3 error 3 folding 4 error 4 folding 5 error 5 i = k error = 1 � error i k i =1 Or with any other meaningful (effectiveness) measure Q: how should data be split? 58/122

Fighting overfitting with trees ◮ large k min (large w.r.t. what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune 59/122

Pruning: high level idea 1. learn a full tree t 0 2. build from t 0 a sequence T = { t 0 , t 1 , . . . , t n } of trees such that ◮ t i is a root-subtree of t i − 1 ( t i ⊂ t i − 1 ) ◮ t i is always less complex than t i − 1 3. choose the t ∈ T with minimum classification error with k -fold cross-validation 60/122

k -fold cross-validation: data splitting Q: how should data be split? Example: Android Malware detection ◮ Gerardo Canfora et al. “Effectiveness of opcode ngrams for detection of multi family android malware”. In: Availability, Reliability and Security (ARES), 2015 10th International Conference on . IEEE. 2015, pp. 333–340 ◮ Gerardo Canfora et al. “Detecting android malware using sequences of system calls”. In: Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile . ACM. 2015, pp. 13–20 61/122

Using cross-validation (CV) for assessment (I) How the learned artifact will behave on unseen data? More precisely: How an artifact learned with this learning technique will behave on unseen data? 62/122

Using CV for assessment (II) “This learning technique” = BuildDecisionTree () with k min = 10 1. repeat k times 1.1 BuildDecisionTree () with k min = 10 on all but one slice k − 1 ◮ k n observations in each X passed to BuildDecisionTree () 1.2 compute classification error on left out slice 2. average computed classification errors k invocations of BuildDecisionTree () 63/122

Using CV for assessment (III) “This learning technique” = BuildDecisionTree () with k min chosen automatically with a 10-fold CV For assessing this technique, we do two nested CVs: 1. repeat k times 1.1 choose k min among m values with 10-CV (repeat BuildDecisionTree () 10 m times) on all but one slice k − 1 9 ◮ 10 n observations in each X passed to k BuildDecisionTree ()! 1.2 compute classification error on left out slice ◮ usually, a new tree is built on k − 1 k n observations 2. average computed classification errors (10 m + 1) k invocations of BuildDecisionTree () 64/122

Using CV for assessment: “cheating” “This learning technique” = BuildDecisionTree () with k min chosen automatically with a 10-fold CV Using just one CV is cheating (cherry picking)! ◮ k min is chosen exactly to minimize error on the full dataset ◮ conceptually, this way of “fitting” k min is similar to the way we build the tree 65/122

Subsection 1 Regression trees 66/122

Regression with trees Trees can be used for regression, instead of classification. decision tree vs. regression tree 67/122

Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 68/122

Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← ¯ ˆ y ⊲ mean y return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 68/122

Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Q: what should we change? 69/122

Best branch function BestBranch ( X , y ) y ) 2 + � y ) 2 ( i ⋆ , t ⋆ ) ← arg min i , t � y i ∈ y | x i ≥ t ( y i − ¯ y i ∈ y | x i < t ( y i − ¯ return ( i ⋆ , t ⋆ ) end function Q: what should we change? Minimize sum of residual sum of squares (RSS) (the two ¯ y are different) 69/122

Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Q: what should we change? 70/122

Stopping criterion function ShouldStop ( y ) if RSS is 0 then return true else if | y | < k min then return true else return false end if end function Q: what should we change? 70/122

Interpretation 4 2 0 0 5 10 15 20 25 30 71/122

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 - PowerPoint PPT Presentation

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 Section 1 General information 2/122 Lecturers Andrea De Lorenzo Dipartimento di Ingegneria e Architettura (DIA) http://delorenzo.inginf.units.it/ 3/122 Course

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

The Project FeederWatch Top 20 feeder birds in the North Pacific region Based on the reports of

New Foster Orientation Welcome to the Animal Welfare League of Alexandria! Thank you for your

Thomas Nield KotlinConf 2018 Thomas Nield Business Consultant at Southwest Airlines Dallas,

Introductory Course for Commercial Dog Breeders Topic 11: Transportation and Minimum Age

Air Pollution Consequences in S ao Paulo: Evidence for Health Bruna Guidetti IPE/USP Summary

Johannes Miescher Gregor Mendel Human Genome Project 2000 Precision Medicine 2010 Discovery

Biological Networks Analysis Introduction and Dijkstras algorithm Genome 559: Introduction to

Pulmonary disease in I have no relevant disclosures the older adult Leah J. Witt, MD