introduction to machine learning
play

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 - PowerPoint PPT Presentation

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 Section 1 General information 2/122 Lecturers Andrea De Lorenzo Dipartimento di Ingegneria e Architettura (DIA) http://delorenzo.inginf.units.it/ 3/122 Course


  1. Notation and terminology Different communities (e.g., statistical learning vs. machine learning vs. artificial intelligence) use different terms and notation: ◮ x ( i ) instead of x i , j (hence x ( i ) instead of x i ) j ◮ m instead of n and n instead of p ◮ . . . Focus on the meaning! 24/122

  2. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 3 2 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

  3. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. 2 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

  4. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

  5. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 1. learn a model (classifier) Sepal length I. setosa I. versicolor 25/122

  6. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 1. learn a model (classifier) Sepal length 2. “use” model on new observations I. setosa I. versicolor 25/122

  7. “A” model? There could be many possible models: ◮ how to choose? ◮ how to compare? Q: a model of what? 26/122

  8. Choosing the model The choice of the model/tool/technique to be used is determined by many factors: ◮ Problem size ( n and p ) ◮ Availability of an output variable ( y ) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see some options. 27/122

  9. Comparing many models Experimentally: does the model work well on (new) data? 28/122

  10. Comparing many models Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . . ◮ Q: what’s the difference? We will see/discuss some options. 28/122

  11. It does not work well. . . Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues. 29/122

  12. ML is not magic Problem : find birth town from height/weight. 200 Trieste Udine 180 Height [cm] 160 140 60 70 80 90 100 Weight [kg] Q: which is the data issue here? 30/122

  13. Implementation When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more ML technique ◮ assess learned models “By hands?” No, with software! 31/122

  14. ML/DM software Many options: ◮ libraries for general purpose languages: ◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . . ◮ specialized sw environments: ◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language) ◮ from scratch 32/122

  15. ML/DM software: which one? ◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills 33/122

  16. ML/DM software: why? In all cases, sw allows to be more productive and concise. E.g., learn and use a model for classification, in Java+Smile: double[][] instances = ...; 1 int[] labels = ...; 2 RandomForest classifier = (new RandomForest.Trainer()).train( 3 instances, labels); double[] newInstance = ...; 4 int newLabel = classifier.predict(newInstance); 5 In R: d = ... 1 classifier = randomForest(label~., d) 2 newD = ... 3 newLabels = predict(classifier, newD) 4 34/122

  17. Section 3 Plotting data: an overview 35/122

  18. Advanced plotting ◮ many packages (e.g., ggplot2) ◮ many options Which is the most proper chart to support a thesis? 36/122

  19. Aim of a plot: examples 37/122

  20. Aim of a plot: examples 38/122

  21. Aim of a plot: examples 39/122

  22. Aim of a plot: examples 40/122

  23. Section 4 Tree-based methods 41/122

  24. The carousel robot attendant Problem : replace the carousel attendant with a robot which automatically decides who can ride the carousel. 42/122

  25. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride Can ride Height h [cm] 150 100 5 10 15 Age a [year] 43/122

  26. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] 150 100 5 10 15 Age a [year] 43/122

  27. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 100 5 10 15 Age a [year] 43/122

  28. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 5 10 15 Age a [year] 43/122

  29. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 Decision tree! a < 10 5 10 15 T F Age a [year] h < 120 T F 43/122

  30. How to build a decision tree Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera 44/122

  31. How to build a decision tree: detail Recursive binary splitting function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem) 45/122

  32. Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Classification error on subset: E ( y ) = |{ y ∈ y : y � = ˆ y }| | y | y = the most common class in y ˆ ◮ Greedy (choose split to minimize error now, not in later steps) 46/122

  33. Best branch ( i ⋆ , t ⋆ ) ← arg min E ( y | x i ≥ t ) + E ( y | x i < t ) i , t The formula say what is done, not how is done! Q: “how” can different methods differ? 47/122

  34. Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Other possible criterion: ◮ tree depth larger than d max 48/122

  35. Best branch criteria Classification error E () works, but has been shown to be “not sufficiently sensitive for tree-growing”. E ( y ) = |{ y ∈ y : y � = ˆ y }| |{ y ∈ y : y = c }| = 1 − max = 1 − max p y , c | y | | y | c c Other two option: ◮ Gini index � G ( y ) = p y , c (1 − p y , c ) c ◮ Cross-entropy � D ( y ) = − p y , c log p y , c c For all indexes, the lower the better ( node impurity ). 49/122

  36. Best branch criteria: binary classification Class. error E Gini index G 0 . 4 Cross-entropy D Index · ( y ) 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 p y , c Cross-entropy is rescaled. Q: what happens with multiclass problems? 50/122

  37. Categorical independent variables ◮ Trees can work with categorical variables ◮ Branch node is x i = c or x i ∈ C ′ ⊂ C ( c is a class) ◮ Can mix categorical and numeric variables 51/122

  38. Stopping criterion: role of k min Suppose k min = 1 (never stop for y size) 200 Cannot ride h < 120 Can ride Height h [cm] a < 9 . 0 a < 10 150 a < 9 . 6 a < 9 . 1 100 a < 9 . 4 5 10 15 Age a [year] Q: what’s wrong? (recall: “a model of what?”) 52/122

  39. Tree complexity When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity is not related (only) with k min , but also with data 53/122

  40. Tree complexity: other interpretation ◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits ( overfitting ) ◮ does not generalize (high variance : model varies if learning data varies) 54/122

  41. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S 55/122

  42. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S 55/122

  43. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S ◮ if I change the point of view on S , my knowledge about S should remain the same! 55/122

  44. Spotting overfitting Learning error Error Model complexity Test error: error on unseen data 56/122

  45. Spotting overfitting Learning error Test error Error Model complexity Test error: error on unseen data 56/122

  46. k -fold cross-validation Where can I find “unseen data”? Pretend to have it! 1. split learning data ( X and y ) in k equal slices (each of n k observations/elements) 2. for each split (i.e., each i ∈ { 1 , . . . , k } ) 2.1 learn on all but k -th slice 2.2 compute classification error on unseen k -th slice 3. average the k classification errors In essence: ◮ can the learner generalize beyond available data? ◮ how the learned artifact will behave on unseen data? 57/122

  47. k -fold cross-validation folding 1 error 1 folding 2 error 2 folding 3 error 3 folding 4 error 4 folding 5 error 5 i = k error = 1 � error i k i =1 Or with any other meaningful (effectiveness) measure Q: how should data be split? 58/122

  48. Fighting overfitting with trees ◮ large k min (large w.r.t. what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune 59/122

  49. Pruning: high level idea 1. learn a full tree t 0 2. build from t 0 a sequence T = { t 0 , t 1 , . . . , t n } of trees such that ◮ t i is a root-subtree of t i − 1 ( t i ⊂ t i − 1 ) ◮ t i is always less complex than t i − 1 3. choose the t ∈ T with minimum classification error with k -fold cross-validation 60/122

  50. k -fold cross-validation: data splitting Q: how should data be split? Example: Android Malware detection ◮ Gerardo Canfora et al. “Effectiveness of opcode ngrams for detection of multi family android malware”. In: Availability, Reliability and Security (ARES), 2015 10th International Conference on . IEEE. 2015, pp. 333–340 ◮ Gerardo Canfora et al. “Detecting android malware using sequences of system calls”. In: Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile . ACM. 2015, pp. 13–20 61/122

  51. Using cross-validation (CV) for assessment (I) How the learned artifact will behave on unseen data? More precisely: How an artifact learned with this learning technique will behave on unseen data? 62/122

  52. Using CV for assessment (II) “This learning technique” = BuildDecisionTree () with k min = 10 1. repeat k times 1.1 BuildDecisionTree () with k min = 10 on all but one slice k − 1 ◮ k n observations in each X passed to BuildDecisionTree () 1.2 compute classification error on left out slice 2. average computed classification errors k invocations of BuildDecisionTree () 63/122

  53. Using CV for assessment (III) “This learning technique” = BuildDecisionTree () with k min chosen automatically with a 10-fold CV For assessing this technique, we do two nested CVs: 1. repeat k times 1.1 choose k min among m values with 10-CV (repeat BuildDecisionTree () 10 m times) on all but one slice k − 1 9 ◮ 10 n observations in each X passed to k BuildDecisionTree ()! 1.2 compute classification error on left out slice ◮ usually, a new tree is built on k − 1 k n observations 2. average computed classification errors (10 m + 1) k invocations of BuildDecisionTree () 64/122

  54. Using CV for assessment: “cheating” “This learning technique” = BuildDecisionTree () with k min chosen automatically with a 10-fold CV Using just one CV is cheating (cherry picking)! ◮ k min is chosen exactly to minimize error on the full dataset ◮ conceptually, this way of “fitting” k min is similar to the way we build the tree 65/122

  55. Subsection 1 Regression trees 66/122

  56. Regression with trees Trees can be used for regression, instead of classification. decision tree vs. regression tree 67/122

  57. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 68/122

  58. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← ¯ ˆ y ⊲ mean y return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 68/122

  59. Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Q: what should we change? 69/122

  60. Best branch function BestBranch ( X , y ) y ) 2 + � y ) 2 ( i ⋆ , t ⋆ ) ← arg min i , t � y i ∈ y | x i ≥ t ( y i − ¯ y i ∈ y | x i < t ( y i − ¯ return ( i ⋆ , t ⋆ ) end function Q: what should we change? Minimize sum of residual sum of squares (RSS) (the two ¯ y are different) 69/122

  61. Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Q: what should we change? 70/122

  62. Stopping criterion function ShouldStop ( y ) if RSS is 0 then return true else if | y | < k min then return true else return false end if end function Q: what should we change? 70/122

  63. Interpretation 4 2 0 0 5 10 15 20 25 30 71/122

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend