introduction to machine learning
play

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 - PowerPoint PPT Presentation

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento di Ingegneria e


  1. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 4 5 6 7 solution : 1. learn a model (classifier) Sepal length 2. “use” model on new observations I. setosa I. versicolor 20/77

  2. “A” model? There could be many possible models: ◮ how to choose? ◮ how to compare? 21/77

  3. Choosing the model The choice of the model/tool/algorithm to be used is determined by many factors: ◮ Problem size ( n and p ) ◮ Availability of an output variable ( y ) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see many options. 22/77

  4. Comparing many models Experimentally: does the model work well on (new) data? 23/77

  5. Comparing many models Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . . We will see/discuss many options. 23/77

  6. It does not work well. . . Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues. 24/77

  7. ML is not magic Problem : find birth town from height/weight. 200 Trieste Udine 180 Height [cm] 160 140 60 70 80 90 100 Weight [kg] Q: which is the data issue here? 25/77

  8. Implementation When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more learning algorithms ◮ assess learned models “By hands?” No, with software! 26/77

  9. ML/DM software Many options: ◮ libraries for general purpose languages: ◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . . ◮ specialized sw environments: ◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language) ◮ from scratch 27/77

  10. ML/DM software: which one? ◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills 28/77

  11. Section 2 Tree-based methods 29/77

  12. The carousel robot attendant Problem : replace the carousel attendant with a robot which automatically decides who can ride the carousel. 30/77

  13. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride Can ride Height h [cm] 150 100 5 10 15 Age a [year] 31/77

  14. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] 150 100 5 10 15 Age a [year] 31/77

  15. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 100 5 10 15 Age a [year] 31/77

  16. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 5 10 15 Age a [year] 31/77

  17. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 Decision tree! a < 10 5 10 15 T F Age a [year] h < 120 T F 31/77

  18. How to build a decision tree Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera 32/77

  19. How to build a decision tree: detail Recursive binary splitting function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem) 33/77

  20. Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Classification error on subset: E ( y ) = |{ y ∈ y : y � = ˆ y }| | y | y = the most common class in y ˆ ◮ Greedy (choose split to minimize error now, not in later steps) 34/77

  21. Best branch ( i ⋆ , t ⋆ ) ← arg min E ( y | x i ≥ t ) + E ( y | x i < t ) i , t The formula say what is done, not how is done! Q: different “how” can differ? how? 35/77

  22. Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Other possible criterion: ◮ tree depth larger than d max 36/77

  23. Categorical independent variables ◮ Trees can work with categorical variables ◮ Branch node is x i = c or x i ∈ C ′ ⊂ C ( c is a class) ◮ Can mix categorical and numeric variables 37/77

  24. Stopping criterion: role of k min Suppose k min = 1 (never stop for y size) 200 Cannot ride h < 120 Can ride Height h [cm] a < 9 . 0 a < 10 150 a < 9 . 6 a < 9 . 1 100 a < 9 . 4 5 10 15 Age a [year] Q: what’s wrong? 38/77

  25. Tree complexity When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity issue is not related (only) with k min 39/77

  26. Tree complexity: other interpretation ◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits ( overfitting ) ◮ does not generalize (high variance : model varies if learning data varies) 40/77

  27. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S 41/77

  28. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S 41/77

  29. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S ◮ if I change the point of view on S , my knowledge about S should remain the same! 41/77

  30. Fighting overfitting ◮ large k min (large w.r.t what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune (bias, variance will be detailed later) 42/77

  31. Evaluation: k -fold cross-validation How to estimate the predictor performance on new (unavailable) data? 1. split learning data ( X and y ) in k equal slices (each of n k observations/elements) 2. for each split (i.e., each i ∈ { 1 , . . . , k } ) 2.1 learn on all but k -th slice 2.2 compute classification error on unseen k -th slice 3. average the k classification errors In essence: ◮ can the learner generalize on available data? ◮ how the learned artifact will behave on unseen data? 43/77

  32. Evaluation: k -fold cross-validation folding 1 accuracy 1 folding 2 accuracy 2 folding 3 accuracy 3 folding 4 accuracy 4 folding 5 accuracy 5 i = k accuracy = 1 � accuracy i k i =1 Or with classification error rate or any other meaningful (effectiveness) measure Q: how should data be split? 44/77

  33. Subsection 1 Regression trees 45/77

  34. Regression with trees Trees can be used for regression, instead of classification. decision tree vs. regression tree 46/77

  35. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 47/77

  36. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← ¯ ˆ y ⊲ mean y return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 47/77

  37. Interpretation 4 2 0 0 5 10 15 20 25 30 48/77

  38. Regression and overfitting Image from F. Daolio 49/77

  39. Trees in summary Pros: � easily interpretable/explicable � learning and regression/classification easily understandable � can handle both numeric and categorical values Cons: � not so accurate ( Q: always?) 50/77

  40. Tree accuracy? Image from An Introduction to Statistical Learning 51/77

  41. Subsection 2 Trees aggregation 52/77

  42. Weakness of the tree 30 25 20 15 0 20 40 60 80 100 Small tree: Big tree: ◮ low complexity ◮ high complexity ◮ will hardly fit the “curve” ◮ may overfit the noise on the part right part ◮ high bias, low variance ◮ low bias, high variance 53/77

  43. The trees view Small tree: Big tree: ◮ “a car is something that ◮ “a car is a made-in-Germany moves” blue object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine” 54/77

  44. Big tree view A big tree: ◮ has a detailed view of the learning data (high complexity) ◮ “trusts too much” the learning data (high variance) What if we “combine” different big tree views and ignore details on which they disagree? 55/77

  45. Wisdom of the crowds What if we “combine” different big tree views and ignore details on which they disagree? ◮ many views ◮ independent views ◮ aggregation of views ≈ the wisdom of the crowds : a collective opinion may be better than a single expert’s opinion 56/77

  46. Wisdom of the trees ◮ many views ◮ independent views ◮ aggregation of views 57/77

  47. Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ aggregation of views 57/77

  48. Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ aggregation of views ◮ just average prediction (regression) or take most common prediction (classification) 57/77

  49. Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ ??? learning is deterministic: same data ⇒ same tree ⇒ same view ◮ aggregation of views ◮ just average prediction (regression) or take most common prediction (classification) 57/77

  50. Independent views Independent views ≡ different points of view ≡ different learning data But we have only one learning data! 58/77

  51. Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions 59/77

  52. Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = ( x T 1 x T 2 x T 3 x T 4 x T 5 ) original learning data X 1 = ( x T 1 x T 5 x T 3 x T 2 x T 5 ) sample 1 X 2 = ( x T 4 x T 2 x T 3 x T 1 x T 1 ) sample 2 X i = . . . sample i ◮ ( y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset) 59/77

  53. Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = ( x T 1 x T 2 x T 3 x T 4 x T 5 ) original learning data X 1 = ( x T 1 x T 5 x T 3 x T 2 x T 5 ) sample 1 X 2 = ( x T 4 x T 2 x T 3 x T 1 x T 1 ) sample 2 X i = . . . sample i ◮ ( y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset) Bagging of trees ( bootstrap , more in general) 59/77

  54. Tree bagging When learning: 1. Repeat B times 1.1 take a sample of the learning data 1.2 learn a tree (unpruned) When predicting: 1. Repeat B times 1.1 get a prediction from i th learned tree 2. predict the average (or most common) prediction For classification, other aggregations can be done: majority voting (most common) is the simplest 60/77

  55. How many trees? B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember k min , depth ( Q: impact on?) 61/77

  56. How many trees? B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember k min , depth ( Q: impact on?) ◮ it has been shown (experimentally) that ◮ for “large” B , bagging is better than single tree ◮ increasing B does not cause overfitting ◮ (for us: default B is ok! “large” ≈ hundreds) Q: how better? at which cost? 61/77

  57. Bagging · 10 − 2 8 7 Test error 6 5 0 100 200 300 400 500 Number B of trees 62/77

  58. Independent view: improvement Despite being learned on different samples, bagging trees may be correlated, hence views are not very independent ◮ e.g., one variable is much more important than others for predicting ( strong predictor ) Idea: force point of view differentiation by “hiding” variables 63/77

  59. Random forest When learning: 1. Repeat B times 1.1 take a sample of the learning data 1.2 consider only m on p independent variables 1.3 learn a tree (unpruned) When predicting: 1. Repeat B times 1.1 get a prediction from i th learned tree 2. predict the average (or most common) prediction ◮ (observations and) variables are random ly chosen. . . ◮ . . . to learn a forest of trees Q: are missing variables a problem? 64/77

  60. Random forest: parameter m How to choose the value for m ? ◮ m = p → bagging ◮ it has been shown (experimentally) that ◮ m does not relate with overfitting ◮ m = √ p is good for classification ◮ m = p 3 is good for regression ◮ (for us, default m is ok!) 65/77

  61. Random forest Experimentally shown: one of the “best” multi-purpose supervised classification methods ◮ Manuel Fern´ andez-Delgado et al. “Do we need hundreds of classifiers to solve real world classification problems”. In: J. Mach. Learn. Res 15.1 (2014), pp. 3133–3181 but. . . 66/77

  62. No free lunch! “Any two optimization algorithms are equivalent when their performance is averaged across all possible problems” ◮ David H Wolpert. “The lack of a priori distinctions between learning algorithms”. In: Neural computation 8.7 (1996), pp. 1341–1390 Why free lunch? ◮ many restaurants, many items on menus, many possibly prices for each item: where to go to eat? ◮ no general answer ◮ but, if you are a vegan, or like pizza, then a best choice could exist Q: problem? algorithm? 67/77

  63. Nature of the prediction Consider classification: ◮ tree → the class ◮ forest → the class, as resulting from a voting 68/77

  64. Nature of the prediction Consider classification: ◮ tree → the class ◮ “virginica” is just “virginica” ◮ forest → the class, as resulting from a voting ◮ “241 virginica, 170 versicolor, 89 setosa” is different than “478 virginica, 10 versicolor, 2 setosa” Is this information useful/exploitable? 68/77

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend