probability and statistics
play

Probability and Statistics for Computer Science many problems are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science many problems are naturally classifica4on problems---Prof. Forsyth Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.5.2020 Last time Demo of Principal


  1. Probability and Statistics ì for Computer Science “…many problems are naturally classifica4on problems”---Prof. Forsyth Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.5.2020

  2. Last time � Demo of Principal Component Analysis � Introduc4on to classifica4on

  3. Objectives � Decision tree (II) } so � Random forest -08 � Support Vector Machine (I)

  4. Classifiers prediction � Why do we need classifiers? patterns efficient � What do we use to quan4fy the performance of a classifier? matrix confusion � What is the baseline accuracy of a 5-class classifier using 0-1 O loss func4on? ¥ � What’s valida4on and cross-valida4on in classifica4on?

  5. Performance of a multiclass classifier � Assuming there are c classes: � The class confusion matrix is c × c True label � Under the 0-1 loss func4on accuracy = sum of diagonal terms sum of all terms ie. in the right example, accuracy = predicted 32/38=84% Source: scikit-learn � The baseline accuracy is 1/c.

  6. - validation Cross mnl : tiple Split the ways data in randomly Testing I Training us . validation " " { tolu ' ' ' ' leave - one - ont purpose ?

  7. Q1. Cross-validation Cross-valida+on is a method used to prevent overficng in classifica4on. D A. TRUE B. FALSE

  8. Decision(tree:(object(classification( � The$object$classifica4on$ decision(tree $can$classify$ objects$into$mul4ple$classes$using$sequence$of$ simple$tests.$It$will$naturally$grow$into$a$tree.$ . moving moving not - hole parts or o o human non - human big or sunset chair(leg( toddler( Cat( dog( sofa( box(

  9. Training(a(decision(tree:(example( � The$“Iris”$data$set$ M Virginica$ PN o O . yin w ' T O a Setosa$ Versicolor$ t Seto s a 50 Virginica o 1?$Where?$ Versicolor o versicolor

  10. Training a decision tree � Choose a dimension/feature and a split � Split the training Data into lef- and right- child subsets D l and D r right left � Repeat the two steps above recursively on each child � Stop the recursion based on some condi4ons � Label the leaves with class labels

  11. Classifying with a decision tree: example � The “Iris” data set Virginica ✓ ✓ Setosa Versicolor

  12. Choosing a split � An informa4ve split makes the subsets more concentrated and reduces uncertainty about class labels

  13. Choosing a split � An informa4ve split makes the subsets more concentrated and reduces uncertainty about class labels

  14. Choosing a split � An informa4ve split makes the subsets ✔ more concentrated and reduces uncertainty about ✖ class labels

  15. Which is more informative?

  16. Quantifying uncertainty using entropy � We can measure uncertainty as the number of bits of informa4on needed to dis4nguish between classes in a dataset (first introduced by Claude Shannon) � We need Log 2 2 =1 bit to dis4nguish 2 equal classes � We need Log 2 4 =2 bit to dis4nguish 4 equal classes Claude Shannon (1916-2001)

  17. Quantifying uncertainty using entropy � Entropy (Shannon entropy) is the measure of uncertainty for a general distribu4on 1 � If class i contains a frac4on P ( i ) of the data, we need log 2 P ( i ) bits for that class � The entropy H(D) of a dataset is defined as the weighted mean of entropy for every class: c 1 � H ( D ) = P ( i ) log 2 P ( i ) i =1 = II , - Pci ) log , Pci ,

  18. Entropy: before the split Pco ) = # -_ 3g pcx ) De DR p H ( D ) = − 3 3 5 − 2 2 Hl De ) = ? 5 log 2 5 log 2 5 H CDR ) = ? = 0 . 971 bits

  19. Entropy: examples pix ) =/ - Ed H ( D ) = − 3 3 5 − 2 2 H ( D l ) = − 1 log 2 1 = 0 bits 5 log 2 5 log 2 5 = 0 . 971 bits

  20. Entropy: examples Dr De H ( D ) = − 3 5 − 2 3 2 H ( D l ) = − 1 log 2 1 = 0 bits 5 log 2 5 log 2 5 H ( D r ) = − 1 1 3 − 2 2 3 log 2 3 log 2 = 0 . 971 bits 3 = 0 . 918 bits

  21. Information gain of a split � The informa4on gain of a split is the amount of entropy that was reduced on average afer the split I = H ( D ) − ( N Dl H ( D l ) + N Dr H ( D r )) N D N D � where � N D is the number of items in the dataset D � N Dl is the number of items in the lef-child dataset D l � N Dr is the number of items in the lef-child dataset D r

  22. Information gain: examples I = H ( D ) − ( N Dl H ( D l ) + N Dr H ( D r )) N D N D = 0 . 971 − (24 60 × 0 + 36 60 × 0 . 918) = 0 . 420 bits

  23. Q. Is the splitting method global optimum? ← locally decided feature A. Yes Specific 0 ° Noa I B. No greedy .

  24. How to choose a dimension and split � If there are d dimensions, choose approximately √ d of them as candidates at random � For each candidate, find the split that maximizes the informa4on gain � Choose the best overall dimension and split � Note that splicng can be generalized to categorical features for which there is no natural ordering of the data

  25. When to stop growing the decision tree? � Growing the tree too deep can lead to overficng to the training data � Stop recursion on a data subset if any of the following occurs: � All items in the data subset are in the same class � The data subset becomes smaller than a predetermined size � A predetermined maximum tree depth has been reached.

  26. How to label the leaves of a decision tree � A leaf will usually have a data subset containing many class labels � Choose the class that has the most items in the subset hard � Alterna4vely, label the leaf with the number it contains in each class for a probabilis4c “sof” classifica4on. node leaf Cs ca Ci T ells cu Cy Cl

  27. Pros and Cons of a decision tree implement � Pros: easy Intuitive . co . → fast low cost b Rv . Conti Discrete Boundary Decision accurate � Cons: Not an over tilting .

  28. Training, evaluation and classification � Build the random forest by training each decision tree on a random subset with replacement from the training data and Ia subset of features are also randomly selected--- “Bagging” - � Evaluate the random forest by tes4ng on its out-of-bag items � Classify by merging the classifica4ons of individual decision trees � By simple vote � Or by adding sof classifica4ons together and then take a vote te . .

  29. An example of bagging Drawing random samples Sample Bagging Bagging … Bagging indices Round 1 Round 2 Round M from our training set with 1 2 7 replacement. E.g., if our 2 2 3 training set consists of 7 3 1 2 training samples, our 4 3 1 bootstrap samples (here: 5 4 1 n=7) can look as follows, 6 7 7 where C 1 , C 2 , … C m shall 7 2 1 symbolize the decision tree classifiers. -9 d- C 1 C 2 random seieztgwt d=9 features

  30. Pros and Cons of Random forest usually � Pros: accurate More overtittihg be likely to less . relative longer in � Cons: cost more , computing

  31. Q2. Do you think random forest will always outperform simple decision tree? A. Yes .r related B. No www E) trees by using different of d subsets d = ?

  32. Considerations in choosing a classifier � When solving a classifica4on problem, it is good to try several techniques. � Criteria to consider in choosing the classifier include Accuracy * model for the training g Speed * drag given new classification ( variety flexibility data of * . gull ) us big Interpretation * effect scaling . *

  33. Support Vector Machine (SVM) overview � The Decision boundary and func4on of a Support Vector Machine ← � Loss func4on (cost func4on in the book) � Training ← � Valida4on � Extension to mul4class classifica4on

  34. SVM problem formulation � At first we assume a binary classifica4on problem � The training set consists of N items � Feature vectors x i of dimension d o - � Corresponding class labels y i ∈ {± 1 } x (2) � We can picture the training data as a d-dimensional scaner plot with colored " " " * labels ! ! x (1) n

  35. Decision boundary of SVM qtktb 20 � SVM uses a hyperplane as its + I / decision boundary x (2) a T x + b = 0 o ✓ � The decision boundary is: a 1 x (1) + a 2 x (2) + ... + a d x ( d ) + b = 0 qtxtbso � In vector nota4on, the x (1) - l hyperplane can be wrinen as: ' ' ' Cds . * K - - . → " I at a n a T x + b = 0 es , - great ; s .

  36. Q3. How many solutions can we have for the decision boundary? x (2) a T x + b = 0 A. One H B. Several " C. Infinite x (1)

  37. Classification function of SVM � SVM assigns a class label to a x (2) feature vector according to the a T x + b = 0 following rule: +1 if a T x i + b ≥ 0 -1 if a T x i + b < 0 � In other words, the classifica4on x (1) func4on is: sign ( a T x i + b ) � Note that If is small, then was close to the decision � a T x i + b � � � x i � boundary If is large, then was far from the decision � � � a T x i + b � x i � boundary

  38. What if there is no clean cut boundary? � Some boundaries are bener x (2) than others for the training data a T x + b = 0 � Some boundaries are likely more robust for run-4me data � We need to a quan4ta4ve x (1) measure to decide about the boundary � The loss func+on can help decide if one boundary is bener than others

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend