 
              CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann
RECAP: MACHINE LEARNING • Workflow 2
NOISE • noisy samples from true function 3
WHY IS NOISE A PROBLEM? • small random sample from the noisy data 4
WHY IS NOISE A PROBLEM? • best model for this (training) data 5
WHY IS NOISE A PROBLEM? à fitting the noise instead of the true function 6
REGRESSION AND MODEL COMPLEXITY Error on training set : linear model >> quadraEc >> 6-order polynomial ß error is zero ! Is the model with zero ( training ) error the best ? PDSH p393 7 Linear Regression
EVALUATION FOR REGRESSION • Training Error vs. Test Error & = 6(7 () ) % predictions for test data • Error measures: + • RMSE: root mean squared error 0 . − 0 . ) 3 RMSE % &, & () = , - (% . • MAE: mean absolute error &, & () = + MAE % , - |% 0 . − 0 . | 8 .
OVERFITTING t µ Sgp kfg fnderf.im pH f a linear s I high order poly 9
EVALUATION FOR CLASSIFICATION • Quality Measures: we have again training and test • error rate (or misclassification rate) = error (accuracy) # #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,% • average accuracy ( = 1 − 23343 3562 ) • Noise in Classification • where do labels come from? à noisy labels 10
EVALUATION FOR CLASSIFICATION • Confusion matrix prediction TPR TP N +1 -1 NR FF YE f ✓ ✘ +1 true positive false nega2ve FPR ftp.u TNI P true label prediction predic2on ✓ ✘ ETNR TIN f true negative -1 false positive prediction prediction Can you define accuracy using these measures? 11
CLASSIFICATION AND MODEL COMPLEXITY to 12
CLASSIFICATION AND MODEL COMPLEXITY eiE'oat µy test errors compare training for all three models 13
OVERFITTING Draw this yourself d I I I l v 14
COMBATING OVERFITTING Several Strategies: 1) prefer simpler models over more complicated ones 2) use validation set for model selection T lDra A ground Validation truth msn.ee EsmYegaePEt prediction B 4 Validation Performance C Evaluation Validation 3) add a regularization term to your optimization problem during training vs penalize large weights in 15
HOW MUCH DATA DO WE NEED? • Learning curve 16
DATA ≠ DATA • Two kinds of data: population vs. sample A sample is a ( representative ) A population is the entire set subset of the objects or events of objects or events under under study. study. Population can be à needed because it’s hypothetical “all students” or impossible or intractable to all students in this class. obtain or use population data. What are problems with sample data ? 17
SAMPLING BIAS • What if our sample is biased? • Think about real world ML applica:ons where this might have a (nega:ve) impact! 18
SUMMARY & READING • Avoid overfitting ! • Model selection using a validation set can prevent overfitting. • Learning curve à training data size matters and influences model selection • Model evaluation for classification is more than just looking at the error . • DSFS • Ch11 (p142-147) • PDSH • Ch5 (p357,370-373) • Ch5 (p393-398) 19
Recommend
More recommend