cse217 introduction to data science
play

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann RECAP: MACHINE LEARNING Workflow 2 NOISE noisy samples from true function 3 WHY IS NOISE A PROBLEM? small random sample from the noisy


  1. CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann

  2. RECAP: MACHINE LEARNING • Workflow 2

  3. NOISE • noisy samples from true function 3

  4. WHY IS NOISE A PROBLEM? • small random sample from the noisy data 4

  5. WHY IS NOISE A PROBLEM? • best model for this (training) data 5

  6. WHY IS NOISE A PROBLEM? à fitting the noise instead of the true function 6

  7. REGRESSION AND MODEL COMPLEXITY Error on training set : linear model >> quadraEc >> 6-order polynomial ß error is zero ! Is the model with zero ( training ) error the best ? PDSH p393 7 Linear Regression

  8. EVALUATION FOR REGRESSION • Training Error vs. Test Error & = 6(7 () ) % predictions for test data • Error measures: + • RMSE: root mean squared error 0 . − 0 . ) 3 RMSE % &, & () = , - (% . • MAE: mean absolute error &, & () = + MAE % , - |% 0 . − 0 . | 8 .

  9. OVERFITTING t µ Sgp kfg fnderf.im pH f a linear s I high order poly 9

  10. EVALUATION FOR CLASSIFICATION • Quality Measures: we have again training and test • error rate (or misclassification rate) = error (accuracy) # #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,% • average accuracy ( = 1 − 23343 3562 ) • Noise in Classification • where do labels come from? à noisy labels 10

  11. EVALUATION FOR CLASSIFICATION • Confusion matrix prediction TPR TP N +1 -1 NR FF YE f ✓ ✘ +1 true positive false nega2ve FPR ftp.u TNI P true label prediction predic2on ✓ ✘ ETNR TIN f true negative -1 false positive prediction prediction Can you define accuracy using these measures? 11

  12. CLASSIFICATION AND MODEL COMPLEXITY to 12

  13. CLASSIFICATION AND MODEL COMPLEXITY eiE'oat µy test errors compare training for all three models 13

  14. OVERFITTING Draw this yourself d I I I l v 14

  15. COMBATING OVERFITTING Several Strategies: 1) prefer simpler models over more complicated ones 2) use validation set for model selection T lDra A ground Validation truth msn.ee EsmYegaePEt prediction B 4 Validation Performance C Evaluation Validation 3) add a regularization term to your optimization problem during training vs penalize large weights in 15

  16. HOW MUCH DATA DO WE NEED? • Learning curve 16

  17. DATA ≠ DATA • Two kinds of data: population vs. sample A sample is a ( representative ) A population is the entire set subset of the objects or events of objects or events under under study. study. Population can be à needed because it’s hypothetical “all students” or impossible or intractable to all students in this class. obtain or use population data. What are problems with sample data ? 17

  18. SAMPLING BIAS • What if our sample is biased? • Think about real world ML applica:ons where this might have a (nega:ve) impact! 18

  19. SUMMARY & READING • Avoid overfitting ! • Model selection using a validation set can prevent overfitting. • Learning curve à training data size matters and influences model selection • Model evaluation for classification is more than just looking at the error . • DSFS • Ch11 (p142-147) • PDSH • Ch5 (p357,370-373) • Ch5 (p393-398) 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend