l ecture 8
play

L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2 Admin:


  1. CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu

  2. Admin CS446 Machine Learning 2

  3. Admin: Homework – HW1 is due tonight (11:59pm) Last minute TA office hour (Ryan): SC 3403 5pm-6pm today – SGD: Apologies for the misunderstanding/ miscommunications (more on this in a bit) Fill out (optional) Compass Survey. – Future HWs: Start early! Let us know ASAP when something is unclear CS446 Machine Learning 3

  4. Reminder: Homework Late Policy Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than three days after their due date. Let us know if there are any special circumstances (family, health, etc.)

  5. Admin: Midterm Midterm exam: Thursday, Oct 10 in class Let us know ASAP if you know you have a conflict (job interview?) or need accommodations We will post past midterms on the website Caveat: different instructor! CS446 Machine Learning 5

  6. Admin: Projects (4 th credit hour) Do you have an idea? Great ML class projects (and write-ups) can be found at http://cs229.stanford.edu/projects2012.html For datasets and problems, see also http://www.kaggle.com/competitions or the UCI machine learning repository http://archive.ics.uci.edu/ml/ Do you have a partner? => Compass survey: due by next Friday (to make sure everybody is on track) CS446 Machine Learning 6

  7. Review: Stochastic Gradient Descent CS446 Machine Learning 7

  8. SGD questions that came up in HW1… … What’s the difference between batch and online learning? … When do we update the weight vector? … How do we check for convergence? … When do we check for convergence? CS446 Machine Learning 8

  9. Terminology: Batch learning The hypothesis (e.g. weight vector) changes based on a batch (set) of training examples. See all examples in the batch, then update your weight vector. Typically, one batch = all training examples ‘Mini-batch’ = a small number of training examples Examples of batch algorithms we’ve seen so far: Standard gradient descent, Decision trees CS446 Machine Learning 9

  10. Terminology: Online learning The hypothesis (e.g. weight vector) changes based on an individual training example. Examples: stochastic gradient descent, Winnow Every time you see a new training example, you may have to update your weight vector. – SGD with LMS loss: w changes with every example – SGD with Perceptron loss: w changes only with misclassified examples (since the gradient = 0 for correctly classified examples) CS446 Machine Learning 10

  11. Reminder: Loss functions Loss as a function of y*f(x) 4 0-1 Loss Square Loss 3 Perceptron Loss yf(x) 2 1 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 11

  12. Convergence checks What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10 -3 ) – Compute Δ w, the difference between w old and w new : Δ w = w old − w new – w has converged when ‖ Δ w ‖ < τ CS446 Machine Learning 12

  13. Convergence checks How often do I check for convergence? Batch learning: w old = w before seeing the current batch w new = w after seeing the current batch Assuming your batch is large enough, this works well. CS446 Machine Learning 13

  14. Convergence checks How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). w old = w after n·k examples/updates w new = w after (n+1)·k examples/updates CS446 Machine Learning 14

  15. Another loose end: Hypothesis testing for evaluation CS446 Machine Learning 15

  16. Why hypothesis testing? We evaluate the accuracy of our classifiers on unseen training data. Hypothesis testing can tell us whether the difference in accuracy between two classifiers is statistically significant or not. CS446 Machine Learning 16

  17. Hypothesis testing You want to show that hypothesis H is true, based on your data (e.g. H = “classifier A and B are different”) – Define a null hypothesis H 0 (H 0 is the contrary of what you want to show) – H 0 defines a distribution P( m | H 0 ) over some statistic (number) that you can compute e.g. a distribution over the difference in accuracy between A and B – Can you refute (reject) H 0 ? CS446 Machine Learning 17

  18. Rejecting H 0 H 0 defines a distribution P( M | H 0 ) over some statistic M (e.g. M = the difference in accuracy between A and B) Select a significance value S (e.g. 0.05, 0.01, etc.) You can only reject H 0 if P(M= m | H 0 ) ≤ S Compute the test statistic m from your data e.g. the average difference in accuracy over your N folds Compute P( m | H 0 ) Refute H 0 with p -value p ≤ S if P( m | H 0 ) ≤ S Note: p -value = P( m | H 0 ), not P(H 0 | m) (common misunderstanding) CS446 Machine Learning 18

  19. Paired t-test Compare the accuracy of two (binary) classifiers on k different test sets test set 1 test set 2 test set 3 test set 4 test set 5 A 80% 82% 85% 78% 85% B 81% 81% 86% 80% 88% Alternatives, e.g. McNemar’s test: Compare the accuracy of two (binary) classifiers on a single test set (do they make mistakes on the same items?) CS446 Machine Learning 19

  20. N-fold cross validation Instead of a single test-training split: train � test � – Split data into N equal-sized parts – Train and test N different instances of the same classifier – This gives N different accuracies CS446 Machine Learning 20

  21. Paired t-test Compare the accuracy of classifiers A and B on k different test sets t-test: – Assumption: Accuracies are drawn from a Normal distribution (with unknown variance) – Null hypothesis: The accuracies of A and B are drawn from the same distribution, – Alternative hypothesis: The accuracies are drawn from two distributions with different means CS446 Machine Learning 21

  22. Paired t-test Compare the accuracy of classifiers A and B on k different test sets Paired t-test: The accuracy of A on test set i is paired with the accuracy of B on test set i – Null hypothesis: If A and B’s accuracies are from the same distribution, their difference (on the same test set) comes from a normal distribution with mean = 0 – Alternative hypothesis: The difference between A and B’s accuracies doesn’t come from a distribution with mean = 0 CS446 Machine Learning 22

  23. Paired t-test for cross-validation Two different classifiers, A and B are trained and tested using N-fold cross-validation For the n- th fold: accuracy (A, n ), accuracy (B, n ) diff n = accuracy (A, n ) - accuracy (B, n ) Null hypothesis: diff comes from a distribution with mean (expected value) = 0. CS446 Machine Learning 23

  24. Paired t-test Null hypothesis (H 0 ; to be refuted): There is no difference between A and B, i.e. the expected accuracies of A and B are the same That is, the expected difference (over all possible data sets) between their accuracies is 0: H 0 : E [ diff D ] = 0 We don’t know the true E [ diff D ] K- fold cross-validation gives us K samples of diff D CS446 Machine Learning 24

  25. t-distribution – Take a sample of n observations from a normal distribution with fixed (but unknown) mean and variance – Compute the sample mean and sample variance for these observations – The t-distribution for n-1 degrees of freedom can be used to estimate how likely it is that the true mean is in a given range – Accept the null hypothesis at significance level a if the t-statistic lies in (- t a /2, n-1 , + t a /2, n-1 ) – There are tables where you can look this up CS446 Machine Learning 25

  26. Paired t-test Null hypothesis H 0 : E [ diff D ] = µ diff = 0 m: our estimate of µ based on N samples of diff D m = 1 ∑ diff n N N n ( diff n − m ) 2 ∑ The sample variance S 2 : S 2 = n = 1 N − 1 Null hypothesis: Accept the null hypothesis at significance level a if the following statistic lies in (- t a /2, N-1 , + t a /2, N-1 ) Nm ~ t N − 1 S CS446 Machine Learning 26

  27. One-sided vs. two-sided tests One-tailed: Test whether the accuracy of A is higher than B Two-tailed: Test whether the accuracies of A and B are different CS446 Machine Learning 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend