CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu
Admin CS446 Machine Learning 2
Admin: Homework – HW1 is due tonight (11:59pm) Last minute TA office hour (Ryan): SC 3403 5pm-6pm today – SGD: Apologies for the misunderstanding/ miscommunications (more on this in a bit) Fill out (optional) Compass Survey. – Future HWs: Start early! Let us know ASAP when something is unclear CS446 Machine Learning 3
Reminder: Homework Late Policy Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than three days after their due date. Let us know if there are any special circumstances (family, health, etc.)
Admin: Midterm Midterm exam: Thursday, Oct 10 in class Let us know ASAP if you know you have a conflict (job interview?) or need accommodations We will post past midterms on the website Caveat: different instructor! CS446 Machine Learning 5
Admin: Projects (4 th credit hour) Do you have an idea? Great ML class projects (and write-ups) can be found at http://cs229.stanford.edu/projects2012.html For datasets and problems, see also http://www.kaggle.com/competitions or the UCI machine learning repository http://archive.ics.uci.edu/ml/ Do you have a partner? => Compass survey: due by next Friday (to make sure everybody is on track) CS446 Machine Learning 6
Review: Stochastic Gradient Descent CS446 Machine Learning 7
SGD questions that came up in HW1… … What’s the difference between batch and online learning? … When do we update the weight vector? … How do we check for convergence? … When do we check for convergence? CS446 Machine Learning 8
Terminology: Batch learning The hypothesis (e.g. weight vector) changes based on a batch (set) of training examples. See all examples in the batch, then update your weight vector. Typically, one batch = all training examples ‘Mini-batch’ = a small number of training examples Examples of batch algorithms we’ve seen so far: Standard gradient descent, Decision trees CS446 Machine Learning 9
Terminology: Online learning The hypothesis (e.g. weight vector) changes based on an individual training example. Examples: stochastic gradient descent, Winnow Every time you see a new training example, you may have to update your weight vector. – SGD with LMS loss: w changes with every example – SGD with Perceptron loss: w changes only with misclassified examples (since the gradient = 0 for correctly classified examples) CS446 Machine Learning 10
Reminder: Loss functions Loss as a function of y*f(x) 4 0-1 Loss Square Loss 3 Perceptron Loss yf(x) 2 1 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 11
Convergence checks What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10 -3 ) – Compute Δ w, the difference between w old and w new : Δ w = w old − w new – w has converged when ‖ Δ w ‖ < τ CS446 Machine Learning 12
Convergence checks How often do I check for convergence? Batch learning: w old = w before seeing the current batch w new = w after seeing the current batch Assuming your batch is large enough, this works well. CS446 Machine Learning 13
Convergence checks How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). w old = w after n·k examples/updates w new = w after (n+1)·k examples/updates CS446 Machine Learning 14
Another loose end: Hypothesis testing for evaluation CS446 Machine Learning 15
Why hypothesis testing? We evaluate the accuracy of our classifiers on unseen training data. Hypothesis testing can tell us whether the difference in accuracy between two classifiers is statistically significant or not. CS446 Machine Learning 16
Hypothesis testing You want to show that hypothesis H is true, based on your data (e.g. H = “classifier A and B are different”) – Define a null hypothesis H 0 (H 0 is the contrary of what you want to show) – H 0 defines a distribution P( m | H 0 ) over some statistic (number) that you can compute e.g. a distribution over the difference in accuracy between A and B – Can you refute (reject) H 0 ? CS446 Machine Learning 17
Rejecting H 0 H 0 defines a distribution P( M | H 0 ) over some statistic M (e.g. M = the difference in accuracy between A and B) Select a significance value S (e.g. 0.05, 0.01, etc.) You can only reject H 0 if P(M= m | H 0 ) ≤ S Compute the test statistic m from your data e.g. the average difference in accuracy over your N folds Compute P( m | H 0 ) Refute H 0 with p -value p ≤ S if P( m | H 0 ) ≤ S Note: p -value = P( m | H 0 ), not P(H 0 | m) (common misunderstanding) CS446 Machine Learning 18
Paired t-test Compare the accuracy of two (binary) classifiers on k different test sets test set 1 test set 2 test set 3 test set 4 test set 5 A 80% 82% 85% 78% 85% B 81% 81% 86% 80% 88% Alternatives, e.g. McNemar’s test: Compare the accuracy of two (binary) classifiers on a single test set (do they make mistakes on the same items?) CS446 Machine Learning 19
N-fold cross validation Instead of a single test-training split: train � test � – Split data into N equal-sized parts – Train and test N different instances of the same classifier – This gives N different accuracies CS446 Machine Learning 20
Paired t-test Compare the accuracy of classifiers A and B on k different test sets t-test: – Assumption: Accuracies are drawn from a Normal distribution (with unknown variance) – Null hypothesis: The accuracies of A and B are drawn from the same distribution, – Alternative hypothesis: The accuracies are drawn from two distributions with different means CS446 Machine Learning 21
Paired t-test Compare the accuracy of classifiers A and B on k different test sets Paired t-test: The accuracy of A on test set i is paired with the accuracy of B on test set i – Null hypothesis: If A and B’s accuracies are from the same distribution, their difference (on the same test set) comes from a normal distribution with mean = 0 – Alternative hypothesis: The difference between A and B’s accuracies doesn’t come from a distribution with mean = 0 CS446 Machine Learning 22
Paired t-test for cross-validation Two different classifiers, A and B are trained and tested using N-fold cross-validation For the n- th fold: accuracy (A, n ), accuracy (B, n ) diff n = accuracy (A, n ) - accuracy (B, n ) Null hypothesis: diff comes from a distribution with mean (expected value) = 0. CS446 Machine Learning 23
Paired t-test Null hypothesis (H 0 ; to be refuted): There is no difference between A and B, i.e. the expected accuracies of A and B are the same That is, the expected difference (over all possible data sets) between their accuracies is 0: H 0 : E [ diff D ] = 0 We don’t know the true E [ diff D ] K- fold cross-validation gives us K samples of diff D CS446 Machine Learning 24
t-distribution – Take a sample of n observations from a normal distribution with fixed (but unknown) mean and variance – Compute the sample mean and sample variance for these observations – The t-distribution for n-1 degrees of freedom can be used to estimate how likely it is that the true mean is in a given range – Accept the null hypothesis at significance level a if the t-statistic lies in (- t a /2, n-1 , + t a /2, n-1 ) – There are tables where you can look this up CS446 Machine Learning 25
Paired t-test Null hypothesis H 0 : E [ diff D ] = µ diff = 0 m: our estimate of µ based on N samples of diff D m = 1 ∑ diff n N N n ( diff n − m ) 2 ∑ The sample variance S 2 : S 2 = n = 1 N − 1 Null hypothesis: Accept the null hypothesis at significance level a if the following statistic lies in (- t a /2, N-1 , + t a /2, N-1 ) Nm ~ t N − 1 S CS446 Machine Learning 26
One-sided vs. two-sided tests One-tailed: Test whether the accuracy of A is higher than B Two-tailed: Test whether the accuracies of A and B are different CS446 Machine Learning 27
Recommend
More recommend