L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu

Admin CS446 Machine Learning 2

Admin: Homework – HW1 is due tonight (11:59pm) Last minute TA office hour (Ryan): SC 3403 5pm-6pm today – SGD: Apologies for the misunderstanding/ miscommunications (more on this in a bit) Fill out (optional) Compass Survey. – Future HWs: Start early! Let us know ASAP when something is unclear CS446 Machine Learning 3

Reminder: Homework Late Policy Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than three days after their due date. Let us know if there are any special circumstances (family, health, etc.)

Admin: Midterm Midterm exam: Thursday, Oct 10 in class Let us know ASAP if you know you have a conflict (job interview?) or need accommodations We will post past midterms on the website Caveat: different instructor! CS446 Machine Learning 5

Admin: Projects (4 th credit hour) Do you have an idea? Great ML class projects (and write-ups) can be found at http://cs229.stanford.edu/projects2012.html For datasets and problems, see also http://www.kaggle.com/competitions or the UCI machine learning repository http://archive.ics.uci.edu/ml/ Do you have a partner? => Compass survey: due by next Friday (to make sure everybody is on track) CS446 Machine Learning 6

Review: Stochastic Gradient Descent CS446 Machine Learning 7

SGD questions that came up in HW1… … What’s the difference between batch and online learning? … When do we update the weight vector? … How do we check for convergence? … When do we check for convergence? CS446 Machine Learning 8

Terminology: Batch learning The hypothesis (e.g. weight vector) changes based on a batch (set) of training examples. See all examples in the batch, then update your weight vector. Typically, one batch = all training examples ‘Mini-batch’ = a small number of training examples Examples of batch algorithms we’ve seen so far: Standard gradient descent, Decision trees CS446 Machine Learning 9

Terminology: Online learning The hypothesis (e.g. weight vector) changes based on an individual training example. Examples: stochastic gradient descent, Winnow Every time you see a new training example, you may have to update your weight vector. – SGD with LMS loss: w changes with every example – SGD with Perceptron loss: w changes only with misclassified examples (since the gradient = 0 for correctly classified examples) CS446 Machine Learning 10

Reminder: Loss functions Loss as a function of y*f(x) 4 0-1 Loss Square Loss 3 Perceptron Loss yf(x) 2 1 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 11

Convergence checks What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10 -3 ) – Compute Δ w, the difference between w old and w new : Δ w = w old − w new – w has converged when ‖ Δ w ‖ < τ CS446 Machine Learning 12

Convergence checks How often do I check for convergence? Batch learning: w old = w before seeing the current batch w new = w after seeing the current batch Assuming your batch is large enough, this works well. CS446 Machine Learning 13

Convergence checks How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). w old = w after n·k examples/updates w new = w after (n+1)·k examples/updates CS446 Machine Learning 14

Another loose end: Hypothesis testing for evaluation CS446 Machine Learning 15

Why hypothesis testing? We evaluate the accuracy of our classifiers on unseen training data. Hypothesis testing can tell us whether the difference in accuracy between two classifiers is statistically significant or not. CS446 Machine Learning 16

Hypothesis testing You want to show that hypothesis H is true, based on your data (e.g. H = “classifier A and B are different”) – Define a null hypothesis H 0 (H 0 is the contrary of what you want to show) – H 0 defines a distribution P( m | H 0 ) over some statistic (number) that you can compute e.g. a distribution over the difference in accuracy between A and B – Can you refute (reject) H 0 ? CS446 Machine Learning 17

Rejecting H 0 H 0 defines a distribution P( M | H 0 ) over some statistic M (e.g. M = the difference in accuracy between A and B) Select a significance value S (e.g. 0.05, 0.01, etc.) You can only reject H 0 if P(M= m | H 0 ) ≤ S Compute the test statistic m from your data e.g. the average difference in accuracy over your N folds Compute P( m | H 0 ) Refute H 0 with p -value p ≤ S if P( m | H 0 ) ≤ S Note: p -value = P( m | H 0 ), not P(H 0 | m) (common misunderstanding) CS446 Machine Learning 18

Paired t-test Compare the accuracy of two (binary) classifiers on k different test sets test set 1 test set 2 test set 3 test set 4 test set 5 A 80% 82% 85% 78% 85% B 81% 81% 86% 80% 88% Alternatives, e.g. McNemar’s test: Compare the accuracy of two (binary) classifiers on a single test set (do they make mistakes on the same items?) CS446 Machine Learning 19

N-fold cross validation Instead of a single test-training split: train � test � – Split data into N equal-sized parts – Train and test N different instances of the same classifier – This gives N different accuracies CS446 Machine Learning 20

Paired t-test Compare the accuracy of classifiers A and B on k different test sets t-test: – Assumption: Accuracies are drawn from a Normal distribution (with unknown variance) – Null hypothesis: The accuracies of A and B are drawn from the same distribution, – Alternative hypothesis: The accuracies are drawn from two distributions with different means CS446 Machine Learning 21

Paired t-test Compare the accuracy of classifiers A and B on k different test sets Paired t-test: The accuracy of A on test set i is paired with the accuracy of B on test set i – Null hypothesis: If A and B’s accuracies are from the same distribution, their difference (on the same test set) comes from a normal distribution with mean = 0 – Alternative hypothesis: The difference between A and B’s accuracies doesn’t come from a distribution with mean = 0 CS446 Machine Learning 22

Paired t-test for cross-validation Two different classifiers, A and B are trained and tested using N-fold cross-validation For the n- th fold: accuracy (A, n ), accuracy (B, n ) diff n = accuracy (A, n ) - accuracy (B, n ) Null hypothesis: diff comes from a distribution with mean (expected value) = 0. CS446 Machine Learning 23

Paired t-test Null hypothesis (H 0 ; to be refuted): There is no difference between A and B, i.e. the expected accuracies of A and B are the same That is, the expected difference (over all possible data sets) between their accuracies is 0: H 0 : E [ diff D ] = 0 We don’t know the true E [ diff D ] K- fold cross-validation gives us K samples of diff D CS446 Machine Learning 24

t-distribution – Take a sample of n observations from a normal distribution with fixed (but unknown) mean and variance – Compute the sample mean and sample variance for these observations – The t-distribution for n-1 degrees of freedom can be used to estimate how likely it is that the true mean is in a given range – Accept the null hypothesis at significance level a if the t-statistic lies in (- t a /2, n-1 , + t a /2, n-1 ) – There are tables where you can look this up CS446 Machine Learning 25

Paired t-test Null hypothesis H 0 : E [ diff D ] = µ diff = 0 m: our estimate of µ based on N samples of diff D m = 1 ∑ diff n N N n ( diff n − m ) 2 ∑ The sample variance S 2 : S 2 = n = 1 N − 1 Null hypothesis: Accept the null hypothesis at significance level a if the following statistic lies in (- t a /2, N-1 , + t a /2, N-1 ) Nm ~ t N − 1 S CS446 Machine Learning 26

One-sided vs. two-sided tests One-tailed: Test whether the accuracy of A is higher than B Two-tailed: Test whether the accuracies of A and B are different CS446 Machine Learning 27

L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2 Admin:

18-759: Wireless Networks L ecture 17: Cellular Peter Steenkiste Departments of Computer Science

18-759: Wireless Networks L ecture 18: Cellular Peter Steenkiste Departments of Computer Science

L ECTURE 8: D YNAMICAL S YSTEMS 7 I NSTRUCTOR : G IANNI A. D I C ARO G EOMETRIES IN THE PHASE SPACE

AAP COVID-19 ECHO: Pediatric Emergency Readiness & Response L ECTURE COVID-19 Testing and

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 2: I NTRODUCTION TO M ACHINE L EARNING Ilya

Wireless Networks L ecture 18: Wireless LANs 802.11* Peter Steenkiste CS and ECE, Carnegie

U nit 1: I ntroduction to data L ecture 1: D ata collection , observational studies , and

Wireless Networks L ecture 21: Wireless and the Internet Peter Steenkiste CS and ECE, Carnegie

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 1: N EUROIMAGING T ECHNIQUES Ilya Kuzovkin

Wireless Networks L ecture 5: Physical Layer Channel Properties Peter Steenkiste CS and ECE,

L ECTURE 13: C ELLULAR A UTOMATA 3 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Wireless Networks L ecture 17: Wireless LANs 802.11 Management Peter Steenkiste CS and ECE,

Wireless Networks L ecture 6: Physical Layer Channel Model and Modulation Peter Steenkiste CS

From Cashews to The Evolution of Behavioral Economics Richard H. Thaler N OBEL P RIZE L ECTURE D

L ECTURE 25: B AYESIAN F ILTERS M ONTE C ARLO L OCALIZATION (PF) I NSTRUCTOR : G IANNI A. D I C ARO

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

The statement being tested in a test of significance is called the null hypothesis . The test of

Bayesian Statistical Model Checking with Application to Stateflow/Simulink Verification Paolo

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Signal description: Process or Gibbs? I. General introduction Contributors: S. Berghout (Leiden)

ITRC Implementation Workshop Quality Considerations for Munitions Response Projects Bill

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

ASA Guidelines for Undergraduate Statistics Programs Nicholas Horton, nhorton@amherst.edu

Bibliography for Module 8 on Evaluating Vaccine Efficacy Eighth Summer Institute in Statistics and

L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2 Admin:

18-759: Wireless Networks L ecture 17: Cellular Peter Steenkiste Departments of Computer Science

18-759: Wireless Networks L ecture 18: Cellular Peter Steenkiste Departments of Computer Science

L ECTURE 8: D YNAMICAL S YSTEMS 7 I NSTRUCTOR : G IANNI A. D I C ARO G EOMETRIES IN THE PHASE SPACE

AAP COVID-19 ECHO: Pediatric Emergency Readiness &amp; Response L ECTURE COVID-19 Testing and

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 2: I NTRODUCTION TO M ACHINE L EARNING Ilya

Wireless Networks L ecture 18: Wireless LANs 802.11* Peter Steenkiste CS and ECE, Carnegie

U nit 1: I ntroduction to data L ecture 1: D ata collection , observational studies , and

Wireless Networks L ecture 21: Wireless and the Internet Peter Steenkiste CS and ECE, Carnegie

M ACHINE L EARNING ON N EUROIMAGING D ATA L ECTURE 1: N EUROIMAGING T ECHNIQUES Ilya Kuzovkin

Wireless Networks L ecture 5: Physical Layer Channel Properties Peter Steenkiste CS and ECE,

L ECTURE 13: C ELLULAR A UTOMATA 3 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Wireless Networks L ecture 17: Wireless LANs 802.11 Management Peter Steenkiste CS and ECE,

Wireless Networks L ecture 6: Physical Layer Channel Model and Modulation Peter Steenkiste CS

From Cashews to The Evolution of Behavioral Economics Richard H. Thaler N OBEL P RIZE L ECTURE D

L ECTURE 25: B AYESIAN F ILTERS M ONTE C ARLO L OCALIZATION (PF) I NSTRUCTOR : G IANNI A. D I C ARO

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

The statement being tested in a test of significance is called the null hypothesis . The test of

Bayesian Statistical Model Checking with Application to Stateflow/Simulink Verification Paolo

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Signal description: Process or Gibbs? I. General introduction Contributors: S. Berghout (Leiden)

ITRC Implementation Workshop Quality Considerations for Munitions Response Projects Bill

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

ASA Guidelines for Undergraduate Statistics Programs Nicholas Horton, nhorton@amherst.edu

Bibliography for Module 8 on Evaluating Vaccine Efficacy Eighth Summer Institute in Statistics and

AAP COVID-19 ECHO: Pediatric Emergency Readiness & Response L ECTURE COVID-19 Testing and