L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

l ecture 8
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 8: L OOSE ENDS Prof. Julia Hockenmaier juliahmr@illinois.edu Admin CS446 Machine Learning 2 Admin:


slide-1
SLIDE 1

CS446 Introduction to Machine Learning (Fall 2013) University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446

  • Prof. Julia Hockenmaier

juliahmr@illinois.edu

LECTURE 8: LOOSE ENDS

slide-2
SLIDE 2

CS446 Machine Learning

Admin

2

slide-3
SLIDE 3

CS446 Machine Learning

Admin: Homework

– HW1 is due tonight (11:59pm)

Last minute TA office hour (Ryan): SC 3403 5pm-6pm today

– SGD: Apologies for the misunderstanding/ miscommunications (more on this in a bit)

Fill out (optional) Compass Survey.

– Future HWs: Start early!

Let us know ASAP when something is unclear

3

slide-4
SLIDE 4

Reminder: Homework Late Policy

Everybody is allowed a total of two late days for the semester. If you have exhausted your contingent of late days, we will subtract 20% per late day. We don’t accept assignments more than three days after their due date. Let us know if there are any special circumstances (family, health, etc.)

slide-5
SLIDE 5

CS446 Machine Learning

Admin: Midterm

Midterm exam: Thursday, Oct 10 in class

Let us know ASAP if you know you have a conflict (job interview?) or need accommodations

We will post past midterms on the website

Caveat: different instructor!

5

slide-6
SLIDE 6

CS446 Machine Learning

Admin: Projects (4th credit hour)

Do you have an idea?

Great ML class projects (and write-ups) can be found at http://cs229.stanford.edu/projects2012.html For datasets and problems, see also http://www.kaggle.com/competitions

  • r the UCI machine learning repository

http://archive.ics.uci.edu/ml/

Do you have a partner? => Compass survey: due by next Friday

(to make sure everybody is on track)

6

slide-7
SLIDE 7

CS446 Machine Learning

Review: Stochastic Gradient Descent

7

slide-8
SLIDE 8

CS446 Machine Learning

SGD questions that came up in HW1…

… What’s the difference between batch and online learning? … When do we update the weight vector? … How do we check for convergence? … When do we check for convergence?

8

slide-9
SLIDE 9

CS446 Machine Learning

Terminology: Batch learning

The hypothesis (e.g. weight vector) changes based on a batch (set) of training examples. See all examples in the batch, then update your weight vector.

Typically, one batch = all training examples ‘Mini-batch’ = a small number of training examples

Examples of batch algorithms we’ve seen so far: Standard gradient descent, Decision trees

9

slide-10
SLIDE 10

CS446 Machine Learning

Terminology: Online learning

The hypothesis (e.g. weight vector) changes based on an individual training example.

Examples: stochastic gradient descent, Winnow

Every time you see a new training example, you may have to update your weight vector. – SGD with LMS loss: w changes with every example – SGD with Perceptron loss: w changes only with misclassified examples

(since the gradient = 0 for correctly classified examples)

10

slide-11
SLIDE 11

CS446 Machine Learning

Reminder: Loss functions

11

1 2 3 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 yf(x) y*f(x)

Loss as a function of y*f(x) 0-1 Loss Square Loss Perceptron Loss

slide-12
SLIDE 12

CS446 Machine Learning

Convergence checks

What does it mean for w to have converged? – Define a convergence threshold τ (e.g. 10-3) – Compute Δw, the difference between wold and wnew: Δw = wold − wnew – w has converged when ‖Δw‖< τ

12

slide-13
SLIDE 13

CS446 Machine Learning

Convergence checks

How often do I check for convergence? Batch learning: wold = w before seeing the current batch wnew = w after seeing the current batch Assuming your batch is large enough, this works well.

13

slide-14
SLIDE 14

CS446 Machine Learning

Convergence checks

How often do I check for convergence? Online learning: – Problem: A single example may only lead to very small changes in w – Solution: Only check for convergence after every k examples (or updates, doesn’t matter). wold = w after n·k examples/updates wnew = w after (n+1)·k examples/updates

14

slide-15
SLIDE 15

CS446 Machine Learning

Another loose end: Hypothesis testing for evaluation

15

slide-16
SLIDE 16

CS446 Machine Learning

Why hypothesis testing?

We evaluate the accuracy of our classifiers

  • n unseen training data.

Hypothesis testing can tell us whether the difference in accuracy between two classifiers is statistically significant or not.

16

slide-17
SLIDE 17

CS446 Machine Learning

Hypothesis testing

You want to show that hypothesis H is true, based on your data

(e.g. H = “classifier A and B are different”)

– Define a null hypothesis H0

(H0 is the contrary of what you want to show)

– H0 defines a distribution P(m |H0)

  • ver some statistic (number) that you can compute

e.g. a distribution over the difference in accuracy between A and B

– Can you refute (reject) H0?

17

slide-18
SLIDE 18

CS446 Machine Learning

Rejecting H0

H0 defines a distribution P(M |H0)

  • ver some statistic M

(e.g. M = the difference in accuracy between A and B)

Select a significance value S (e.g. 0.05, 0.01, etc.)

You can only reject H0 if P(M=m |H0) ≤ S

Compute the test statistic m from your data

e.g. the average difference in accuracy over your N folds

Compute P(m |H0) Refute H0 with p-value p ≤ S if P(m |H0) ≤ S

Note: p-value = P(m |H0), not P(H0| m) (common misunderstanding)

18

slide-19
SLIDE 19

CS446 Machine Learning

Paired t-test

Compare the accuracy of two (binary) classifiers on k different test sets

Alternatives, e.g. McNemar’s test: Compare the accuracy of two (binary) classifiers

  • n a single test set (do they make mistakes on the same

items?)

19

test set 1 test set 2 test set 3 test set 4 test set 5 A 80% 82% 85% 78% 85% B 81% 81% 86% 80% 88%

slide-20
SLIDE 20

CS446 Machine Learning

N-fold cross validation

Instead of a single test-training split: – Split data into N equal-sized parts – Train and test N different instances

  • f the same classifier

– This gives N different accuracies

20

train

test

slide-21
SLIDE 21

CS446 Machine Learning

Paired t-test

Compare the accuracy of classifiers A and B

  • n k different test sets

t-test: – Assumption: Accuracies are drawn from a Normal distribution (with unknown variance) – Null hypothesis: The accuracies of A and B are drawn from the same distribution, – Alternative hypothesis: The accuracies are drawn from two distributions with different means

21

slide-22
SLIDE 22

CS446 Machine Learning

Paired t-test

Compare the accuracy of classifiers A and B

  • n k different test sets

Paired t-test: The accuracy of A on test set i is paired with the accuracy of B on test set i – Null hypothesis: If A and B’s accuracies are from the same distribution, their difference (on the same test set) comes from a normal distribution with mean = 0 – Alternative hypothesis: The difference between A and B’s accuracies doesn’t come from a distribution with mean = 0

22

slide-23
SLIDE 23

CS446 Machine Learning

Paired t-test for cross-validation

Two different classifiers, A and B are trained and tested using N-fold cross-validation For the n-th fold: accuracy(A, n), accuracy(B, n) diffn = accuracy(A, n) - accuracy(B, n) Null hypothesis: diff comes from a distribution with mean (expected value) = 0.

23

slide-24
SLIDE 24

CS446 Machine Learning

Paired t-test

Null hypothesis (H0; to be refuted): There is no difference between A and B, i.e. the expected accuracies of A and B are the same That is, the expected difference (over all possible data sets) between their accuracies is 0: H0: E[diffD] = 0 We don’t know the true E[diffD] K-fold cross-validation gives us K samples of diffD

24

slide-25
SLIDE 25

CS446 Machine Learning

t-distribution

– Take a sample of n observations from a normal distribution with fixed (but unknown) mean and variance – Compute the sample mean and sample variance for these observations – The t-distribution for n-1 degrees of freedom can be used to estimate how likely it is that the true mean is in a given range – Accept the null hypothesis at significance level a if the t-statistic lies in (-ta/2, n-1, +ta/2, n-1) – There are tables where you can look this up

25

slide-26
SLIDE 26

CS446 Machine Learning

Paired t-test

Null hypothesis H0: E[diffD] = µdiff = 0 m: our estimate of µ based on N samples of diffD The sample variance S2: Null hypothesis: Accept the null hypothesis at significance level a if the following statistic lies in (-ta/2, N-1, +ta/2, N-1)

26

m = 1 N diffn

n

S2 = (diffn − m)2

n=1 N

N −1

Nm S ~ tN−1

slide-27
SLIDE 27

CS446 Machine Learning

One-sided vs. two-sided tests

One-tailed: Test whether the accuracy of A is higher than B Two-tailed: Test whether the accuracies of A and B are different

27