Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class - - PowerPoint PPT Presentation

notes and announcements
SMART_READER_LITE
LIVE PREVIEW

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class - - PowerPoint PPT Presentation

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting softcopy, email to


slide-1
SLIDE 1

Notes and Announcements

  • Midterm exam: Oct 20, Wednesday, In Class
  • Late Homeworks

– Turn in hardcopies to Michelle. – DO NOT ask Michelle for extensions. – Note down the date and time of submission. – If submitting softcopy, email to 10-701 instructors list. – Software needs to be submitted via Blackboard.

  • HW2 out today – watch email

1

slide-2
SLIDE 2

Projects

Hands-on experience with Machine Learning Algorithms – understand when they work and fail, develop new ones! Project Ideas online, discuss TAs, every project must have a TA mentor

  • Proposal (10%): Oct 11
  • Mid-term report (25%): Nov 8
  • Poster presentation (20%): Dec 2, 3-6 pm, NSH Atrium
  • Final Project report (45%): Dec 6

2

slide-3
SLIDE 3

Project Proposal

  • Proposal (10%): Oct 11

– 1 pg maximum – Describe data set – Project idea (approx two paragraphs) – Software you will need to write. – 1-3 relevant papers. Read at least one before submitting your proposal. – Teammate. Maximum team size is 2. division of work – Project milestone for mid-term report? Include experimental results.

3

slide-4
SLIDE 4

Recitation Tomorrow!

4

  • Linear & Non-linear Regression, Nonparametric

methods

  • Strongly recommended!!
  • Place: NSH 1507 (Note)
  • Time: 5-6 pm

TK

slide-5
SLIDE 5

Non-parametric methods

Kernel density estimate, kNN classifier, kernel regression

Aarti Singh

Machine Learning 10-701/15-781 Sept 29, 2010

slide-6
SLIDE 6

Parametric methods

  • Assume some functional form (Gaussian, Bernoulli,

Multinomial, logistic, Linear) for

– P(Xi|Y) and P(Y) as in Naïve Bayes – P(Y|X) as in Logistic regression

  • Estimate parameters (m,s2,q,w,b) using MLE/MAP

and plug in

  • Pro – need few data points to learn parameters
  • Con – Strong distributional assumptions, not satisfied

in practice

6

slide-7
SLIDE 7

Example

7

3 5 8 7 9 4 2 1 Hand-written digit images

projected as points on a two-dimensional (nonlinear) feature spaces

slide-8
SLIDE 8

Non-Parametric methods

  • Typically don’t make any distributional assumptions
  • As we have more data, we should be able to learn

more complex models

  • Let number of parameters scale with number of

training data

  • Today, we will see some nonparametric methods for

– Density estimation – Classification – Regression

8

slide-9
SLIDE 9

Histogram density estimate

9

Partition the feature space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.

  • Often, the same width is

used for all bins, ¢i = ¢.

  • ¢ acts as a smoothing

parameter.

Image src: Bishop book

slide-10
SLIDE 10

Effect of histogram bin width

10

# bins = 1/D

Assuming density it roughly constant in each bin (holds true if D is small)

Bias of histogram density estimate:

x

slide-11
SLIDE 11

Bias – Variance tradeoff

  • Choice of #bins
  • Bias – how close is the mean of estimate to the truth
  • Variance – how much does the estimate vary around mean

Small D, large #bins “Small bias, Large variance” Large D, small #bins “Large bias, Small variance” Bias-Variance tradeoff

11

# bins = 1/D

(p(x) approx constant per bin) (more data per bin, stable estimate)

slide-12
SLIDE 12

Choice of #bins

12

Image src: Bishop book

# bins = 1/D

Image src: Larry book

fixed n D decreases ni decreases MSE = Bias + Variance

slide-13
SLIDE 13

Histogram as MLE

  • Class of density estimates – constants on each bin

Parameters pj - density in bin j Note since

  • Maximize likelihood of data under probability model

with parameters pj

  • Show that histogram density estimate is MLE under

this model – HW/Recitation

13

slide-14
SLIDE 14
  • Histogram – blocky estimate
  • Kernel density estimate aka “Parzen/moving window

method”

Kernel density estimate

14

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 0.35

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide-15
SLIDE 15
  • more generally

Kernel density estimate

15

  • 1

1

slide-16
SLIDE 16

Kernel density estimation

16

Gaussian bumps (red) around six data points and their sum (blue)

  • Place small "bumps" at each data point, determined by the

kernel function.

  • The estimator consists of a (normalized) "sum of bumps”.
  • Note that where the points are denser the density estimate

will have higher values.

Img src: Wikipedia

slide-17
SLIDE 17

Kernels

17

Any kernel function that satisfies

slide-18
SLIDE 18

Kernels

18

Finite support

– only need local points to compute estimate

Infinite support

  • need all points to

compute estimate

  • But quite popular

since smoother (10-702)

slide-19
SLIDE 19

Choice of kernel bandwidth

19

Image Source: Larry’s book – All

  • f Nonparametric

Statistics

Bart-Simpson Density

Too small Too large Just right

slide-20
SLIDE 20

Histograms vs. Kernel density estimation

20

D = h acts as a smoother.

slide-21
SLIDE 21

Bias-variance tradeoff

  • Simulations

21

slide-22
SLIDE 22

k-NN (Nearest Neighbor) density estimation

  • Histogram
  • Kernel density est

Fix D, estimate number of points within D of x (ni or nx) from data Fix nx= k, estimate D from data (volume of ball around x that contains k training pts)

  • k-NN density est

22

slide-23
SLIDE 23

k-NN density estimation

23

k acts as a smoother.

Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular …

slide-24
SLIDE 24

From Density estimation to Classification

24

slide-25
SLIDE 25

k-NN classifier

25

Sports Science Arts

slide-26
SLIDE 26

k-NN classifier

26

Sports Science Arts

Test document

slide-27
SLIDE 27

k-NN classifier (k=4)

27

Sports Science Arts

Test document What should we predict? … Average? Majority? Why?

Dk,x

slide-28
SLIDE 28

k-NN classifier

28

  • Optimal Classifier:
  • k-NN Classifier:

(Majority vote) # total training pts of class y # training pts of class y that lie within Dk ball

slide-29
SLIDE 29

1-Nearest Neighbor (kNN) classifier

Sports Science Arts

29

slide-30
SLIDE 30

2-Nearest Neighbor (kNN) classifier

Sports Science Arts

30

K even not used in practice

slide-31
SLIDE 31

3-Nearest Neighbor (kNN) classifier

Sports Science Arts

31

slide-32
SLIDE 32

5-Nearest Neighbor (kNN) classifier

Sports Science Arts

32

slide-33
SLIDE 33

What is the best K?

33

Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation Choice of K - in next class …

slide-34
SLIDE 34

1-NN classifier – decision boundary

34

K = 1

Voronoi Diagram

slide-35
SLIDE 35

k-NN classifier – decision boundary

35

  • K acts as a smoother (Bias-variance tradeoff)
  • Guarantee: For , the error rate of the 1-nearest-

neighbour classifier is never more than twice the optimal error.

slide-36
SLIDE 36

Case Study: kNN for Web Classification

  • Dataset

– 20 News Groups (20 classes) – Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/) – 61,118 words, 18,774 documents – Class labels descriptions

37

slide-37
SLIDE 37

Experimental Setup

  • Training/Test Sets:

– 50%-50% randomly split. – 10 runs – report average results

  • Evaluation Criteria:

38

slide-38
SLIDE 38

Results: Binary Classes

alt.atheism vs. comp.graphics rec.autos vs. rec.sport.baseball comp.windows.x vs. rec.motorcycles

k Accuracy

39

slide-39
SLIDE 39

From Classification to Regression

40

slide-40
SLIDE 40

Temperature sensing

  • What is the temperature

in the room?

41

Average “Local” Average

at location x? x

slide-41
SLIDE 41

Kernel Regression

  • Aka Local Regression
  • Nadaraya-Watson Kernel Estimator

Where

  • Weight each training point based on distance to test

point

  • Boxcar kernel yields

local average

42

h

slide-42
SLIDE 42

Kernels

43

slide-43
SLIDE 43

Choice of kernel bandwidth h

44

Image Source: Larry’s book – All

  • f Nonparametric

Statistics

h=1 h=10 h=50 h=200 Choice of kernel is not that important Too small Too large Just right Too small

slide-44
SLIDE 44

Kernel Regression as Weighted Least Squares

45

Weighted Least Squares

Kernel regression corresponds to locally constant estimator

  • btained from (locally) weighted least squares

i.e. set f(Xi) = b (a constant)

slide-45
SLIDE 45

Kernel Regression as Weighted Least Squares

46

constant

Notice that

set f(Xi) = b (a constant)

slide-46
SLIDE 46

Local Linear/Polynomial Regression

47

Weighted Least Squares

Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set (local polynomial of degree p around X)

More in HW, 10-702 (statistical machine learning)

slide-47
SLIDE 47

Summary

  • Instance based/non-parametric approaches

48

Four things make a memory based learner: 1. A distance metric, dist(x,Xi) Euclidean (and many more) 2. How many nearby neighbors/radius to look at? k, D/h 3. A weighting function (optional) W based on kernel K 4. How to fit with the local points? Average, Majority vote, Weighted average, Poly fit

slide-48
SLIDE 48

Summary

  • Parametric vs Nonparametric approaches

49

  • Nonparametric models place very mild assumptions on

the data distribution and provide good models for complex data Parametric models rely on very strong (simplistic) distributional assumptions

  • Nonparametric models (not histograms) requires

storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.

slide-49
SLIDE 49

What you should know…

  • Histograms, Kernel density estimation

– Effect of bin width/ kernel bandwidth – Bias-variance tradeoff

  • K-NN classifier

– Nonlinear decision boundaries

  • Kernel (local) regression

– Interpretation as weighted least squares – Local constant/linear/polynomial regression

50