Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class - PowerPoint PPT Presentation

Notes and Announcements • Midterm exam: Oct 20 , Wednesday, In Class • Late Homeworks – Turn in hardcopies to Michelle. – DO NOT ask Michelle for extensions. – Note down the date and time of submission. – If submitting softcopy, email to 10-701 instructors list. – Software needs to be submitted via Blackboard. • HW2 out today – watch email 1

Projects Hands-on experience with Machine Learning Algorithms – understand when they work and fail, develop new ones! Project Ideas online, discuss TAs, every project must have a TA mentor • Proposal (10%): Oct 11 • Mid-term report (25%): Nov 8 • Poster presentation (20%): Dec 2, 3-6 pm, NSH Atrium • Final Project report (45%): Dec 6 2

Project Proposal • Proposal (10%): Oct 11 – 1 pg maximum – Describe data set – Project idea (approx two paragraphs) – Software you will need to write. – 1-3 relevant papers. Read at least one before submitting your proposal. – Teammate. Maximum team size is 2. division of work – Project milestone for mid-term report? Include experimental results. 3

Recitation Tomorrow! • Linear & Non-linear Regression, Nonparametric methods • Strongly recommended!! • Place: NSH 1507 (Note) • Time: 5-6 pm TK 4

Non-parametric methods Kernel density estimate, kNN classifier, kernel regression Aarti Singh Machine Learning 10-701/15-781 Sept 29, 2010

Parametric methods • Assume some functional form (Gaussian, Bernoulli, Multinomial, logistic, Linear) for – P ( X i | Y ) and P ( Y ) as in Naïve Bayes – P ( Y | X ) as in Logistic regression • Estimate parameters ( m,s 2 ,q ,w, b ) using MLE/MAP and plug in • Pro – need few data points to learn parameters • Con – Strong distributional assumptions, not satisfied in practice 6

Example 7 1 2 9 4 8 3 5 Hand-written digit images projected as points on a two-dimensional (nonlinear) feature spaces 7

Non-Parametric methods • Typically don’t make any distributional assumptions • As we have more data, we should be able to learn more complex models • Let number of parameters scale with number of training data • Today, we will see some nonparametric methods for – Density estimation – Classification – Regression 8

Histogram density estimate Partition the feature space into distinct bins with widths ¢ i and count the number of observations, n i , in each bin. • Often, the same width is used for all bins, ¢ i = ¢ . • ¢ acts as a smoothing parameter. 9 Image src: Bishop book

Effect of histogram bin width # bins = 1/ D x Bias of histogram density estimate: Assuming density it roughly constant in each bin (holds true if D is small) 10

Bias – Variance tradeoff # bins = 1/ D • Choice of #bins (p(x) approx constant per bin) (more data per bin, stable estimate) • Bias – how close is the mean of estimate to the truth • Variance – how much does the estimate vary around mean Small D , large #bins “ Small bias , Large variance ” Large D , small #bins “ Large bias , Small variance ” Bias-Variance tradeoff 11

Choice of #bins # bins = 1/ D fixed n MSE = Bias + Variance D decreases n i decreases 12 Image src: Bishop book Image src: Larry book

Histogram as MLE • Class of density estimates – constants on each bin Parameters p j - density in bin j Note since • Maximize likelihood of data under probability model with parameters p j • Show that histogram density estimate is MLE under this model – HW/Recitation 13

Kernel density estimate 0.35 • Histogram – blocky estimate 0.3 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 • Kernel density estimate aka “ Parzen/moving window method” 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 14

Kernel density estimate • more generally -1 1 15

Kernel density estimation • Place small "bumps" at each data point, determined by the kernel function. • The estimator consists of a (normalized) "sum of bumps”. Img src: Wikipedia Gaussian bumps (red) around six data points and their sum (blue) • Note that where the points are denser the density estimate will have higher values. 16

Kernels Any kernel function that satisfies 17

Kernels Finite support – only need local points to compute estimate Infinite support - need all points to compute estimate -But quite popular since smoother (10-702) 18

Choice of kernel bandwidth Too small Image Source: Larry’s book – All of Nonparametric Statistics Bart-Simpson Density Just right Too large 19

Histograms vs. Kernel density estimation D = h acts as a smoother. 20

Bias-variance tradeoff • Simulations 21

k-NN (Nearest Neighbor) density estimation • Histogram • Kernel density est Fix D , estimate number of points within D of x (n i or n x ) from data Fix n x = k , estimate D from data (volume of ball around x that contains k training pts) • k-NN density est 22

k-NN density estimation Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular … k acts as a smoother. 23

From Density estimation to Classification 24

k-NN classifier Sports Science Arts 25

k-NN classifier Test document Sports Science Arts 26

k-NN classifier (k=4) Test document D k,x Sports Science Arts What should we predict? … Average? Majority? Why? 27

k-NN classifier • Optimal Classifier: • k-NN Classifier: (Majority vote) # training pts of class y that lie within D k ball # total training pts of class y 28

1-Nearest Neighbor (kNN) classifier Sports Science Arts 29

2-Nearest Neighbor (kNN) classifier K even not used in practice Sports Science Arts 30

What is the best K? Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation Choice of K - in next class … 33

1-NN classifier – decision boundary Voronoi Diagram K = 1 34

k-NN classifier – decision boundary • K acts as a smoother (Bias-variance tradeoff) • Guarantee: For , the error rate of the 1-nearest- neighbour classifier is never more than twice the optimal error. 35

Case Study: kNN for Web Classification • Dataset – 20 News Groups (20 classes) – Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/) – 61,118 words, 18,774 documents – Class labels descriptions 37

Experimental Setup • Training/Test Sets: – 50%-50% randomly split. – 10 runs – report average results • Evaluation Criteria: 38

Results: Binary Classes alt.atheism vs. comp.graphics comp.windows.x rec.autos vs. vs. rec.motorcycles rec.sport.baseball Accuracy k 39

From Classification to Regression 40

Temperature sensing • What is the temperature in the room? at location x? x Average “Local” Average 41

Kernel Regression h • Aka Local Regression • Nadaraya-Watson Kernel Estimator Where • Weight each training point based on distance to test point • Boxcar kernel yields local average 42

Kernels 43

Choice of kernel bandwidth h h=1 h=10 Too small Too small Image Source: Larry’s book – All of Nonparametric Statistics Choice of kernel is not that important Just h=50 h=200 Too large right 44

Kernel Regression as Weighted Least Squares Weighted Least Squares Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f ( X i ) = b (a constant) 45

Kernel Regression as Weighted Least Squares set f ( X i ) = b (a constant) constant Notice that 46

Local Linear/Polynomial Regression Weighted Least Squares Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set (local polynomial of degree p around X) More in HW, 10-702 (statistical machine learning) 47

Summary • Instance based/non-parametric approaches Four things make a memory based learner: 1. A distance metric, dist(x,X i ) Euclidean (and many more) 2. How many nearby neighbors/radius to look at? k, D /h 3. A weighting function (optional) W based on kernel K 4. How to fit with the local points? Average, Majority vote, Weighted average, Poly fit 48

Summary • Parametric vs Nonparametric approaches  Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data Parametric models rely on very strong (simplistic) distributional assumptions  Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation. 49

What you should know… • Histograms, Kernel density estimation – Effect of bin width/ kernel bandwidth – Bias-variance tradeoff • K-NN classifier – Nonlinear decision boundaries • Kernel (local) regression – Interpretation as weighted least squares – Local constant/linear/polynomial regression 50

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class - PowerPoint PPT Presentation

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting softcopy, email to

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

Printout Tuesday, October 29, 2019 7:38 PM Quick Notes Page 1 Quick Notes Page 2 Quick Notes

Briefing Notes The Briefing Notes Page The Briefing Notes include: An introduction to the

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

Cell History and Structure Quiz on Block Day January 18-19, 2016 Admit Ticket NOTES: Take notes

C R RAO AIMSCS Lecture Notes Series Author (s): B.L.S. PRAKASA RAO Title of the Notes : Brief

NOTES: ORGANIC MARKET IS BOOMING, INCREASE OF ORGANIC FARMLAND RECENT YEARS NOTES: ORGANIC MARKET

Lotus Notes & Domino 8: New Interface, New Features, New Gadgets Whats New? Notes 8.0

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

via Threshold-Based Pruning Edward Gan & Peter Bailis 1 MacroBase: Analytics on Fast Streams

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

Draft 1 Density estimation by Randomized Quasi-Monte Carlo Pierre LEcuyer Joint work with

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Sub-quadratic Markov tree mixture models for probability density estimation Sourour Ammar 1 , Ph.

Extended Path Integral Formulation for Volumetric Transport T. Hachisuka I. Georgiev W. Jarosz

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class - PowerPoint PPT Presentation

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting softcopy, email to

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

Printout Tuesday, October 29, 2019 7:38 PM Quick Notes Page 1 Quick Notes Page 2 Quick Notes

Briefing Notes The Briefing Notes Page The Briefing Notes include: An introduction to the

DHTs and Sharding Aurojit Panda Announcements Announcements Fill out the Github consent

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 6 Monday, February 2 Announcements 2 Announcements Homework 2 due Monday 2/2 @

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 6 Friday, September 13 Announcements 2 Announcements Homework 2 due Tuesday

61A Lecture 24 Monday, March 30 Announcements 2 Announcements Homework 7 due Wednesday 4/8

61A Lecture 37 Wednesday, April 29 Announcements 2 Announcements Homework 9 (4 pts) due

CS 61A Lecture 10 Friday, February 13 Announcements 2 Announcements Guerrilla Section 2 is

Cell History and Structure Quiz on Block Day January 18-19, 2016 Admit Ticket NOTES: Take notes

C R RAO AIMSCS Lecture Notes Series Author (s): B.L.S. PRAKASA RAO Title of the Notes : Brief

NOTES: ORGANIC MARKET IS BOOMING, INCREASE OF ORGANIC FARMLAND RECENT YEARS NOTES: ORGANIC MARKET

Lotus Notes &amp; Domino 8: New Interface, New Features, New Gadgets Whats New? Notes 8.0

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

via Threshold-Based Pruning Edward Gan &amp; Peter Bailis 1 MacroBase: Analytics on Fast Streams

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

Draft 1 Density estimation by Randomized Quasi-Monte Carlo Pierre LEcuyer Joint work with

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Sub-quadratic Markov tree mixture models for probability density estimation Sourour Ammar 1 , Ph.

Extended Path Integral Formulation for Volumetric Transport T. Hachisuka I. Georgiev W. Jarosz

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Lotus Notes & Domino 8: New Interface, New Features, New Gadgets Whats New? Notes 8.0

via Threshold-Based Pruning Edward Gan & Peter Bailis 1 MacroBase: Analytics on Fast Streams