lecture 3
play

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem - PowerPoint PPT Presentation

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem February 2016 Hacettepe University Administrative Assignment 1 will be out on Thursday It is due March 4 (i.e. in two weeks). It includes Pencil-and-paper


  1. Lecture 3: − Kernel Regression − Curse of Dimensionality Aykut Erdem February 2016 Hacettepe University

  2. Administrative • Assignment 1 will be out on Thursday • It is due March 4 (i.e. in two weeks). • It includes − Pencil-and-paper derivations − Implementing kNN classifier − Implementing linear regression − numpy/Python code • Note: Lecture slides are not enough, you should also read related book chapters! 2

  3. Recall from last time… Nearest Neighbors • Very simple method • Retain all training data − It can be slow in testing − Finding NN in high dimensions is slow adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson • Metrics are very important • Good baseline 3

  4. Classification • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.) 
 • Output: Y - Discrete (0,1,2,...) slide by Aarti Singh and Barnabas Poczos Sports% Anemic%cell% Science% Healthy%cell% News% Y'='Diagnosis' X'='Document' Y'='Topic' X'='Cell'Image' 4

  5. Regression • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.) 
 • Output: Y slide by Aarti Singh and Barnabas Poczos - Real valued, vectors over real. Stock%Market%% t%% Predic$on% Y'='?' X'='Feb01'' 5

  6. What should I watch tonight? slide by Sanja Fidler 6

  7. What should I watch tonight? slide by Sanja Fidler 7

  8. What should I watch tonight? slide by Sanja Fidler 8

  9. Today • Kernel regression − nonparametric 
 • Distances • Next: Linear regression − parametric − simple model 9

  10. 
 
 Simple 1-D Regression • Circles are data points (i.e., training examples) that are given to us • The data points are uniform in x , but may be displaced in y 
 t ( x ) = f ( x ) + ε 
 slide by Sanja Fidler with ε some noise • In green is the “true” curve that we don’t know 10

  11. Kernel Regression 11

  12. 1-NN for Regression Here, this is the Here, this is the closest closest datapoint datapoint Here, this is the closest Here, this is the datapoint closest datapoint y x slide by Dhruv Batra 12 Figure Credit: Carlos Guestrin

  13. 1-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 13 Figure Credit: Andrew Moore

  14. 9-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 14 Figure Credit: Andrew Moore

  15. Weighted K-NN for Regression • Given: Training data ( ( 𝑦 1 , 𝑧 1 ),…, ( 𝑦 n , 𝑧 n )) 
 𝑦 1 , 𝑧 1 , … , 𝑦 𝑜 , 𝑧 𝑜 • – Attribute vectors: 𝑦 𝑗 ∈ 𝑌 
 𝑦 𝑗 ∈ 𝑌 – – Target attribute 𝑧 𝑗 ∈ R 𝑧 𝑗 ∈ ℜ – • Parameter: 
 • – Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → 
 R 𝐿 ∶ 𝑌 × 𝑌 → ℜ – – Number of nearest neighbors to consider: k – • Prediction rule 
 • – New example 𝑦 ’ 
 – x’ – K-nearest neighbors: k train examples with largest 𝐿 ( 𝑦 𝑗 , 𝑦 ’) ′ 𝐿 𝑦 𝑗 , 𝑦 – slide by Thorsten Joachims 15

  16. Multivariate distance metrics • Suppose the input vectors x 1 , x 2 , … x N are two dimensional: x 1 = ( x 11 , x 12 ) , x 2 = ( x 21 , x 22 ) , … x N = ( x N1 , x N2 ). • One can draw the nearest-neighbor regions in input space. Dist ( x i , x j ) = ( x i1 – x j1 ) 2 + ( x i2 – x j2 ) 2 Dist ( x i , x j ) =( x i1 – x j1 ) 2 +( 3x i2 – 3x j2 ) 2 slide by Dhruv Batra The relative scalings in the distance metric affect region shapes 16 Slide Credit: Carlos Guestrin

  17. Example: Choosing a restaurant • In everyday life we need to make • Reviews $ Distance Cuisine • decisions by taking into account (out of 5 (out of 10) stars) lots of factors • • 4 30 21 7 • The question is what weight we put 2 15 12 8 on each of these factors (how • 5 27 53 9 • important are they with respect to individuals’ ¡preferences 3 20 5 6 individuals’ ¡preferences the others). • • ? slide by Richard Zemel 17

  18. Euclidean distance metric Or equivalently, where A slide by Dhruv Batra 18 Slide Credit: Carlos Guestrin

  19. Notable distance metrics 
 (and their level sets) Mahalanobis 
 Scaled Euclidian (L 2 ) slide by Dhruv Batra (non-diagonal A) 19 Slide Credit: Carlos Guestrin

  20. Minkowski distance ! 1 /p n slide by Dhruv Batra X | x i − y i | p D = i =1 Image Credit: By Waldir (Based on File:MinkowskiCircles.svg) 
 20 [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

  21. Notable distance metrics 
 (and their level sets) Scaled Euclidian (L 2 ) L 1 norm (absolute) L inf (max) norm slide by Dhruv Batra 21 Slide Credit: Carlos Guestrin

  22. Kernel Regression/Classification Four things make a memory based learner: • A distance metric − Euclidean (and others) • How many nearby neighbors to look at? − All of them • A weighting function (optional) − w i = exp(-d(x i , query) 2 / σ 2 ) − Nearby points to the query are weighted strongly, far points weakly. 
 The σ parameter is the Kernel Width. Very important. • How to fit with the local points? − Predict the weighted average of the outputs predict = Σ w i y i / Σ w i slide by Dhruv Batra 22 Slide Credit: Carlos Guestrin

  23. Weighting/Kernel functions w i = exp(-d(x i , query) 2 / σ 2 ) slide by Dhruv Batra (Our examples use Gaussian) 23 Slide Credit: Carlos Guestrin

  24. E ff ect of Kernel Width • What happens as σ → inf? • What happens as σ → 0? slide by Dhruv Batra Image Credit: Ben Taskar 24

  25. Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features − Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality − Distances become meaningless in high dimensions 25

  26. Curse of Dimensionality • Consider applying a KNN classifier/regressor to data where the inputs are uniformly distributed 1 in the D -dimensional unit cube. • Suppose we estimate the density of class labels around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points. 0 s • The expected edge length of this cube will be 1 1/ D . 1 e D ( f ) = f 0.9 d=10 d=7 0.8 d=5 • If D = 10 , and we want to base our estimate on 0.7 d=3 Edge length of cube 10% of the data, we have e 10 (0.1) = 0.8 , so we 0.6 0.5 need to extend the cube 80% along each 0.4 slide by Kevin Murphy dimension around x . d=1 0.3 0.2 0.1 • Even if we only use 1% of the data, we find 0 — no longer very local 0 0.2 0.4 0.6 0.8 1 e 10 (0.01) = 0.63. Fraction of data in neighborhood 26

  27. Next Lecture: Linear Regression 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend