bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 3: Kernel - PowerPoint PPT Presentation

photo:@rewardyfahmi // Unsplash BBM406 Fundamentals of Machine Learning Lecture 3: Kernel Regression, Distance Metrics, Curse of Dimensionality Aykut Erdem // Hacettepe University // Fall 2019 Administrative Assignment 1 will be out


  1. photo:@rewardyfahmi // Unsplash BBM406 Fundamentals of 
 Machine Learning Lecture 3: Kernel Regression, Distance Metrics, Curse of Dimensionality Aykut Erdem // Hacettepe University // Fall 2019

  2. Administrative • Assignment 1 will be out Friday! • It is due November 1 (i.e. in two weeks). • It includes − Pencil-and-paper derivations − Implementing kernel regression − numpy/Python code 2

  3. Movie Recommendation System • MovieLens dataset (100K ratings of 9K movies by 600 users) • You may want to split the training set into train and validation (more on this next week) • The data consists of three tables: − Ratings: userId, movieId, rating, timestamp − Movies: movieId, title, genre − Links: movieId, imdbId, and tmdbId − Tags: userId, movieId, tag, timestamp 
 3

  4. Recall from last time… Nearest Neighbors • Very simple method • Retain all training data − It can be slow in testing − Finding NN in high dimensions is slow adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson • Metrics are very important • Good baseline 4

  5. Classification • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.) 
 • Output: Y - Discrete (0,1,2,...) slide by Aarti Singh and Barnabas Poczos Sports% Anemic%cell% Science% Healthy%cell% News% Y'='Diagnosis' X'='Document' Y'='Topic' X'='Cell'Image' 5

  6. Regression • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.) 
 • Output: Y slide by Aarti Singh and Barnabas Poczos - Real valued, vectors over real. Stock%Market%% t%% Predic$on% Y'='?' X'='Feb01'' 6

  7. What should I watch tonight? slide by Sanja Fidler 7

  8. What should I watch tonight? slide by Sanja Fidler 8

  9. What should I watch tonight? slide by Sanja Fidler 9

  10. Today • Kernel regression − nonparametric 
 • Distance metrics • Linear regression ( more on Friday ) − parametric − simple model 10

  11. 
 
 Simple 1-D Regression • Circles are data points (i.e., training examples) that are given to us • The data points are uniform in x , but may be displaced in y 
 t ( x ) = f ( x ) + ε 
 slide by Sanja Fidler with ε some noise • In green is the “true” curve that we don’t know 11

  12. Kernel Regression 12

  13. K-NN for Regression • Given: Training data {( 𝑦 1 , 𝑧 1 ),…, ( 𝑦 n , 𝑧 n )} 
 – Attribute vectors: 𝑦 𝑗 ∈ 𝑌 
 R – Target attribute 𝑧 𝑗 ∈ • Parameter: 
 – Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → 
 R – Number of nearest neighbors to consider: k • Prediction rule 
 – New example 𝑦′ 
 – K-nearest neighbors: k train examples with largest 𝐿 ( 𝑦 𝑗 , 𝑦′ ) slide by Thorsten Joachims x 0 ) = 1 h ( ~ P x 0 ) y i i 2 knn ( ~ k 13

  14. 1-NN for Regression Here, this is the Here, this is the closest closest datapoint datapoint Here, this is the closest Here, this is the datapoint closest datapoint y x slide by Dhruv Batra 14 Figure Credit: Carlos Guestrin

  15. 1-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 15 Figure Credit: Andrew Moore

  16. 9-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 16 Figure Credit: Andrew Moore

  17. Multivariate distance metrics • Suppose the input vectors x 1 , x 2 , … x N are two dimensional: x 1 = ( x 11 , x 12 ) , x 2 = ( x 21 , x 22 ) , … x N = ( x N 1 , x N 2 ) . • One can draw the nearest-neighbor regions in input space. slide by Dhruv Batra Dist( x i , x j ) = ( x i 1 – x j 1 ) 2 + ( x i 2 – x j 2 ) 2 Dist( x i , x j ) = ( x i 1 – x j 1 ) 2 + ( 3x i 2 – 3x j 2 ) 2 The relative scalings in the distance metric affect region shapes 17 Slide Credit: Carlos Guestrin

  18. Example: Choosing a restaurant • In everyday life we need to make • Reviews $ Distance Cuisine • decisions by taking into account (out of 5 (out of 10) stars) lots of factors • • 4 30 21 7 • The question is what weight we 
 2 15 12 8 put on each of these factors 
 • 5 27 53 9 • (how important are they with individuals’ ¡preferences 3 20 5 6 individuals’ ¡preferences respect to the others). • • ? slide by Richard Zemel 18

  19. Euclidean distance metric Or equivalently, where A slide by Dhruv Batra 19 Slide Credit: Carlos Guestrin

  20. Notable distance metrics 
 (and their level sets) Mahalanobis 
 Scaled Euclidian (L 2 ) slide by Dhruv Batra (non-diagonal A) 20 Slide Credit: Carlos Guestrin

  21. Minkowski distance ! 1 /p n slide by Dhruv Batra X | x i − y i | p D = i =1 Image Credit: By Waldir (Based on File:MinkowskiCircles.svg) 
 21 [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

  22. Notable distance metrics 
 (and their level sets) Scaled Euclidian (L 2 ) L 1 norm (absolute) L inf (max) norm slide by Dhruv Batra 22 Slide Credit: Carlos Guestrin

  23. Weighted K-NN for Regression • Given: Training data {( 𝑦 1 , 𝑧 1 ),…, ( 𝑦 n , 𝑧 n )} 
 𝑦 1 , 𝑧 1 , … , 𝑦 𝑜 , 𝑧 𝑜 • – Attribute vectors: 𝑦 𝑗 ∈ 𝑌 
 𝑦 𝑗 ∈ 𝑌 – R – Target attribute 𝑧 𝑗 ∈ 𝑧 𝑗 ∈ ℜ – • Parameter: 
 • – Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → 
 R 𝐿 ∶ 𝑌 × 𝑌 → ℜ – – Number of nearest neighbors to consider: k – • • Prediction rule 
 – x’ – New example 𝑦′ 
 ′ 𝐿 𝑦 𝑗 , 𝑦 – – K-nearest neighbors: k train examples with largest 𝐿 ( 𝑦 𝑗 , 𝑦′ ) slide by Thorsten Joachims 23

  24. Kernel Regression/Classification Four things make a memory based learner: • A distance metric − Euclidean (and others) • How many nearby neighbors to look at? − All of them • A weighting function (optional) − w i = exp (-d(x i , query) 2 / σ 2 ) − Nearby points to the query are weighted strongly, far points weakly. 
 The σ parameter is the Kernel Width. Very important. • How to fit with the local points? − Predict the weighted average of the outputs predict = Σ w i y i / Σ w i slide by Dhruv Batra 24 Slide Credit: Carlos Guestrin

  25. Weighting/Kernel functions w i = exp(-d(x i , query) 2 / σ 2 ) slide by Dhruv Batra (Our examples use Gaussian) 25 Slide Credit: Carlos Guestrin

  26. E ff ect of Kernel Width • What happens as σ → inf ? • What happens as σ → 0 ? slide by Dhruv Batra Image Credit: Ben Taskar 26

  27. Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features • Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality • Distances become meaningless in high dimensions 27

  28. Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features − Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality − Distances become meaningless in high dimensions 28

  29. Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features − Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality • Distances become meaningless in high dimensions 29

  30. Curse of Dimensionality • Consider applying a KNN classifier/regressor to data where the inputs are uniformly distributed 1 in the D -dimensional unit cube. • Suppose we estimate the density of class labels around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points. 0 s • The expected edge length of this cube will be 1 1/ D e D ( f ) = f . 1 0.9 d=10 d=7 0.8 d=5 • If D = 10 , and we want to base our estimate on 0.7 d=3 Edge length of cube 10% of the data, we have e 10 (0.1) = 0.8 , so we 0.6 0.5 need to extend the cube 80% along each 0.4 slide by Kevin Murphy dimension around x . d=1 0.3 0.2 • Even if we only use 1% of the data, we find 0.1 0 e 10 (0.01) = 0.63. — no longer very local 0 0.2 0.4 0.6 0.8 1 Fraction of data in neighborhood 30

  31. Parametric vs Non-parametric Models • Does the capacity (size of hypothesis class) grow with size of training data? –Yes = Non-parametric Models –No = Parametric Models 31

  32. Next Lecture: Linear Regression, 
 Least Squares Optimization, Model complexity, Regularization 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend