BBM406 Fundamentals of Machine Learning Lecture 3: Kernel - PowerPoint PPT Presentation

photo:@rewardyfahmi // Unsplash BBM406 Fundamentals of   Machine Learning Lecture 3: Kernel Regression, Distance Metrics, Curse of Dimensionality Aykut Erdem // Hacettepe University // Fall 2019

Administrative • Assignment 1 will be out Friday! • It is due November 1 (i.e. in two weeks). • It includes − Pencil-and-paper derivations − Implementing kernel regression − numpy/Python code 2

Movie Recommendation System • MovieLens dataset (100K ratings of 9K movies by 600 users) • You may want to split the training set into train and validation (more on this next week) • The data consists of three tables: − Ratings: userId, movieId, rating, timestamp − Movies: movieId, title, genre − Links: movieId, imdbId, and tmdbId − Tags: userId, movieId, tag, timestamp   3

Recall from last time… Nearest Neighbors • Very simple method • Retain all training data − It can be slow in testing − Finding NN in high dimensions is slow adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson • Metrics are very important • Good baseline 4

Classification • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.)   • Output: Y - Discrete (0,1,2,...) slide by Aarti Singh and Barnabas Poczos Sports% Anemic%cell% Science% Healthy%cell% News% Y'='Diagnosis' X'='Document' Y'='Topic' X'='Cell'Image' 5

Regression • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.)   • Output: Y slide by Aarti Singh and Barnabas Poczos - Real valued, vectors over real. Stock%Market%% t%% Predic$on% Y'='?' X'='Feb01'' 6

What should I watch tonight? slide by Sanja Fidler 7

Today • Kernel regression − nonparametric   • Distance metrics • Linear regression ( more on Friday ) − parametric − simple model 10

    Simple 1-D Regression • Circles are data points (i.e., training examples) that are given to us • The data points are uniform in x , but may be displaced in y   t ( x ) = f ( x ) + ε   slide by Sanja Fidler with ε some noise • In green is the “true” curve that we don’t know 11

Kernel Regression 12

K-NN for Regression • Given: Training data {( 𝑦 1 , 𝑧 1 ),…, ( 𝑦 n , 𝑧 n )}   – Attribute vectors: 𝑦 𝑗 ∈ 𝑌   R – Target attribute 𝑧 𝑗 ∈ • Parameter:   – Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →   R – Number of nearest neighbors to consider: k • Prediction rule   – New example 𝑦′   – K-nearest neighbors: k train examples with largest 𝐿 ( 𝑦 𝑗 , 𝑦′ ) slide by Thorsten Joachims x 0 ) = 1 h ( ~ P x 0 ) y i i 2 knn ( ~ k 13

1-NN for Regression Here, this is the Here, this is the closest closest datapoint datapoint Here, this is the closest Here, this is the datapoint closest datapoint y x slide by Dhruv Batra 14 Figure Credit: Carlos Guestrin

1-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 15 Figure Credit: Andrew Moore

9-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 16 Figure Credit: Andrew Moore

Multivariate distance metrics • Suppose the input vectors x 1 , x 2 , … x N are two dimensional: x 1 = ( x 11 , x 12 ) , x 2 = ( x 21 , x 22 ) , … x N = ( x N 1 , x N 2 ) . • One can draw the nearest-neighbor regions in input space. slide by Dhruv Batra Dist( x i , x j ) = ( x i 1 – x j 1 ) 2 + ( x i 2 – x j 2 ) 2 Dist( x i , x j ) = ( x i 1 – x j 1 ) 2 + ( 3x i 2 – 3x j 2 ) 2 The relative scalings in the distance metric affect region shapes 17 Slide Credit: Carlos Guestrin

Example: Choosing a restaurant • In everyday life we need to make • Reviews $ Distance Cuisine • decisions by taking into account (out of 5 (out of 10) stars) lots of factors • • 4 30 21 7 • The question is what weight we   2 15 12 8 put on each of these factors   • 5 27 53 9 • (how important are they with individuals’ ¡preferences 3 20 5 6 individuals’ ¡preferences respect to the others). • • ? slide by Richard Zemel 18

Euclidean distance metric Or equivalently, where A slide by Dhruv Batra 19 Slide Credit: Carlos Guestrin

Notable distance metrics   (and their level sets) Mahalanobis   Scaled Euclidian (L 2 ) slide by Dhruv Batra (non-diagonal A) 20 Slide Credit: Carlos Guestrin

Minkowski distance ! 1 /p n slide by Dhruv Batra X | x i − y i | p D = i =1 Image Credit: By Waldir (Based on File:MinkowskiCircles.svg)   21 [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Notable distance metrics   (and their level sets) Scaled Euclidian (L 2 ) L 1 norm (absolute) L inf (max) norm slide by Dhruv Batra 22 Slide Credit: Carlos Guestrin

Weighted K-NN for Regression • Given: Training data {( 𝑦 1 , 𝑧 1 ),…, ( 𝑦 n , 𝑧 n )}   𝑦 1 , 𝑧 1 , … , 𝑦 𝑜 , 𝑧 𝑜 • – Attribute vectors: 𝑦 𝑗 ∈ 𝑌   𝑦 𝑗 ∈ 𝑌 – R – Target attribute 𝑧 𝑗 ∈ 𝑧 𝑗 ∈ ℜ – • Parameter:   • – Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →   R 𝐿 ∶ 𝑌 × 𝑌 → ℜ – – Number of nearest neighbors to consider: k – • • Prediction rule   – x’ – New example 𝑦′   ′ 𝐿 𝑦 𝑗 , 𝑦 – – K-nearest neighbors: k train examples with largest 𝐿 ( 𝑦 𝑗 , 𝑦′ ) slide by Thorsten Joachims 23

Kernel Regression/Classification Four things make a memory based learner: • A distance metric − Euclidean (and others) • How many nearby neighbors to look at? − All of them • A weighting function (optional) − w i = exp (-d(x i , query) 2 / σ 2 ) − Nearby points to the query are weighted strongly, far points weakly.   The σ parameter is the Kernel Width. Very important. • How to fit with the local points? − Predict the weighted average of the outputs predict = Σ w i y i / Σ w i slide by Dhruv Batra 24 Slide Credit: Carlos Guestrin

Weighting/Kernel functions w i = exp(-d(x i , query) 2 / σ 2 ) slide by Dhruv Batra (Our examples use Gaussian) 25 Slide Credit: Carlos Guestrin

E ff ect of Kernel Width • What happens as σ → inf ? • What happens as σ → 0 ? slide by Dhruv Batra Image Credit: Ben Taskar 26

Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features • Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality • Distances become meaningless in high dimensions 27

Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features − Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality − Distances become meaningless in high dimensions 28

Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features − Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality • Distances become meaningless in high dimensions 29

Curse of Dimensionality • Consider applying a KNN classifier/regressor to data where the inputs are uniformly distributed 1 in the D -dimensional unit cube. • Suppose we estimate the density of class labels around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points. 0 s • The expected edge length of this cube will be 1 1/ D e D ( f ) = f . 1 0.9 d=10 d=7 0.8 d=5 • If D = 10 , and we want to base our estimate on 0.7 d=3 Edge length of cube 10% of the data, we have e 10 (0.1) = 0.8 , so we 0.6 0.5 need to extend the cube 80% along each 0.4 slide by Kevin Murphy dimension around x . d=1 0.3 0.2 • Even if we only use 1% of the data, we find 0.1 0 e 10 (0.01) = 0.63. — no longer very local 0 0.2 0.4 0.6 0.8 1 Fraction of data in neighborhood 30

Parametric vs Non-parametric Models • Does the capacity (size of hypothesis class) grow with size of training data? –Yes = Non-parametric Models –No = Parametric Models 31

Next Lecture: Linear Regression,   Least Squares Optimization, Model complexity, Regularization 32

BBM406 Fundamentals of Machine Learning Lecture 3: Kernel - PowerPoint PPT Presentation

photo:@rewardyfahmi // Unsplash BBM406 Fundamentals of Machine Learning Lecture 3: Kernel Regression, Distance Metrics, Curse of Dimensionality Aykut Erdem // Hacettepe University // Fall 2019 Administrative Assignment 1 will be out

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

A Pause for A Thought: A Pause for A Thought: it was said in 1979 Today's car differs from

ROBOTICS ROBOTICS A brief history A brief history Basilio Bona ROBOTICA 03CFIOR 1 Outline

Shawn Foreman By Shawn Foreman To Begin... My name is Shawn Foreman.I am currently a student at

LESSONS ABOUT COMMUNITY FROM SCIENCE FICTION Dawn M. Foster Director of Community

MoonGen A Scriptable High-Speed Packet Generator Sebastian Gallenm uller, Paul Emmerich

MoonGen A Scriptable High-Speed Packet Generator Paul Emmerich March 4th, 2015 Chair for

MoonGen A Scriptable High-Speed Packet Generator Paul Emmerich January 31st, 2016 FOSDEM 2016

Testing interoperability with closed-source software through scriptable diplomacy Ole Andr