K-Nearest Neighbors Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

β–Ά
k nearest neighbors
SMART_READER_LITE
LIVE PREVIEW

K-Nearest Neighbors Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

K-Nearest Neighbors Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Check out review materials Probability Linear algebra Python and NumPy Start your HW 0 On your Local machine: Install


slide-1
SLIDE 1

K-Nearest Neighbors

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • Check out review materials
  • Probability
  • Linear algebra
  • Python and NumPy
  • Start your HW 0
  • On your Local machine: Install Anaconda, Jupiter notebook
  • On the cloud: https://colab.research.google.com
  • Sign up Piazza discussion forum
slide-3
SLIDE 3

Enrollment

  • Maximum allowable capacity reached.

Students Classroom

slide-4
SLIDE 4

Machine learning reading&study group

  • Reading Group

Tuesday 11 AM - 12:00 PM Location: Whittmore Hall 457B

  • Research paper reading: machine learning, computer vision
  • Study Group

Thursday 11 AM - 12:00 PM Location: Whittmore Hall 457B

  • Video lecture: machine learning

All are welcome.

More info: https://github.com/vt-vl-lab/reading_group

slide-5
SLIDE 5

Recap: Machine learning algorithms

Supervised Learning Unsupervised Learning Discrete Classification Clustering Continuous Regression Dimensionality reduction

slide-6
SLIDE 6

Today’s plan

  • Supervised learning
  • Setup
  • Basic concepts
  • K-Nearest Neighbor (kNN)
  • Distance metric
  • Pros/Cons of nearest neighbor
  • Validation, cross-validation, hyperparameter tuning
slide-7
SLIDE 7

Supervised learning

  • Input: 𝑦 (Images, texts, emails)
  • Output: 𝑧 (e.g., spam or non-spams)
  • Data: 𝑦 1 , 𝑧 1 , 𝑦 2 , 𝑧 2 , β‹― , 𝑦 𝑂 , 𝑧 𝑂

(Labeled dataset)

  • (Unknown) Target function: 𝑔: 𝑦 β†’ 𝑧 (β€œTrue” mapping)
  • Model/hypothesis: β„Ž: 𝑦 β†’ 𝑧 (Learned model)
  • Learning = search in hypothesis space

Slide credit: Dhruv Batra

slide-8
SLIDE 8

Training set Learning Algorithm β„Ž 𝑦 𝑧

Hypothesis

slide-9
SLIDE 9

Training set Learning Algorithm β„Ž 𝑦 𝑧

Hypothesis Size of house Estimated price

Regression

slide-10
SLIDE 10

Training set Learning Algorithm β„Ž 𝑦 𝑧

Hypothesis Unseen image Predicted object class

Image credit: CS231n @ Stanford

β€˜Mug’

Classification

slide-11
SLIDE 11

Procedural view of supervised learning

  • Training Stage:
  • Raw data β†’ 𝑦

(Feature Extraction)

  • Training data { 𝑦, 𝑧 } β†’ β„Ž

(Learning)

  • Testing Stage
  • Raw data β†’ 𝑦

(Feature Extraction)

  • Test data 𝑦 β†’ β„Ž(𝑦)

(Apply function, evaluate error)

Slide credit: Dhruv Batra

slide-12
SLIDE 12

Basic steps of supervised learning

  • Set up a supervised learning problem
  • Data collection: Collect training data with the β€œright” answer.
  • Representation: Choose how to represent the data.
  • Modeling: Choose a hypothesis class: 𝐼 = {β„Ž: π‘Œ β†’ 𝑍}
  • Learning/estimation: Find best hypothesis in the model class.
  • Model selection: Try different models. Picks the best one.

(More on this later)

  • If happy stop, else refine one or more of the above

Slide credit: Dhruv Batra

slide-13
SLIDE 13

Nearest neighbor classifier

  • Training data

𝑦 1 , 𝑧 1 , 𝑦 2 , 𝑧 2 , β‹― , 𝑦 𝑂 , 𝑧 𝑂

  • Learning

Do nothing.

  • Testing

β„Ž 𝑦 = 𝑧(𝑙), where 𝑙 = argmini 𝐸(𝑦, 𝑦(𝑗))

slide-14
SLIDE 14

Face recognition

Image credit: MegaFace

slide-15
SLIDE 15

Face recognition – surveillance application

slide-16
SLIDE 16

Music identification

https://www.youtube.com/watch?v=TKNNOMddkNc

slide-17
SLIDE 17

Album recognition (Instance recognition)

http://record-player.glitch.me/auth

slide-18
SLIDE 18

Scene Completion

(C) Dhruv Batra

[Hayes & Efros, SIGGRAPH07]

slide-19
SLIDE 19

Hays and Efros, SIGGRAPH 2007

slide-20
SLIDE 20

… 200 total

[Hayes & Efros, SIGGRAPH07]

slide-21
SLIDE 21

Context Matching

[Hayes & Efros, SIGGRAPH07]

slide-22
SLIDE 22

Graph cut + Poisson blending

[Hayes & Efros, SIGGRAPH07]

slide-23
SLIDE 23

[Hayes & Efros, SIGGRAPH07]

slide-24
SLIDE 24

[Hayes & Efros, SIGGRAPH07]

slide-25
SLIDE 25

[Hayes & Efros, SIGGRAPH07]

slide-26
SLIDE 26

[Hayes & Efros, SIGGRAPH07]

slide-27
SLIDE 27

[Hayes & Efros, SIGGRAPH07]

slide-28
SLIDE 28

[Hayes & Efros, SIGGRAPH07]

slide-29
SLIDE 29

Synonyms

  • Nearest Neighbors
  • k-Nearest Neighbors
  • Member of following families:
  • Instance-based Learning
  • Memory-based Learning
  • Exemplar methods
  • Non-parametric methods

Slide credit: Dhruv Batra

slide-30
SLIDE 30

Instance/Memory-based Learning

  • 1. A distance metric
  • 2. How many nearby neighbors to look at?
  • 3. A weighting function (optional)
  • 4. How to fit with the local points?

Slide credit: Carlos Guestrin

slide-31
SLIDE 31

Instance/Memory-based Learning

  • 1. A distance metric
  • 2. How many nearby neighbors to look at?
  • 3. A weighting function (optional)
  • 4. How to fit with the local points?

Slide credit: Carlos Guestrin

slide-32
SLIDE 32

Recall: 1-Nearest neighbor classifier

  • Training data

𝑦 1 , 𝑧 1 , 𝑦 2 , 𝑧 2 , β‹― , 𝑦 𝑂 , 𝑧 𝑂

  • Learning

Do nothing.

  • Testing

β„Ž 𝑦 = 𝑧(𝑙), where 𝑙 = argmini 𝐸(𝑦, 𝑦(𝑗))

slide-33
SLIDE 33

Distance metrics (𝑦: continuous variables )

  • 𝑀2-norm: Euclidean distance 𝐸 𝑦, 𝑦′ =

σ𝑗 𝑦𝑗 βˆ’ 𝑦𝑗′ 2

  • 𝑀1-norm: Sum of absolute difference 𝐸 𝑦, 𝑦′ = σ𝑗 |𝑦𝑗 βˆ’ 𝑦𝑗′|
  • 𝑀inf-norm

𝐸 𝑦, 𝑦′ = max( 𝑦𝑗 βˆ’ 𝑦𝑗

β€² )

  • Scaled Euclidean distance 𝐸 𝑦, 𝑦′ =

σ𝑗 πœπ‘—

2 𝑦𝑗 βˆ’ 𝑦𝑗′ 2

  • Mahalanobis distance

𝐸 𝑦, 𝑦′ = 𝑦 βˆ’ 𝑦′ ⊀𝐡(𝑦 βˆ’ 𝑦′)

slide-34
SLIDE 34

Distance metrics (𝑦: discrete variables )

  • Example application: document classification
  • Hamming distance
slide-35
SLIDE 35

Distance metrics (𝑦: Histogram / PDF)

  • Histogram intersection
  • Chi-squared Histogram matching distance
  • Earth mover’s distance (Cross-bin similarity measure)
  • minimal cost paid to transform one distribution into the other

histint 𝑦, 𝑦′ = 1 βˆ’ ෍

𝑗

min(𝑦𝑗, 𝑦𝑗

β€²)

πœ“2 𝑦, 𝑦′ = 1 2 ෍

𝑗

𝑦𝑗 βˆ’ 𝑦𝑗

β€² 2

𝑦𝑗 + 𝑦𝑗

β€²

[Rubner et al. IJCV 2000]

slide-36
SLIDE 36

Distance metrics (𝑦: gene expression microarray data)

  • When β€œshape” matters more than values
  • Want 𝐸(𝑦(1), 𝑦(2)) < 𝐸(𝑦(1), 𝑦(3))
  • How?
  • Correlation Coefficients
  • Pearson, Spearman, Kendal, etc

𝑦(1) 𝑦(3) 𝑦(2)

Gene

slide-37
SLIDE 37

Distance metrics (𝑦: Learnable feature)

Large margin nearest neighbor (LMNN)

slide-38
SLIDE 38

Instance/Memory-based Learning

  • 1. A distance metric
  • 2. How many nearby neighbors to look at?
  • 3. A weighting function (optional)
  • 4. How to fit with the local points?

Slide credit: Carlos Guestrin

slide-39
SLIDE 39

kNN Classification

k = 3 k = 5

Image credit: Wikipedia

slide-40
SLIDE 40

Classification decision boundaries

Image credit: CS231 @ Stanford

slide-41
SLIDE 41

Instance/Memory-based Learning

  • 1. A distance metric
  • 2. How many nearby neighbors to look at?
  • 3. A weighting function (optional)
  • 4. How to fit with the local points?

Slide credit: Carlos Guestrin

slide-42
SLIDE 42

Issue: Skewed class distribution

  • Problem with majority voting in kNN
  • Intuition: nearby points should be weighted

strongly, far points weakly

  • Apply weight

π‘₯(𝑗) = exp(βˆ’ 𝑒 𝑦 𝑗 , π‘Ÿπ‘£π‘“π‘ π‘§

2

𝜏2 )

  • 𝜏2: Kernel width

?

slide-43
SLIDE 43

Instance/Memory-based Learning

  • 1. A distance metric
  • 2. How many nearby neighbors to look at?
  • 3. A weighting function (optional)
  • 4. How to fit with the local points?

Slide credit: Carlos Guestrin

slide-44
SLIDE 44

1-NN for Regression

  • Just predict the same output as the nearest neighbour.

x y

Here, this is the closest datapoint

Figure credit: Carlos Guestrin

slide-45
SLIDE 45

1-NN for Regression

  • Often bumpy (overfits)

Figure credit: Andrew Moore

slide-46
SLIDE 46

9-NN for Regression

  • Predict the averaged of k nearest neighbor values

Figure credit: Andrew Moore

slide-47
SLIDE 47

Weighting/Kernel functions

Weight π‘₯(𝑗) = exp(βˆ’ 𝑒 𝑦 𝑗 , π‘Ÿπ‘£π‘“π‘ π‘§

2

𝜏2 ) Prediction (use all the data) 𝑧 = ෍

𝑗

π‘₯ 𝑗 𝑧 𝑗 / ෍

𝑗

π‘₯(𝑗)

(Our examples use Gaussian)

Slide credit: Carlos Guestrin

slide-48
SLIDE 48

Effect of Kernel Width

  • What happens as Οƒοƒ inf?
  • What happens as Οƒοƒ 0?

Slide credit: Ben Taskar

Kernel regression

slide-49
SLIDE 49

Problems with Instance-Based Learning

  • Expensive
  • No Learning: most real work done during testing
  • For every test sample, must search through all dataset – very slow!
  • Must use tricks like approximate nearest neighbour search
  • Doesn’t work well when large number of irrelevant features
  • Distances overwhelmed by noisy features
  • Curse of Dimensionality
  • Distances become meaningless in high dimensions

Slide credit: Dhruv Batra

slide-50
SLIDE 50

Curse of dimensionality

  • Consider a hypersphere with radius 𝑠 and dimension 𝑒
  • Consider hypercube with edge of length 2𝑠
  • Distance between center and the corners is 𝑠 𝑒
  • Hypercube consist almost entirely of the β€œcorners”

𝑒= 2 2𝑠 2𝑠

slide-51
SLIDE 51

Hyperparameter selection

  • How to choose K?
  • Which distance metric should I use? L2, L1?
  • How large the kernel width 𝜏2 should be?
  • ….
slide-52
SLIDE 52

Tune hyperparameters on the test dataset?

  • Will give us a stronger performance on the test set!
  • Why this is not okay? Let’s discuss

Evaluate on the test set only a single time, at the very end.

slide-53
SLIDE 53

Validation set

  • Spliting training set: A fake test set to tune hyper-parameters

Slide credit: CS231 @ Stanford

slide-54
SLIDE 54

Cross-validation

  • 5-fold cross-validation -> split the training data into 5 equal folds
  • 4 of them for training and 1 for validation

Slide credit: CS231 @ Stanford

slide-55
SLIDE 55

Hyper-parameters selection

  • Split training dataset into train/validation set (or cross-validation)
  • Try out different values of hyper-parameters and evaluate these

models on the validation set

  • Pick the best performing model on the validation set
  • Run the selected model on the test set. Report the results.
slide-56
SLIDE 56

Things to remember

  • Supervised Learning
  • Training/testing data; classification/regression; Hypothesis
  • k-NN
  • Simplest learning algorithm
  • With sufficient data, very hard to beat β€œstrawman” approach
  • Kernel regression/classification
  • Set k to n (number of data points) and chose kernel width
  • Smoother than k-NN
  • Problems with k-NN
  • Curse of dimensionality
  • Not robust to irrelevant features
  • Slow NN search: must remember (very large) dataset for prediction
slide-57
SLIDE 57

Next class: Linear Regression

Price ($) in 1000’s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2