CS 4803 / 7643: Deep Learning
Zsolt Kira Georgia Tech
Topics:
– Image Classification – Supervised Learning view – K-NN – Linear Classifier
CS 4803 / 7643: Deep Learning Topics: Image Classification - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view K-NN Linear Classifier Zsolt Kira Georgia Tech Last Time High-level intro to what deep learning is Fast brief of logistics
Topics:
– Image Classification – Supervised Learning view – K-NN – Linear Classifier
– Requirements: ML, math (linear algebra, calculus), programming (python) – Grades: 80% PS/HW, 20% Project, Piazza Bonus – Project: Topic of your choosing (related to DL), groups of 3-4 with separated undergrad/grad – 7 free late days – 1 week re-grading period – No Cheating
– Graded pass/fail – Intended to do on your own – Don’t worry if rusty! It’s OK to need a refresher on various subjects to do it. Some of it (e.g. last question) is more suitable for graduate students. – If not registered, email staff for gradescope account
(C) Dhruv Batra & Zsolt Kira 2
(C) Dhruv Batra & Zsolt Kira 3
Rahul Duggal 2nd year CS PhD student http://www.rahulduggal.com/ Patrick Grady 2nd year Robotics PhD student https://www.linkedin.com/in/patrick-grady Sameer Dharur MS-CS student https://www.linkedin.com/in/sameerdharur/ Jiachen Yang 2nd year ML PhD https://www.cc.gatech.edu/~jyang462/ Yinquan Lu 2nd year MSCSE student https://www.cc.gatech.edu/~jyang462/ Anishi Mehta MSCS student https://www.linkedin.com/in/anishimehta
– Still a large waitlist for grad, still adding some capacity
– Anybody not have access?
– 110+ people signed up. Please use that for questions.
(C) Dhruv Batra & Zsolt Kira 4
Website: http://www.cc.gatech.edu/classes/AY2020/cs7643_spring/ Piazza: https://piazza.com/gatech/spring2020/cs4803dl7643a/ Staff mailing list (personal questions): cs4803-7643-staff@lists.gatech.edu Gradescope: https://www.gradescope.com/courses/78537
Canvas: https://gatech.instructure.com/courses/94450/
Course Access Code (Piazza): MWXKY8
http://cs231n.github.io/python-numpy-tutorial/
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– What changed to enable DL
(C) Dhruv Batra & Zsolt Kira 6
(C) Dhruv Batra & Zsolt Kira 7
hierarchical (compositional) function in an end-to-end manner
– Cascade of non-linear transformations – Multiple layers of representations
– Learning (goal-driven) representations – Learning to feature extraction
– No single neuron “encodes” everything – Groups of neurons work together
data and processing to enable depth and feature learning
– Combined with specialized hardware (gpus) and open- source/distribution (arXiv, github)
– If your model is poor, so will your result be
some finesse)
– Still have to guard against overfitting (very complex functions!) – Still tune hyper-parameters – Still design neural network architectures – Lots of research to automate this too, e.g. via reinforcement learning!
– Depth>=3: most losses non-convex in parameters – Theoretically, all bets are off – Leads to stochasticity
– “Yes, but all interesting learning problems are non-convex” – For example, human learning
– “Yes, but it often works!”
(C) Dhruv Batra & Zsolt Kira 9
– Hard to track down what’s failing – Pipeline systems have “oracle” performances at each step – In end-to-end systems, it’s hard to know why things are not working
(C) Dhruv Batra & Zsolt Kira 10
(C) Dhruv Batra & Zsolt Kira 11
End-to-End Pipeline [Fang et al. CVPR15] [Vinyals et al. CVPR15]
– Hard to track down what’s failing – Pipeline systems have “oracle” performances at each step – In end-to-end systems, it’s hard to know why things are not working
– Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid degenerate initializations… – “We’re working on it”
– “Yes, but it often works!”
(C) Dhruv Batra & Zsolt Kira 12
– Direct consequence of stochasticity & non-convexity
– It’s getting much better – Standard toolkits/libraries/frameworks now available – Caffe, Theano, (Py)Torch
– “Yes, but it often works!”
(C) Dhruv Batra & Zsolt Kira 13
(C) Dhruv Batra & Zsolt Kira 14
Image Classification: A core task in Computer Vision cat
(assume given set of discrete labels) {dog, cat, truck, plane, ...}
This image by Nikita is licensed under CC-BY 2.0
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
This image by Nikita is licensed under CC-BY 2.0
The Problem: Semantic Gap
What the computer sees An image is just a big grid of numbers between [0, 255]: e.g. 800 x 600 x 3 (3 channels RGB)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Challenges: Viewpoint variation
All pixels change when the camera moves!
This image by Nikita is licensed under CC-BY 2.0
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Challenges: Illumination
This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Challenges: Deformation
This image by Umberto Salvagnin is licensed under CC-BY 2.0 This image by Tom Thai is licensed under CC-BY 2.0 This image by sare bear is licensed under CC-BY 2.0 This image by Umberto Salvagnin is licensed under CC-BY 2.0
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Challenges: Occlusion
This image is CC0 1.0 public domain This image by jonsson is licensed under CC-BY 2.0 This image is CC0 1.0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Challenges: Background Clutter
This image is CC0 1.0 public domain This image is CC0 1.0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
24
Unlike e.g. sorting a list of numbers, no obvious way to hard-code the algorithm for recognizing a cat, or other classes.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
25
John Canny, “A Computational Approach to Edge Detection”, IEEE TPAMI 1986
Find edges Find corners
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
26
Example training set
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(images, text, emails…)
(spam or non-spam…)
– f: X Y (the “true” mapping / reality)
– (x1,y1), (x2,y2), …, (xN,yN)
– h: X Y – y = h(x) = sign(wTx)
– Find best h in model class.
(C) Dhruv Batra & Zsolt Kira 27
– Training Data { (x,y) } f (Learning)
– Test Data x f(x) (Apply function, Evaluate error)
(C) Dhruv Batra & Zsolt Kira 28
– X and Y are random variables – D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y)
– Both training & testing data sampled IID from P(X,Y) – Learn on training set – Have some hope of generalizing to test set
(C) Dhruv Batra & Zsolt Kira 29
(C) Dhruv Batra & Zsolt Kira 30
Reality
(C) Dhruv Batra & Zsolt Kira 31
Reality
3x3 conv, 384 Pool 5x5 conv, 256 11x11 conv, 96 Input Pool 3x3 conv, 384 3x3 conv, 256 Pool FC 4096 FC 4096 Softmax FC 1000
AlexNet
(C) Dhruv Batra & Zsolt Kira 32
Reality
Input Softmax FC HxWx3
Multi-class Logistic Regression
(C) Dhruv Batra & Zsolt Kira 33
Reality
Pool Input Pool Pool Pool Pool Softmax 3x3 conv, 512 3x3 conv, 512 3x3 conv, 256 3x3 conv, 256 3x3 conv, 128 3x3 conv, 128 3x3 conv, 64 3x3 conv, 64 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 FC 4096 FC 1000 FC 4096
VGG19
– You approximated reality with model
– You tried to learn model with finite data
– You were lazy and couldn’t/didn’t optimize to completion
– Reality just sucks
(C) Dhruv Batra & Zsolt Kira 34
– Enough training data D – and H is not too complex – then probably we can generalize to unseen test data
– Vapnik–Chervonenkis dimension – Rademacher complexity
(C) Dhruv Batra & Zsolt Kira 35
(C) Dhruv Batra & Zsolt Kira 36
(C) Dhruv Batra & Zsolt Kira 37
38
Memorize all data and labels Predict the label
training image
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
39
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
10 classes 50,000 training images 10,000 testing images
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
40
Four things make a memory based learner:
(C) Dhruv Batra & Zsolt Kira 42 Slide Credit: Carlos Guestrin
Four things make a memory based learner:
– Euclidean (and others)
– 1
– unused
– Just predict the same output as the nearest neighbour.
(C) Dhruv Batra & Zsolt Kira 43 Slide Credit: Carlos Guestrin
Four things make a memory based learner:
– Euclidean (and others)
– k
– unused
– Just predict the average output among the nearest neighbours.
(C) Dhruv Batra & Zsolt Kira 44 Slide Credit: Carlos Guestrin
(C) Dhruv Batra & Zsolt Kira 45
x y
Here, this is the closest datapoint Figure Credit: Carlos Guestrin
46
L1 distance:
add
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
47
Nearest Neighbor classifier
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
48
Nearest Neighbor classifier Memorize training data
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
49
Nearest Neighbor classifier For each test image: Find closest train image Predict label of nearest image
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
50
Nearest Neighbor classifier Q: With N examples, how fast are training and prediction?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
51
Nearest Neighbor classifier Q: With N examples, how fast are training and prediction? A: Train O(1), predict O(N)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
52
Nearest Neighbor classifier Q: With N examples, how fast are training and prediction? A: Train O(1), predict O(N) This is bad: we want classifiers that are fast at prediction; slow for training is ok
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
53
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– http://vision.stanford.edu/teaching/cs231n-demos/knn/
– http://www.cs.technion.ac.il/~rani/LocBoost/
(C) Dhruv Batra & Zsolt Kira 55
with size of training data?
– Yes = Non-Parametric Models – No = Parametric Models
(C) Dhruv Batra & Zsolt Kira 56
57
L1 (Manhattan) distance L2 (Euclidean) distance
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
58
L1 (Manhattan) distance L2 (Euclidean) distance
K = 1 K = 1
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
L1 norm (absolute) Linfinity (max) norm Scaled Euclidian (L2) Mahalanobis
Slide by Andrew W. Moore
Slide by Andrew W. Moore
Other Metrics…
(Stanfill+Waltz, Maes’ Ringo system…) where Or equivalently,
61
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
62
Very problem-dependent. Must try them all out and see what works best.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
63
Idea #1: Choose hyperparameters that work best on the data Your Dataset
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
64
Idea #1: Choose hyperparameters that work best on the data BAD: K = 1 always works perfectly on training data Your Dataset
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
65
Idea #1: Choose hyperparameters that work best on the data BAD: K = 1 always works perfectly on training data Idea #2: Split data into train and test, choose hyperparameters that work best on test data Your Dataset train test
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
66
Idea #1: Choose hyperparameters that work best on the data BAD: K = 1 always works perfectly on training data Idea #2: Split data into train and test, choose hyperparameters that work best on test data BAD: No idea how algorithm will perform on new data Your Dataset train test
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
67
Idea #1: Choose hyperparameters that work best on the data BAD: K = 1 always works perfectly on training data Idea #2: Split data into train and test, choose hyperparameters that work best on test data BAD: No idea how algorithm will perform on new data Your Dataset train test Idea #3: Split data into train, val, and test; choose hyperparameters on val and evaluate on test Better! train test validation
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
68
Your Dataset test fold 1 fold 2 fold 3 fold 4 fold 5 Idea #4: Cross-Validation: Split data into folds, try each fold as validation and average the results test fold 1 fold 2 fold 3 fold 4 fold 5 test fold 1 fold 2 fold 3 fold 4 fold 5 Useful for small datasets, but not used too frequently in deep learning
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
69
Example of 5-fold cross-validation for the value of k. Each point: single
The line goes through the mean, bars indicated standard deviation (Seems that k ~= 7 works best for this data)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra & Zsolt Kira 70
Hays and Efros, SIGGRAPH 2007
… 200 total
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
– No Learning: most real work done during testing – For every test sample, must search through all dataset – very slow! – Must use tricks like approximate nearest neighbour search
features
– Distances overwhelmed by noisy features
– Distances become meaningless in high dimensions
(C) Dhruv Batra & Zsolt Kira 78
79
k-Nearest Neighbor on images never used.
(all 3 images have same L2 distance to the one on the left)
Original Boxed Shifted Tinted
Original image is CC0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
80
k-Nearest Neighbor on images never used.
Dimensions = 1 Points = 4 Dimensions = 3 Points = 43 Dimensions = 2 Points = 42
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 81
Figure Credit: Kevin Murphy
In Image classification we start with a training set of images and labels, and must predict labels on the test set The K-Nearest Neighbors classifier predicts labels based on nearest training examples Distance metric and K are hyperparameters Choose hyperparameters using the validation set; only run
82
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
This image is CC0 1.0 public domain
Neural Network Linear classifiers
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra 84
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
4096-dim
Embedding (VGGNet) Embedding (LSTM) Image Question
“How many horses are in this image?”
Neural Network Softmax
50,000 training images each image is 32x32x3 10,000 test images.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Image
10 numbers giving class scores
Array of 32x32x3 numbers (3072 numbers total)
parameters
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Image parameters
10 numbers giving class scores
Array of 32x32x3 numbers (3072 numbers total)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Image parameters
10 numbers giving class scores
Array of 32x32x3 numbers (3072 numbers total)
10x1 10x3072 3072x1
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Image parameters
10 numbers giving class scores
Array of 32x32x3 numbers (3072 numbers total)
3072x1 10x1 10x3072 10x1
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra 90
Reality
3x3 conv, 384 Pool 5x5 conv, 256 11x11 conv, 96 Input Pool 3x3 conv, 384 3x3 conv, 256 Pool FC 4096 FC 4096 Softmax FC 1000
AlexNet
(C) Dhruv Batra 91
Reality
Input Softmax FC HxWx3
Multi-class Logistic Regression
92
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Input image
56 231 24 2 56 231 24 2
Stretch pixels into column
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
93
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
0.2
0.1 2.0 1.5 1.3 2.1 0.0 0.25 0.2
Input image
56 231 24 2 56 231 24 2
Stretch pixels into column
1.1 3.2
437.9 61.95
Cat score Dog score Ship score
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
94
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
f(x,W) = Wx Algebraic Viewpoint
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
95
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Input image
0.2
0.1 2.0 1.5 1.3 2.1 0.0 .25 0.2
1.1 3.2
W b f(x,W) = Wx Algebraic Viewpoint
Score
437.9 61.95 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
96
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
97
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
98
Array of 32x32x3 numbers (3072 numbers total)
Cat image by Nikita is licensed under CC-BY 2.0 Plot created using Wolfram Cloud
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
99
Class 1: First and third quadrants Class 2: Second and fourth quadrants Class 1: 1 <= L2 norm <= 2 Class 2: Everything else Class 1: Three modes Class 2: Everything else
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
100
f(x,W) = Wx Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint One template per class Hyperplanes cutting up space
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain
f(x,W) = Wx + b Example class scores for 3 images for some W: How can we tell whether this W is good or bad?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
that quantifies our unhappiness with the scores across the training data.
efficiently finding the parameters that minimize the loss function. (optimization)
TODO:
Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n