 
              An Introduction to Optimization Methods in Deep Learning 1 Yuan YAO HKUST
Acknowledgement ´ Feifei Li, Stanford cs231n ´ Ruder, Sebastian (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747. ´ http://ruder.io/deep-learning-optimization-2017/
Image Classification Example Dataset: Fashion MNIST Example Dataset: CIFAR10 28x28 grayscale images 60,000 training and 10,000 test examples 10 classes 10 classes 50,000 training images 10,000 testing images Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009. Jason WU, Peng XU, and Nayeon LEE Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 18 April 6, 2017
The Challenge of Human-Instructing- Computers The Problem : Semantic Gap What the computer sees An image is just a big grid of numbers between [0, 255]: e.g. 800 x 600 x 3 (3 channels RGB) This image by Nikita is licensed under CC-BY 2.0 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 7 April 6, 2017
Complex Invariance Challenges : Viewpoint variation Euclidean transform All pixels change when the camera moves! This image by Nikita is licensed under CC-BY 2.0 Challenges : Deformation Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 8 April 6, 2017 Large scale deformation This image by Tom Thai is This image by sare bear is This image by Umberto Salvagnin This image by Umberto Salvagnin licensed under CC-BY 2.0 licensed under CC-BY 2.0 is licensed under CC-BY 2.0 is licensed under CC-BY 2.0 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 10 April 6, 2017
Complex Invariance Challenges : Background Clutter Challenges : Illumination This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain This image is CC0 1.0 public domain Challenges : Occlusion Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 9 April 6, 2017 Challenges : Intraclass variation Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 12 April 6, 2017 This image is CC0 1.0 public domain This image by jonsson is licensed This image is CC0 1.0 public domain This image is CC0 1.0 public domain under CC-BY 2.0 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 13 April 6, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 11 April 6, 2017
Data Driven Learning of the invariants: linear discriminant/classification Recall from last time : Linear Classifier f(x,W) = Wx + b Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 7 April 11, 2017
(Empirical) Loss or Risk Function Suppose: 3 training examples, 3 classes. A loss function tells how With some W the scores are: good our current classifier is Given a dataset of examples Where is image and 3.2 1.3 2.2 cat is (integer) label 5.1 4.9 2.5 Loss over the dataset is a car sum of loss over examples: -1.7 2.0 -3.1 frog Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 10 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss: With some W the scores are: Given an example “Hinge loss” where is the image and Hing Loss where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form: 3.2 1.3 2.2 cat Suppose: 3 training examples, 3 classes. Suppose: 3 training examples, 3 classes. Multiclass SVM loss: Multiclass SVM loss: 5.1 4.9 2.5 With some W the scores are: With some W the scores are: car Given an example Given an example where is the image and -1.7 2.0 -3.1 where is the image and frog where is the (integer) label, where is the (integer) label, and using the shorthand for the and using the shorthand for the scores vector: scores vector: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 12 April 11, 2017 the SVM loss has the form: 3.2 1.3 2.2 cat the SVM loss has the form: 3.2 1.3 2.2 cat 5.1 4.9 2.5 car Loss over full dataset is average: 5.1 4.9 2.5 car -1.7 2.0 -3.1 frog -1.7 2.0 -3.1 frog L = (2.9 + 0 + 12.9)/3 2.9 0 12.9 Losses: = 5.27 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 17 April 11, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 11 April 11, 2017
Cross Entropy (Negative Log-likelihood) Loss Softmax Classifier (Multinomial Logistic Regression) unnormalized probabilities 3.2 24.5 0.13 cat L_i = -log(0.13) = 0.89 normalize exp 5.1 164.0 0.87 car -1.7 0.18 0.00 frog unnormalized log probabilities probabilities Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 46 April 11, 2017
Loss + Regularization Data loss : Model predictions Regularization : Model should match training data should be “simple”, so it works on test data Occam’s Razor : “Among competing hypotheses, the simplest is the best” William of Ockham, 1285 - 1347 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 33 April 11, 2017
Regularizations = regularization strength Regularization (hyperparameter) In common use: ´ Explicit regularization L2 regularization ´ L2-regularization L1 regularization ´ L1-regularization (Lasso) Elastic net (L1 + L2) ´ Elastic-net (L1+L2) ´ Max-norm regularization Max norm regularization (might see later) ´ Implicit regularization Dropout (will see later) ´ Dropout Fancier: Batch normalization, stochastic depth ´ Batch-normalization Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 34 April 11, 2017 ´ Earlystopping
Recommend
More recommend