Softmax Classifier + SGD Todays Class Intro to Machine Learning - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Softmax Classifier + SGD

Today’s Class Intro to Machine Learning What is Machine Learning? Supervised Learning: Classification with K-nearest neighbors Unsupervised Learning: Clustering with K-means clustering Softmax Classifier Stochastic Gradient Descent Regularization

Teaching Assistants Ziyan Yang Paola Cascante-Bonilla (tw8cb@virginia.edu) (pc9za@virginia.edu) Office Hours: Thursdays Hours: Fridays 2 to 4pm 3 to 5pm (Rice 442) (Rice 442) 3

Also… • Assignment 2 will be released between today and tomorrow. • Subscribe and check Piazza regularly, important information about assignments will go there. Please use Piazza.

Machine Learning • Machine learning is the subfield of computer science that gives "computers the ability to learn without being explicitly programmed.” - term coined by Arthur Samuel 1959 while at IBM • The study of algorithms that can learn from data. • In contrast to previous Artificial Intelligence systems based on Logic, e.g. ”Expert Systems”

Supervised Learning vs Unsupervised Learning ! → # ! cat dog bear dog bear dog cat cat bear

Supervised Learning vs Unsupervised Learning ! → # ! cat dog bear dog Classification Clustering bear dog cat cat bear

Supervised Learning Examples Classification cat Face Detection Language Parsing Structured Prediction

Supervised Learning Examples cat = !( ) = !( ) = !( )

Supervised Learning – k-Nearest Neighbors cat dog k=3 bear cat, cat, dog cat cat dog bear dog bear 11

Supervised Learning – k-Nearest Neighbors cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear 12

Supervised Learning – k-Nearest Neighbors • How do we choose the right K? • How do we choose the right features? • How do we choose the right distance metric? 13

Supervised Learning – k-Nearest Neighbors • How do we choose the right K? • How do we choose the right features? • How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”) 14

Training, Validation (Dev), Test Sets Validation Testing Training Set Set Set

Training, Validation (Dev), Test Sets Validation Testing Training Set Set Set Used during development

Training, Validation (Dev), Test Sets Validation Testing Training Set Set Set Only to be used for evaluating the model at the very end of development and any changes to the model after running it on the test set, could be influenced by what you saw happened on the test set, which would invalidate any future evaluation.

Unsupervised Learning – k-means clustering k = 3 1. Initially assign all images to a random cluster 18

Unsupervised Learning – k-means clustering k = 3 2. Compute the mean image (in feature space) for each cluster 19

Unsupervised Learning – k-means clustering k = 3 3. Reassign images to clusters based on similarity to cluster means 20

Unsupervised Learning – k-means clustering k = 3 4. Keep repeating this process until convergence 21

Unsupervised Learning – k-means clustering • How do we choose the right K? • How do we choose the right features? • How do we choose the right distance metric? • How sensitive is this method with respect to the random assignment of clusters? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”) 24

Supervised Learning - Classification Training Data Test Data cat dog cat . . . . . . bear 25

Supervised Learning - Classification Training Data ) ( = [ ] ! ( = [ ] cat ) ' = [ ] ! ' = [ ] dog ) & = [ ] ! & = [ ] cat . . . ! " = [ ] ) " = [ ] bear 26

Supervised Learning - Classification Training Data targets / We need to find a function that labels / inputs predictions maps x and y for any of them. ground truth ! & = ! & = 9 ' & = [' && ' &% ' &$ ' &) ] 1 1 ! , = -(' , ; 0) + ' % = [' %& ' %% ' %$ ' %) ] ! % = ! % = 9 2 2 ! $ = ! $ = 9 ' $ = [' $& ' $% ' $$ ' $) ] 1 2 How do we ”learn” the parameters of this function? . We choose ones that makes the . following quantity small: . " 2 4567(+ ! , , ! , ) ! " = ! " = 9 ' " = [' "& ' "% ' "$ ' ") ] 3 1 ,3& 27

Supervised Learning – Linear Softmax Training Data targets / labels / inputs ground truth ! & = ' & = [' && ' &% ' &$ ' &) ] 1 ' % = [' %& ' %% ' %$ ' %) ] ! % = 2 ! $ = ' $ = [' $& ' $% ' $$ ' $) ] 1 . . . ! " = ' " = [' "& ' "% ' "$ ' ") ] 3 28

Supervised Learning – Linear Softmax Training Data targets / labels / predictions inputs ground truth ! & = + ! & = [0.85 0.10 0.05] ' & = [' && ' &% ' &$ ' &) ] [1 0 0] ! % = + ' % = [' %& ' %% ' %$ ' %) ] ! % = [0.20 0.70 0.10] [0 1 0] ! $ = + ! $ = [0.40 0.45 0.15] ' $ = [' $& ' $% ' $$ ' $) ] [1 0 0] . . . ! " = + ! " = [0.40 0.25 0.35] ' " = [' "& ' "% ' "$ ' ") ] [0 0 1] 29

Supervised Learning – Linear Softmax ! " = ! " = + [, , , / ] $ " = [$ "& $ "' $ "( $ ") ] [1 0 0] - . 0 - = 1 -& $ "& + 1 -' $ "' + 1 -( $ "( + 1 -) $ ") + 3 - 0 . = 1 .& $ "& + 1 .' $ "' + 1 .( $ "( + 1 .) $ ") + 3 . 0 / = 1 /& $ "& + 1 /' $ "' + 1 /( $ "( + 1 /) $ ") + 3 / - = 4 5 6 /(4 5 6 +4 5 9 + 4 5 : ) , . = 4 5 9 /(4 5 6 +4 5 9 + 4 5 : ) , / = 4 5 : /(4 5 6 +4 5 9 + 4 5 : ) , 30

How do we find a good w and b? ! " = ! " = + [, - (/, 1) , 3 (/, 1) , 4 (/, 1)] $ " = [$ "& $ "' $ "( $ ") ] [1 0 0] We need to find w, and b that minimize the following: 8 ( 8 8 5 /, 1 = 6 6 −! ",9 log(+ ! ",9 ) = 6 −log(+ ! ",>?4@> ) = 6 −log , ",>?4@> (/, 1) "7& 97& "7& "7& Why? 31

Gradient Descent (GD) 6 = 0.01 , !(#, %) = ( −log 1 ),23452 (#, %) Initialize w and b randomly )*+ for e = 0, num_epochs do Compute: and :!(#, %)/:# :!(#, %)/:% Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 32

Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12 " 33

Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 " 34

Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 " 35

Our function L(w) ! " = 3 + (1 − ") * 36

Our function L(w) ! " = 3 + (1 − ") * 2 !(+, -) = . −log 6 /,789:7 (+, -) /01 37

Our function L(w) ! " = 3 + (1 − ") * L " + , " * , . . , " +* = −./01/23456 0 " + , " * , . . , " +* , 6 + 789:7 ; −./01/23456 0 " + , " * , . . , " +* , 6 * 789:7 < … −./01/23456 0 " + , " * , . . , " +* , 6 > 789:7 ? 38

Gradient Descent (GD) expensive 6 = 0.01 , !(#, %) = ( −log 1 ),23452 (#, %) Initialize w and b randomly )*+ for e = 0, num_epochs do Compute: and :!(#, %)/:# :!(#, %)/:% Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 39

(mini-batch) Stochastic Gradient Descent (SGD) 5 = 0.01 !(#, %) = ( −log 0 ),12341 (#, %) Initialize w and b randomly )∈+ for e = 0, num_epochs do for b = 0, num_batches do Compute: and 9!(#, %)/9# 9!(#, %)/9% Update w: # = # − 5 9!(#, %)/9# Update b: % = % − 5 9!(#, %)/9% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 40

Source: Andrew Ng

(mini-batch) Stochastic Gradient Descent (SGD) 5 = 0.01 !(#, %) = ( −log 0 ),12341 (#, %) Initialize w and b randomly )∈+ for e = 0, num_epochs do for b = 0, num_batches do Compute: and for |B| = 1 9!(#, %)/9# 9!(#, %)/9% Update w: # = # − 5 9!(#, %)/9# Update b: % = % − 5 9!(#, %)/9% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 42

Computing Analytic Gradients This is what we have:

Softmax Classifier + SGD Todays Class Intro to Machine Learning - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Softmax Classifier + SGD Todays Class Intro to Machine Learning What is Machine Learning? Supervised Learning: Classification with K-nearest neighbors Unsupervised Learning: Clustering with

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 Softmax Alternatives in Neural MT

Neural Networks + Backpropagation Last Class Softmax Classifier Generalization /

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Softmax Classifier + Generalization Various slides from previous courses by: D.A. Forsyth

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

CS4501: Introduction to Computer Vision Max-Margin Classifier, Regularization, Generalization,

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background:

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

A Baseline for Few-Shot Image Classification Guneet S. Dhillon 1 , Pratik Chaudhari 2 , Avinash

PAIRWISE DECOMPOSITION OF IMAGE SEQUENCES FOR ACTIVE MULTI-VIEW RECOGNITION(EXPERIMENT)

Lexicon Induction Melanie Bolla and Olga Whelan Ling 575 Lexicon Induction (and the problem it

Discussion: Reproducibility and Cross-study Replicability of Prognostic Signatures from High

Welcome to the course! EX TREME GRADIEN T BOOS TIN G W ITH X GBOOS T Sergey Fogelson VP of

Edge Detection Sanja Fidler CSC420: Intro to Image Understanding 1 / 47 Finding Waldo Lets

Bunch Length Measurements at SPEAR3 Jeff Corbett for the SPEAR3 Accelerator Group Outline 1.