CSC 411: Lecture 07: Multiclass Classification Class based on Raquel - PowerPoint PPT Presentation

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 1 / 17

Today Multi-class classification with: Least-squares regression Logistic Regression K-NN Decision trees Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 2 / 17

Discriminant Functions for K > 2 classes First idea: Use K − 1 classifiers, each solving a two class problem of separating point in a class C k from points not in the class. Known as 1 vs all or 1 vs the rest classifier PROBLEM: More than one good answer for green region! Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 3 / 17

Discriminant Functions for K > 2 classes Another simple idea: Introduce K ( K − 1) / 2 two-way classifiers, one for each possible pair of classes Each point is classified according to majority vote amongst the disc. func. Known as the 1 vs 1 classifier PROBLEM: Two-way preferences need not be transitive Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 4 / 17

K-Class Discriminant We can avoid these problems by considering a single K-class discriminant comprising K functions of the form y k ( x ) = w T k x + w k , 0 and then assigning a point x to class C k if ∀ j � = k y k ( x ) > y j ( x ) Note that w T k is now a vector, not the k -th coordinate The decision boundary between class C j and class C k is given by y j ( x ) = y k ( x ), and thus it’s a ( D − 1) dimensional hyperplane defined as ( w k − w j ) T x + ( w k 0 − w j 0 ) = 0 What about the binary case? Is this different? What is the shape of the overall decision boundary? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 5 / 17

K-Class Discriminant The decision regions of such a discriminant are always singly connected and convex In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins the pair of points is also within the object Which object is convex? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 6 / 17

K-Class Discriminant The decision regions of such a discriminant are always singly connected and convex Consider 2 points x A and x B that lie inside decision region R k Any convex combination ˆ x of those points also will be in R k ˆ x = λ x A + (1 − λ ) x B Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 7 / 17

Proof A convex combination point, i.e., λ ∈ [0 , 1] ˆ x = λ x A + (1 − λ ) x B From the linearity of the classifier y ( x ) y k (ˆ x ) = λ y k ( x A ) + (1 − λ ) y k ( x B ) Since x A and x B are in R k , it follows that y k ( x A ) > y j ( x A ), y k ( x B ) > y j ( x B ), ∀ j � = k Since λ and 1 − λ are positive, then ˆ x is inside R k Thus R k is singly connected and convex Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 8 / 17

Example Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 9 / 17

Multi-class Classification with Linear Regression From before we have: y k ( x ) = w T k x + w k , 0 which can be rewritten as: y ( x ) = ˜ W T ˜ x where the k -th column of ˜ W is [ w k , 0 , w T k ] T , and ˜ x is [1 , x T ] T Training: How can I find the weights ˜ W with the standard sum-of-squares regression loss? 1-of-K encoding: For multi-class problems (with K classes), instead of using t = k (target has label k ) we often use a 1-of-K encoding , i.e., a vector of K target values containing a single 1 for the correct class and zeros elsewhere Example: For a 4-class problem, we would write a target with class label 2 as: t = [0 , 1 , 0 , 0] T Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 10 / 17

Multi-class Classification with Linear Regression Sum-of-least-squares loss: N x ( n ) − t ( n ) || 2 ℓ ( ˜ � || ˜ W T ˜ W ) = n =1 || ˜ X ˜ W − T || 2 = F where the n -th row of ˜ x ( n ) ] T , and n -th row of T is [ t ( n ) ] T X is [˜ Setting derivative wrt ˜ W to 0, we get: X T ˜ X ) − 1 ˜ ˜ � ˜ X T T W = Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 11 / 17

Multi-class Logistic Regression Associate a set of weights with each class, then use a normalized exponential output exp( z k ) p ( C k | x ) = y k ( x ) = � j exp( z j ) where the activations are given by z k = w T k x exp( z k ) The function j exp( z j ) is called a softmax function � Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 12 / 17

Multi-class Logistic Regression The likelihood N K N K t ( n ) t ( n ) � � � � y ( n ) p ( C k | x ( n ) ) k ( x ( n ) ) k k p ( T | w 1 , · · · , w k ) = = n =1 k =1 n =1 k =1 with exp( z k ) p ( C k | x ) = y k ( x ) = � j exp( z j ) where n -th row of T is 1-of-K encoding of example n and z k = w T k x + w k 0 What assumptions have I used to derive the likelihood? Derive the loss by computing the negative log-likelihood: N K t ( n ) log[ y ( n ) � � k ( x ( n ) )] E ( w 1 , · · · , w K ) = − log p ( T | w 1 , · · · , w K ) = − k n =1 k =1 This is known as the cross-entropy error for multiclass classification How do we obtain the weights? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 13 / 17

Training Multi-class Logistic Regression How do we obtain the weights? N K � � t ( n ) log[ y ( n ) k ( x ( n ) )] E ( w 1 , · · · , w K ) = − log p ( T | w 1 , · · · , w K ) = − k n =1 k =1 Do gradient descent, where the derivatives are ∂ y ( n ) j = δ ( k , j ) y ( n ) − y ( n ) y ( n ) j j ∂ z ( n ) k k and ∂ y ( n ) K ∂ E ∂ E � j = y ( n ) − t ( n ) = · k k ∂ z ( n ) ∂ y ( n ) ∂ z ( n ) j =1 k j k ∂ y ( n ) N K · ∂ z ( n ) N ∂ E ∂ E � � j � ( y ( n ) − t ( n ) k ) · x ( n ) k = · = k j ∂ w k , j ∂ y ( n ) ∂ z ( n ) ∂ w k , j n =1 n =1 j =1 j k The derivative is the error times the input Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 14 / 17

Softmax for 2 Classes Let’s write the probability of one of the classes exp( z 1 ) exp( z 1 ) p ( C 1 | x ) = y 1 ( x ) = j exp( z j ) = � exp( z 1 ) + exp( z 2 ) I can equivalently write this as exp( z 1 ) 1 p ( C 1 | x ) = y 1 ( x ) = exp( z 1 ) + exp( z 2 ) = 1 + exp ( − ( z 1 − z 2 )) So the logistic is just a special case that avoids using redundant parameters Rather than having two separate set of weights for the two classes, combine into one z ′ = z 1 − z 2 = w T 1 x − w T 2 x = w T x The over-parameterization of the softmax is because the probabilities must add to 1. Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 15 / 17

Multi-class K-NN Can directly handle multi class problems Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 16 / 17

Multi-class Decision Trees Can directly handle multi class problems How is this decision tree constructed? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 17 / 17

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel - PowerPoint PPT Presentation

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 1 / 17

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 8: Linear Classification II Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

CSC 411 Lecture 9: SVMs and Boosting Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 20: Closing Thoughts Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 19: Bayesian Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Class based on Raquel

CSC 411: Lecture 11: Neural Networks II Class based on Raquel Urtasun & Rich Zemels

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks:

A Kernel Theory of Modern Data Augmentation Tr Tri Dao ao , Albert Gu, Alex Ratner, Virginia

Outline Last time: window-based generic object detection Discriminative classifiers

Perceptrons 2-29-16 What is a neural network? activation connection functions A NN is a

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel - PowerPoint PPT Presentation

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 1 / 17

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 8: Linear Classification II Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &amp;

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

CSC 411 Lecture 9: SVMs and Boosting Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun &amp; Rich Zemels lectures

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun &amp; Rich Zemels lectures

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 20: Closing Thoughts Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 19: Bayesian Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun &amp; Rich Zemels

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 14: Principal Components Analysis &amp; Autoencoders Class based on Raquel

CSC 411: Lecture 11: Neural Networks II Class based on Raquel Urtasun &amp; Rich Zemels

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos &amp; Aarti Singh 1

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks:

A Kernel Theory of Modern Data Augmentation Tr Tri Dao ao , Albert Gu, Alex Ratner, Virginia

Outline Last time: window-based generic object detection Discriminative classifiers

Perceptrons 2-29-16 What is a neural network? activation connection functions A NN is a

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Class based on Raquel

CSC 411: Lecture 11: Neural Networks II Class based on Raquel Urtasun & Rich Zemels

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1