Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang
Example: image classification indoor Indoor outdoor
Example: image classification (multiclass) ImageNet figure borrowed from vision.standford.edu
Multiclass classification β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ π¦ π β π π , π§ π β {1,2, β¦ , πΏ} β’ Find π π¦ : π π β {1,2, β¦ , πΏ} that outputs correct labels β’ What kind of π ?
Approaches for multiclass classification
Approach 1: reduce to regression β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 π₯ π¦ = π₯ π π¦ that minimizes ΰ· π π₯ π π¦ π β π§ π 2 β’ Find π π Ο π=1 π π π₯ = β’ Bad idea even for binary classification Reduce to linear regression; ignore the fact π§ β {1,2. . . , πΏ}
Approach 1: reduce to regression Bad idea even for binary classification Figure from Pattern Recognition and Machine Learning , Bishop
Approach 2: one-versus-the-rest β’ Find πΏ β 1 classifiers π 1 , π 2 , β¦ , π πΏβ1 β’ π 1 classifies 1 π€π‘ {2,3, β¦ , πΏ} β’ π 2 classifies 2 π€π‘ {1,3, β¦ , πΏ} β’ β¦ β’ π πΏβ1 classifies πΏ β 1 π€π‘ {1,2, β¦ , πΏ β 2} β’ Points not classified to classes {1,2, β¦ , πΏ β 1} are put to class πΏ β’ Problem of ambiguous region: some points may be classified to more than one classes
Approach 2: one-versus-the-rest Figure from Pattern Recognition and Machine Learning , Bishop
Approach 3: one-versus-one β’ Find πΏ β 1 πΏ/2 classifiers π (1,2) , π (1,3) , β¦ , π (πΏβ1,πΏ) β’ π (1,2) classifies 1 π€π‘ 2 β’ π (1,3) classifies 1 π€π‘ 3 β’ β¦ β’ π (πΏβ1,πΏ) classifies πΏ β 1 π€π‘ πΏ β’ Computationally expensive: think of πΏ = 1000 β’ Problem of ambiguous region
Approach 3: one-versus-one Figure from Pattern Recognition and Machine Learning , Bishop
Approach 4: discriminant functions β’ Find πΏ scoring functions π‘ 1 , π‘ 2 , β¦ , π‘ πΏ β’ Classify π¦ to class π§ = argmax π π‘ π (π¦) β’ Computationally cheap β’ No ambiguous regions
Linear discriminant functions β’ Find πΏ discriminant functions π‘ 1 , π‘ 2 , β¦ , π‘ πΏ β’ Classify π¦ to class π§ = argmax π π‘ π (π¦) β’ Linear discriminant: π‘ π (π¦) = π₯ π π π¦ , with π₯ π β π π
Linear discriminant functions β’ Linear discriminant: π‘ π (π¦) = π₯ π π π¦ , with π₯ π β π π β’ Lead to convex region for each class: by π§ = argmax π π₯ π π π¦ Figure from Pattern Recognition and Machine Learning , Bishop
Conditional distribution as discriminant β’ Find πΏ discriminant functions π‘ 1 , π‘ 2 , β¦ , π‘ πΏ β’ Classify π¦ to class π§ = argmax π π‘ π (π¦) β’ Conditional distributions: π‘ π (π¦) = π(π§ = π|π¦) β’ Parametrize by π₯ π : π‘ π (π¦) = π π₯ π (π§ = π|π¦)
Multiclass logistic regression
Review: binary logistic regression β’ Sigmoid 1 π π₯ π π¦ + π = 1 + exp(β(π₯ π π¦ + π)) β’ Interpret as conditional probability π π₯ π§ = 1 π¦ = π π₯ π π¦ + π π π₯ π§ = 0 π¦ = 1 β π π₯ π§ = 1 π¦ = 1 β π π₯ π π¦ + π β’ How to extend to multiclass?
Review: binary logistic regression β’ Suppose we model the class-conditional densities π π¦ π§ = π and class probabilities π π§ = π β’ Conditional probability by Bayesian rule: π π¦|π§ = 1 π(π§ = 1) 1 π π§ = 1|π¦ = π π¦|π§ = 1 π π§ = 1 + π π¦|π§ = 2 π(π§ = 2) = 1 + exp(βπ) = π(π) where we define π β ln π π¦|π§ = 1 π(π§ = 1) π π¦|π§ = 2 π(π§ = 2) = ln π π§ = 1|π¦ π π§ = 2|π¦
Review: binary logistic regression β’ Suppose we model the class-conditional densities π π¦ π§ = π and class probabilities π π§ = π β’ π π§ = 1|π¦ = π π = π(π₯ π π¦ + π) is equivalent to setting log odds π = ln π π§ = 1|π¦ π π§ = 2|π¦ = π₯ π π¦ + π β’ Why linear log odds?
Review: binary logistic regression β’ Suppose the class-conditional densities π π¦ π§ = π is normal 2π π/2 exp{β 1 1 2 } π π¦ π§ = π = π π¦|π π , π½ = π¦ β π π 2 β’ log odd is π = ln π π¦|π§ = 1 π(π§ = 1) π π¦|π§ = 2 π(π§ = 2) = π₯ π π¦ + π where π = β 1 π π 1 + 1 π π 2 + ln π(π§ = 1) π₯ = π 1 β π 2 , 2 π 1 2 π 2 π(π§ = 2)
Multiclass logistic regression β’ Suppose we model the class-conditional densities π π¦ π§ = π and class probabilities π π§ = π β’ Conditional probability by Bayesian rule: π π¦|π§ = π π(π§ = π) exp(π π ) π π§ = π|π¦ = Ο π π π¦|π§ = π π(π§ = π) = Ο π exp(π π ) where we define π π β ln [π π¦ π§ = π π π§ = π ]
Multiclass logistic regression β’ Suppose the class-conditional densities π π¦ π§ = π is normal 2π π/2 exp{β 1 1 2 } π π¦ π§ = π = π π¦|π π , π½ = π¦ β π π 2 β’ Then π = β 1 2 π¦ π π¦ + π₯ π π¦ + π π π π β ln π π¦ π§ = π π π§ = π where π π = β 1 1 π₯ π = π π , π π π + ln π π§ = π + ln 2 π π 2π π/2
Multiclass logistic regression β’ Suppose the class-conditional densities π π¦ π§ = π is normal 2π π/2 exp{β 1 1 2 } π π¦ π§ = π = π π¦|π π , π½ = π¦ β π π 2 1 2 π¦ π π¦ , we have β’ Cancel out β exp(π π ) π₯ π π π¦ + π π π π§ = π|π¦ = Ο π exp(π π ) , π π β where π π = β 1 1 π₯ π = π π , π π π + ln π π§ = π + ln 2 π π 2π π/2
Multiclass logistic regression: conclusion β’ Suppose the class-conditional densities π π¦ π§ = π is normal 2π π/2 exp{β 1 1 2 } π π¦ π§ = π = π π¦|π π , π½ = 2 π¦ β π π β’ Then exp( π₯ π π π¦ + π π ) π π§ = π|π¦ = Ο π exp( π₯ π π π¦ + π π ) which is the hypothesis class for multiclass logistic regression β’ It is softmax on linear transformation; it can be used to derive the negative log- likelihood loss (cross entropy)
Softmax β’ A way to squash π = (π 1 , π 2 , β¦ , π π , β¦ ) into probability vector π Ο π exp(π π ) , exp(π 2 ) exp(π 1 ) exp π π softmax π = Ο π exp(π π ) , β¦ , , β¦ Ο π exp π π β’ Behave like max: when π π β« π π βπ β π , π π β 1, π π β 0
Cross entropy for conditional distribution β’ Let π data (π§|π¦) denote the empirical distribution of the data β’ Negative log-likelihood 1 π π Ο π=1 β log π π§ = π§ π π¦ π = βE π data (π§|π¦) log π(π§|π¦) is the cross entropy between π data and the model output π β’ Information theory viewpoint: KL divergence π data πΈ(π data | π = E π data [log π ] = E π data [log π data ] β E π data [log π] Entropy; constant Cross entropy
Cross entropy for full distribution β’ Let π data (π¦, π§) denote the empirical distribution of the data β’ Negative log-likelihood 1 π π Ο π=1 β log π(π¦ π , π§ π ) = βE π data (π¦,π§) log π(π¦, π§) is the cross entropy between π data and the model output π
Multiclass logistic regression: summary Label π§ π Last hidden layer β Cross entropy softmax (π₯ π ) π β + π π π π Linear Convert to probability Loss
Recommend
More recommend