lecture 7 multiclass classification

Lecture 7: Multiclass Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from


  1. Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

  2. Example: image classification indoor Indoor outdoor

  3. Example: image classification (multiclass) ImageNet figure borrowed from vision.standford.edu

  4. Multiclass classification β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ 𝑦 𝑗 ∈ 𝑆 𝑒 , 𝑧 𝑗 ∈ {1,2, … , 𝐿} β€’ Find 𝑔 𝑦 : 𝑆 𝑒 β†’ {1,2, … , 𝐿} that outputs correct labels β€’ What kind of 𝑔 ?

  5. Approaches for multiclass classification

  6. Approach 1: reduce to regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 that minimizes ΰ·  π‘œ π‘₯ π‘ˆ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = β€’ Bad idea even for binary classification Reduce to linear regression; ignore the fact 𝑧 ∈ {1,2. . . , 𝐿}

  7. Approach 1: reduce to regression Bad idea even for binary classification Figure from Pattern Recognition and Machine Learning , Bishop

  8. Approach 2: one-versus-the-rest β€’ Find 𝐿 βˆ’ 1 classifiers 𝑔 1 , 𝑔 2 , … , 𝑔 πΏβˆ’1 β€’ 𝑔 1 classifies 1 𝑀𝑑 {2,3, … , 𝐿} β€’ 𝑔 2 classifies 2 𝑀𝑑 {1,3, … , 𝐿} β€’ … β€’ 𝑔 πΏβˆ’1 classifies 𝐿 βˆ’ 1 𝑀𝑑 {1,2, … , 𝐿 βˆ’ 2} β€’ Points not classified to classes {1,2, … , 𝐿 βˆ’ 1} are put to class 𝐿 β€’ Problem of ambiguous region: some points may be classified to more than one classes

  9. Approach 2: one-versus-the-rest Figure from Pattern Recognition and Machine Learning , Bishop

  10. Approach 3: one-versus-one β€’ Find 𝐿 βˆ’ 1 𝐿/2 classifiers 𝑔 (1,2) , 𝑔 (1,3) , … , 𝑔 (πΏβˆ’1,𝐿) β€’ 𝑔 (1,2) classifies 1 𝑀𝑑 2 β€’ 𝑔 (1,3) classifies 1 𝑀𝑑 3 β€’ … β€’ 𝑔 (πΏβˆ’1,𝐿) classifies 𝐿 βˆ’ 1 𝑀𝑑 𝐿 β€’ Computationally expensive: think of 𝐿 = 1000 β€’ Problem of ambiguous region

  11. Approach 3: one-versus-one Figure from Pattern Recognition and Machine Learning , Bishop

  12. Approach 4: discriminant functions β€’ Find 𝐿 scoring functions 𝑑 1 , 𝑑 2 , … , 𝑑 𝐿 β€’ Classify 𝑦 to class 𝑧 = argmax 𝑗 𝑑 𝑗 (𝑦) β€’ Computationally cheap β€’ No ambiguous regions

  13. Linear discriminant functions β€’ Find 𝐿 discriminant functions 𝑑 1 , 𝑑 2 , … , 𝑑 𝐿 β€’ Classify 𝑦 to class 𝑧 = argmax 𝑗 𝑑 𝑗 (𝑦) β€’ Linear discriminant: 𝑑 𝑗 (𝑦) = π‘₯ 𝑗 π‘ˆ 𝑦 , with π‘₯ 𝑗 ∈ 𝑆 𝑒

  14. Linear discriminant functions β€’ Linear discriminant: 𝑑 𝑗 (𝑦) = π‘₯ 𝑗 π‘ˆ 𝑦 , with π‘₯ 𝑗 ∈ 𝑆 𝑒 β€’ Lead to convex region for each class: by 𝑧 = argmax 𝑗 π‘₯ 𝑗 π‘ˆ 𝑦 Figure from Pattern Recognition and Machine Learning , Bishop

  15. Conditional distribution as discriminant β€’ Find 𝐿 discriminant functions 𝑑 1 , 𝑑 2 , … , 𝑑 𝐿 β€’ Classify 𝑦 to class 𝑧 = argmax 𝑗 𝑑 𝑗 (𝑦) β€’ Conditional distributions: 𝑑 𝑗 (𝑦) = π‘ž(𝑧 = 𝑗|𝑦) β€’ Parametrize by π‘₯ 𝑗 : 𝑑 𝑗 (𝑦) = π‘ž π‘₯ 𝑗 (𝑧 = 𝑗|𝑦)

  16. Multiclass logistic regression

  17. Review: binary logistic regression β€’ Sigmoid 1 𝜏 π‘₯ π‘ˆ 𝑦 + 𝑐 = 1 + exp(βˆ’(π‘₯ π‘ˆ 𝑦 + 𝑐)) β€’ Interpret as conditional probability π‘ž π‘₯ 𝑧 = 1 𝑦 = 𝜏 π‘₯ π‘ˆ 𝑦 + 𝑐 π‘ž π‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ π‘ž π‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 + 𝑐 β€’ How to extend to multiclass?

  18. Review: binary logistic regression β€’ Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and class probabilities π‘ž 𝑧 = 𝑗 β€’ Conditional probability by Bayesian rule: π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) 1 π‘ž 𝑧 = 1|𝑦 = π‘ž 𝑦|𝑧 = 1 π‘ž 𝑧 = 1 + π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = 1 + exp(βˆ’π‘) = 𝜏(𝑏) where we define 𝑏 ≔ ln π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = ln π‘ž 𝑧 = 1|𝑦 π‘ž 𝑧 = 2|𝑦

  19. Review: binary logistic regression β€’ Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and class probabilities π‘ž 𝑧 = 𝑗 β€’ π‘ž 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(π‘₯ π‘ˆ 𝑦 + 𝑐) is equivalent to setting log odds 𝑏 = ln π‘ž 𝑧 = 1|𝑦 π‘ž 𝑧 = 2|𝑦 = π‘₯ π‘ˆ 𝑦 + 𝑐 β€’ Why linear log odds?

  20. Review: binary logistic regression β€’ Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{βˆ’ 1 1 2 } π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 𝑦 βˆ’ 𝜈 𝑗 2 β€’ log odd is 𝑏 = ln π‘ž 𝑦|𝑧 = 1 π‘ž(𝑧 = 1) π‘ž 𝑦|𝑧 = 2 π‘ž(𝑧 = 2) = π‘₯ π‘ˆ 𝑦 + 𝑐 where 𝑐 = βˆ’ 1 π‘ˆ 𝜈 1 + 1 π‘ˆ 𝜈 2 + ln π‘ž(𝑧 = 1) π‘₯ = 𝜈 1 βˆ’ 𝜈 2 , 2 𝜈 1 2 𝜈 2 π‘ž(𝑧 = 2)

  21. Multiclass logistic regression β€’ Suppose we model the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 and class probabilities π‘ž 𝑧 = 𝑗 β€’ Conditional probability by Bayesian rule: π‘ž 𝑦|𝑧 = 𝑗 π‘ž(𝑧 = 𝑗) exp(𝑏 𝑗 ) π‘ž 𝑧 = 𝑗|𝑦 = Οƒ π‘˜ π‘ž 𝑦|𝑧 = π‘˜ π‘ž(𝑧 = π‘˜) = Οƒ π‘˜ exp(𝑏 π‘˜ ) where we define 𝑏 𝑗 ≔ ln [π‘ž 𝑦 𝑧 = 𝑗 π‘ž 𝑧 = 𝑗 ]

  22. Multiclass logistic regression β€’ Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{βˆ’ 1 1 2 } π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 𝑦 βˆ’ 𝜈 𝑗 2 β€’ Then π‘ˆ = βˆ’ 1 2 𝑦 π‘ˆ 𝑦 + π‘₯ 𝑗 𝑦 + 𝑐 𝑗 𝑏 𝑗 ≔ ln π‘ž 𝑦 𝑧 = 𝑗 π‘ž 𝑧 = 𝑗 where 𝑐 𝑗 = βˆ’ 1 1 π‘₯ 𝑗 = 𝜈 𝑗 , π‘ˆ 𝜈 𝑗 + ln π‘ž 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

  23. Multiclass logistic regression β€’ Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{βˆ’ 1 1 2 } π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 𝑦 βˆ’ 𝜈 𝑗 2 1 2 𝑦 π‘ˆ 𝑦 , we have β€’ Cancel out βˆ’ exp(𝑏 𝑗 ) π‘₯ 𝑗 π‘ˆ 𝑦 + 𝑐 𝑗 π‘ž 𝑧 = 𝑗|𝑦 = Οƒ π‘˜ exp(𝑏 π‘˜ ) , 𝑏 𝑗 ≔ where 𝑐 𝑗 = βˆ’ 1 1 π‘₯ 𝑗 = 𝜈 𝑗 , π‘ˆ 𝜈 𝑗 + ln π‘ž 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

  24. Multiclass logistic regression: conclusion β€’ Suppose the class-conditional densities π‘ž 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{βˆ’ 1 1 2 } π‘ž 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 βˆ’ 𝜈 𝑗 β€’ Then exp( π‘₯ 𝑗 π‘ˆ 𝑦 + 𝑐 𝑗 ) π‘ž 𝑧 = 𝑗|𝑦 = Οƒ π‘˜ exp( π‘₯ π‘˜ π‘ˆ 𝑦 + 𝑐 π‘˜ ) which is the hypothesis class for multiclass logistic regression β€’ It is softmax on linear transformation; it can be used to derive the negative log- likelihood loss (cross entropy)

  25. Softmax β€’ A way to squash 𝑏 = (𝑏 1 , 𝑏 2 , … , 𝑏 𝑗 , … ) into probability vector π‘ž Οƒ π‘˜ exp(𝑏 π‘˜ ) , exp(𝑏 2 ) exp(𝑏 1 ) exp 𝑏 𝑗 softmax 𝑏 = Οƒ π‘˜ exp(𝑏 π‘˜ ) , … , , … Οƒ π‘˜ exp 𝑏 π‘˜ β€’ Behave like max: when 𝑏 𝑗 ≫ 𝑏 π‘˜ βˆ€π‘˜ β‰  𝑗 , π‘ž 𝑗 β‰… 1, π‘ž π‘˜ β‰… 0

  26. Cross entropy for conditional distribution β€’ Let π‘ž data (𝑧|𝑦) denote the empirical distribution of the data β€’ Negative log-likelihood 1 π‘œ π‘œ Οƒ 𝑗=1 βˆ’ log π‘ž 𝑧 = 𝑧 𝑗 𝑦 𝑗 = βˆ’E π‘ž data (𝑧|𝑦) log π‘ž(𝑧|𝑦) is the cross entropy between π‘ž data and the model output π‘ž β€’ Information theory viewpoint: KL divergence π‘ž data 𝐸(π‘ž data | π‘ž = E π‘ž data [log π‘ž ] = E π‘ž data [log π‘ž data ] βˆ’ E π‘ž data [log π‘ž] Entropy; constant Cross entropy

  27. Cross entropy for full distribution β€’ Let π‘ž data (𝑦, 𝑧) denote the empirical distribution of the data β€’ Negative log-likelihood 1 π‘œ π‘œ Οƒ 𝑗=1 βˆ’ log π‘ž(𝑦 𝑗 , 𝑧 𝑗 ) = βˆ’E π‘ž data (𝑦,𝑧) log π‘ž(𝑦, 𝑧) is the cross entropy between π‘ž data and the model output π‘ž

  28. Multiclass logistic regression: summary Label 𝑧 𝑗 Last hidden layer β„Ž Cross entropy softmax (π‘₯ π‘˜ ) π‘ˆ β„Ž + 𝑐 π‘˜ π‘ž π‘˜ Linear Convert to probability Loss

Recommend


More recommend