lecture 7 multiclass classification
play

Lecture 7: Multiclass Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from


  1. Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

  2. Example: image classification indoor Indoor outdoor

  3. Example: image classification (multiclass) ImageNet figure borrowed from vision.standford.edu

  4. Multiclass classification โ€ข Given training data ๐‘ฆ ๐‘— , ๐‘ง ๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ โ€ข ๐‘ฆ ๐‘— โˆˆ ๐‘† ๐‘’ , ๐‘ง ๐‘— โˆˆ {1,2, โ€ฆ , ๐ฟ} โ€ข Find ๐‘” ๐‘ฆ : ๐‘† ๐‘’ โ†’ {1,2, โ€ฆ , ๐ฟ} that outputs correct labels โ€ข What kind of ๐‘” ?

  5. Approaches for multiclass classification

  6. Approach 1: reduce to regression โ€ข Given training data ๐‘ฆ ๐‘— , ๐‘ง ๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ 1 ๐‘ฅ ๐‘ฆ = ๐‘ฅ ๐‘ˆ ๐‘ฆ that minimizes เท  ๐‘œ ๐‘ฅ ๐‘ˆ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— 2 โ€ข Find ๐‘” ๐‘œ ฯƒ ๐‘—=1 ๐‘€ ๐‘” ๐‘ฅ = โ€ข Bad idea even for binary classification Reduce to linear regression; ignore the fact ๐‘ง โˆˆ {1,2. . . , ๐ฟ}

  7. Approach 1: reduce to regression Bad idea even for binary classification Figure from Pattern Recognition and Machine Learning , Bishop

  8. Approach 2: one-versus-the-rest โ€ข Find ๐ฟ โˆ’ 1 classifiers ๐‘” 1 , ๐‘” 2 , โ€ฆ , ๐‘” ๐ฟโˆ’1 โ€ข ๐‘” 1 classifies 1 ๐‘ค๐‘ก {2,3, โ€ฆ , ๐ฟ} โ€ข ๐‘” 2 classifies 2 ๐‘ค๐‘ก {1,3, โ€ฆ , ๐ฟ} โ€ข โ€ฆ โ€ข ๐‘” ๐ฟโˆ’1 classifies ๐ฟ โˆ’ 1 ๐‘ค๐‘ก {1,2, โ€ฆ , ๐ฟ โˆ’ 2} โ€ข Points not classified to classes {1,2, โ€ฆ , ๐ฟ โˆ’ 1} are put to class ๐ฟ โ€ข Problem of ambiguous region: some points may be classified to more than one classes

  9. Approach 2: one-versus-the-rest Figure from Pattern Recognition and Machine Learning , Bishop

  10. Approach 3: one-versus-one โ€ข Find ๐ฟ โˆ’ 1 ๐ฟ/2 classifiers ๐‘” (1,2) , ๐‘” (1,3) , โ€ฆ , ๐‘” (๐ฟโˆ’1,๐ฟ) โ€ข ๐‘” (1,2) classifies 1 ๐‘ค๐‘ก 2 โ€ข ๐‘” (1,3) classifies 1 ๐‘ค๐‘ก 3 โ€ข โ€ฆ โ€ข ๐‘” (๐ฟโˆ’1,๐ฟ) classifies ๐ฟ โˆ’ 1 ๐‘ค๐‘ก ๐ฟ โ€ข Computationally expensive: think of ๐ฟ = 1000 โ€ข Problem of ambiguous region

  11. Approach 3: one-versus-one Figure from Pattern Recognition and Machine Learning , Bishop

  12. Approach 4: discriminant functions โ€ข Find ๐ฟ scoring functions ๐‘ก 1 , ๐‘ก 2 , โ€ฆ , ๐‘ก ๐ฟ โ€ข Classify ๐‘ฆ to class ๐‘ง = argmax ๐‘— ๐‘ก ๐‘— (๐‘ฆ) โ€ข Computationally cheap โ€ข No ambiguous regions

  13. Linear discriminant functions โ€ข Find ๐ฟ discriminant functions ๐‘ก 1 , ๐‘ก 2 , โ€ฆ , ๐‘ก ๐ฟ โ€ข Classify ๐‘ฆ to class ๐‘ง = argmax ๐‘— ๐‘ก ๐‘— (๐‘ฆ) โ€ข Linear discriminant: ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ , with ๐‘ฅ ๐‘— โˆˆ ๐‘† ๐‘’

  14. Linear discriminant functions โ€ข Linear discriminant: ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ , with ๐‘ฅ ๐‘— โˆˆ ๐‘† ๐‘’ โ€ข Lead to convex region for each class: by ๐‘ง = argmax ๐‘— ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ Figure from Pattern Recognition and Machine Learning , Bishop

  15. Conditional distribution as discriminant โ€ข Find ๐ฟ discriminant functions ๐‘ก 1 , ๐‘ก 2 , โ€ฆ , ๐‘ก ๐ฟ โ€ข Classify ๐‘ฆ to class ๐‘ง = argmax ๐‘— ๐‘ก ๐‘— (๐‘ฆ) โ€ข Conditional distributions: ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ž(๐‘ง = ๐‘—|๐‘ฆ) โ€ข Parametrize by ๐‘ฅ ๐‘— : ๐‘ก ๐‘— (๐‘ฆ) = ๐‘ž ๐‘ฅ ๐‘— (๐‘ง = ๐‘—|๐‘ฆ)

  16. Multiclass logistic regression

  17. Review: binary logistic regression โ€ข Sigmoid 1 ๐œ ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ = 1 + exp(โˆ’(๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘)) โ€ข Interpret as conditional probability ๐‘ž ๐‘ฅ ๐‘ง = 1 ๐‘ฆ = ๐œ ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘ž ๐‘ฅ ๐‘ง = 0 ๐‘ฆ = 1 โˆ’ ๐‘ž ๐‘ฅ ๐‘ง = 1 ๐‘ฆ = 1 โˆ’ ๐œ ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ โ€ข How to extend to multiclass?

  18. Review: binary logistic regression โ€ข Suppose we model the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— and class probabilities ๐‘ž ๐‘ง = ๐‘— โ€ข Conditional probability by Bayesian rule: ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž(๐‘ง = 1) 1 ๐‘ž ๐‘ง = 1|๐‘ฆ = ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž ๐‘ง = 1 + ๐‘ž ๐‘ฆ|๐‘ง = 2 ๐‘ž(๐‘ง = 2) = 1 + exp(โˆ’๐‘) = ๐œ(๐‘) where we define ๐‘ โ‰” ln ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž(๐‘ง = 1) ๐‘ž ๐‘ฆ|๐‘ง = 2 ๐‘ž(๐‘ง = 2) = ln ๐‘ž ๐‘ง = 1|๐‘ฆ ๐‘ž ๐‘ง = 2|๐‘ฆ

  19. Review: binary logistic regression โ€ข Suppose we model the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— and class probabilities ๐‘ž ๐‘ง = ๐‘— โ€ข ๐‘ž ๐‘ง = 1|๐‘ฆ = ๐œ ๐‘ = ๐œ(๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘) is equivalent to setting log odds ๐‘ = ln ๐‘ž ๐‘ง = 1|๐‘ฆ ๐‘ž ๐‘ง = 2|๐‘ฆ = ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ โ€ข Why linear log odds?

  20. Review: binary logistic regression โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = ๐‘ฆ โˆ’ ๐œˆ ๐‘— 2 โ€ข log odd is ๐‘ = ln ๐‘ž ๐‘ฆ|๐‘ง = 1 ๐‘ž(๐‘ง = 1) ๐‘ž ๐‘ฆ|๐‘ง = 2 ๐‘ž(๐‘ง = 2) = ๐‘ฅ ๐‘ˆ ๐‘ฆ + ๐‘ where ๐‘ = โˆ’ 1 ๐‘ˆ ๐œˆ 1 + 1 ๐‘ˆ ๐œˆ 2 + ln ๐‘ž(๐‘ง = 1) ๐‘ฅ = ๐œˆ 1 โˆ’ ๐œˆ 2 , 2 ๐œˆ 1 2 ๐œˆ 2 ๐‘ž(๐‘ง = 2)

  21. Multiclass logistic regression โ€ข Suppose we model the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— and class probabilities ๐‘ž ๐‘ง = ๐‘— โ€ข Conditional probability by Bayesian rule: ๐‘ž ๐‘ฆ|๐‘ง = ๐‘— ๐‘ž(๐‘ง = ๐‘—) exp(๐‘ ๐‘— ) ๐‘ž ๐‘ง = ๐‘—|๐‘ฆ = ฯƒ ๐‘˜ ๐‘ž ๐‘ฆ|๐‘ง = ๐‘˜ ๐‘ž(๐‘ง = ๐‘˜) = ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) where we define ๐‘ ๐‘— โ‰” ln [๐‘ž ๐‘ฆ ๐‘ง = ๐‘— ๐‘ž ๐‘ง = ๐‘— ]

  22. Multiclass logistic regression โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = ๐‘ฆ โˆ’ ๐œˆ ๐‘— 2 โ€ข Then ๐‘ˆ = โˆ’ 1 2 ๐‘ฆ ๐‘ˆ ๐‘ฆ + ๐‘ฅ ๐‘— ๐‘ฆ + ๐‘ ๐‘— ๐‘ ๐‘— โ‰” ln ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— ๐‘ž ๐‘ง = ๐‘— where ๐‘ ๐‘— = โˆ’ 1 1 ๐‘ฅ ๐‘— = ๐œˆ ๐‘— , ๐‘ˆ ๐œˆ ๐‘— + ln ๐‘ž ๐‘ง = ๐‘— + ln 2 ๐œˆ ๐‘— 2๐œŒ ๐‘’/2

  23. Multiclass logistic regression โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = ๐‘ฆ โˆ’ ๐œˆ ๐‘— 2 1 2 ๐‘ฆ ๐‘ˆ ๐‘ฆ , we have โ€ข Cancel out โˆ’ exp(๐‘ ๐‘— ) ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘— ๐‘ž ๐‘ง = ๐‘—|๐‘ฆ = ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) , ๐‘ ๐‘— โ‰” where ๐‘ ๐‘— = โˆ’ 1 1 ๐‘ฅ ๐‘— = ๐œˆ ๐‘— , ๐‘ˆ ๐œˆ ๐‘— + ln ๐‘ž ๐‘ง = ๐‘— + ln 2 ๐œˆ ๐‘— 2๐œŒ ๐‘’/2

  24. Multiclass logistic regression: conclusion โ€ข Suppose the class-conditional densities ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— is normal 2๐œŒ ๐‘’/2 exp{โˆ’ 1 1 2 } ๐‘ž ๐‘ฆ ๐‘ง = ๐‘— = ๐‘‚ ๐‘ฆ|๐œˆ ๐‘— , ๐ฝ = 2 ๐‘ฆ โˆ’ ๐œˆ ๐‘— โ€ข Then exp( ๐‘ฅ ๐‘— ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘— ) ๐‘ž ๐‘ง = ๐‘—|๐‘ฆ = ฯƒ ๐‘˜ exp( ๐‘ฅ ๐‘˜ ๐‘ˆ ๐‘ฆ + ๐‘ ๐‘˜ ) which is the hypothesis class for multiclass logistic regression โ€ข It is softmax on linear transformation; it can be used to derive the negative log- likelihood loss (cross entropy)

  25. Softmax โ€ข A way to squash ๐‘ = (๐‘ 1 , ๐‘ 2 , โ€ฆ , ๐‘ ๐‘— , โ€ฆ ) into probability vector ๐‘ž ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) , exp(๐‘ 2 ) exp(๐‘ 1 ) exp ๐‘ ๐‘— softmax ๐‘ = ฯƒ ๐‘˜ exp(๐‘ ๐‘˜ ) , โ€ฆ , , โ€ฆ ฯƒ ๐‘˜ exp ๐‘ ๐‘˜ โ€ข Behave like max: when ๐‘ ๐‘— โ‰ซ ๐‘ ๐‘˜ โˆ€๐‘˜ โ‰  ๐‘— , ๐‘ž ๐‘— โ‰… 1, ๐‘ž ๐‘˜ โ‰… 0

  26. Cross entropy for conditional distribution โ€ข Let ๐‘ž data (๐‘ง|๐‘ฆ) denote the empirical distribution of the data โ€ข Negative log-likelihood 1 ๐‘œ ๐‘œ ฯƒ ๐‘—=1 โˆ’ log ๐‘ž ๐‘ง = ๐‘ง ๐‘— ๐‘ฆ ๐‘— = โˆ’E ๐‘ž data (๐‘ง|๐‘ฆ) log ๐‘ž(๐‘ง|๐‘ฆ) is the cross entropy between ๐‘ž data and the model output ๐‘ž โ€ข Information theory viewpoint: KL divergence ๐‘ž data ๐ธ(๐‘ž data | ๐‘ž = E ๐‘ž data [log ๐‘ž ] = E ๐‘ž data [log ๐‘ž data ] โˆ’ E ๐‘ž data [log ๐‘ž] Entropy; constant Cross entropy

  27. Cross entropy for full distribution โ€ข Let ๐‘ž data (๐‘ฆ, ๐‘ง) denote the empirical distribution of the data โ€ข Negative log-likelihood 1 ๐‘œ ๐‘œ ฯƒ ๐‘—=1 โˆ’ log ๐‘ž(๐‘ฆ ๐‘— , ๐‘ง ๐‘— ) = โˆ’E ๐‘ž data (๐‘ฆ,๐‘ง) log ๐‘ž(๐‘ฆ, ๐‘ง) is the cross entropy between ๐‘ž data and the model output ๐‘ž

  28. Multiclass logistic regression: summary Label ๐‘ง ๐‘— Last hidden layer โ„Ž Cross entropy softmax (๐‘ฅ ๐‘˜ ) ๐‘ˆ โ„Ž + ๐‘ ๐‘˜ ๐‘ž ๐‘˜ Linear Convert to probability Loss

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend