Lecture 7: Multiclass Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Example: image classification indoor Indoor outdoor

Example: image classification (multiclass) ImageNet figure borrowed from vision.standford.edu

Multiclass classification • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 • 𝑦 𝑗 ∈ 𝑆 𝑒 , 𝑧 𝑗 ∈ {1,2, … , 𝐿} • Find 𝑔 𝑦 : 𝑆 𝑒 → {1,2, … , 𝐿} that outputs correct labels • What kind of 𝑔 ?

Approaches for multiclass classification

Approach 1: reduce to regression • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑜 𝑥 𝑈 𝑦 𝑗 − 𝑧 𝑗 2 • Find 𝑔 𝑜 σ 𝑗=1 𝑀 𝑔 𝑥 = • Bad idea even for binary classification Reduce to linear regression; ignore the fact 𝑧 ∈ {1,2. . . , 𝐿}

Approach 1: reduce to regression Bad idea even for binary classification Figure from Pattern Recognition and Machine Learning , Bishop

Approach 2: one-versus-the-rest • Find 𝐿 − 1 classifiers 𝑔 1 , 𝑔 2 , … , 𝑔 𝐿−1 • 𝑔 1 classifies 1 𝑤𝑡 {2,3, … , 𝐿} • 𝑔 2 classifies 2 𝑤𝑡 {1,3, … , 𝐿} • … • 𝑔 𝐿−1 classifies 𝐿 − 1 𝑤𝑡 {1,2, … , 𝐿 − 2} • Points not classified to classes {1,2, … , 𝐿 − 1} are put to class 𝐿 • Problem of ambiguous region: some points may be classified to more than one classes

Approach 2: one-versus-the-rest Figure from Pattern Recognition and Machine Learning , Bishop

Approach 3: one-versus-one • Find 𝐿 − 1 𝐿/2 classifiers 𝑔 (1,2) , 𝑔 (1,3) , … , 𝑔 (𝐿−1,𝐿) • 𝑔 (1,2) classifies 1 𝑤𝑡 2 • 𝑔 (1,3) classifies 1 𝑤𝑡 3 • … • 𝑔 (𝐿−1,𝐿) classifies 𝐿 − 1 𝑤𝑡 𝐿 • Computationally expensive: think of 𝐿 = 1000 • Problem of ambiguous region

Approach 3: one-versus-one Figure from Pattern Recognition and Machine Learning , Bishop

Approach 4: discriminant functions • Find 𝐿 scoring functions 𝑡 1 , 𝑡 2 , … , 𝑡 𝐿 • Classify 𝑦 to class 𝑧 = argmax 𝑗 𝑡 𝑗 (𝑦) • Computationally cheap • No ambiguous regions

Linear discriminant functions • Find 𝐿 discriminant functions 𝑡 1 , 𝑡 2 , … , 𝑡 𝐿 • Classify 𝑦 to class 𝑧 = argmax 𝑗 𝑡 𝑗 (𝑦) • Linear discriminant: 𝑡 𝑗 (𝑦) = 𝑥 𝑗 𝑈 𝑦 , with 𝑥 𝑗 ∈ 𝑆 𝑒

Linear discriminant functions • Linear discriminant: 𝑡 𝑗 (𝑦) = 𝑥 𝑗 𝑈 𝑦 , with 𝑥 𝑗 ∈ 𝑆 𝑒 • Lead to convex region for each class: by 𝑧 = argmax 𝑗 𝑥 𝑗 𝑈 𝑦 Figure from Pattern Recognition and Machine Learning , Bishop

Conditional distribution as discriminant • Find 𝐿 discriminant functions 𝑡 1 , 𝑡 2 , … , 𝑡 𝐿 • Classify 𝑦 to class 𝑧 = argmax 𝑗 𝑡 𝑗 (𝑦) • Conditional distributions: 𝑡 𝑗 (𝑦) = 𝑞(𝑧 = 𝑗|𝑦) • Parametrize by 𝑥 𝑗 : 𝑡 𝑗 (𝑦) = 𝑞 𝑥 𝑗 (𝑧 = 𝑗|𝑦)

Multiclass logistic regression

Review: binary logistic regression • Sigmoid 1 𝜏 𝑥 𝑈 𝑦 + 𝑐 = 1 + exp(−(𝑥 𝑈 𝑦 + 𝑐)) • Interpret as conditional probability 𝑞 𝑥 𝑧 = 1 𝑦 = 𝜏 𝑥 𝑈 𝑦 + 𝑐 𝑞 𝑥 𝑧 = 0 𝑦 = 1 − 𝑞 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 + 𝑐 • How to extend to multiclass?

Review: binary logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • 𝑞 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(𝑥 𝑈 𝑦 + 𝑐) is equivalent to setting log odds 𝑏 = ln 𝑞 𝑧 = 1|𝑦 𝑞 𝑧 = 2|𝑦 = 𝑥 𝑈 𝑦 + 𝑐 • Why linear log odds?

Review: binary logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 𝑦 − 𝜈 𝑗 2 • log odd is 𝑏 = ln 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = 𝑥 𝑈 𝑦 + 𝑐 where 𝑐 = − 1 𝑈 𝜈 1 + 1 𝑈 𝜈 2 + ln 𝑞(𝑧 = 1) 𝑥 = 𝜈 1 − 𝜈 2 , 2 𝜈 1 2 𝜈 2 𝑞(𝑧 = 2)

Multiclass logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • Conditional probability by Bayesian rule: 𝑞 𝑦|𝑧 = 𝑗 𝑞(𝑧 = 𝑗) exp(𝑏 𝑗 ) 𝑞 𝑧 = 𝑗|𝑦 = σ 𝑘 𝑞 𝑦|𝑧 = 𝑘 𝑞(𝑧 = 𝑘) = σ 𝑘 exp(𝑏 𝑘 ) where we define 𝑏 𝑗 ≔ ln [𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 ]

Multiclass logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 𝑦 − 𝜈 𝑗 2 • Then 𝑈 = − 1 2 𝑦 𝑈 𝑦 + 𝑥 𝑗 𝑦 + 𝑐 𝑗 𝑏 𝑗 ≔ ln 𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 where 𝑐 𝑗 = − 1 1 𝑥 𝑗 = 𝜈 𝑗 , 𝑈 𝜈 𝑗 + ln 𝑞 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

Multiclass logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 𝑦 − 𝜈 𝑗 2 1 2 𝑦 𝑈 𝑦 , we have • Cancel out − exp(𝑏 𝑗 ) 𝑥 𝑗 𝑈 𝑦 + 𝑐 𝑗 𝑞 𝑧 = 𝑗|𝑦 = σ 𝑘 exp(𝑏 𝑘 ) , 𝑏 𝑗 ≔ where 𝑐 𝑗 = − 1 1 𝑥 𝑗 = 𝜈 𝑗 , 𝑈 𝜈 𝑗 + ln 𝑞 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

Multiclass logistic regression: conclusion • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • Then exp( 𝑥 𝑗 𝑈 𝑦 + 𝑐 𝑗 ) 𝑞 𝑧 = 𝑗|𝑦 = σ 𝑘 exp( 𝑥 𝑘 𝑈 𝑦 + 𝑐 𝑘 ) which is the hypothesis class for multiclass logistic regression • It is softmax on linear transformation; it can be used to derive the negative log- likelihood loss (cross entropy)

Softmax • A way to squash 𝑏 = (𝑏 1 , 𝑏 2 , … , 𝑏 𝑗 , … ) into probability vector 𝑞 σ 𝑘 exp(𝑏 𝑘 ) , exp(𝑏 2 ) exp(𝑏 1 ) exp 𝑏 𝑗 softmax 𝑏 = σ 𝑘 exp(𝑏 𝑘 ) , … , , … σ 𝑘 exp 𝑏 𝑘 • Behave like max: when 𝑏 𝑗 ≫ 𝑏 𝑘 ∀𝑘 ≠ 𝑗 , 𝑞 𝑗 ≅ 1, 𝑞 𝑘 ≅ 0

Cross entropy for conditional distribution • Let 𝑞 data (𝑧|𝑦) denote the empirical distribution of the data • Negative log-likelihood 1 𝑜 𝑜 σ 𝑗=1 − log 𝑞 𝑧 = 𝑧 𝑗 𝑦 𝑗 = −E 𝑞 data (𝑧|𝑦) log 𝑞(𝑧|𝑦) is the cross entropy between 𝑞 data and the model output 𝑞 • Information theory viewpoint: KL divergence 𝑞 data 𝐸(𝑞 data | 𝑞 = E 𝑞 data [log 𝑞 ] = E 𝑞 data [log 𝑞 data ] − E 𝑞 data [log 𝑞] Entropy; constant Cross entropy

Cross entropy for full distribution • Let 𝑞 data (𝑦, 𝑧) denote the empirical distribution of the data • Negative log-likelihood 1 𝑜 𝑜 σ 𝑗=1 − log 𝑞(𝑦 𝑗 , 𝑧 𝑗 ) = −E 𝑞 data (𝑦,𝑧) log 𝑞(𝑦, 𝑧) is the cross entropy between 𝑞 data and the model output 𝑞

Multiclass logistic regression: summary Label 𝑧 𝑗 Last hidden layer ℎ Cross entropy softmax (𝑥 𝑘 ) 𝑈 ℎ + 𝑐 𝑘 𝑞 𝑘 Linear Convert to probability Loss

Lecture 7: Multiclass Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Multiclass Classification Machine Learning So far: Binary Classification We have seen linear

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary

Multiclass and Multi-label Classification INFO-4604, Applied Machine Learning University of

Mohammad ali Bagheri Binary vs. Multiclass Classification Real word applications Class

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

E t Extending Abstract GPU APIs to di Ab t t GPU API t Shared Memory SPLASH Student Research

Two camps of program verifjcation Interactive Theorem Provers (ITPs): Coq, Agda, Lean, Idris, ...

3 -Powers Narad Rampersad Dept. of Math. and Stat., University of Winnipeg Winnipeg, MB R3B 2E9

Encore Fellows UK launches today! Join the Encore Fellows UK pilot! q Corporate sponsor q seeking

Extending Tables w ith Data from over a Million W ebsites Oliver Lehmberg, Dominique Ritze, Petar

Dracula Reborn: ML-style modules, Racket macros, and ACL2 theorem proving Carl Eastlund Zoe

The Complexity of Computable Categoricity for Algebraic Fields Russell Miller Queens College

Case I ID: Pt was a 63 yo female h/o CLL S/P splenectomy admitted for 3 day h/o fever, cough,

Lecture 7: Multiclass Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass) ImageNet figure borrowed from

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Multiclass Classification Machine Learning So far: Binary Classification We have seen linear

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary

Multiclass and Multi-label Classification INFO-4604, Applied Machine Learning University of

Mohammad ali Bagheri Binary vs. Multiclass Classification Real word applications Class

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

E t Extending Abstract GPU APIs to di Ab t t GPU API t Shared Memory SPLASH Student Research

Two camps of program verifjcation Interactive Theorem Provers (ITPs): Coq, Agda, Lean, Idris, ...

3 -Powers Narad Rampersad Dept. of Math. and Stat., University of Winnipeg Winnipeg, MB R3B 2E9

Encore Fellows UK launches today! Join the Encore Fellows UK pilot! q Corporate sponsor q seeking

Extending Tables w ith Data from over a Million W ebsites Oliver Lehmberg, Dominique Ritze, Petar

Dracula Reborn: ML-style modules, Racket macros, and ACL2 theorem proving Carl Eastlund Zoe

The Complexity of Computable Categoricity for Algebraic Fields Russell Miller Queens College

Case I ID: Pt was a 63 yo female h/o CLL S/P splenectomy admitted for 3 day h/o fever, cough,

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels