Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: - PowerPoint PPT Presentation

Multiclass Classification CS 6956: Deep Learning for NLP 1

So far: Binary Classification • We have seen linear models for binary classification • We can write down a loss for binary classification – Common losses: Hinge loss and log loss 2

This lecture • Multiclass classification • Modeling multiple classes • Loss functions for multiclass classification – Once we have a loss, we can minimize it to train 3

Where are we? • Multiclass classification • Modeling multiple classes • Loss functions for multiclass classification – Once we have a loss, we can minimize it to train 4

What is multiclass classification? • An input can belong to one of K classes • Training data: Input associated with class label (a number from 1 to K) • Prediction: Given a new input, predict the class label Each input belongs to exactly one class. Not more, not less. • Otherwise, the problem is not multiclass classification • If an input can be assigned multiple labels (think tags for emails rather than folders), it is called multi-label classification 5

Example applications: Images – Input : hand-written character; Output : which character? all map to the letter A – Input : a photograph of an object; Output : which of a set of categories of objects is it? • Eg: the Caltech 256 dataset Duck laptop Car tire Car tire 6

Example applications: Language • Input : a news article • Output : Which section of the newspaper should be be in • Input : an email • Output : which folder should an email be placed into • Input : an audio command given to a car • Output : which of a set of actions should be executed 7

Multiclass prediction • Suppose we have K classes: Given an input 𝒚 , we need to predict one of these classes. – Let us number the labels as 1, 2, …, K • The intuition for modeling K classes: – For a label 𝑗 , we can define a scoring function 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) – The score is a real number. Higher score means that the label is preferred • Prediction: find the label with the highest score argmax 0 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) 9

Multiclass prediction • Suppose we have K classes: Given an input 𝒚 , we need to predict one of these classes. – Let us number the labels as 1, 2, …, K • Modeling K classes: – For a label 𝑗 , we can define a scoring function 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) – The score is a real number. Higher score means that the label is preferred • Prediction: find the label with the highest score argmax 0 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) 10

Multiclass prediction • Suppose we have K classes: Given an input 𝒚 , we need to predict one of these classes. – Let us number the labels as 1, 2, …, K • Modeling K classes: – For a label 𝑗 , we can define a scoring function 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) – The score is a real number. Higher score means that the label is preferred We haven’t committed to the actual functional form of the 𝑡𝑑𝑝𝑠𝑓 function. • Prediction: find the label with the highest score argmax 0 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) For now, we will assume that there is some function that is parameterized. Our eventual goal would be to learn the parameters. 11

Multiclass prediction • Suppose we have K classes: Given an input 𝒚 , we need to predict one of these classes. – Let us number the labels as 1, 2, …, K • Modeling K classes: – For a label 𝑗 , we can define a scoring function 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) – The score is a real number. Higher score means that the label is preferred • Prediction: find the label with the highest score argmax 0 𝑡𝑑𝑝𝑠𝑓(𝒚, 𝑗) 12

Scores to probabilities Suppose you wanted a model that predicts the probability that the label is 𝑗 for an example 𝒚 . The most common probabilistic model involves the softmax operator and is defined as: exp (𝑡𝑑𝑝𝑠𝑓 𝑗, 𝒚 ) 𝑄 𝑗 𝒚 = < ∑ exp (𝑡𝑑𝑝𝑠𝑓 𝑘, 𝒚 ) =>? 13

The softmax function A general method to normalize scores into probabilities to produce a categorical probability distribution. Converts a vector of scores into a vector of probabilities • If we have a collection of K scores 𝑨 ? , 𝑨 A , ⋯ , 𝑨 < that could be any real numbers, then their softmax gives K probabilities, each of which is defined as: 𝑓 C D 𝑓 C F 𝑓 C G 𝑓 C D + 𝑓 C F + ⋯ + 𝑓 C G , 𝑓 C D + 𝑓 C F + ⋯ + 𝑓 C G , ⋯ , 𝑓 C D + 𝑓 C F + ⋯ + 𝑓 C G The numerator is the un-normalized probability for each outcome. The denominator adds up the un-normalized probabilities for all competing outcomes. 14

What we didn’t see: How are the scores constructed? They could be linear functions of the input features I 𝒚 𝑡𝑑𝑝𝑠𝑓 𝒚, 𝑗 = w 0 – This gives us multiclass SVM (if we use hinge loss) or multinomial logistic regression (if we use cross-entropy loss) They could be a neural network – Most commonly used with the softmax function Important lesson: If you want multiple decisions to compete with each other, then place a softmax on top of them. 15

Is this the only way to predict multiple classes? 16

Is this the only way to predict multiple classes? • Not really • Historically, there have been several approaches – Reducing multiclass classification to several binary classification problems – One-vs-all: K binary classifiers. For the 𝑗 JK label, the binary classification problem is “label 𝑗 vs. not label 𝑗 ”. – All-vsl-all: O(K 2 ) classifiers. One classifier for each pair of labels. – Error correcting output codes: Encode each label as a binary string and train one classifier for each position of the string • How would you construct the output in each case? 17

Is this the only way to predict multiple classes? • Not really • Historically, there have been several approaches – Reducing multiclass classification to several binary classification problems – One-vs-all: K binary classifiers. For the 𝑗 JK label, the binary classification problem is “label 𝑗 vs. not label 𝑗 ”. – All-vsl-all: O(K 2 ) classifiers. One classifier for each pair of labels. – Error correcting output codes: Encode each label as a binary string and train one classifier for each position of the string • How would you construct the output in each case? 18

Is this the only way to predict multiple classes? • Not really • Historically, there have been several approaches – Reducing multiclass classification to several binary classification problems – One-vs-all : K binary classifiers. For the 𝑗 JK label, the binary classification problem is “label 𝑗 vs. not label 𝑗 ”. – All-vsl-all: O(K 2 ) classifiers. One classifier for each pair of labels. – Error correcting output codes: Encode each label as a binary string and train one classifier for each position of the string • How would you construct the output in each case? 19

Is this the only way to predict multiple classes? • Not really • Historically, there have been several approaches – Reducing multiclass classification to several binary classification problems – One-vs-all : K binary classifiers. For the 𝑗 JK label, the binary classification problem is “label 𝑗 vs. not label 𝑗 ”. – All-vs-all : O(K 2 ) classifiers. One classifier for each pair of labels. – Error correcting output codes: Encode each label as a binary string and train one classifier for each position of the string • How would you construct the output in each case? 20

Is this the only way to predict multiple classes? • Not really • Historically, there have been several approaches – Reducing multiclass classification to several binary classification problems – One-vs-all : K binary classifiers. For the 𝑗 JK label, the binary classification problem is “label 𝑗 vs. not label 𝑗 ”. – All-vs-all : O(K 2 ) classifiers. One classifier for each pair of labels. – Error correcting output codes : Encode each label as a binary string and train one classifier for each position of the string • How would you construct the output in each case? 21

Is this the only way to predict multiple classes? • Not really • Historically, there have been several approaches – Reducing multiclass classification to several binary classification problems – One-vs-all : K binary classifiers. For the 𝑗 JK label, the binary classification problem is “label 𝑗 vs. not label 𝑗 ”. – All-vs-all : O(K 2 ) classifiers. One classifier for each pair of labels. – Error correcting output codes : Encode each label as a binary string and train one classifier for each position of the string • Exercise : How would you construct the output in each case? 22

Exercises 1. What is the connection between the softmax function and the sigmoid function used in logistic regression? To explore this, consider what happens when we have – two classes and use softmax 2. Come up with at least two different prediction schemes for the all-vs-all setting 23

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: - PowerPoint PPT Presentation

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We have seen linear models for binary classification We can write down a loss for binary classification Common losses: Hinge loss and log loss 2

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Multiclass Classification Machine Learning So far: Binary Classification We have seen linear

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary

Multiclass and Multi-label Classification INFO-4604, Applied Machine Learning University of

Mohammad ali Bagheri Binary vs. Multiclass Classification Real word applications Class

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Classification Key Concepts Duen Horng (Polo) Chau Associate Professor Associate Director,

Key T Key Terms rms referred to the Connectivism and Connective Knowledge course (de Freitas et

multi-user human-robot interaction Presenter: Maham Tanveer 9 th November, 2015 1 Fig. 1 [1] 2

1. Have you ever contributed to/participated in actions designed to help preserve wilderness on

Lexicon Induction Melanie Bolla and Olga Whelan Ling 575 Lexicon Induction (and the problem it

PAIRWISE DECOMPOSITION OF IMAGE SEQUENCES FOR ACTIVE MULTI-VIEW RECOGNITION(EXPERIMENT)

A Baseline for Few-Shot Image Classification Guneet S. Dhillon 1 , Pratik Chaudhari 2 , Avinash

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: - PowerPoint PPT Presentation

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We have seen linear models for binary classification We can write down a loss for binary classification Common losses: Hinge loss and log loss 2

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Multiclass Classification Machine Learning So far: Binary Classification We have seen linear

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

From Binary to Multiclass Classification CS 6355: Structured Prediction 1 We have seen binary

Multiclass and Multi-label Classification INFO-4604, Applied Machine Learning University of

Mohammad ali Bagheri Binary vs. Multiclass Classification Real word applications Class

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Classification Key Concepts Duen Horng (Polo) Chau Associate Professor Associate Director,

Key T Key Terms rms referred to the Connectivism and Connective Knowledge course (de Freitas et

multi-user human-robot interaction Presenter: Maham Tanveer 9 th November, 2015 1 Fig. 1 [1] 2

1. Have you ever contributed to/participated in actions designed to help preserve wilderness on

Lexicon Induction Melanie Bolla and Olga Whelan Ling 575 Lexicon Induction (and the problem it

PAIRWISE DECOMPOSITION OF IMAGE SEQUENCES FOR ACTIVE MULTI-VIEW RECOGNITION(EXPERIMENT)

A Baseline for Few-Shot Image Classification Guneet S. Dhillon 1 , Pratik Chaudhari 2 , Avinash

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels