From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1

Q&A Q: How do I get into the online section? A: Sorry! I erroneously claimed we would automatically add you to the online section. Here’s the correct answer: To join the online section, email Dorothy Holland- Minkley at dfh@andrew.cmu.edu stating that you would like to join the online section. Why the extra step? We want to make sure you’ve seen the non-professional video recording and are okay with the quality. 2

Q&A Q: Will I get off the waitlist? A: Don’t be on the waitlist. Just email Dorothy to join the online section instead! 3

Q&A Q: Can I move between 10-418 and 10-618? A: Yes. Just email Dorothy Holland-Minkley at dfh@andrew.cmu.edu to do so. Q: When is the last possible moment I can move between 10-418 and 10-618? A: I’m not sure. We’ll announce on Piazza once I have an answer. 4

QnA Populating Wikipedia Infoboxes Q: Why do interactions appear between variables in this example? A: Consider the test time setting: – Author writes a new article (vector x) – Infobox is empty – ML system must populate all fields (vector y ) at once – Interactions that were seen (i.e. vector y ) at training time are unobserved at test time – so we wish to model them 5

ROADMAP 7

How do we get from Classification to Structured Prediction? 1. We start with the simplest decompositions (i.e. classification ) 2. Then we formulate structured prediction as a search problem (decomposition of into a sequence of decisions ) 3. Finally, we formulate structured prediction in the framework of graphical models (decomposition into parts ) 8

Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . n v p d n Sample 1: n n v d n Sample 2: n v p d n Sample 3: v n p d n Sample 4: v n v d n Sample 5: n v p d n Sample 6: ψ 2 ψ 4 ψ 6 ψ 8 ψ 0 X 0 X 1 X 2 X 3 X 4 X 5 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 like time flies an arrow 9

Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . Sample 1: Sample 2: ψ 11 ψ 11 ψ 12 ψ 12 ψ 10 ψ 10 ψ 6 ψ 6 ψ 4 ψ 4 ψ 2 ψ 2 ψ 8 ψ 8 ψ 5 ψ 5 Sample 4: Sample 3: ψ 1 ψ 1 ψ 7 ψ 7 ψ 9 ψ 11 ψ 9 ψ 3 ψ 11 ψ 3 ψ 12 ψ 12 ψ 10 ψ 10 ψ 6 ψ 6 ψ 4 ψ 4 ψ 2 ψ 2 ψ 8 ψ 8 ψ 5 ψ 5 ψ 1 ψ 1 ψ 7 ψ 7 ψ 9 ψ 9 ψ 3 ψ 3 X 7 X 6 ψ 11 X 3 ψ 12 X 1 ψ 10 ψ 6 ψ 4 ψ 2 ψ 8 X 4 ψ 5 ψ 1 X 5 X 2 ψ 7 10 ψ 9 ψ 3

Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . n v p d n Sample 1: time flies like an arrow n n v d n Sample 2: flies like time an arrow n v p n n Sample 3: with flies fly their wings p n n v v Sample 4: you with time will see ψ 6 X 0 ψ 0 ψ 2 ψ 4 X 3 ψ 8 X 5 X 1 X 2 X 4 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 W 1 W 2 W 3 W 4 W 5 11

Factors have local opinions (≥ 0) Each black box looks at some of the tags X i and words W i Note: We chose to reuse v n p d v n p d the same factors at different positions in the v v 1 6 3 4 1 6 3 4 sentence. n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 X 0 ψ 2 ψ 4 X 3 ψ 6 ψ 8 X 5 ψ 0 X 1 X 2 X 4 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … W 1 W 2 W 3 W 4 W 5 v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 12

Factors have local opinions (≥ 0) Each black box looks at some of the tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = ? v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 13

Global probability = product of local opinions Each black box looks at some of the tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) Uh-oh! The probabilities of v n p d v n p d the various assignments sum v v 1 6 3 4 1 6 3 4 up to Z > 1. n n 8 4 2 0.1 8 4 2 0.1 So divide them all by Z. p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 14

Markov Random Field (MRF) Joint distribution over tags X i and words W i The individual factors aren’t necessarily probabilities. p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 15

Hidden Markov Model But sometimes we choose to make them probabilities. Constrain each row of a factor to sum to one. Now Z = 1 . p ( n, v, p, d, n, time, flies, like, an, arrow ) = ( .3 * .8 * .2 * .5 * … ) v n p d v n p d v v .1 .4 .2 .3 .1 .4 .2 .3 n .8 .1 .1 n .8 .1 .1 0 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 d .2 .8 0 0 0 p n n v d <START> time time flies flies like like … … time flies like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 16

Markov Random Field (MRF) Joint distribution over tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 17

Conditional Random Field (CRF) Conditional distribution over tags X i given words w i . The factors and Z are now specific to the sentence w . p ( n, v, p, d, n | time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 v v 3 5 n n 4 5 p 0.1 p 0.1 d 0.1 d 0.2 time flies like an arrow 18

BACKGROUND: BINARY CLASSIFICATION 19

Linear Models for Classification Key idea: Try to learn this hyperplane directly • There are lots of Directly modeling the commonly used hyperplane would use a Linear Classifiers decision function: • These include: – Perceptron h ( � ) = sign ( θ T � ) – (Binary) Logistic Regression – Naïve Bayes (under for: certain conditions) y ∈ { − 1 , +1 } – (Binary) Support Vector Machines

(Online) Perceptron Algorithm Data: Inputs are continuous vectors of length M . Outputs are discrete. Prediction: Output determined by hyperplane. � if a ≥ 0 1 , y = h θ ( x ) = sign( θ T x ) sign ( a ) = ˆ otherwise − 1 , Learning: Iterative procedure: • initialize parameters to vector of all zeroes • while not converged • receive next example ( x (i) , y (i) ) • predict y’ = h( x (i) ) • if positive mistake: add x (i) to parameters • if negative mistake: subtract x (i) from parameters 21

(Binary) Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = �� ˆ p θ ( y | � ) y ∈ { 0 , 1 } 22

Support Vector Machines (SVMs) Hard-margin SVM (Primal) Hard-margin SVM (Lagrangian Dual) Soft-margin SVM (Primal) Soft-margin SVM (Lagrangian Dual) 23

Decision Trees 24 Figure from Tom Mitchell

Binary and Multiclass Classification Supervised Learning: Binary Classification: Multiclass Classification: 25

Outline Reductions (Binary à Multiclass) Settings 1. one-vs-all (OVA) A. Multiclass Classification 2. all-vs-all (AVA) B. Hierarchical Classification 3. classification tree C. Extreme Classification 4. error correcting output codes (ECOC) Why ? – multiclass is the simplest structured prediction setting – key insights in the simple reductions are analogous to later (less simple) concepts 26

From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1 Q&A Q: How do I get

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

Trees Linear Vs non-linear data structures Types of binary trees Binary tree traversals

GrayLog for Java developers Track Monitoring & Cloud Jos Manuel Ortega @jmortegac Agenda

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

TACKLING BIG-IP BLUE-GREEN DEPLOYMENTS IN PRIVATE CLOUD USING F5 & VMWARE ANSIBLE MODULES

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

I have nothing to disclose. Stefanie M. Ueda, M.D. Assistant Clinical Professor Division of

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

INSTAGRAM #CambSMmeetup @lenkakopp Is INSTAGRAM right for your business? Additional Resources

& Class Project Wednesday, February 25, 2015 Agenda Python Oracle Interface (cx_Oracle)

From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1 Q&A Q: How do I get

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

Trees Linear Vs non-linear data structures Types of binary trees Binary tree traversals

GrayLog for Java developers Track Monitoring &amp; Cloud Jos Manuel Ortega @jmortegac Agenda

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

TACKLING BIG-IP BLUE-GREEN DEPLOYMENTS IN PRIVATE CLOUD USING F5 &amp; VMWARE ANSIBLE MODULES

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

I have nothing to disclose. Stefanie M. Ueda, M.D. Assistant Clinical Professor Division of

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

INSTAGRAM #CambSMmeetup @lenkakopp Is INSTAGRAM right for your business? Additional Resources

&amp; Class Project Wednesday, February 25, 2015 Agenda Python Oracle Interface (cx_Oracle)

GrayLog for Java developers Track Monitoring & Cloud Jos Manuel Ortega @jmortegac Agenda

TACKLING BIG-IP BLUE-GREEN DEPLOYMENTS IN PRIVATE CLOUD USING F5 & VMWARE ANSIBLE MODULES

& Class Project Wednesday, February 25, 2015 Agenda Python Oracle Interface (cx_Oracle)