from binary to extreme classification
play

From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1 Q&A Q: How do I get


  1. 10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1

  2. Q&A Q: How do I get into the online section? A: Sorry! I erroneously claimed we would automatically add you to the online section. Here’s the correct answer: To join the online section, email Dorothy Holland- Minkley at dfh@andrew.cmu.edu stating that you would like to join the online section. Why the extra step? We want to make sure you’ve seen the non-professional video recording and are okay with the quality. 2

  3. Q&A Q: Will I get off the waitlist? A: Don’t be on the waitlist. Just email Dorothy to join the online section instead! 3

  4. Q&A Q: Can I move between 10-418 and 10-618? A: Yes. Just email Dorothy Holland-Minkley at dfh@andrew.cmu.edu to do so. Q: When is the last possible moment I can move between 10-418 and 10-618? A: I’m not sure. We’ll announce on Piazza once I have an answer. 4

  5. QnA Populating Wikipedia Infoboxes Q: Why do interactions appear between variables in this example? A: Consider the test time setting: – Author writes a new article (vector x) – Infobox is empty – ML system must populate all fields (vector y ) at once – Interactions that were seen (i.e. vector y ) at training time are unobserved at test time – so we wish to model them 5

  6. ROADMAP 7

  7. How do we get from Classification to Structured Prediction? 1. We start with the simplest decompositions (i.e. classification ) 2. Then we formulate structured prediction as a search problem (decomposition of into a sequence of decisions ) 3. Finally, we formulate structured prediction in the framework of graphical models (decomposition into parts ) 8

  8. Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . n v p d n Sample 1: n n v d n Sample 2: n v p d n Sample 3: v n p d n Sample 4: v n v d n Sample 5: n v p d n Sample 6: ψ 2 ψ 4 ψ 6 ψ 8 ψ 0 X 0 X 1 X 2 X 3 X 4 X 5 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 like time flies an arrow 9

  9. Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . Sample 1: Sample 2: ψ 11 ψ 11 ψ 12 ψ 12 ψ 10 ψ 10 ψ 6 ψ 6 ψ 4 ψ 4 ψ 2 ψ 2 ψ 8 ψ 8 ψ 5 ψ 5 Sample 4: Sample 3: ψ 1 ψ 1 ψ 7 ψ 7 ψ 9 ψ 11 ψ 9 ψ 3 ψ 11 ψ 3 ψ 12 ψ 12 ψ 10 ψ 10 ψ 6 ψ 6 ψ 4 ψ 4 ψ 2 ψ 2 ψ 8 ψ 8 ψ 5 ψ 5 ψ 1 ψ 1 ψ 7 ψ 7 ψ 9 ψ 9 ψ 3 ψ 3 X 7 X 6 ψ 11 X 3 ψ 12 X 1 ψ 10 ψ 6 ψ 4 ψ 2 ψ 8 X 4 ψ 5 ψ 1 X 5 X 2 ψ 7 10 ψ 9 ψ 3

  10. Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . n v p d n Sample 1: time flies like an arrow n n v d n Sample 2: flies like time an arrow n v p n n Sample 3: with flies fly their wings p n n v v Sample 4: you with time will see ψ 6 X 0 ψ 0 ψ 2 ψ 4 X 3 ψ 8 X 5 X 1 X 2 X 4 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 W 1 W 2 W 3 W 4 W 5 11

  11. Factors have local opinions (≥ 0) Each black box looks at some of the tags X i and words W i Note: We chose to reuse v n p d v n p d the same factors at different positions in the v v 1 6 3 4 1 6 3 4 sentence. n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 X 0 ψ 2 ψ 4 X 3 ψ 6 ψ 8 X 5 ψ 0 X 1 X 2 X 4 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … W 1 W 2 W 3 W 4 W 5 v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 12

  12. Factors have local opinions (≥ 0) Each black box looks at some of the tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = ? v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 13

  13. Global probability = product of local opinions Each black box looks at some of the tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) Uh-oh! The probabilities of v n p d v n p d the various assignments sum v v 1 6 3 4 1 6 3 4 up to Z > 1. n n 8 4 2 0.1 8 4 2 0.1 So divide them all by Z. p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 14

  14. Markov Random Field (MRF) Joint distribution over tags X i and words W i The individual factors aren’t necessarily probabilities. p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 15

  15. Hidden Markov Model But sometimes we choose to make them probabilities. Constrain each row of a factor to sum to one. Now Z = 1 . p ( n, v, p, d, n, time, flies, like, an, arrow ) = ( .3 * .8 * .2 * .5 * … ) v n p d v n p d v v .1 .4 .2 .3 .1 .4 .2 .3 n .8 .1 .1 n .8 .1 .1 0 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 d .2 .8 0 0 0 p n n v d <START> time time flies flies like like … … time flies like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 16

  16. Markov Random Field (MRF) Joint distribution over tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 17

  17. Conditional Random Field (CRF) Conditional distribution over tags X i given words w i . The factors and Z are now specific to the sentence w . p ( n, v, p, d, n | time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 v v 3 5 n n 4 5 p 0.1 p 0.1 d 0.1 d 0.2 time flies like an arrow 18

  18. BACKGROUND: BINARY CLASSIFICATION 19

  19. Linear Models for Classification Key idea: Try to learn this hyperplane directly • There are lots of Directly modeling the commonly used hyperplane would use a Linear Classifiers decision function: • These include: – Perceptron h ( � ) = sign ( θ T � ) – (Binary) Logistic Regression – Naïve Bayes (under for: certain conditions) y ∈ { − 1 , +1 } – (Binary) Support Vector Machines

  20. (Online) Perceptron Algorithm Data: Inputs are continuous vectors of length M . Outputs are discrete. Prediction: Output determined by hyperplane. � if a ≥ 0 1 , y = h θ ( x ) = sign( θ T x ) sign ( a ) = ˆ otherwise − 1 , Learning: Iterative procedure: • initialize parameters to vector of all zeroes • while not converged • receive next example ( x (i) , y (i) ) • predict y’ = h( x (i) ) • if positive mistake: add x (i) to parameters • if negative mistake: subtract x (i) from parameters 21

  21. (Binary) Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = ������ ˆ p θ ( y | � ) y ∈ { 0 , 1 } 22

  22. Support Vector Machines (SVMs) Hard-margin SVM (Primal) Hard-margin SVM (Lagrangian Dual) Soft-margin SVM (Primal) Soft-margin SVM (Lagrangian Dual) 23

  23. Decision Trees 24 Figure from Tom Mitchell

  24. Binary and Multiclass Classification Supervised Learning: Binary Classification: Multiclass Classification: 25

  25. Outline Reductions (Binary à Multiclass) Settings 1. one-vs-all (OVA) A. Multiclass Classification 2. all-vs-all (AVA) B. Hierarchical Classification 3. classification tree C. Extreme Classification 4. error correcting output codes (ECOC) Why ? – multiclass is the simplest structured prediction setting – key insights in the simple reductions are analogous to later (less simple) concepts 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend