beyond binary classification
play

Beyond binary classification Subhransu Maji CMPSCI 689: Machine - PowerPoint PPT Presentation

Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015 Administrivia Mini-project 1 posted One of three Decision trees and perceptrons Theory and programming Due Wednesday, March 04, 11:55pm


  1. Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015

  2. Administrivia Mini-project 1 posted � ‣ One of three ‣ Decision trees and perceptrons ‣ Theory and programming ‣ Due Wednesday, March 04, 11:55pm 4:00pm ➡ Turn in a hard copy in the CS office ‣ Must be done individually, but feel free to discuss with others ‣ Start early … CMPSCI 689 Subhransu Maji (UMASS) 2 /27

  3. Today’s lecture Learning with imbalanced data � Beyond binary classification � ‣ Multi-class classification ‣ Ranking ‣ Collective classification CMPSCI 689 Subhransu Maji (UMASS) 3 /27

  4. Learning with imbalanced data One class might be rare (E.g., face detection) � Mistakes on the rare class cost more: � ‣ cost of misclassifying y=+1 is (>1) α ‣ cost of misclassifying y= - 1 is 1 Why? we want is a better f-score (or average precision) binary classification -weighted binary classification α E ( x ,y ) ∼ D [ α y =1 f ( x ) 6 = y ] E ( x ,y ) ∼ D [ f ( x ) 6 = y ] Suppose we have an algorithm to train a binary classifier, can we use it to train the alpha weighted version? CMPSCI 689 Subhransu Maji (UMASS) 4 /27

  5. Training by sub-sampling D α D, α Input: Output: � � While true � We have sub-sampled the ( x , y ) ∼ D ‣ Sample negatives by t ∼ uniform(0 , 1) ‣ Sample y > 0 or t < 1 / α ‣ If ➡ return ( x , y ) sub-sampling algorithm Claim binary classification -weighted binary classification α D α D ✏ ↵✏ CMPSCI 689 Subhransu Maji (UMASS) 5 /27

  6. Proof of the claim Error on D = E ( x ,y ) ∼ D [ ` α (ˆ y, y )] X = ( D ( x , +1) ↵ [ˆ y 6 = 1] + D ( x , � 1)[ˆ y 6 = � 1]) x X ◆! ✓ y 6 = 1] + 1 = ↵ D ( x , +1)[ˆ ↵ D ( x , � 1)[ˆ y 6 = � 1] x X ! = ↵ ( D α ( x , +1)[ˆ y 6 = 1] + D α ( x , � 1)[ˆ y 6 = � 1]) x = ↵✏ binary classification -weighted binary classification α D α D ✏ ↵✏ CMPSCI 689 Subhransu Maji (UMASS) 6 /27

  7. Modifying training To train simply — � ‣ Subsample negatives and train a binary classifier. ‣ Alternatively, supersample positives and train a binary classifier. ‣ Which one is better? For some learners we don’t need to keep copies of the positives � ‣ Decision tree ➡ Modify accuracy to the weighted version ‣ kNN classifier ➡ Take weighted votes during prediction ‣ Perceptron? CMPSCI 689 Subhransu Maji (UMASS) 7 /27

  8. Overview Learning with imbalanced data � Beyond binary classification � ‣ Multi-class classification ‣ Ranking ‣ Collective classification CMPSCI 689 Subhransu Maji (UMASS) 8 /27

  9. Multi-class classification Labels are one of K different ones. � Some classifiers are inherently multi-class — � ‣ kNN classifiers: vote among the K labels, pick the one with the highest vote (break ties arbitrarily) ‣ Decision trees: use multi-class histograms to determine the best feature to splits. At the leaves predict the most frequent label. Question: can we take a binary classifier and turn it into multi-class? CMPSCI 689 Subhransu Maji (UMASS) 9 /27

  10. One-vs-all (OVA) classifier Train K classifiers, each to distinguish one class from the rest � Prediction: pick the class with the highest score: � � i ← arg max f i ( x ) score function � � Example � i ← arg max w T ‣ Perceptron : i x ➡ May have to calibrate the weights (e.g., fix the norm to 1) since we are comparing the scores of classifiers ➡ In practice, doing this right is tricky when there are a large number of classes CMPSCI 689 Subhransu Maji (UMASS) 10 /27

  11. One-vs-one (OVO) classifier Train K(K-1)/2 classifiers, each to distinguish one class from another � Each classifier votes for the winning class in a pair � The class with most votes wins � � 0 1 � @X f ji = − f ij i ← arg max f ij ( x ) A � j � � 0 1 Example � @X w T � � ‣ Perceptron : i ← arg max sign ij x A w ji = − w ij � j ➡ Calibration is not an issue since we are taking the sign of the score CMPSCI 689 Subhransu Maji (UMASS) 11 /27

  12. Directed acyclic graph (DAG) classifier DAG SVM [Platt et al., NIPS 2000] � ‣ Faster testing: O(K) instead of O(K(K-1)/2) ‣ Has some theoretical guarantees Figure from Platt et al. CMPSCI 689 Subhransu Maji (UMASS) 12 /27

  13. Overview Learning with imbalanced data � Beyond binary classification � ‣ Multi-class classification ‣ Ranking ‣ Collective classification CMPSCI 689 Subhransu Maji (UMASS) 13 /27

  14. Ranking CMPSCI 689 Subhransu Maji (UMASS) 14 /27

  15. Ranking Input: query (e.g. “cats”) � Output: a sorted list of items � � How should we measure performance? � The loss function is trickier than in the binary classification case � ‣ Example 1: All items in the first page should be relevant ‣ Example 2: All relevant items should be ahead of irrelevant items CMPSCI 689 Subhransu Maji (UMASS) 15 /27

  16. Learning to rank For simplicity lets assume we are learning to rank for a given query. � Learning to rank: � ‣ Input: a list of items ‣ Output: a function that takes a set of items and returns a sorted list � � Approaches � ‣ Pointwise approach: ➡ Assumes that each document has a numerical score. ➡ Learn a model to predict the score (e.g. linear regression). ‣ Pairwise approach: ➡ Ranking is approximated by a classification problem. ➡ Learn a binary classifier that can tell which item is better given a pair. CMPSCI 689 Subhransu Maji (UMASS) 16 /27

  17. Naive rank train Create a dataset with binary labels � features for ‣ Initialize: D ← φ x ij comparing item i and j ‣ For every i and j such that, i ≠ j ➡ If item i is more relevant than j D ← D ∪ ( x ij , +1) • Add a positive point: ➡ If item i is less relevant than j D ← D ∪ ( x ij , − 1) • Add a negative point: Learn a binary classifier on D � Ranking � score ← [0 , 0 , . . . , 0] ‣ Initialize: ‣ For every i and j such that, i ≠ j ➡ Calculate prediction: y ← f (ˆ x ij ) ➡ Update scores: score i = score i + y score j = score j − y ranking ← arg sort ( score ) CMPSCI 689 Subhransu Maji (UMASS) 17 /27

  18. Problems with naive ranking Naive rank train works well for bipartite ranking problems � ‣ Where the goal is to predict whether an item is relevant or not. There is no notion of an item being more relevant than another. A better strategy is to account for the positions of the items in the list � Denote a ranking by: � σ ‣ If item u appears before item v, we have: σ u < σ v Σ M Let the space of all permutations of M objects be: � f : X → Σ M A ranking function maps M items to a permutation: � A cost function (omega) � ω ( i, j ) ‣ The cost of placing an item at position i at j: X Ranking loss: ` ( � , ˆ � ) = [ � u < � v ][ˆ � v < ˆ � u ] ! ( u, v ) u 6 = v ! -ranking: min E ( X , σ ) ∼ D [ ` ( � , ˆ � )] , where ˆ � = f ( X ) f CMPSCI 689 Subhransu Maji (UMASS) 18 /27

  19. ω -rank loss functions To be a valid loss function ω must be: � ω ( i, j ) = ω ( j, i ) ‣ Symmetric: ω ( i, j ) ≤ ω ( i, k ) if i < j < k or k < j < i ‣ Monotonic: ω ( i, j ) + ω ( j, k ) ≥ ω ( i, k ) ‣ Satisfy triangle inequality: � Examples: � ‣ Kemeny loss: � ω ( i, j ) = 1, for i 6 = j � ‣ Top-K loss: ⇢ 1 if min( i, j )  K, i 6 = j ω ( i, j ) = 0 otherwise CMPSCI 689 Subhransu Maji (UMASS) 19 /27

  20. ω -rank train Create a dataset with binary labels � features for D ← φ ‣ Initialize: x ij comparing item i and j ‣ For every i and j such that, i ≠ j ➡ If σ ᵢ < σ ⱼ (item i is more relevant) D ← D ∪ ( x ij , +1 , ω ( i, j )) • Add a positive point: ➡ If σ ᵢ > σ ⱼ (item j is more relevant) D ← D ∪ ( x ij , − 1 , ω ( i, j )) • Add a negative point: Learn a binary classifier on D (each instance has a weight) � Ranking � score ← [0 , 0 , . . . , 0] ‣ Initialize: ‣ For every i and j such that, i ≠ j ➡ Calculate prediction: y ← f (ˆ x ij ) ➡ Update scores: score i = score i + y score j = score j − y ranking ← arg sort ( score ) CMPSCI 689 Subhransu Maji (UMASS) 20 /27

  21. Overview Learning with imbalanced data � Beyond binary classification � ‣ Multi-class classification ‣ Ranking ‣ Collective classification CMPSCI 689 Subhransu Maji (UMASS) 21 /27

  22. Collective classification Predicting multiple correlated variables input output ( x , k ) ∈ X × [ K ] G ( X , k ) be the set of all graphs features labels objective f : G ( X ) → G ([ K ]) E ( V,E ) ∼ D [ Σ v ∈ V (ˆ y v 6 = y v )] CMPSCI 689 Subhransu Maji (UMASS) 22 /27

  23. Collective classification Predicting multiple correlated variables y v ← f ( x v ) ˆ independent predictions can be noisy x v ← [ x v , φ ([ K ] , nbhd( v ))] labels of nearby vertices as features E.g., histogram of labels in a 5x5 neighborhood CMPSCI 689 Subhransu Maji (UMASS) 23 /27

  24. Stacking classifiers Train a two classifiers � First one is trained to predict output from the input � Second is trained on the input and the output of first classifier y (1) ˆ ← f 1 ( x v ) v ⇣ ⇣ ⌘⌘ y (2) y (1) ˆ ˆ v , nbhd( v ) ← f 2 x v , φ v CMPSCI 689 Subhransu Maji (UMASS) 24 /27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend