ML in Practice: Dealing with imbalanced data
CMSC 422 MARINE CARPUAT
marine@cs.umd.edu
ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C - - PowerPoint PPT Presentation
ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics A few practical issues CIML Chapter 4 Dealing with imbalanced data distributions Evaluation metrics (CIML 4.5) Learning with
CMSC 422 MARINE CARPUAT
marine@cs.umd.edu
– CIML Chapter 4
– Evaluation metrics (CIML 4.5) – Learning with imbalanced data (CIML 5.1)
– Learning algorithms can’t compensate for useless training examples
– Feature design often has a bigger impact on performance than tweaking the learning algorithm
Which classifier is the best? – this result table alone cannot give us the answer – solution: statistical hypothesis testing
Classifier Accuracy on test set Team A 80.00 Team B 79.90 Team C 79.00 Team D 78.00
Is the difference in accuracy between A and B statistically significant? What is the probability that the observed difference in performance was due to chance?
Classifier Accuracy on test set Team A 80.00 Team B 79.90 Team C 79.00 Team D 78.00
– Is the data too noisy? – Is the learning problem too hard? – Is your implementation buggy?
– CIML Chapter 4
– Evaluation metrics (CIML 4.5) – Learning with imbalanced data (CIML 5.1)
– Given medical record, – Predict whether a patient has cancer or not
– Given a document collection and a query – Find documents in collection that are relevant to query
Gold label = +1 Gold label = -1 Prediction = +1 tp fp Prediction = -1 fn tn Imagine we are addressing a document retrieval task for a given query, where +1 means that the document is relevant
document is not relevant We can categorize predictions as:
Gold label = +1 Gold label = -1 Prediction = +1 tp fp Prediction = -1 fn tn
– i.e., with = 1 (that is, = ½): – F = 2PR/(P+R)
R P PR R P F + + =
=
2 2
) 1 ( 1 ) 1 ( 1 1 b b a a
– CIML Chapter 4
– Evaluation metrics (CIML 4.5) – Learning with imbalanced data (CIML 5.1)
– “needle in a haystack” problems – E.g., find fraudulent transactions in credit card histories
– Subsampling – Weighting
Problem setting
Input
} of unknown target function 𝑔 Output
𝑚(𝑧, 𝑔(𝑦)) where 𝑧 is the truth and 𝑔 𝑦 is the system’s prediction e.g. 𝑚 𝑧, 𝑔(𝑦) = 0 𝑗𝑔 𝑧 = 𝑔(𝑦) 1 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Captures our notion of what is important to learn
– as measured by loss 𝑚 – on future examples that are also drawn from 𝐸
– 𝜁 , the expected loss of 𝑔 over 𝐸 with respect to 𝑚 should be small
𝜁 ≜ 𝔽 𝑦,𝑧 ~𝐸 𝑚(𝑧, 𝑔(𝑦)) =
(𝑦,𝑧)
𝐸 𝑦, 𝑧 𝑚(𝑧, 𝑔(𝑦))
We define cost of misprediction as: α > 1 for y = +1 1 for y = -1
Given a good algorithm for solving the binary classification problem, how can I solve the α-weighted binary classification problem?
– More computationally efficient
– Base binary classifier might do better with more training examples – Efficient implementations incorporate weight in algorithm, instead of explicitly duplicating data!