ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C - - PowerPoint PPT Presentation

ml in practice
SMART_READER_LITE
LIVE PREVIEW

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C - - PowerPoint PPT Presentation

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics A few practical issues CIML Chapter 4 Dealing with imbalanced data distributions Evaluation metrics (CIML 4.5) Learning with


slide-1
SLIDE 1

ML in Practice: Dealing with imbalanced data

CMSC 422 MARINE CARPUAT

marine@cs.umd.edu

slide-2
SLIDE 2

T

  • pics
  • A few practical issues

– CIML Chapter 4

  • Dealing with imbalanced data distributions

– Evaluation metrics (CIML 4.5) – Learning with imbalanced data (CIML 5.1)

slide-3
SLIDE 3

Practical Issues

  • “garbage in, garbage out”

– Learning algorithms can’t compensate for useless training examples

  • e.g., if we only have irrelevant features

– Feature design often has a bigger impact on performance than tweaking the learning algorithm

slide-4
SLIDE 4

Practical Issues

Which classifier is the best? – this result table alone cannot give us the answer – solution: statistical hypothesis testing

Classifier Accuracy on test set Team A 80.00 Team B 79.90 Team C 79.00 Team D 78.00

slide-5
SLIDE 5

Practical Issues

Is the difference in accuracy between A and B statistically significant? What is the probability that the observed difference in performance was due to chance?

Classifier Accuracy on test set Team A 80.00 Team B 79.90 Team C 79.00 Team D 78.00

slide-6
SLIDE 6

A confidence of 95%

  • does NOT mean

“There is a 95% chance than classifier A is better than classifier B”

  • It means

“If I run this experiment 100 times, I expect A to perform better than B 95 times.”

slide-7
SLIDE 7

Practical Issues: Debugging

  • You’ve implemented a learning algorithm,

you try it on some train/dev/test data, but it doesn’t seem to learn.

  • What’s going on?

– Is the data too noisy? – Is the learning problem too hard? – Is your implementation buggy?

slide-8
SLIDE 8

Practical Issues: Debugging

  • You probably have a bug

–if the learning algorithm cannot overfit the training data –if the predictions are incorrect on a toy 2D dataset hand-crafted to be learnable

slide-9
SLIDE 9

T

  • pics
  • A few practical issues

– CIML Chapter 4

  • Dealing with imbalanced learning problems

– Evaluation metrics (CIML 4.5) – Learning with imbalanced data (CIML 5.1)

slide-10
SLIDE 10

Evaluation metrics: beyond accuracy/error

  • Example 1

– Given medical record, – Predict whether a patient has cancer or not

  • Example 2

– Given a document collection and a query – Find documents in collection that are relevant to query

  • Accuracy is not a good metric when some

errors matter more than others!

slide-11
SLIDE 11

The 2-by-2 contingency table

Gold label = +1 Gold label = -1 Prediction = +1 tp fp Prediction = -1 fn tn Imagine we are addressing a document retrieval task for a given query, where +1 means that the document is relevant

  • 1 means that the

document is not relevant We can categorize predictions as:

  • true/false positives
  • true/false negatives
slide-12
SLIDE 12

Precision and recall

  • Precision: % of

positive predictions that are correct

  • Recall: % of

positive gold labels that are found

Gold label = +1 Gold label = -1 Prediction = +1 tp fp Prediction = -1 fn tn

slide-13
SLIDE 13

A combined measure: F

  • A combined measure that assesses the P/R

tradeoff is F measure

  • People usually use balanced F-1 measure

– i.e., with  = 1 (that is,  = ½): – F = 2PR/(P+R)

R P PR R P F + + =

  • +

=

2 2

) 1 ( 1 ) 1 ( 1 1 b b a a

slide-14
SLIDE 14

T

  • pics
  • A few practical issues

– CIML Chapter 4

  • Dealing with imbalanced learning problems

– Evaluation metrics (CIML 4.5) – Learning with imbalanced data (CIML 5.1)

slide-15
SLIDE 15

Imbalanced data distributions

  • Sometimes training examples are drawn

from an imbalanced distribution

  • This results in an imbalanced training set

– “needle in a haystack” problems – E.g., find fraudulent transactions in credit card histories

  • Why is this a big problem for the ML

algorithms we know?

slide-16
SLIDE 16

Learning with imbalanced data

  • We need to let the learning algorithm

know that we care about some examples more than others!

  • 2 heuristics to balance the training data

– Subsampling – Weighting

slide-17
SLIDE 17

Recall: Machine Learning as Function Approximation

Problem setting

  • Set of possible instances 𝑌
  • Unknown target function 𝑔: 𝑌 → 𝑍
  • Set of function hypotheses 𝐼 = ℎ ℎ: 𝑌 → 𝑍}

Input

  • Training examples { 𝑦 1 , 𝑧 1 , … 𝑦 𝑂 , 𝑧 𝑂

} of unknown target function 𝑔 Output

  • Hypothesis ℎ ∈ 𝐼 that best approximates target function 𝑔
slide-18
SLIDE 18

Recall: Loss Function

𝑚(𝑧, 𝑔(𝑦)) where 𝑧 is the truth and 𝑔 𝑦 is the system’s prediction e.g. 𝑚 𝑧, 𝑔(𝑦) = 0 𝑗𝑔 𝑧 = 𝑔(𝑦) 1 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Captures our notion of what is important to learn

slide-19
SLIDE 19

Recall: Expected loss

  • 𝑔 should make good predictions

– as measured by loss 𝑚 – on future examples that are also drawn from 𝐸

  • Formally

– 𝜁 , the expected loss of 𝑔 over 𝐸 with respect to 𝑚 should be small

𝜁 ≜ 𝔽 𝑦,𝑧 ~𝐸 𝑚(𝑧, 𝑔(𝑦)) =

(𝑦,𝑧)

𝐸 𝑦, 𝑧 𝑚(𝑧, 𝑔(𝑦))

slide-20
SLIDE 20
slide-21
SLIDE 21

We define cost of misprediction as: α > 1 for y = +1 1 for y = -1

Given a good algorithm for solving the binary classification problem, how can I solve the α-weighted binary classification problem?

slide-22
SLIDE 22

Solution: Train a binary classifier

  • n an induced distribution
slide-23
SLIDE 23

Subsampling optimality

  • Theorem: If the binary classifier achieves a

binary error rate of ε, then the error rate of the α-weighted classifier is α ε

  • Proof (CIML 5.1)
slide-24
SLIDE 24

Strategies for inducing a new binary distribution

  • Undersample the negative class
  • Oversample the positive class
slide-25
SLIDE 25

Strategies for inducing a new binary distribution

  • Undersample the negative class

– More computationally efficient

  • Oversample the positive class

– Base binary classifier might do better with more training examples – Efficient implementations incorporate weight in algorithm, instead of explicitly duplicating data!

slide-26
SLIDE 26
slide-27
SLIDE 27

What you should know

  • Be aware of practical issues when applying

ML techniques to new problems

  • How to select an appropriate evaluation

metric for imbalanced learning problems

  • How to learn from imbalanced data using α-

weighted binary classification, and what the error guarantees are