Corrupted Labels Nontawat Charoenphakdee 1,2 , Jongyeong Lee 1,2 and - - PowerPoint PPT Presentation
Corrupted Labels Nontawat Charoenphakdee 1,2 , Jongyeong Lee 1,2 and - - PowerPoint PPT Presentation
On Symmetric Losses for Learning from Corrupted Labels Nontawat Charoenphakdee 1,2 , Jongyeong Lee 1,2 and Masashi Sugiyama 2,1 The University of Tokyo 1 RIKEN Center for Advanced Intelligence Project (AIP) 2 2 Supervised learning Predict output
Supervised learning
https://t.pimg.jp/006/570/886/1/6570886.jpg https://www.kullabs.com/uploads/meauring-clip-art-at-clker-com-vector-clip-art-online-royalty-free-H2SJHF-clipart.png https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://coursera.s3.amazonaws.com/topics/ml/large-icon.png\
Features (Input) Labels (Output)
No noise robustness
Prediction function
2
Machine learning
Data collection Learn from input-output pairs
Predict output of unseen input accurately
Such that
Learning fr from corrupted labels
https://thumbs.dreamstime.com/b/power-crowd-d-render-crowdsourcing-concept-30738769.jpg http://www.process-improvement-institute.com/wp-content/uploads/2015/05/Accounting-for-Human-Error-Probability-in-SIL-Verification.jpg https://www.kullabs.com/uploads/meauring-clip-art-at-clker-com-vector-clip-art-online-royalty-free-H2SJHF-clipart.png https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://coursera.s3.amazonaws.com/topics/ml/large-icon.png
Feature collection Labeling process Noise-robust ML
Data collection
Prediction function
Our goal Examples:
- Expert labelers (human error)
- Crowdsourcing (non-expert error)
3
Contents
- Background and related work
- The importance of symmetric losses
- Theoretical properties of symmetric losses
- Barrier hinge loss
- Experiments
4
Warmup: Binary ry classification
: Label : Prediction function : Feature vector
- Given: input-output pairs:
- Goal: minimize expected error:
No access to distribution: minimize empirical error (Vapnik, 1998):
: Margin loss function
5
same sign different sign
Minimizing 0-1 loss directly is difficult.
- Discontinuous and not differentiable (Ben-david+, 2003, Feldman+, 2012)
In practice, we minimize a surrogate loss (Zhang, 2004, Bartlett+, 2006).
Surrogate losses
: Label : Prediction function : Feature vector : Margin
6
Given: Two sets of corrupted data:
Clean:
Learning fr from corrupted la labels
Positive: Negative:
7
Class priors
Positive-unlabeled:
(Scott+, 2013, Menon+, 2015, Lu+, 2019)
This setting covers many weakly-supervised settings (Lu+, 2019).
(du Plessis+, 2014)
Given: Two sets of corrupted data: Assumption:
Problem: are unidentifiable from samples (Scott+, 2013).
How to learn without estimating ? 8 Is Issue on cla lass pri riors
Positive: Negative:
Classification error: Balanced error rate (BER): Area under the receiver operating characteristic curve (AUC) risk:
9 Related work:
Class priors are needed! (Lu+, 2019) Class priors are not needed! (Menon+, 2015)
Menon+, 2015: we can treat corrupted data as if they were clean.
Related work: BER and AUC optimization 10
Squared loss was used in experiments. van Rooyen+, 2015: symmetric losses are also useful for BER minimization (no experiments). The proof relies on a property of 0-1 loss. Ours: using symmetric loss is preferable for both BER and AUC theoretically and experimentally!
Contents
- Background and related work
- The importance of symmetric losses
- Theoretical properties of symmetric losses
- Barrier hinge loss
- Experiments
11
Robustness under symmetric noise (label flip with a fixed probability)
Symmetric losses
12
Risk estimator simplification in weakly-supervised learning
(du Plessis+, 2014, Kiryo+, 2017, Lu+, 2018) (Ghosh+, 2015, van Rooyen+, 2015)
Applications:
Symmetric losses:
AUC maximization
13
Excessive terms become constant!
Excessive terms can be safely ignored with symmetric losses
1.
Corrupted risk Clean risk
Symmetric losses:
BER minimization
14
Excessive term becomes constant!
Corrupted risk Clean risk
Excessive terms can be safely ignored with symmetric losses
Coincides with van Rooyen 2015+
Contents
- Background and related work
- The importance of symmetric losses
- Theoretical properties of symmetric losses
- Barrier hinge loss
- Experiments
15
Theoretical properties of f symmetric losses
Nonnegative symmetric losses are non-convex.
- Theory of convex losses cannot be applied.
16
We provide a better understanding of symmetric losses:
- Necessary and sufficient condition for classification-calibration
- Excess risk bound in binary classification
- Inability to estimate class posterior probability
- A sufficient condition for AUC-consistency
➢ Covers many symmetric losses, e.g., sigmoid, ramp.
(du Plessis+, 2014, Ghosh+, 2015)
Well-known symmetric losses, e.g., sigmoid, ramp are classification-calibrated and AUC-consistent!
Contents
- Background and related work
- The importance of symmetric losses
- Theoretical properties of symmetric losses
- Barrier hinge loss
- Experiments
17
Convex symmetric losses?
By sacrificing nonnegativity:
- nly unhinged loss is convex and symmetric (van Rooyen+, 2015).
18
This loss has been considered (although robustness was not discussed).
(Devroye+, 1996, Schoelkopf+, 2002, Shawe-Taylor+, 2004, Sriperumbudur+, 2009, Reid+, 2011)
slope of the non-symmetric region. width of symmetric region. High penalty if misclassify or output is outside symmetric region.
Barrier hinge loss
19
Symmetricity of f barrier hinge loss
Satisfies symmetric property in an interval.
20
If output range is restricted in a symmetric region: unhinged, hinge , barrier are equivalent.
Contents
- Background and related work
- The importance of symmetric losses
- Theoretical properties of symmetric losses
- Barrier hinge loss
- Experiments
21
Experiments: BER/AUC optimization fr
from corrupted la labels
22
To empirically answer the following questions:
- 1. Does the symmetric condition significantly help?
- 2. Do we need a loss to be symmetric everywhere?
- 3. Does the negative unboundedness degrade the practical performance?
We conducted the following experiments: Fix the models, vary the loss functions Losses: Barrier [b=200, r=50], Unhinged, Sigmoid, Logistic, Hinge, Squared, Savage Experiment 1: MLPs on UCI/LIBSVM datasets. Experiment 2:
CNNs on more difficult datasets (MNIST, CIFAR-10).
Experiments: BER/AUC optimization fr
from corrupted la labels
23
For UCI datasets:
Multilayered perceptrons (MLPs) with one hidden layer: [d-500-1] Activation function: Rectifier Linear Units (ReLU) (Nair+, 2010)
MNIST and CIFAR-10:
Convolutional neural networks (CNNs): [d-Conv[18,5,1,0]-Max[2,2]-Conv[48,5,1,0]-Max[2,2]-800-400-1] ReLU after fully connected layer follows by dropout layer (Srivastava+, 2010)
MNIST: Odd numbers vs Even numbers CIFAR: One class vs Airplane (follows Ishida+, 2017)
Conv[18, 5, 1 , 0]: 18 channels, 5 x 5 convolutions, stride 1, padding 0 Max[2,2]: max pooling with kernel size 2 and stride 2
Experiment 1: : MLPs on UCI/LIBSVM datasets
24
Dataset information and more experiments and can be found in our paper.
The higher the better.
Experiment 1: : MLPs on UCI/ I/LIBSVM datasets 25
Symmetric losses and barrier hinge loss are preferable!
The higher the better.
Experiment 2: : CNNs on MNIST/CIF
IFAR-10 10 26
Conclusion
We showed that symmetric loss is preferable under corrupted labels for:
- Area under the receiver operating characteristic curve (AUC) maximization
- Balanced error rate (BER) minimization
We provided general theoretical properties for symmetric losses:
- Classification-calibration, excess risk bound, AUC-consistency
- Inability of estimating the class posterior probability
We proposed a barrier hinge loss:
- As a proof of concept of the importance of symmetric condition
- Symmetric only in an interval but benefits greatly from symmetric condition
- Significantly outperformed all losses in BER/AUC optimization using CNNs