[PPT] - Corrupted Labels Nontawat Charoenphakdee 1,2 , Jongyeong Lee 1,2 and PowerPoint Presentation

SLIDE 1

On Symmetric Losses for Learning from Corrupted Labels

The University of Tokyo1 RIKEN Center for Advanced Intelligence Project (AIP)2

Nontawat Charoenphakdee1,2 , Jongyeong Lee1,2 and Masashi Sugiyama2,1

SLIDE 2

Supervised learning

https://t.pimg.jp/006/570/886/1/6570886.jpg https://www.kullabs.com/uploads/meauring-clip-art-at-clker-com-vector-clip-art-online-royalty-free-H2SJHF-clipart.png https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://coursera.s3.amazonaws.com/topics/ml/large-icon.png\

Features (Input) Labels (Output)

No noise robustness

Prediction function

2

Machine learning

Data collection Learn from input-output pairs

Predict output of unseen input accurately

Such that

SLIDE 3

Learning fr from corrupted labels

https://thumbs.dreamstime.com/b/power-crowd-d-render-crowdsourcing-concept-30738769.jpg http://www.process-improvement-institute.com/wp-content/uploads/2015/05/Accounting-for-Human-Error-Probability-in-SIL-Verification.jpg https://www.kullabs.com/uploads/meauring-clip-art-at-clker-com-vector-clip-art-online-royalty-free-H2SJHF-clipart.png https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://coursera.s3.amazonaws.com/topics/ml/large-icon.png

Feature collection Labeling process Noise-robust ML

Data collection

Prediction function

Our goal Examples:

Expert labelers (human error)
Crowdsourcing (non-expert error)

3

SLIDE 4

4

SLIDE 5

Warmup: Binary ry classification

: Label : Prediction function : Feature vector

Given: input-output pairs:
Goal: minimize expected error:

No access to distribution: minimize empirical error (Vapnik, 1998):

: Margin loss function

5

same sign different sign

SLIDE 6

Minimizing 0-1 loss directly is difficult.

Discontinuous and not differentiable (Ben-david+, 2003, Feldman+, 2012)

In practice, we minimize a surrogate loss (Zhang, 2004, Bartlett+, 2006).

Surrogate losses

: Label : Prediction function : Feature vector : Margin

6

SLIDE 7

Given: Two sets of corrupted data:

Clean:

Learning fr from corrupted la labels

Positive: Negative:

7

Class priors

Positive-unlabeled:

(Scott+, 2013, Menon+, 2015, Lu+, 2019)

This setting covers many weakly-supervised settings (Lu+, 2019).

(du Plessis+, 2014)

SLIDE 8

Given: Two sets of corrupted data: Assumption:

Problem: are unidentifiable from samples (Scott+, 2013).

How to learn without estimating ? 8 Is Issue on cla lass pri riors

Positive: Negative:

SLIDE 9

Classification error: Balanced error rate (BER): Area under the receiver operating characteristic curve (AUC) risk:

9 Related work:

Class priors are needed! (Lu+, 2019) Class priors are not needed! (Menon+, 2015)

SLIDE 10

Menon+, 2015: we can treat corrupted data as if they were clean.

Related work: BER and AUC optimization 10

Squared loss was used in experiments. van Rooyen+, 2015: symmetric losses are also useful for BER minimization (no experiments). The proof relies on a property of 0-1 loss. Ours: using symmetric loss is preferable for both BER and AUC theoretically and experimentally!

SLIDE 11

11

SLIDE 12

Robustness under symmetric noise (label flip with a fixed probability)

Symmetric losses

12

Risk estimator simplification in weakly-supervised learning

(du Plessis+, 2014, Kiryo+, 2017, Lu+, 2018) (Ghosh+, 2015, van Rooyen+, 2015)

Applications:

SLIDE 13

Symmetric losses:

AUC maximization

13

Excessive terms become constant!

Excessive terms can be safely ignored with symmetric losses

1.

Corrupted risk Clean risk

SLIDE 14

Symmetric losses:

BER minimization

14

Excessive term becomes constant!

Corrupted risk Clean risk

Excessive terms can be safely ignored with symmetric losses

Coincides with van Rooyen 2015+

SLIDE 15

15

SLIDE 16

Theoretical properties of f symmetric losses

Nonnegative symmetric losses are non-convex.

Theory of convex losses cannot be applied.

16

We provide a better understanding of symmetric losses:

Necessary and sufficient condition for classification-calibration
Excess risk bound in binary classification
Inability to estimate class posterior probability
A sufficient condition for AUC-consistency

➢ Covers many symmetric losses, e.g., sigmoid, ramp.

(du Plessis+, 2014, Ghosh+, 2015)

Well-known symmetric losses, e.g., sigmoid, ramp are classification-calibrated and AUC-consistent!

SLIDE 17

17

SLIDE 18

Convex symmetric losses?

By sacrificing nonnegativity:

nly unhinged loss is convex and symmetric (van Rooyen+, 2015).

18

This loss has been considered (although robustness was not discussed).

(Devroye+, 1996, Schoelkopf+, 2002, Shawe-Taylor+, 2004, Sriperumbudur+, 2009, Reid+, 2011)

SLIDE 19

slope of the non-symmetric region. width of symmetric region. High penalty if misclassify or output is outside symmetric region.

Barrier hinge loss

19

SLIDE 20

Symmetricity of f barrier hinge loss

Satisfies symmetric property in an interval.

20

If output range is restricted in a symmetric region: unhinged, hinge , barrier are equivalent.

SLIDE 21

21

SLIDE 22

Experiments: BER/AUC optimization fr

from corrupted la labels

22

To empirically answer the following questions:

1. Does the symmetric condition significantly help?
2. Do we need a loss to be symmetric everywhere?
3. Does the negative unboundedness degrade the practical performance?

We conducted the following experiments: Fix the models, vary the loss functions Losses: Barrier [b=200, r=50], Unhinged, Sigmoid, Logistic, Hinge, Squared, Savage Experiment 1: MLPs on UCI/LIBSVM datasets. Experiment 2:

CNNs on more difficult datasets (MNIST, CIFAR-10).

SLIDE 23

Experiments: BER/AUC optimization fr

from corrupted la labels

23

For UCI datasets:

Multilayered perceptrons (MLPs) with one hidden layer: [d-500-1] Activation function: Rectifier Linear Units (ReLU) (Nair+, 2010)

MNIST and CIFAR-10:

Convolutional neural networks (CNNs): [d-Conv[18,5,1,0]-Max[2,2]-Conv[48,5,1,0]-Max[2,2]-800-400-1] ReLU after fully connected layer follows by dropout layer (Srivastava+, 2010)

MNIST: Odd numbers vs Even numbers CIFAR: One class vs Airplane (follows Ishida+, 2017)

Conv[18, 5, 1 , 0]: 18 channels, 5 x 5 convolutions, stride 1, padding 0 Max[2,2]: max pooling with kernel size 2 and stride 2

SLIDE 24

Experiment 1: : MLPs on UCI/LIBSVM datasets

24

Dataset information and more experiments and can be found in our paper.

The higher the better.

SLIDE 25

Experiment 1: : MLPs on UCI/ I/LIBSVM datasets 25

Symmetric losses and barrier hinge loss are preferable!

The higher the better.

SLIDE 26

Experiment 2: : CNNs on MNIST/CIF

IFAR-10 10 26

On Symmetric Losses for Learning from Corrupted Labels

The University of Tokyo1 RIKEN Center for Advanced Intelligence Project (AIP)2

Nontawat Charoenphakdee1,2 , Jongyeong Lee1,2 and Masashi Sugiyama2,1

Supervised learning

Features (Input) Labels (Output)

Prediction function

2

Data collection Learn from input-output pairs

Predict output of unseen input accurately

Such that

Learning fr from corrupted labels

Feature collection Labeling process Noise-robust ML

Data collection

Prediction function

Our goal Examples:

3

Contents

4

Warmup: Binary ry classification

No access to distribution: minimize empirical error (Vapnik, 1998):

5

Minimizing 0-1 loss directly is difficult.

In practice, we minimize a surrogate loss (Zhang, 2004, Bartlett+, 2006).

Surrogate losses

6

Given: Two sets of corrupted data:

Clean:

Learning fr from corrupted la labels

Positive: Negative:

7

Positive-unlabeled:

This setting covers many weakly-supervised settings (Lu+, 2019).

Given: Two sets of corrupted data: Assumption:

Problem: are unidentifiable from samples (Scott+, 2013).

How to learn without estimating ? 8 Is Issue on cla lass pri riors

Positive: Negative:

Classification error: Balanced error rate (BER): Area under the receiver operating characteristic curve (AUC) risk:

9 Related work:

Menon+, 2015: we can treat corrupted data as if they were clean.

Related work: BER and AUC optimization 10

Squared loss was used in experiments. van Rooyen+, 2015: symmetric losses are also useful for BER minimization (no experiments). The proof relies on a property of 0-1 loss. Ours: using symmetric loss is preferable for both BER and AUC theoretically and experimentally!

Contents

11

Robustness under symmetric noise (label flip with a fixed probability)

Symmetric losses

12

Risk estimator simplification in weakly-supervised learning

Applications:

Symmetric losses:

AUC maximization

13

Excessive terms can be safely ignored with symmetric losses

Symmetric losses:

BER minimization

14

Excessive terms can be safely ignored with symmetric losses

Contents

15

Theoretical properties of f symmetric losses

Nonnegative symmetric losses are non-convex.

16

We provide a better understanding of symmetric losses:

➢ Covers many symmetric losses, e.g., sigmoid, ramp.

Well-known symmetric losses, e.g., sigmoid, ramp are classification-calibrated and AUC-consistent!

Contents

17

Convex symmetric losses?

By sacrificing nonnegativity:

18

This loss has been considered (although robustness was not discussed).

slope of the non-symmetric region. width of symmetric region. High penalty if misclassify or output is outside symmetric region.

Barrier hinge loss

19

Symmetricity of f barrier hinge loss

Satisfies symmetric property in an interval.

20

If output range is restricted in a symmetric region: unhinged, hinge , barrier are equivalent.

Contents

21

Experiments: BER/AUC optimization fr