Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang - - PowerPoint PPT Presentation

β–Ά
error bounded correction of noisy labels
SMART_READER_LITE
LIVE PREVIEW

Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang - - PowerPoint PPT Presentation

Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang Wu, Aman Goswami, Mayank Goswami, Dimitris Metaxas, Chao Chen The State University of New York at Stony Brook Rutgers University The City University of New York, Queens


slide-1
SLIDE 1

Error-Bounded Correction of Noisy Labels

Songzhu Zheng, Pengxiang Wu, Aman Goswami, Mayank Goswami, Dimitris Metaxas, Chao Chen

The State University of New York at Stony Brook Rutgers University The City University of New York, Queen’s College

1

slide-2
SLIDE 2

Label Noise is Ubiquitous and Troublesome

Label Noise can be Introduced by:

  • Human or automatic annotators mistakenly (Yan et al. 2014; Veit et al. 2017)

2

Training

Noisy Model

Inference

Dog

Noisy Model

Train with noisy labels Infer with noisy model Dog Cat

slide-3
SLIDE 3

Settings

3

  • ΰ·€

𝑧 is noisy label (observed), 𝑧 is clean label (unknown)

  • Chanllenge:

Train with noisy data 𝐲, ΰ·₯ 𝒛 . But require to give correct prediction 𝒛.

Inference

Cat Robust Model Trained with 𝐲, ΰ·₯ 𝒛

?

slide-4
SLIDE 4

Settings

4

  • ΰ·€

𝑧 is noisy label (observed), 𝑧 is clean label (unknown)

  • Chanllenge:

Train with noisy data 𝐲, ΰ·₯ 𝒛 . But require to give correct prediction 𝒛.

Inference

Cat Robust Model Trained with 𝐲, ΰ·₯ 𝒛

?

  • Noise Transition Matrix π‘ˆ. Each entry πœπ‘—π‘˜ = 𝑄 ΰ·€

𝑧 = π‘˜ 𝑧 = 𝑗): π‘ˆ = 𝑑𝑏𝑒 𝑒𝑝𝑕 β„Žπ‘£π‘›π‘π‘œ 𝑑𝑏𝑒 0.4 0.3 0.3 𝑒𝑝𝑕 0.3 0.4 0.3 β„Žπ‘£π‘›π‘π‘œ 0.3 0.3 0.4 Uniform Noise π‘ˆ = 𝑑𝑏𝑒 𝑒𝑝𝑕 β„Žπ‘£π‘›π‘π‘œ 𝑑𝑏𝑒 0.6 0.4 𝑒𝑝𝑕 0.4 0.6 β„Žπ‘£π‘›π‘π‘œ 0.4 0.6 Pairwise Noise

True Noisy True Noisy

slide-5
SLIDE 5

Existing Solutions – Model Re-calibration

5

  • Introduce new loss term to get robust model:

1) Estimation of matrix π‘ˆ to correct the loss term (Goldberger & Ben-Reuven, 2017; Patrini et al., 2017) 2) Robust deep learning layer (Van Rooyen et al., 2015) 3) Reconstruction loss term (Reed et al., 2014)

  • Pros:

Globally regularization; theoretical guarantee

  • Cons:

Not flexible enough; omit local information

slide-6
SLIDE 6

Existing Solutions – Data Re-calibration

6

  • Re-weighting or pick data point using noisy classifier
  • Noisy classifier’s confidence determines the weight
  • Clean labels gain higher weight
  • Re-weighting and training happens jointly
  • Pros:

Better performance than model re-calibration model. Flexible enough to fully use point-wise information

  • Cons:

No theoretical support

Training Robust Model Input Reweighting

slide-7
SLIDE 7

7

Contribution

  • The first theoretic explanation for data re-calibration method
  • Explained why noisy classifier to be used to decide whether a

label is trustable or not.

  • A theory inspired data re-calibrating algorithm
  • Easy to tune
  • Scalable
  • Label Correction

Image Source: https://media.istockphoto.com/vectors/hand-drawn-vector-cartoon-illustration-of-a-broken-robot-trying-to-vector-

id1131797122?k=6&m=1131797122&s=612x612&w=0&h=H2fviprWr24dxlO2QPae1R8X3nrHB-J40NCunv2aE84=

slide-8
SLIDE 8

(N (Noisy) Cla lassifier and (N (Noisy) Posterior

8

Classification scoring function f 𝑦 approximates posterior probability of labels:

  • Clean (𝑦, 𝑧) : 𝑔(𝑦) approximates clean posterior πœƒ 𝑦 = 𝑄 𝑧 = 1 𝑦)
  • Noisy (𝑦, ΰ·€

𝑧) : 𝑔(𝑦) approximates noisy posterior ΰ·€ πœƒ 𝑦 = 𝑄 ΰ·€ 𝑧 = 1 𝑦)

slide-9
SLIDE 9

(N (Noisy) Cla lassifier and (N (Noisy) Posterior

9

Classification scoring function f 𝑦 approximates posterior probability of labels:

  • Clean (𝑦, 𝑧) : 𝑔(𝑦) approximates clean posterior πœƒ 𝑦 = 𝑄 𝑧 = 1 𝑦)
  • Noisy (𝑦, ΰ·€

𝑧) : 𝑔(𝑦) approximates noisy posterior ΰ·€ πœƒ 𝑦 = 𝑄 𝑧 = 1 𝑦)

  • There is a linear relationship ΰ·€

πœƒ 𝑦 = (1 βˆ’ 𝜐10 βˆ’ 𝜐01)πœƒ 𝑦 + 𝜐01 Remember 𝜐10 = 𝑄 ΰ·€ 𝑧 = 0 𝑧 = 1) and 𝜐01 = 𝑄 ΰ·€ 𝑧 = 1 𝑧 = 0)

slide-10
SLIDE 10

Low Confidence of ΰ·€ πœƒ 𝑦 Implies Noise

10

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞ and for Ξ” =

1βˆ’ 𝜐10βˆ’πœ01 2

, there exists constant 𝐷, πœ‡ > 0 such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

slide-11
SLIDE 11

Low Confidence of ΰ·€ πœƒ 𝑦 Implies Noise

11

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞, there exists constant 𝐷, πœ‡ > 0 and Ξ” ∈ (0, 1), such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

𝚬

β„“ πœƒ 𝑦 ΰ·€ πœƒ 𝑦

Clean Label Noisy Label

β„“

slide-12
SLIDE 12

Low Confidence of ΰ·€ πœƒ 𝑦 Implies Noise

12

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞, there exists constant 𝐷, πœ‡ > 0 and Ξ” ∈ (0, 1), such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

𝚬

β„“ πœƒ 𝑦 ΰ·€ πœƒ 𝑦 f 𝑦

Clean Label Noisy Label

β„“

slide-13
SLIDE 13

Low Confidence of ΰ·€ πœƒ 𝑦 Implies Noise

13

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞, there exists constant 𝐷, πœ‡ > 0 and Ξ” ∈ (0, 1), such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

𝚬

β„“ πœƒ 𝑦 ΰ·€ πœƒ 𝑦 f 𝑦

Clean Label Noisy Label

β„“

ΰ·€ πœƒ 𝑦 = (1 βˆ’ 𝜐10 βˆ’ 𝜐01)πœƒ 𝑦 + 𝜐01

slide-14
SLIDE 14

Low Confidence of ΰ·€ πœƒ 𝑦 Implies Noise

14

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞, there exists constant 𝐷, πœ‡ > 0 and Ξ” ∈ (0, 1), such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡ β„“

𝚬

πœƒ 𝑦 ΰ·€ πœƒ 𝑦 f 𝑦

Clean Label Noisy Label

β„“

slide-15
SLIDE 15

Inconfidence of ΰ·€ πœƒ 𝑦 Implies Noise

15

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞, there exists constant 𝐷, πœ‡ > 0 and Ξ” ∈ (0, 1), such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 𝐷 𝑃 πœ—

πœ‡

𝚬

β„“ 1 βˆ’ πœƒ 𝑦 1 βˆ’ ΰ·€ πœƒ 𝑦 1 βˆ’ 𝑔 𝑦

Clean Label Noisy Label

β„“

slide-16
SLIDE 16

16

Tsybakov Condition

Clean Label Noisy Label πœƒ 𝑦 = 1/2

𝟐 πŸ‘ βˆ’ 𝒖 ≀ 𝛉 𝐲 ≀ 𝟐 πŸ‘ + 𝒖

slide-17
SLIDE 17

17

Tsybakov Condition

  • Tsybakov Condition [2004]. There exists constants 𝐷, πœ‡ > 0 and 𝑒0 ∈ ቀ

ቃ 0,

1 2 , such that for all 𝑒 ≀ 𝑒0,

𝑄 πœƒ 𝑦 βˆ’

1 2 ≀ 𝑒 ≀ π·π‘’πœ‡

Clean Label Noisy Label πœƒ 𝑦 = 1/2

𝟐 πŸ‘ βˆ’ 𝒖 ≀ 𝛉 𝐲 ≀ 𝟐 πŸ‘ + 𝒖

slide-18
SLIDE 18

18

Tsybakov Condition

  • Tsybakov Condition [2004]. There exists constants 𝐷, πœ‡ > 0 and 𝑒0 ∈ ቀ

ቃ 0,

1 2 , such that for all 𝑒 ≀ 𝑒0,

𝑄 πœƒ 𝑦 βˆ’

1 2 ≀ 𝑒 ≀ π·π‘’πœ‡

  • Empirical Verification (CIFAR-10) : መ

𝐷 = 0.32 and መ πœ‡ = 1.04 . Statistically Significant Clean Label Noisy Label πœƒ 𝑦 = 1/2

𝟐 πŸ‘ βˆ’ 𝒖 ≀ 𝛉 𝐲 ≀ 𝟐 πŸ‘ + 𝒖

slide-19
SLIDE 19

Inconfidence of ΰ·€ πœƒ 𝑦 Implies Noise

Theorem 1. Let πœ— ≔ 𝑔 βˆ’ ΰ·€ πœƒ ∞, there exists constant 𝐷, πœ‡ > 0 and Ξ” ∈ (0, 1), such that:

  • ΰ·€

𝑧 = 1 ∢ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 0.23 𝑃 πœ—

1.04

  • ΰ·€

𝑧 = 0 ∢ 𝑄𝑠𝑝𝑐 1 βˆ’ 𝑔 𝑦 ≀ Ξ”, ΰ·€ 𝑧 𝑗𝑑 π‘‘π‘šπ‘“π‘π‘œ ≀ 0.23 𝑃 πœ—

1.04

slide-20
SLIDE 20

Theory-Inspired Algorithm

20

slide-21
SLIDE 21

Theory-Inspired Algorithm

21

Corollary 1. Let πœ— ≔ max 𝑔 𝑦 βˆ’ ΰ·€ πœƒ(𝑦) . If ΰ·€ π‘§π‘œπ‘“π‘₯ denotes the output of the LRT-Correction with input x, ΰ·€ y , f and πœ€ then βˆƒC, πœ‡ > 0 : 𝑄𝑠𝑝𝑐 ΰ·€ π‘§π‘œπ‘“π‘₯ is clean > 1 βˆ’ 𝐷 𝑃 πœ—

πœ‡

Remark: The extension to multi-class would be natural

slide-22
SLIDE 22

AdaCorr: Using LRT-Correction During Training

Step 1: Train 𝑔(𝑦) using (𝑦, ΰ·€ 𝑧) Step 2: Applying LRT-Correction using (𝑦, ΰ·€ 𝑧), 𝑔(𝑦) and πœ€ Step 3: Let ΰ·€ y = ΰ·€ π‘§π‘œπ‘“π‘₯ Step 4: Repeat Step 1~3

22

Remark: In step 1, to get a good approximation of ΰ·€ πœƒ(𝑦), we train 𝑔(𝑦) with (𝑦, ΰ·€ 𝑧) for several warm-up epochs

slide-23
SLIDE 23

Experiment - Setting

Data Sets:

  • MNIST (LeCun & Cortes, 2010);
  • CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009);
  • ModelNet40 (Z. Wu & Xiao, 2015)
  • Clothing 1M (Xiao et al., 2015)

Base Lines:

  • Forward Correction (Patrini et al., 2017)
  • Decoupling (Malach & Shalev-Schwartz 2017)
  • Forgetting (Arpit et al., 2017)
  • Co-teaching (Han et al., 2018)
  • MentorNet (Jiang et al., 2018)
  • Abstention (Thulasidasan et al., 2019)

Backbone for every baseline:

  • Preactive ResNet-34 (He et al., 2016) for MNIST; CIFAR10/100.
  • ModelNet40 (Qi et al.) for Point Cloud.
  • ResNet-50 for Cloth 1M

Epochs for every baseline: 180 epochs Optimizer for every baseline: RAdam (Liu et al., 2019) Learning Rate: 0.001 at beginning and decayed 0.5 for every 60 epochs Hyper-parameter for AdaCorr:

  • 30 epochs as Burning-in Period
  • Initial 1/πœ€ is set to be 1.2 and decreased by 0.02 every epoch

23

slide-24
SLIDE 24

Experiment - Performance

24

slide-25
SLIDE 25

Experiment - Performance

25

slide-26
SLIDE 26

Experiment - Performance

26

slide-27
SLIDE 27

Conclusion

  • We addressed the training with label noise problem
  • We provided the first theoretical justification for data re-calibration methods
  • We prove that noisy classifier can be used to decide the purity of the label
  • We proposed a new theory inspired algorithm
  • scalable ; easy to tune; good performance.

27

Code will be available on GitHub: https://github.com/pingqingsheng/LRT

Thanks for watching