error bounded correction of noisy labels
play

Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang - PowerPoint PPT Presentation

Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang Wu, Aman Goswami, Mayank Goswami, Dimitris Metaxas, Chao Chen The State University of New York at Stony Brook Rutgers University The City University of New York, Queens


  1. Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang Wu, Aman Goswami, Mayank Goswami, Dimitris Metaxas, Chao Chen The State University of New York at Stony Brook Rutgers University The City University of New York, Queen’s College 1

  2. Label Noise is Ubiquitous and Troublesome Dog Training Train with noisy labels Cat Noisy Model Infer with Inference Dog noisy model Noisy Model Label Noise can be Introduced by: • Human or automatic annotators mistakenly (Yan et al. 2014; Veit et al. 2017) 2

  3. Settings • ෤ 𝑧 is noisy label (observed), 𝑧 is clean label (unknown) • C hanllenge: Train with noisy data 𝐲, ෥ 𝒛 . But require to give correct prediction 𝒛 . Cat ? Inference Robust Model Trained with 𝐲, ෥ 𝒛 3

  4. Settings • ෤ 𝑧 is noisy label (observed), 𝑧 is clean label (unknown) • C hanllenge: Train with noisy data 𝐲, ෥ 𝒛 . But require to give correct prediction 𝒛 . Cat ? Inference Robust Model Trained with 𝐲, ෥ 𝒛 • Noise Transition Matrix 𝑈 . Each entry 𝜐 𝑗𝑘 = 𝑄 ෤ 𝑧 = 𝑘 𝑧 = 𝑗) : Noisy 𝑑𝑏𝑢 𝑒𝑝𝑕 ℎ𝑣𝑛𝑏𝑜 Noisy 𝑑𝑏𝑢 𝑒𝑝𝑕 ℎ𝑣𝑛𝑏𝑜 True True 𝑑𝑏𝑢 0.4 0.3 0.3 𝑑𝑏𝑢 0.6 0.4 0 𝑈 = 𝑈 = 𝑒𝑝𝑕 0.4 0.6 0 𝑒𝑝𝑕 0.3 0.4 0.3 ℎ𝑣𝑛𝑏𝑜 0.3 0.3 0.4 ℎ𝑣𝑛𝑏𝑜 0 0.4 0.6 Uniform Noise Pairwise Noise 4

  5. Existing Solutions – Model Re-calibration • Introduce new loss term to get robust model: 1) Estimation of matrix 𝑈 to correct the loss term (Goldberger & Ben-Reuven, 2017; Patrini et al., 2017) 2) Robust deep learning layer (Van Rooyen et al., 2015) 3) Reconstruction loss term (Reed et al., 2014) • Pros: Globally regularization; theoretical guarantee • Cons: Not flexible enough; omit local information 5

  6. Existing Solutions – Data Re-calibration Training Reweighting Robust Model Input • Re-weighting or pick data point using noisy classifier • Noisy classifier’s confidence determines the weight • Clean labels gain higher weight • Re-weighting and training happens jointly • Pros: Better performance than model re-calibration model. Flexible enough to fully use point-wise information • Cons: 6 No theoretical support

  7. Contribution • The first theoretic explanation for data re-calibration method • Explained why noisy classifier to be used to decide whether a label is trustable or not. • A theory inspired data re-calibrating algorithm • Easy to tune • Scalable • Label Correction Image Source: https://media.istockphoto.com/vectors/hand-drawn-vector-cartoon-illustration-of-a-broken-robot-trying-to-vector- 7 id1131797122?k=6&m=1131797122&s=612x612&w=0&h=H2fviprWr24dxlO2QPae1R8X3nrHB-J40NCunv2aE84=

  8. (N (Noisy) Cla lassifier and (N (Noisy) Posterior Classification scoring function f 𝑦 approximates posterior probability of labels: • Clean (𝑦, 𝑧) : 𝑔(𝑦) approximates clean posterior 𝜃 𝑦 = 𝑄 𝑧 = 1 𝑦) • Noisy (𝑦, ෤ 𝑧) : 𝑔(𝑦) approximates noisy posterior ෤ 𝜃 𝑦 = 𝑄 ෤ 𝑧 = 1 𝑦) 8

  9. (N (Noisy) Cla lassifier and (N (Noisy) Posterior Classification scoring function f 𝑦 approximates posterior probability of labels: • Clean (𝑦, 𝑧) : 𝑔(𝑦) approximates clean posterior 𝜃 𝑦 = 𝑄 𝑧 = 1 𝑦) • Noisy (𝑦, ෤ 𝑧) : 𝑔(𝑦) approximates noisy posterior ෤ 𝜃 𝑦 = 𝑄 𝑧 = 1 𝑦) • There is a linear relationship ෤ 𝜃 𝑦 = (1 − 𝜐 10 − 𝜐 01 ) 𝜃 𝑦 + 𝜐 01 Remember 𝜐 10 = 𝑄 ෤ 𝑧 = 0 𝑧 = 1) and 𝜐 01 = 𝑄 ෤ 𝑧 = 1 𝑧 = 0) 9

  10. Low Confidence of ෤ 𝜃 𝑦 Implies Noise 1− 𝜐 10 −𝜐 01 Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ and for Δ = , there exists constant 𝐷, 𝜇 > 0 2 such that: 𝜇 • 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 • 𝜇 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 10

  11. Low Confidence of ෤ 𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦 ෤ ℓ 11

  12. Low Confidence of ෤ 𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦 ෤ f 𝑦 ℓ 12

  13. Low Confidence of ෤ 𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜃 𝑦 = (1 − 𝜐 10 − 𝜐 01 ) 𝜃 𝑦 + 𝜐 01 ෤ Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦 ෤ f 𝑦 ℓ 13

  14. Low Confidence of ෤ 𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦 ෤ f 𝑦 ℓ 14

  15. Inconfidence of ෤ 𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 1 − 𝜃 𝑦 1 − ෤ 𝜃 𝑦 1 − 𝑔 𝑦 Clean Label Noisy Label 𝚬 ℓ ℓ 15

  16. Tsybakov Condition Clean Label Noisy Label 𝜃 𝑦 = 1/2 𝟐 𝟑 − 𝒖 ≤ 𝛉 𝐲 ≤ 𝟐 𝟑 + 𝒖 16

  17. Tsybakov Condition 1 • Tsybakov Condition [2004]. There exists constants 𝐷, 𝜇 > 0 and 𝑢 0 ∈ ቀ 0, 2 , such that for all 𝑢 ≤ 𝑢 0 , ቃ 1 2 ≤ 𝑢 ≤ 𝐷𝑢 𝜇 𝑄 𝜃 𝑦 − Clean Label Noisy Label 𝜃 𝑦 = 1/2 𝟑 − 𝒖 ≤ 𝛉 𝐲 ≤ 𝟐 𝟐 𝟑 + 𝒖 17

  18. Tsybakov Condition 1 • Tsybakov Condition [2004]. There exists constants 𝐷, 𝜇 > 0 and 𝑢 0 ∈ ቀ 0, 2 , such that for all 𝑢 ≤ 𝑢 0 , ቃ 1 2 ≤ 𝑢 ≤ 𝐷𝑢 𝜇 𝑄 𝜃 𝑦 − 𝐷 = 0.32 and መ Empirical Verification (CIFAR-10) : መ • 𝜇 = 1.04 . Statistically Significant Clean Label Noisy Label 𝜃 𝑦 = 1/2 𝟑 − 𝒖 ≤ 𝛉 𝐲 ≤ 𝟐 𝟐 𝟑 + 𝒖 18

  19. Inconfidence of ෤ 𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 − ෤ 𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: 1.04 • 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 0.23 𝑃 𝜗 • 1.04 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ, ෤ ෤ 𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 0.23 𝑃 𝜗

  20. Theory-Inspired Algorithm 20

  21. Theory-Inspired Algorithm Corollary 1. Let 𝜗 ≔ max 𝑔 𝑦 − ෤ 𝜃(𝑦) . If ෤ 𝑧 𝑜𝑓𝑥 denotes the output of the LRT-Correction with input x, ෤ y , f and 𝜀 then ∃C, 𝜇 > 0 : 𝜇 𝑄𝑠𝑝𝑐 ෤ 𝑧 𝑜𝑓𝑥 is clean > 1 − 𝐷 𝑃 𝜗 Remark: The extension to multi-class would be natural 21

  22. AdaCorr: Using LRT-Correction During Training Step 1: Train 𝑔(𝑦) using (𝑦, ෤ 𝑧) Step 2: Applying LRT-Correction using (𝑦, ෤ 𝑧) , 𝑔(𝑦) and 𝜀 Step 3: Let ෤ y = ෤ 𝑧 𝑜𝑓𝑥 Step 4: Repeat Step 1~3 Remark: In step 1, to get a good approximation of ෤ 𝜃(𝑦) , we train 𝑔(𝑦) with (𝑦, ෤ 𝑧) for several warm-up epochs 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend