 
              Error-Bounded Correction of Noisy Labels Songzhu Zheng , Pengxiang Wu, Aman Goswami, Mayank Goswami, Dimitris Metaxas, Chao Chen The State University of New York at Stony Brook Rutgers University The City University of New York, Queen’s College 1
Label Noise is Ubiquitous and Troublesome Dog Training Train with noisy labels Cat Noisy Model Infer with Inference Dog noisy model Noisy Model Label Noise can be Introduced by: • Human or automatic annotators mistakenly (Yan et al. 2014; Veit et al. 2017) 2
Settings •  𝑧 is noisy label (observed), 𝑧 is clean label (unknown) • C hanllenge: Train with noisy data 𝐲,  𝒛 . But require to give correct prediction 𝒛 . Cat ? Inference Robust Model Trained with 𝐲,  𝒛 3
Settings •  𝑧 is noisy label (observed), 𝑧 is clean label (unknown) • C hanllenge: Train with noisy data 𝐲,  𝒛 . But require to give correct prediction 𝒛 . Cat ? Inference Robust Model Trained with 𝐲,  𝒛 • Noise Transition Matrix 𝑈 . Each entry 𝜐 𝑗𝑘 = 𝑄  𝑧 = 𝑘 𝑧 = 𝑗) : Noisy 𝑑𝑏𝑢 𝑒𝑝 ℎ𝑣𝑛𝑏𝑜 Noisy 𝑑𝑏𝑢 𝑒𝑝 ℎ𝑣𝑛𝑏𝑜 True True 𝑑𝑏𝑢 0.4 0.3 0.3 𝑑𝑏𝑢 0.6 0.4 0 𝑈 = 𝑈 = 𝑒𝑝 0.4 0.6 0 𝑒𝑝 0.3 0.4 0.3 ℎ𝑣𝑛𝑏𝑜 0.3 0.3 0.4 ℎ𝑣𝑛𝑏𝑜 0 0.4 0.6 Uniform Noise Pairwise Noise 4
Existing Solutions – Model Re-calibration • Introduce new loss term to get robust model: 1) Estimation of matrix 𝑈 to correct the loss term (Goldberger & Ben-Reuven, 2017; Patrini et al., 2017) 2) Robust deep learning layer (Van Rooyen et al., 2015) 3) Reconstruction loss term (Reed et al., 2014) • Pros: Globally regularization; theoretical guarantee • Cons: Not flexible enough; omit local information 5
Existing Solutions – Data Re-calibration Training Reweighting Robust Model Input • Re-weighting or pick data point using noisy classifier • Noisy classifier’s confidence determines the weight • Clean labels gain higher weight • Re-weighting and training happens jointly • Pros: Better performance than model re-calibration model. Flexible enough to fully use point-wise information • Cons: 6 No theoretical support
Contribution • The first theoretic explanation for data re-calibration method • Explained why noisy classifier to be used to decide whether a label is trustable or not. • A theory inspired data re-calibrating algorithm • Easy to tune • Scalable • Label Correction Image Source: https://media.istockphoto.com/vectors/hand-drawn-vector-cartoon-illustration-of-a-broken-robot-trying-to-vector- 7 id1131797122?k=6&m=1131797122&s=612x612&w=0&h=H2fviprWr24dxlO2QPae1R8X3nrHB-J40NCunv2aE84=
(N (Noisy) Cla lassifier and (N (Noisy) Posterior Classification scoring function f 𝑦 approximates posterior probability of labels: • Clean (𝑦, 𝑧) : 𝑔(𝑦) approximates clean posterior 𝜃 𝑦 = 𝑄 𝑧 = 1 𝑦) • Noisy (𝑦,  𝑧) : 𝑔(𝑦) approximates noisy posterior  𝜃 𝑦 = 𝑄  𝑧 = 1 𝑦) 8
(N (Noisy) Cla lassifier and (N (Noisy) Posterior Classification scoring function f 𝑦 approximates posterior probability of labels: • Clean (𝑦, 𝑧) : 𝑔(𝑦) approximates clean posterior 𝜃 𝑦 = 𝑄 𝑧 = 1 𝑦) • Noisy (𝑦,  𝑧) : 𝑔(𝑦) approximates noisy posterior  𝜃 𝑦 = 𝑄 𝑧 = 1 𝑦) • There is a linear relationship  𝜃 𝑦 = (1 − 𝜐 10 − 𝜐 01 ) 𝜃 𝑦 + 𝜐 01 Remember 𝜐 10 = 𝑄  𝑧 = 0 𝑧 = 1) and 𝜐 01 = 𝑄  𝑧 = 1 𝑧 = 0) 9
Low Confidence of  𝜃 𝑦 Implies Noise 1− 𝜐 10 −𝜐 01 Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ and for Δ = , there exists constant 𝐷, 𝜇 > 0 2 such that: 𝜇 • 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 • 𝜇 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 10
Low Confidence of  𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦  ℓ 11
Low Confidence of  𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦  f 𝑦 ℓ 12
Low Confidence of  𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜃 𝑦 = (1 − 𝜐 10 − 𝜐 01 ) 𝜃 𝑦 + 𝜐 01  Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦  f 𝑦 ℓ 13
Low Confidence of  𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 Clean Label Noisy Label 𝚬 ℓ 𝜃 𝑦 𝜃 𝑦  f 𝑦 ℓ 14
Inconfidence of  𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: • 𝜇 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 𝜇 • 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 𝐷 𝑃 𝜗 1 − 𝜃 𝑦 1 −  𝜃 𝑦 1 − 𝑔 𝑦 Clean Label Noisy Label 𝚬 ℓ ℓ 15
Tsybakov Condition Clean Label Noisy Label 𝜃 𝑦 = 1/2 𝟐 𝟑 − 𝒖 ≤ 𝛉 𝐲 ≤ 𝟐 𝟑 + 𝒖 16
Tsybakov Condition 1 • Tsybakov Condition [2004]. There exists constants 𝐷, 𝜇 > 0 and 𝑢 0 ∈ ቀ 0, 2 , such that for all 𝑢 ≤ 𝑢 0 , ቃ 1 2 ≤ 𝑢 ≤ 𝐷𝑢 𝜇 𝑄 𝜃 𝑦 − Clean Label Noisy Label 𝜃 𝑦 = 1/2 𝟑 − 𝒖 ≤ 𝛉 𝐲 ≤ 𝟐 𝟐 𝟑 + 𝒖 17
Tsybakov Condition 1 • Tsybakov Condition [2004]. There exists constants 𝐷, 𝜇 > 0 and 𝑢 0 ∈ ቀ 0, 2 , such that for all 𝑢 ≤ 𝑢 0 , ቃ 1 2 ≤ 𝑢 ≤ 𝐷𝑢 𝜇 𝑄 𝜃 𝑦 − 𝐷 = 0.32 and መ Empirical Verification (CIFAR-10) : መ • 𝜇 = 1.04 . Statistically Significant Clean Label Noisy Label 𝜃 𝑦 = 1/2 𝟑 − 𝒖 ≤ 𝛉 𝐲 ≤ 𝟐 𝟐 𝟑 + 𝒖 18
Inconfidence of  𝜃 𝑦 Implies Noise Theorem 1. Let 𝜗 ≔ 𝑔 −  𝜃 ∞ , there exists constant 𝐷, 𝜇 > 0 and Δ ∈ (0, 1) , such that: 1.04 • 𝑧 = 1 ∶ 𝑄𝑠𝑝𝑐 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 0.23 𝑃 𝜗 • 1.04 𝑧 = 0 ∶ 𝑄𝑠𝑝𝑐 1 − 𝑔 𝑦 ≤ Δ,   𝑧 𝑗𝑡 𝑑𝑚𝑓𝑏𝑜 ≤ 0.23 𝑃 𝜗
Theory-Inspired Algorithm 20
Theory-Inspired Algorithm Corollary 1. Let 𝜗 ≔ max 𝑔 𝑦 −  𝜃(𝑦) . If  𝑧 𝑜𝑓𝑥 denotes the output of the LRT-Correction with input x,  y , f and 𝜀 then ∃C, 𝜇 > 0 : 𝜇 𝑄𝑠𝑝𝑐  𝑧 𝑜𝑓𝑥 is clean > 1 − 𝐷 𝑃 𝜗 Remark: The extension to multi-class would be natural 21
AdaCorr: Using LRT-Correction During Training Step 1: Train 𝑔(𝑦) using (𝑦,  𝑧) Step 2: Applying LRT-Correction using (𝑦,  𝑧) , 𝑔(𝑦) and 𝜀 Step 3: Let  y =  𝑧 𝑜𝑓𝑥 Step 4: Repeat Step 1~3 Remark: In step 1, to get a good approximation of  𝜃(𝑦) , we train 𝑔(𝑦) with (𝑦,  𝑧) for several warm-up epochs 22
Recommend
More recommend