Robust Deep Learning Based on Meta-learning Deyu Meng Xian Jiaotong - - PowerPoint PPT Presentation
Robust Deep Learning Based on Meta-learning Deyu Meng Xian Jiaotong - - PowerPoint PPT Presentation
Robust Deep Learning Based on Meta-learning Deyu Meng Xian Jiaotong University dymeng@mail.xjtu.edu.cn http://gr.xjtu.edu.cn/web/dymeng Deep Learning Robust Meta-learning The Success of Deep Learning Relies on
- Deep Learning
- Robust
- Meta-learning
LFW The Success of Deep Learning Relies on well-annotated & big data sets
What we think we have: But what we really have is always:
Commonly Encountered Data Bias (low quality data)
Label noise Data noise
Class imbalance
- Deep Learning
- Robust
- Meta-learning
Robust Machine Learning for Data Bias
Design specific optimization objective (especially, robust loss) to make it robust to certain data bias:
Label noise Data noise
Class imbalance
Lin, et al., TPAMI, 2018 Yong, et al., TPAMI, 2018 Meng, et al., Information Sciences, 2017
Two Critical Issues
Generalized Cross Entropy Symmetric Cross Entropy Bi-Tempered logistic Loss Polynomial SoftWeighting loss Focal loss CT loss
Lin, et al., TPAMI, 2018 Xie, et al., TMI, 2018 Zhao, et al., AAAI, 2015 Amid, et al., NeurIPS, 2019 Wang, et al., ICCV, 2019 Zhang, et al., NeurIPS, 2018
Hyperparameter Tunning Non-convexity
- Deep Learning
- Robust
- Meta-learning
Training Data VS Validation Data
Hyper-parameter tuning: by validation data
Training loss Validation loss
≈ argmin
Θ∈{Θ1,Θ2,⋯,Θ𝑡}
1 𝑁
𝑗=1 𝑁
𝑀𝑗
𝑛 (𝒙∗(Θ))
Training Data VS Validation Data
Hyper-parameter tuning: by validation data
Training loss Validation loss
✓ Low efficiency ✓ Low accuracy ✓ Search instead of optimization ✓ Heuristic instead of intelligent
≈ argmin
Θ∈{Θ1,Θ2,⋯,Θ𝑡}
1 𝑁
𝑗=1 𝑁
𝑀𝑗
𝑛 (𝒙∗(Θ))
- The function of validation data is higher than training data
➢Hyper-parameter tuning VS classifier parameter learning ➢Make the model adaptable to data fit (general to specific)
- Validation data is different from training data!
➢Teacher vs. student ➢Ideal vs. real ➢High quality vs. low quality ➢Small scale vs. large scale ➢Fixed vs. dynamic (relatively)
- What we should do?
➢ Lower the threshold of training data collection; higher the threshold of validation data selection
Intrinsic Functions of Validation Data
✓ Optimization instead of search ✓ Intelligent instead of heuristic (partially)
From Validation Loss Searching to Meta Loss Training
Hyper-parameter tuning: by meta data
Training loss Meta loss
= argmin
Θ∈
1 𝑁
𝑗=1 𝑁
𝑀𝑗
𝑛 (𝒙∗(Θ))
Many Recent Attempts
◆ Loss function.
Wu L, Tian F, Xia Y, et al. Learning to teach with dynamic loss functions. In NeurIPS, 2018: 6466-6477. Huang C, Zhai S, Talbott W, et al. Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment. In ICML, 2019: 2891-2900. Xu H, Zhang H, Hu Z, et al. AutoLoss: Learning Discrete Schedule for Alternate Optimization. In ICLR, 2019. Li C, Yuan X, Lin C, et al. AM-LFS: AutoML for Loss Function Search. In ICCV, 2019: 8410-8419. Grabocka J, Scholz R, Schmidt-Thieme L. Learning Surrogate Losses[J]. arXiv preprint arXiv:1905.10108, 2019.
◆ Regularization.
Feng J, Simon N. Gradient-based regularization parameter selection for problems with nonsmooth penalty functions[J]. Journal of Computational and Graphical Statistics, 2018, 27(2): 426-435. Frecon J, Salzo S, Pontil M. Bilevel learning of the group lasso structure. In NeurIPS 2018: 8301-8311. Streeter M. Learning Optimal Linear Regularizers. In ICML. 2019: 5996-6004.
◆ learner (NAS).
Zoph B, Le Q V. Neural architecture search with reinforcement learning. In ICLR, 2017. Baker B, Gupta O, Naik N, et al. Designing neural network architectures using reinforcement learning. In ICLR, 2017. Pham H, Guan M, Zoph B, et al. Efficient Neural Architecture Search via Parameter Sharing. ICML. 2018: 4092-4101. Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition. In CVPR, 2018: 8697-8710. Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. In ICLR, 2019. Xie S, Zheng H, Liu C, et al. SNAS: stochastic neural architecture search. In ICLR, 2019. Liu C, Zoph B, Neumann M, et al. Progressive neural architecture search. In ECCV, 2018: 19-34.
Many Recent Attempts
◆ Hyper-parameters learning.
Maclaurin D, Duvenaud D, Adams R. Gradient-based hyperparameter optimization through reversible learning. In ICML, 2015: 2113-2122. Pedregosa F. Hyperparameter optimization with approximate gradient. In ICML, 2016: 737-746. Luketina J, Berglund M, Greff K, et al. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML. 2016: 2952-2960. Franceschi L, Donini M, Frasconi P, et al. Forward and reverse gradient-based hyperparameter optimization. In ICML, 2017: 1165-1173. Franceschi L, Frasconi P, Salzo S, et al. Bilevel Programming for Hyperparameter Optimization and Meta-Learning. In ICML, 2018: 1563-1572.
◆ Gradients and learning rate.
Andrychowicz M, Denil M, Gomez S, et al. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016. Baydin A G, Cornish R, Rubio D M, et al. Online learning rate adaptation with hypergradient descent. In ICLR, 2018. Jacobsen A, Schlegel M, Linke C, et al. Meta-descent for Online, Continual Prediction. In AAAI. 2019. Metz L,, et al. Understanding and correcting pathologies in the training of learned optimizers. In ICML,2019:4556-4565. Xu Z, Dai A M, Kemp J, et al. Learning an Adaptive Learning Rate Schedule. arXiv preprint arXiv:1909.09712, 2019.
◆ Sample reweighing.
Jiang L, Zhou Z, Leung T, et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML, 2018: 2309-2318. Ren M, Zeng W, Yang B, et al. Learning to Reweight Examples for Robust Deep Learning. In ICML, 2018: 4331-4340. Shu J, Xie Q, Yi L, et al. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, 2019. Zhao S, Fard M M, Narasimhan H, et al. Metric-Optimized Example Weights. In ICML 2019: 7533-7542.
- Deep Learning
- Robust
- Meta-learning
Generalized Cross Entropy Symmetric Cross Entropy Bi-Tempered logistic Loss Polynomial SoftWeighting loss
Zhao, et al., AAAI, 2015 Amid, et al., NeurIPS, 2019 Wang, et al., ICCV, 2019 Zhang, et al., NeurIPS, 2018
Adaptively Learning the Robust Loss
Training loss Meta loss
Hyperparameter Learning by Meta Learning
Shu, et al., submitted, 2019
Experimental Results
Shu, et al., submitted, 2019
Experimental Results
✓ The hyper-parameter adaptively learned by meta-learning actually not the optimal one for the original loss, with fixed hyper-parameter throughout its iteration. ✓ Meta learning adaptively finds a proper hyper-parameter and simultaneously explores a good initialization network parameter under its current hyper-parameter in a dynamical way. ✓ Such adaptive learning manner should be more suitable for simultaneously obtain optimal values for both of them rather than only updating one under the other fixed.
Shu, et al., submitted, 2019
When Model Contains Large Amount of Hyperparameters?
➢ Overfitting issue easily occurs (similar to conventional machine learning) ➢ How to alleviate this issue? ➢ Build parametric prior representation (neither too large nor too small) for hyperparameters (similar to conventional machine learning) ➢ Learner VS meta-learner ➢ Need to deeply understand the data as well as the learning problem!
✓ Multi-view learning, multi-task learning (parameter - similar) ✓ Subspace learning (matrix – low rank)
Training loss Meta loss
When Model Contains Large Amount of Hyperparameters?
- Deep Learning
- Robust
- Meta-learning
Deep Learning with Training Data Bias
Problem: big data often come with noisy labels or class imbalance.
Deep Networks tend to overfit to Training Data!
Deep neural networks easily fit(memorizing) random labels.
Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. ICLR 2017. best paper
Zhang et al. (2017) found that:
How to robustly train deep networks on training data bias to improve the generalization performance?
Related work: Learning with Training Data Bias
◆Sample weighting methods
✓dataset resampling(Chawla et al., 2002) ✓instance re-weight (Zadrozny, 2004) ✓AdaBoost method (Freund & Schapire, 1997) ✓Hard example mining (Malisiewicz et al., 2011) ✓focal loss (Lin et al., 2018) ✓self-paced learning (Kumar et al., 2010) ✓Iterative reweighting strategy (Fernando & Mkchael, 2003; Zhang & Sabuncu, 2018) ✓prediction variance method (Chang et al., 2017)
◆Meta learning methods
✓FWL (Dehghani et al.,2018) ✓learning to teach (Fan et al., 2018; Wu et al., 2018) ✓MentorNet (Jiang et al., 2018) ✓L2RW (Ren et al., 2018)
◆Other methods
✓GLC (Hendrycks et al., 2018) ✓Reed (Reed et al., 2015) ✓Co-teaching (Han et al., 2018) ✓D2L (Ma et al.,2018) ✓S-Model (Goldberger & Ben-Reuven, 2017)
Sample weighting methods
Existing studies define a curriculum as a function(hand-design) for specific tasks and extra hyper-parameter setting.
Strategy Regularzer 𝑯 Weight 𝒘∗ Self-paced [Kumar et al. NIPS 2010] − 𝒘 1 𝒘∗ = 𝕁(𝒎𝒋 ≤ 𝝁) Linear weighting [Jiang et al. AAAI 2015] 𝟐 𝟑
𝒋=𝟐 𝒐
(𝒘𝒋
𝟑 − 𝟑𝒘𝒋)
𝒘∗ = 𝐧𝐛𝐲 (𝟏, 𝟐 − 𝟐 𝝁 𝒎𝒋) Focal Loss [Lin et al., ICCV 2017] − 𝒘∗ = 𝟐 − 𝒇𝒚𝒒 −𝒎𝒋
𝜷
Hard example mining [Malisiewicz et al., ICCV 2011] − 𝒘∗ = 𝕁(𝒎𝒋 > 𝝁(𝟐 − 𝒛𝒋)) Prediction variance [Chang et al., NIPS 2017] − 𝒘∗ = 𝟐 𝒂 𝑾𝒃𝒔 𝒎𝒋 + 𝑾𝒃𝒔(𝒎𝒋) |𝒎𝒋|
Strategy Regularzer 𝑯 Weight 𝒘∗ Self-paced [Kumar et al. NIPS 2010] − 𝒘 1 𝒘∗ = 𝕁(𝒎𝒋 ≤ 𝝁) Linear weighting [Jiang et al. AAAI 2015] 𝟐 𝟑
𝒋=𝟐 𝒐
(𝒘𝒋
𝟑 − 𝟑𝒘𝒋)
𝒘∗ = 𝐧𝐛𝐲 (𝟏, 𝟐 − 𝟐 𝝁 𝒎𝒋) Focal Loss [Lin et al., ICCV 2017] − 𝒘∗ = 𝟐 − 𝒇𝒚𝒒 −𝒎𝒋
𝜷
Hard example mining [Malisiewicz et al., ICCV 2011] − 𝒘∗ = 𝕁(𝒎𝒋 > 𝝁(𝟐 − 𝒛𝒋)) Prediction variance [Chang et al., NIPS 2017] − 𝒘∗ = 𝟐 𝒂 𝑾𝒃𝒔 𝒎𝒋 + 𝑾𝒃𝒔(𝒎𝒋) |𝒎𝒋| ⚫ Need to pre-specify the form of weighting function ⚫ Need to manually set hyper- parameters
Sample weighting methods
Meta Data and Meta Loss
Meta Data Training Data
L2RW [Ren et al., ICML 2018]
Directly learning weights from training and meta data
Meta Data and Meta Loss
Meta Data Training Data Training Loss
Input Structure
Meta Loss
MentorNet [Jiang et al., ICML 2018]
The meta-learner is complex, hard to be reproduced. Very Complex Input Very Complex Theta
Our work
Meta-Weight-Net
Input: Loss Theta: MLP
Our work
Inner loop: Outer loop: Notation:
◆ Θ: Parameters of teacher ◆ 𝑥: Parameters of student
Meta-Weight-Net
Shu, et al., NeurIPS, 2019
Our work
Step 5: Step 6: Step 7: Shu, et al., NeurIPS, 2019
Our work
Shu, et al., NeurIPS, 2019
Experiments
Experimental Setup: Class Imbalance
Datasets: CIFAR-10 & CIFAR-100
Shu, et al., NeurIPS, 2019
Experimental Setup: Noisy Label
Datasets: CIFAR-10 & CIFAR-100
Shu, et al., NeurIPS, 2019
Stable analysis of Meta-Weight-Net
Shu, et al., NeurIPS, 2019
Real Data Experiment
Shu, et al., NeurIPS, 2019
Insight: Adaptively Learn the Weight Function
Shu, et al., NeurIPS, 2019
Future research
◆Extension to other semi/weakly-supervised learning problems ◆More amelioration to the Meta-Weight-Net ◆Multi-view learning, ensemble learning, domain adaptation ◆General hyper-parameter learning (meta-learner designing)