Robust Deep Learning Based on Meta-learning Deyu Meng Xian Jiaotong - - PowerPoint PPT Presentation

robust deep learning based on meta learning
SMART_READER_LITE
LIVE PREVIEW

Robust Deep Learning Based on Meta-learning Deyu Meng Xian Jiaotong - - PowerPoint PPT Presentation

Robust Deep Learning Based on Meta-learning Deyu Meng Xian Jiaotong University dymeng@mail.xjtu.edu.cn http://gr.xjtu.edu.cn/web/dymeng Deep Learning Robust Meta-learning The Success of Deep Learning Relies on


slide-1
SLIDE 1

Deyu Meng Xi’an Jiaotong University

dymeng@mail.xjtu.edu.cn http://gr.xjtu.edu.cn/web/dymeng

Robust Deep Learning Based on Meta-learning

slide-2
SLIDE 2
  • Deep Learning
  • Robust
  • Meta-learning
slide-3
SLIDE 3

LFW The Success of Deep Learning Relies on well-annotated & big data sets

slide-4
SLIDE 4

What we think we have: But what we really have is always:

slide-5
SLIDE 5

Commonly Encountered Data Bias (low quality data)

Label noise Data noise

Class imbalance

slide-6
SLIDE 6
  • Deep Learning
  • Robust
  • Meta-learning
slide-7
SLIDE 7

Robust Machine Learning for Data Bias

Design specific optimization objective (especially, robust loss) to make it robust to certain data bias:

Label noise Data noise

Class imbalance

Lin, et al., TPAMI, 2018 Yong, et al., TPAMI, 2018 Meng, et al., Information Sciences, 2017

slide-8
SLIDE 8

Two Critical Issues

Generalized Cross Entropy Symmetric Cross Entropy Bi-Tempered logistic Loss Polynomial SoftWeighting loss Focal loss CT loss

Lin, et al., TPAMI, 2018 Xie, et al., TMI, 2018 Zhao, et al., AAAI, 2015 Amid, et al., NeurIPS, 2019 Wang, et al., ICCV, 2019 Zhang, et al., NeurIPS, 2018

Hyperparameter Tunning Non-convexity

slide-9
SLIDE 9
  • Deep Learning
  • Robust
  • Meta-learning
slide-10
SLIDE 10

Training Data VS Validation Data

Hyper-parameter tuning: by validation data

Training loss Validation loss

≈ argmin

Θ∈{Θ1,Θ2,⋯,Θ𝑡}

1 𝑁 ෍

𝑗=1 𝑁

𝑀𝑗

𝑛 (𝒙∗(Θ))

slide-11
SLIDE 11

Training Data VS Validation Data

Hyper-parameter tuning: by validation data

Training loss Validation loss

✓ Low efficiency ✓ Low accuracy ✓ Search instead of optimization ✓ Heuristic instead of intelligent

≈ argmin

Θ∈{Θ1,Θ2,⋯,Θ𝑡}

1 𝑁 ෍

𝑗=1 𝑁

𝑀𝑗

𝑛 (𝒙∗(Θ))

slide-12
SLIDE 12
  • The function of validation data is higher than training data

➢Hyper-parameter tuning VS classifier parameter learning ➢Make the model adaptable to data fit (general to specific)

  • Validation data is different from training data!

➢Teacher vs. student ➢Ideal vs. real ➢High quality vs. low quality ➢Small scale vs. large scale ➢Fixed vs. dynamic (relatively)

  • What we should do?

➢ Lower the threshold of training data collection; higher the threshold of validation data selection

Intrinsic Functions of Validation Data

slide-13
SLIDE 13

✓ Optimization instead of search ✓ Intelligent instead of heuristic (partially)

From Validation Loss Searching to Meta Loss Training

Hyper-parameter tuning: by meta data

Training loss Meta loss

= argmin

Θ∈𝒣

1 𝑁 ෍

𝑗=1 𝑁

𝑀𝑗

𝑛 (𝒙∗(Θ))

slide-14
SLIDE 14

Many Recent Attempts

◆ Loss function.

Wu L, Tian F, Xia Y, et al. Learning to teach with dynamic loss functions. In NeurIPS, 2018: 6466-6477. Huang C, Zhai S, Talbott W, et al. Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment. In ICML, 2019: 2891-2900. Xu H, Zhang H, Hu Z, et al. AutoLoss: Learning Discrete Schedule for Alternate Optimization. In ICLR, 2019. Li C, Yuan X, Lin C, et al. AM-LFS: AutoML for Loss Function Search. In ICCV, 2019: 8410-8419. Grabocka J, Scholz R, Schmidt-Thieme L. Learning Surrogate Losses[J]. arXiv preprint arXiv:1905.10108, 2019.

◆ Regularization.

Feng J, Simon N. Gradient-based regularization parameter selection for problems with nonsmooth penalty functions[J]. Journal of Computational and Graphical Statistics, 2018, 27(2): 426-435. Frecon J, Salzo S, Pontil M. Bilevel learning of the group lasso structure. In NeurIPS 2018: 8301-8311. Streeter M. Learning Optimal Linear Regularizers. In ICML. 2019: 5996-6004.

◆ learner (NAS).

Zoph B, Le Q V. Neural architecture search with reinforcement learning. In ICLR, 2017. Baker B, Gupta O, Naik N, et al. Designing neural network architectures using reinforcement learning. In ICLR, 2017. Pham H, Guan M, Zoph B, et al. Efficient Neural Architecture Search via Parameter Sharing. ICML. 2018: 4092-4101. Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition. In CVPR, 2018: 8697-8710. Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. In ICLR, 2019. Xie S, Zheng H, Liu C, et al. SNAS: stochastic neural architecture search. In ICLR, 2019. Liu C, Zoph B, Neumann M, et al. Progressive neural architecture search. In ECCV, 2018: 19-34.

slide-15
SLIDE 15

Many Recent Attempts

◆ Hyper-parameters learning.

Maclaurin D, Duvenaud D, Adams R. Gradient-based hyperparameter optimization through reversible learning. In ICML, 2015: 2113-2122. Pedregosa F. Hyperparameter optimization with approximate gradient. In ICML, 2016: 737-746. Luketina J, Berglund M, Greff K, et al. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML. 2016: 2952-2960. Franceschi L, Donini M, Frasconi P, et al. Forward and reverse gradient-based hyperparameter optimization. In ICML, 2017: 1165-1173. Franceschi L, Frasconi P, Salzo S, et al. Bilevel Programming for Hyperparameter Optimization and Meta-Learning. In ICML, 2018: 1563-1572.

◆ Gradients and learning rate.

Andrychowicz M, Denil M, Gomez S, et al. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016. Baydin A G, Cornish R, Rubio D M, et al. Online learning rate adaptation with hypergradient descent. In ICLR, 2018. Jacobsen A, Schlegel M, Linke C, et al. Meta-descent for Online, Continual Prediction. In AAAI. 2019. Metz L,, et al. Understanding and correcting pathologies in the training of learned optimizers. In ICML,2019:4556-4565. Xu Z, Dai A M, Kemp J, et al. Learning an Adaptive Learning Rate Schedule. arXiv preprint arXiv:1909.09712, 2019.

◆ Sample reweighing.

Jiang L, Zhou Z, Leung T, et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML, 2018: 2309-2318. Ren M, Zeng W, Yang B, et al. Learning to Reweight Examples for Robust Deep Learning. In ICML, 2018: 4331-4340. Shu J, Xie Q, Yi L, et al. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, 2019. Zhao S, Fard M M, Narasimhan H, et al. Metric-Optimized Example Weights. In ICML 2019: 7533-7542.

slide-16
SLIDE 16
  • Deep Learning
  • Robust
  • Meta-learning
slide-17
SLIDE 17

Generalized Cross Entropy Symmetric Cross Entropy Bi-Tempered logistic Loss Polynomial SoftWeighting loss

Zhao, et al., AAAI, 2015 Amid, et al., NeurIPS, 2019 Wang, et al., ICCV, 2019 Zhang, et al., NeurIPS, 2018

Adaptively Learning the Robust Loss

slide-18
SLIDE 18

Training loss Meta loss

Hyperparameter Learning by Meta Learning

Shu, et al., submitted, 2019

slide-19
SLIDE 19

Experimental Results

Shu, et al., submitted, 2019

slide-20
SLIDE 20

Experimental Results

✓ The hyper-parameter adaptively learned by meta-learning actually not the optimal one for the original loss, with fixed hyper-parameter throughout its iteration. ✓ Meta learning adaptively finds a proper hyper-parameter and simultaneously explores a good initialization network parameter under its current hyper-parameter in a dynamical way. ✓ Such adaptive learning manner should be more suitable for simultaneously obtain optimal values for both of them rather than only updating one under the other fixed.

Shu, et al., submitted, 2019

slide-21
SLIDE 21

When Model Contains Large Amount of Hyperparameters?

➢ Overfitting issue easily occurs (similar to conventional machine learning) ➢ How to alleviate this issue? ➢ Build parametric prior representation (neither too large nor too small) for hyperparameters (similar to conventional machine learning) ➢ Learner VS meta-learner ➢ Need to deeply understand the data as well as the learning problem!

✓ Multi-view learning, multi-task learning (parameter - similar) ✓ Subspace learning (matrix – low rank)

Training loss Meta loss

slide-22
SLIDE 22

When Model Contains Large Amount of Hyperparameters?

slide-23
SLIDE 23
  • Deep Learning
  • Robust
  • Meta-learning
slide-24
SLIDE 24

Deep Learning with Training Data Bias

Problem: big data often come with noisy labels or class imbalance.

slide-25
SLIDE 25

Deep Networks tend to overfit to Training Data!

Deep neural networks easily fit(memorizing) random labels.

Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. ICLR 2017. best paper

Zhang et al. (2017) found that:

slide-26
SLIDE 26

How to robustly train deep networks on training data bias to improve the generalization performance?

slide-27
SLIDE 27

Related work: Learning with Training Data Bias

◆Sample weighting methods

✓dataset resampling(Chawla et al., 2002) ✓instance re-weight (Zadrozny, 2004) ✓AdaBoost method (Freund & Schapire, 1997) ✓Hard example mining (Malisiewicz et al., 2011) ✓focal loss (Lin et al., 2018) ✓self-paced learning (Kumar et al., 2010) ✓Iterative reweighting strategy (Fernando & Mkchael, 2003; Zhang & Sabuncu, 2018) ✓prediction variance method (Chang et al., 2017)

◆Meta learning methods

✓FWL (Dehghani et al.,2018) ✓learning to teach (Fan et al., 2018; Wu et al., 2018) ✓MentorNet (Jiang et al., 2018) ✓L2RW (Ren et al., 2018)

◆Other methods

✓GLC (Hendrycks et al., 2018) ✓Reed (Reed et al., 2015) ✓Co-teaching (Han et al., 2018) ✓D2L (Ma et al.,2018) ✓S-Model (Goldberger & Ben-Reuven, 2017)

slide-28
SLIDE 28

Sample weighting methods

Existing studies define a curriculum as a function(hand-design) for specific tasks and extra hyper-parameter setting.

Strategy Regularzer 𝑯 Weight 𝒘∗ Self-paced [Kumar et al. NIPS 2010] − 𝒘 1 𝒘∗ = 𝕁(𝒎𝒋 ≤ 𝝁) Linear weighting [Jiang et al. AAAI 2015] 𝟐 𝟑 ෍

𝒋=𝟐 𝒐

(𝒘𝒋

𝟑 − 𝟑𝒘𝒋)

𝒘∗ = 𝐧𝐛𝐲 (𝟏, 𝟐 − 𝟐 𝝁 𝒎𝒋) Focal Loss [Lin et al., ICCV 2017] − 𝒘∗ = 𝟐 − 𝒇𝒚𝒒 −𝒎𝒋

𝜷

Hard example mining [Malisiewicz et al., ICCV 2011] − 𝒘∗ = 𝕁(𝒎𝒋 > 𝝁(𝟐 − 𝒛𝒋)) Prediction variance [Chang et al., NIPS 2017] − 𝒘∗ = 𝟐 𝒂 𝑾𝒃𝒔 𝒎𝒋 + 𝑾𝒃𝒔(𝒎𝒋) |𝒎𝒋|

slide-29
SLIDE 29

Strategy Regularzer 𝑯 Weight 𝒘∗ Self-paced [Kumar et al. NIPS 2010] − 𝒘 1 𝒘∗ = 𝕁(𝒎𝒋 ≤ 𝝁) Linear weighting [Jiang et al. AAAI 2015] 𝟐 𝟑 ෍

𝒋=𝟐 𝒐

(𝒘𝒋

𝟑 − 𝟑𝒘𝒋)

𝒘∗ = 𝐧𝐛𝐲 (𝟏, 𝟐 − 𝟐 𝝁 𝒎𝒋) Focal Loss [Lin et al., ICCV 2017] − 𝒘∗ = 𝟐 − 𝒇𝒚𝒒 −𝒎𝒋

𝜷

Hard example mining [Malisiewicz et al., ICCV 2011] − 𝒘∗ = 𝕁(𝒎𝒋 > 𝝁(𝟐 − 𝒛𝒋)) Prediction variance [Chang et al., NIPS 2017] − 𝒘∗ = 𝟐 𝒂 𝑾𝒃𝒔 𝒎𝒋 + 𝑾𝒃𝒔(𝒎𝒋) |𝒎𝒋| ⚫ Need to pre-specify the form of weighting function ⚫ Need to manually set hyper- parameters

Sample weighting methods

slide-30
SLIDE 30

Meta Data and Meta Loss

Meta Data Training Data

slide-31
SLIDE 31

L2RW [Ren et al., ICML 2018]

Directly learning weights from training and meta data

slide-32
SLIDE 32

Meta Data and Meta Loss

Meta Data Training Data Training Loss

Input Structure

Meta Loss

slide-33
SLIDE 33

MentorNet [Jiang et al., ICML 2018]

The meta-learner is complex, hard to be reproduced. Very Complex Input Very Complex Theta

slide-34
SLIDE 34

Our work

Meta-Weight-Net

Input: Loss Theta: MLP

slide-35
SLIDE 35

Our work

Inner loop: Outer loop: Notation:

◆ Θ: Parameters of teacher ◆ 𝑥: Parameters of student

Meta-Weight-Net

Shu, et al., NeurIPS, 2019

slide-36
SLIDE 36

Our work

Step 5: Step 6: Step 7: Shu, et al., NeurIPS, 2019

slide-37
SLIDE 37

Our work

Shu, et al., NeurIPS, 2019

slide-38
SLIDE 38

Experiments

slide-39
SLIDE 39

Experimental Setup: Class Imbalance

Datasets: CIFAR-10 & CIFAR-100

Shu, et al., NeurIPS, 2019

slide-40
SLIDE 40

Experimental Setup: Noisy Label

Datasets: CIFAR-10 & CIFAR-100

Shu, et al., NeurIPS, 2019

slide-41
SLIDE 41

Stable analysis of Meta-Weight-Net

Shu, et al., NeurIPS, 2019

slide-42
SLIDE 42

Real Data Experiment

Shu, et al., NeurIPS, 2019

slide-43
SLIDE 43

Insight: Adaptively Learn the Weight Function

Shu, et al., NeurIPS, 2019

slide-44
SLIDE 44

Future research

◆Extension to other semi/weakly-supervised learning problems ◆More amelioration to the Meta-Weight-Net ◆Multi-view learning, ensemble learning, domain adaptation ◆General hyper-parameter learning (meta-learner designing)

slide-45
SLIDE 45

Jun Shu, Qian Zhao, Keyu Chen, Zongben Xu, Deyu Meng. Learning Adaptive Loss for Robust Learning with Noisy Labels. arXiv:2002.06482, 2020. Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, Deyu Meng. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. NeurIPS, 2019.

slide-46
SLIDE 46