Not All Samples Are Created Equal Deep Learning with Importance - - PowerPoint PPT Presentation

not all samples are created equal deep learning with
SMART_READER_LITE
LIVE PREVIEW

Not All Samples Are Created Equal Deep Learning with Importance - - PowerPoint PPT Presentation

Not All Samples Are Created Equal Deep Learning with Importance Sampling Angelos Katharopoulos & Fran cois Fleuret ICML, July 11, 2018 Funded by Evolution of gradient norms during training Small CNN on MNIST Gradient norm CDF 1 . 0 10


slide-1
SLIDE 1

Not All Samples Are Created Equal Deep Learning with Importance Sampling

Angelos Katharopoulos & Fran¸ cois Fleuret ICML, July 11, 2018

Funded by

slide-2
SLIDE 2

Evolution of gradient norms during training

2000 4000 6000 8000 10000 Iterations 10−2 10−1 100 Training loss

Small CNN on MNIST

10−1 100 101 102 Gradient norm 0.5 0.6 0.7 0.8 0.9 1.0 Probability

Gradient norm CDF

  • A. Katharopoulos

Not All Samples Are Created Equal 2/13

slide-3
SLIDE 3

Evolution of gradient norms during training

2000 4000 6000 8000 10000 Iterations 10−2 10−1 100 Training loss

Small CNN on MNIST

10−1 100 101 102 Gradient norm 0.5 0.6 0.7 0.8 0.9 1.0 Probability

Gradient norm CDF

  • A. Katharopoulos

Not All Samples Are Created Equal 2/13

slide-4
SLIDE 4

Evolution of gradient norms during training

2000 4000 6000 8000 10000 Iterations 10−2 10−1 100 Training loss

Small CNN on MNIST

10−1 100 101 102 Gradient norm 0.5 0.6 0.7 0.8 0.9 1.0 Probability

Gradient norm CDF

  • A. Katharopoulos

Not All Samples Are Created Equal 2/13

slide-5
SLIDE 5

Evolution of gradient norms during training

2000 4000 6000 8000 10000 Iterations 10−2 10−1 100 Training loss

Small CNN on MNIST

10−1 100 101 102 Gradient norm 0.5 0.6 0.7 0.8 0.9 1.0 Probability 85% of the samples have negligible gradient

Gradient norm CDF

  • A. Katharopoulos

Not All Samples Are Created Equal 2/13

slide-6
SLIDE 6

Related work

◮ Sample points proportionally to the gradient norm (Needell et al., 2014; Zhao and

Zhang, 2015; Alain et al., 2015)

◮ SVRG type methods (Johnson and Zhang, 2013; Defazio et al., 2014; Lei et al., 2017) ◮ Sample using the loss

◮ Hard/Semi-hard sample mining (Schroff et al., 2015; Simo-Serra et al., 2015) ◮ Online Batch Selection (Loshchilov and Hutter, 2015) ◮ Prioritized Experience Replay (Schaul et al., 2015)

  • A. Katharopoulos

Not All Samples Are Created Equal 3/13

slide-7
SLIDE 7

Related work

◮ Sample points proportionally to the gradient norm (Needell et al., 2014; Zhao and

Zhang, 2015; Alain et al., 2015)

◮ SVRG type methods (Johnson and Zhang, 2013; Defazio et al., 2014; Lei et al., 2017) ◮ Sample using the loss

◮ Hard/Semi-hard sample mining (Schroff et al., 2015; Simo-Serra et al., 2015) ◮ Online Batch Selection (Loshchilov and Hutter, 2015) ◮ Prioritized Experience Replay (Schaul et al., 2015)

  • A. Katharopoulos

Not All Samples Are Created Equal 3/13

slide-8
SLIDE 8

Contributions

◮ Derive a fast to compute importance distribution ◮ Variance cannot always be reduced so start importance sampling when it is useful

  • A. Katharopoulos

Not All Samples Are Created Equal 4/13

slide-9
SLIDE 9

Contributions

◮ Derive a fast to compute importance distribution ◮ Variance cannot always be reduced so start importance sampling when it is useful ◮ Package everything in an embarassingly simple to use library

BONUS

  • A. Katharopoulos

Not All Samples Are Created Equal 4/13

slide-10
SLIDE 10

Deriving the sampling distribution (1)

Similar to Zhao and Zhang (2015) we want to minimize the variance of the gradients. P∗ = arg min

P

Tr (VP[wiGi]) = arg min EP

  • w2

i Gi2 2

  • To simplify, we minimize an upper bound

Gi2 ≤ ˆ Gi ⇐ ⇒ min

P EP

  • w2

i Gi2 2

  • ≤ min

P EP

  • w2

i ˆ

Gi

2

  • A. Katharopoulos

Not All Samples Are Created Equal 5/13

slide-11
SLIDE 11

Deriving the sampling distribution (1)

Similar to Zhao and Zhang (2015) we want to minimize the variance of the gradients. P∗ = arg min

P

Tr (VP[wiGi]) = arg min EP

  • w2

i Gi2 2

  • To simplify, we minimize an upper bound

Gi2 ≤ ˆ Gi ⇐ ⇒ min

P EP

  • w2

i Gi2 2

  • ≤ min

P EP

  • w2

i ˆ

Gi

2

  • A. Katharopoulos

Not All Samples Are Created Equal 5/13

slide-12
SLIDE 12

Deriving the sampling distribution (1)

Similar to Zhao and Zhang (2015) we want to minimize the variance of the gradients. P∗ = arg min

P

Tr (VP[wiGi]) = arg min EP

  • wi 2 Gi2

2

  • To simplify, we minimize an upper bound

Gi2 ≤ ˆ Gi ⇐ ⇒ min

P EP

  • wi 2 Gi2

2

  • ≤ min

P EP

  • wi 2 ˆ

Gi

2

  • A. Katharopoulos

Not All Samples Are Created Equal 5/13

slide-13
SLIDE 13

Deriving the sampling distribution (1)

Similar to Zhao and Zhang (2015) we want to minimize the variance of the gradients. P∗ = arg min

P

Tr (VP[wiGi]) = arg min EP

  • w2

i Gi2 2

  • To simplify, we minimize an upper bound

Gi2 ≤ ˆ Gi ⇐ ⇒ min

P EP

  • w2

i Gi2 2

  • ≤ min

P EP

  • w2

i ˆ

Gi

2

  • A. Katharopoulos

Not All Samples Are Created Equal 5/13

slide-14
SLIDE 14

Deriving the sampling distribution (2)

We show that we can upper bound the gradient norm of the parameters using the norm of the gradient with respect to the pre-activation outputs of the last layer. We conjecture that batch normalization and weight initialization make it tight.

  • A. Katharopoulos

Not All Samples Are Created Equal 6/13

slide-15
SLIDE 15

Variance reduction achieved with our upper-bound

10000 20000 30000 40000 50000 Iterations 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Empirical variance reduction

CIFAR-100

uniform loss gradient-norm

  • A. Katharopoulos

Not All Samples Are Created Equal 7/13

slide-16
SLIDE 16

Variance reduction achieved with our upper-bound

10000 20000 30000 40000 50000 Iterations 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Empirical variance reduction

CIFAR-100

uniform loss gradient-norm upper-bound (ours)

  • A. Katharopoulos

Not All Samples Are Created Equal 7/13

slide-17
SLIDE 17

Variance reduction achieved with our upper-bound

50000 100000 150000 200000 250000 300000 Iterations 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 Empirical variance reduction

Downsampled Imagenet

uniform loss gradient-norm

  • A. Katharopoulos

Not All Samples Are Created Equal 7/13

slide-18
SLIDE 18

Variance reduction achieved with our upper-bound

50000 100000 150000 200000 250000 300000 Iterations 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 Empirical variance reduction

Downsampled Imagenet

uniform loss gradient-norm upper-bound (ours)

  • A. Katharopoulos

Not All Samples Are Created Equal 7/13

slide-19
SLIDE 19

Is the upper-bound enough to speed up training?

Not really, because

◮ a forward pass on the whole dataset is still prohibitive ◮ the importance distribution can be arbitrarily close to uniform

Two key ideas

◮ Sample a large batch (B) randomly and resample a small batch (b) with

importance

◮ Start importance sampling when the variance will be reduced

  • A. Katharopoulos

Not All Samples Are Created Equal 8/13

slide-20
SLIDE 20

When do we start importance sampling?

We start importance sampling when the variance reduction is large enough Tr (Vu[Gi]) − Tr (VP[wiGi]) = 1 B

B

  • i=1

Gi2

2 B

  • i=1

(pi − u)2 ∝

B

  • i=1

(pi − u)2

  • distance of importance

distribution to uniform

  • A. Katharopoulos

Not All Samples Are Created Equal 9/13

slide-21
SLIDE 21

When do we start importance sampling?

We start importance sampling when the variance reduction is large enough Tr (Vu[Gi]) − Tr (VP[wiGi]) = 1 B

B

  • i=1

Gi2

2 B

  • i=1

(pi − u)2 ∝

B

  • i=1

(pi − u)2

  • distance of importance

distribution to uniform

We show that the equivalent batch increment τ ≥

  • 1 −
  • i(pi−u)2
  • i p2

i

−1 which allows us to perform importance sampling when Btforward + b(tforward + tbackward)

  • Time for importance

sampling iteration

≤ τ(tforward + tbackward)b

  • Time for equivalent

uniform sampling iteration

  • A. Katharopoulos

Not All Samples Are Created Equal 9/13

slide-22
SLIDE 22

Experimental setup

◮ We fix a time budget for all methods and compare the achieved training loss and

test error

◮ We evaluate on three tasks

  • 1. WideResnets on CIFAR10/100 (image classification task)
  • 2. Pretrained ResNet50 on MIT67 (finetuning task)
  • 3. LSTM on permuted MNIST (sequence classification task)
  • A. Katharopoulos

Not All Samples Are Created Equal 10/13

slide-23
SLIDE 23

Importance sampling for image classification

Uniform 10−2 10−1 100 101 Training loss relative to uniform Uniform 100 6 × 10−1 2 × 100 3 × 100 4 × 100 Test error relative to uniform CIFAR-10 AR-10 CIFAR-100

  • A. Katharopoulos

Not All Samples Are Created Equal 11/13

slide-24
SLIDE 24

Importance sampling for image classification

◮ SVRG methods do not work for Deep Learning

SVRG Katyusha SCSG Uniform 10−2 10−1 100 101 Training loss relative to uniform SVRG Katyusha SCSG Uniform 100 6 × 10−1 2 × 100 3 × 100 4 × 100 Test error relative to uniform CIFAR-10 AR-10 CIFAR-100

  • A. Katharopoulos

Not All Samples Are Created Equal 11/13

slide-25
SLIDE 25

Importance sampling for image classification

◮ SVRG methods do not work for Deep Learning ◮ Our loss-based sampling outperfoms existing loss based methods

SVRG Katyusha SCSG Uniform Loschilov 2015 Schaul 2015 Loss (ours) 10−2 10−1 100 101 Training loss relative to uniform SVRG Katyusha SCSG Uniform Loschilov 2015 Schaul 2015 Loss (ours) 100 6 × 10−1 2 × 100 3 × 100 4 × 100 Test error relative to uniform CIFAR-10 AR-10 CIFAR-100

  • A. Katharopoulos

Not All Samples Are Created Equal 11/13

slide-26
SLIDE 26

Importance sampling for image classification

◮ SVRG methods do not work for Deep Learning ◮ Our loss-based sampling outperfoms existing loss based methods ◮ Improvement from 3× to 10× compared to training loss with uniform sampling

SVRG Katyusha SCSG Uniform Loschilov 2015 Schaul 2015 Loss (ours) Ours 10−2 10−1 100 101 Training loss relative to uniform SVRG Katyusha SCSG Uniform Loschilov 2015 Schaul 2015 Loss (ours) Ours 100 6 × 10−1 2 × 100 3 × 100 4 × 100 Test error relative to uniform CIFAR-10 AR-10 CIFAR-100

  • A. Katharopoulos

Not All Samples Are Created Equal 11/13

slide-27
SLIDE 27

Importance sampling for finetuning

◮ Earlier variance reduction leads to faster convergence

250 500 750 1000 1250 1500 1750 Seconds 10−2 10−1 100 Training Loss upper-bound (ours) uniform loss 250 500 750 1000 1250 1500 1750 Seconds 3 × 10−1 4 × 10−1 6 × 10−1 Test Error upper-bound (ours) uniform loss

  • A. Katharopoulos

Not All Samples Are Created Equal 12/13

slide-28
SLIDE 28

Thank you for your time!

Check out the code at http://github.com/idiap/importance-sampling . from importance_sampling import ImportanceTraining x, y = load_data() model = load_model() ImportanceTraining(model).fit(x, y, batch_size=128, epochs=10)

  • A. Katharopoulos

Not All Samples Are Created Equal 13/13

slide-29
SLIDE 29

References I

Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014. Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference

  • n Machine Learning (ICML-15), pages 1–9, 2015.

Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua

  • Bengio. Variance reduction in sgd by distributed importance sampling. arXiv

preprint arXiv:1511.06481, 2015. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.

  • A. Katharopoulos

Not All Samples Are Created Equal 1/3

slide-30
SLIDE 30

References II

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014. Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum

  • ptimization via scsg methods. In Advances in Neural Information Processing

Systems, pages 2345–2355, 2017. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Computer Vision (ICCV), 2015 IEEE International Conference

  • n, pages 118–126. IEEE, 2015.
  • A. Katharopoulos

Not All Samples Are Created Equal 2/3

slide-31
SLIDE 31

References III

Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural

  • networks. arXiv preprint arXiv:1511.06343, 2015.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience

  • replay. arXiv preprint arXiv:1511.05952, 2015.
  • A. Katharopoulos

Not All Samples Are Created Equal 3/3