Deep multi-task learning with evolving weights ESANN 2016 Soufiane - - PowerPoint PPT Presentation

deep multi task learning with evolving weights
SMART_READER_LITE
LIVE PREVIEW

Deep multi-task learning with evolving weights ESANN 2016 Soufiane - - PowerPoint PPT Presentation

Deep multi-task learning with evolving weights ESANN 2016 Soufiane Belharbi Romain Hrault Clment Chatelain Sbastien Adam soufiane.belharbi@insa-rouen.fr LITIS lab., DocApp team - INSA de Rouen, France images/logos 27 April, 2016


slide-1
SLIDE 1

images/logos

Deep multi-task learning with evolving weights

ESANN 2016

Soufiane Belharbi Romain Hérault Clément Chatelain Sébastien Adam

soufiane.belharbi@insa-rouen.fr

LITIS lab., DocApp team - INSA de Rouen, France 27 April, 2016 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights

slide-2
SLIDE 2

images/logos Context

Training deep neural networks

Deep neural network are interesting models (Complex/hierarchical features, complex mapping) ⇒ Improve performance Training deep neural networks is difficult ⇒ Vanishing gradient ⇒ More parameters ⇒ Need more data Some solutions: ⇒ Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] ⇒ Use unlabeled data

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20

slide-3
SLIDE 3

images/logos Context

Training deep neural networks

Deep neural network are interesting models (Complex/hierarchical features, complex mapping) ⇒ Improve performance Training deep neural networks is difficult ⇒ Vanishing gradient ⇒ More parameters ⇒ Need more data Some solutions: ⇒ Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] ⇒ Use unlabeled data

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20

slide-4
SLIDE 4

images/logos Context

Semi-supervised learning

General case: Data = { labeled data

  • expensive (money, time), few

, unlabeled data

  • cheap, abundant

} E.g: medical images ⇒ semi-supervised learning: Exploit unlabeled data to improve the generalization

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20

slide-5
SLIDE 5

images/logos Context

Semi-supervised learning

General case: Data = { labeled data

  • expensive (money, time), few

, unlabeled data

  • cheap, abundant

} E.g: medical images ⇒ semi-supervised learning: Exploit unlabeled data to improve the generalization

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20

slide-6
SLIDE 6

images/logos Context

Pre-training and semi-supervised learning

The pre-training technique can exploit the unlabeled data A sequential transfer learning performed in 2 steps:

1

Unsupervised task (x labeled and unlabeled data)

2

Supervised task ( (x, y) labeled data)

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 3/20

slide-7
SLIDE 7

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 ˆ y1 ˆ y2 A DNN to train

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 4/20

slide-8
SLIDE 8

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 ˆ x1 ˆ x2 ˆ x3 ˆ x4 ˆ x5

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

slide-9
SLIDE 9

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 h1,1 h1,2 h1,3 h1,4 h1,5

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

slide-10
SLIDE 10

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 h1,1 h1,2 h1,3 h1,4 h1,5 ˆ h1,1 ˆ h1,2 ˆ h1,3 ˆ h1,4 ˆ h1,5

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

slide-11
SLIDE 11

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 h2,1 h2,2 h2,3

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

slide-12
SLIDE 12

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 h2,1 h2,2 h2,3 ˆ h2,1 ˆ h2,2 ˆ h2,3

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

slide-13
SLIDE 13

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 h3,1 h3,2 h3,3

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

slide-14
SLIDE 14

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

1) Step 1: Unsupervised layer-wise training

Train layer by layer sequentially using only x (labeled or unlabeled)

At each layer: ⇒ When to stop training? ⇒ What hyper-parameters to use? ⇒ How to make sure that the training improves the supervised task?

slide-15
SLIDE 15

images/logos Pre-training technique and semi-supervised learning

Layer-wise pre-training: auto-encoders

x1 x2 x3 x4 x5 ˆ y1 ˆ y2

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 6/20

2) Step 2: Supervised training

Train the whole network using (x, y)

Back-propagation

slide-16
SLIDE 16

images/logos Pre-training technique and semi-supervised learning

Pre-training technique: Pros and cons

Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks ⇒ Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20

slide-17
SLIDE 17

images/logos Pre-training technique and semi-supervised learning

Pre-training technique: Pros and cons

Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks ⇒ Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20

slide-18
SLIDE 18

images/logos Pre-training technique and semi-supervised learning

Proposed solution

Why is it difficult in practice? ⇒ Sequential transfer learning Possible solution: ⇒ Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

slide-19
SLIDE 19

images/logos Pre-training technique and semi-supervised learning

Proposed solution

Why is it difficult in practice? ⇒ Sequential transfer learning Possible solution: ⇒ Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

slide-20
SLIDE 20

images/logos Pre-training technique and semi-supervised learning

Proposed solution

Why is it difficult in practice? ⇒ Sequential transfer learning Possible solution: ⇒ Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

slide-21
SLIDE 21

images/logos Proposed approach

Parallel transfer learning: Tasks combination

Train cost = supervised task + unsupervised task

  • reconstruction

l labeled samples, u unlabeled samples, wsh: shared parameters.

Reconstruction (auto-encoder) task: Jr(D; w′ = {wsh, wr}) =

l+u

  • i=1

Cr(R(xi; w′), xi) . Supervised task: Js(D; w = {wsh, ws}) =

l

  • i=1

Cs(M(xi; w), yi) . Weighted tasks combination J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

slide-22
SLIDE 22

images/logos Proposed approach

Parallel transfer learning: Tasks combination

Train cost = supervised task + unsupervised task

  • reconstruction

l labeled samples, u unlabeled samples, wsh: shared parameters.

Reconstruction (auto-encoder) task: Jr(D; w′ = {wsh, wr}) =

l+u

  • i=1

Cr(R(xi; w′), xi) . Supervised task: Js(D; w = {wsh, ws}) =

l

  • i=1

Cs(M(xi; w), yi) . Weighted tasks combination J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

slide-23
SLIDE 23

images/logos Proposed approach

Parallel transfer learning: Tasks combination

Train cost = supervised task + unsupervised task

  • reconstruction

l labeled samples, u unlabeled samples, wsh: shared parameters.

Reconstruction (auto-encoder) task: Jr(D; w′ = {wsh, wr}) =

l+u

  • i=1

Cr(R(xi; w′), xi) . Supervised task: Js(D; w = {wsh, ws}) =

l

  • i=1

Cs(M(xi; w), yi) . Weighted tasks combination J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

slide-24
SLIDE 24

images/logos Proposed approach

Parallel transfer learning: Tasks combination

Train cost = supervised task + unsupervised task

  • reconstruction

l labeled samples, u unlabeled samples, wsh: shared parameters.

Reconstruction (auto-encoder) task: Jr(D; w′ = {wsh, wr}) =

l+u

  • i=1

Cr(R(xi; w′), xi) . Supervised task: Js(D; w = {wsh, ws}) =

l

  • i=1

Cs(M(xi; w), yi) . Weighted tasks combination J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

slide-25
SLIDE 25

images/logos Proposed approach

Tasks combination with evolving weights

Weighted tasks combination: J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

Problems How to fix λs, λr? At the end of the training, only Js should matters Tasks combination with evolving weights (our contribution) J (D; {wsh, ws, wr}) = λs(t) · Js(D; {wsh, ws}) + λr(t) · Jr(D; {wsh, wr}) .

t: learning epochs, λs(t), λr (t) ∈ [0, 1]: importance weight, λs(t) + λr (t) = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

slide-26
SLIDE 26

images/logos Proposed approach

Tasks combination with evolving weights

Weighted tasks combination: J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

Problems How to fix λs, λr? At the end of the training, only Js should matters Tasks combination with evolving weights (our contribution) J (D; {wsh, ws, wr}) = λs(t) · Js(D; {wsh, ws}) + λr(t) · Jr(D; {wsh, wr}) .

t: learning epochs, λs(t), λr (t) ∈ [0, 1]: importance weight, λs(t) + λr (t) = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

slide-27
SLIDE 27

images/logos Proposed approach

Tasks combination with evolving weights

Weighted tasks combination: J (D; {wsh, ws, wr}) = λs · Js(D; {wsh, ws}) + λr · Jr(D; {wsh, wr}) .

λs, λr ∈ [0, 1]: importance weight, λs + λr = 1.

Problems How to fix λs, λr? At the end of the training, only Js should matters Tasks combination with evolving weights (our contribution) J (D; {wsh, ws, wr}) = λs(t) · Js(D; {wsh, ws}) + λr(t) · Jr(D; {wsh, wr}) .

t: learning epochs, λs(t), λr (t) ∈ [0, 1]: importance weight, λs(t) + λr (t) = 1.

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

slide-28
SLIDE 28

images/logos Proposed approach

Tasks combination with evolving weights

J (D; {wsh, ws, wr}) = λs(t)·Js(D; {wsh, ws})+λr(t)·Jr(D; {wsh, wr}) .

start 0.2 0.4 0.6 0.8 1

Exponential schedule

Importance weights

t: Train epochs λr(t) λs(t)

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 11/20

  • λr(t) = exp( −t

σ )

, σ : slope λs(t) = 1 − λr(t)

slide-29
SLIDE 29

images/logos Proposed approach

Tasks combination with evolving weights: Optimization

Algorithm 1 Training our model for one epoch

1: D is the shuffled training set. B a mini-batch. 2: for B in D do 3:

Make a gradient step toward Jr using B (update w′)

4:

Bs ⇐ labeled examples of B,

5:

Make a gradient step toward Js using Bs (update w)

6: end for

[R.Caruana 97, J.Weston 08, R.Collobert 08, Z.Zhang 15]

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 12/20

slide-30
SLIDE 30

images/logos Results

Experimental protocol

Objective: Compare Training DNN using different approaches: No pre-training (base-line) With pre-training (Stairs schedule) Parallel transfer learning (proposed approach) Studied evolving weights schedules:

1

Stairs (Pre-training) Linear

start t1 1

Linear until t1

start

Exponential

Importance weights

t: Train epochs λr λs

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 13/20

slide-31
SLIDE 31

images/logos Results

Experimental protocol

Task: Classification (MNIST) Number of hidden layers K: 1, 2, 3, 4. Optimization:

Epochs: 5000 Batch size: 600 Options: No regularization, No adaptive learning rate

Hyper-parameters of the evolving schedules:

t1: 100 σ: 40

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 14/20

slide-32
SLIDE 32

images/logos Results

Shallow networks: (K = 1, l = 1E2)

1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900

Size of unlabeled data (u)

28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5 32.0 32.5

Calssification error MNIST test (%)

Evaluation of the eloving weight schedules (size of labeled data l = 100), K = 1

baseline stairs100 lin100 lin exp40

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 15/20

slide-33
SLIDE 33

images/logos Results

Shallow networks: (K = 1, l = 1E3)

1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900

Size of unlabeled data (u)

11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5

Calssification error MNIST test (%)

Evaluation of the eloving weight schedules (size of labeled data l = 1000), K = 1

baseline stairs100 lin100 lin exp40

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 16/20

slide-34
SLIDE 34

images/logos Results

Deep networks: exponential schedule (l = 1E3)

1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900

Size of unlabeled data (u)

10.0 10.5 11.0 11.5 12.0 12.5 13.0

Calssification error MNIST test (%)

Evaluation of the exp40 eloving weight schedule (size of labeled data l = 1000)

K = 2 K = 3 K = 4

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 17/20

slide-35
SLIDE 35

images/logos Conclusion and perspectives

Conclusion

An alternative method to the pre-training. Parallel transfer learning with evolving weights Improve generalization easily. Reduce the number of hyper-parameters (t1, σ)

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 18/20

slide-36
SLIDE 36

images/logos Conclusion and perspectives

Perspectives

Evolve the importance weight according to the train/validation error. Explore other evolving schedules (toward automatic schedule) Adjust the learning rate: Adadelta, Adagrad, RMSProp

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 19/20

slide-37
SLIDE 37

images/logos Conclusion and perspectives

Questions

Thank you for your attention, Questions?

LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 20/20