Robust model training and generalisation with Studentising flows - - PowerPoint PPT Presentation

robust model training and generalisation with
SMART_READER_LITE
LIVE PREVIEW

Robust model training and generalisation with Studentising flows - - PowerPoint PPT Presentation

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal


slide-1
SLIDE 1

Robust model training and generalisation with Studentising flows

Simon Alexanderson Gustav Eje Henter

{simonal,ghe}@kth.se

Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal Institute of Technology, Stockholm, Sweden

2020-07-11

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 1 / 11

slide-2
SLIDE 2

One-slide summary

  • We propose replacing Gaussian base distributions Z in normalising

flows with multivariate Student’s t-distributions

  • Studentising flows
  • Our proposal is motivated through statistical robustness
  • Experiments show that the proposal stabilises training and leads to

better generalisation

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 2 / 11

slide-3
SLIDE 3

Outline

  • What is robustness?
  • Robustness sits in the tails
  • Tails of flow-based models
  • Experimental findings

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 3 / 11

slide-4
SLIDE 4

Why do we need robustness?

Generate some 1D standard normal data and fit a Gaussian: x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-5
SLIDE 5

Why do we need robustness?

Generate some 1D standard normal data and fit a Gaussian: x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-6
SLIDE 6

Why do we need robustness?

The fit changes if we add an outlying datapoint (red blob). x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-7
SLIDE 7

Why do we need robustness?

The fit changes if we add an outlying datapoint (red blob). x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-8
SLIDE 8

Why do we need robustness?

A fitted Student’s t-distribution (red plot) is more concentrated. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-9
SLIDE 9

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-10
SLIDE 10

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-11
SLIDE 11

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-12
SLIDE 12

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-13
SLIDE 13

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-14
SLIDE 14

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-15
SLIDE 15

Why do we need robustness?

In contrast, the Student’s t-distribution is statistically robust. x p(x)

  • 2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-16
SLIDE 16

Robust statistics

Robust (resistant) estimator: Adversarially corrupting a fraction η of the data (η < 1/2) only has a bounded effect on the estimated model parameters θ

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 5 / 11

slide-17
SLIDE 17

Why is Student’s t robust?

The probability density functions of Gaussians and Student’s t-distributions look similar.

  • 10
  • 5

5 10 0.1 0.2 0.3 0.4 0.5 Gauss. t (ν = 4) t (ν = 15)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-18
SLIDE 18

Why is Student’s t robust?

The associated loss functions (the negative log-likelihood, or NLL) exhibit differences in the tails.

  • 10
  • 5

5 10 5 10 15 20 25 Gauss. t (ν = 4) t (ν = 15)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-19
SLIDE 19

Why is Student’s t robust?

The influence function is the gradient of the NLL. It quantifies the effect of

  • utliers. For the t-distribution the influence function is bounded.
  • 10
  • 5

5 10 −5 5 Gauss. t (ν = 4) t (ν = 15)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-20
SLIDE 20

Why is Student’s t robust?

Gradient clipping can also limit the influence of outliers, but need not converge on the maximum-likelihood model.

  • 10
  • 5

5 10 −5 5 Gauss. t (ν = 4) t (ν = 15) Clipped Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-21
SLIDE 21

Related work

Our findings complement those in concurrent work by Jaini et al. (2020)1

  • They show:
  • Lipschitz-continuous triangular flows f θ (Z) with Gaussian base

distributions Z cannot represent fat-tailed data

  • For example: Glow with sigmoid-transformed scale factors
  • Using multivariate tν-distributions allows modelling data with fat tails
  • We add to this:
  • The advantages of tν-distributions can be understood through

statistical robustness

  • Experimentally, these benefits extend to bounded data (no fat tails)

1Jaini, P., Kobyzev, I., Yu, Y., and Brubaker, M. Tails of Lipschitz triangular flows.

In Proc. ICML, 2020.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 7 / 11

slide-22
SLIDE 22

Stable training

Training loss of Glow models of 64×64 CelebA data trained using Adam. The red configuration is unstable.

200 400 600 800 1000 Steps 2 3 4 5 6 7 8 Training loss t ( = 50), lr=1e-3

  • Gauss. no grad-clip, lr=1e-4
  • Gauss. w. grad-clip, lr=1e-3
  • Gauss. no grad-clip, lr=5e-4

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

slide-23
SLIDE 23

Stable training

Reducing the learning rate (yellow), clipping gradients (green), or changing the base to a multivariate tν-distribution (blue) stabilises training.

200 400 600 800 1000 Steps 2 3 4 5 6 7 8 Training loss t ( = 50), lr=1e-3

  • Gauss. no grad-clip, lr=1e-4
  • Gauss. w. grad-clip, lr=1e-3
  • Gauss. no grad-clip, lr=5e-4

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

slide-24
SLIDE 24

Better generalisation on image data

Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = ∞ 20 50 1000 ∞ 20 50 1000 Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ −0.03 −0.03 0.01 −0.36 −0.37 −0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22

  • utliers

∆ −0.04 −0.03 0.01 −0.03 −0.02 0.01

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

slide-25
SLIDE 25

Better generalisation on image data

Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = ∞ 20 50 1000 ∞ 20 50 1000 Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ −0.03 −0.03 0.01 −0.36 −0.37 −0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22

  • utliers

∆ −0.04 −0.03 0.01 −0.03 −0.02 0.01

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

slide-26
SLIDE 26

Better generalisation on image data

Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = ∞ 20 50 1000 ∞ 20 50 1000 Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ −0.03 −0.03 0.01 −0.36 −0.37 −0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22

  • utliers

∆ −0.04 −0.03 0.01 −0.03 −0.02 0.01

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

slide-27
SLIDE 27

Better generalisation on more complex data

In probabilistic motion modelling, flow-based models are the current state

  • f the art in terms of output quality. However, they are quite overfitted.

20000 40000 60000 80000 100000 120000 Steps 200 200 400 600 800 1000 Loss Gauss. t ( = 50) 10000 20000 30000 40000 50000 60000 70000 80000 Steps 200 400 600 800 1000 Loss Gauss. t ( = 50)

Locomotion synthesis Gesture generation

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 10 / 11

slide-28
SLIDE 28

Better generalisation on more complex data

Studentising flows (yellow) perform equally well on training data but greatly reduce overfitting for locomotion and gesture-modelling tasks.

20000 40000 60000 80000 100000 120000 Steps 200 200 400 600 800 1000 Loss Gauss. t ( = 50) 10000 20000 30000 40000 50000 60000 70000 80000 Steps 200 400 600 800 1000 Loss Gauss. t ( = 50)

Locomotion synthesis Gesture generation

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 10 / 11

slide-29
SLIDE 29

Please see our paper for more!

  • Additional experiments and results
  • Connections between:
  • Consistency and asymptotic efficiency
  • Statistical robustness
  • Machine-learning best practises
  • Code

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 11 / 11