Robust model training and generalisation with Studentising flows - - PowerPoint PPT Presentation

▶

robust model training and generalisation with

Robust model training and generalisation with Studentising flows - - PowerPoint PPT Presentation

Sep 21, 2022 673 likes •984 views

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal

slide-1

SLIDE 1

Robust model training and generalisation with Studentising flows

Simon Alexanderson Gustav Eje Henter

{simonal,ghe}@kth.se

Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal Institute of Technology, Stockholm, Sweden

2020-07-11

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 1 / 11

slide-2

SLIDE 2

One-slide summary

We propose replacing Gaussian base distributions Z in normalising

flows with multivariate Student’s t-distributions

Studentising flows
Our proposal is motivated through statistical robustness
Experiments show that the proposal stabilises training and leads to

better generalisation

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 2 / 11

slide-3

SLIDE 3

Outline

What is robustness?
Robustness sits in the tails
Tails of flow-based models
Experimental findings

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 3 / 11

slide-4

SLIDE 4

Why do we need robustness?

Generate some 1D standard normal data and fit a Gaussian: x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-5

SLIDE 5

Why do we need robustness?

Generate some 1D standard normal data and fit a Gaussian: x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-6

SLIDE 6

Why do we need robustness?

The fit changes if we add an outlying datapoint (red blob). x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-7

SLIDE 7

Why do we need robustness?

The fit changes if we add an outlying datapoint (red blob). x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-8

SLIDE 8

Why do we need robustness?

A fitted Student’s t-distribution (red plot) is more concentrated. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-9

SLIDE 9

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-10

SLIDE 10

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-11

SLIDE 11

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-12

SLIDE 12

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-13

SLIDE 13

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-14

SLIDE 14

Why do we need robustness?

As the outlier is moved away, the Gaussian fit changes a lot. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-15

SLIDE 15

Why do we need robustness?

In contrast, the Student’s t-distribution is statistically robust. x p(x)

2

2 4 6 8 10 12 0.05 0.1 0.15 0.2 0.25 0.3 Gauss. t (ν = 1.5)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

slide-16

SLIDE 16

Robust statistics

Robust (resistant) estimator: Adversarially corrupting a fraction η of the data (η < 1/2) only has a bounded effect on the estimated model parameters θ

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 5 / 11

slide-17

SLIDE 17

Why is Student’s t robust?

The probability density functions of Gaussians and Student’s t-distributions look similar.

10
5

5 10 0.1 0.2 0.3 0.4 0.5 Gauss. t (ν = 4) t (ν = 15)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-18

SLIDE 18

Why is Student’s t robust?

The associated loss functions (the negative log-likelihood, or NLL) exhibit differences in the tails.

10
5

5 10 5 10 15 20 25 Gauss. t (ν = 4) t (ν = 15)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-19

SLIDE 19

Why is Student’s t robust?

The influence function is the gradient of the NLL. It quantifies the effect of

utliers. For the t-distribution the influence function is bounded.
10
5

5 10 −5 5 Gauss. t (ν = 4) t (ν = 15)

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-20

SLIDE 20

Why is Student’s t robust?

Gradient clipping can also limit the influence of outliers, but need not converge on the maximum-likelihood model.

10
5

5 10 −5 5 Gauss. t (ν = 4) t (ν = 15) Clipped Gauss.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

slide-21

SLIDE 21

Related work

Our findings complement those in concurrent work by Jaini et al. (2020)1

They show:
Lipschitz-continuous triangular flows f θ (Z) with Gaussian base

distributions Z cannot represent fat-tailed data

For example: Glow with sigmoid-transformed scale factors
Using multivariate tν-distributions allows modelling data with fat tails
We add to this:
The advantages of tν-distributions can be understood through

statistical robustness

Experimentally, these benefits extend to bounded data (no fat tails)

1Jaini, P., Kobyzev, I., Yu, Y., and Brubaker, M. Tails of Lipschitz triangular flows.

In Proc. ICML, 2020.

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 7 / 11

slide-22

SLIDE 22

Stable training

Training loss of Glow models of 64×64 CelebA data trained using Adam. The red configuration is unstable.

200 400 600 800 1000 Steps 2 3 4 5 6 7 8 Training loss t ( = 50), lr=1e-3

Gauss. no grad-clip, lr=1e-4
Gauss. w. grad-clip, lr=1e-3
Gauss. no grad-clip, lr=5e-4

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

slide-23

SLIDE 23

Stable training

Reducing the learning rate (yellow), clipping gradients (green), or changing the base to a multivariate tν-distribution (blue) stabilises training.

200 400 600 800 1000 Steps 2 3 4 5 6 7 8 Training loss t ( = 50), lr=1e-3

Gauss. no grad-clip, lr=1e-4
Gauss. w. grad-clip, lr=1e-3
Gauss. no grad-clip, lr=5e-4

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

slide-24

SLIDE 24

Better generalisation on image data

Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = ∞ 20 50 1000 ∞ 20 50 1000 Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ −0.03 −0.03 0.01 −0.36 −0.37 −0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22

utliers

∆ −0.04 −0.03 0.01 −0.03 −0.02 0.01

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

slide-25

SLIDE 25

Better generalisation on image data

Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = ∞ 20 50 1000 ∞ 20 50 1000 Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ −0.03 −0.03 0.01 −0.36 −0.37 −0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22

utliers

∆ −0.04 −0.03 0.01 −0.03 −0.02 0.01

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

slide-26

SLIDE 26

Better generalisation on image data

Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = ∞ 20 50 1000 ∞ 20 50 1000 Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ −0.03 −0.03 0.01 −0.36 −0.37 −0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22

utliers

∆ −0.04 −0.03 0.01 −0.03 −0.02 0.01

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

slide-27

SLIDE 27

Better generalisation on more complex data

In probabilistic motion modelling, flow-based models are the current state

f the art in terms of output quality. However, they are quite overfitted.

20000 40000 60000 80000 100000 120000 Steps 200 200 400 600 800 1000 Loss Gauss. t ( = 50) 10000 20000 30000 40000 50000 60000 70000 80000 Steps 200 400 600 800 1000 Loss Gauss. t ( = 50)

Locomotion synthesis Gesture generation

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 10 / 11

slide-28

SLIDE 28

Better generalisation on more complex data

Studentising flows (yellow) perform equally well on training data but greatly reduce overfitting for locomotion and gesture-modelling tasks.

20000 40000 60000 80000 100000 120000 Steps 200 200 400 600 800 1000 Loss Gauss. t ( = 50) 10000 20000 30000 40000 50000 60000 70000 80000 Steps 200 400 600 800 1000 Loss Gauss. t ( = 50)

Locomotion synthesis Gesture generation

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 10 / 11

slide-29

SLIDE 29

Please see our paper for more!

Additional experiments and results
Connections between:
Consistency and asymptotic efficiency
Statistical robustness
Machine-learning best practises
Code

Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 11 / 11