Tighter risk certificates for (probabilistic) neural networks Omar - - PowerPoint PPT Presentation

tighter risk certificates for probabilistic neural
SMART_READER_LITE
LIVE PREVIEW

Tighter risk certificates for (probabilistic) neural networks Omar - - PowerPoint PPT Presentation

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40 The crew Mar a P erez-Ortiz (UCL) Yours truly (UCL / DeepMind) Csaba


slide-1
SLIDE 1

UCL Centre for AI

Slide 1 / 40

Tighter risk certificates for (probabilistic) neural networks

Omar Rivasplata

  • .rivasplata@cs.ucl.ac.uk

01 July 2020

slide-2
SLIDE 2

The crew

  • O. Rivasplata

Slide 2 / 40

  • Mar´

ıa P´ erez-Ortiz (UCL)

  • Yours truly (UCL / DeepMind)
  • Csaba Szepesv´

ari (DeepMind)

  • John Shawe-Taylor (UCL)
slide-3
SLIDE 3

Overview of this talk

  • O. Rivasplata

Slide 3 / 40

⊲ Motivation ⊲ Classic NNs: weights ⊲ Probabilistic NNs: random weights ⊲ Highlights of experiments ⊲ Conclusions

slide-4
SLIDE 4

What motivated this project

  • O. Rivasplata

Slide 4 / 40

slide-5
SLIDE 5

Blundell et al. (2015)

  • O. Rivasplata

Slide 5 / 40

  • Variational Bayes : minθ KL(qθ(w)p(w|D))
  • Objective : f(θ) = Eqθ(w)[log(1/p(D|w))] + KL(qθ(w)p(w))

(ELBO)

  • Algorithm : ‘Bayes by Backprop’
slide-6
SLIDE 6

Thiemann et al. (2017)

  • O. Rivasplata

Slide 6 / 40

  • PAC-Bayes-lambda :

Eqθ(w)[L(w)] ≤ Eqθ(w)[ ˆ Ln(w, D)] 1 − λ/2 + KL(qθ(w)p(w)) + Cn nλ(1 − λ/2) λ ∈ (0, 2)

  • Algorithm : f(θ, λ) = RHS, alternated optimization over θ and λ
slide-7
SLIDE 7

Dziugaite & Roy (2017)

  • O. Rivasplata

Slide 7 / 40

  • Optimized a classic PAC-Bayes bound
  • Experiments on ‘binary MNIST’ ([0-4] vs. [5-9])
  • Demonstrated non-vacuous risk bound values
slide-8
SLIDE 8

Classic Neural Nets

  • O. Rivasplata

Slide 8 / 40

slide-9
SLIDE 9

What to achieve from data?

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 9 / 40

Use the available data to: (1) learn a weight vector ˆ w (2) certify ˆ w’s performance

  • split the data, part for (1) and part for (2)?
  • the whole of the data for (1) and (2) simultaneously?

⊲ self-certified learning!

slide-10
SLIDE 10

Learning framework

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 10 / 40

✞ ✝ ☎ ✆

ALG : Zn → W

✞ ✝ ☎ ✆

W → H

  • Z = X × Y

X = set of inputs Y = set of labels

  • W ⊂ Rp

weight space ˆ w = ALG(data)

  • H function class

predictors h ˆ

w : X → Y

data set: D = (Z1, . . . , Zn) ∈ Zn (e.g. training set) a finite sequence of input-label examples Zi = (Xi, Yi).

slide-11
SLIDE 11

A measure of performance

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 11 / 40

Empirical risk:

ˆ Ln(w) = ˆ Ln(w, D) = 1 n

n

  • i=1

ℓ(w, Zi) (in-sample error) Tied to the choice of a loss function ℓ(w, z)

  • the square loss (regression)
  • the 0-1 loss (classification)
  • the cross-entropy loss (NN classification)

⊲ surrogate loss, nice properties

slide-12
SLIDE 12

Empirical Risk Minimization

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 12 / 40

Training set error: ˆ Ltrn(w) =

1 n trn

  • Zi∈Dtrn

ℓ(w, Zi) ERM:

ˆ w ∈ arg min

w

ˆ Ltrn(w)

Penalized ERM:

ˆ w ∈ arg min

w

ˆ Ltrn(w) + Reg(w)

slide-13
SLIDE 13

Generalization

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 13 / 40

If learned weight ˆ w does well on the train set examples... ...will it still do well on unseen examples?

slide-14
SLIDE 14

PAC Learning

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 14 / 40

data set: D = (Z1, . . . , Zn) ∈ Zn a finite sequence of input-label examples Zi = (Xi, Yi). Assumptions:

  • A data-generating distribution P ∈ M1(Z).
  • P is unknown, only the training set is given.
  • The input-label examples are i.i.d. ∼ P.

Population risk:

L(w) = Eℓ(w, Z) =

  • Z ℓ(w, z) dP(z)

(out-of-sample)

slide-15
SLIDE 15

Certifying performance: test set error

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 15 / 40

Test set error: ˆ Ltst( ˆ w) =

1 n tst

  • Zi∈Dtst

ℓ( ˆ w, Zi)

⊲ ˆ w obtained from the training set ⊲ test set not used for training ⊲ ˆ Ltst( ˆ w) serves as estimate of L( ˆ w) ⊲ Note: L( ˆ w) remains unknown!

slide-16
SLIDE 16

Certifying performance: confidence bound

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 16 / 40

Risk upper bound:

For any given δ ∈ (0, 1), with probability at least 1 − δ over random datasets

  • f size n, simultaneous for all w :

✞ ✝ ☎ ✆

L(w) ≤ ˆ Ln(w) + ǫ(n, δ) For ˆ w = ALG(train set) this gives: L( ˆ w) ≤ ˆ Ltst( ˆ w) + ǫ(n tst, δ) Recommendable practice: ⊲ report confidence bound together with your test set error estimate

slide-17
SLIDE 17

Self-certified learning?

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 17 / 40

Risk upper bound:

For any given δ ∈ (0, 1), with probability at least 1 − δ over random datasets

  • f size n, simultaneous for all w :

✞ ✝ ☎ ✆

L(w) ≤ ˆ Ln(w) + ǫ(n, δ) Alternative practice: Find ˆ w by minimizing the risk bound ⊲ A form of regularized ERM ⊲ the learned ˆ w comes with its own risk certificate ⊲ best if the risk bound is non-vacuous, ideally tight! ⊲ may avoid the need of data-splitting ⊲ may lead to self-certified learning!

slide-18
SLIDE 18

Probabilistic Neural Nets

  • O. Rivasplata

Slide 18 / 40

slide-19
SLIDE 19

Randomized weights

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 19 / 40

  • Based on data D, learn a distribution over weights:

QD ∈ M1(W), QD = ALG(train set).

  • Predictions:
  • draw w ∼ QD and predict with the chosen w.
  • each prediction with a fresh random draw.

The risk measures L(w) and ˆ Ln(w) are extended to Q by averaging: Q[L] ≡

  • W L(w) dQ(w) = Ew∼Q[L(w)]

Q[ ˆ Ln] ≡

  • W ˆ

Ln(w) dQ(w) = Ew∼Q[ ˆ Ln(w)]

slide-20
SLIDE 20

Two usual PAC-Bayes bounds

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 20 / 40

‘prior’ ‘posterior’ Fix a distribution Q0. For any sample size n, for any confidence parameter δ ∈ (0, 1), with probability at least 1 − δ

  • ver random samples (of size n)

simultaneously for all distributions Q

✓ ✒ ✏ ✑

Q[L] ≤ Q[ ˆ Ln] +

  • KL(QQ0) + log 2 √n

δ

  • 2n

(PB-classic)

✎ ✍ ☞ ✌

kl(Q[ ˆ Ln]Q[L]) ≤ KL(QQ0) + log2 √n

δ

  • n

(PB-kl)

slide-21
SLIDE 21

Two more PAC-Bayes bounds

  • O. Rivasplata

Slide 21 / 40

Fix a distribution Q0. For any size n, for any confidence δ ∈ (0, 1), with probability at least 1 − δ over random samples (of size n) PB-quad: simultaneously for all distributions Q

✛ ✚ ✘ ✙

Q[L] ≤           

  • Q[ ˆ

Ln] + KL(QQ0) + log(2 √n

δ )

2n +

  • KL(QQ0) + log(2 √n

δ )

2n           

2

PB-lambda: simultaneously for all distributions Q and λ ∈ (0, 2)

✓ ✒ ✏ ✑

Q[L] ≤ Q[ ˆ Ln] 1 − λ/2 + KL(QQ0) + log(2 √n

δ )

nλ(1 − λ/2)

slide-22
SLIDE 22

Cornerstone: change of measure inequality

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 22 / 40

Donsker & Varadhan (1975), Csisz´ ar (1975) KL(QQ0) = sup

f:W→R

  • Q[f] − log Q0[ef]
  • Let f : Zn × W → R.

For a given Q0 : Q[f(D, w)] ≤ KL(QQ0) + log Q0[ef(D,w)].

  • Apply Markov’s inequality to Q0[ef(D,w)].
  • w.p. ≥ 1 − δ over the random draw of D ∼ Pn,

simultaneously for all distributions Q : Q[f(D, w)] ≤ KL(QQ0) + log Pn[Q0[ef(D,w)]] + log(1/δ).

  • Use with suitable f,

upper-bound the exponential moment Pn[Q0[ef(D,w)]].

slide-23
SLIDE 23

Using a PAC-Bayes bound

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 23 / 40

  • Use your favourite ALG to find QD = ALG(train set), and

plug QD into the PAC-Bayes bound to certify its risk:

✓ ✒ ✏ ✑

QD[L] ≤ QD[ ˆ Ln] +

  • KL(QDQ0) + log2 √n

δ

  • 2n
  • Use the PAC-Bayes bound itself as a training objective:

✓ ✒ ✏ ✑

QD ∈ arg min

Q

Q[ ˆ Ln] +

  • KL(QQ0) + log 2 √n

δ

  • 2n

Note: both uses illustrated here with PB-classic, but the same can be done with PB-quad or PB-lambda (or any other)

slide-24
SLIDE 24

Training objectives

  • O. Rivasplata

Slide 24 / 40

fclassic(Q) = Q[ ˆ Lce

n ] +

  • KL(QQ0) + log(2 √n

δ )

2n fquad(Q) =           

  • Q[ ˆ

Lce

n ] +

KL(QQ0) + log(2 √n

δ )

2n +

  • KL(QQ0) + log(2 √n

δ )

2n           

2

flambda(Q, λ) = Q[ ˆ Lce

n ]

1 − λ/2 + KL(QQ0) + log(2 √n

δ )

nλ(1 − λ/2)

slide-25
SLIDE 25

Experiments

  • O. Rivasplata

Slide 25 / 40

slide-26
SLIDE 26

PAC-Bayes with Backprop

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 26 / 40

slide-27
SLIDE 27

Prior mean at the random initialization

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 27 / 40

  • (PAC-Bayes) prior Q0 = Gauss(w0, Σ0)

Σ0 = λ0I (λ0 is hyperparameter) w0 = randomly initialized weights

  • (PAC-Bayes) posterior QD = Gauss(w, Σ)

w, Σ learned by PAC-Bayes with Backprop Experiments (ours) on MNIST fquad Test acc. = 86.36 Test error = 0.1364 RUB value = 0.24107 fclassic (cf. D & R (2017)) Test acc. = 84.22 Test error = 0.1578 RUB value = 0.24375

slide-28
SLIDE 28

Prior mean learned from data

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 28 / 40

  • (PAC-Bayes) prior Q0 = Gauss(w0, Σ0)

Σ0 = λ0I (λ0 is hyperparameter) w0 = ERM on a split of the data

  • (PAC-Bayes) posterior QD = Gauss(w, Σ)

w, Σ learned by PAC-Bayes with Backprop Experiments (ours) on MNIST fquad Test acc. = 97.89 Test error = 0.0211 RUB value = 0.04588 fclassic (cf. D & R (2018)) Test acc. = 97.21 Test error = 0.0279 RUB value = 0.06029

slide-29
SLIDE 29

Closing remarks

  • O. Rivasplata

Slide 29 / 40

slide-30
SLIDE 30

Bayesian Learning

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 30 / 40

posterior QD, density qD(w) prior Q0, density q0(w)

qD(w) = L(D|w) q0(w) /C

  • Bayes rule update on prior to form posterior

⊲ likelihood factor L(D|w)

  • principled approach, e.g. MAP learning
  • derive learning algorithms

⊲ balance ‘fit to data’ and ‘fit to prior’

slide-31
SLIDE 31

Generalized Bayes

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 31 / 40

A bit more general: “temperature” λ > 0

qD(w) = L(D|w)λ q0(w) /C

Even more general: data-dependent factor F

qD(w) = F(D, w) q0(w)

  • P.G. Bissiri, C.C. Holmes, S.G. Walker (2016)

A general framework for updating belief distributions

slide-32
SLIDE 32

PAC-Bayes

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 32 / 40

qD(w)

✞ ✝ ☎ ✆

no update factor q0(w)

  • more general than generalized Bayes
  • increased flexibility in choice of distributions
  • balance qD[ ˆ

Ln] and KL(qDq0) ⊲ ‘fit to data’ versus ‘fit to prior’

slide-33
SLIDE 33

Future

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 33 / 40

⊲ choice of distributions ⊲ understand properties ⊲ scaling to larger problems? ⊲ architecture vs. PAC-Bayes bounds? ⊲ problem-specific PAC-Bayes bounds?

slide-34
SLIDE 34

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 34 / 40

Thank you!

slide-35
SLIDE 35

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 35 / 40

Wait...

slide-36
SLIDE 36

some PAC-Bayes history

  • O. Rivasplata

Slide 36 / 40

  • J. Shawe-Taylor & R.C. Williamson (1997)

A PAC analysis of a Bayesian estimator

  • D.A. McAllester (1998)

Some PAC-Bayesian Theorems

  • D.A. McAllester (1999)

PAC-Bayesian Model Averaging

  • J. Langford & M. Seeger (2001)

Bounds for Averaging Classifiers

  • J. Langford & R. Caruana (2002)

(Not) Bounding the True Error

  • M. Seeger (2002)

PAC-Bayesian generalization bounds for gaussian processes

slide-37
SLIDE 37

some more PAC-Bayes history

  • O. Rivasplata

Slide 37 / 40

  • J. Langford & J. Shawe-Taylor (2002)

PAC-Bayes & Margins

  • D.A. McAllester (2003)

Simplified PAC-Bayesian Margin Bounds

  • A. Maurer (2004)

A note on the PAC Bayesian theorem

  • J.-Y. Audibert (2004)

A better variance control for PAC-Bayesian classification

  • O. Catoni (2007)

PAC-Bayesian supervised classification: The thermodynamics of statistical learning

  • P. Germain, A. Lacasse, F. Laviolette, M. Marchand (2009)

PAC-Bayesian learning of linear classifiers

slide-38
SLIDE 38

some recent PAC-Bayes

  • O. Rivasplata

Slide 38 / 40

  • J. Keshet, D.A. McAllester, T. Hazan (2011)

PAC-Bayesian approach for minimization of phoneme error rate

  • A. Noy & K. Crammer (2014)

Robust forward algorithms via PAC-Bayes and Laplace distributions

  • P. Germain, F. Bach, A. Lacoste, S. Lacoste-Julien (2016)

PAC-Bayesian theory meets Bayesian inference

  • N. Thiemann, C. Igel, O. Wintenberger, Y. Seldin (2017)

A Strongly Quasiconvex PAC-Bayesian Bound

  • G.K Dziugaite & D. Roy (2017)

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data

  • G.K Dziugaite & D. Roy (2018)

Data-dependent PAC-Bayes priors via differential privacy

slide-39
SLIDE 39

more recent PAC-Bayes

  • O. Rivasplata

Slide 39 / 40

  • O. Rivasplata, E. Parrado-Hern´

andez, J. Shawe-Taylor,

  • S. Sun, Cs. Szepev´

ari (2018) PAC-Bayes bounds for stable algorithms with instance-dependent priors

  • P. Alquier & B. Guedj (2018)

Simpler PAC-Bayesian bounds for hostile data

  • S.S. Lorenzen, C. Igel, Y. Seldin (2019)

On PAC-Bayesian Bounds for Random Forests

  • G. Letarte, P. Germain, B. Guedj, F. Laviolette (2019)

Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks

  • O. Rivasplata, V.M. Tankasali, Cs. Szepev´

ari (2019) PAC-Bayes with Backprop (in arXiv)

slide-40
SLIDE 40

Motivation Classic weights Random weights Experiments Conclusions

  • O. Rivasplata

Slide 40 / 40

Thank you again!