tighter risk certificates for probabilistic neural
play

Tighter risk certificates for (probabilistic) neural networks Omar - PowerPoint PPT Presentation

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40 The crew Mar a P erez-Ortiz (UCL) Yours truly (UCL / DeepMind) Csaba


  1. Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40

  2. The crew • Mar´ ıa P´ erez-Ortiz (UCL) • Yours truly (UCL / DeepMind) • Csaba Szepesv´ ari (DeepMind) • John Shawe-Taylor (UCL) O. Rivasplata Slide 2 / 40

  3. Overview of this talk ⊲ Motivation ⊲ Classic NNs: weights ⊲ Probabilistic NNs: random weights ⊲ Highlights of experiments ⊲ Conclusions O. Rivasplata Slide 3 / 40

  4. What motivated this project O. Rivasplata Slide 4 / 40

  5. Blundell et al. (2015) • Variational Bayes : min θ KL ( q θ ( w ) � p ( w | D )) • Objective : f ( θ ) = E q θ ( w ) [log(1 / p ( D | w ))] + KL ( q θ ( w ) � p ( w )) (ELBO) • Algorithm : ‘Bayes by Backprop’ O. Rivasplata Slide 5 / 40

  6. Thiemann et al. (2017) • PAC-Bayes-lambda : E q θ ( w ) [ L ( w )] ≤ E q θ ( w ) [ ˆ L n ( w , D )] + KL ( q θ ( w ) � p ( w )) + C n λ ∈ (0 , 2) 1 − λ/ 2 n λ (1 − λ/ 2) • Algorithm : f ( θ, λ ) = RHS, alternated optimization over θ and λ O. Rivasplata Slide 6 / 40

  7. Dziugaite & Roy (2017) Optimized a classic PAC-Bayes bound • • Experiments on ‘binary MNIST’ ([0-4] vs. [5-9]) • Demonstrated non-vacuous risk bound values O. Rivasplata Slide 7 / 40

  8. Classic Neural Nets O. Rivasplata Slide 8 / 40

  9. What to achieve from data? Motivation Use the available data to: Classic weights Random weights (1) learn a weight vector ˆ w Experiments (2) certify ˆ w ’s performance Conclusions • split the data, part for (1) and part for (2)? • the whole of the data for (1) and (2) simultaneously? ⊲ self-certified learning! O. Rivasplata Slide 9 / 40

  10. Learning framework ✞ ☎ ✞ ☎ Motivation ALG : Z n → W ✝ ✆ ✝ ✆ W → H Classic weights Random weights Experiments • W ⊂ R p • Z = X × Y • H function class Conclusions X = set of inputs weight space predictors Y = set of labels w = ALG (data) ˆ w : X → Y h ˆ data set: D = ( Z 1 , . . . , Z n ) ∈ Z n (e.g. training set) a finite sequence of input-label examples Z i = ( X i , Y i ). O. Rivasplata Slide 10 / 40

  11. A measure of performance Motivation n L n ( w , D ) = 1 Classic weights � L n ( w ) = ˆ ˆ Empirical risk: ℓ ( w , Z i ) Random weights n i = 1 (in-sample error) Experiments Conclusions Tied to the choice of a loss function ℓ ( w , z ) • the square loss (regression) the 0-1 loss (classification) • the cross-entropy loss (NN classification) • ⊲ surrogate loss, nice properties O. Rivasplata Slide 11 / 40

  12. Empirical Risk Minimization Motivation Classic weights � ˆ 1 Training set error: L trn ( w ) = ℓ ( w , Z i ) Random weights n trn Experiments Z i ∈ D trn Conclusions ˆ ERM: w ∈ arg min ˆ L trn ( w ) w ˆ Penalized ERM: w ∈ arg min ˆ L trn ( w ) + Reg( w ) w O. Rivasplata Slide 12 / 40

  13. Generalization Motivation Classic weights Random weights Experiments Conclusions If learned weight ˆ w does well on the train set examples... ...will it still do well on unseen examples? O. Rivasplata Slide 13 / 40

  14. PAC Learning Motivation data set: D = ( Z 1 , . . . , Z n ) ∈ Z n Classic weights Random weights a finite sequence of input-label examples Z i = ( X i , Y i ). Experiments Conclusions Assumptions: • A data-generating distribution P ∈ M 1 ( Z ). • P is unknown, only the training set is given. The input-label examples are i.i.d. ∼ P . • � Population risk: L ( w ) = E � ℓ ( w , Z ) � = Z ℓ ( w , z ) dP ( z ) (out-of-sample) O. Rivasplata Slide 14 / 40

  15. Certifying performance: test set error Motivation Classic weights � ˆ 1 Test set error: L tst ( ˆ w ) = ℓ ( ˆ w , Z i ) Random weights n tst Experiments Z i ∈ D tst Conclusions ⊲ w obtained from the training set ˆ ⊲ test set not used for training ˆ ⊲ L tst ( ˆ w ) serves as estimate of L ( ˆ w ) Note: L ( ˆ w ) remains unknown! ⊲ O. Rivasplata Slide 15 / 40

  16. Certifying performance: confidence bound Motivation Risk upper bound: Classic weights For any given δ ∈ (0 , 1), Random weights with probability at least 1 − δ over random datasets Experiments of size n , simultaneous for all w : ✞ ☎ Conclusions L ( w ) ≤ ˆ ✝ ✆ L n ( w ) + ǫ ( n , δ ) w ) ≤ ˆ For ˆ w = ALG (train set) this gives: L ( ˆ L tst ( ˆ w ) + ǫ ( n tst , δ ) Recommendable practice: ⊲ report confidence bound together with your test set error estimate O. Rivasplata Slide 16 / 40

  17. Self-certified learning? Motivation Risk upper bound: Classic weights For any given δ ∈ (0 , 1), Random weights with probability at least 1 − δ over random datasets Experiments of size n , simultaneous for all w : ✞ ☎ Conclusions L ( w ) ≤ ˆ ✝ ✆ L n ( w ) + ǫ ( n , δ ) Alternative practice: Find ˆ w by minimizing the risk bound ⊲ A form of regularized ERM ⊲ the learned ˆ w comes with its own risk certificate ⊲ best if the risk bound is non-vacuous, ideally tight! ⊲ may avoid the need of data-splitting ⊲ may lead to self-certified learning! O. Rivasplata Slide 17 / 40

  18. Probabilistic Neural Nets O. Rivasplata Slide 18 / 40

  19. Randomized weights Motivation Based on data D , learn a distribution over weights: � Classic weights Q D ∈ M 1 ( W ), Q D = ALG (train set). Random weights Experiments Predictions: Conclusions � • draw w ∼ Q D and predict with the chosen w . • each prediction with a fresh random draw. The risk measures L ( w ) and ˆ L n ( w ) are extended to Q by averaging: � Q [ L ] ≡ W L ( w ) dQ ( w ) = E w ∼ Q [ L ( w )] � Q [ ˆ W ˆ L n ( w ) dQ ( w ) = E w ∼ Q [ ˆ L n ] ≡ L n ( w )] O. Rivasplata Slide 19 / 40

  20. Two usual PAC-Bayes bounds Motivation ‘prior’ ‘posterior’ Classic weights Fix a distribution Q 0 . Random weights For any sample size n , Experiments for any confidence parameter δ ∈ (0 , 1), Conclusions with probability at least 1 − δ over random samples (of size n ) simultaneously for all distributions Q ✓ ✏ KL ( Q � Q 0 ) + log � 2 √ n � � δ Q [ L ] ≤ Q [ ˆ L n ] + (PB-classic) ✒ ✑ 2 n ✎ ☞ KL ( Q � Q 0 ) + log � 2 √ n � δ kl( Q [ ˆ L n ] � Q [ L ]) ≤ (PB-kl) ✍ ✌ n O. Rivasplata Slide 20 / 40

  21. Two more PAC-Bayes bounds Fix a distribution Q 0 . For any size n , for any confidence δ ∈ (0 , 1), with probability at least 1 − δ over random samples (of size n ) PB-quad: simultaneously for all distributions Q ✛ ✘ 2 � KL ( Q � Q 0 ) + log( 2 √ n � KL ( Q � Q 0 ) + log( 2 √ n   δ ) δ )       Q [ ˆ   Q [ L ] ≤ L n ] + +       2 n 2 n     ✚   ✙ PB-lambda: simultaneously for all distributions Q and λ ∈ (0 , 2) ✓ ✏ KL ( Q � Q 0 ) + log( 2 √ n Q [ L ] ≤ Q [ ˆ δ ) L n ] 1 − λ/ 2 + ✒ ✑ n λ (1 − λ/ 2) O. Rivasplata Slide 21 / 40

  22. Cornerstone: change of measure inequality Donsker & Varadhan (1975), Csisz´ ar (1975) Motivation Classic weights � � Q [ f ] − log Q 0 [ e f ] KL ( Q � Q 0 ) = sup Random weights f : W → R Experiments Conclusions Let f : Z n × W → R . For a given Q 0 : � Q [ f ( D , w )] ≤ KL ( Q � Q 0 ) + log Q 0 [ e f ( D , w ) ]. Apply Markov’s inequality to Q 0 [ e f ( D , w ) ]. � w.p. ≥ 1 − δ over the random draw of D ∼ P n , � simultaneously for all distributions Q : Q [ f ( D , w )] ≤ KL ( Q � Q 0 ) + log P n [ Q 0 [ e f ( D , w ) ]] + log(1 /δ ). Use with suitable f , � upper-bound the exponential moment P n [ Q 0 [ e f ( D , w ) ]]. O. Rivasplata Slide 22 / 40

  23. Using a PAC-Bayes bound Motivation Use your favourite ALG to find Q D = ALG (train set), and � Classic weights plug Q D into the PAC-Bayes bound to certify its risk: Random weights ✓ ✏ Experiments KL ( Q D � Q 0 ) + log � 2 √ n � Conclusions � δ Q D [ L ] ≤ Q D [ ˆ L n ] + ✒ ✑ 2 n Use the PAC-Bayes bound itself as a training objective: � ✓ ✏ KL ( Q � Q 0 ) + log � 2 √ n � � Q [ ˆ δ Q D ∈ arg min L n ] + 2 n ✒ ✑ Q Note: both uses illustrated here with PB-classic, but the same can be done with PB-quad or PB-lambda (or any other) O. Rivasplata Slide 23 / 40

  24. Training objectives KL ( Q � Q 0 ) + log( 2 √ n � δ ) f classic ( Q ) = Q [ ˆ L ce n ] + 2 n 2 KL ( Q � Q 0 ) + log( 2 √ n KL ( Q � Q 0 ) + log( 2 √ n � �   δ ) δ )       Q [ ˆ L ce   f quad ( Q ) = n ] + +      2 n 2 n        KL ( Q � Q 0 ) + log( 2 √ n f lambda ( Q , λ ) = Q [ ˆ L ce δ ) n ] 1 − λ/ 2 + n λ (1 − λ/ 2) O. Rivasplata Slide 24 / 40

  25. Experiments O. Rivasplata Slide 25 / 40

  26. PAC-Bayes with Backprop Motivation Classic weights Random weights Experiments Conclusions O. Rivasplata Slide 26 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend