Tighter risk certificates for (probabilistic) neural networks Omar - PowerPoint PPT Presentation

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40

The crew • Mar´ ıa P´ erez-Ortiz (UCL) • Yours truly (UCL / DeepMind) • Csaba Szepesv´ ari (DeepMind) • John Shawe-Taylor (UCL) O. Rivasplata Slide 2 / 40

Overview of this talk ⊲ Motivation ⊲ Classic NNs: weights ⊲ Probabilistic NNs: random weights ⊲ Highlights of experiments ⊲ Conclusions O. Rivasplata Slide 3 / 40

What motivated this project O. Rivasplata Slide 4 / 40

Blundell et al. (2015) • Variational Bayes : min θ KL ( q θ ( w ) � p ( w | D )) • Objective : f ( θ ) = E q θ ( w ) [log(1 / p ( D | w ))] + KL ( q θ ( w ) � p ( w )) (ELBO) • Algorithm : ‘Bayes by Backprop’ O. Rivasplata Slide 5 / 40

Thiemann et al. (2017) • PAC-Bayes-lambda : E q θ ( w ) [ L ( w )] ≤ E q θ ( w ) [ ˆ L n ( w , D )] + KL ( q θ ( w ) � p ( w )) + C n λ ∈ (0 , 2) 1 − λ/ 2 n λ (1 − λ/ 2) • Algorithm : f ( θ, λ ) = RHS, alternated optimization over θ and λ O. Rivasplata Slide 6 / 40

Dziugaite & Roy (2017) Optimized a classic PAC-Bayes bound • • Experiments on ‘binary MNIST’ ([0-4] vs. [5-9]) • Demonstrated non-vacuous risk bound values O. Rivasplata Slide 7 / 40

Classic Neural Nets O. Rivasplata Slide 8 / 40

What to achieve from data? Motivation Use the available data to: Classic weights Random weights (1) learn a weight vector ˆ w Experiments (2) certify ˆ w ’s performance Conclusions • split the data, part for (1) and part for (2)? • the whole of the data for (1) and (2) simultaneously? ⊲ self-certified learning! O. Rivasplata Slide 9 / 40

Learning framework ✞ ☎ ✞ ☎ Motivation ALG : Z n → W ✝ ✆ ✝ ✆ W → H Classic weights Random weights Experiments • W ⊂ R p • Z = X × Y • H function class Conclusions X = set of inputs weight space predictors Y = set of labels w = ALG (data) ˆ w : X → Y h ˆ data set: D = ( Z 1 , . . . , Z n ) ∈ Z n (e.g. training set) a finite sequence of input-label examples Z i = ( X i , Y i ). O. Rivasplata Slide 10 / 40

A measure of performance Motivation n L n ( w , D ) = 1 Classic weights � L n ( w ) = ˆ ˆ Empirical risk: ℓ ( w , Z i ) Random weights n i = 1 (in-sample error) Experiments Conclusions Tied to the choice of a loss function ℓ ( w , z ) • the square loss (regression) the 0-1 loss (classification) • the cross-entropy loss (NN classification) • ⊲ surrogate loss, nice properties O. Rivasplata Slide 11 / 40

Empirical Risk Minimization Motivation Classic weights � ˆ 1 Training set error: L trn ( w ) = ℓ ( w , Z i ) Random weights n trn Experiments Z i ∈ D trn Conclusions ˆ ERM: w ∈ arg min ˆ L trn ( w ) w ˆ Penalized ERM: w ∈ arg min ˆ L trn ( w ) + Reg( w ) w O. Rivasplata Slide 12 / 40

Generalization Motivation Classic weights Random weights Experiments Conclusions If learned weight ˆ w does well on the train set examples... ...will it still do well on unseen examples? O. Rivasplata Slide 13 / 40

PAC Learning Motivation data set: D = ( Z 1 , . . . , Z n ) ∈ Z n Classic weights Random weights a finite sequence of input-label examples Z i = ( X i , Y i ). Experiments Conclusions Assumptions: • A data-generating distribution P ∈ M 1 ( Z ). • P is unknown, only the training set is given. The input-label examples are i.i.d. ∼ P . • � Population risk: L ( w ) = E � ℓ ( w , Z ) � = Z ℓ ( w , z ) dP ( z ) (out-of-sample) O. Rivasplata Slide 14 / 40

Certifying performance: test set error Motivation Classic weights � ˆ 1 Test set error: L tst ( ˆ w ) = ℓ ( ˆ w , Z i ) Random weights n tst Experiments Z i ∈ D tst Conclusions ⊲ w obtained from the training set ˆ ⊲ test set not used for training ˆ ⊲ L tst ( ˆ w ) serves as estimate of L ( ˆ w ) Note: L ( ˆ w ) remains unknown! ⊲ O. Rivasplata Slide 15 / 40

Certifying performance: confidence bound Motivation Risk upper bound: Classic weights For any given δ ∈ (0 , 1), Random weights with probability at least 1 − δ over random datasets Experiments of size n , simultaneous for all w : ✞ ☎ Conclusions L ( w ) ≤ ˆ ✝ ✆ L n ( w ) + ǫ ( n , δ ) w ) ≤ ˆ For ˆ w = ALG (train set) this gives: L ( ˆ L tst ( ˆ w ) + ǫ ( n tst , δ ) Recommendable practice: ⊲ report confidence bound together with your test set error estimate O. Rivasplata Slide 16 / 40

Self-certified learning? Motivation Risk upper bound: Classic weights For any given δ ∈ (0 , 1), Random weights with probability at least 1 − δ over random datasets Experiments of size n , simultaneous for all w : ✞ ☎ Conclusions L ( w ) ≤ ˆ ✝ ✆ L n ( w ) + ǫ ( n , δ ) Alternative practice: Find ˆ w by minimizing the risk bound ⊲ A form of regularized ERM ⊲ the learned ˆ w comes with its own risk certificate ⊲ best if the risk bound is non-vacuous, ideally tight! ⊲ may avoid the need of data-splitting ⊲ may lead to self-certified learning! O. Rivasplata Slide 17 / 40

Probabilistic Neural Nets O. Rivasplata Slide 18 / 40

Randomized weights Motivation Based on data D , learn a distribution over weights: � Classic weights Q D ∈ M 1 ( W ), Q D = ALG (train set). Random weights Experiments Predictions: Conclusions � • draw w ∼ Q D and predict with the chosen w . • each prediction with a fresh random draw. The risk measures L ( w ) and ˆ L n ( w ) are extended to Q by averaging: � Q [ L ] ≡ W L ( w ) dQ ( w ) = E w ∼ Q [ L ( w )] � Q [ ˆ W ˆ L n ( w ) dQ ( w ) = E w ∼ Q [ ˆ L n ] ≡ L n ( w )] O. Rivasplata Slide 19 / 40

Two usual PAC-Bayes bounds Motivation ‘prior’ ‘posterior’ Classic weights Fix a distribution Q 0 . Random weights For any sample size n , Experiments for any confidence parameter δ ∈ (0 , 1), Conclusions with probability at least 1 − δ over random samples (of size n ) simultaneously for all distributions Q ✓ ✏ KL ( Q � Q 0 ) + log � 2 √ n � � δ Q [ L ] ≤ Q [ ˆ L n ] + (PB-classic) ✒ ✑ 2 n ✎ ☞ KL ( Q � Q 0 ) + log � 2 √ n � δ kl( Q [ ˆ L n ] � Q [ L ]) ≤ (PB-kl) ✍ ✌ n O. Rivasplata Slide 20 / 40

Two more PAC-Bayes bounds Fix a distribution Q 0 . For any size n , for any confidence δ ∈ (0 , 1), with probability at least 1 − δ over random samples (of size n ) PB-quad: simultaneously for all distributions Q ✛ ✘ 2 � KL ( Q � Q 0 ) + log( 2 √ n � KL ( Q � Q 0 ) + log( 2 √ n   δ ) δ )       Q [ ˆ   Q [ L ] ≤ L n ] + +       2 n 2 n     ✚   ✙ PB-lambda: simultaneously for all distributions Q and λ ∈ (0 , 2) ✓ ✏ KL ( Q � Q 0 ) + log( 2 √ n Q [ L ] ≤ Q [ ˆ δ ) L n ] 1 − λ/ 2 + ✒ ✑ n λ (1 − λ/ 2) O. Rivasplata Slide 21 / 40

Cornerstone: change of measure inequality Donsker & Varadhan (1975), Csisz´ ar (1975) Motivation Classic weights � � Q [ f ] − log Q 0 [ e f ] KL ( Q � Q 0 ) = sup Random weights f : W → R Experiments Conclusions Let f : Z n × W → R . For a given Q 0 : � Q [ f ( D , w )] ≤ KL ( Q � Q 0 ) + log Q 0 [ e f ( D , w ) ]. Apply Markov’s inequality to Q 0 [ e f ( D , w ) ]. � w.p. ≥ 1 − δ over the random draw of D ∼ P n , � simultaneously for all distributions Q : Q [ f ( D , w )] ≤ KL ( Q � Q 0 ) + log P n [ Q 0 [ e f ( D , w ) ]] + log(1 /δ ). Use with suitable f , � upper-bound the exponential moment P n [ Q 0 [ e f ( D , w ) ]]. O. Rivasplata Slide 22 / 40

Using a PAC-Bayes bound Motivation Use your favourite ALG to find Q D = ALG (train set), and � Classic weights plug Q D into the PAC-Bayes bound to certify its risk: Random weights ✓ ✏ Experiments KL ( Q D � Q 0 ) + log � 2 √ n � Conclusions � δ Q D [ L ] ≤ Q D [ ˆ L n ] + ✒ ✑ 2 n Use the PAC-Bayes bound itself as a training objective: � ✓ ✏ KL ( Q � Q 0 ) + log � 2 √ n � � Q [ ˆ δ Q D ∈ arg min L n ] + 2 n ✒ ✑ Q Note: both uses illustrated here with PB-classic, but the same can be done with PB-quad or PB-lambda (or any other) O. Rivasplata Slide 23 / 40

Training objectives KL ( Q � Q 0 ) + log( 2 √ n � δ ) f classic ( Q ) = Q [ ˆ L ce n ] + 2 n 2 KL ( Q � Q 0 ) + log( 2 √ n KL ( Q � Q 0 ) + log( 2 √ n � �   δ ) δ )       Q [ ˆ L ce   f quad ( Q ) = n ] + +      2 n 2 n        KL ( Q � Q 0 ) + log( 2 √ n f lambda ( Q , λ ) = Q [ ˆ L ce δ ) n ] 1 − λ/ 2 + n λ (1 − λ/ 2) O. Rivasplata Slide 24 / 40

Experiments O. Rivasplata Slide 25 / 40

PAC-Bayes with Backprop Motivation Classic weights Random weights Experiments Conclusions O. Rivasplata Slide 26 / 40

Tighter risk certificates for (probabilistic) neural networks Omar - PowerPoint PPT Presentation

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40 The crew Mar a P erez-Ortiz (UCL) Yours truly (UCL / DeepMind) Csaba

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

PKI and Certificate Security Outline Motivation Certificates Public Key Infrastructure

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Tighter Bounds on the Inefficiency Ratio of Stable Equilibria in Load Balancing Games Akaki

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Simplification & Integration www.apprenticeships.org.uk/certificates Apprenticeship

Certificates for cs.washington.edu 1 Certificates for GMail Important fields: Testing SSL

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Using Strengths Based Measures to Assess and Manage Risk of Future Negative outcomes Simone

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, Pompeu Fabra University,

PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to

Model selection theory: a tutorial with applications to learning Pascal Massart Universit

Tighter risk certificates for (probabilistic) neural networks Omar - PowerPoint PPT Presentation

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40 The crew Mar a P erez-Ortiz (UCL) Yours truly (UCL / DeepMind) Csaba

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

PKI and Certificate Security Outline Motivation Certificates Public Key Infrastructure

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Tighter Bounds on the Inefficiency Ratio of Stable Equilibria in Load Balancing Games Akaki

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Simplification &amp; Integration www.apprenticeships.org.uk/certificates Apprenticeship

Certificates for cs.washington.edu 1 Certificates for GMail Important fields: Testing SSL

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Using Strengths Based Measures to Assess and Manage Risk of Future Negative outcomes Simone

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, Pompeu Fabra University,

PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to

Model selection theory: a tutorial with applications to learning Pascal Massart Universit

Simplification & Integration www.apprenticeships.org.uk/certificates Apprenticeship