A Primer on PAC-Bayesian Learning Benjamin Guedj John Shawe-Taylor - PowerPoint PPT Presentation

A Primer on PAC-Bayesian Learning Benjamin Guedj John Shawe-Taylor ICML 2019 June 10, 2019 1 65

What to expect We will... Provide an overview of what PAC-Bayes is Illustrate its flexibility and relevance to tackle modern machine learning tasks, and rethink generalization Cover main existing results and key ideas, and briefly sketch some proofs We won’t... Cover all of Statistical Learning Theory: see the NeurIPS 2018 tutorial ”Statistical Learning Theory: A Hitchhiker’s guide” (Shawe-Taylor and Rivasplata) Provide an encyclopaedic coverage of the PAC-Bayes literature (apologies!) 2 65

In a nutshell PAC-Bayes is a generic framework to efficiently rethink generalization for numerous machine learning algorithms. It leverages the flexibility of Bayesian learning and allows to derive new learning algorithms. 3 65

The plan Elements of Statistical Learning 1 The PAC-Bayesian Theory 2 State-of-the-art PAC-Bayes results: a case study 3 Localized PAC-Bayes: data- or distribution-dependent priors Stability and PAC-Bayes PAC-Bayes analysis of deep neural networks 4 65

Error distribution 6 65

Learning is to be able to generalize From examples, what can a system learn about the underlying phenomenon? Memorizing the already seen data is usually bad − → overfitting Generalization is the ability to ’perform’ well on unseen data. [Figure from Wikipedia] 7 65

Statistical Learning Theory is about high confidence For a fixed algorithm, function class and sample size, generating random samples − → distribution of test errors Focusing on the mean of the error distribution? ⊲ can be misleading: learner only has one sample Statistical Learning Theory: tail of the distribution ⊲ finding bounds which hold with high probability over random samples of size m Compare to a statistical test – at 99% confidence level ⊲ chances of the conclusion not being true are less than 1% PAC: probably approximately correct (Valiant, 1984) P m [ large error ] � δ Use a ‘confidence parameter’ δ : δ is the probability of being misled by the training set Hence high confidence: P m [ approximately correct ] � 1 − δ 8 65

Mathematical formalization Learning algorithm A : Z m → H • Z = X × Y • H = hypothesis class X = set of inputs = set of predictors Y = set of outputs (e.g. (e.g. classifiers) labels) functions X → Y Training set (aka sample): S m = (( X 1 , Y 1 ) , . . . , ( X m , Y m )) a finite sequence of input-output examples. Classical assumptions : • A data-generating distribution P over Z . • Learner doesn’t know P , only sees the training set. • The training set examples are i.i.d. from P : S m ∼ P m ⊲ these can be relaxed (mostly beyond the scope of this tutorial) 9 65

What to achieve from the sample? Use the available sample to: 1 learn a predictor 2 certify the predictor’s performance Learning a predictor: • algorithm driven by some learning principle • informed by prior knowledge resulting in inductive bias Certifying performance: • what happens beyond the training set • generalization bounds Actually these two goals interact with each other! 10 65

Risk (aka error) measures A loss function ℓ ( h ( X ) , Y ) is used to measure the discrepancy between a predicted output h ( X ) and the true output Y . � m R in ( h ) = 1 Empirical risk: i = 1 ℓ ( h ( X i ) , Y i ) m (in-sample) � � Theoretical risk: R out ( h ) = E ℓ ( h ( X ) , Y ) (out-of-sample) Examples: • ℓ ( h ( X ) , Y ) = 1 [ h ( X ) � = Y ] : 0-1 loss (classification) • ℓ ( h ( X ) , Y ) = ( Y − h ( X )) 2 : square loss (regression) • ℓ ( h ( X ) , Y ) = ( 1 − Yh ( X )) + : hinge loss • ℓ ( h ( X ) , 1 ) = − log ( h ( X )) : log loss (density estimation) 11 65

Generalization If predictor h does well on the in-sample ( X , Y ) pairs... ...will it still do well on out-of-sample pairs? Generalization gap: ∆ ( h ) = R out ( h ) − R in ( h ) Upper bounds: w.h.p. ∆ ( h ) � ǫ ( m , δ ) R out ( h ) � R in ( h ) + ǫ ( m , δ ) ◮ Lower bounds: w.h.p. ∆ ( h ) � ˜ ǫ ( m , δ ) Flavours: distribution-free distribution-dependent algorithm-free algorithm-dependent 12 65

Why you should care about generalization bounds Generalization bounds are a safety check: give a theoretical guarantee on the performance of a learning algorithm on any unseen data. Generalization bounds: may be computed with the training sample only, do not depend on any test sample provide a computable control on the error on any unseen data with prespecified confidence explain why specific learning algorithms actually work and even lead to designing new algorithm which scale to more complex settings 13 65

Building block: one single hypothesis For one fixed (non data-dependent) h : � m � � 1 E [ R in ( h )] = E i = 1 ℓ ( h ( X i ) , Y i ) = R out ( h ) m ◮ P m [ ∆ ( h ) > ǫ ] = P m � � E [ R in ( h )] − R in ( h ) > ǫ deviation ineq. ◮ ℓ ( h ( X i ) , Y i ) are independent r.v.’s ◮ If 0 � ℓ ( h ( X ) , Y ) � 1, using Hoeffding’s inequality: � − 2 m ǫ 2 � P m � � ∆ ( h ) > ǫ = δ � exp ◮ Given δ ∈ ( 0 , 1 ) , equate RHS to δ , solve equation for ǫ , get P m � � � ∆ ( h ) > ( 1 / 2 m ) log ( 1 /δ ) � δ � 1 � 1 � ◮ with probability � 1 − δ , R out ( h ) � R in ( h ) + 2 m log δ 14 65

Finite function class Algorithm A : Z m → H Function class H with |H| < ∞ P m � � Aim for a uniform bound: ∀ f ∈ H , ∆ ( f ) � ǫ � 1 − δ P m ( E 1 or E 2 or · · · ) � P m ( E 1 ) + P m ( E 2 ) + · · · Basic tool: known as the union bound (aka countable sub-additivity) � � P m � � f ∈ H P m � � ∃ f ∈ H , ∆ ( f ) > ǫ ∆ ( f ) > ǫ � − 2 m ǫ 2 � � |H| exp = δ � � � | H | 1 w.p. � 1 − δ , ∀ h ∈ H , R out ( h ) � R in ( h ) + 2 m log δ This is a worst-case approach, as it considers uniformly all hypotheses. 15 65

Towards non-uniform learnability A route to improve this is to consider data-dependent hypotheses h i , associated with prior distribution P = ( p i ) i (structural risk minimization): � � � 1 1 w.p. � 1 − δ , ∀ h i ∈ H , R out ( h i ) � R in ( h i ) + 2 m log p i δ Note that we can also write w.p. � 1 − δ , ∀ h i ∈ H , � 1 � 1 � �� R out ( h i ) � R in ( h i ) + KL ( Dirac ( h i ) � P ) + log 2 m δ First attempt to introduce hypothesis-dependence (i.e. complexity depends on the chosen function) This leads to a bound-minimizing algorithm: � � 1 � � � 1 return arg min R in ( h i ) + 2 m log p i δ h i ∈ H 16 65

Uncountably infinite function class? Algorithm A : Z m → H Function class H with |H| � | N | Vapnik & Chervonenkis dimension: for H with d = VC ( H ) finite, for any m , for any δ ∈ ( 0 , 1 ) , � 2 em � 4 � 8 d � + 8 � w.p. � 1 − δ , ∀ h ∈ H , ∆ ( h ) � m log m log d δ The bound holds for all functions in the class (uniform over H ) and for all distributions (uniform over P ) Rademacher complexity (measures how well a function can align with randomly perturbed labels – can be used to take advantage of margin assumptions) These approaches are suited to analyse the performance of individual functions, and take some account of correlations − → Extension: PAC-Bayes allows to consider distributions over hypotheses. 17 65

The PAC-Bayes framework Before data, fix a distribution P ∈ M 1 ( H ) ⊲ ‘prior’ Based on data, learn a distribution Q ∈ M 1 ( H ) ⊲ ‘posterior’ Predictions: • draw h ∼ Q and predict with the chosen h . • each prediction with a fresh random draw. The risk measures R in ( h ) and R out ( h ) are extended by averaging: � � R in ( Q ) ≡ H R in ( h ) dQ ( h ) R out ( Q ) ≡ H R out ( h ) dQ ( h ) h ∼ Q ln Q ( h ) KL ( Q � P ) = E P ( h ) is the Kullback-Leibler divergence. Recall the bound for data-dependent hypotheses h i associated with prior weights p i : w.p. � 1 − δ , ∀ h i ∈ H , � 1 � 1 � �� R out ( h i ) � R in ( h i ) + KL ( Dirac ( h i ) � P ) + log 2 m δ 19 65

PAC-Bayes aka Generalized Bayes ”Prior”: exploration mechanism of H ”Posterior” is the twisted prior after confronting with data 20 65

PAC-Bayes bounds vs. Bayesian learning Prior • PAC-Bayes bounds: bounds hold even if prior incorrect • Bayesian: inference must assume prior is correct Posterior • PAC-Bayes bounds: bound holds for all posteriors • Bayesian: posterior computed by Bayesian inference, depends on statistical modeling Data distribution • PAC-Bayes bounds: can be used to define prior, hence no need to be known explicitly • Bayesian: input effectively excluded from the analysis, randomness lies in the noise model generating the output 21 65

A Primer on PAC-Bayesian Learning Benjamin Guedj John Shawe-Taylor - PowerPoint PPT Presentation

A Primer on PAC-Bayesian Learning Benjamin Guedj John Shawe-Taylor ICML 2019 June 10, 2019 1 65 What to expect We will... Provide an overview of what PAC-Bayes is Illustrate its flexibility and relevance to tackle modern machine learning

Guiding Financial Controls and Practices for PACs and PAC Treasurers PAC Treasurers Workshop

NAPSLO PAC Contributions How contributing to the NAPSLO PAC will benefit you, your company and the

WELCOME June 2011 PAC Presentation Opening Remarks Introductions June 2011 PAC

AAOS Orthopaedic PAC The Orthopaedic PAC is the only national political action committee

LArIAT Fermilab PAC Meeting November 11, 2016 Jen Raaf PAC Charge Fermilab PAC Meeting, J.

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

The PAC Learning Framework Guoqing Zheng January 20, 2015 Guoqing Zheng The PAC Learning

HERITAGE SQUARE CONSIDERATIONS Public Process Project Advisory Committee Meetings: PAC Meeting

Interferometric Sensor (MAGIS-100) PAC Meeting Jason Hogan on behalf of the MAGIS

Tariff Primer: A Graphic Presentation of the Fordney- Tariff Primer: A Graphic Presentation of the

Linac Simulation Linac Simulation Primer Primer J.-F. Ostiguy APC ostiguy@fnal.gov September

PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and

Perspective Yutong Zhao 1 , Lu Xiao 1 , Xiao Wang 1 , Lei Sun 1 , Bihuan Chen 2 , Yang Liu 3 ,

Polarity By adding the individual bond dipoles, one can determine the overall dipole moment

SLICT: Secure Localized Information Centric Things Marcel Enguehard, Ralph Droms, Dario Rossi 26

Memory Prefetching Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture The memory

Eigenvalue estimates and localization of the first Dirichlet eigenfunction Thomas Beck

Compressive Quantum Tomography Kunal Marwaha Backstory Quantum information is

SUSTAINABILITY & INNOVATION IN PUBLIC RADIO 2019 WBUR BIZLAB SUMMIT #BizLabSummit

localized op+cal pulses we saw how to characterized a beam

A Primer on PAC-Bayesian Learning Benjamin Guedj John Shawe-Taylor - PowerPoint PPT Presentation

A Primer on PAC-Bayesian Learning Benjamin Guedj John Shawe-Taylor ICML 2019 June 10, 2019 1 65 What to expect We will... Provide an overview of what PAC-Bayes is Illustrate its flexibility and relevance to tackle modern machine learning

Guiding Financial Controls and Practices for PACs and PAC Treasurers PAC Treasurers Workshop

NAPSLO PAC Contributions How contributing to the NAPSLO PAC will benefit you, your company and the

WELCOME June 2011 PAC Presentation Opening Remarks Introductions June 2011 PAC

AAOS Orthopaedic PAC The Orthopaedic PAC is the only national political action committee

LArIAT Fermilab PAC Meeting November 11, 2016 Jen Raaf PAC Charge Fermilab PAC Meeting, J.

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

The PAC Learning Framework Guoqing Zheng January 20, 2015 Guoqing Zheng The PAC Learning

HERITAGE SQUARE CONSIDERATIONS Public Process Project Advisory Committee Meetings: PAC Meeting

Interferometric Sensor (MAGIS-100) PAC Meeting Jason Hogan on behalf of the MAGIS

Tariff Primer: A Graphic Presentation of the Fordney- Tariff Primer: A Graphic Presentation of the

Linac Simulation Linac Simulation Primer Primer J.-F. Ostiguy APC ostiguy@fnal.gov September

PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and

Perspective Yutong Zhao 1 , Lu Xiao 1 , Xiao Wang 1 , Lei Sun 1 , Bihuan Chen 2 , Yang Liu 3 ,

Polarity By adding the individual bond dipoles, one can determine the overall dipole moment

SLICT: Secure Localized Information Centric Things Marcel Enguehard, Ralph Droms, Dario Rossi 26

Memory Prefetching Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture The memory

Eigenvalue estimates and localization of the first Dirichlet eigenfunction Thomas Beck

Compressive Quantum Tomography Kunal Marwaha Backstory Quantum information is

SUSTAINABILITY &amp; INNOVATION IN PUBLIC RADIO 2019 WBUR BIZLAB SUMMIT #BizLabSummit

localized op+cal pulses we saw how to characterized a beam

SUSTAINABILITY & INNOVATION IN PUBLIC RADIO 2019 WBUR BIZLAB SUMMIT #BizLabSummit