Probabilistic Inference and Learning with Steins Method Lester - PowerPoint PPT Presentation

Probabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New England September 3, 2020 Collaborators: Jackson Gorham, Andrew Duncan, Sebastian Vollmer, Jonathan Huggins, Wilson Chen, Alessandro Barp, Francois-Xavier Briol, Mark Girolami, Chris Oates, Murat Erdogdu, Ohad Shamir, Marina Riabiz, Jon Cockayne, Pawel Swietach, Steven Niederer, and Anant Raj Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 1 / 33

Motivation: Large-scale Posterior Inference Example: Bayesian logistic regression Fixed feature vectors: v l ∈ R d for each datapoint l = 1 , . . . , L 1 1 Binary class labels: Y l ∈ { 0 , 1 } , P ( Y l = 1 | v l , β ) = 2 1+ e −� β,vl � Unknown parameter vector: β ∼ N (0 , I ) 3 Generative model simple to express Posterior distribution over unknown parameters is complex Normalization constant unknown, exact integration intractable Standard inferential approach: Use Markov chain Monte Carlo (MCMC) to (eventually) draw samples from the posterior distribution Benefit: Approximates intractable posterior expectations � E P [ h ( Z )] = X p ( x ) h ( x ) dx with asymptotically exact sample � n estimates E Q [ h ( X )] = 1 i =1 h ( x i ) n Problem: Each new MCMC sample point x i requires iterating over entire observed dataset: prohibitive when dataset is large! Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 2 / 33

Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior � expectations E P [ h ( Z )] = X p ( x ) h ( x ) dx with asymptotically � n exact sample estimates E Q [ h ( X )] = 1 i =1 h ( x i ) n Problem: Each point x i requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 3 / 33

Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measures suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 4 / 33

Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1 , . . . , x n ∈ X Define discrete distribution Q n with, for any function h , � n E Q n [ h ( X )] = 1 i =1 h ( x i ) used to approximate E P [ h ( Z )] n We make no assumption about the provenance of the x i Goal: Quantify how well E Q n approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 5 / 33

Integral Probability Metrics Goal: Quantify how well E Q n approximates E P Idea: Consider an integral probability metric (IPM) [M¨ uller, 1997] | E Q n [ h ( X )] − E P [ h ( Z )] | d H ( Q n , P ) = sup h ∈H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H ( Q n , P ) to zero implies ( Q n ) n ≥ 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with E P [ h ( Z )] known a priori to be 0 Then IPM computation only depends on Q n ! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 6 / 33

Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: Identify operator T and set G of functions g : X → R d with 1 E P [( T g )( Z )] = 0 for all g ∈ G . T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S ( Q n , T , G ) � sup | E Q n [( T g )( X )] | = d T G ( Q n , P ) , g ∈G an IPM-type measure with no explicit integration under P Lower bound S ( Q n , T , G ) by reference IPM d H ( Q n , P ) 2 ⇒ ( Q n ) n ≥ 1 converges to P whenever S ( Q n , T , G ) → 0 (Req. II) Performed once, in advance, for large classes of distributions Upper bound S ( Q n , T , G ) by any means necessary to 3 demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 7 / 33

Identifying a Stein Operator T Goal: Identify operator T for which E P [( T g )( Z )] = 0 for all g ∈ G Approach: Generator method of Barbour [1988, 1990], G¨ otze [1991] Identify a Markov process ( Z t ) t ≥ 0 with stationary distribution P Under mild conditions, its infinitesimal generator ( A u )( x ) = lim t → 0 ( E [ u ( Z t ) | Z 0 = x ] − u ( x )) /t satisfies E P [( A u )( Z )] = 0 Overdamped Langevin diffusion: dZ t = 1 2 ∇ log p ( Z t ) dt + dW t Generator: ( A P u )( x ) = 1 2 �∇ u ( x ) , ∇ log p ( x ) � + 1 2 �∇ , ∇ u ( x ) � Stein operator: ( T P g )( x ) � � g ( x ) , ∇ log p ( x ) � + �∇ , g ( x ) � [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through ∇ log p ; computable even if p cannot be normalized! E P [( T P g )( Z )] = 0 for all g : X → R d in classical Stein set � g ( x ) � ∗ , �∇ g ( x ) � ∗ , �∇ g ( x ) −∇ g ( y ) � ∗ � � � � G �·� = g : sup x � = y max ≤ 1 � x − y � Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 8 / 33

Detecting Convergence and Non-convergence Goal: Show classical Stein discrepancy S ( Q n , T P , G �·� ) → 0 if and only if ( Q n ) n ≥ 1 converges to P In the univariate case ( d = 1 ), known that for many targets P , S ( Q n , T P , G �·� ) → 0 only if Wasserstein d W �·� ( Q n , P ) → 0 [Stein, Diaconis, Holmes, and Reinert, 2004, Chatterjee and Shao, 2011, Chen, Goldstein, and Shao, 2011] Few multivariate targets have been analyzed (see [Reinert and R¨ ollin, 2009, Chatterjee and Meckes, 2008, Meckes, 2009] for multivariate Gaussian) New contribution [Gorham, Duncan, Vollmer, and Mackey, 2019] Theorem (Stein Discrepancy-Wasserstein Equivalence) If the Langevin diffusion couples at an integrable rate and ∇ log p is Lipschitz, then S ( Q n , T P , G �·� ) → 0 ⇔ d W �·� ( Q n , P ) → 0 . Examples: strongly log concave P , Bayesian logistic regression or robust t regression with Gaussian priors, Gaussian mixtures Conditions not necessary: template for bounding S ( Q n , T P , G �·� ) Mackey (MSR) Inference and Learning with Stein’s Method September 3, 2020 9 / 33

Probabilistic Inference and Learning with Steins Method Lester - PowerPoint PPT Presentation

Probabilistic Inference and Learning with Steins Method Lester Mackey Microsoft Research New England September 3, 2020 Collaborators: Jackson Gorham, Andrew Duncan, Sebastian Vollmer, Jonathan Huggins, Wilson Chen, Alessandro Barp,

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Stein Elliptic Curves over Q ( 5) William Stein, University of Washington (This is part of

Chapter 2: Method of Alterations The Probabilistic Method Summer 2020 Freie Universitt Berlin

On Computational and Probabilistic Inference Rajat Mani Thomas Objectives: Revisiting Bayesian

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

Chapter 0: Why? What? How? The Probabilistic Method Summer 2020 Freie Universitt Berlin

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

PROBABILISTIC METHOD Probabilistic Method Colouring Problem Theorem 1 Let A 1 , A 2 , . . . , A

The Probabilistic Method The Probabilistic Method Topics on Randomized Computation Topics on

Advanced Algorithms (VII) Shanghai Jiao Tong University Chihao Zhang April 20, 2020 The

Inference and Learning for Probabilistic Logic Programming Fabrizio Riguzzi Dipartimento di

CS325 Artificial Intelligence Ch 14b Probabilistic Inference Cengiz Gnay Spring 2013

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Stein Couplings for Concentration of Measure Jay Bartroff, Subhankar Ghosh, Larry Goldstein and

Lec Lectur ure 3: e 3: 2. What are the limits of magnetocaloric performance? Magnetocaloric ma

SEALED SOURCE RECOVERY April 17, 2009 Abigail Cuthbertson Federal Project Manager, Offsite

Physics 116 Worlds largest popcorn ball Session 38 Nuclei Dec 2, 2011

Joint work with Antoine Chambert-Loir . P . 2 A X -L INDEMANN T HEOREM (L INDEMANN -W EIERSTRASS )

st t r r

The OPTIDUAL randomized trial Grard HELFT on behalf of the OPTIDUAL Investigators Institut de

Large-Scale Machine Learning I. Scalability issues Jean-Philippe Vert jean-philippe.vert@ {