Measuring Sample Quality with Steins Method Lester Mackey Joint work - PowerPoint PPT Presentation

Measuring Sample Quality with Stein’s Method Lester Mackey ∗ Joint work with Jackson Gorham † , Andrew Duncan ‡ , Sebastian Vollmer ∗∗ Microsoft Research ∗ , Opendoor Labs † , University of Sussex ‡ , University of Warwick ∗∗ July 30, 2018 Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 1 / 32

Motivation: Large-scale Posterior Inference Example: Bayesian logistic regression Unknown parameter vector: β ∼ N (0 , I ) 1 Fixed covariate vector: v l ∈ R d for each datapoint l = 1 , . . . , L 2 � � ind 1 Binary class label: Y l | v l , β ∼ Ber 3 1+ e −� β,vl � Generative model simple to express Posterior distribution over unknown parameters is complex Normalization constant unknown, exact integration intractable Standard inferential approach: Use Markov chain Monte Carlo (MCMC) to (eventually) draw samples from the posterior distribution Benefit: Approximates intractable posterior expectations � E P [ h ( Z )] = X p ( x ) h ( x ) dx with asymptotically exact sample � n estimates E Q n [ h ( X )] = 1 i =1 h ( x i ) n Problem: Each new MCMC sample point x i requires iterating over entire observed dataset: prohibitive when dataset is large! Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 2 / 32

Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior � expectations E P [ h ( Z )] = X p ( x ) h ( x ) dx with asymptotically � n exact sample estimates E Q n [ h ( X )] = 1 i =1 h ( x i ) n Problem: Each point x i requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 3 / 32

Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measure suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 4 / 32

Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d (will relax to any convex set) and density p p known up to normalization, integration under P is intractable Sample points x 1 , . . . , x n ∈ X Define discrete distribution Q n with, for any function h , E Q n [ h ( X )] = 1 � n i =1 h ( x i ) used to approximate E P [ h ( Z )] n We make no assumption about the provenance of the x i Goal: Quantify how well E Q n approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 5 / 32

Integral Probability Metrics Goal: Quantify how well E Q n approximates E P Idea: Consider an integral probability metric (IPM) [M¨ uller, 1997] d H ( Q n , P ) = sup | E Q n [ h ( X )] − E P [ h ( Z )] | h ∈H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H ( Q n , P ) to zero implies ( Q n ) n ≥ 1 converges weakly to P (Requirement II) Examples Total variation distance ( H = { h : sup x | h ( x ) | ≤ 1 } ) Wasserstein (or Kantorovich-Rubenstein) distance, d W �·� | h ( x ) − h ( y ) | ( H = W �·� � { h : sup x � = y ≤ 1 } ) � x − y � Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 6 / 32

Integral Probability Metrics Goal: Quantify how well E Q n approximates E P Idea: Consider an integral probability metric (IPM) [M¨ uller, 1997] | E Q n [ h ( X )] − E P [ h ( Z )] | d H ( Q n , P ) = sup h ∈H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H ( Q n , P ) to zero implies ( Q n ) n ≥ 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with E P [ h ( Z )] known a priori to be 0 Then IPM computation only depends on Q n ! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 7 / 32

Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: Identify operator T and set G of functions g : X → R d with 1 E P [( T g )( Z )] = 0 for all g ∈ G . T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S ( Q n , T , G ) � sup | E Q n [( T g )( X )] | = d T G ( Q n , P ) , g ∈G an IPM-type measure with no explicit integration under P Lower bound S ( Q n , T , G ) by reference IPM d H ( Q n , P ) 2 ⇒ S ( Q n , T , G ) → 0 only if ( Q n ) n ≥ 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions Upper bound S ( Q n , T , G ) by any means necessary to 3 demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 8 / 32

Identifying a Stein Operator T Goal: Identify operator T for which E P [( T g )( Z )] = 0 for all g ∈ G Approach: Generator method of Barbour [1988, 1990], G¨ otze [1991] Identify a Markov process ( Z t ) t ≥ 0 with stationary distribution P Under mild conditions, its infinitesimal generator ( A u )( x ) = lim t → 0 ( E [ u ( Z t ) | Z 0 = x ] − u ( x )) /t satisfies E P [( A u )( Z )] = 0 Overdamped Langevin diffusion: dZ t = 1 2 ∇ log p ( Z t ) dt + dW t Generator: ( A P u )( x ) = 1 2 �∇ u ( x ) , ∇ log p ( x ) � + 1 2 �∇ , ∇ u ( x ) � Stein operator: ( T P g )( x ) � � g ( x ) , ∇ log p ( x ) � + �∇ , g ( x ) � [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through ∇ log p ; computable even if p cannot be normalized! E P [( T P g )( Z )] = 0 for all g : X → R d in classical Stein set � g ( x ) � ∗ , �∇ g ( x ) � ∗ , �∇ g ( x ) −∇ g ( y ) � ∗ � � � � G �·� = g : sup x � = y max ≤ 1 � x − y � Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 9 / 32

Detecting Convergence and Non-convergence Goal: Show classical Stein discrepancy S ( Q n , T P , G �·� ) → 0 if and only if ( Q n ) n ≥ 1 converges to P In the univariate case ( d = 1 ), known that for many targets P , S ( Q n , T P , G �·� ) → 0 only if Wasserstein d W �·� ( Q n , P ) → 0 [Stein, Diaconis, Holmes, and Reinert, 2004, Chatterjee and Shao, 2011, Chen, Goldstein, and Shao, 2011] Few multivariate targets have been analyzed (see [Reinert and R¨ ollin, 2009, Chatterjee and Meckes, 2008, Meckes, 2009] for multivariate Gaussian) New contribution [Gorham, Duncan, Vollmer, and Mackey, 2016] Theorem (Stein Discrepancy-Wasserstein Equivalence) If the Langevin diffusion couples at an integrable rate and ∇ log p is Lipschitz, then S ( Q n , T P , G �·� ) → 0 ⇔ d W �·� ( Q n , P ) → 0 . Examples: strongly log concave P , Bayesian logistic regression or robust t regression with Gaussian priors, Gaussian mixtures Conditions not necessary: template for bounding S ( Q n , T P , G �·� ) Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 10 / 32

Computing Stein Discrepancies Question: How do we compute a Stein discrepancy S ( Q n , T P , G ) = sup g ∈G | E Q n [( T P g )( X )] | in practice? Consider the classical Stein discrepancy optimization problem n 1 � S ( Q n , T P , G �·� ) = sup � g ( x i ) , ∇ log p ( x i ) � + �∇ , g ( x i ) � n g i =1 s.t. � g ( x ) � ∗ ≤ 1 , ∀ x ∈ X �∇ g ( x ) � ∗ ≤ 1 , ∀ x ∈ X �∇ g ( x ) − ∇ g ( y ) � ∗ ≤ � x − y � , ∀ x, y ∈ X Objective only depends on the values of g and ∇ g at the n sample points x i Infinite-dimensional problem with infinitude of constraints Idea: Find alternative Stein set G with equivalent convergence properties and only finitely many constraints Mackey (MSR) Stein’s Method for Sample Quality July 30, 2018 11 / 32

Measuring Sample Quality with Steins Method Lester Mackey Joint work - PowerPoint PPT Presentation

Measuring Sample Quality with Steins Method Lester Mackey Joint work with Jackson Gorham , Andrew Duncan , Sebastian Vollmer Microsoft Research , Opendoor Labs , University of Sussex , University of Warwick

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Stein Elliptic Curves over Q ( 5) William Stein, University of Washington (This is part of

ITU on Measuring Speech Quality Measuring Perceived Quality Typically done by using standards

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Measuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorham

Elliptic Curves in Sage William Stein Sage Project Functionality William Stein Demo

Q ( 5) Elliptic Curves over Q ( 5) Stein William Stein, University of Washington This is

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Measuring What Matters Quality, Impact and Measuring Social Value Philip Angier, Angier Griffin

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

Sample and Hold Dag T. Wisland Spring 2014 Outline Sample and hold basics Non ideal

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

Lower Bounds for L 1 Discrepancy Armen Vagharshakyan Brown University January 10, 2013 Armen

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

Introduction to Machine Learning CART: Advantages & Disadvantages

Simple Stochastic Games: Risk Taking in Strategic Contexts Ryan O. Murphy Chair of Decision

Flow Cytometry Data Assessment Flow Cytometry Data Assessment with L2 Discrepancy Learning with

Active Learning Using Discrepancy Zhenghang Cui, Issei Sato The University of Tokyo / RIKEN AIP

Stein Point Markov Chain Monte Carlo Wilson Chen Institute of Statistical Mathematics, Japan

From affine to two-source extractors via approximate duality Eli Ben-Sasson Noga Zewi Computer

Measuring Sample Quality with Steins Method Lester Mackey Joint work - PowerPoint PPT Presentation

Measuring Sample Quality with Steins Method Lester Mackey Joint work with Jackson Gorham , Andrew Duncan , Sebastian Vollmer Microsoft Research , Opendoor Labs , University of Sussex , University of Warwick

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Stein Elliptic Curves over Q ( 5) William Stein, University of Washington (This is part of

ITU on Measuring Speech Quality Measuring Perceived Quality Typically done by using standards

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Measuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorham

Elliptic Curves in Sage William Stein Sage Project Functionality William Stein Demo

Q ( 5) Elliptic Curves over Q ( 5) Stein William Stein, University of Washington This is

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Measuring What Matters Quality, Impact and Measuring Social Value Philip Angier, Angier Griffin

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

Sample and Hold Dag T. Wisland Spring 2014 Outline Sample and hold basics Non ideal

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

Lower Bounds for L 1 Discrepancy Armen Vagharshakyan Brown University January 10, 2013 Armen

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

Introduction to Machine Learning CART: Advantages &amp; Disadvantages

Simple Stochastic Games: Risk Taking in Strategic Contexts Ryan O. Murphy Chair of Decision

Flow Cytometry Data Assessment Flow Cytometry Data Assessment with L2 Discrepancy Learning with

Active Learning Using Discrepancy Zhenghang Cui, Issei Sato The University of Tokyo / RIKEN AIP

Stein Point Markov Chain Monte Carlo Wilson Chen Institute of Statistical Mathematics, Japan

From affine to two-source extractors via approximate duality Eli Ben-Sasson Noga Zewi Computer

Introduction to Machine Learning CART: Advantages & Disadvantages