Measuring Sample Quality with Kernels Lester Mackey Joint work with - PowerPoint PPT Presentation

Measuring Sample Quality with Kernels Lester Mackey ∗ Joint work with Jackson Gorham † Microsoft Research ∗ , Opendoor Labs † June 25, 2018 Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 1 / 31

Motivation: Large-scale Posterior Inference Example: Bayesian logistic regression Fixed covariate vector: v l ∈ R d for each datapoint l = 1 , . . . , L 1 Unknown parameter vector: β ∼ N (0 , I ) 2 � � ind 1 Binary class label: Y l | v l , β ∼ Ber 3 1+ e −� β,vl � Generative model simple to express Posterior distribution over unknown parameters is complex Normalization constant unknown, exact integration intractable Standard inferential approach: Use Markov chain Monte Carlo (MCMC) to (eventually) draw samples from the posterior distribution Benefit: Approximates intractable posterior expectations � E P [ h ( Z )] = X p ( x ) h ( x ) dx with asymptotically exact sample � n estimates E Q [ h ( X )] = 1 i =1 h ( x i ) n Problem: Each new MCMC sample point x i requires iterating over entire observed dataset: prohibitive when dataset is large! Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 2 / 31

Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior � expectations E P [ h ( Z )] = X p ( x ) h ( x ) dx with asymptotically � n exact sample estimates E Q [ h ( X )] = 1 i =1 h ( x i ) n Problem: Each point x i requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 3 / 31

Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measures suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 4 / 31

Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1 , . . . , x n ∈ X Define discrete distribution Q n with, for any function h , � n E Q n [ h ( X )] = 1 i =1 h ( x i ) used to approximate E P [ h ( Z )] n We make no assumption about the provenance of the x i Goal: Quantify how well E Q n approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 5 / 31

Integral Probability Metrics Goal: Quantify how well E Q n approximates E P Idea: Consider an integral probability metric (IPM) [M¨ uller, 1997] d H ( Q n , P ) = sup | E Q n [ h ( X )] − E P [ h ( Z )] | h ∈H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H ( Q n , P ) to zero implies ( Q n ) n ≥ 1 converges weakly to P (Requirement II) Examples Bounded Lipschitz (or Dudley) metric, d BL �·� | h ( x ) − h ( y ) | ( H = BL �·� � { h : sup x | h ( x ) | + sup x � = y ≤ 1 } ) � x − y � Wasserstein (or Kantorovich-Rubenstein) distance, d W �·� | h ( x ) − h ( y ) | ( H = W �·� � { h : sup x � = y ≤ 1 } ) � x − y � Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 6 / 31

Integral Probability Metrics Goal: Quantify how well E Q n approximates E P Idea: Consider an integral probability metric (IPM) [M¨ uller, 1997] | E Q n [ h ( X )] − E P [ h ( Z )] | d H ( Q n , P ) = sup h ∈H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H ( Q n , P ) to zero implies ( Q n ) n ≥ 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with E P [ h ( Z )] known a priori to be 0 Then IPM computation only depends on Q n ! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 7 / 31

Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: Identify operator T and set G of functions g : X → R d with 1 E P [( T g )( Z )] = 0 for all g ∈ G . T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S ( Q n , T , G ) � sup | E Q n [( T g )( X )] | = d T G ( Q n , P ) , g ∈G an IPM-type measure with no explicit integration under P Lower bound S ( Q n , T , G ) by reference IPM d H ( Q n , P ) 2 ⇒ S ( Q n , T , G ) → 0 only if ( Q n ) n ≥ 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions Upper bound S ( Q n , T , G ) by any means necessary to 3 demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 8 / 31

Identifying a Stein Operator T Goal: Identify operator T for which E P [( T g )( Z )] = 0 for all g ∈ G Approach: Generator method of Barbour [1988, 1990], G¨ otze [1991] Identify a Markov process ( Z t ) t ≥ 0 with stationary distribution P Under mild conditions, its infinitesimal generator ( A u )( x ) = lim t → 0 ( E [ u ( Z t ) | Z 0 = x ] − u ( x )) /t satisfies E P [( A u )( Z )] = 0 Overdamped Langevin diffusion: dZ t = 1 2 ∇ log p ( Z t ) dt + dW t Generator: ( A P u )( x ) = 1 2 �∇ u ( x ) , ∇ log p ( x ) � + 1 2 �∇ , ∇ u ( x ) � Stein operator: ( T P g )( x ) � � g ( x ) , ∇ log p ( x ) � + �∇ , g ( x ) � [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through ∇ log p ; computable even if p cannot be normalized! Multivariate generalization of density method operator ( T g )( x ) = g ( x ) d dx log p ( x ) + g ′ ( x ) [Stein, Diaconis, Holmes, and Reinert, 2004] Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 9 / 31

Identifying a Stein Set G Goal: Identify set G for which E P [( T P g )( Z )] = 0 for all g ∈ G Approach: Reproducing kernels k : X × X → R A reproducing kernel k is symmetric ( k ( x, y ) = k ( y, x ) ) and positive semidefinite ( � i,l c i c l k ( z i , z l ) ≥ 0 , ∀ z i ∈ X , c i ∈ R ) Gaussian kernel k ( x, y ) = e − 1 2 � x − y � 2 2 Inverse multiquadric kernel k ( x, y ) = (1 + � x − y � 2 2 ) − 1 / 2 Generates a reproducing kernel Hilbert space (RKHS) K k We define the kernel Stein set G k, �·� as vector-valued g with Each component g j in K k Component norms � g j � K k jointly bounded by 1 E P [( T P g )( Z )] = 0 for all g ∈ G k, �·� under mild conditions [Gorham and Mackey, 2017] Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 10 / 31

Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S ( Q n , T P , G k, �·� ) Stein operator ( T P g )( x ) � � g ( x ) , ∇ log p ( x ) � + �∇ , g ( x ) � Stein set G k, �·� � { g = ( g 1 , . . . , g d ) | � v � ∗ ≤ 1 for v j � � g j � K k } Benefit: Computable in closed form [Gorham and Mackey, 2017] �� n i,i ′ =1 k j S ( Q n , T P , G k, �·� ) = � w � for w j � 0 ( x i , x i ′ ) . Reduces to parallelizable pairwise evaluations of Stein kernels k j 1 0 ( x, y ) � p ( x ) p ( y ) ∇ x j ∇ y j ( p ( x ) k ( x, y ) p ( y )) Stein set choice inspired by control functional kernels j =1 k j k 0 = � d 0 of Oates, Girolami, and Chopin [2016] When �·� = �·� 2 , recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] To ease notation, will use G k � G k, �·� 2 in remainder of the talk Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 11 / 31

Measuring Sample Quality with Kernels Lester Mackey Joint work with - PowerPoint PPT Presentation

Measuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorham Microsoft Research , Opendoor Labs June 25, 2018 Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 1 / 31 Motivation: Large-scale Posterior

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

ITU on Measuring Speech Quality Measuring Perceived Quality Typically done by using standards

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

1 Measuring Similarity with Kernels 1.1 Introduction Over the last ten years, estimation and

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Launching Kernels Dr Eric McCreath Research School of Computer Science The Australian National

Localization of Sensor Networks II Localization of Sensor Networks II Jie Gao Jie Gao Computer

Probability and Statistics for Computer Science How

Develop Your Data Mindset Module 11 - Student Level Goal Monitoring Part 3B - Answer By Nathan

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Coaching Session 1: Adapting an Existing Measure for the ELI Pilot Katie Dahlke | Michael Little

Y P Berenson-Allen Center for Noninvasive Brain Stimulation O Department of Neurology Beth

PROMS: What can they do for us? Debbie Cooke, PhD, CPsychol Senior Lecturer, Health

COVID-19 and LTC June 18, 2020 Questions and Answer Session Use the QA box in the webinar