Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol - PowerPoint PPT Presentation

Minimum Stein Discrepancy Estimators Fran¸ cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on “Stein’s Method for Machine Learning and Statistics” 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 1 / 15

Collaborators Alessandro Barp Andrew Duncan Mark Girolami Lester Mackey ICL ICL U. Cambridge Microsoft Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint available here: https://fxbriol.github.io ) 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 2 / 15

Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15

Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15

Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15

Minimum Stein Discrepancy Estimators Let Γ( Y ) := { f : X → Y} . A function class G ⊂ Γ( R d ) is a Stein class, with corresponding Stein operator S P θ : G ⊂ Γ( R d ) → Γ( R d ) if: � S P θ [ f ] d P θ = 0 ∀ f ∈ G X This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: � � � � � � � � SD S P θ [ G ] ( Q || P θ ) := sup fd P θ − fd Q � � f ∈S P θ [ G ] X X � � � � � � � S P θ [ g ] d Q = sup � , � g ∈G X on which we base our minimum Stein discrepancy estimators: ˆ θ n ∈ argmin θ ∈ Θ SD S P θ [ G ] ( Q n || P θ ) . 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 6 / 15

Score Matching Estimators are Minimum Stein Discrepancy Estimators Consider the Stein operator S m 1 p [ g ] := p θ ∇ · ( p θ g ) and the Stein class: � � g = ( g 1 , . . . , g d ) ∈ C 1 ( X , R d ) ∩ L 2 ( X ; Q ) : � g � L 2 ( X ; Q ) ≤ 1 G = . In this case, the Stein discrepancy is the Score Matching divergence: SD S P θ [ G ] ( Q || P θ ) = SM( Q || P θ ) . Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 7 / 15

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol - PowerPoint PPT Presentation

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on Steins Method for Machine Learning and Statistics 15 th June 2019 F-X Briol (University of

L-estimators, R-estimators, Redescending M gr. Jakub Petr asek Estimators Revision Seminar

Discrepancy and SDPs Nikhil Bansal (TU Eindhoven, Netherlands ) Outline Discrepancy Theory

Constructive Discrepancy Minimization for Convex Sets Thomas Rothvoss UW Seattle Discrepancy

Discrepancy of Random Set Systems Rebecca Hoberg and Thomas Rothvo Discrepancy theory Set

Flow Cytometry Data Assessment Flow Cytometry Data Assessment with L2 Discrepancy Learning with

The discrepancy of the linear flow on the torus Bence Borda Alfr ed R enyi Institute of

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Discrepancy Theory and Applications to Bin Packing Thomas Rothvoss Joint work with Becca Hoberg

Lower Bounds for L 1 Discrepancy Armen Vagharshakyan Brown University January 10, 2013 Armen

On some sets with minimal L 2 discrepancy Dmitriy Bilyk University of South Carolina, Columbia,

Stein Elliptic Curves over Q ( 5) William Stein, University of Washington (This is part of

Estimation theory Parametric estimation Properties of estimators Minimum variance

Dynamic Panel Data estimators Christopher F Baum EC 823: Applied Econometrics Boston College,

Small Sample Performance of Instrumental Variables Probit Estimators: A Monte Carlo Investigation

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

A Slide Rule and a Half Colin Tombeur The Conundrum In some of Charles N. Pickworths detailed

Approaching the sign problem by complexification Manuel Scherzer in collaboration with I.-O.

X Example? In such cases, we can use local search algorithms Keep a single

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of

Multiscale Methods for Subsurface Flow Jrg Aarnes, KnutAndreas Lie, Stein Krogstad, and

Eigenfunctions and Approximation Methods Chris Williams School of Informatics, University of

exemplifi plified ed for r a proje ject ct to determi ermine ne the assuran rance ce durin

Revisiting the gravitational lensing with Gauss Bonnet theorem Hideki Asada (Hirosaki) Ishihara,