minimum stein discrepancy estimators
play

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol - PowerPoint PPT Presentation

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on Steins Method for Machine Learning and Statistics 15 th June 2019 F-X Briol (University of


  1. Minimum Stein Discrepancy Estimators Fran¸ cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on “Stein’s Method for Machine Learning and Statistics” 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 1 / 15

  2. Collaborators Alessandro Barp Andrew Duncan Mark Girolami Lester Mackey ICL ICL U. Cambridge Microsoft Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint available here: https://fxbriol.github.io ) 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 2 / 15

  3. Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15

  4. Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15

  5. Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15

  6. Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15

  7. Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15

  8. Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15

  9. Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15

  10. Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15

  11. Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15

  12. Minimum Stein Discrepancy Estimators Let Γ( Y ) := { f : X → Y} . A function class G ⊂ Γ( R d ) is a Stein class, with corresponding Stein operator S P θ : G ⊂ Γ( R d ) → Γ( R d ) if: � S P θ [ f ] d P θ = 0 ∀ f ∈ G X This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: � � � � � � � � SD S P θ [ G ] ( Q || P θ ) := sup fd P θ − fd Q � � f ∈S P θ [ G ] X X � � � � � � � S P θ [ g ] d Q = sup � , � g ∈G X on which we base our minimum Stein discrepancy estimators: ˆ θ n ∈ argmin θ ∈ Θ SD S P θ [ G ] ( Q n || P θ ) . 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 6 / 15

  13. Minimum Stein Discrepancy Estimators Let Γ( Y ) := { f : X → Y} . A function class G ⊂ Γ( R d ) is a Stein class, with corresponding Stein operator S P θ : G ⊂ Γ( R d ) → Γ( R d ) if: � S P θ [ f ] d P θ = 0 ∀ f ∈ G X This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: � � � � � � � � SD S P θ [ G ] ( Q || P θ ) := sup fd P θ − fd Q � � f ∈S P θ [ G ] X X � � � � � � � S P θ [ g ] d Q = sup � , � g ∈G X on which we base our minimum Stein discrepancy estimators: ˆ θ n ∈ argmin θ ∈ Θ SD S P θ [ G ] ( Q n || P θ ) . 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 6 / 15

  14. Score Matching Estimators are Minimum Stein Discrepancy Estimators Consider the Stein operator S m 1 p [ g ] := p θ ∇ · ( p θ g ) and the Stein class: � � g = ( g 1 , . . . , g d ) ∈ C 1 ( X , R d ) ∩ L 2 ( X ; Q ) : � g � L 2 ( X ; Q ) ≤ 1 G = . In this case, the Stein discrepancy is the Score Matching divergence: SD S P θ [ G ] ( Q || P θ ) = SM( Q || P θ ) . Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 7 / 15

  15. Score Matching Estimators are Minimum Stein Discrepancy Estimators Consider the Stein operator S m 1 p [ g ] := p θ ∇ · ( p θ g ) and the Stein class: � � g = ( g 1 , . . . , g d ) ∈ C 1 ( X , R d ) ∩ L 2 ( X ; Q ) : � g � L 2 ( X ; Q ) ≤ 1 G = . In this case, the Stein discrepancy is the Score Matching divergence: SD S P θ [ G ] ( Q || P θ ) = SM( Q || P θ ) . Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 7 / 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend