approximating likelihood ratios with calibrated
play

Approximating likelihood ratios with calibrated classifiers Gilles - PowerPoint PPT Presentation

Approximating likelihood ratios with calibrated classifiers Gilles Louppe April 13, 2016 Joint work with Kyle Cranmer Juan Pavez New York University Federico Santa Mar a University See paper (Cranmer et al., 2015) for full details. 2


  1. Approximating likelihood ratios with calibrated classifiers Gilles Louppe April 13, 2016

  2. Joint work with Kyle Cranmer Juan Pavez New York University Federico Santa Mar´ ıa University See paper (Cranmer et al., 2015) for full details. 2 / 16

  3. Studying the constituents of the universe (c) Jorge Cham 3 / 16

  4. Collecting data (c) Jorge Cham 4 / 16

  5. Testing for new physics p (data | theory + X ) p (data | theory) (c) Jorge Cham 5 / 16

  6. Likelihood-free setup • Complex simulator p parameterized by θ ; • Samples x ∼ p can be generated on-demand; • ... but the likelihood p ( x | θ ) cannot be evaluated! p = ⊗ 6 / 16

  7. Simple hypothesis testing • Assume some observed data D = { x 1 , . . . , x n } ; • Test a null θ = θ 0 against an alternative θ = θ 1 ; • The Neyman-Pearson lemma states that the most powerful test statistic is p X ( x | θ 0 ) � λ ( D ; θ 0 , θ 1 ) = p X ( x | θ 1 ) . x ∈D • ... but neither p X ( x | θ 0 ) nor p X ( x | θ 1 ) can be evaluated! 7 / 16

  8. Straight approximation 1. Approximate p X ( x | θ 0 ) and p X ( x | θ 1 ) individually, using density estimation algorithms; 2. Evaluate their ratio r ( x ; θ 0 , θ 1 ). Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover, it is not even necessary! p X ( x | θ 0 ) p X ( x | θ 1 ) = r ( x ; θ 0 , θ 1 ) / When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik 8 / 16

  9. Likehood ratio invariance under change of variable Theorem. The likelihood ratio is invariant under the change of variable U = s ( X ), provided s ( x ) is monotonic with r ( x ). r ( x ) = p X ( x | θ 0 ) p X ( x | θ 1 ) = p U ( s ( x ) | θ 0 ) p U ( s ( x ) | θ 1 ) 9 / 16

  10. Approximating likelihood ratios with classifiers • Well, a classifier trained to distinguish x ∼ p 0 from x ∼ p 1 approximates p X ( x | θ 1 ) s ∗ ( x ) = p X ( x | θ 0 ) + p X ( x | θ 1 ) , which is monotonic with r ( x ). • Estimating p ( s ( x ) | θ ) is now easy, since the change of variable s ( x ) projects x in a 1D space, where only the informative content of the ratio is preserved. This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc). • Disentangle training from calibration. 10 / 16

  11. Inference and composite hypothesis testing Approximated likelihood ratios can be used for inference, since ˆ θ = arg max p ( D| θ ) θ p ( x | θ ) � = arg max p ( x | θ 1 ) θ x ∈D p ( s ( x ; θ, θ 1 ) | θ ) � = arg max (1) p ( s ( x ; θ, θ 1 ) | θ 1 ) θ x ∈D where θ 1 is fixed and s ( x ; θ, θ 1 ) is a family of classifiers parameterized by ( θ, θ 1 ). Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way. 11 / 16

  12. Parameterized learning For inference, we need to build a family s ( x ; θ, θ 1 ) of classifiers. • One could build a classifier s independently for all θ, θ 1 . But this is computationally expensive and would not guarantee a smooth evolution of s ( x ; θ, θ 1 ) as θ varies. • Solution: build a single parameterized classifier instead, where parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016). T := {} ; while size( T ) < N do Draw θ 0 ∼ π Θ 0 ; Draw x ∼ p ( x | θ 0 ); T := T ∪ { (( x , θ 0 , θ 1 ) , y = 0) } ; Draw θ 1 ∼ π Θ 1 ; Draw x ∼ p ( x | θ 1 ); T := T ∪ { (( x , θ 0 , θ 1 ) , y = 1) } ; end while Learn a single classifier s ( x ; θ 0 , θ 1 ) from T . 12 / 16

  13. Example: Inference from multidimensional data Let assume 5D data x generated from the following process p 0 : 1. z := ( z 0 , z 1 , z 2 , z 3 , z 4 ), such that z 0 ∼ N ( µ = α, σ = 1), 5 0 X1 z 1 ∼ N ( µ = β, σ = 3), 5 0 1 z 2 ∼ Mixture( 1 2 N ( µ = − 2 , σ = 6 1) , 1 0 X2 2 N ( µ = 2 , σ = 0 . 5)), 6 1 2 z 3 ∼ Exponential( λ = 3), and 3 0 z 4 ∼ Exponential( λ = 0 . 5); X3 3 6 2. x := R z , where R is a fixed semi-positive 2 1 definite 5 × 5 matrix defining a fixed 8 X4 4 projection of z into the observed space. 0 3 0 3 0 5 0 5 2 6 0 6 6 3 0 3 0 4 8 2 1 1 1 X0 X1 X2 X3 X4 Our goal is to infer the values α and Observed data D β based on D . Check out (Louppe et al., 2016) to reproduce this example. 13 / 16

  14. Example: Inference from multidimensional data Recipe: 1. Build a single parameterized classifier s ( x ; θ 0 , θ 1 ), in this case a 2-layer NN trained on 5+2 features, with the alternative fixed to θ 1 = ( α = 0 , β = 0). α, ˆ 2. Find the approximated MLE ˆ β by solving Eqn. 1. Solve Eqn. 1 using likelihood scans or through optimization. Since the generator is inexpensive, p ( s ( x ; θ 0 , θ 1 ) | θ ) can be calibrated on-the-fly, for every candidate ( α, β ), e.g. using histograms. 3. Construct the log-likelihood ratio (LLR) statistic − 2 log Λ( α, β ) = − 2 log p ( D| α, β ) α, ˆ p ( D| ˆ β ) 14 / 16

  15. Approx. LLR (smoothed by a Exact − 2 log Λ( α, β ) Gaussian Process) 0.6 0.6 0.8 0.8 β β 1.0 1.0 1.2 1.2 1.4 1.4 0.90 0.95 1.00 1.05 1.10 1.15 0.90 0.95 1.00 1.05 1.10 1.15 α α 0 4 8 12 16 20 24 28 32 α =1, β = − 1 Exact MLE Approx. MLE 15 / 16

  16. Diagnostics In practice ˆ r (ˆ s ( x ; θ 0 , θ 1 )) will not be exact. Diagnostic procedures are needed to assess the quality of this approximation. 1. For inference, the value of the MLE ˆ θ should be independent of the value of θ 1 used in the denominator of the ratio. 2. Train a classifier to distinguish between unweighted samples from p ( x | θ 0 ) and samples from p ( x | θ 1 ) weighted by ˆ r (ˆ s ( x ; θ 0 , θ 1 )). 14 1.0 Exact 12 Approx., θ 1 =( α =0 ,β =1) Approx., θ 1 =( α =1 ,β = − 1) 0.8 10 Approx., θ 1 =( α =0 ,β = − 1) True Positive Rate ± 1 σ , θ 1 =( α =0 ,β = − 1) − 2logΛ( θ ) 8 0.6 6 0.4 4 p ( x | θ 1 ) r ( x | θ 0 ,θ 1 ) exact 0.2 p ( x | θ 1 ) no weights 2 p ( x | θ 1 ) r ( x | θ 0 ,θ 1 ) approx. 0 0.0 0.7 0.8 0.9 1.0 1.1 1.2 1.3 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate α

  17. Density ratio estimation Approximating likelihood ratios relates to many other fundamental statistical inference problems, including • transfer learning, • outlier detection, • divergence estimation, • ...

  18. Transfer learning: p train � = p test As training data increases, i.e. as N → ∞ , 1 � � L ( ϕ ( x i )) → L ( ϕ ( x )) p train ( x ) d x . N x i We want to be good on test data, i.e., minimize � L ( ϕ ( x )) p test ( x ) d x . Solution: importance weighting . 1 p test ( x i ) ϕ ∗ = arg min � p train ( x i ) L ( ϕ ( x i )) N ϕ x i

  19. Summary • We proposed an approach for approximating LR in the likelihood-free setup. • Evaluating likelihood ratios reduces to supervised learning. Both problems are deeply connected. • Alternative to Approximate Bayesian Computation, without the need to define a prior over parameters. 16 / 16

  20. References Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913 . Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169 . Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference toolbox. http://dx.doi.org/10.5281/zenodo.47798 , https://github.com/diana-hep/carl .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend