Approximating likelihood ratios with calibrated classifiers Gilles - - PowerPoint PPT Presentation
Approximating likelihood ratios with calibrated classifiers Gilles - - PowerPoint PPT Presentation
Approximating likelihood ratios with calibrated classifiers Gilles Louppe June 22, 2016 MLHEP, Lund Joint work with Kyle Cranmer Juan Pavez New York University Federico Santa Mar a University See paper (Cranmer et al., 2015) for full
Joint work with Kyle Cranmer
New York University
Juan Pavez
Federico Santa Mar´ ıa University
See paper (Cranmer et al., 2015) for full details.
2 / 23
Studying the constituents of the universe
(c) Jorge Cham
3 / 23
Collecting data
(c) Jorge Cham
4 / 23
Testing for new physics
(c) Jorge Cham
p(data|theory + X) p(data|theory)
5 / 23
Likelihood-free setup
- Complex simulator p parameterized by θ;
- Samples x ∼ p can be generated on-demand;
- ... but the likelihood p(x|θ) cannot be evaluated!
p = ⊗
6 / 23
Simple hypothesis testing
- Assume some observed data D = {x1, . . . , xn};
- Test a null θ = θ0 against an alternative θ = θ1;
- The Neyman-Pearson lemma states that the most powerful
test statistic is λ(D; θ0, θ1) =
- x∈D
pX(x|θ0) pX(x|θ1).
- ... but neither pX(x|θ0) nor pX(x|θ1) can be evaluated!
7 / 23
Straight approximation
- 1. Approximate pX(x|θ0) and pX(x|θ1) individually, using density
estimation algorithms;
- 2. Evaluate their ratio r(x; θ0, θ1).
Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover, it is not even necessary!
pX(x|θ0) pX(x|θ1) = r(x; θ0, θ1)
/
When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik
8 / 23
Likehood ratio invariance under change of variable
- Theorem. The likelihood ratio is invariant under the change of
variable U = s(X), provided s(x) is monotonic with r(x). r(x) = pX(x|θ0) pX(x|θ1) = pU(s(x)|θ0) pU(s(x)|θ1)
9 / 23
Approximating likelihood ratios with classifiers
- Well, a classifier trained to distinguish x ∼ p0 from x ∼ p1
approximates s∗(x) = pX(x|θ1) pX(x|θ0) + pX(x|θ1), which is monotonic with r(x).
- Estimating p(s(x)|θ) is now easy, since the change of variable
s(x) projects x in a 1D space, where only the informative content of the ratio is preserved.
This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc).
- Disentangle training from calibration.
10 / 23
Inference and composite hypothesis testing
Approximated likelihood ratios can be used for inference, since
ˆ θ = arg max
θ
p(D|θ) = arg max
θ
- x∈D
p(x|θ) p(x|θ1) = arg max
θ
- x∈D
p(s(x; θ, θ1)|θ) p(s(x; θ, θ1)|θ1) (1)
where θ1 is fixed and s(x; θ, θ1) is a family of classifiers parameterized by (θ, θ1). Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way.
11 / 23
Parameterized learning
For inference, we need to build a family s(x; θ, θ1) of classifiers.
- One could build a classifier s independently for all θ, θ1. But
this is computationally expensive and would not guarantee a smooth evolution of s(x; θ, θ1) as θ varies.
- Solution: build a single parameterized classifier instead, where
parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016).
T := {}; while size(T ) < N do Draw θ0 ∼ πΘ0; Draw x ∼ p(x|θ0); T := T ∪ {((x, θ0, θ1), y = 0)}; Draw θ1 ∼ πΘ1; Draw x ∼ p(x|θ1); T := T ∪ {((x, θ0, θ1), y = 1)}; end while Learn a single classifier s(x; θ0, θ1) from T .
12 / 23
Example: Inference from multidimensional data
Let assume 5D data x generated from the following process p0:
- 1. z := (z0, z1, z2, z3, z4), such that
z0 ∼ N(µ = α, σ = 1), z1 ∼ N(µ = β, σ = 3), z2 ∼ Mixture( 1
2 N(µ = −2, σ =
1), 1
2 N(µ = 2, σ = 0.5)),
z3 ∼ Exponential(λ = 3), and z4 ∼ Exponential(λ = 0.5);
- 2. x := Rz, where R is a fixed semi-positive
definite 5 × 5 matrix defining a fixed projection of z into the observed space.
Our goal is to infer the values α and β based on D.
1 5 5 X1 1 2 6 6 X2 6 3 3 X3 3 3 X0 4 8 1 2 X4 1 5 5 X1 1 2 6 6 X2 6 3 3 X3 4 8 1 2 X4
Observed data D
Check out (Louppe et al., 2016) to reproduce this example.
13 / 23
Example: Inference from multidimensional data
Recipe:
- 1. Build a single parameterized classifier s(x; θ0, θ1), in this case
a 2-layer NN trained on 5+2 features, with the alternative fixed to θ1 = (α = 0, β = 0).
- 2. Find the approximated MLE ˆ
α, ˆ β by solving Eqn. 1.
Solve Eqn. 1 using likelihood scans or through optimization. Since the generator is inexpensive, p(s(x; θ0, θ1)|θ) can be calibrated on-the-fly, for every candidate (α, β), e.g. using histograms.
- 3. Construct the log-likelihood ratio (LLR) statistic
−2 log Λ(α, β) = −2 log p(D|α, β) p(D|ˆ α, ˆ β)
14 / 23
Exact −2 log Λ(α, β)
0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β
- Approx. LLR (smoothed by a
Gaussian Process)
0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β 4 8 12 16 20 24 28 32
α =1,β= −1 Exact MLE
- Approx. MLE
15 / 23
Diagnostics
In practice ˆ r(ˆ s(x; θ0, θ1)) will not be exact. Diagnostic procedures are needed to assess the quality of this approximation.
- 1. For inference, the value of the MLE ˆ
θ should be independent
- f the value of θ1 used in the denominator of the ratio.
- 2. Train a classifier to distinguish between unweighted samples
from p(x|θ0) and samples from p(x|θ1) weighted by ˆ r(ˆ s(x; θ0, θ1)).
0.7 0.8 0.9 1.0 1.1 1.2 1.3 α 2 4 6 8 10 12 14 −2logΛ(θ)
Exact Approx., θ1 =(α =0,β =1) Approx., θ1 =(α =1,β =−1) Approx., θ1 =(α =0,β =−1) ±1σ, θ1 =(α =0,β =−1)
0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate
p(x|θ1)r(x|θ0,θ1) exact p(x|θ1) no weights p(x|θ1)r(x|θ0,θ1) approx.
16 / 23
Density ratio estimation
The density ratio r(x; θ0, θ1) = p(x|θ0)
p(x|θ1) appears in many other
fundamental statistical inference problems, including
- transfer learning,
- outlier detection,
- divergence estimation,
- ...
For all of them, the proposed approximation can be used as a drop-in replacement!
17 / 23
Transfer learning
Under the assumption that train and test data are drawn iid from a same distribution p, 1 N
- xi
L(ϕ(xi)) →
- L(ϕ(x))p(x)dx,
as training data increases, i.e. as N → ∞. Minimizing L over training data is therefore a good strategy.
18 / 23
Transfer learning
Under the assumption that train and test data are drawn iid from a same distribution p, 1 N
- xi
L(ϕ(xi)) →
- L(ϕ(x))ptrain(x)dx,
as training data increases, i.e. as N → ∞. But we want to be good on test data, i.e., minimize
- L(ϕ(x))ptest(x)dx.
Minimizing L over training data is therefore a bad strategy!
19 / 23
Importance weighting
Reweight samples by ptest(xi)
ptrain(xi), such that
1 N
- xi
ptest(xi) ptrain(xi)L(ϕ(xi)) →
- ptest(x)
ptrain(x)L(ϕ(x))ptrain(x)dx, as training data increases, i.e. as N → ∞. Again, ptest(xi)
ptrain(xi) cannot be evaluated directly, but approximated
likelihood ratios can be used as a drop-in replacement.
20 / 23
Example
p0 : α = −2, β = 2 versus p1 : α = 0, β = 0 p0 versus p0
p1 p1
21 / 23
Example
p0 versus ˆ rp1
22 / 23
Summary
- We proposed an approach for approximating LR in the
likelihood-free setup.
- Evaluating likelihood ratios reduces to supervised learning.
Both problems are deeply connected.
- Alternative to Approximate Bayesian Computation, without
the need to define a prior over parameters.
23 / 23
References
Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913. Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169. Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference
- toolbox. http://dx.doi.org/10.5281/zenodo.47798,