L I K E L I H O O D F R E E I N F E R E N C E - - PowerPoint PPT Presentation

l i k e l i h o o d f r e e i n f e r e n c e
SMART_READER_LITE
LIVE PREVIEW

L I K E L I H O O D F R E E I N F E R E N C E - - PowerPoint PPT Presentation

NYU Center Center for for Data Cosmology and Science particle physics N E W A P P R O A C H E S T O L I K E L I H O O D F R E E I N F E R E N C E http://arxiv.org/abs/1506.02169 Kyle Cranmer New York University Department of Physics


slide-1
SLIDE 1

Kyle Cranmer New York University Department of Physics Center for Data Science

NYU Center for Data Science Center for Cosmology and particle physics

L I K E L I H O O D F R E E I N F E R E N C E

N E W A P P R O A C H E S T O

http://arxiv.org/abs/1506.02169

slide-2
SLIDE 2

P R E FA C E

  • This reminds me of PhyStat series leading up to the LHC.
  • Thanks to Louis, Tom, Bob, Richard, …
  • Impressed by the sophistication of discussion
  • One thing I learned:
  • collaboration might converge on high-level statistical procedure.

Put in likelihood / probability model and turn the crank.

  • Practical improvements to analysis mainly lie in techniques used for

modeling the data ! (eg. systematics, ND->FD extrapolation, etc.)

  • Useful to factorize discussion & software in terms of modeling and

high-level statistical procedure

2

slide-3
SLIDE 3

T H E H I G G S D I S C O V E RY

3

ftot(Dsim, G|α) = Y

c∈channels

" Pois(nc|νc(α))

nc

Y

e=1

fc(xce|α) # · Y

p∈S

fp(ap|αp)

slide-4
SLIDE 4

I N T R O D U C T I O N

  • In particle physics, our high-level inference goals are
  • searches (hypothesis testing)
  • measurements (maximum likelihood estimate)
  • constrain parameters (confidence intervals)
  • Typically, we use likelihood-based techniques
  • surprisingly, we lack a nice technique for likelihood-

based inference when we want to use high-dimensional

  • bservations and have to deal with detector response

4

slide-5
SLIDE 5

Likelihood-free Inference

slide-6
SLIDE 6

OVERVIEW OF PREDICTIONS

6

The language is Quantum Field Theory

1)

Feynman Diagrams are used to predict high-energy interaction among fundamental particles

2)

The interaction of outgoing particles with the detector is simulated.

3)

e+ e- mu- mu+ Finally, we run particle identification algorithms

  • n the simulated data as if they were from real

collisions.

4)

  • q

q W W H W + W − ν l+ l− ¯ ν

>100 million sensors ~10-30 features describe interesting part

slide-7
SLIDE 7

D E T E C T O R S I M U L AT I O N

  • Conceptually: Prob(detector response | particles )
  • Implementation: Monte Carlo integration over micro-physics
  • Consequence: cannot evaluate likelihood for a given event

7

slide-8
SLIDE 8

D E T E C T O R S I M U L AT I O N

  • Conceptually: Prob(detector response | particles )
  • Implementation: Monte Carlo integration over micro-physics
  • Consequence: cannot evaluate likelihood for a given event
  • This motivates a new class of algorithms for what is called

likelihood-free inference, which only require ability to generate samples from the simulation in the “forward mode”

8

slide-9
SLIDE 9

1 0 ⁸ S E N S O R S → 1 R E A L - VA L U E D Q U A N T I T Y

  • Most measurements and searches for new particles at the LHC are based on the

distribution of a single variable or feature

  • choosing a good variable (feature engineering) is a task for a skilled physicist

and tailored to the goal of measurement or new particle search

  • likelihood p(x|θ) approximated using histograms (univariate density estimation)

9

[GeV]

4l

m 200 400 600 Events/10 GeV 5 10 15 20 25 30 35 40

  • 1

Ldt = 4.8 fb

= 7 TeV: s

  • 1

Ldt = 5.8 fb

= 8 TeV: s 4l →

(*)

ZZ → H

Data

(*)

Background ZZ t Background Z+jets, t =125 GeV)

H

Signal (m =190 GeV)

H

Signal (m =360 GeV)

H

Signal (m Syst.Unc.

Preliminary ATLAS

This doesn’t scale if x is high dimensional!

slide-10
SLIDE 10

H I G H D I M E N S I O N A L E X A M P L E

  • For instance, when looking for deviations from the standard model

Higgs, we would like to look at all sorts of kinematic correlations

  • each observation x is high-dimensional

10

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000 6000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000 6000

1

Φ

  • 2

2

Events / ( 0.21 )

2000 4000 6000

1

Φ

  • 2

2

Events / ( 0.21 )

2000 4000 6000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000 6000 8000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000 6000 8000

Φ

  • 2

2

Events / ( 0.21 )

2000 4000 6000

Φ

  • 2

2

Events / ( 0.21 )

2000 4000 6000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

1000 2000 3000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

1000 2000 3000

1

Φ

  • 2

2

Events / ( 0.21 )

500 1000 1500 2000

1

Φ

  • 2

2

Events / ( 0.21 )

500 1000 1500 2000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

1000 2000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

1000 2000

Φ

  • 2

2

Events / ( 0.21 )

500 1000 1500 2000

Φ

  • 2

2

Events / ( 0.21 )

500 1000 1500 2000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

5000 10000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

5000 10000

1

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000

1

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

5000 10000

* θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

5000 10000

1

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000 4000

1

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000 4000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000

2

θ

  • r cos
1

θ cos

  • 1
  • 0.5

0.5 1

Events / ( 0.08 )

2000 4000

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000 4000

Φ

  • 2

2

Events / ( 0.21 )

1000 2000 3000 4000

H

l l l l

slide-11
SLIDE 11

M O V I N G C L O S E R T O T H E D ATA

  • A more extreme example is to work with lower-level data
  • each observation x is high-dimensional

11

LArTPC

Time Projection Chamber

νµ

Electric Field Electric Field Electric Field Electric Field

Neutrino interaction in LAr produces ionization and scintillation light γ γ γ γ γ γ γ γ Drift the ionization charge in a uniform electric field

Electric Field Electric Field

Read out charge and light produced using precision wires and PMT's

ν ν e

e candidate

candidate γ γ candidate candidate Neutral Current Neutral Current π π 0

0 candidate

candidate

ArgoNeuT Data ArgoNeuT Data ArgoNeuT Data ArgoNeuT Data ArgoNeuT Data ArgoNeuT Data

Tracking, Calorimetry, and Particle ID in same detector. Goal ~80% Neutrino Efficiency. All you need for Physics is neutrino flavor and energy.

Jonathon Asaadi

CNNs Applied to MicroBooNE

1 vgenty

Vic Genty @ Columbia U.

MicroBooNE-NOTE-1019-PUB Convolutional Neural Networks Applied to Neutrino Events in a Liquid Argon Time Projection Chamber

MicroBooNE Collaboration July 4, 2016

http://www-microboone.fnal.gov/publications/publicnotes/MICROBOONE-NOTE-1019-PUB.pdf

with MicroBooNE Deep Learning Team


  • G. Collins @ MIT
  • K. Terao @ Columbia
  • T. Wongjirad @ MIT

Pattern recognition with 2D ADC images in LArTPC

  • P. Płoński, D. Stefan, R. Sulej

1

DS@HEP Workshop, NYC, July 7, 2016

…informal input to the workshop discussions…

slide-12
SLIDE 12

L I K E L I H O O D F R E E I N F E R E N C E

  • Goal: approximate the likelihood p(x|θ) for high

dimensional feature x using a generative model for the data

12

f V

κ 0.5 1 1.5 2

f F

κ 2 − 1.5 − 1 − 0.5 − 0.5 1 1.5 2

ATLAS and CMS LHC Run 1 Preliminary

γ γ → H ZZ → H WW → H bb → H τ τ → H Combined SM 68% CL Best fit 95% CL

[GeV]

H

m 124 124.5 125 125.5 126 )

H

m ( Λ 2ln − 1 2 3 4 5 6 7 CMS and ATLAS Run 1 LHC

γ γ → H l 4 → ZZ → H l +4 γ γ Combined

  • Stat. only uncert.
slide-13
SLIDE 13

L I K E L I H O O D F R E E I N F E R E N C E

  • Goal: approximate the likelihood p(x|θ) for high

dimensional feature x using a generative model for the data

13

γ γ α α

d

m ∆

K

ε

K

ε

s

m ∆ &

d

m ∆

ub

V β sin 2

(excl. at CL > 0.95) < 0 β

  • sol. w/ cos 2

excluded at CL > 0.95

α β γ

ρ

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 2.0

η

  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

excluded area has CL > 0.95

Figure 11.2: Constraints on the ¯ ρ, ¯ η plane. The shaded areas have 95% CL. C

slide-14
SLIDE 14

T H E R A P I D R I S E O F “ A B C ”

14

slide-15
SLIDE 15

A N A LT E R N AT I V E T O A B C

K.C., http://arxiv.org/abs/1506.02169

slide-16
SLIDE 16

C O L L A B O R AT O R S

16

Gilles Louppe Data Science Fellow Funded via NSF DIANA/HEP Based at CERN PhD in machine learning scikit-learn developer @glouppe Juan Pavez CS graduate student in Chile Fellowship to work @ CERN summer ’15 @jgpavez

slide-17
SLIDE 17

M A C H I N E L E A R N I N G : C L A S S I F I E R S

  • Common to use machine learning

classifiers to separate signal (H1) vs. background (H0)

  • want a function that maps signal

to y=1 and background to y=0

  • think of it as applied calculus of

variations: find function s(x) that minimizes loss:

17

L[s] = Z p(x|H0) (0 − s(x))2 dx + Z p(x|H1) (1 − s(x))2dx ≈ X

i

(yi − s(xi))2

BDT

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

Normalized

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Signal Background

BDT

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

Normalized

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%

s

slide-18
SLIDE 18

M A C H I N E L E A R N I N G : C L A S S I F I E R S

  • applied calculus of variations:

find function s(x) that minimizes loss:

  • the optimal classifier would

learn the regression function

  • which is 1-to-1 with the

likelihood ratio

18

p(x|H1) p(x|H0) s(x) = p(x|H1) p(x|H0) + p(x|H1)

L[s] = Z p(x|H0) (0 − s(x))2 dx + Z p(x|H1) (1 − s(x))2dx ≈ X

i

(yi − s(xi))2

BDT

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

Normalized

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Signal Background

BDT

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

Normalized

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%

s

slide-19
SLIDE 19

PA R A M E T R I Z E D C L A S S I F I E R S

  • We started with a classifier that was learning
  • Implicitly that classifier depends on H0 and H1 used to

generate the training data. Make that explicit

  • Can do the same thing for any two points in parameter
  • space. I call this a parametrized classifier

19

s(x; H0, H1) = p(x|H1) p(x|H0) + p(x|H1) s(x) = p(x|H1) p(x|H0) + p(x|H1) s(x; θ0, θ1) = p(x|θ1) p(x|θ0) + p(x|θ1)

slide-20
SLIDE 20

G E N E R A L I Z E D L I K E L I H O O D R AT I O T E S T S

  • The target likelihood ratio test based on high-dimensional features x is:
  • I can show that an equivalent test can be made from 1-D projection
  • if the map s: X → ℝ has the same level sets as the likelihood ratio
  • Remember that a classifier that minimizes squared loss ∑ [ yᵢ - s(xᵢ) ]² approximates

the regression function, which has the same level sets!

20

BDT

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

Normalized

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Signal Background

BDT

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

Normalized

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%

s p(s)

T(D; θ0, θ1) =

n

Y

e=1

p(xe|θ0) p(xe|θ1)

T(D; θ0, θ1) = Y

e

p(xe|θ0) p(xe|θ1) = Y

e

p(s(xe; θ0, θ1)|θ0) p(s(xe; θ0, θ1)|θ1)

s(x; θ0; θ1) = monotonic[ p(x|θ0)/p(x|θ1) ]

K.C., G. Louppe, J. Pavez: http://arxiv.org/abs/1506.02169

slide-21
SLIDE 21

A N E X A M P L E

slide-22
SLIDE 22

T H E D ATA

22

Let assume 5D data x generated from the following process p0:

  • 1. z := (z0, z1, z2, z3, z4), such that

z0 ∼ N(µ = α, σ = 1), z1 ∼ N(µ = β, σ = 3), z2 ∼ Mixture( 1

2 N(µ = −2, σ =

1), 1

2 N(µ = 2, σ = 0.5)),

z3 ∼ Exponential(λ = 3), and z4 ∼ Exponential(λ = 0.5);

  • 2. x := Rz, where R is a fixed semi-positive

definite 5 × 5 matrix defining a fixed projection of z into the observed space.

Distribution depends on α, β toy data with α=1, β=-1

slide-23
SLIDE 23

L I K E L I H O O D C O N T O U R S

23

0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β

(a)

0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β

(b)

0.90 0.95 1.00 1.05 1.10 1.15 α 1.4 1.2 1.0 0.8 0.6 β

(c)

4 8 12 16 20 24 28 32

α =1,β= −1 Exact MLE

  • Approx. MLE

(d)

Exact likelihood Approximate likelihood Approximate likelihood (smoothed)

slide-24
SLIDE 24

D I A G N O S T I C S

slide-25
SLIDE 25

M A X I M U M L I K E L I H O O D E S T I M AT O R S

25

(4.4) ˆ θ = arg max

θ

X ln p(xe|θ) p(xe|θ1) = arg max

θ

X ln p(s(xe; θ, θ1)|θ) p(s(xe; θ, θ1)|θ1) . It is important that we include the denominator p(s(xe; θ, θ1)|θ1) because this cancels Jacobian factors that vary with θ.

  • The denominator in the likelihood ratio is just a shift

p1(s∗) p0(s∗) = p1(x) p0(x) R dΩs∗p0(x)/|ˆ n · rs| R dΩs∗p0(x)/|ˆ n · rs| = p1(x) p0(x)

  • Provides a non-trivial diagnostic:

Diagnostics

In practice ˆ r(ˆ s(x; θ0, θ1)) will not be exact. Diagnostic procedures are needed to assess the quality of this approximation.

  • 1. For inference, the value of the MLE ˆ

θ should be independent

  • f the value of θ1 used in the denominator of the ratio.
slide-26
SLIDE 26

D I A G N O S T I C S

26 (a) Poorly trained, well calibrated. (b) Poorly trained, well calibrated. (c) Poorly calibrated, well trained. (d) Poorly calibrated, well trained. (e) Well trained, well calibrated. (f) Well trained, well calibrated.

slide-27
SLIDE 27

D I A G N O S T I C S W I T H A N A D V E R S A RY

  • Train a new classifier to discriminate between events from target p(x|θ₀) 


and events resampled from original distribution p(x|θ₁) with probabilities given by the predicted weights r̂(x|θ₀, θ₁) ≈ p(x|θ₀)/p(x|θ₁)

  • classifier can easily distinguish unweighted distributions;
  • exact weights are perfect (AUC~0.5)

27

Important: Performance evaluated on independent testing sample

slide-28
SLIDE 28

D I A G N O S T I C S

28 (a) Poorly trained, well calibrated. (b) Poorly trained, well calibrated. (c) Poorly calibrated, well trained. (d) Poorly calibrated, well trained. (e) Well trained, well calibrated. (f) Well trained, well calibrated.

slide-29
SLIDE 29

S P E C I A L C A S E : M I X T U R E M O D E L S

slide-30
SLIDE 30

M I X T U R E M O D E L

  • Often the model for the data is a mixture of different components wc
  • to be more generic, consider parametrized coefficients wc(θ)
  • I worked out a way to decompose the training into pairwise

comparisons:

  • Last line uses the main result of the paper, need a classifier for each

pairwise ( c vs. c’ ) comparison (n (n-1)/2 of them)

30

p(x|θ0) p(x|θ1) = P

c wc(θ0)pc(x)

P

c0 wc0(θ1)pc0(x)

= X

c

"X

c0

wc0(θ1) wc(θ0) pc0(x) pc(x) #−1 = X

c

"X

c0

wc0(θ1) wc(θ0) pc0(sc,c0) pc(sc,c0) #−1

p(x|θ) = X

c

wc(θ)pc(x)

slide-31
SLIDE 31

R E S U LT S F O R 1 0 - D I M E X A M P L E

  • Left: fit to mixture coefficients for

single pseudo-experiment

  • Right: histogram of best fit of
  • ne coefficient for many pseudo-

experiments

31

slide-32
SLIDE 32

C O N N E C T I O N T O R E W E I G H T I N G

slide-33
SLIDE 33

T H E D ATA

33

Let assume 5D data x generated from the following process p0:

  • 1. z := (z0, z1, z2, z3, z4), such that

z0 ∼ N(µ = α, σ = 1), z1 ∼ N(µ = β, σ = 3), z2 ∼ Mixture( 1

2 N(µ = −2, σ =

1), 1

2 N(µ = 2, σ = 0.5)),

z3 ∼ Exponential(λ = 3), and z4 ∼ Exponential(λ = 0.5);

  • 2. x := Rz, where R is a fixed semi-positive

definite 5 × 5 matrix defining a fixed projection of z into the observed space.

p₀ has α=1, β=-1 p₁ has α=0, β=0

slide-34
SLIDE 34

O R I G I N A L V S . TA R G E T D I S T R I B U T I O N S

  • 1-d projections of the original and target distributions

34

slide-35
SLIDE 35

T W O R E W E I G H I N G M E T H O D S : 1 0 0 K S A M P L E S

35

hep_ml.GBReweigher carl with calibrated MLP

slide-36
SLIDE 36

E VA L U AT I N G T H E Q U A L I T Y O F T H E R E W E I G H T I N G

  • Train a new classifier to discriminate between events from target and events resampled from
  • riginal distribution with probabilities given by the predicted weights
  • classifier can easily distinguish unweighted distributions;
  • exact weights are perfect (AUC~0.5)
  • carl doing a little better than GBReweighter on this problem (no special effort to tune either)
  • neither is perfect

36

Important: Performance evaluated on independent testing sample

slide-37
SLIDE 37

37