[PPT] - Pr Proba obability a and nd Sta Statistics Basic c concepts PowerPoint Presentation

SLIDE 1

Pr Proba

bability a

and nd Sta Statistics Basic c concepts pts

(from a a physi sicist st p point of vi view)

Benoi noit C t CLEMENT – Univer ersité té J

J. Four

urier er / / L LPSC bclem ement@l nt@lpsc.in2p n2p3.fr

SLIDE 2

2

Bibliography

Kendall’s Advanced theory of statistics, Hodder Arnold Pub. volume 1 : Distribution theory, A. Stuart et K. Ord volume 2a : Classical Inference and and the Linear Model, A. Stuart, K. Ord, S. Arnold volume 2b : Bayesian inference, A. O’Hagan, J. Forster The Review of Particle Physics, K. Nakamura et al., J. Phys. G 37, 075021 (2010) (+Booklet) Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling, Oxford Science Publication Statistical Data Analysis, Glen Cowan, Oxford Science Publication Analyse statistique des données expérimentales, K. Protassov, EDP sciences Probabilités, analyse des données et statistiques, G. Saporta, Technip Analyse de données en sciences expérimentales, B.C., Dunod

SLIDE 3

3

Sample and population

SAMPLE Finite size Selected through a random proces cess

eg. Result of a

measurement POPULATION Potentially infinite size

eg. All possible

results

Characterizing the sample, the population and the sampling process Pr Probabi bility th theory (first lecture)

SLIDE 4

Statistics

4

SAMPLE Finite size xi POPULATION f(x; x;θ) PH PHYSI SICS param ameter eters θ EX EXPER PERIMENT IN INFEREN ENCE Observ rvab able

Using the sample to estimate the characteristics of the population Stati tisti tical cal i infer erence ence (second lecture)

SLIDE 5

5

Random process

A random process (« measurement » or « experiment ») is a process whose outcom tcome c e cannot nnot be pred edicted cted w with c th certa tainty nty. It will be described by : Unive verse se: Ω = = set of all possible outc

utcomes.

Eve Event : log

gical cond

condition on an

n an out
utcome. It can

either be true or false; an event splits the universe in 2 subsets. An event A will be identified by the subset A for which A is true.

SLIDE 6

6

Probability

A prob

bab

ability ty f funct nction

n P is defined by : (Kolmogorov, 1933)

P : : {Events ents} → [0: [0:1] A A → P(A) A) satisfying : P( P(Ω)=1 =1 P(A (A or

r B)

B) = P P(A (A) + P P(B) (B) i if A A and and B B = Ø Ø

P(A) n n lim

A n

=

+∞ →

Inter terpret etati ation o

n of t

this n number er : :

Freq

eque uenti ntist approach ach : if we repeat the random

process a great number of times n , and count the number of times the outcome satisfy event A, nA then the ratio :

defines a probability

Bayes

esian an i inter erpreta etati tion

n : a probability is a measure of

the credibility associated to the event.

SLIDE 7

7

Simple logic

Event « not A not A » » is associated with the complement A. P(A) (A) = = 1–P(A) (A) P( P(Ø) = ) = 1-P( P(Ω) = ) = 0 Event « A A and and B B » » Event « A A or

r B

B » » P(A (A or B B) = P(A)+ A)+P(B) (B)–P(A and and B) B)

SLIDE 8

8

Conditional probability

If an event B B is kno nown to b to be tr e true, one can restrain the universe to Ω’=B and define a new probability function

n this universe, the condit

itio ional p l probabilit ility. P(A|B) = B) = « probability ty o

f A

A given B en B » P(B) B) and P(A B) | P(A = The definition of conditional probability leads to : P(A and B) = P(A|B).P(B) = P(B|A).P(A) Hence relating P(A|B) to P(B|A) by the Bayes es t theor eorem em : :

P(A) B).P(B) | P(A A) | P(B =

SLIDE 9

9

Incompatibility and Indpendance

Two incompa pati tibl ble events cannot be true simultaneously, then : P(A P(A a and B B) = ) = 0 P(A P(A or B B) = ) = P(A (A)+ )+P( P(B) B) Two events are indepe pende dent, if the realization

f one is not linked in any way to the

realization of the other : P(A P(A|B |B)=P(A) P(A) and P(B P(B|A) = |A) = P(B P(B) P(A P(A and B B) ) = P(A P(A). ).P( P(B) B)

SLIDE 10

10

Random variable

When the outcome of the random process is a num number (real or integer), we associate to the random process a rand ndom

m v

variab able X e X. Each realization of the process leads to a particular result : X= X=x. x x is a realization of X. X. dx dF f(x) = For a discr crete v te variab able : Probabilit lity l law : p(x) = ) = P(X= X=x) For a real v l varia iable le : P(X=x)=0, Cumul ulat ative d e dens nsity ty f functi nction

n : F(x

F(x) = = P(X (X<x)

dF = F(x+dx)-F(x) = P(X < x+dx) - P(X < x) = P(X < x or x < X < x+dx) - P(X < x) = P(X < x) + P(x < X < x+dx) - P(X < x) = P(x < P(x < X < x+d +dx) = ) = f(x) (x)dx

Probability ty d density ty f functi nction (

n (pdf) :

:

SLIDE 11

Density function

Probab ability ty d density ty f functi ction Note : discrete variables can also be described by a probability density function using Dirac distributions: Cumulati ative e density ty f functi ction By construction :

11

x f(x) x F(x) 1

∑ ∑

= =

i i

1 p(i) x)

p(i)δ(i

f(x)

∫ ∫

= = < < = = = +∞ = = ∞

∞ b a a

f(x)dx

F(a)

F(b)

b) X P(a f(x)dx F(a) 1 ) Ω P( ) F( P(Ø) ) F(-

∫

+∞ ∞ −

= = 1 ) Ω P( f(x)dx

SLIDE 12

Moments

For any function g( g(x), the expectati ectation of g is : It’s the mean value of g Momen ents ts μk are the expectation of Xk. 0th moment : μ0=1 =1 (pdf normalization) 1st moment : μ1=μ (mean)

12

∫

= g(x)f(x)dx E[g(X)] [f] FT dx f(x)e ] E[e (t)

1 ixt ixt −

= = =

∫

ϕ

∑ ∫∑

= =

k k k k k

μ k! (it) f(x)dx k! (itx) (t) ϕ

t k k k k

dt d i μ

=

− = ϕ

X’ = ’ = X-μ1 is a central tral v vari riab able 2nd central moment : μ’2=σ2 (variance) Characteri racteristic f c functi ction : From Taylor expansion : Pdf entirely defined by its moments CF : useful tool for demonstrations

SLIDE 13

Sample PDF

A sample ple is obtained from a random d

m drawi

wing ng within a populati ation, described by a probability density function. We’re going to discuss how to charact aracteri erize, e, indep epen enden entl tly fro rom one e an anoth ther: :

a populati

ation

a sample

ple To this end, it is useful, to consider a sample as a finite set from which one can randomly draw elements, with equipropability We can the associate to this process a probability density, the empiri rical al density ty or sample e density ty This density will be useful to translate properties of distribution to a finite sample.

13

∑

=

i sample

i)

δ(x

n 1 (x) f

SLIDE 14

Characterizing a distribution

How t to reduce a uce a distr tributi ution/

n/sam

ample t e to a fini nite te num number of

f val

alues ? ?  Meas asur ure o e of l locat ation

n:

Reducing the distribution to one c e centr ntral al v value ue

> Res

Result  Meas asur ure o e of d disper ersion

n:

Spread ad of the distribution around the central value

> Uncer

ertai tainty nty/Error  Freq equen uency cy t table/ e/hi histo togram am (for a finite sample)

14

SLIDE 15

Location and dispersion

Mea Mean val alue : Sum (integral) of all possible values weighted by the probability of occurrence:

15

∫

+∞ ∞ −

= = xf(x)dx x μ

∑

=

= =

n 1 i i

x n 1 x μ

population sample (size n) population sample (size n)

f(x)dx ) μ (x σ v

2 2 ∫

− = =

∑

=

− = =

n 1 i 2 i 2

μ) (x n 1 σ v

2 2 2 2 2 2 2

x x μ x xf(x)dx 2μ f(x)dx μ f(x)dx x σ − = − = − + =

∫ ∫ ∫

Standard ard d deviati ation (σ) and vari rian ance ce (v= σ²) ²) : Mean value of the squared deviation to the mean : Koen enig’s th theo eorem :

SLIDE 16

Discrete distributions

Binomial al distri ributi tion: randomly choosing K objects within a finite set of n, with a fixed drawing probability of p Variable : K Parameters : n,p ,p Law : Mean : np np Variance : np( p(1-p) p)

16

k n k

p) (1 p k)! (n k! n! p) P(k;n,

−

− − = k! λ e λ) P(k;

k

λ

=

p = 0.65 n = 10 λ = 6.5

Poisson d distri tributi tion : limit of the binomial when n→+∞,p →0,np= p=λ Counting events with fixed probability per time/space unit. Variable : K Parameters : λ Law : Mean : λ Variance : λ

SLIDE 17

Real distributions

Uniform rm d distri ributi tion : equiprobability over a finite range [a,b] Parameters : a,b ,b Law : Mean : Variance :

17

b x a if a b 1 a,b) f(x; < < − =

/12 a) (b b)/2 (a σ v μ

2 2

− + = = =

2 2

2σ μ) (x

e 2π σ 1 σ) μ, f(x;

− −

=

Chi hi-square d are distri tributi tion : sum of the square

f n normal reduced variables

Variable : Parameters : n Law : Mean : n n Variance : 2n 2n

      =

− − −

2 n Γ 2 e c f(c;n)

1 2 n 2 c 1 2 n

∑

=

        =

n 1 k 2 X X k

k k

σ μ

X

C

Normal al d distri tributi tion (Gaussian an) ) : limit of many processes Parameters : μ, , σ Law :

SLIDE 18

Convergence

18

SLIDE 19

Multidimensional PDF (1)

Random variables can be generalized to random vectors : the probab ability ty d density ty f functi ction becomes : and Marg rginal al density ty : probability of only one of the component

19

( )

∫ ∫

= ⇒ = +∞ < < ∞ + < < = y)dy f(x, (x) f dy y)dx f(x, ) Y

and

dx x X P(x (x)dx f

X X

) dx x X x and ... dx x X x and dx x X P(x dx dx )dx x , , x , f(x x )d x f(

n n n n 2 2 2 2 1 1 1 1 n 2 1 n 2 1

+ < < … + < < + < < = … … =  

) X ,..., X , (X X

n 2 1

= 

∫ ∫

= < < < <

b a d c

y) f(x, dy dx d) Y c and b X P(a

SLIDE 20

Multidimensional PDF (2)

For a fixed value of Y=y0: f(x|y x|y0)dx x = « Probabil bilit ity of x<X <X<x <x+dx +dx knowin ing g that Y=y0 » is , a conditional al density ty for X. It is proportional to f(x,y), so

20

(y) f y) f(x, y)dx f(x, y) f(x, y) | f(x 1 y)dx | f(x y) f(x, y) | f(x

Y

= = ⇒ = ∝

∫ ∫

∫

= = (y)dy y)f | f(x (y) y)f | f(x (x) f (y) y)f | f(x x) | f(y

Y Y X Y

The two random variables X and Y are indep epen enden ent if all events

f the form x<X<

X<x+dx +dx are independent from y<Y< <Y<y+dy +dy f( f(x|y x|y)= )=fX(x) ) an and f(y|x)= )=fY(y (y) hence f(

f(x, x,y)= = fX(x) x).fY(y) y)

Translated in term of pdf’s, Bayes’ theorem becomes:

SLIDE 21

Covariance and correlation

A ran random v vect ector (X (X,Y) ) can be treated as 2 separate arate vari riab ables es marginal densities : mean and variance for each variable : μX

X μY Y σX X σY

Doesn’t t t take i e into account c correl rrelati ations b between en t the vari riab ables es

21

Correl elati tion : Uncorrel rrelated ated : ρ=0 =0 : Indep epen enden ent Unco correl rrelated ated

Y X XY Y X Y X

μ μ μ σ ρσ y)dxdy )f(x, μ

)(y

μ (x Y) Cov(X, − = = − = ∫∫

∑

=

− =

n 1 i y i X i

) μ

)(y

μ (x n 1 Y) Cov(X,

Y Xσ

σ Y) Cov(X, ρ =

ρ=-0.5 ρ=0 ρ=0.9 ρ=0

nly quantify linear correlation

Generalized measure of dispersion : Covar arian ance o ce of X and Y

SLIDE 22

Decorrelation

Covariance matrix for n variables Xi: For uncorrel rrelat ated ed v variab ables es Σ is diagona

nal

22

              = ⇒ =

2 n n 2 2n n 1 1n n 2 2n 2 2 2 1 12 n 1 1n 2 1 12 2 1 j i ij

σ σ σ ρ σ σ ρ σ σ ρ σ σ σ ρ σ σ ρ σ σ ρ σ Σ ) X , Cov(X Σ       

BX Y ΣB, B σ' σ' σ' Σ'

1

2

n 2 2 2 1

= =               =       

Matrix rea real and symmetri etric : c : can de diagonalized On can define n new uncorrelated variables Yi σ’i

2 are the eigen

enval alues es of Σ, B contains the orthonorm rmal al e eigen envect ectors rs. The Yi are the princi cipal al componen ents

ts. Sorted for the larger to the

smaller σ’ ’ they allow dimen ensional al r reducti ction

SLIDE 23

Regression

Meas asure o re of locat cation:

a point : (μX ,

, μY)

a curve : line closest to the points -> linear

ear regre ression Minimizing the dispersion between the curve « y=ax+b » and the distribution :

23

      − − = − − =

∑ ∫∫

i 2 i i 2

b) ax (y n 1 y)dxdy f(x, b) ax (y w(a,b)        − = = ⇔    = + + = + − ⇔      − − = = ∂ ∂ − − = = ∂ ∂

∫∫ ∫∫

X X Y Y X Y Y X Y X Y X X 2 X 2 X

μ σ σ ρ μ b σ σ ρ a μ b aμ μ μ σ ρσ bμ ) μ a(σ y)dxdy b)f(x, ax (y b w y)dxdy b)f(x, ax x(y a w

Fully correlated ρ=1 Fully anti-correlated ρ=-1 Then Y = aX+b

SLIDE 24

Multidimensional Pdfs

Multinomial al d distri ributi tion : randomly choosing K1 , K2 ,… Ks objects within a finite set of n, with a fixed drawing probability for each category p1, , p2,… … ps with ΣKi=n =n and Σpi=1 =1 Parameters : n, p , p1, , p2,… … ps Law : Mean : μi=n =npi Variance : σi

2=n

=npi(1 (1-pi) Cov(K (Ki,K ,Kj)= )=-np npipj

Rem : var varia iable les ar are n e not in t indep depen

endent. The binomial, correspond to s=2, but has
nly one independent variable.

24

) μ x ( Σ ) μ x ( 2 1

1 Τ

e Σ 2π 1 Σ) , μ ; x f(

   

 

− − −

−

=

Σ , μ 

∏

− −

=

2 i 2 i i

2σ ) μ (x i

e 2π σ 1 Σ) , μ ; x f(  

s 2 1

k s k 2 k 1 s 2 1

p p p ! k ! k ! k n! ) p ;n, k P(     =

Multinorm rmal al distri ributi tion : : Parameters : Law : if uncorrel rrelat ated ed Indepen enden ent t Unco correl rrelated ated

SLIDE 25

The sum of several random variable is a new random variable S Assuming the mean and variance of each variable exists, Mea Mean val alue of S : The mean an is an additi tive e quantity ty

Sum of random variables

25

∑

=

n 1 i i

X S

∑ ∑∫ ∫ ∑

= = =

= =       =

n 1 i i n 1 i i i X i n 1 n 1 n 1 i i S

μ )dx (x f x ...dx )dx x ,..., f(x x μ

i

∑∑ ∑ ∫ ∑

< = =

+ =       − =

i i j j i n 1 i 2 X n 1 n 1 2 n 1 i X i 2 S

) X , Cov(X 2 σ ...dx )dx x ,..., f(x μ x σ

i i

∑

=

n 1 k 2 X 2 S

i

σ σ

Varia riance of S : Fo For unco correl elated ted var ariables, th the e var arian ance i e is ad additive e

> used for error combinations

SLIDE 26

Probability density function of S : fS(s) Using the characteristic function : For independent variables The characteristic function factorizes. Finally the pdf is the Fourier transform of the cf, so : The pdfs of the sum is a convolution. Sum of Normal variables -> Normal Sum of Poisson variables (λ1 and λ2) -> Poisson, λ = λ1 + λ2 Sum of Khi-2 variables (n1 and n2) -> Khi-2, n = n1 + n2

Sum of random variables

26

∫ ∫

∑ = = x d )e x ( f ds (s)e f (t)

i

x it X ist S S

 



ϕ

n 2 1

X X X S

f f f f ∗ ∗ ∗ = 

∏ ∏∫

= = (t) dx )e (x f (t)

i k k

X k itx k X S

ϕ ϕ

SLIDE 27

Weak ak law of large n e number ers Sample of size n = realization of n independent variables, with the same e distri tributi tion (mean μ, variance σ2). The sam ample m mean ean is a realization of Mea Mean val alue of M : μM=μ Vari arian ance ce of M : σM

2 =

= σ2/n Centr tral al-Limit t theorem em n independent random variables of mean μi

i and variance σi 2

Sum of the reduced variables : The pdfs of C C converge to a reduced normal distribution : The sum of many random f fluct ctuation is normal ally d distri tributed ted

Sum of independent variables

27

∑

= =

i

X n 1 n S M

∑

− =

i i i

σ μ X n 1 C

2 c n C

2

e 2π 1 (c) f

− +∞ →

 → 

SLIDE 28

Central limit theorem

28

100

100 200 300 400 500 600 700

4
3
2
1

1 2 3 4 5

Gauss X1

100

100 200 300 400 500 600 700

4
3
2
1

1 2 3 4 5

Gauss (X1+X2)*racine(2)

100

100 200 300 400 500 600 700

4
3
2
1

1 2 3 4 5

Gauss (X1+X2+X3)*racine(3)

100

100 200 300 400 500 600 700

4
3
2
1

1 2 3 4 5

Gauss (X1+X2+X3+X4+X5)*racine(5)

SLIDE 29

Dispersion and uncertainty

Any measure (or combination of measure) is a realization of a random variable.

Measured value : θ
True value : θ0

Uncertai ertainty ty = quantifying the difference between θ and θ0 :

>meas

easure e of d dispers ersion We will postulate : Δθ = = ασθ

Absolute error, always positive

Usually one differentiate

Statisti

tical cal e error r : due to the measurement Pdf.

System

temati atic e c errors rs or bias -> fixed but unknown deviation (equipment, assumptions,…) Systematic errors can be seen as statistical error in a set a similar experiences.

29

SLIDE 30

Error sources

30

Observation error : ΔO Position error : ΔP Scaling error: ΔS

θ = θ0+δO+δS+δP Each δi is a realization of a random variable : mean 0 (negligible) and variance σi

2. For uncorrelated

ted error r sources ces : Choice ce of α ? ? If many sources, from central-limit -> normal distribution α=1 gives (approximately) a 68% confidence interval α=2 gives 95% CL (and at least 75% from Bienaymé-Chebyshev)

2 P 2 S 2 O 2 P 2 S 2 O 2 2 tot 2 tot P P S S O O

Δ Δ Δ ) σ σ (σ α ) σ (α Δ ασ Δ ασ Δ ασ Δ + + = + + = =      = = =

SLIDE 31

Error propagation

31

f(x) x Δx Δx

dx df (x) f' =

Δf Δf

Meas asure re : x±Δx Co Compute : f(x) ) -> > Δf ? f ? Assuming smal all errors rs, using Taylor expansion :

Δx dx df Δx) f(x Δx) f(x 2 1 Δf Δx dx f d 2 1 Δx dx df f(x) Δx) f(x Δx dx f d 2 1 Δx dx df f(x) Δx) f(x

2 2 2 2 2 2

= − − + = ⇒ + − = − + + = +

SLIDE 32

Error propagation

Meas asure re : x±Δx, y , y±Δy,… ,… Co Compute : f(x,y ,y,… ,…) -> > Δf ? f ? Idea : treat the effect of each variable as separate error r sources rces

32 2Δy

2Δx

2Δfx

xm ym

zm=f(xm,ym)

dx ) df(x,y x f

m

= ∂ ∂

Curve z=f(x,ym), fixed ym Surface z=f(x,y)

Δy y f f Δ , Δx x f f Δ

y x

∂ ∂ = ∂ ∂ =

ΔxΔy y f x f ρ Δy y f Δx x f f fΔ Δ ρ f Δ f Δ Δf

xy 2 2 y x xy 2 Y 2 x 2

∂ ∂ ∂ ∂ +         ∂ ∂ +       ∂ ∂ = + + =

∑ ∑

<

∂ ∂ ∂ ∂ +         ∂ ∂ =

i j i i i , j i j i x x 2 i 2

Δx Δx x f x f ρ Δx x f Δf

j i

Δy y f Δx x f Δf ∂ ∂ + ∂ ∂ = Δy y f Δx x f Δf ∂ ∂ − ∂ ∂ =

∑

        ∂ ∂ =

i i 2 i 2

Δx x f Δf

uncorrelated correlated anticorrelated Th Then

SLIDE 33

Parametric estimation

From a finite sample {x {xi} } -> estimating a parameter θ Statisti tic = a function S S = = f({x {xi}) Any statistic can be considered as an estimato ator of θ To be a good estimator it needs to satisfy :

Consisten

tency cy : limit of the estimator for a infinite sample.

Bias

as : difference between the estimator and the true value

Effici

cien ency cy : speed of convergence

Robustn

tnes ess : sensitivity to statistical fluctuations A good esti timat ator r should at least be consisten tent and asymptoti tical cally unbias ased ed Efficient / Unbiased / Robust often contradict each other ⇒differen erent t choices ces f for d r differen erent a t applicati cations

33

SLIDE 34

Bias and consistency

As the sample is a set of realization of random variables (or one vector variable), so is the estimator : it has a mean, a variance,… and a probability density function

34

Θ

f

n realizatio a is θ ˆ ˆ

Θ

θ

μ

] θ

Θ

E[ ) θ b(

ˆ

ˆ ˆ = = ) θ b( = ˆ ) θ b(

n 

 → 

+∞ →

ˆ ε 0, ε) θ

θ

P(

n

∀   →  >

+∞ →

ˆ σ

n Θ

  → 

+∞ → ˆ

biased asymptotically unbiased unbiased

Bias as : Mean value of the estimator unbias ased ed e estimato ator r : asympto totica cally unbias ased ed : Consisten tency cy: formally in practice, if asymptoti tical cally unbias ased ed

SLIDE 35

Efficiency

For any any unb unbiased es esti timator of

f θ, the variance cannot

exceed (Cramer-Rao bound): The effici ciency ncy of a convergent estimator, is given by its varianc ance. An effici cient e nt estimator ator reaches the Cramer-Rao bound (at least asymptotically) : Minimal variance estimator MVE will often be biased, asymptotically unbiased

35

                    ∂ ∂ − =               ∂ ∂ ≥

2 2 2 2 Θ

θ ln E θ ln E σ L L 1 1

ˆ

SLIDE 36

Empirical estimator

Sam ample mean ean is a good estimator of the populati ation m mean

> weak

ak law o

f larg

rge n e numbers ers : convergent, unbiased

36

( )

2 2 2 2 μ i 2 2 2 i 2 i i 2 i 2

σ n 1 n n σ σ σ σ n 1 ] s E[ μ μ μ) (x n 1 ) μ (x n 1 s − = − = −       = − −       − = − =

∑ ∑ ∑

ˆ

ˆ ˆ ˆ ˆ n σ ] μ)

μ

E[( σ μ, ] μ E[ μ , x n 1 μ

2 2 2 μ μ i

= = = = = ∑ ˆ ˆ ˆ

ˆ ˆ

biased, a asympto toti tical cally unbias ased ed unbias ased ed v varian ance e ce estimato ator r : variance of the estimator (convergence)

∑

− =

i 2 i 2

) μ (x 1

n

1 σ ˆ ˆ n 2σ 2 n 1 n 1

n

σ σ

4 2 4 2 σ2

→       + − = γ

ˆ

Sample v e vari rian ance ce as an estimator of the population v varian ance ce :

SLIDE 37

Errors on these estimator

Uncertai ertainty ty ⇔ Estimato ator r standard ard d deviati ation Use an estimator of standard deviation : (!!! Biased ) Mea Mean : Vari rian ance ce : Central-Limit theorem -> empirical estimators of mean and variance are normal ally d distri tributed ted, for larg arge en enough sam amples define 68% confidence intervals

37

n σ μ Δ n σ σ , x n 1 μ

2 2 2 μ i

ˆ ˆ ˆ

ˆ

= ⇒ = = ∑

2 2 4 2 σ i 2 i 2

σ n 2 σ Δ n 2σ σ , ) μ (x 1

n

1 σ

2

ˆ ˆ ˆ ˆ

ˆ

= ⇒ ≈ − =

∑

2

σ σ ˆ ˆ =

2 2

σ Δ σ μ Δ μ ˆ ˆ ; ˆ ˆ ± ±

SLIDE 38

Likelihood function

Gener eric f c functi nction

n k(x,

x,θ)

x : random variable(s) θ : parameter(s)

38

Probability density function f(x;θ) = k(x,θ0) ∫ f(x;θ) dx=1

for Bayesian f(x|θ)= f(x;θ)

Likelihood function L (θ) = k(u,θ) ∫ L (θ) dθ =???

for Bayesian f(θ|x)= L (θ)/∫L (θ)dθ fix θ= θ0 (true value) fix x= u (one realization of the random variable)

For a sample : n independent realizations of the same variable X

∏ ∏

= =

i i i i

θ) ; f(x θ) , k(x ) (θ L

SLIDE 39

Maximum likelihood

For a sample of measurements, {x {xi} The ana analytical f for

rm of
f the

the d dens ensity i is kno nown It depends on sever eral al u unknow nown p n param ameter ters θ

eg. event counting : Follow a Poisson distribution,

with a parameter that depends on the physics : λi(θ) An estimator of the parameters of θ, are the ones that maximize o e of obser erving ng t the o e obser erved ed r resul ult. t.

> Maximum

um o

f the l

e likel elihood hood f funct nction

n

rem rem : system of equations for several parameters rem rem : often en m minimize e -ln lnL : simplify expressions

39

∏

=

i i x i ) (θ λ

! x ) (θ λ e ) (θ

i i

L θ

θ θ

= ∂ ∂

= ˆ

L

SLIDE 40

Properties of MLE

Mostly asympto totic p c propert erties : valid for large sample, often assumed in any case for lack of better information Asymptotically unbias ased ed Asymptotically effici cien ent (reaches the CR bound) Asymptotically normal ally d distri tributed ted

> Multinormal law, with covariance given by generalization
f CR Bound :

Goodness of fit = The value of is Khi-2 distributed, with ndf = sample e size e – number o er of param amet eters ers

40

) θ

θ

( Σ ) θ

θ

( 2 1

1 Τ

e Σ 2π 1 Σ) , θ ; θ f(

   

 

ˆ ˆ

ˆ

−

=

        ∂ ∂ ∂ ∂ − =

− j i 1 ij

θ ln θ ln E Σ L L

) θ ( 2ln

ˆ

L

∫

+∞

= −

) θ ( 2ln

(x;ndf)dx

f value p

2

ˆ L χ

Probability of getting a worse agreement

SLIDE 41

Errors on MLE

Errors on parameter -> from the covariance matrix For one p e para aramete ter, 68% interval More generally : Confidence contour are defined by the equation : Values of β for different number o er of param ameters eters nθ and confiden ence l ce level els α

41

) θ

θ

( Σ ) θ

θ

( 2 1

1 Τ

e Σ 2π 1 Σ) , θ ; θ f(

   

 

ˆ ˆ

ˆ

−

=

        ∂ ∂ ∂ ∂ − =

− j i 1 ij

θ ln θ ln E Σ L L

2 2 θ

θ ln 1 σ Δθ ∂ ∂ − = = L

ˆ

nly one realization
f the estimator ->

empirical mean of 1 value…

) O(θ ) θ

)(θ

θ

(θ

Σ 2 1 ln ln Δln

3 j i, j j i i 1 ij

+ = − =

∑

−

ˆ ˆ ˆ ) L( ) L( L θ θ

dx ) (x;n f α with α) , β(n Δln

2β θ χ θ

2

∫

= = L

nθ→ α↓ 1 (0. 5*nθ

2)

2 3 68.3 0.5 1.15 1.76 95.4 2 3.09 4.01 99.7 4.5 5.92 7.08

SLIDE 42

Least squares

Set of measurements (x (xi, , yi) ) with uncertainties on yi Theoretical law : y = f(x, x,θ) Naïve e appro roach ach : use regres ression Rewei eight each term by the error Maximum likelihood : assume each yi is normally distributed with a mean equal to f(x (xi,θ) and a variance equal to Δyi Then the likel elihood is :

42

θ w , θ)) , f(x (y ) θ w(

i i 2 i i

= ∂ ∂ − =∑ θ K , y θ) , f(x y ) θ ( K

i 2 i 2 i i i 2

= ∂ ∂         ∆ − =∑

∏

          − −

=

i Δ θ) , f(x y 2 1 i

2 i y i i

e Δy 2π 1 ) (θ L

θ K θ θ

2

= ∂ ∂ = ∂ ∂ − ⇔ = ∂ ∂ L L ln 2

Least s t squar ares es or Khi hi-2 f 2 fit t is the MLE, for Gaussian errors

) θ (x, f

y

( Σ ) θ (x, f

y

( 2 1 ) θ ( K

1 Τ 2

      

−

=

Generic case with correlations:

SLIDE 43

Example : fitting a line

For f(x)= )=ax ax+b

43

∑ ∑ ∑ ∑ ∑

= = = = =

i 2 i i 2 i i i 2 i i i 2 i 2 i i 2 i i i

Δy 1 E , Δy y D , Δy x C , Δy x B , Δy y x A E 1.52 b Δ , B 1.52 a Δ C BE AC DB b , C BE DC AE a

2 2

= = − − = − − = ˆ ˆ ˆ ˆ

SLIDE 44

Example : fitting a line

2 dimensional error contours on a and b

44

SLIDE 45

Non parametric estimation

Directl ectly e estimati ting ng t the p prob

bab

ability ty d density ty f functi nction

n
Likelihood ratio discriminant
Separating power of variables
Data/MC agreement
…

Freq equen uency cy T Table e : For a sample {x {xi} , , i=1..n ..n 1.Define successive intervals (bins) Ck=[a [ak,a ,ak+

k+1[

2.Count the number of events nk in Ck Histog togram : Graphical representation of the frequency table

45

k k

C x if n h(x) ∈ =

SLIDE 46

Histogram

46

N/Z for stable heavy nuclei

1.321, 1.357, 1.392, 1.410, 1.428, 1.446, 1.464, 1.421, 1.438, 1.344, 1.379, 1.413, 1.448, 1.389, 1.366, 1.383, 1.400, 1.416, 1.433, 1.466, 1.500, 1.322, 1.370, 1.387, 1.403, 1.419, 1.451, 1.483, 1.396, 1.428, 1.375, 1.406, 1.421, 1.437, 1.453, 1.468, 1.500, 1.446, 1.363, 1.393, 1.424, 1.439, 1.454, 1.469, 1.484, 1.462, 1.382, 1.411, 1.441, 1.455, 1.470, 1.500, 1.449, 1.400, 1.428, 1.442, 1.457, 1.471, 1.485, 1.514, 1.464, 1.478, 1.416, 1.444, 1.458, 1.472, 1.486, 1.500, 1.465, 1.479, 1.432, 1.459, 1.472, 1.486, 1.513, 1.466, 1.493, 1.421, 1.447, 1.460, 1.473, 1.486, 1.500, 1.526, 1.480, 1.506, 1.435, 1.461, 1.487, 1.500, 1.512, 1.538, 1.493, 1.450, 1.475, 1.500, 1.512, 1.525, 1.550, 1.506, 1.530, 1.487, 1.512, 1.524, 1.536, 1.518, 1.577, 1.554, 1.586, 1.586

SLIDE 47

Histogram as PDF estimator

Statistical description : nk are multi tinomial random variables. with parameters :

47

∫ ∑

= ∈ = =

k

C X k k k k

(x)dx f ) C P(x p n n

p np ) ,n Cov(n μ ) p (1 np σ np μ

1 p r k r k n 1 p k k 2 n k n

k k k k k

<< <<

≈ − = ≈ − = =

For a large sample : For small classes (width δ): So finally : Th The histo togram am is is an esti timato tor o r of t the probability ty density ty Each bin can be described by a Poi

isson
n d

dens nsity. The 1σ error on nk is then :

h(x) nδ 1 lim f(x)

δ n → +∞ →

=

k k k n

p n μ n n lim = =

+∞ →

f(x) δ p lim ) δf(x (x)dx f p

k δ c C X k

k

= ⇒ ≈ =

→

∫

n μ σ Δn

k n 2 n k

k k

= = = ˆ ˆ

SLIDE 48

Confidence interval

For a random variable, a confiden ence i ce interv erval al with confiden ence ce level el α, is any interval [a,b] ,b] such as : Generalization of the concept of uncertainty: interval that contains the tru true val alue with a given probability

> slightl

tly differen erent t concep cepts ts For Bayes esian ans : the posterior density is the probability density of the true value. It can be used to derive interval : No such thing for a Frequen enti tist : The interval itself becomes the random variable [a, [a,b] is a realization of [A, [A,B] Independently of θ

48

α (x)dx f [a,b]) P(X

b a X

= = ∈

∫

Probability of finding a realization inside the interval

α [a,b]) P(θ = ∈

α θ) B and θ P(A = > <

SLIDE 49

Confidence interval

Mean centered, symetric interval [μ-a, , μ+a +a]

49

Mean centered, probability symetric interval : [a, b] ,

2 α f(x)dx f(x)dx

b μ μ a

= = ∫

∫

Highest Probability Density (HDP) [a, b]

α f(x)dx

a μ a μ

=

∫

+ −

[a,b] y and [a,b] for x f(y) f(x) α f(x)dx

b a

∉ ∈ > =

∫

SLIDE 50

Confidence Belt

To build a frequentist interval for an estimator of θ

1. Make pseudo-experiments for several values of θ and compute

the estimator for each (MC sampling of the estimator pdf)

2. For each θ,

, determine A(θ) ) and B(θ) ) such as : These 2 curves are the confiden ence b ce belt, for a CL L α.

3. Inverse these functions. The interval satisfy:

50

θ ˆ s experiment

pseudo

the

f

฀)/2

(1

fraction a for ฀ (฀) ฀ ˆ <

)] )] θ ( Ξ ), ), θ ( [Ω

1
1
1
1

ˆ ˆ

s experiment

pseudo

the

f

฀)/2

(1

fraction a for Ω(฀) ฀ ˆ >

Confidence Belt for Poisson parameter λ estimated with the empirical mean of 3 realizations (68%CL)

θ ˆ

( ) ( ) ( ) ( ) ( )

฀ Ω(฀) ฀ ˆ P

฀ (฀)

฀ ˆ P 1 ฀ ) ฀ ˆ ( Ω P

฀

) ฀ ˆ ( ฀ P 1 ) ฀ ˆ ( ฀ ฀ ) ฀ ˆ ( Ω P

1
1
1
1

= > < − = > < − = < <

SLIDE 51

Dealing w ith systematics

The variance of the estimator only measure the statistical uncertainty. Often, we will have to deal with some param ameter ters whose value i ue is know

wn w

n with l h limit p t preci ecision.

n.

System emati tic u c uncer certa tainti nties es The likelihood function becomes : The known parameters ν are nuisanc ance p e param ameter eters

51

r ν

Δν ν ν ) ν , θ (

ν Δ ν

Δ

+ −

+

± = L

SLIDE 52

Bayesian inference

In Bayes esian s an stati tisti tics cs, nuisance parameters are dealt with by assigning them a prior π(ν). Usually a multinormal law is used with mean ν0 and covariance matrix estimated from Δν0 (+correlation, if needed) The final prior is obtained by margin inali liza zatio ion over the nuisance parameters

52

dν dθ ν)π(θ)π(ν) θ, | f(x ν)π(θ)π(ν) θ, | f(x x) | ν , f(θ

∫∫

= dν dθ ν)π(θ)π(ν) θ, | f(x dν ν)π(θ)π(ν) θ, | f(x x)dν | ν , f(θ x) | f(θ

∫∫ ∫ ∫

= =

SLIDE 53

Profile Likelihood

53

No true frequentist way to add systematic effects. Popular method of the day : profiling Deal with nuisance parameters as realization if random variables : extend the likelihood : G( G(v) is the likel elihood of the new p param amet eter ers (identical to prior) For each value of θ, maximize the likelihood with respect to nuisance : profile l e likel elihood PL PL(θ). ).

PL

PL(θ) has t the same e stati tisti tical cal a asymptoti tical cal proper erti ties es t than the e regular ar likel elihood

) (ν ) ν , θ ( ) ν , θ ( G L L ' →

SLIDE 54

Statistical Tests

Stati tisti tical cal t tests ts aim at:

Checking the compatib

ibilit ility of a dataset {xi} with th a given d en distr tributi ution

n
Checking the compati

atibility ty o

f two d
datas

tasets ets {xi}, {yi} : are they issued from the same distribution.

Compar

aring ng d differen ent h t hypothes

thesis : background

vs signal+background In every case :

build a statistic that quantify the agreement

with the hypothesis

con

convert i it t into nto a a prob

bability of

compatibility/incompatibility : p-val alue ue

54

SLIDE 55

Pearson test

Test for bin binned da data : use the Poisson limit of the histogram

Sort the sample into k bins Ci

i : ni

Compute the probability of this class : pi=∫Ci

Cif(x

(x)d )dx

The test statistics compare, for each bin the deviati

ation o

f

th the observ ervati ation from the expected ected m mean to the theoreti retical cal standard ard deviat ation. Then χ2 follow (asymptotically) a Khi-2 law with k-1 degrees of freedom (1 constraint ∑ni=n) p-val alue : probability of doing worse, For a “good” agreement χ2 /(k (k-1) ~ 1) ~ 1, More precisely (1σ interval ~ 68%CL)

55

∑

− =

i bins i 2 i i 2

np ) np (n χ

Poisson mean Poisson variance Data

∫

+∞

= −

2 2

1)dx

(x;k

f value p

χ χ

1) 2(k 1) (k χ2 − ± − ∈

SLIDE 56

Kolmogorov-Smirnov test

Test for unbinned ed data a : compare the sample cumulative density function to the tested one Sample Pdf (ordered sample) The the Kolmogoro rov s stati tisti tic is the largest deviation : The test distribution has been computed by Kolmogorov: [0; [0;β] ] define a confidence interval for Dn β=0.9584/√n for 68.3% CL β=1.3754/√n for 95.4% CL

56

       > < ≤ < = ⇒ =

+

∑

x x 1 x x x n k x x (x) F i)

δ(x

n 1 (x) f

n 1 k k s i s

F(x) (x) F sup D

S x n

− =

2 2z

2r r 1 r n

e 1) ( 2 ) n β P(D

− −

∑ −

= >

SLIDE 57

Example

Test compatibility with an exponen enti tial al law :

57

0.008, 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421, 0.423, 0.455, 0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774, 0.825, 0.843, 0.921, 0.987, 0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287, 1.317, 1.320, 1.333, 1.412, 1.421, 1.438, 1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146, 2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372, 2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029, 3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814, 3.854, 3.929, 4.038, 4.065, 4.089, 4.177, 4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222, 6.556, 6.670, 7.673, 8.071, 8.165, 8.181, 8.383, 8.557, 8.606, 9.032, 10.482, 14.174

0.4 λ , λe f(x)

λx

= =

−

Dn = 0. 0.069 069 p-value = = 0. 0.0617 0617 1σ : [0, [0, 0. 0.0875] 0875]