Statistics and learning Statistical estimation Emmanuel Rachelson - - PowerPoint PPT Presentation

statistics and learning
SMART_READER_LITE
LIVE PREVIEW

Statistics and learning Statistical estimation Emmanuel Rachelson - - PowerPoint PPT Presentation

Statistics and learning Statistical estimation Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 18 th September 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 17 How to retrieve lecture support & practical


slide-1
SLIDE 1

Statistics and learning

Statistical estimation Emmanuel Rachelson and Matthieu Vignes

ISAE SupAero

Wednesday 18th September 2013

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 1 / 17

slide-2
SLIDE 2

How to retrieve ’lecture’ support & practical sessions

LMS @ ISAE

  • r

My website (clickable links)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 17

slide-3
SLIDE 3

Things you have to keep in mind

Crux of the estimation

◮ Population, sample and statistics.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 17

slide-4
SLIDE 4

Things you have to keep in mind

Crux of the estimation

◮ Population, sample and statistics. ◮ Concept of estimator of a paramater.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 17

slide-5
SLIDE 5

Things you have to keep in mind

Crux of the estimation

◮ Population, sample and statistics. ◮ Concept of estimator of a paramater. ◮ Bias, comparison of estimators, Maximum Likelihood Estimator.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 17

slide-6
SLIDE 6

Things you have to keep in mind

Crux of the estimation

◮ Population, sample and statistics. ◮ Concept of estimator of a paramater. ◮ Bias, comparison of estimators, Maximum Likelihood Estimator. ◮ Sufficient statistics, quantiles.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 17

slide-7
SLIDE 7

Things you have to keep in mind

Crux of the estimation

◮ Population, sample and statistics. ◮ Concept of estimator of a paramater. ◮ Bias, comparison of estimators, Maximum Likelihood Estimator. ◮ Sufficient statistics, quantiles. ◮ Interval estimation.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 17

slide-8
SLIDE 8

Statistical estimation

Steps in estimation procedure:

◮ Consider a population (size N) described by a random variable X

(known or unknown distribution) with parameter θ,

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 17

slide-9
SLIDE 9

Statistical estimation

Steps in estimation procedure:

◮ Consider a population (size N) described by a random variable X

(known or unknown distribution) with parameter θ,

◮ a sample with n ≤ N independent observations (x1 . . . xn) is

extracted,

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 17

slide-10
SLIDE 10

Statistical estimation

Steps in estimation procedure:

◮ Consider a population (size N) described by a random variable X

(known or unknown distribution) with parameter θ,

◮ a sample with n ≤ N independent observations (x1 . . . xn) is

extracted,

◮ θ is estimated through a statistic (=function of Xi’s):

ˆ θ = T(X1 . . . Xn).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 17

slide-11
SLIDE 11

Statistical estimation

Steps in estimation procedure:

◮ Consider a population (size N) described by a random variable X

(known or unknown distribution) with parameter θ,

◮ a sample with n ≤ N independent observations (x1 . . . xn) is

extracted,

◮ θ is estimated through a statistic (=function of Xi’s):

ˆ θ = T(X1 . . . Xn). Note: independence is true only if drawing is made with replacement. Without replacement, the approximation is ok if n << N.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 17

slide-12
SLIDE 12

Statistical estimation

Steps in estimation procedure:

◮ Consider a population (size N) described by a random variable X

(known or unknown distribution) with parameter θ,

◮ a sample with n ≤ N independent observations (x1 . . . xn) is

extracted,

◮ θ is estimated through a statistic (=function of Xi’s):

ˆ θ = T(X1 . . . Xn). Note: independence is true only if drawing is made with replacement. Without replacement, the approximation is ok if n << N.

Mean estimation

Estimate the average life span of a bulb...

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 17

slide-13
SLIDE 13

Point estimation of a parameter

Recall

n realisations of random variables iid (X1 . . . Xn) are available. Some parameters can be of interest. Direct computation not feasible so estimation needed. Objective here: tools and maths grounds for estimation.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 17

slide-14
SLIDE 14

Point estimation of a parameter

Recall

n realisations of random variables iid (X1 . . . Xn) are available. Some parameters can be of interest. Direct computation not feasible so estimation needed. Objective here: tools and maths grounds for estimation.

Definitions

◮ Statistical model: definition of a probability distribution Pθ (joint if

discrete rv and density if continuous rv), with θ is a (p-vector of) unknown parameter(s).

◮ Statistic: T : Rn → R(p), (xi)i=1...n → T(x1 . . . xn). Examples:

empirical mean or variance (known/unknown mean).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 17

slide-15
SLIDE 15

Estimator, bias, comparison

Exercice

Lift can bear 1, 000 kg. User weight ∼ N(75, 162).

◮ Max. number of people allowed in it if P(lift won’t take off) = 10−6 ? ◮ Lift manufacturer allows 11 people inside. P(overweight) =??

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 17

slide-16
SLIDE 16

Estimator, bias, comparison

Exercice

Lift can bear 1, 000 kg. User weight ∼ N(75, 162).

◮ Max. number of people allowed in it if P(lift won’t take off) = 10−6 ? ◮ Lift manufacturer allows 11 people inside. P(overweight) =??

Definitions

◮ Estimator of an unknown parameter θ: a statistic denoted ˆ

θ (observed values are approximations of θ). The bias associated to ˆ θ is E[ˆ θ] − θ (if = 0, ˆ θ is said to be unbiased). Ex: (exercices) (i) the empirical mean is an unbiaised estimator for the (theoretical) mean. (ii) S2

n := n

i=1 (Xi− ¯

X)2 n

is a biased estimator for σ2.

◮ ˆ

θ is asymptotically unbiased if limn→∞ E[ˆ θ] = θ.

◮ ˆ

θ1 and ˆ θ2: 2 unbiased estimator for θ; ˆ θ1 is better than ˆ θ2 if V ar(ˆ θ1) < V ar(ˆ θ2); in practice, ˆ θ1 converges faster than ˆ θ2.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 17

slide-17
SLIDE 17

Application break

Estimating the duration of a traffic light

θ > 0 is the actual duration of a traffic light. Unknown. We observe a sample t1 . . . tn, where ti is the waiting time of driver i.

  • 1. What is a good modelling for Ti’s ? Density ? Mean and variance ?
  • 2. If T = 1

n

n

i=1 Ti, what is E[T] ? var(T) ? Can you use T to build

an unbiased estimator of θ ? Establish its probability convergence.

  • 3. Let Mn = supi Ti. Compute the cumulative probability function of

Mn ? Density ? Mean and variance ? Plot the cpf for n = 3, n = 30 and interpret. Use Mn to build an unbiased probability-convergent estimator of θ.

  • 4. Compare the variances of both estimators. Which one would you use

to estimate θ ?

  • 5. Numerical application for n = 3 and (t1, t2, t3) = (2, 24, 13).
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 17

slide-18
SLIDE 18

Convergence of estimators

Def: ˆ θ converges in probability towards θ if ∀ǫ > 0, P(|ˆ θ − θ| < ǫ) →n 1.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 17

slide-19
SLIDE 19

Convergence of estimators

Def: ˆ θ converges in probability towards θ if ∀ǫ > 0, P(|ˆ θ − θ| < ǫ) →n 1.

Theorem

An (asymptotically) unbiased estimator s.t. limn V ar(ˆ θ) = 0 converges in probability towards θ.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 17

slide-20
SLIDE 20

Convergence of estimators

Def: ˆ θ converges in probability towards θ if ∀ǫ > 0, P(|ˆ θ − θ| < ǫ) →n 1.

Theorem

An (asymptotically) unbiased estimator s.t. limn V ar(ˆ θ) = 0 converges in probability towards θ.

Theorem

An unbiased estimator ˆ θ with the following technical regularity hypotheses (H1-H5) verifies V ar(ˆ θ) > Vn(θ), with the Cramer-Rao bound Vn(θ) := (−E[ ∂2 log f(X1...Xn;θ)

∂θ2

])−1 (inverse of Fisher information). (H1) the support D := {X, f(x; θ) > 0} does not depend upon θ. (H2) θ belongs to an open interval I. (H3) on I × D, ∂f

∂θ and ∂2f ∂θ2 exist and are integrable over x.

(H4) θ →

  • A f(x; θ)dx has a second order derivative (x ∈ I, A ∈ B(R))

(H5) ( ∂ log f(X;θ)

∂θ

)2 is integrable.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 17

slide-21
SLIDE 21

Application to the estimation of a |N|

Definition

An unbiased estimator ˆ σ for θ is efficient if its variance is equal to the Cramer-Rao bound. It is the best possible among unbiased estimators.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 17

slide-22
SLIDE 22

Application to the estimation of a |N|

Definition

An unbiased estimator ˆ σ for θ is efficient if its variance is equal to the Cramer-Rao bound. It is the best possible among unbiased estimators.

Exercice

Let (Xi)i=1...n iid rv ∼ N(m, σ2). Yi := |Xi − m| is observed.

◮ Density of Yi ? Compute E[Yi] ? Interpretation compared to σ ? ◮ Let ˆ

σ :=

i aiYi. If we want ˆ

σ to be unbiased, give a constraint on (ai)’s. Under this constraint, show that V ar(ˆ σ) is minimum iif all ai are equal. In this case, give the variance.

◮ Compare the Cramer-Rao bound to the above variance. Is the built

estimator efficient ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 17

slide-23
SLIDE 23

Likelihood function

Definition

The likelihood of a rv X = (X1 . . . Xn) is the function L: L : Rn × Θ − → R+ (x, θ) − → L(x; θ) :=    f(x; θ), the density of X

  • r

Pθ(X1 = x1 . . . Xn = xn), if X discrete

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 17

slide-24
SLIDE 24

Likelihood function

Definition

The likelihood of a rv X = (X1 . . . Xn) is the function L: L : Rn × Θ − → R+ (x, θ) − → L(x; θ) :=    f(x; θ), the density of X

  • r

Pθ(X1 = x1 . . . Xn = xn), if X discrete

Examples

◮ Xi Gaussian iid rv:

L(x; θ) =

  • i

f(xi; θ) =

  • 1

σ √ 2π n exp

  • −1

2

  • i

xi − m σ 2

◮ Xi Bernouilli iid rv: L(x; θ) = p

xi(1 − p)n− xi

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 17

slide-25
SLIDE 25

Maximum likelihood estimation (MLE)

Definition

ˆ θMLE := arg max

θ∈Θ (log)L(x1 . . . xn; θ)

Interpretation: ˆ θMLE is the parameter value that gives maximum probability to the observed values or random variables...

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 17

slide-26
SLIDE 26

Maximum likelihood estimation (MLE)

Definition

ˆ θMLE := arg max

θ∈Θ (log)L(x1 . . . xn; θ)

Interpretation: ˆ θMLE is the parameter value that gives maximum probability to the observed values or random variables... Remark: the EMV does not always exists (possible alternatives: least square or moments). When it exists, it is not necessarily unique, can be biased or not efficient. However...

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 17

slide-27
SLIDE 27

Maximum likelihood estimation (MLE)

Definition

ˆ θMLE := arg max

θ∈Θ (log)L(x1 . . . xn; θ)

Interpretation: ˆ θMLE is the parameter value that gives maximum probability to the observed values or random variables... Remark: the EMV does not always exists (possible alternatives: least square or moments). When it exists, it is not necessarily unique, can be biased or not efficient. However...

Theorem

◮ ˆ

θMLE is asymptotically unbiased and efficient.

◮ ˆ θMLE−θ Vn(θ)

− →n N(0, 1), where Vn(θ) is the Cramer-Rao bound.

◮ ˆ

θMLE converges to θ in squared mean.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 17

slide-28
SLIDE 28

Maximum likelihood estimation (MLE)

Definition

ˆ θMLE := arg max

θ∈Θ (log)L(x1 . . . xn; θ)

Interpretation: ˆ θMLE is the parameter value that gives maximum probability to the observed values or random variables... Remark: the EMV does not always exists (possible alternatives: least square or moments). When it exists, it is not necessarily unique, can be biased or not efficient. However...

Theorem

◮ ˆ

θMLE is asymptotically unbiased and efficient.

◮ ˆ θMLE−θ Vn(θ)

− →n N(0, 1), where Vn(θ) is the Cramer-Rao bound.

◮ ˆ

θMLE converges to θ in squared mean.

’MLE for a proportion’ exercice ? Mean and variance estimation of N (µ, σ).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 17

slide-29
SLIDE 29

Sufficient statistic

Remark/definition

Any realisation (xi) of a rv X, unknown distribution but parameterised by θ, from a sample contains information on θ. If the statistic summarises all possible information from the sample, it is sufficient. In other words ”no

  • ther statistic which can be calculated from the same sample provides any

additional information as to the value of the parameter” (Fisher 1922) In mathematical terms: P(X = x|T = t, θ) = P(X = x|T = t)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 17

slide-30
SLIDE 30

Sufficient statistic

Remark/definition

Any realisation (xi) of a rv X, unknown distribution but parameterised by θ, from a sample contains information on θ. If the statistic summarises all possible information from the sample, it is sufficient. In other words ”no

  • ther statistic which can be calculated from the same sample provides any

additional information as to the value of the parameter” (Fisher 1922) In mathematical terms: P(X = x|T = t, θ) = P(X = x|T = t)

Theorem (Fisher-Neyman)

T(X) is sufficient if there exist 2 functions g and h s.t. L(x; θ) = g(t; θ)h(x) Implication: in the context of MLE, 2 samples yielding the same value for T yield the same inferences about θ. (dep. on θ is only in conjunction with T).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 17

slide-31
SLIDE 31

Sufficient statistic

An example

Sufficiency of an estimator of a proportion

Quality control in a factory: n items drawn with replacement to estimate p the proportion of faulty items. Xi = 1 if item i is cracked, 0 otherwise. Show that the ’classical’ estimator for p, 1

n

n

i=1 Xi is sufficient.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 17

slide-32
SLIDE 32

Quantiles

Definition

The cumulative distribution function F (F(x) = x

−∞ f(t)dt, with f

density of X) is a non-decreasing function R → [0; 1]. Its inverse F −1 is called the quantile function. ∀β ∈]0; 1[, the β-quantile is defined by F −1(β).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 17

slide-33
SLIDE 33

Quantiles

Definition

The cumulative distribution function F (F(x) = x

−∞ f(t)dt, with f

density of X) is a non-decreasing function R → [0; 1]. Its inverse F −1 is called the quantile function. ∀β ∈]0; 1[, the β-quantile is defined by F −1(β). In particular: P(X ≤ F −1(β)) = β and P(X ≥ F −1(β)) = 1 − β

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 17

slide-34
SLIDE 34

Quantiles

Definition

The cumulative distribution function F (F(x) = x

−∞ f(t)dt, with f

density of X) is a non-decreasing function R → [0; 1]. Its inverse F −1 is called the quantile function. ∀β ∈]0; 1[, the β-quantile is defined by F −1(β). In particular: P(X ≤ F −1(β)) = β and P(X ≥ F −1(β)) = 1 − β In practice, either quantile are read from tables: either F or F −1 (old-fashioned) or they are computed using statistics softwares on computers (qnorm, qbinom, qpois, qt, qchisq, etc. in R). Quantile for the Gaussian distribution will (most of the time) be denoted zβ. For Student distribution tβ and so on. By the way: what are χ2 and Student distribution ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 17

slide-35
SLIDE 35

Interval estimation

ˆ θ: a point estimation of θ; even in favourable situations, it is very unlikely that ˆ θ = θ. How close is it ? Could an interval that contains the true value of θ with say a high probability (low error) be built ? Not too big (informative), but not too restricted neither (for the true value has a great chance of being in it).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 15 / 17

slide-36
SLIDE 36

Interval estimation

ˆ θ: a point estimation of θ; even in favourable situations, it is very unlikely that ˆ θ = θ. How close is it ? Could an interval that contains the true value of θ with say a high probability (low error) be built ? Not too big (informative), but not too restricted neither (for the true value has a great chance of being in it). Typically, a 1/√n-neighbourhood of ˆ θ will do the job. It is much more useful than many digits in an estimator to give an interval with the associated error.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 15 / 17

slide-37
SLIDE 37

Interval estimation

ˆ θ: a point estimation of θ; even in favourable situations, it is very unlikely that ˆ θ = θ. How close is it ? Could an interval that contains the true value of θ with say a high probability (low error) be built ? Not too big (informative), but not too restricted neither (for the true value has a great chance of being in it). Typically, a 1/√n-neighbourhood of ˆ θ will do the job. It is much more useful than many digits in an estimator to give an interval with the associated error.

Definition

  • 1. A confidence interval ˆ

In is defined by a couple of estimators: ˆ In = [ˆ θ1; ˆ θ2].

  • 2. its associated confidence level 1 − α (α ∈ [0; 1]) is s.t.

P(θ ∈ ˆ In) ≥ 1 − α.

  • 3. ˆ

In is asymptotically of level at least 1 − α if ∀ǫ > 0, ∃Ne s.t. P(θ ∈ ˆ In) ≥ 1 − α − ǫ for n ≥ Ne.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 15 / 17

slide-38
SLIDE 38

Confidence intervals you need to know

a partial typology

◮ Xi ∼ N(m, σ2), with σ2 known, then I(m) = [¯

x + / − z1−α/2

σ √n]. ◮ when σ2 is unknown, it becomes I(m) = [¯

x + / − tn−1;1−α/2

sn−1 √n ],

with s2

n−1 := (xi−¯ x)2 n−1

and tn−1;1−α/2 the quantile of a Student distribution with n − 1 degrees of freedom (df). Note that tn−1;1−α/2 ≃n z1−α/2.

◮ if Gaussianity is lost, we can only derive asymptotic confidence

intervals.

◮ as for σ2: if m is known Iα = [ n σ2 χ2

n;1−α/2 ;

n σ2 χ2

n;α/2 ]

◮ when m is unknown: Iα = [ (n−1)S2

n−1

χ2

n−1;1−α/2 ;

(n−1)S2

n−1

χ2

n;α/2

]

◮ confidence interval for a proportion: exercices (if time permits) ◮ for other distributions: use the Cramer-Rao bound !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 16 / 17

slide-39
SLIDE 39

Next time

Multivariate descriptive statistics !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 17 / 17

slide-40
SLIDE 40

Next time

Multivariate descriptive statistics !

Some notions of (advanced) algebras wil be needed. E.g. Matrices,

  • perations, inverse, rank, projection, metrics, scalar product,

eigenvalues/vectors, matrix norm, matrix approximation . . . .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 17 / 17