Beyond descriptive statistics 2 When we have a data set, we usually - - PowerPoint PPT Presentation

beyond descriptive statistics
SMART_READER_LITE
LIVE PREVIEW

Beyond descriptive statistics 2 When we have a data set, we usually - - PowerPoint PPT Presentation

Beyond descriptive statistics 2 When we have a data set, we usually want to do more with the data than just describe them Keep in mind that data are information of a sample selected or generated from a population, and our goal is to make


slide-1
SLIDE 1
slide-2
SLIDE 2

Beyond descriptive statistics

 When we have a data set, we usually want to do more with

the data than just describe them

 Keep in mind that data are information of a sample selected

  • r generated from a population, and our goal is to make

inferences about the population

2

slide-3
SLIDE 3

Statistical inference

Statistical inference can be further subdivided into the two main areas of estimation and hypothesis

  • Estimation is concerned with estimating the

values of specific population parameters (Today’s lecture).

  • Hypothesis testing is concerned with testing

whether the value of a population parameter is equal to some specific value (next lecture).

3

slide-4
SLIDE 4

Point estimation and interval estimation

  • Sometimes we are interested in obtaining specific values

as estimates of our parameters (along with estimation precise). These values are referred to as point estimates

  • Sometimes we want to specify a range within which the

parameter values are likely to fall. If the range is narrow, then we may feel our point estimate is good. These are called interval estimates

4

slide-5
SLIDE 5
  • Purpose of

inference:

Make decisions about population characteristics when it is impractical to observe the whole population and we only have a sample of data drawn from the population Population?

5

From Sample to Population!

slide-6
SLIDE 6

Towards statistical inference

  • Parameter: a number describing the population
  • Statistic: a number describing a sample
  • Statistical inference: Statistic → Parameter

6

slide-7
SLIDE 7

Estimation of population mean

  • We have a sample randomly sampled from

a population

  • The population mean µ and variance σ2 are unknown
  • Question: how to use the observed sample

to estimate µ and σ2 ?

7

1 2

( , ,..., )

n

x x x

1 2

( , ,..., )

n

x x x

slide-8
SLIDE 8

Point estimator of population mean and variance

  • A natural estimator for estimating population

mean µ is the sample mean

  • A natural estimator for estimating population standard

deviation σ is the sample standard deviation

8

=

− − =

n i i

x x n s

1 2

) ( 1 1 n x x

n i i / 1

=

=

slide-9
SLIDE 9

Point estimator of population mean

  • A natural estimator for estimating population

mean µ is the sample mean

  • Question: How good is this estimate?

9

n x x

n i i / 1

=

=

slide-10
SLIDE 10

Point estimator of population mean

  • A natural estimator for estimating population

mean µ is the sample mean

  • Question: How good is this estimate?
  • We would like this is close to µ, however, we don’t

know the value of µ.

  • We need to study the distribution of

10

n x x

n i i / 1

=

=

x X

slide-11
SLIDE 11

Sampling distribution of sample mean

11

  • To understand what properties of make it a

desirable estimator for µ, we need to forget about our particular sample for the moment and consider all possible samples of size n that could have been selected from the population

  • The values of in different samples will be different.

These values will be denoted by

  • The sampling distribution of is the distribution
  • f values over all possible samples of size n that

could have been selected from the study population X

 , , ,

3 2 1

x x x

X

X

x

slide-12
SLIDE 12

Population Mean

12

Research question: center of a population

slide-13
SLIDE 13

Sample is representative of the population Population Mean Random sample 1  Random sample 2  . . . Random sample K 

13

Research question: center of a population

slide-14
SLIDE 14

The selection of random sample set is A RANDOM EXPERIMENT.  is RANDOM VARIABLE  are observed values for

Population Mean Random sample 1  Random sample 2  . . Random sample K 

14

Research question: center of a population

1

x

2

x

K

x X X

1,..., K

x x X

slide-15
SLIDE 15

An example of sampling distribution

15

slide-16
SLIDE 16

Sample mean is an unbiased estimator of population mean

  • We can show that the average of these samples mean (
  • ver all possible samples) is equal to the population

mean µ

  • Unbiasedness: Let X1, X2, …, Xn be a random sample

drawn from some population with mean µ. Then

16

 , , ,

3 2 1

x x x

µ = ) (X E

slide-17
SLIDE 17

is minimum variance unbiased estimator of µ

  • The unbiasedness of sample mean is not

sufficient reason to use it as an estimator of µ

  • There are many other unbiased estimators, like

sample median and the average of min and max

  • We can show that (but not here): among all kinds
  • f unbiased estimators, the sample mean has the

smallest variance

  • Now what is the variance of sample mean ?

17

X X

slide-18
SLIDE 18

Standard error(SE) of sample mean

  • The variance of sample mean measures the

estimation precise.

  • is the population variance

18

2

var( ) / X n σ =

( ) / SE X n σ =

2

σ

X

slide-19
SLIDE 19

Use to estimate

19

  • In practice, the population variance is rarely
  • known. And the sample variance is a

reasonable estimator for

  • Therefore, the standard error of mean can

be estimated by (recall that ) NOTE: The larger sample size is the smaller standard error is  the more accurate estimation is

n / σ n s /

n s /

n / σ

=

− − =

n i i

x x n s

1 2

) ( 1 1

2

σ

2

σ

2

s

X

slide-20
SLIDE 20

An example of standard error

  • A sample of size 10 birthweights:

97, 125, 62, 120, 132, 135, 118, 137, 126, 118 (sample mean x-bar=117.00 and sample standard deviation s=22.44)

  • In order to estimate the population mean µ, a

point estimate is the sample mean , with standard error given by

20

00 . 117 = x 09 . 7 10 / 44 . 22 / = = = n s SE

slide-21
SLIDE 21

Summary of sampling distribution of

21

X

  • Let X1, …, Xn be a random sample from a population

with µ and σ2 . Then the mean and variance of is µ and σ2/n, respectively

  • Furthermore, if X1, …, Xn be a random sample from

a normal population with µ and σ2 . Then by the properties of linear combination, is also normally distributed, that is

  • Now the question is, if the population is NOT

normal, what is the distribution of ?

X

X

) / , ( ~

2 n

N X σ µ

X

slide-22
SLIDE 22

The Central Limit Theorem

22

  • Let X1 , X2 , …, Xn denote n independent random

variables sampled from some population with mean µ and variance σ2

  • When n is large, the sampling distribution of the

sample mean is approximately normally distributed even if the underlying population is not normal

  • By standardization:

2

( , ) X N n µ σ ≈

~ (0,1) / X Z N n µ σ − ≈

slide-23
SLIDE 23

Illustration of Central limit Theorem (CLT)

23

slide-24
SLIDE 24

Interval estimation

24

  • Let X1 , X2 , …, Xn denote n independent random

variables sampled from some population with mean µ and variance σ2

  • Our goal is to estimate µ. We know that is a good point

estimate

  • Now we want to have a confidence interval

such that a 95% confidence interval(CI) for µ should satisfy the following:

a X a X a X ± = + − ) , (

% 95 ) Pr( = + < < − a X a X µ

N1=N2=15"

slide-25
SLIDE 25

Interval estimation

25

  • BY CLT we have . Use the Z-transformation,

we have

  • Thus for a standard normal variable, what’s the width of the

interval within which 95% of the data are covered? Pr(-1<Z<1)=0.6827, Pr(-1.96<Z<1.96)=0.95 Pr(-2.58<Z<2.58)=0.99 ) / , ( ~

2 n

N X σ µ ~ (0,1) / X Z N n µ σ − =

slide-26
SLIDE 26

Interval estimation

26

  • Thus we have

However, we don’t know σ, we replace σ with s—sample standard deviation: Term is called the SAMPLE STANDARD ERROR, denoted as SE( ). Thus 95% confidence interval for µ is: 0.95 ( 1.96 1.96) = ( 1.96 1.96) / P Z X P n µ σ = − ≤ ≤ − − ≤ ≤ 0.95= ( 1.96 1.96) / X P s n µ − − ≤ ≤

/ s n

0.95 ( 1.96 ( ) 1.96 ( )) P X SE X X SE X µ ≈ − ≤ ≤ +

slide-27
SLIDE 27

Ex.

27 • Assuming that the n=92 students were a simple random

sample of all NYU students. The sample mean weight is

  • lbs. And sample standard deviation s=23.7.

What’s the 95% confidence interval for NYU student mean weight? 145.2 x =

slide-28
SLIDE 28

Ex.

28 • Assuming that the n=92 students were a simple random

sample of all NYU students. The sample mean weight is

  • lbs. And sample standard deviation s=23.7.

What’s the 95% confidence interval for NYU student mean weight?

  • Answer: First of all the standard error is
  • 95% CI is

23.7 ( ) =2.47 92 SE X = 145.2 x = 1.96* ( ) 145.2 (1.96)(2.47) 145.2 4.8 lbs =[140.4 lbs, 150 lbs] x SE X ± = ± = ±

slide-29
SLIDE 29

Interval estimation for an arbitrary level of confidence 1-α

29

Zα/2 is Z critical value for level α/2 In particular EXCEL: NORM.INV(0.025, 0,1)=-1.96 NORM.INV(probability, mean, sd)= critical value

/2

( ) / 2 P Z zα α ≤ =

Area 1-α Area α/2 Area α/2

0.025

( ) 0.025 P Z z ≤ =

/2

1 /2

z

α −

slide-30
SLIDE 30

Z critical value table

1-α 0.80 0.90 0.95 0.99 α 0.20 0.10 0.05 0.01 α/2 0.10 0.05 0.025 0.005 Z1-α/2 1.28 1.64 1.96 2.58

30

E.G. For 0.99 level of confidence interval, go out 2.58 standard errors from sample mean.

slide-31
SLIDE 31

Confidence interval (when n is large)

  • Confidence Interval for the mean of a normal

distribution (large sample case)

  • A 100%×(1-α) CI for the mean µ of a normal distribution

with unknown variance is given by A shorthand notation for the CI is

31

1 /2 1 /2

( ( ), ( )) ( ) / x z SE X x z SE X SE X s n

α α − −

− + =

1 /2

( ) x z SE X

α −

±

slide-32
SLIDE 32

Motivation for t-distribution

  • has an approximately normal

distribution only when it is computed using a large sample.

  • For small n=5, 10, 25… <=30, it is not true.

When sample size is small, we need to use the student’s t distribution

32

/ X s n µ −

slide-33
SLIDE 33

T-distribution

  • If X1, …, Xn ~ N(µ,σ2) and are independent, then a

new random variable where is called t-distribution with n-1 degrees of freedom

33

1

~ /

n

X t t s n µ

− =

1 − n

t

=

− − =

n i i

x x n s

1 2

) ( 1 1

slide-34
SLIDE 34

Normal density and t densities

34

slide-35
SLIDE 35

Comparison of normal and t distributions

If n is the sample size, the corresponding t distribution

has n-1 degree freedom of the sample.

Comparison of the .975th percentile of the t distribution

and the normal distribution

The bigger degrees of freedom, the closer to the

standard normal distribution

T distribution is more spread out than normal distribution,

td,.975 is more farther from 0 than Z.975

35

d td,.975 Z.975 d td,.975 Z.975 4 2.776 1.960 60 2.000 1.960 9 2.262 1.960 ∞ 1.960 1.960 29 2.045 1.960

slide-36
SLIDE 36

100%×(1-α) area

1-α tn-1,α/2 = -t1-α/2 tn-1,1-α/2

36

α/2 α/2

 Define the critical values t1-α/2 and -t1-α/2 as follows

1 1,1 /2 1 1,1 /2

/ 2 / 2

n n n n

P t t P t t

α α

α α

− − − − − −

    > = < − =     and

EXCEL: t.INV(0.025, 60)=-2.003 t.INV(probability, degree freedom)= critical value

slide-37
SLIDE 37

Confidence interval

  • Confidence Interval for the mean of a normal

distribution

  • A 100%×(1-α) CI for the mean µ of a normal distribution

with unknown variance is given by A shorthand notation for the CI is

37

1,1 /2 1,1 /2

( ( ), ( )) ( ) /

n n

x t SE X x t SE X SE X s n

α α − − − −

− + =

1,1 /2

( )

n

x t SE X

α − −

±

slide-38
SLIDE 38

Factors affecting the length of a CI

38

The length of a 100%*(1-α) confidence interval for µ equals 2tn-1,1- α s/ and is determined by n, s and α n: as the sample size increase, the length of the CI decreases. s: As the standard deviation (s) which reflects the variability of the distribution of individual

  • bservations, increases, the length of the CI increases.

α: As the confidence desired increases (α decreases), the length of the CI increases.

n

slide-39
SLIDE 39

Estimating the population proportion

Setting:

  • X follows binomial distribution Bin(n, p)
  • Intuition:
  • Parameter p is unknown population proportion of

success

  • We consider the following sample proportion of

success

39

n X p / ˆ =

slide-40
SLIDE 40

An example of estimating p

  • Estimating

the prevalence

  • f

malignant melanoma in 45- to 54-year-old women in the

  • US. Suppose a random sample of 5000 women

is selected from this age group, of whom 28 are found to have the disease

  • Suppose

the prevalence

  • f

the disease (population proportion of the disease) in this age group is p. How can p be estimated?

40

slide-41
SLIDE 41

Cancer example continued

  • Let X be the number of women with the disease among

the n women

  • X can be looked at a binomial random variable with

parameters n and p

  • we have E(X)=np and Var(X)=npq, where q=1-p

41

slide-42
SLIDE 42

Properties of sample proportion

  • Sample proportion of the disease among n women:
  • Its mean:

Thus, is an unbiased estimator for p

  • Its variance and standard deviation:

Replace p by the standard error of the sample proportion is 42

n X p / ˆ =

p n X E p E = = / ) ( ) ˆ (

2

ˆ ( ) ( ) / (1 ) / ˆ ( ) (1 ) / Var p Var X n p p n SD p p p n = = − = − ˆ ˆ ˆ ( ) (1 ) / SE p p p n = −

ˆ p

ˆ p

slide-43
SLIDE 43

An example of point estimate of p

  • (Cancer example continued): Estimate the prevalence of

malignant melanoma

  • Solution:

43

ˆ 28 / 5000 0.0056 ˆ ( ) 0.0056(0.9944) / 5000 0.0011 p SE p = = = =

slide-44
SLIDE 44

Why sample proportion is best estimator for p

  • Actually, sample proportion is also sample mean
  • Actually, population proportion p is also population

mean

  • Recall that the sample mean is minimum variance

unbiased estimate of the population mean

  • Therefore, the sample proportion is also the

minimum variance unbiased estimate of the population proportion

44

p ˆ

1

ˆ / /

n i i

p X n X n

=

= =

p X E

i

= = µ ) (

slide-45
SLIDE 45

Interval estimation for p

  • EX. Suppose we are interested in estimating the

prevalence rate of breast cancer among 50- to 54-year-

  • ld women whose mothers have had breast cancer.

Suppose that in a random sample of 10,000 such women, 400 are found to have had breast cancer at some point in their lives. The best point estimate of the prevalence rate p is given by the sample proportion

  • How can an interval estimate of p be obtained?

45

ˆ 400 /10000 0.04 ˆ var( ) .04(.96) /10,000 0.002 p p = = = =

slide-46
SLIDE 46

Central limit theorem plays an important role

Since As discussed earlier, sample proportion is a special case of sample mean. When sample size n is large, use central limit theorem, we have

46

n pq p Var p p E / ) ˆ ( , ) ˆ ( = =

) / , ( ˆ n pq p N p ≈

slide-47
SLIDE 47

CI for p – Normal theory method

  • An approximate 100%×(1-α) CI for the binomial parameter

p based on the normal approximation is given by

47

1 /2 1 /2 1 /2

ˆ ˆ ˆ ˆ ( ( ), ( )) ˆ ˆ ( ) ˆ ˆ ˆ ( ) / p z SE p p z SE p p z SE p SE p pq n

α α α − − −

− + = ± =

slide-48
SLIDE 48

An example of CI for p

Breast Cancer example continued Since we have Therefore, 95% CI is given by

48

1 /2

ˆ 0.04, =0.05, z 1.95, 10,000 p n

α

α

= = =

(0.04 1.96 .04(.96) /10,000,0.04 1.96 .04(.96) /10,000 (0.04 0.004,0.04 .004) (.036,.044) − + = − + =

slide-49
SLIDE 49

Summary– three recipes for CI

  • Proportion (large sample)
  • Large sample population mean
  • Small sample population mean
  • Each of those SE is proportional to the magic number

49

1 /2

ˆ ˆ ( ) p z SE p

α −

±

1 /2

( ) x z SE X

α −

±

1,1 /2

( )

n

x t SE X

α − −

±

1/ n