Overview Course 02402 Introduction to Statistics 1 Introduction to - - PowerPoint PPT Presentation

overview course 02402 introduction to statistics
SMART_READER_LITE
LIVE PREVIEW

Overview Course 02402 Introduction to Statistics 1 Introduction to - - PowerPoint PPT Presentation

Overview Course 02402 Introduction to Statistics 1 Introduction to simulation Example 1 Lecture 10: Simulation based statistical methods 2 Propagation of error Example 1, cont. 3 Confidence intervals using simulation: Bootstrapping Per Bruun


slide-1
SLIDE 1

Course 02402 Introduction to Statistics Lecture 10: Simulation based statistical methods Per Bruun Brockhoff

DTU Informatics Building 305 - room 110 Danish Technical University 2800 Lyngby – Denmark e-mail: pbb@imm.dtu.dk

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 1 / 27

Overview

1 Introduction to simulation

Example 1

2 Propagation of error

Example 1, cont.

3 Confidence intervals using simulation: Bootstrapping

Example 2, one-sample Two-sample situation Example 3

4 Hypothesis testing using simulation

By bootstrap confidence intervals One-sample setup, Example 2, cont. Hypothesis testing using permutation tests Two-sample setup, Example 3, cont.

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 2 / 27 Introduction to simulation

Motivation Table 8.1 has a "missing link”: Small samples that are NOT from a normal distribution In the old days: non-parametric tests, e.g. chapter 14. More common now: Simulation based statistics:

Confidence intervals are much easier to achieve They are much easier to apply in more complicated situations They better reflect today’s reality: they are simply now used in many contexts

Require : Use of computer - R is a super tool for this!

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 4 / 27 Introduction to simulation

What is simulation really?

(Pseudo) random numbers generated from a computer A random number generator is an algorithm that can generate xi+1 from xi A sequence of numbers appears random Require a "start" called a "seed" (Using the computer clock) Basically the uniform distribution is simulated in this way, and then:

If U ∼ Uniform(0.1) and F is a distribution function for any probability distribution, then F −1(U) follow the distribution given by F

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 5 / 27

slide-2
SLIDE 2

Introduction to simulation

In practice in R

The following (02402 relevant) distributions are ready for simulation: rbinom Binomial distribution rpois Poisson distribution rhyper The hypergeometric distribution rnorm normal distribution rlnorm log-normal distributions rexp exponential runif The uniform distribution rt t-distribution rchisq χ2-distribution rf F distribution

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 6 / 27 Introduction to simulation Example 1

Example 1

A company produces rectangular plates. The length of plates (in meters), X is assumed to follow a normal distribution N(2, 0.12) and the width of the plates (in meters), Y are assumed to follow a normal distribution N(3, 0.22). We are interested in the area of the plates which of course is given by A = XY . What is the mean area? What is the standard deviation in the areas from plate to plate? how often such plates have an area that differ by more than 0.1m2 from the targeted 6m2? The probability of other events? Generally: what is the probability distribution of the random variable A

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 7 / 27 Introduction to simulation Example 1

Example 1, solution in R Code:

k=10000 X=rnorm(k,2,0.1) Y=rnorm(k,3,0.2) A=X*Y mean(A) sd(A) sum(abs(A-6)>0.1)/k

Result: > mean(A) [1] 5.999061 > sd(A) [1] 0.5030009 > sum(abs(A- 6)>0.1)/k [1] 0.8462

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 8 / 27 Propagation of error

Propagation of error

Must be able to find: σ2

f(X1,...,Xn) = Var(f(X1, . . . , Xn))

We allready know: σ2

f(X1,...,Xn) = n

  • i=1

a2

i σ2 i , if f(X1, . . . , Xn) = n

  • i=1

aiXi New rule for non-linear funktions: σ2

f(X1,...,Xn) ≈ n

  • i=1

∂f ∂Xi 2 σ2

i

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 10 / 27

slide-3
SLIDE 3

Propagation of error

Propagation of error

Or by simulation: Simulate k outcomes of all n measurements as N(Xi, σ2

i ): X(j) i , j = 1 . . . , k

Calculate the standard deviation directly as the observed standard deviation of the k values for f: σf(X1,...,Xn) =

  • 1

k−1

k

i=1(fj − ¯

f)2 fj = f(X(j)

1 , . . . , X(j) n )

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 11 / 27 Propagation of error Example 1, cont.

Example 1, cont.

We already used the simulation method in the first part of the

  • example. Given two specific measurements of X and Y , X = 2.05m

and y = 2.99m. What is the variance of A = 2.05 × 2.99 = 6.13 using the error propagation law?

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 12 / 27 Propagation of error Example 1, cont.

Example 1, cont.

Actually one can deduce the variance of A theoretically, Var(XY ) = E

  • (XY )2

− [E(XY )]2 = E(X2)E(Y 2) − E(X)2E(Y )2 =

  • Var(X) + E(X)2

Var(Y ) + E(Y )2 − E(X)2E(Y )2 = Var(X)Var(Y ) + Var(X)E(Y )2 + Var(Y )E(X)2 = 0.12 × 0.22 + 0.12 × 32 + 0.22 × 22 = 0.0004 + 0.09 + 0.16 = 0.2504

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 13 / 27 Confidence intervals using simulation: Bootstrapping

Confidence intervals using simulation: Bootstrapping

What to do with a small sample size (n < 30), and NO assumption of a normal distribution? Two possible solutions:

1

Find/identify/assume a different and more suitable distribution for the population ("the system")

2

Do not assume any distribution whatsoever

Bootstrapping exists in two versions:

1

Parametric bootstrap: Simulate multiple samples from the assumed distribution.

2

Non-parametric bootstrap: Simulate multiple samples directly from the data.

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 15 / 27

slide-4
SLIDE 4

Confidence intervals using simulation: Bootstrapping

Non-parametric bootstrap for the one-sample situation

Data: x1, . . . , xn. 100(1 − α)% confidence interval for µ:

Simulate k samples of size n by randomly sampling among the available data (with replacement - large k, e.g. k > 1.000) Calculate the average in each of the k samples ¯ x∗

1, . . . , ¯

x∗

k

Calculate the 100α/2% - and 100(1 − α/2)% percentiles for these The confidence interval is:

  • quantile100α/2%, quantile100(1−α/2)%
  • Per Bruun Brockhoff (pbb@imm.dtu.dk)

Introduction to Statistics, Lecture 10 Fall 2012 16 / 27 Confidence intervals using simulation: Bootstrapping Example 2, one-sample

Example 2, one-sample

In a study women’s cigarette consumption before and after giving birth is

  • explored. The following observations of the number of smoked cigarettes

per day were the results: before after before after 8 5 13 15 24 11 15 19 7 11 12 20 15 22 6 15 6 20 20

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 17 / 27 Confidence intervals using simulation: Bootstrapping Example 2, one-sample

Example 2, solution in R

Data: x1=c(8,24,7,20,6,20,13,15,11,22,15) x2=c(5,11,0,15,0,20,15,19,12,0,6) dif=x1-x2 R-Method 1: k=10000 mysamples = replicate(k, sample(dif, replace = TRUE)) mymeans = apply(mysamples, 2, mean) quantile(mymeans,c(0.025,0.975)) R-Method 2: (First install the package "bootstrap") library(bootstrap) quantile(bootstrap(dif,k,mean)$thetastar,c(0.025,0.975))

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 18 / 27 Confidence intervals using simulation: Bootstrapping Two-sample situation

Two-sample situation

Data: x1, . . . , xn1 and y1, . . . , yn2 100(1 − α)% confidence interval for µ1 − µ2 : Simulate k sets of 2 samples of size n1 and n2 by sampling randomly from the respective groups (with replacement - large k, eg.k > 1.000) Calculate the difference between the averages for each of the k sample pairs: ¯ x∗

1 − ¯

y∗

1, . . . , ¯

x∗

k − ¯

y∗

k

Calculate the 100α/2% - and 100(1 − α/2)% percentiles for these The confidence interval is:

  • quantile100α/2%, quantile100(1−α/2)%
  • Per Bruun Brockhoff (pbb@imm.dtu.dk)

Introduction to Statistics, Lecture 10 Fall 2012 19 / 27

slide-5
SLIDE 5

Confidence intervals using simulation: Bootstrapping Example 3

Example 3

In a study it was explored whether children who received milk from bottle as a child had worse or better teeth health conditions than those who had not received milk from the bottle. For 19 randomly selected children is was recorded when they had their first incident of caries: bottle age bottle age bottle Age no 9 no 10 yes 16 yes 14 no 8 yes 14 yes 15 no 6 yes 9 no 10 yes 12 no 12 no 12 yes 13 yes 12 no 6 no 20 yes 19 yes 13 Find the confidence interval for the difference!

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 20 / 27 Confidence intervals using simulation: Bootstrapping Example 3

Example3, solution in R

Data input: x=c(9,10,12,6,10,8,6,20,12) y=c(14,15,19,12,13,13,16,14,9,12) Bootstrapping in R: k=10000 xsamples = replicate(k, sample (x, replace = TRUE)) ysamples = replicate(k, sample (y, replace = TRUE)) mymeandifs = apply(xsamples, 2, mean)-apply(ysamples, 2, mean) quantile(mymeandifs,c(0.025,0.975))

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 21 / 27 Hypothesis testing using simulation By bootstrap confidence intervals

Hypothesis testing using bootstrap confidence intervals

Relationship between confidence intervals and hypothesis testing: H0 : θ = θ0 accepted ⇔ θ0 is in the confidence interval for θ F.ex. one-sided hypothesis test using the bootstrap: H0 : θ = θ0 versus H1 : θ > θ0 accepted ⇔ θ0 > 100α%-percentile of the bootstrap values for θ

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 23 / 27 Hypothesis testing using simulation One-sample setup, Example 2, cont.

One-sample setup, Example 2, cont.

We continue the cigarette consumption example. We would now like to show that cigarette consumption has decreased after giving birth: H0 : µ1 − µ2 = 0 versus the alternative: H1 : µ1 − µ2 > 0 The P-value is found in R as: sum(mymeans<0)/k

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 24 / 27

slide-6
SLIDE 6

Hypothesis testing using simulation Hypothesis testing using permutation tests

Hypothesis testing using permutation tests

We now have the two samples: x1, . . . , xn1 and y1, . . . , yn2 A permutation test for the hypothesis µ1 = µ2 is defined by: Simulate k sets of 2 samples of size n1 and n2 by permuting the available data (Large k, eg. k > 1.000) Calculate the difference between the averages for each of the k sample pairs: ¯ x∗

1 − ¯

y∗

1, . . . , ¯

x∗

k − ¯

y∗

k

Find the P-value from the position of ¯ x − ¯ y in this distribution (2-sided or 1-sided - in the usual manner)

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 25 / 27 Hypothesis testing using simulation Two-sample setup, Example 3, cont.

Two-sample setup, Example 3, cont.

We continue the tooth health example. We want to perform a two-sided test for the hypothesis: µ1 = µ2. The following R-code implements the calculations: x=c(9,10,12,6,10,8,6,20,12) y=c(14,15,19,12,13,13,16,14,9,12) k=100000 perms = replicate(k,sample(c(x,y))) mymeandifs = apply(perms[1:9,], 2, mean)-apply(perms[10:19,], 2, mean) sum(abs(mymeandifs)>abs(mean(x)-mean(y)))/k

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 26 / 27 Hypothesis testing using simulation Two-sample setup, Example 3, cont.

Overview

1

Introduction to simulation Example 1

2

Propagation of error Example 1, cont.

3

Confidence intervals using simulation: Bootstrapping Example 2, one-sample Two-sample situation Example 3

4

Hypothesis testing using simulation By bootstrap confidence intervals One-sample setup, Example 2, cont. Hypothesis testing using permutation tests Two-sample setup, Example 3, cont.

Per Bruun Brockhoff (pbb@imm.dtu.dk) Introduction to Statistics, Lecture 10 Fall 2012 27 / 27