Marc Mehlman
Inference for Distributions
Marc H. Mehlman
marcmehlman@yahoo.com
University of New Haven
Based on Rare Event Rule: “rare events happen – but not to me”.
Marc Mehlman (University of New Haven) Inference for Distributions 1 / 42
Inference for Distributions Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation
Inference for Distributions Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Based on Rare Event Rule: rare events happen but not to me. Marc Mehlman Marc Mehlman (University of New Haven) Inference for Distributions 1
Marc Mehlman
University of New Haven
Marc Mehlman (University of New Haven) Inference for Distributions 1 / 42
Marc Mehlman
1
2
3
4
5
6
7
8
9
Marc Mehlman (University of New Haven) Inference for Distributions 2 / 42
Marc Mehlman
t Distribution
Marc Mehlman (University of New Haven) Inference for Distributions 3 / 42
Marc Mehlman
t Distribution
Marc Mehlman (University of New Haven) Inference for Distributions 4 / 42
Marc Mehlman
t Distribution
7
When comparing the density curves of the standard Normal distribution and t distributions, several facts are apparent:
The density curves of the t distributions are similar in shape to the standard Normal curve. The spread of the t distributions is a bit greater than that of the standard Normal distribution. The t distributions have more probability in the tails and less in the center than does the standard Normal. As the degrees of freedom increase, the t density curve approaches the standard Normal curve ever more closely. We can use Table D in the back of the book to determine critical values t* for t distributions with different degrees of freedom.
Marc Mehlman (University of New Haven) Inference for Distributions 5 / 42
Marc Mehlman
t Distribution
The t procedures are exactly correct when the population is exactly
The t procedures are robust to small deviations from Normality, but:
The sample must be a random sample from the population.
Outliers and skewness strongly influence the mean and therefore the t
because of the Central Limit Theorem.
As a guideline:
When n < 15, the data must be close to Normal and without outliers.
When 15 > n > 40, mild skewness is acceptable, but not outliers.
When n > 40, the t statistic will be valid even with strong skewness. Marc Mehlman (University of New Haven) Inference for Distributions 6 / 42
Marc Mehlman
CI for µ: σ unknown
Marc Mehlman (University of New Haven) Inference for Distributions 7 / 42
Marc Mehlman
CI for µ: σ unknown
A confidence interval is a range of values that contains the true population parameter with probability (confidence level) C. We have a set of data from a population with both µ and σ unknown. We use x̅ to estimate µ, and s to estimate σ, using a t distribution (df n − 1). C
t* −t* m m
C is the area between −t* and t*. We find t* in the line of Table. The margin of error m is:
Marc Mehlman (University of New Haven) Inference for Distributions 8 / 42
Marc Mehlman
CI for µ: σ unknown
x def
x.
Marc Mehlman (University of New Haven) Inference for Distributions 9 / 42
Marc Mehlman
CI for µ: σ unknown
The meters of total rainfall for Jupa, Beliana in the first decade of each of the last sixteen centuries is given below: 3.790155 3.628361 3.989105 5.124677 4.227491 3.183561 5.286963 3.323666 4.116425 2.771781 6.243354 5.040272 6.821760 6.170435 5.439190 5.206938 Find a 90% confidence interval for the mean rainfall per first decade of each century. Solution: The following normal quartile plot indicates
−1 1 2 3 4 5 6 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles
that the data comes from a normal (or at least almost normal) distribution. Since σ is unknown and the distribution is close to normal with no outliers, R gives us > mean(mdat)-qt(0.95,15)*sd(mdat)/sqrt(16) [1] 4.124238 > mean(mdat)+qt(0.95,15)*sd(mdat)/sqrt(16) [1] 5.171278 Thus a 90% confidence interval is (4.124238, 5.171278). Notice > t.test(mdat,mu=4.5,conf.level=0.90) One Sample t-test data: mdat t = 0.4948, df = 15, p-value = 0.6279 alternative hypothesis: true mean is not equal to 4.5 90 percent confidence interval: 4.124238 5.171278 sample estimates: mean of x 4.647758 Marc Mehlman (University of New Haven) Inference for Distributions 10 / 42
Marc Mehlman
CI for µ: σ unknown
The meters of total rainfall for Jupa, Beliana in the first decade of each of the last sixteen centuries is given below: 3.790155 3.628361 3.989105 5.124677 4.227491 3.183561 5.286963 3.323666 4.116425 2.771781 6.243354 5.040272 6.821760 6.170435 5.439190 5.206938 Find a 90% confidence interval for the mean rainfall per first decade of each century. Solution: The following normal quartile plot indicates
−1 1 2 3 4 5 6 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles
that the data comes from a normal (or at least almost normal) distribution. Since σ is unknown and the distribution is close to normal with no outliers, R gives us > mean(mdat)-qt(0.95,15)*sd(mdat)/sqrt(16) [1] 4.124238 > mean(mdat)+qt(0.95,15)*sd(mdat)/sqrt(16) [1] 5.171278 Thus a 90% confidence interval is (4.124238, 5.171278). Notice > t.test(mdat,mu=4.5,conf.level=0.90) One Sample t-test data: mdat t = 0.4948, df = 15, p-value = 0.6279 alternative hypothesis: true mean is not equal to 4.5 90 percent confidence interval: 4.124238 5.171278 sample estimates: mean of x 4.647758 Marc Mehlman (University of New Haven) Inference for Distributions 10 / 42
Marc Mehlman
t-test: Mean (σ unknown)
Marc Mehlman (University of New Haven) Inference for Distributions 11 / 42
Marc Mehlman
t-test: Mean (σ unknown)
1 versus H1 : µX > µ0 is P(T ≥ t). 2 versus H2 : µX < µ0 is P(T ≤ t). 3 versus H3 : µX = µ0 is 2P(T ≥ |t|). Marc Mehlman (University of New Haven) Inference for Distributions 12 / 42
Marc Mehlman
t-test: Mean (σ unknown)
The number of hours of Sesame Street that American 4 year olds watch each year is assumed to distributed normally. Twenty-five 4 year olds are randomly sampled and one finds ¯ x = 125 and s2
X = 100. What is the p–value
for a test of H0 : µX = 120 versus HA : µX > 120? Solution: The population is normally distributed so the test statistic is t = 125 − 120 10/ √ 25 = 2.5 comes from t(24) under H0. Thus the p–value is
> 1-pt(2.5,24) [1] 0.009827088
It seems unlikely that the average number of Sesame Street watching hours is 120 or less. Marc Mehlman (University of New Haven) Inference for Distributions 13 / 42
Marc Mehlman
t-test: Mean (σ unknown)
The number of hours of Sesame Street that American 4 year olds watch each year is assumed to distributed normally. Twenty-five 4 year olds are randomly sampled and one finds ¯ x = 125 and s2
X = 100. What is the p–value
for a test of H0 : µX = 120 versus HA : µX > 120? Solution: The population is normally distributed so the test statistic is t = 125 − 120 10/ √ 25 = 2.5 comes from t(24) under H0. Thus the p–value is
> 1-pt(2.5,24) [1] 0.009827088
It seems unlikely that the average number of Sesame Street watching hours is 120 or less. Marc Mehlman (University of New Haven) Inference for Distributions 13 / 42
Marc Mehlman
t-test: Mean (σ unknown)
> set.seed(1234) > ntab=rnorm(99,5,1) > t.test(ntab,mu=4.6) One Sample t-test data: ntab t = 2.2299, df = 98, p-value = 0.02804 alternative hypothesis: true mean is not equal to 4.6 95 percent confidence interval: 4.624239 5.016220 sample estimates: mean of x 4.820229 > t.test(ntab,mu=4.6,alternative="greater") One Sample t-test data: ntab t = 2.2299, df = 98, p-value = 0.01402 alternative hypothesis: true mean is greater than 4.6 95 percent confidence interval: 4.65623 Inf sample estimates: mean of x 4.820229 Marc Mehlman (University of New Haven) Inference for Distributions 14 / 42
Marc Mehlman
t test: Matched Pairs
Marc Mehlman (University of New Haven) Inference for Distributions 15 / 42
Marc Mehlman
t test: Matched Pairs
Sometimes we want to compare treatments or conditions at the individual level. The data sets produced this way are not independent. The individuals in one sample are related to those in the other sample.
Pre-test and post-test studies look at data collected on the same sample elements before and after some experiment is performed.
Twin studies often try to sort out the influence of genetic factors by comparing a variable between sets of twins.
Using people matched for age, sex, and education in social studies allows us to cancel out the effect of these potential lurking variables. Marc Mehlman (University of New Haven) Inference for Distributions 16 / 42
Marc Mehlman
t test: Matched Pairs
def
Marc Mehlman (University of New Haven) Inference for Distributions 17 / 42
Marc Mehlman
t test: Matched Pairs
Randomly selected caffeine-dependent individuals were deprived of all caffeine–rich foods and assigned to receive daily pills. At
scores represent greater depression). Find the p–value of the test H0 : µdiff = 0 versus HA : µdiff ≥ 0 Subject 1 2 3 4 5 6 7 8 9 10 11 Caffeine 5 5 4 3 8 5 2 11 1 Placebo 16 23 5 7 14 24 6 3 15 12 Difference 11 18 1 4 6 19 6 3 13 1
Solution: The normal quartile plot to the left indicates
−1.0 −0.5 0.0 0.5 1.0 1.5 5 10 15 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles
that the data comes from a normal (or at least almost normal) distribution. > cdat=c(5,5,4,3,8,5,0,0,2,11,1) > pdat=c(16,23,5,7,14,24,6,3,15,12,0) > tstat=(mean(pdat-cdat)-0)/(sd(pdat-cdat)/sqrt(11)) > tstat [1] 3.530426 > 1-pt(tstat,10) [1] 0.002721472 It is very likely that caffeine deprivation causes an increase in rates of depression. Marc Mehlman (University of New Haven) Inference for Distributions 18 / 42
Marc Mehlman
t test: Matched Pairs
Randomly selected caffeine-dependent individuals were deprived of all caffeine–rich foods and assigned to receive daily pills. At
scores represent greater depression). Find the p–value of the test H0 : µdiff = 0 versus HA : µdiff ≥ 0 Subject 1 2 3 4 5 6 7 8 9 10 11 Caffeine 5 5 4 3 8 5 2 11 1 Placebo 16 23 5 7 14 24 6 3 15 12 Difference 11 18 1 4 6 19 6 3 13 1
Solution: The normal quartile plot to the left indicates
−1.0 −0.5 0.0 0.5 1.0 1.5 5 10 15 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles
that the data comes from a normal (or at least almost normal) distribution. > cdat=c(5,5,4,3,8,5,0,0,2,11,1) > pdat=c(16,23,5,7,14,24,6,3,15,12,0) > tstat=(mean(pdat-cdat)-0)/(sd(pdat-cdat)/sqrt(11)) > tstat [1] 3.530426 > 1-pt(tstat,10) [1] 0.002721472 It is very likely that caffeine deprivation causes an increase in rates of depression. Marc Mehlman (University of New Haven) Inference for Distributions 18 / 42
Marc Mehlman
t test: Matched Pairs
Using R, one can find the p–value two other ways too. > t.test(pdat-cdat,mu=0,alternative="greater") One Sample t-test data: pdat - cdat t = 3.5304, df = 10, p-value = 0.002721 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: 3.583269 Inf sample estimates: mean of x 7.363636 > t.test(pdat,cdat,paired=TRUE,mu=0,alternative="greater") Paired t-test data: pdat and cdat t = 3.5304, df = 10, p-value = 0.002721 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 3.583269 Inf sample estimates: mean of the differences 7.363636 Marc Mehlman (University of New Haven) Inference for Distributions 19 / 42
Marc Mehlman
t test: Matched Pairs Without any assumptions of population distribution types or sample sizes one has:
Ignore pairs with differences of zero and let n be the count of the remaining pairs. The test statistic is the count, X, of pairs with a positive difference. To find the p–value of a test note that X ∼ BIN(n, 1/2) under the hypotheses H0 the median of the two matched pairs are the same.
In the previous example concerning caffeine deprivation, ten out of eleven subjects experienced an increase in depression. Consider the signed test for matched pairs, H0 : median depression with caffeine = median depression with placebo versus HA : median depression with caffeine < median depression with placebo. Noticing that x = 10, the p–value is P(X ≥ 10) = P(X = 10) + P(X = 11) = 11 10 1 2 10 1 − 1 2 1 + 11 11 1 2 11 1 − 1 2
> 1-pbinom(9,11,1/2) [1] 0.005859375 Marc Mehlman (University of New Haven) Inference for Distributions 20 / 42
Marc Mehlman
Two Sample z–Test: Means (σX and σY known)
Marc Mehlman (University of New Haven) Inference for Distributions 21 / 42
Marc Mehlman
Two Sample z–Test: Means (σX and σY known)
X
nX + σ2
Y
nY
X
Y
Marc Mehlman (University of New Haven) Inference for Distributions 22 / 42
Marc Mehlman
Two Sample z–Test: Means (σX and σY known)
The range of Brand X wireless routers is distributed as N(µX , 17.7) and the range of Brand Y wireless routers is distributed as N(µY , 15.3). Suppose ten Brand X routers are sampled and ¯ X = 131.2 meters. Further suppose that nine Brand Y routers are sampled and ¯ Y = 141.5 meters. Find the p–value of the test of H0 : µX = µY versus HA : µX < µY . Solution: z = 131.2 − 141.5
10
+ 15.32
9
= −1.360229. which comes from N(0, 1) under H0. The p–value for the test is > zstat=(131.2-141.5)/sqrt((17.7^2/10)+(15.3^2/9)) > zstat [1] -1.360229 > pnorm(zstat,0,1) [1] 0.08687867 One must accept H0 at the 0.05 significance level and reject H0 at the 0.10 significance level.
Marc Mehlman (University of New Haven) Inference for Distributions 23 / 42
Marc Mehlman
Two Sample z–Test: Means (σX and σY known)
The range of Brand X wireless routers is distributed as N(µX , 17.7) and the range of Brand Y wireless routers is distributed as N(µY , 15.3). Suppose ten Brand X routers are sampled and ¯ X = 131.2 meters. Further suppose that nine Brand Y routers are sampled and ¯ Y = 141.5 meters. Find the p–value of the test of H0 : µX = µY versus HA : µX < µY . Solution: z = 131.2 − 141.5
10
+ 15.32
9
= −1.360229. which comes from N(0, 1) under H0. The p–value for the test is > zstat=(131.2-141.5)/sqrt((17.7^2/10)+(15.3^2/9)) > zstat [1] -1.360229 > pnorm(zstat,0,1) [1] 0.08687867 One must accept H0 at the 0.05 significance level and reject H0 at the 0.10 significance level.
Marc Mehlman (University of New Haven) Inference for Distributions 23 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
Marc Mehlman (University of New Haven) Inference for Distributions 24 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
X
nX + S2
Y
nY
X
Y
Marc Mehlman (University of New Haven) Inference for Distributions 25 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
Let X1, · · · , XnX and Y1, · · · , YnY be independent random samples. Assume either the random samples were sampled from a normal population or the each the sample sizes are ≥ 5. Let H0 : µX = µY The test statistic is T = ¯ X − ¯ Y
X nX + S2 Y nY
∼ t (df) where df =
X nX + s2 Y nY
2
1 nX −1
X nX
2 +
1 nY −1
Y nY
2 for H0. When using a table, let the degrees of freedom be largest integer that is less than or equal to df for a slightly more conservative test (statistical software can handle non–integer degree of freedoms). A (1 − α)100% confidence interval for µX − µY is ¯ x − ¯ y ± t⋆(df)
X
nX + s2
Y
nY . Marc Mehlman (University of New Haven) Inference for Distributions 26 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
Marc Mehlman (University of New Haven) Inference for Distributions 27 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
Does smoking damage the lungs of children exposed to parental smoking? Forced Vital Capacity (FVC) is the volume (in milliliters) of air that an individual can exhale in 6 seconds. FVC was obtained for a sample of children not exposed to parental smoking and for a sample of children exposed to parental smoking. Is the mean FVC lower in the population of children exposed to parental smoking? We test:
Parental smoking Mean FVC s n Yes 75.5 9.3 30 No 88.2 15.1 30
H0: µsmoke = µno <=> (µsmoke − µno) = 0 Ha: µsmoke < µno <=> (µsmoke − µno) < 0 (one-sided) Marc Mehlman (University of New Haven) Inference for Distributions 28 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
The test statistic is > (75.5-88.2)/sqrt(9.3^2/30 + 15.1^2/30) [1] -3.922419 The conservative test gives 29 for the degrees of freedom and a p–value of > tstat=(75.5-88.2)/sqrt(9.3^2/30 + 15.1^2/30) > tstat [1] -3.922419 > pt(tstat,29) [1] 0.0002468389 The degrees of freedom for Welch’s Test is
30 + 15.12 30
2
1 30−1
30
2 +
1 30−1
30
2 = 48.23342. (Use 48 if you looking up the p–value in a table). Using R, Welch’s test gives a p–value of > pt(-3.922419, 48.23342) [1] 0.0001385454
Marc Mehlman (University of New Haven) Inference for Distributions 29 / 42
Marc Mehlman
Two Sample t–test: Means (σX and σY unknown)
> dat1=rnorm(25,0.9,2.5) > dat2=rnorm(43,0.6,1.3) > t.test(dat1,dat2,mu=0.0) Welch Two Sample t-test data: dat1 and dat2 t = 1.2977, df = 29.947, p-value = 0.2043 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:
2.0679355 sample estimates: mean of x mean of y 1.2426692 0.4392566
Marc Mehlman (University of New Haven) Inference for Distributions 30 / 42
Marc Mehlman
Pooled Two Sample t–test: Means (σX = σY unknown)
Marc Mehlman (University of New Haven) Inference for Distributions 31 / 42
Marc Mehlman
Pooled Two Sample t–test: Means (σX = σY unknown)
Let X1, · · · , XnX and Y1, · · · , YnY be independent random samples. Assume σ = σX = σY and that either the random samples were sampled from a normal population or the sum of the sample sizes, nX + nY , are ≥ 30. Define the pooled estimator of σ2, S2
p def
= (nX − 1)S2
X + (nY − 1)S2 Y
nX + nY − 2 . Let H0 : µX = µY The test statistic is T = ¯ X − ¯ Y Sp 1
nX + 1 nY
∼ t(nX + nY − 2) for H0. A (1 − α)100% confidence interval for µX − µY is ¯ x − ¯ y ± t⋆(nX + nY − 2)sp
nX + 1 nY . Marc Mehlman (University of New Haven) Inference for Distributions 32 / 42
Marc Mehlman
Pooled Two Sample t–test: Means (σX = σY unknown)
Marc Mehlman (University of New Haven) Inference for Distributions 33 / 42
Marc Mehlman
Two Sample F–test: Variance
Marc Mehlman (University of New Haven) Inference for Distributions 34 / 42
Marc Mehlman
Two Sample F–test: Variance
X
Y
Marc Mehlman (University of New Haven) Inference for Distributions 35 / 42
Marc Mehlman
Two Sample F–test: Variance
X
Y
1 HA : σX > σY is P(f ≥ F(n − 1, m − 1)). 2 HA : σX < σY is P(f ≤ F(n − 1, m − 1)). 3 HA : σX = σY is twice the smaller value from the two alternative
Marc Mehlman (University of New Haven) Inference for Distributions 36 / 42
Marc Mehlman
Two Sample F–test: Variance
The # of tail wags per minute of 11 randomly chosen Labrador dogs at feeding time was tallied and sample variance of 6.73 was calculated. The # of tail wags per minute of 9 randomly chosen terriers at feeding time was tallied and sample variance of 25.91 was calculated. Assuming that the number of wags for both types of dogs are normal, find the p–value of a test of H0 : σL = σT versus HA : σL < σT . Solution: F def =
S2
L
S2
T
∼ F(11 − 1, 9 − 1) under H0 and f =
6.73 25.91 = 0.26. The p–value of the
left–tailed test is > pf(6.73/25.91,10,8) [1] 0.02510348 There is reason to reject the hypothesis that variance of tail wagging is the same between terriers and Labradors in favor of more tail wagging variance for terriers.
Marc Mehlman (University of New Haven) Inference for Distributions 37 / 42
Marc Mehlman
Two Sample F–test: Variance
The # of tail wags per minute of 11 randomly chosen Labrador dogs at feeding time was tallied and sample variance of 6.73 was calculated. The # of tail wags per minute of 9 randomly chosen terriers at feeding time was tallied and sample variance of 25.91 was calculated. Assuming that the number of wags for both types of dogs are normal, find the p–value of a test of H0 : σL = σT versus HA : σL < σT . Solution: F def =
S2
L
S2
T
∼ F(11 − 1, 9 − 1) under H0 and f =
6.73 25.91 = 0.26. The p–value of the
left–tailed test is > pf(6.73/25.91,10,8) [1] 0.02510348 There is reason to reject the hypothesis that variance of tail wagging is the same between terriers and Labradors in favor of more tail wagging variance for terriers.
Marc Mehlman (University of New Haven) Inference for Distributions 37 / 42
Marc Mehlman
Two Sample F–test: Variance
> dat1=rnorm(25,0.9,2.5) > dat2=rnorm(43,0.6,1.5) > var.test(dat1,dat2) F test to compare two variances data: dat1 and dat2 F = 2.4715, num df = 24, denom df = 42, p-value = 0.009983 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 1.242724 5.280972 sample estimates: ratio of variances 2.471486
Marc Mehlman (University of New Haven) Inference for Distributions 38 / 42
Marc Mehlman
Two Sample F–test: Variance
1 To use a F table, chose X to be the random variable with the larger
2 If the test is two–sided, use α/2 for the table. 3 When a degree of freedom can not be found in the table, use the
Marc Mehlman (University of New Haven) Inference for Distributions 39 / 42
Marc Mehlman
Chapter #8 R Assignment
Marc Mehlman (University of New Haven) Inference for Distributions 40 / 42
Marc Mehlman
Chapter #8 R Assignment
> set.seed(4321) > ndat1=rnorm(73,3.1,1.9) > ndat2=rnorm(57,2.9,2.3) > ndat3=rnorm(83,3.5,2.3)
1
2
Marc Mehlman (University of New Haven) Inference for Distributions 41 / 42
Marc Mehlman
Chapter #8 R Assignment 3 Assuming the means and standard deviations of ndat1 and ndat2 are
4 Assuming the means and standard deviations of ndat2 and ndat3 are
Marc Mehlman (University of New Haven) Inference for Distributions 42 / 42