MATH 185 – University of California San Diego – Ery Arias-Castro 1 / 30
Univariate Continuous Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation
Univariate Continuous Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation
Univariate Continuous Data MATH 185 Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math185.html MATH 185 University of California San Diego
Lung dysfunction in workers in the detergent industry
MATH 185 – University of California San Diego – Ery Arias-Castro 2 / 30
We consider the following dataset (quoted in Larsen & Marx, exercise 5.3.1).
> str(bacillus) 'data.frame': 19 obs. of 1 variable: $ ratio: num 0.61 0.7 0.63 0.76 0.67 0.72 0.64 0.82 0.88 0.82 ... NULL
The FEV1/V C ratio is a measure of lung capacity.
FEV1: forced expiratory volume V C: vital capacity Normal FEV1/V C ratio is 0.80
This ratio was measured for certain workers in the detergent industry exposed to a Bacillus subtilis enzyme.
Boxplot
MATH 185 – University of California San Diego – Ery Arias-Castro 3 / 30
A basic plot that helps visualize how the data is spread out.
0.60 0.65 0.70 0.75 0.80 0.85
Boxplot
MATH 185 – University of California San Diego – Ery Arias-Castro 4 / 30
The middle box represents the inter-quartile range and contains the 50% of
the data.
The upper edge (hinge) of the box indicates the 75th percentile of the data
set
The lower hinge indicates the 25th percentile The line within the box indicates the median value of the data. The ends of the vertical lines or ”whiskers” indicate the minimum and
maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range.
The points outside the ends of the whiskers are outliers or suspected
- utliers.
Boxplot
MATH 185 – University of California San Diego – Ery Arias-Castro 5 / 30
The following table provides similar information
> summary(ratio)
- Min. 1st Qu.
Median Mean 3rd Qu. Max. 0.6100 0.7100 0.7800 0.7663 0.8350 0.8800
Histogram
MATH 185 – University of California San Diego – Ery Arias-Castro 6 / 30
The histogram is more detailed – approximates the actual distribution of the data.
Histogram of ratio
ratio Frequency 0.60 0.65 0.70 0.75 0.80 0.85 0.90 1 2 3 4 5 6 7
Histogram
MATH 185 – University of California San Diego – Ery Arias-Castro 7 / 30
A bin’s width is the range it covers. A bin’s height is proportional to the number of points that fall within that
range.
The histogram is an (piecewise constant) approximation of the population’s
probability density function.
Boxplot v. Histogram
MATH 185 – University of California San Diego – Ery Arias-Castro 8 / 30
> library(UsingR) > simple.hist.and.boxplot(ratio)
Histogram of x
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.60 0.65 0.70 0.75 0.80 0.85
Main question
MATH 185 – University of California San Diego – Ery Arias-Castro 9 / 30
Are workers exposed to a bacillus subtilis enzyme more likely to suffer from
lung dysfunction?
Since we know what the normal level for the FEV1/V C ratio is (0.80), we
want to compare the observations to this baseline.
Testing the Mean – Student’s t-Test
MATH 185 – University of California San Diego – Ery Arias-Castro 10 / 30
Let µ be the population mean. Consider H0 : µ = 0.80 versus H1 : µ < 0.80. The Student t-test rejects for large negative values of T, where
T = X − µ S/√n , X = 1 n
n
- i=1
Xi, S2 = 1 n − 1
n
- i=1
(Xi − X)2
Testing the Mean – Student’s t-Test
MATH 185 – University of California San Diego – Ery Arias-Castro 11 / 30
> t.test(ratio, mu = 0.8) One Sample t-test data: ratio t = -1.7091, df = 18, p-value = 0.1046 alternative hypothesis: true mean is not equal to 0.8 95 percent confidence interval: 0.7249096 0.8077219 sample estimates: mean of x 0.7663158
The p-value is based on the assumption that the observations are an i.i.d. sample from a normal distribution.
Quantile-Quantile Plot
MATH 185 – University of California San Diego – Ery Arias-Castro 12 / 30
Helps visualize whether a sample comes from a given translation/scale family of
- distributions. Here we compare with the normal family.
−2 −1 1 2 0.60 0.65 0.70 0.75 0.80 0.85
Normal Q−Q Plot
Theoretical Quantiles Sample Quantiles
If the points lie close to the line, then this assumption is reasonable.
Wilcoxon Sign Test
MATH 185 – University of California San Diego – Ery Arias-Castro 13 / 30
Let m be the population median. Consider H0 : m = m0 versus H1 : m < m0.
– The Wilcoxon sign test rejects for small values of N+ where
N+ = #{i : Xi > m0}
– In fact, for large n, the following statistic Z is approximately standard
normal Z = N+ + N0/2 − n/2
- n/4
, N0 = #{i : Xi = m0}
Here we get a p-value of 0.4092729.
Wilcoxon Signed Rank Test
MATH 185 – University of California San Diego – Ery Arias-Castro 14 / 30
Let F be the population’s cumulative distribution function, that we assume
- symmetric. Let m be the median of F.
Say we want to test H0 : m = m0 versus H1 : m < m0.
– The Wilcoxon signed-rank test rejects for small values of W where
W =
n
- i=1
Yi Ri, Ri = rank(|Xi − m0|), Yi = sign(Xi − m0) (Actually, R returns W − n(n + 1)/4.)
– In fact, for large n, the following statistic Z is approximately standard
normal Z = W − n(n + 1)/4
- n(n + 1)(2n + 1)/24
Here we get p-value = 0.07084
MATH 185 – University of California San Diego – Ery Arias-Castro 15 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 16 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 17 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 18 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 19 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 20 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 21 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 22 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 23 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 24 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 25 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 26 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 27 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 28 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 29 / 30
MATH 185 – University of California San Diego – Ery Arias-Castro 30 / 30