FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. - - PowerPoint PPT Presentation
FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. - - PowerPoint PPT Presentation
Data analysis and Basic Statistics FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. (statistical science) (National Food Research Institute, National Agriculture and Food Research Organization) Email: Yoshiki.tsukakoshi@gmail.com
Today’s Agenda
- Descriptive statistics
– Summary statistics – Graphical presentation
- Statistical Inference
– Estimation (Point/Interval) – Hypothesis testing – Regression analysis – Analysis of variance
- Statistical Packages
Analytical Results (Repeated Analysis) AAS for cereal
Test Cadmium mg/kg 1 0.0394 2 0.0412 3 0.0420 4 0.0398 5 0.0407 6 0.0395
Know Data Types
- Univariate Data
– Single variable from a sample
- Bivariate Data
– Two variable from a sample – Analytical value and Brand, origin – Comparison of two method
- Multivariate Data
– Multiple data from single sample – ICP-MS, DNA arrays
Univariate Data
- Single set of samples
Test Cadmium mg/kg 1 0.0394 2 0.0412 3 0.0420 4 0.0398 5 0.0407 6 0.0395 Average 0.0404
What can be done with Univariate Data Descriptive Statistics
- Location/Shift/Central Tendency
– Mean (arithmetic, geometrical, harmonic)
- Trimmed
- Interquartile mean
– Median, Mode
- Scatter/Scale/Dispersion
– Variance, median absolute deviation – Range, interquartile range
- Shape
– Skewness, kurtosis
- Frequency distribution
– Histogram, cumulative distribution,stem and leaf
Location Parameters
- Sample data set (5,4,5,6,5)
- Arithmetic mean
– ((5+4+5+6+5)/5) = 5
- Geometrical mean
– (5*4*5*6*5)^(1/5) = 4.959344… – Positive number only
- Harmonic mean
– (1/5) + (1/4) + (1/5) + (1/6) + (1/5) = 4.918033…
- Arithemetic >= geometrical >= harmonic
Location parameter - 2
- Location based on quantile
– Median = 5 (Symmetric case, mean) – Quartile
- 1ST, 2ND 3RD
– Percentile 1st, 2nd...100th
- Mode
– If continuous, kernel density method
- Empirically, 3(mean-median)≒ (mean-mode)
Scale parameters
- Variance, Standard Deviation
- Coefficient of Variation
– Geometric coefficient of variation
- Range
– (minimum, maximum)
Population variance
Test result Deviance Deviance2 1 0.0394
- 0.00103
1.06778E-06 2 0.0412 0.000767 5.87778E-07 3 0.0420 0.001567 2.45444E-06 4 0.0398
- 0.00063
4.01111E-07 5 0.0407 0.000267 7.11111E-08 6 0.0395
- 0.00093
8.71111E-07 Average 0.0404 9.08889E-07
Coefficient of variance, Relative standard error s/μ
Descriptive Statistics Graphical representation of data
- Histogram
- Stem-and-leaf
– (1.1, 1.1, 1.1, 1.2, 1.2, 2.5, 2.5) – 1|1 1 1 2 2 – 2|5 5
Histogram of r0
r0 Frequency
- 6
- 4
- 2
2 4 6 6000
Starting R
Command line window
Exercise Plotting histogram
- >Orange
- Tree age circumference
- 1 1 118 30
- 2 1 484 58
- 3 1 664 87
- > hist(Orange$circumference)
Histogram of Orange$circumference
Orange$circumference Frequency 50 100 150 200 2 4 6
Exercise Plotting Stem and Leaf Plot
- > stem(Orange$circumference)
- The decimal point is 1 digit(s) to the right of the |
- 2 | 00023
- 4 | 918
- 6 | 295
- 8 | 17
- 10 | 81255
- 12 | 059
- 14 | 02256
- 16 | 72479
- 18 |
- 20 | 3394
Exercise Plotting empirical distribution
- >plot(ecdf(Orange$ci
rcumference))
50 100 150 200 0.0 0.4 0.8
ecdf(Orange$circumference)
x Fn(x)
Shape parameter
- Skewness
- Kurtosis
Histogram of rnorm(1000)
rnorm(1000) Frequency
- 3
- 2
- 1
1 2 3 50 150
Histogram of rnorm(100)^2
rnorm(100)^2 Frequency 2 4 6 8 20 40 60
Histogram of rnorm(1000)
rnorm(1000) Frequency
- 3
- 2
- 1
1 2 3 50 150
skewness
- Non parametric skew = (mean-median)/s.d.
Histogram of rnorm(100)^2
rnorm(100)^2 Frequency 2 4 6 8 20 40 60
Histogram of rnorm(1000)
rnorm(1000) Frequency
- 3
- 2
- 1
1 2 3 50 150
Histogram of rnorm(100)^2
rnorm(100)^2 Frequency 2 4 6 8 20 40 60
negative Positive
Calculating skewness
Test result Deviance Deviance3 1 0.0394 -0.00103
- 1.10337E-09
2 0.0412 0.000767 4.5063E-10 3 0.0420 0.001567 3.8453E-09 4 0.0398 -0.00063
- 2.54037E-10
5 0.0407 0.000267 1.8963E-11 6 0.0395 -0.00093
- 8.13037E-10
Average 0.0404 3.57407E-10
- Divide mean deviance3 by s.d.3
Interval Range
- 1 standard deviation (σ)
- Inter-Quartile Range (IQR)
- 3rd quartile (75%th quantile) - 1st quartile (25%th quantile)
- Normalized IQR 0.7413×IQR
– If data is normally distributed, converge to standard deviation
Terminology : Shape
- Unimodal
– Single peak
- Bimodal
– Dual peak
- Multimodal
Histogram of rnorm(1e+06, 0, 1)
rnorm(1e+06, 0, 1) Frequency
- 4
- 2
2 4 50000 150000
istogram of c(rnorm(1e+06, 0, 1), rnorm(1e+0
c(rnorm(1e+06, 0, 1), rnorm(1e+06, 8, 2)) Frequency
- 5
5 10 15 100000 250000
Terminology : Shape - 2
- Symmetric
- Asymmetric
– Skewness not zero
istogram of c(rnorm(1e+06, 0, 1), rnorm(1
c(rnorm(1e+06, 0, 1), rnorm(1e+06, 4, 2)) Frequency
- 5
5 10 0e+00 2e+05
Histogram of rnorm(1e+06, 0, 1)
rnorm(1e+06, 0, 1) Frequency
- 4
- 2
2 4 50000 150000
Terminology : Shape - 3
- Heavy tail
– Decreases slower than exponential distribution
- Light tail
– Decreases faster than exponential distribution
Histogram of c(rlnorm(1000, 0, 1))
c(rlnorm(1000, 0, 1)) Frequency 5 10 15 20 25 200 400 600
Graphical Distribution Box-plot
- 4
- 2
2 4
Outlier ( outside 1.5 IQR) Median 1st Quadrant 3rd Quadrant +1.5 IQR Outlier ( outside 1.5 IQR)
- 1.5 IQR
Graphical Distribution Error bars
- The width can be:
– Standard error – Confidence interval
1 2 3 4 5 Replicates Value (arbitrary units) 0.0 0.4 0.8 1.2 1 2 3 4 5 Replicates Value (arbitrary units) 0.0 0.4 0.8 1.2
Exercise on calculation
- =RAND()
- Copy
=NORM.INV(RAND(), 0, 1)
Generating (pseudo) random number
- > rnorm(10,0,1)
- 0.11896589 0.34946077
0.69093896 0.12069319
- 0.39211479 0.04787566
0.17911326 0.67266441 1.16644580 -0.35335929
- > r0 = rnorm(1e5, 0, 1)
- > r1 = rnorm(1e5, 1, 1)
- >hist(r0),xlim=c(-6,6))
- >hist(r1),xlim=c(-6,6))
Histogram of r0
r0 Frequency
- 6
- 4
- 2
2 4 6 6000
Histogram of r1
r1 Frequency
- 6
- 4
- 2
2 4 6 6000
Expectation and variance
- E[aX+bY]=aE[X]+bE[Y]
- V[X+Y] =V[X]+V[Y]+2cov(X,Y)
- SD[X+Y] = √V[X+Y]
Exercise Adding variable (cont.)
- # E[aX+bY]=aE[X]+bE[Y]
- > mean(r0)
- [1] 0.001574978
- > mean(r1)
- [1] 0.9990033
- > mean(r0+r1)
- [1] 1.000578
- > mean(r0-r1)
- [1] -0.9974283
Exercise Estimating Standard Deviation
- # V[X+Y]=V[X]+V[Y]+Cov(X,Y)
- > sd(r0)
- [1] 0.999807 (1)
- > sd(r1)
- [1] 0.997682 (1)
- > sd(r0+r1) # (Adding independent variables)
- [1] 1.411946 # √(1×1 + 1×1 )
- > sd(r0-r1)
- [1] 1.412932 # √(1×1 + 1×1 )
- > sd(r0+r0) # (Adding correlated variables)
- [1] 1.999614 # √(1×1 + 1×1 + 2×1)
- > sd(r0-r0)
- [1] 0 (0) # √(1×1 + 1×1 - 2×1)
Probability Process and Distributions (Univariate)
- Discrete Distribution
– Count data (colony count) – Food poisoning incidents
- Continuous Distribution
– Concentration of chemicals
Discrete distribution
- Constant
- Binominal
- Poisson
- Negative binominal
Distribution and Random Process Binominal Distribution
- Toss of coin – Heads or tails
– Number of heads
- Throwing dice
– Probability of getting the number one
- Bernoulli test
- n trials
- probability of p
Binominal Distribution
- Expectation (pn)
- Variance(np(1-p) )
- Standard deviation
1 2 3 4 5 6 7 8 9 10 0.00 0.10 0.20 0.30 1 2 3 4 5 6 7 8 9 10 0.00 0.10 0.20 0.30
Probability mass function N=10,p=0.1 Probability mass function N=10,p=0.3
n n
Poisson distribution
- Well dispersed solution
Binominal n: large p: small np : constant
Poisson Distribution Probability mass functions
1 2 3 4 5 6 7 8 9 0.00 0.15 0.30
Poisson distribution λ=1 Binominal distribution n=10,p=0.1 Poisson distribution λ=3 Binominal distribution n=10,p=0.3
Other discrete distributions
- Geometric distribution
– The number of needed to get one success – Parameter p
- Hyper-geometric distribution
– k success in n draws without replacement of N
- Negative binominal distribution
– The number of successes before n failures
Central Limit Theorem
- Addition of Several Variables from any
distribution
- gram of sample(1:6, 10000, replace
sample(1:6, 10000, replace = TRUE) Frequency 1 2 3 4 5 6 1000
:6, 10000, replace = TRUE) + sample(
, 10000, replace = TRUE) + sample(1:6, 10000 Frequency 5 10 15 20 25 600
Central limit theorem an example
- gram of sample(1:6, 10000, replace
sample(1:6, 10000, replace = TRUE) Frequency 1 2 3 4 5 6 1000
:6, 10000, replace = TRUE) + sample(
, 10000, replace = TRUE) + sample(1:6, 10000 Frequency 2 4 6 8 10 14 1000
TRUE) + sample(1:6, 10000, replace
RUE) + sample(1:6, 10000, replace = TRUE) + Frequency 5 10 15 800
:6, 10000, replace = TRUE) + sample(
, 10000, replace = TRUE) + sample(1:6, 10000 Frequency 5 10 15 20 25 600
Continuous Distribution Normal (Gaussian ) Distribution distributional shape
- 3
- 2
- 1
1 2 3 0.0 0.1 0.2 0.3 0.4 k dnorm(k, 0, 1)
- Bell shape
- Symmetrical
- Converges to zero and +-infinity
Probability Density Distribution
Normal Distribution cumulative density funtiction
- 68% in ±1 σ
- 97% in ±1 σ
- 3
- 2
- 1
1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 k pnorm(k, 0, 1)
Log normal Distribution
Skewed shape Mean and mode different Heavy tail
2 4 6 8 10 0.0 0.2 0.4 0.6 k dlnorm(k, 0, 1)
Exponential distribution and Gamma distribution
- Exponential distribution
– Time until an event occurs which is expected to
- ccur at the same rate
- Gamma distribution
– Time until k events occur at the same rate which are expected to occur at the same rate
Gamma Distribution
- Shape parameter k
- Scale parameter θ
Histogram of rgamma(1e+05, 1)
rgamma(1e+05, 1) Frequency 2 4 6 8 10 12 30000
Histogram of rgamma(1e+05, 2)
rgamma(1e+05, 2) Frequency 5 10 15 15000
γ
K=1,θ=1 K=2,θ=1
Gamma Distribution
- Mean = k・θ
Histogram of rgamma(1e+05, 3)
rgamma(1e+05, 3) Frequency 5 10 15 10000 25000
Histogram of rgamma(1e+05, 4)
rgamma(1e+05, 4) Frequency 5 10 15 10000
K=3,θ=1 K=4,θ=1
Exponential Distribution
- Exp(-λx)
- Monotonically decreasing
- Gamma distribution of Shape = 0
Histogram of rexp(1e+05)
rexp(1e+05) Frequency 2 4 6 8 10 12 14 40000
Histogram of rexp(1e+05, 3)
rexp(1e+05, 3) Frequency 1 2 3 30000
Generating random number
- R
– rnorm(n, mean, s.d.) – >hist(rnorm(1e5, 0, 1)) – >hist(rnorm(1e5, 0, 2))
- Excel 2010
– Norm.dist()
Histogram of rnorm(1e+05, 0, 1)
rnorm(1e+05, 0, 1) Frequency
- 4
- 2
2 4 15000
Histogram of rnorm(1e+05, 0, 2)
rnorm(1e+05, 0, 2) Frequency
- 5
5 15000
Central Limit theorem exercise
- >hist (runif(1e5))
Histogram of r0
r0 Frequency 0.0 0.2 0.4 0.6 0.8 1.0 3000
- >hist (runif(1e5)+runif(1e5))
Histogram of runif(1e+05) + runif(1e+0
runif(1e+05) + runif(1e+05) Frequency 0.0 0.5 1.0 1.5 2.0 6000
Central Limit theorem exercise
gram of runif(1e+05) + runif(1e+05) + ru
runif(1e+05) + runif(1e+05) + runif(1e+05) Frequency 0.0 1.0 2.0 3.0 10000
f runif(1e+05) + runif(1e+05) + runif(1e+0
runif(1e+05) + runif(1e+05) + runif(1e+05) + runif(1e Frequency 1 2 3 4 8000
Chi-square distribution
- Distribution of square of normal random number
Histogram of apply(r0, 1, var
apply(r0, 1, var) * 3 Frequency 5 10 20 600
Sample size = 10 Degree of freedom = 9 Sample size = 4 Degree of freedom = 3
Exercise Plant Growth
- >data(PlantGrowth)
- >> PlantGrowth
- weight group
- 1 4.17 ctrl
- 2 5.58 ctrl
- 3 5.18 ctrl
Exercise Plotting notched Box Plot
ctrl trt1 trt2 3.5 4.0 4.5 5.0 5.5 6.0
boxplot(weight ~ group, data = PlantGrowth, main = "PlantGrowth data", ylab = "Dried weight of plants", col = "lightgray", notch = TRUE, varwidth = TRUE)
Level of Measurement
- Ratio Data
- Interval Data
- Ordinal Data
– Can put rank
- Categorical Data
– Binary Data
Quantitative Data Qualitative Data
Descriptive Statistic Bivariate Data
- Dependence index
– Correlation: Pearson’ chi-square, Kendall’ τ, Spearman’ ρ
- Cross-tabulation
– Binary and binary – Binary and nominal – Nominal and nominal
- Scatterplots
– Ordinal/Interval and ordinal/Interval
- Quantile-Quantile plots
– Ordinal and ordinal
- 4
- 2
2 4
- 4
- 2
2 4 c(rr0) c(rr1)
Statistical inference
- Drawing conclusions from data based on
model/assumption
- Data is independently identically distributed
– Random sampling from population – Randomized experiment
- Set Model or Assumption
- Estimate
– Parameter (mean, proportion, variance)
- Interval
– Confidential, Tolerance, Prediction
- Test of Hypothesis
Types of statistical inference
- Point Estimate
– Obtain single estimate
- Estimate Interval
– Interval of possible values
- Hypothesis testing
– Making decision from data
- Check model assumption
Point Estimation
- Obtain best single value of a population
parameter from a subset
- Unbiasness
- minimum variance
- Parametric Distribution
– Maximum Likelihood Estimator – Moment Estimator
Unbiasness
- True parameter: θ0
- Estimate:θ
- E[θ]= θ0
- Estimator of standard deviation
– Variance calculated from 5 normal samples – Left: mean=0.8, right: mean=1.0
Histogram of aa
aa Frequency 1 2 3 4 5
Histogram of aa
aa Frequency 1 2 3 4 5 6
Unbiased variance
Test result Deviance Deviance2 1 0.0394
- 0.00103
1.06778E-06 2 0.0412 0.000767 5.87778E-07 3 0.0420 0.001567 2.45444E-06 4 0.0398
- 0.00063
4.01111E-07 5 0.0407 0.000267 7.11111E-08 6 0.0395
- 0.00093
8.71111E-07 Average 0.0404 Biased Estimate 9.08889E-07 Unbised Estimate =(sum of Deviance)/(6-1)
Minimum variance
- Normal distribution mean
- 5 samples
- True mean=0
gram of apply(matrix(rnorm(5e+05), ncol = 5
apply(matrix(rnorm(5e+05), ncol = 5), 1, mean) Frequency
- 2
- 1
1 2 5000 10000
ram of apply(matrix(rnorm(5e+05), ncol = 5),
apply(matrix(rnorm(5e+05), ncol = 5), 1, median) Frequency
- 2
- 1
1 2 5000 10000 15000
Goodness-of-fit Test
- Graphical method
– Quantile-Quantile plot
Exercise plotting Q-Q plot
- Fit to normal distribution
- > qqnorm(rnorm(1e2))
- > qqnorm(rlnorm(1e2))
- 2
- 1
1 2
- 2
2
Normal Q-Q Plot
Theoretical Quantiles Sample Quantiles
- 2
- 1
1 2 2 4 6 8
Normal Q-Q Plot
Theoretical Quantiles Sample Quantiles
Pearson’s Chi-square
- Yate’s correction
Observed 10 2 7 9 Hypothesis 8 4 9 7 Diff 2
- 2
2 2
Other test of fit
- Based on empirical distribution function
– Kolmogrov-smirnov test – Anderson-Darling test – Lilliefors test – Cramer-von Mises test
- For normality
– Jarque-Bera
- Based on skewness and kurtosis
– Shapiro-wilk test
- Statistic based on variance and covariance of rank
Interval Estimation Types of Interval
- Confidential
– True parameter with probability of alpha – Nominal and actual coverage probability
- Prediction
– Another sample falls within the prediction interval with the probability of alpha
- Tolerance
– N percent of data falls within the interval with confidence level of alpha
Confidence Intervals Example of 95%
Table of T-values
d.f. t0.95 t0.975 t0.995 1 6.3 12.7 63.7 2 2.9 4.3 10 3 2.4 3.2 5.8 4 2.13 2.8 4.6 5 2.02 2.6 4.0 6 1.94 2.4 3.7 Z(∞) 1.645 1.960 2.326
Statistical inference
- Model, Assumption, Hypothesis-
- Parametric
– Data generation process is parametricized
- Non-parametric
– Data generation process is not parametericized
- Asymptotical – commonly used
– Critical value based on table
- Exact – computer intensive
– Critical values based on data
Statistical inference and error
- Type I error
– False Positive – α error – Rejecting a hypothesis that should have been accepted
- Type Ⅱerror
– False negative – β error – Accepting a hypothesis that should have been rejected
Statistical Test
- Test for location
- Test for dispersion
- Test for outlier
- P-value, error
- Detection Power
- Uniformly most powerful test
Ratio data
- Quantitative data
- Unlike interval data, it has natural zero
- Can do multiplication or division
- Age, Length, etc.
Interval Data
- Quantitative data
- Can add or subtract the data
- Can not do multiplication or division
- Ex. Temperature
Z-test
- Critical value does not depend on sample size
- Standard deviation : known
- Exercise
- Proficiency testing
- Target : 700μg/g
- Standard deviation: 25μg/g
- Test if one laboratory reports 640μg/g, they
significantly differs from target
Test for normal interval data single set of samples
- One sample t-test
- What is tested
– Whether population mean differs from 0 – Standard deviation: unknown
- C.f. z-test
- Threfore, s.d. is estimated from data
- Error included
- Mean of data set of (150, 120, 180, 130)
significantly differs from 100.
Test for interval data 2 levels
- T-test (paired or unpaired)
- Variance of two gourps
– Same Students test – Unsame Welch-Aspin test
T-distribution
- Distribution of sample mean divided by
sample variance
- Normality assumed
- Degree of freedom
- probability
- If standard deviation is known or the degree
- f freedom is infinity. It is z test.
T-distribution and normal distribution
- Green = degree of freedom (d.f.) 2
- Blue = degree of freedom (d.f.) 10
- Red = degree of freedom (d.f.) +infinity
- 10
- 5
5 10 0.0 0.1 0.2 0.3 0.4 seq(-10, 10, by = 0.01) dnorm(seq(-10, 10, by = 0.01))
T-test assumptions
- Each of two data set follow a normal
distribution
– especially when sample size is small
- Each of two data set are sampled
independently
- There are few cases where those assumptions
are strictly met, care is needed for strict discussions.
Robustness of inference
- How violation to assumptions affects the test
- Outliers
– Against outlier
- Distribution
– Mixture distribution of different s.d.
- T-test is somehow robust to some violations
- Some discuss to apply tests to check if those assumption
holds in the data, but there are other discussions.
Case of unequal variance, equal sample size n=10, σ1/σ2=4
- Student
Histogram of p
p Frequency 0.0 0.2 0.4 0.6 0.8 1.0 1000 3000 5000
- Aspin-Welch
Histogram of p
p Frequency 0.0 0.2 0.4 0.6 0.8 1.0 2000 4000 6000
T test exercise Sample Data
- > ToothGrowth
- len supp dose
- 1 4.2 VC 0.5
- 2 11.5 VC 0.5
- 3 7.3 VC 0.5
- 4 5.8 VC 0.5
- 5 6.4 VC 0.5
- 6 10.0 VC 0.5
- 7 11.2 VC 0.5
T test exercise 2 Sample Data
- d0 <- ToothGrowth$len[1:10]
– VitaminC dose 0.5mg
- d1<- ToothGrowth$len[11:20]
– VitaminC dose 1.0mg
T-test using R
- exercise 3-
- t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0,
paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
- >t.test(d0, d1)
- Welch Two Sample t-test
- data: d0 and d1
- t = -7.4634, df = 17.862, p-value = 6.811e-07
- alternative hypothesis: true difference in means is not equal to 0
- 95 percent confidence interval:
- 11.265712 -6.314288
- sample estimates:
- mean of x mean of y
- 7.98 16.77
T-test exercise using random number
- alpha error-
- > t.test(rnorm(1e1),rnorm(1e1))
- t = -0.9106, df = 17.085, p-value = 0.3752
- t = 0.7685, df = 17.982, p-value = 0.4522
- t = 0.8858, df = 12.341, p-value = 0.3927
- t = -1.0532, df = 17.886, p-value = 0.3063
- t = 0.0694, df = 17.496, p-value = 0.9455
- t = -0.0784, df = 13.86, p-value = 0.9386
- t = -0.606, df = 17.528, p-value = 0.5523
Summary
Actual Nominal P-value>=0.5 50% P-value<0.5 50%
Test for Proportions
- > prop.test(10,20,p=0.5)
- 1-sample proportions test without continuity correction
- data: 10 out of 20, null probability 0.5
- X-squared = 0, df = 1, p-value = 1
- alternative hypothesis: true p is not equal to 0.5
- 95 percent confidence interval:
- 0.299298 0.700702
- sample estimates:
- p
- 0.5
One-sided(tailed) and two- sided(tailed) test
- One sided
– Null hypothesis μ=μ0 – Alternative hypothesis μ>μ0
- Two sided
– Null hypothesis μ=μ0 – Alternative hypothesis μ≠μ0
- One-sided p-value = ½ two-sided p-value
Balanced v.s. Unbalanced
- Balanced
– Equal sample or experiment assigned to each treatment
- Unbalanced
– Unequal sample or experiment assigned to each treatment
Ordinal Data
- Several levels with order
- Excellent – good – fair
- Ranks in the race
- Interval data is an ordinal data
- But ordinal data is not always interval data
What can be done with Ordinal data
- Wilcoxon rank sum test
– Mann-Whitney’s U test – Unpaired t-test – interval data
- Wilcoxon signed rank test
– Paired t-test – interval data
Wilcoxon test
- exercise-
- > wilcox.test(d0,d1)
- Wilcoxon rank sum test with continuity correction
- data: d0 and d1
- W = 0, p-value = 0.0001796
- alternative hypothesis: true location shift is not equal to 0
- Warning message:
- In wilcox.test.default(d0, d1) : cannot compute exact p-
value with ties
Wilcoxon test
- unequal variance-
Histogram of p
p Frequency 0.0 0.2 0.4 0.6 0.8 1.0 2000 6000
Histogram of p
p Frequency 0.0 0.2 0.4 0.6 0.8 1.0 2000 6000
Equal variance P(p<0.05)=0.044 Unequal variance P(p<0.05)=0.12
Comparison of t-test and wilcoxin test
- Detection power – 98%
- Extremely low sample size and high difference
– t test
Detection power of t-test
- S.d. =1
- μ1-μ2 = 1, 0.5, 0.1
Detection power of wilcox test norm, diff=1
20 40 60 80 100 0.0 0.4 0.8 mvec powvec
- S.d. =1
- μ1-μ2 = 1
T-test Detection power
- > power.t.test(n=10,delta=1, sd=NULL, sig.level =0.05, power=.5)
- Two-sample t test power calculation
- n = 10
- delta = 1
- sd = 1.079782
- sig.level = 0.05
- power = 0.5
- alternative = two.sided
- NOTE: n is number in *each* group
Calculating sample size - t-test
- power.t.test(delta=1, sd=1, sig.level =0.05, power=.95)
- Two-sample t test power calculation
- n = 26.98922
- delta = 1
- sd = 1
- sig.level = 0.05
- power = 0.95
- alternative = two.sided
- NOTE: n is number in *each* group
t.test detection power (β-error)
- exercise-
- >t.test(rnorm(1e1,sd=1.079782),rnorm(1e1,sd
=1.079082))
- t = -1.0004, df = 10.752, p-value = 0.3391
- t = 0.1531, df = 17.229, p-value = 0.8801
P>0.05 P<=0.05
What is Nominal Data
- Categorically discrete
- Order of category is arbitrary
- Red, blue, green
- Origin(Region)
Analysis of Nominal Data
- Contigency table
- Binominal test
- Chisquare-test
- Fisher’s exact test
- McNemer test
Correlation
- analysis of bivariate data-
- r=0(left), 1.0(middle), 0.7(right)
- 4
- 2
2 4
- 4
- 2
2 4 rr0 rr1
- 4
- 2
2 4
- 4
- 2
2 4 rr0 rr0
- 4
- 2
2 4
- 4
- 2
2 4 (rr0 + rr1)/2 rr0
Exercise Formaldehyde
- carb optden
- 1 0.1 0.086
- 2 0.3 0.269
- 3 0.5 0.446
- 4 0.6 0.538
- 5 0.7 0.626
- 6 0.9 0.782
Covariance
Data A 0.1 0.3 0.5 0.6 0.7 0.9 mean Devianc nce A
- 0.41667
67 -0.2166 667 -0.0166 667 0.08333 333 0.18333 333 0.38333 333 0.51666 667 Data B 0.086 0.269 0.446 0.538 0.626 0.782 Devianc nce B
- 0.37183
83 -0.1888 883 -0.0118 183 0.08016 167 0.16816 167 0.32416 167 0.45783 833 D.
- D. A2
A2 0.17361 61 1 0.04694 944 0.00027 278 0.00694 944 0.03361 611 0.14694 944 0.06805 056
- D. B2
0.13826 26 0.03565 658 0.00014 14 0.00642 427 0.02828 28 0.10508 084 0.05230 308 D.A X X D.B 0.15493 93 1 0.04091 914 0.00019 197 0.00668 681 0.03083 831 0.12426 264 0.05963 636
Correlation=covariance/( (s.d.ofA)×(s.d.ofB) )
Exercise Formaldehyde
> plot(Formaldehyde$carb, Formaldehyde$optden) > cor(Formaldehyde$carb, Formaldehyde$optden)
0.2 0.4 0.6 0.8 0.1 0.3 0.5 0.7 Formaldehyde$carb Formaldehyde$optden
Correlation - exercise
> rr0=rnorm(1e2,0,1)
- rr1=rnorm(1e2,0,1)
- > plot(rr0,rr1,xlim=c(-4,4),ylim=c(-4,4))
- > plot(rr0,rr0,xlim=c(-4,4),ylim=c(-4,4))
- > plot((rr0+rr1)/2,rr0,xlim=c(-4,4),ylim=c(-4,4))
Correlation test
- > cor.test(rr0, rr1)
- Pearson's product-moment correlation
- data: rr0 and rr1
- t = 0.2471, df = 98, p-value = 0.8054
- alternative hypothesis: true correlation is not equal to 0
- 95 percent confidence interval:
- -0.1723101 0.2202910
- sample estimates:
- cor
- 0.02495251
- 4
- 2
2 4
- 4
- 2
2 4 c(rr0) c(rr1)
Regression and correlation Goodness-of-fit
- R
– Pearson’s correlation coefficient
- R2
– Coefficient of determination – proportion of variance in Y explained by a linear function of X
- 1-R2
– Fraction of variance unexplained
Correlation and outlier exercise
- > cor.test(c(rr0,5), c(rr1,5))
- Pearson's product-moment correlation
- data: c(rr0, 5) and c(rr1, 5)
- t = 2.0633, df = 99, p-value = 0.0417
- alternative hypothesis: true correlation is not equal to 0
- 95 percent confidence interval:
- 0.007928624 0.383282118
- sample estimates:
- cor
- 0.2030532
- 4
- 2
2 4 6
- 4
- 2
2 4 6 c(rr0, 5) c(rr1, 5)
Regression analysis
- ordinal data to ordinal data-
- Spearman or Kendall correlation
– Spearman : convert data to ranks and calculate Pearson’s correlation coefficient – Kendall : concordant pair and discordant pairs
- They can also be used for interval data
Rank correlation test
- exercise-
- > cor.test(c(rr0,5), c(rr1,-5), method="spearman")
- Spearman's rank correlation rho
- data: c(rr0, 5) and c(rr1, -5)
- S = 164262, p-value = 0.6666
- alternative hypothesis: true rho is not equal to 0
- sample estimates:
- rho
- 0.04331974
- > cor.test(c(rr0,5), c(rr1,-5), method=“kendall")
Regression Analysis
- Relation between 2 independent variables
– Y ~ a + bX
- Scatter plot
- Uni-varialte and Multi-variate
- Linear Regression
- Non-linear Regression
- R^2 and Prediction Interval
Y ~ a + bX
- Y
– Dependent variable – Regressand – Endogenous variable – Response variables – Measured variables
- X
– Independent variable – Regressors – Exogeneous variables – Explanetory variables – Covariate – Predictor variables
20 30 40 50 60 70 80 90 20 40 60 80 x y
Regression Predicted and mean response
- Standard deviation (S.D.)
- Standard error of mean (S.E.M.)
20 30 40 50 60 70 80 90 20 40 60 80 x y
Outer:Predicted Inner:Mean
Regression
- assumption-
- Error is mean zero and its variance is consistent
across observations
- If not, try transforming variable
– exp(X) – log(X) – Sqrt(x)
- Weighted least square
Regression – Robustness against
- utliers
- Ridge/Lasso regression
- Least absolute deviation regression
Regression by polynomials relation is not straight line
- 2
- 1
1 2
- 6
- 5
- 4
- 3
- 2
- 1
rr
- rr * rr
Regression diagnosis
- Residual
- Standarized residual
– Residual divided by error
- Studentized residual
– Remove the data and do regression analysis – Residual divided by error
- Durvin-watson statsitics
- Graphical
Regression analysis fitting binary data to continuous function
- Logistic Regression
– Logistic function – Logit p = log (p/(1-p)) = a + bx
- Probit regression
– Inverse of cumulative density function of normal distribution
Level 1
2 3 4 5 6 7 8 9
Positive 0
1 1 1 1 1
Logistic regression
- Logit p = log (p/(1-p)) = a + bx
Level 1
2 3 4 5 6 7 8 9
Positive 0
1 1 1 1 1
2 4 6 8
- 0.5
0.0 0.5 1.0 1.5 res$fit
Other data
- Poisson regression
– Count data
- Multinominal logistic/Probit regression
- Ordered logistic/Probit regression
Design of Experiment Principle by R.A. Fisher
- Comparison
- Randomization
- Replication
- Blocking
- Orthgonality
- Factorial experiment
Completely Randomised Design
Spray A Spray B Spray C 10 11 7 17 1 20 21 7 14 11 2 14 16 3
- Data(InsectSprays)
Analysis of several sets of data Analysis of Variance
- Compare variances between group means and
variance in a group
- Extension of two sample test to multi samples
- One-way ANOVA
- Two-way ANOVA
- Repeated measure ANOVA
– Measurement on same block or subject
- 2 or more levels
One-way ANOVA
- 1 interval and 1 nominal data
- Group 1 1.1 1.3 1.4
- Group 2 1.3 1.7 1.5
- Group 3 1.2 1.1 1.0
T-test and ANOVA
- ne way ANOVA and unpaired t
- Compare the results
- Generate 10 normal random number
- >r0 <- rnorm(10, 0, 1)
– -0.73 0.61 1.03 1.81 -0.05 -1.21 1.04 0.65 -0.99 0.18
- >r1 <- rnorm(10, 1, 1)
– 1.94 0.82 0.23 2.07 -1.04 0.18 -1.80 1.35 -0.06 -0.61
Output for (Student’s) T-test
- test(r1, r0, var.equal=TRUE)
- Two Sample t-test
- data: r1 and r0
- t = 2.827, df = 18, p-value = 0.01117
- alternative hypothesis: true difference in means is not equal to 0
- 95 percent confidence interval:
- 0.2579993 1.7510822
- sample estimates:
- mean of x mean of y
- 1.2375198 0.2329791
Output for anova
- > group <- c(rep(0,10), rep(1,10))
- > rs <- c(r0,r1)
- > aov(rs ~ group)
ANOVA result
- Df Sum Sq Mean Sq F value p-value
- group 1 5.046 5.046 7.992 0.0112
- Residuals 18 11.364 0.631
s11 s12 s13 s14 s15 s16 s17 s18 S19 s110 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 s11- m1 s12- m1 s13- m1 s14- m1 s15- m1 s16
- m1
s17- m1 s18- m1 s19- m1 s110- m1 S21 S22 s23 s24 s25 s26 s27 s28 s29 s210 M2 M2 m2 m2 m2 m2 m2 m2 m2 m2 s21- m2 s22- m2 s23- m2 s24- m2 s25- m2 s26- m2 s27- m2 s28
- m2
s29- m2 s210
- m2
m1 m2 m12 m12 m1-m12 m2-m12
F-value
- Distribution of Ratio sample variance of
dataset A and sample variance of dataset B
F-table
df2 df1 1 2 5 20
1 40 50 57 62 2 8.5 9.0 9.3 9.4 5 4.1 4.3 4.1 3.2 20 3.0 2.6 2.2 1.8
df2 df1 1 2 5 20
1 161 200 230 248 2 19 19 19 19 5 6.6 5.8 5.1 4.6 20 4.4 3.5 2.7 2.1 alpha=0.1 alpha=0.05
ANOVA with unequal variances
- > oneway.test(rs ~ group)
- One-way analysis of means (not assuming
equal variances)
- data: rs and group
- F = 7.9918, num df = 1.000, denom df =
13.923, p-value = 0.01351
t.test by Aspin-Welch
- > t.test(r1, r0)
- Welch Two Sample t-test
- data: r1 and r0
- t = 2.827, df = 13.923, p-value = 0.01351 # slightly larger p-
value
- alternative hypothesis: true difference in means is not equal
to 0
- 95 percent confidence interval:
- 0.2420163 1.7670653
- sample estimates:
- mean of x mean of y
- 1.2375198 0.2329791
Exercise Plant Growth
- >data(PlantGrowth)
- >> PlantGrowth
- weight group
- 1 4.17 ctrl
- 2 5.58 ctrl
- 3 5.18 ctrl
Exercise ANOVA
- Anova(weight ~ group, data = PlantGrowth))
- Analysis of Variance Table
- Response: weight
- Df Sum Sq Mean Sq F value Pr(>F)
- group 2 3.7663 1.8832 4.8461 0.01591 *
- Residuals 27 10.4921 0.3886
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘
Multiple Comparison
- Tukey’s range test
– Also as Tukey’s honest significance test – Compares all possible pair of means
- Sceffe’s test
Exercise Tukey HSD
- > TukeyHSD(fit)
- Tukey multiple comparisons of means
- 95% family-wise confidence level
- Fit: aov(formula = weight ~ group, data = PlantGrowth)
- $group
- diff lwr upr p adj
- trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
- trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
- trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
- In fact comparison to control Dunnet’s is better but have to install
multicomp package (requires internet connection)
Analysis of variance
- assumptions-
- Normality of data
– Deviance is normally distributed
- Homoscedastic
– variance of each group are equal
- All data are independent
Anova for ordinal data
- One-way
– Kruskal-Wallis – Van der Waerden test
- One-way repeated measure
– Friedman test
Exercise Kruskal-wallis
- >kruskal.test(weight~group,data=PlantGrowth
)
- Kruskal-Wallis rank sum test
- data: weight by group
- Kruskal-Wallis chi-squared = 7.9882, df = 2, p-
value = 0.01842
Comparison of wilcoxon and Kruskal-Wallis tests
- >wilcox.test(weight~group,data=PlantGrowth,subset=g
roup!='trt1',exact=FALSE, correct=FALSE)
- W = 25, p-value = 0.05878
- >kruskal.test(weight~group,data=PlantGrowth,subset=
group!='trt1')
- Kruskal-Wallis chi-squared = 3.5714, df = 1, p-value =
0.05878
One way repeated measures Age 1 Age 2 Age 3 Age 4 Tree 1 30 58 87 115 Tree 2 33 69 111 156 Tree 3 30 51 75 108
- Data(Orange)
Subset data
- Compare circumference at age 118 and age
484
- > Orange.1<-subset(Orange, age==118)
- > Orange.2<-subset(Orange, age==484)
Paired t-test
- >t.test(Orange.1$circumference,Orange.2$circumference,paired
=TRUE)
- Paired t-test
- data: Orange.1$circumference and Orange.2$circumference
- t = -8.6768, df = 4, p-value = 0.000971
- alternative hypothesis: true difference in means is not equal to 0
- 95 percent confidence interval:
- -35.37558 -18.22442
- sample estimates:
- mean of the differences
- -26.8
ANOVA
- Repeated measures-
- > summary(aov(circumference ~ age + Error(Tree/age),
data=rbind(Orange.1,Orange.2)))
- Error: Tree
- Df Sum Sq Mean Sq F value Pr(>F)
- Residuals 4 179.4 44.85
- Error: Tree:age
- Df Sum Sq Mean Sq F value Pr(>F)
- age 1 1795.6 1795.6 75.29 0.000971 ***
- Residuals 4 95.4 23.8
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Wilcox signed rank
- >wilcox.test(Orange.1$circumference,Orange.2$c
ircumference,paired=TRUE)
- Wilcoxon signed rank test
- data: Orange.1$circumference and
Orange.2$circumference
- V = 0, p-value = 0.0625
- alternative hypothesis: true location shift is not
equal to 0
Friedman test
- friedman.test(circumference ~ age | Tree,
data=rbind(Orange.1,Orange.2))
- Friedman rank sum test
- data: circumference and age and Tree
- Friedman chi-squared = 5, df = 1, p-value =
0.02535
Factorial Design
- two-way anova-
Low Dose Middle Dose High Dose
Vitamine C
4.2/11.5 16.5/16.5 23.6/18.5
OrangeJuice 15.2/21.5
19.7/23.3 25.5/26.4
> ToothGrowth
ANOVA on toothgrowth (2-way)
- > summary(aov(len ~ supp*dose,
data=ToothGrowth))
- Df Sum Sq Mean Sq F value Pr(>F)
- supp 1 205.3 205.3 12.317 0.000894 ***
- dose 1 2224.3 2224.3 133.415 < 2e-16 ***
- supp:dose 1 88.9 88.9 5.333 0.024631 *
- Residuals 56 933.6 16.7
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1
Test for heteroscedasticity
- F-test
- Cochran’s C
- Bartlett’s test
- Levene’s test
- Brown-Forsythe test
- Goldfeld-Quandt test
- Breuseh-Pagan test
Histogram of rnorm(10000, 0, 2)
rnorm(10000, 0, 2) Frequency
- 5
5 500 1000 1500
Histogram of rnorm(10000, 0, 1)
rnorm(10000, 0, 1) Frequency
- 4
- 2
2 4 500 1000 1500 2000
Cochran’ C
- Data 1=(1,2,1,3,1,2)
- Data 2=(1,3,5,2,5,2)
- Data 3=(1,4,1,5,1,2)
- Detects high within-group variation
- H0: all variance equal
- H1: At least one variance not equal or outlier
included
Cochran’s C
- >library(outlier)
- > cochran.test(rs ~ group)
- Cochran test for outlying variance
- data: rs ~ group
- C = 0.7706, df = 10, k = 2, p-value = 0.0856
- alternative hypothesis: Group 0 has outlying variance
- sample estimates:
- 0 1
- 0.9729731 0.2896914
Extension of ANOVA
- ANCOVA
– Analysis of Covariance
- MANOVA
– Multivariate ANOVA
Mechanism of missing data
- Missing Completely at Random
- Missing at Random
- Impossible to Observe
- Censored data (Left censored, Right censored)
Statistical packages
- Open source
– Free redistribution, no royalty – Source code open
- Freeware
– Free distribution, no royalty
- Proprietary
– Requires royalty
excel2010
- One way ANOVA
- Two-way ANOVA
- Two-way with replication
- Correlation
- Covariance
- Basic Statistics
Excel - 2
- Generating random number
- Rank and percentile
- regression
- sampling
- Paried t test
- t test (same variance)
- T test (unsame variance)
- Z test
Excel - 3
- Summary statistics
- F test
- Fourier analysis
- Histogram
- Moving average
Paired t
Two way ANOVA
Exporting Data from Excel
- Prepare
data
- 1st line =
tiltle
- 2nd-later
= data
Exporting Data from Excel
- Save the
data in a csv forman
Exporting data from R to Excel
- >write.csv(data, “filename”)
Using real data Read the csv file
- >sp = read.csv(“filename”)
- To see the filename,
- Drop the file to console window
- >load("C:\\Users\\yoshiEPSON\\Documents\\ICC
AGSA2013.docx")
- Change “load” to “read.csv”
- >read.csv("C:\\Users\\yoshiEPSON\\Documents\\
ICCAGSA2013.docx“)