fao training course june 11 2013
play

FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. - PowerPoint PPT Presentation

Data analysis and Basic Statistics FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. (statistical science) (National Food Research Institute, National Agriculture and Food Research Organization) Email: Yoshiki.tsukakoshi@gmail.com


  1. Central limit theorem an example ogram of sample(1:6, 10000, replace :6, 10000, replace = TRUE) + sample( Frequency Frequency 1000 1000 0 0 1 2 3 4 5 6 0 2 4 6 8 10 14 sample(1:6, 10000, replace = TRUE) , 10000, replace = TRUE) + sample(1:6, 10000 TRUE) + sample(1:6, 10000, replace :6, 10000, replace = TRUE) + sample( Frequency Frequency 800 600 0 0 0 5 10 15 0 5 10 15 20 25 RUE) + sample(1:6, 10000, replace = TRUE) + , 10000, replace = TRUE) + sample(1:6, 10000

  2. Continuous Distribution Normal (Gaussian ) Distribution distributional shape 0.4 dnorm(k, 0, 1) 0.3 Probability Density Distribution 0.2 0.1 0.0 -3 -2 -1 0 1 2 3 • Bell shape k • Symmetrical • Converges to zero and +-infinity

  3. Normal Distribution cumulative density funtiction 1.0 0.8 • 68% in ± 1 σ pnorm(k, 0, 1) 0.6 • 97% in ± 1 σ 0.4 0.2 0.0 -3 -2 -1 0 1 2 3 k

  4. Log normal Distribution 0.6 dlnorm(k, 0, 1) 0.4 0.2 0.0 0 2 4 6 8 10 k Skewed shape Mean and mode different Heavy tail

  5. Exponential distribution and Gamma distribution • Exponential distribution – Time until an event occurs which is expected to occur at the same rate • Gamma distribution – Time until k events occur at the same rate which are expected to occur at the same rate

  6. Gamma Distribution γ • Shape parameter k • Scale parameter θ K=1 , θ=1 K=2 , θ=1 Histogram of rgamma(1e+05, 1) Histogram of rgamma(1e+05, 2) Frequency Frequency 30000 15000 0 0 0 2 4 6 8 10 12 0 5 10 15 rgamma(1e+05, 1) rgamma(1e+05, 2)

  7. Gamma Distribution • Mean = k ・ θ K=3 , θ=1 K=4 , θ=1 Histogram of rgamma(1e+05, 3) Histogram of rgamma(1e+05, 4) 25000 Frequency Frequency 10000 10000 0 0 0 5 10 15 0 5 10 15 rgamma(1e+05, 3) rgamma(1e+05, 4)

  8. Exponential Distribution Histogram of rexp(1e+05, 3) Histogram of rexp(1e+05) Frequency 30000 Frequency 40000 0 0 0 2 4 6 8 10 12 14 0 1 2 3 rexp(1e+05) rexp(1e+05, 3) • Exp(- λx ) • Monotonically decreasing • Gamma distribution of Shape = 0

  9. Generating random number Histogram of rnorm(1e+05, 0, 1) • R 15000 Frequency – rnorm(n, mean, s.d.) 0 – >hist(rnorm(1e5, 0, 1)) -4 -2 0 2 4 rnorm(1e+05, 0, 1) – >hist(rnorm(1e5, 0, 2)) Histogram of rnorm(1e+05, 0, 2) • Excel 2010 15000 Frequency – Norm.dist() 0 -5 0 5 rnorm(1e+05, 0, 2)

  10. Central Limit theorem exercise • >hist (runif(1e5)) Histogram of r0 Frequency 3000 0 0.0 0.2 0.4 0.6 0.8 1.0 r0 • >hist (runif(1e5)+runif(1e5)) Histogram of runif(1e+05) + runif(1e+0 Frequency 6000 0 0.0 0.5 1.0 1.5 2.0 runif(1e+05) + runif(1e+05)

  11. Central Limit theorem exercise gram of runif(1e+05) + runif(1e+05) + ru Frequency 10000 0 0.0 1.0 2.0 3.0 runif(1e+05) + runif(1e+05) + runif(1e+05) f runif(1e+05) + runif(1e+05) + runif(1e+0 Frequency 8000 0 0 1 2 3 4 runif(1e+05) + runif(1e+05) + runif(1e+05) + runif(1e

  12. Chi-square distribution Histogram of apply(r0, 1, var Frequency 600 0 0 5 10 20 Sample size = 4 Sample size = 10 apply(r0, 1, var) * 3 Degree of freedom = 3 Degree of freedom = 9 • Distribution of square of normal random number

  13. Exercise Plant Growth • >data(PlantGrowth) • >> PlantGrowth • weight group • 1 4.17 ctrl • 2 5.58 ctrl • 3 5.18 ctrl

  14. Exercise Plotting notched Box Plot 6.0 5.5 5.0 4.5 4.0 3.5 ctrl trt1 trt2 boxplot(weight ~ group, data = PlantGrowth, main = "PlantGrowth data", ylab = "Dried weight of plants", col = "lightgray", notch = TRUE, varwidth = TRUE)

  15. Level of Measurement • Ratio Data Quantitative • Interval Data Data • Ordinal Data – Can put rank Qualitative Data • Categorical Data – Binary Data

  16. Descriptive Statistic Bivariate Data • Dependence index – Correlation: Pearson’ chi - square, Kendall’ τ, Spearman’ ρ • Cross-tabulation 4 – Binary and binary 2 c(rr1) – Binary and nominal 0 – Nominal and nominal -2 -4 • Scatterplots -4 -2 0 2 4 – Ordinal/Interval and ordinal/Interval c(rr0) • Quantile-Quantile plots – Ordinal and ordinal

  17. Statistical inference • Drawing conclusions from data based on model/assumption • Data is independently identically distributed – Random sampling from population – Randomized experiment • Set Model or Assumption • Estimate – Parameter (mean, proportion, variance) • Interval – Confidential, Tolerance, Prediction • Test of Hypothesis

  18. Types of statistical inference • Point Estimate – Obtain single estimate • Estimate Interval – Interval of possible values • Hypothesis testing – Making decision from data • Check model assumption

  19. Point Estimation • Obtain best single value of a population parameter from a subset • Unbiasness • minimum variance • Parametric Distribution – Maximum Likelihood Estimator – Moment Estimator

  20. Unbiasness • True parameter: θ 0 • Estimate:θ • E[θ]= θ 0 Histogram of aa Histogram of aa Frequency Frequency 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 aa aa • Estimator of standard deviation – Variance calculated from 5 normal samples – Left: mean=0.8, right: mean=1.0

  21. Unbiased variance Deviance Deviance 2 Test result -0.00103 1 0.0394 1.06778E-06 0.000767 2 0.0412 5.87778E-07 0.001567 3 0.0420 2.45444E-06 -0.00063 4 0.0398 4.01111E-07 0.000267 5 0.0407 7.11111E-08 -0.00093 6 0.0395 8.71111E-07 Average 0.0404 Biased 9.08889E-07 Estimate Unbised =(sum of Estimate Deviance)/(6-1)

  22. Minimum variance • Normal distribution mean • 5 samples • True mean=0 gram of apply(matrix(rnorm(5e+05), ncol = 5 ram of apply(matrix(rnorm(5e+05), ncol = 5), 15000 10000 Frequency 10000 Frequency 5000 5000 0 0 -2 -1 0 1 2 -2 -1 0 1 2 apply(matrix(rnorm(5e+05), ncol = 5), 1, mean) apply(matrix(rnorm(5e+05), ncol = 5), 1, median)

  23. Goodness-of-fit Test • Graphical method – Quantile-Quantile plot

  24. Exercise plotting Q-Q plot • Fit to normal distribution • > qqnorm(rnorm(1e2)) • > qqnorm(rlnorm(1e2)) Normal Q-Q Plot Normal Q-Q Plot Sample Quantiles Sample Quantiles 8 2 6 0 4 2 -2 0 -2 -1 0 1 2 -2 -1 0 1 2 Theoretical Quantiles Theoretical Quantiles

  25. Pearson’s Chi -square Observed 10 2 7 9 Hypothesis 8 4 9 7 Diff 2 -2 2 2 • Yate’s correction

  26. Other test of fit • Based on empirical distribution function – Kolmogrov-smirnov test – Anderson-Darling test – Lilliefors test – Cramer-von Mises test • For normality – Jarque-Bera • Based on skewness and kurtosis – Shapiro-wilk test • Statistic based on variance and covariance of rank

  27. Interval Estimation Types of Interval • Confidential – True parameter with probability of alpha – Nominal and actual coverage probability • Prediction – Another sample falls within the prediction interval with the probability of alpha • Tolerance – N percent of data falls within the interval with confidence level of alpha

  28. Confidence Intervals Example of 95%

  29. Table of T-values d.f. t0.95 t0.975 t0.995 1 6.3 12.7 63.7 2 2.9 4.3 10 3 2.4 3.2 5.8 4 2.13 2.8 4.6 5 2.02 2.6 4.0 6 1.94 2.4 3.7 Z( ∞ ) 1.645 1.960 2.326

  30. Statistical inference -Model, Assumption, Hypothesis- • Parametric – Data generation process is parametricized • Non-parametric – Data generation process is not parametericized • Asymptotical – commonly used – Critical value based on table • Exact – computer intensive – Critical values based on data

  31. Statistical inference and error • Type I error – False Positive – α error – Rejecting a hypothesis that should have been accepted • Type Ⅱ error – False negative – β error – Accepting a hypothesis that should have been rejected

  32. Statistical Test • Test for location • Test for dispersion • Test for outlier • P-value, error • Detection Power • Uniformly most powerful test

  33. Ratio data • Quantitative data • Unlike interval data, it has natural zero • Can do multiplication or division • Age, Length, etc.

  34. Interval Data • Quantitative data • Can add or subtract the data • Can not do multiplication or division • Ex. Temperature

  35. Z-test • Critical value does not depend on sample size • Standard deviation : known • Exercise • Proficiency testing • Target : 700μg/g • Standard deviation: 25 μg/g • Test if one laboratory reports 640μg/g, they significantly differs from target

  36. Test for normal interval data single set of samples • One sample t-test • What is tested – Whether population mean differs from 0 – Standard deviation: unknown • C.f. z-test • Threfore, s.d. is estimated from data • Error included • Mean of data set of (150, 120, 180, 130) significantly differs from 100.

  37. Test for interval data 2 levels • T-test (paired or unpaired) • Variance of two gourps – Same Students test – Unsame Welch-Aspin test

  38. T-distribution • Distribution of sample mean divided by sample variance • Normality assumed • Degree of freedom • probability • If standard deviation is known or the degree of freedom is infinity. It is z test.

  39. T-distribution and normal distribution dnorm(seq(-10, 10, by = 0.01)) 0.4 0.3 0.2 0.1 0.0 -10 -5 0 5 10 seq(-10, 10, by = 0.01) • Green = degree of freedom (d.f.) 2 • Blue = degree of freedom (d.f.) 10 • Red = degree of freedom (d.f.) +infinity

  40. T-test assumptions • Each of two data set follow a normal distribution – especially when sample size is small • Each of two data set are sampled independently • There are few cases where those assumptions are strictly met, care is needed for strict discussions.

  41. Robustness of inference • How violation to assumptions affects the test • Outliers – Against outlier • Distribution – Mixture distribution of different s.d. • T-test is somehow robust to some violations • Some discuss to apply tests to check if those assumption holds in the data, but there are other discussions.

  42. Case of unequal variance, equal sample size n=10, σ1/σ2 = 4 Histogram of p Histogram of p 5000 6000 4000 Frequency Frequency 3000 2000 1000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p • Student • Aspin-Welch

  43. T test exercise Sample Data • > ToothGrowth • len supp dose • 1 4.2 VC 0.5 • 2 11.5 VC 0.5 • 3 7.3 VC 0.5 • 4 5.8 VC 0.5 • 5 6.4 VC 0.5 • 6 10.0 VC 0.5 • 7 11.2 VC 0.5

  44. T test exercise 2 Sample Data • d0 <- ToothGrowth$len[1:10] – VitaminC dose 0.5mg • d1<- ToothGrowth$len[11:20] – VitaminC dose 1.0mg

  45. T-test using R -exercise 3- • t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) • >t.test(d0, d1) • Welch Two Sample t-test • data: d0 and d1 • t = -7.4634, df = 17.862, p-value = 6.811e-07 • alternative hypothesis: true difference in means is not equal to 0 • 95 percent confidence interval: • -11.265712 -6.314288 • sample estimates: • mean of x mean of y • 7.98 16.77

  46. T-test exercise using random number -alpha error- • > t.test(rnorm(1e1),rnorm(1e1)) • t = -0.9106, df = 17.085, p-value = 0.3752 • t = 0.7685, df = 17.982, p-value = 0.4522 • t = 0.8858, df = 12.341, p-value = 0.3927 • t = -1.0532, df = 17.886, p-value = 0.3063 • t = 0.0694, df = 17.496, p-value = 0.9455 • t = -0.0784, df = 13.86, p-value = 0.9386 • t = -0.606, df = 17.528, p-value = 0.5523

  47. Summary Actual Nominal P-value>=0.5 50% P-value<0.5 50%

  48. Test for Proportions • > prop.test(10,20,p=0.5) • 1-sample proportions test without continuity correction • data: 10 out of 20, null probability 0.5 • X-squared = 0, df = 1, p-value = 1 • alternative hypothesis: true p is not equal to 0.5 • 95 percent confidence interval: • 0.299298 0.700702 • sample estimates: • p • 0.5

  49. One-sided(tailed) and two- sided(tailed) test • One sided – Null hypothesis μ=μ0 – Alternative hypothesis μ>μ0 • Two sided – Null hypothesis μ=μ0 – Alternative hypothesis μ≠μ0 • One-sided p-value = ½ two-sided p-value

  50. Balanced v.s. Unbalanced • Balanced – Equal sample or experiment assigned to each treatment • Unbalanced – Unequal sample or experiment assigned to each treatment

  51. Ordinal Data • Several levels with order • Excellent – good – fair • Ranks in the race • Interval data is an ordinal data • But ordinal data is not always interval data

  52. What can be done with Ordinal data • Wilcoxon rank sum test – Mann- Whitney’s U test – Unpaired t-test – interval data • Wilcoxon signed rank test – Paired t-test – interval data

  53. Wilcoxon test -exercise- • > wilcox.test(d0,d1) • Wilcoxon rank sum test with continuity correction • data: d0 and d1 • W = 0, p-value = 0.0001796 • alternative hypothesis: true location shift is not equal to 0 • Warning message: • In wilcox.test.default(d0, d1) : cannot compute exact p- value with ties

  54. Wilcoxon test -unequal variance- Histogram of p Histogram of p 6000 Frequency 6000 Frequency 2000 2000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p Equal variance Unequal variance P(p<0.05)=0.044 P(p<0.05)=0.12

  55. Comparison of t-test and wilcoxin test • Detection power – 98% • Extremely low sample size and high difference – t test

  56. Detection power of t-test • S.d. =1 • μ1 - μ2 = 1, 0.5, 0.1

  57. Detection power of wilcox test norm, diff=1 0.8 powvec 0.4 0.0 0 20 40 60 80 100 • S.d. =1 mvec • μ1 - μ2 = 1

  58. T-test Detection power • > power.t.test(n=10,delta=1, sd=NULL, sig.level =0.05, power=.5) • Two-sample t test power calculation • n = 10 • delta = 1 • sd = 1.079782 • sig.level = 0.05 • power = 0.5 • alternative = two.sided • NOTE: n is number in *each* group

  59. Calculating sample size - t-test • power.t.test(delta=1, sd=1, sig.level =0.05, power=.95) • Two-sample t test power calculation • n = 26.98922 • delta = 1 • sd = 1 • sig.level = 0.05 • power = 0.95 • alternative = two.sided • NOTE: n is number in *each* group

  60. t.test detection power (β -error) -exercise- • >t.test(rnorm(1e1,sd=1.079782),rnorm(1e1,sd =1.079082)) • t = -1.0004, df = 10.752, p-value = 0.3391 • t = 0.1531, df = 17.229, p-value = 0.8801 P>0.05 P<=0.05

  61. What is Nominal Data • Categorically discrete • Order of category is arbitrary • Red, blue, green • Origin(Region)

  62. Analysis of Nominal Data • Contigency table • Binominal test • Chisquare-test • Fisher’s exact test • McNemer test

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend