FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. - PowerPoint PPT Presentation

Central limit theorem an example ogram of sample(1:6, 10000, replace :6, 10000, replace = TRUE) + sample( Frequency Frequency 1000 1000 0 0 1 2 3 4 5 6 0 2 4 6 8 10 14 sample(1:6, 10000, replace = TRUE) , 10000, replace = TRUE) + sample(1:6, 10000 TRUE) + sample(1:6, 10000, replace :6, 10000, replace = TRUE) + sample( Frequency Frequency 800 600 0 0 0 5 10 15 0 5 10 15 20 25 RUE) + sample(1:6, 10000, replace = TRUE) + , 10000, replace = TRUE) + sample(1:6, 10000

Continuous Distribution Normal (Gaussian ) Distribution distributional shape 0.4 dnorm(k, 0, 1) 0.3 Probability Density Distribution 0.2 0.1 0.0 -3 -2 -1 0 1 2 3 • Bell shape k • Symmetrical • Converges to zero and +-infinity

Normal Distribution cumulative density funtiction 1.0 0.8 • 68% in ± 1 σ pnorm(k, 0, 1) 0.6 • 97% in ± 1 σ 0.4 0.2 0.0 -3 -2 -1 0 1 2 3 k

Log normal Distribution 0.6 dlnorm(k, 0, 1) 0.4 0.2 0.0 0 2 4 6 8 10 k Skewed shape Mean and mode different Heavy tail

Exponential distribution and Gamma distribution • Exponential distribution – Time until an event occurs which is expected to occur at the same rate • Gamma distribution – Time until k events occur at the same rate which are expected to occur at the same rate

Gamma Distribution γ • Shape parameter k • Scale parameter θ K=1 ， θ=1 K=2 ， θ=1 Histogram of rgamma(1e+05, 1) Histogram of rgamma(1e+05, 2) Frequency Frequency 30000 15000 0 0 0 2 4 6 8 10 12 0 5 10 15 rgamma(1e+05, 1) rgamma(1e+05, 2)

Gamma Distribution • Mean = k ・ θ K=3 ， θ=1 K=4 ， θ=1 Histogram of rgamma(1e+05, 3) Histogram of rgamma(1e+05, 4) 25000 Frequency Frequency 10000 10000 0 0 0 5 10 15 0 5 10 15 rgamma(1e+05, 3) rgamma(1e+05, 4)

Exponential Distribution Histogram of rexp(1e+05, 3) Histogram of rexp(1e+05) Frequency 30000 Frequency 40000 0 0 0 2 4 6 8 10 12 14 0 1 2 3 rexp(1e+05) rexp(1e+05, 3) • Exp(- λx ) • Monotonically decreasing • Gamma distribution of Shape = 0

Generating random number Histogram of rnorm(1e+05, 0, 1) • R 15000 Frequency – rnorm(n, mean, s.d.) 0 – >hist(rnorm(1e5, 0, 1)) -4 -2 0 2 4 rnorm(1e+05, 0, 1) – >hist(rnorm(1e5, 0, 2)) Histogram of rnorm(1e+05, 0, 2) • Excel 2010 15000 Frequency – Norm.dist() 0 -5 0 5 rnorm(1e+05, 0, 2)

Central Limit theorem exercise • >hist (runif(1e5)) Histogram of r0 Frequency 3000 0 0.0 0.2 0.4 0.6 0.8 1.0 r0 • >hist (runif(1e5)+runif(1e5)) Histogram of runif(1e+05) + runif(1e+0 Frequency 6000 0 0.0 0.5 1.0 1.5 2.0 runif(1e+05) + runif(1e+05)

Central Limit theorem exercise gram of runif(1e+05) + runif(1e+05) + ru Frequency 10000 0 0.0 1.0 2.0 3.0 runif(1e+05) + runif(1e+05) + runif(1e+05) f runif(1e+05) + runif(1e+05) + runif(1e+0 Frequency 8000 0 0 1 2 3 4 runif(1e+05) + runif(1e+05) + runif(1e+05) + runif(1e

Chi-square distribution Histogram of apply(r0, 1, var Frequency 600 0 0 5 10 20 Sample size = 4 Sample size = 10 apply(r0, 1, var) * 3 Degree of freedom = 3 Degree of freedom = 9 • Distribution of square of normal random number

Exercise Plant Growth • >data(PlantGrowth) • >> PlantGrowth • weight group • 1 4.17 ctrl • 2 5.58 ctrl • 3 5.18 ctrl

Exercise Plotting notched Box Plot 6.0 5.5 5.0 4.5 4.0 3.5 ctrl trt1 trt2 boxplot(weight ~ group, data = PlantGrowth, main = "PlantGrowth data", ylab = "Dried weight of plants", col = "lightgray", notch = TRUE, varwidth = TRUE)

Level of Measurement • Ratio Data Quantitative • Interval Data Data • Ordinal Data – Can put rank Qualitative Data • Categorical Data – Binary Data

Descriptive Statistic Bivariate Data • Dependence index – Correlation: Pearson’ chi - square, Kendall’ τ, Spearman’ ρ • Cross-tabulation 4 – Binary and binary 2 c(rr1) – Binary and nominal 0 – Nominal and nominal -2 -4 • Scatterplots -4 -2 0 2 4 – Ordinal/Interval and ordinal/Interval c(rr0) • Quantile-Quantile plots – Ordinal and ordinal

Statistical inference • Drawing conclusions from data based on model/assumption • Data is independently identically distributed – Random sampling from population – Randomized experiment • Set Model or Assumption • Estimate – Parameter (mean, proportion, variance) • Interval – Confidential, Tolerance, Prediction • Test of Hypothesis

Types of statistical inference • Point Estimate – Obtain single estimate • Estimate Interval – Interval of possible values • Hypothesis testing – Making decision from data • Check model assumption

Point Estimation • Obtain best single value of a population parameter from a subset • Unbiasness • minimum variance • Parametric Distribution – Maximum Likelihood Estimator – Moment Estimator

Unbiasness • True parameter: θ 0 • Estimate:θ • E[θ]= θ 0 Histogram of aa Histogram of aa Frequency Frequency 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 aa aa • Estimator of standard deviation – Variance calculated from 5 normal samples – Left: mean=0.8, right: mean=1.0

Unbiased variance Deviance Deviance 2 Test result -0.00103 1 0.0394 1.06778E-06 0.000767 2 0.0412 5.87778E-07 0.001567 3 0.0420 2.45444E-06 -0.00063 4 0.0398 4.01111E-07 0.000267 5 0.0407 7.11111E-08 -0.00093 6 0.0395 8.71111E-07 Average 0.0404 Biased 9.08889E-07 Estimate Unbised =(sum of Estimate Deviance)/(6-1)

Minimum variance • Normal distribution mean • 5 samples • True mean=0 gram of apply(matrix(rnorm(5e+05), ncol = 5 ram of apply(matrix(rnorm(5e+05), ncol = 5), 15000 10000 Frequency 10000 Frequency 5000 5000 0 0 -2 -1 0 1 2 -2 -1 0 1 2 apply(matrix(rnorm(5e+05), ncol = 5), 1, mean) apply(matrix(rnorm(5e+05), ncol = 5), 1, median)

Goodness-of-fit Test • Graphical method – Quantile-Quantile plot

Exercise plotting Q-Q plot • Fit to normal distribution • > qqnorm(rnorm(1e2)) • > qqnorm(rlnorm(1e2)) Normal Q-Q Plot Normal Q-Q Plot Sample Quantiles Sample Quantiles 8 2 6 0 4 2 -2 0 -2 -1 0 1 2 -2 -1 0 1 2 Theoretical Quantiles Theoretical Quantiles

Pearson’s Chi -square Observed 10 2 7 9 Hypothesis 8 4 9 7 Diff 2 -2 2 2 • Yate’s correction

Other test of fit • Based on empirical distribution function – Kolmogrov-smirnov test – Anderson-Darling test – Lilliefors test – Cramer-von Mises test • For normality – Jarque-Bera • Based on skewness and kurtosis – Shapiro-wilk test • Statistic based on variance and covariance of rank

Interval Estimation Types of Interval • Confidential – True parameter with probability of alpha – Nominal and actual coverage probability • Prediction – Another sample falls within the prediction interval with the probability of alpha • Tolerance – N percent of data falls within the interval with confidence level of alpha

Confidence Intervals Example of 95%

Table of T-values d.f. t0.95 t0.975 t0.995 1 6.3 12.7 63.7 2 2.9 4.3 10 3 2.4 3.2 5.8 4 2.13 2.8 4.6 5 2.02 2.6 4.0 6 1.94 2.4 3.7 Z( ∞ ) 1.645 1.960 2.326

Statistical inference -Model, Assumption, Hypothesis- • Parametric – Data generation process is parametricized • Non-parametric – Data generation process is not parametericized • Asymptotical – commonly used – Critical value based on table • Exact – computer intensive – Critical values based on data

Statistical inference and error • Type I error – False Positive – α error – Rejecting a hypothesis that should have been accepted • Type Ⅱ error – False negative – β error – Accepting a hypothesis that should have been rejected

Statistical Test • Test for location • Test for dispersion • Test for outlier • P-value, error • Detection Power • Uniformly most powerful test

Ratio data • Quantitative data • Unlike interval data, it has natural zero • Can do multiplication or division • Age, Length, etc.

Interval Data • Quantitative data • Can add or subtract the data • Can not do multiplication or division • Ex. Temperature

Z-test • Critical value does not depend on sample size • Standard deviation : known • Exercise • Proficiency testing • Target : 700μg/g • Standard deviation: 25 μg/g • Test if one laboratory reports 640μg/g, they significantly differs from target

Test for normal interval data single set of samples • One sample t-test • What is tested – Whether population mean differs from 0 – Standard deviation: unknown • C.f. z-test • Threfore, s.d. is estimated from data • Error included • Mean of data set of (150, 120, 180, 130) significantly differs from 100.

Test for interval data 2 levels • T-test (paired or unpaired) • Variance of two gourps – Same Students test – Unsame Welch-Aspin test

T-distribution • Distribution of sample mean divided by sample variance • Normality assumed • Degree of freedom • probability • If standard deviation is known or the degree of freedom is infinity. It is z test.

T-distribution and normal distribution dnorm(seq(-10, 10, by = 0.01)) 0.4 0.3 0.2 0.1 0.0 -10 -5 0 5 10 seq(-10, 10, by = 0.01) • Green = degree of freedom (d.f.) 2 • Blue = degree of freedom (d.f.) 10 • Red = degree of freedom (d.f.) +infinity

T-test assumptions • Each of two data set follow a normal distribution – especially when sample size is small • Each of two data set are sampled independently • There are few cases where those assumptions are strictly met, care is needed for strict discussions.

Robustness of inference • How violation to assumptions affects the test • Outliers – Against outlier • Distribution – Mixture distribution of different s.d. • T-test is somehow robust to some violations • Some discuss to apply tests to check if those assumption holds in the data, but there are other discussions.

Case of unequal variance, equal sample size n=10, σ1/σ2 ＝ 4 Histogram of p Histogram of p 5000 6000 4000 Frequency Frequency 3000 2000 1000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p • Student • Aspin-Welch

T test exercise Sample Data • > ToothGrowth • len supp dose • 1 4.2 VC 0.5 • 2 11.5 VC 0.5 • 3 7.3 VC 0.5 • 4 5.8 VC 0.5 • 5 6.4 VC 0.5 • 6 10.0 VC 0.5 • 7 11.2 VC 0.5

T test exercise 2 Sample Data • d0 <- ToothGrowth$len[1:10] – VitaminC dose 0.5mg • d1<- ToothGrowth$len[11:20] – VitaminC dose 1.0mg

T-test using R -exercise 3- • t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) • >t.test(d0, d1) • Welch Two Sample t-test • data: d0 and d1 • t = -7.4634, df = 17.862, p-value = 6.811e-07 • alternative hypothesis: true difference in means is not equal to 0 • 95 percent confidence interval: • -11.265712 -6.314288 • sample estimates: • mean of x mean of y • 7.98 16.77

T-test exercise using random number -alpha error- • > t.test(rnorm(1e1),rnorm(1e1)) • t = -0.9106, df = 17.085, p-value = 0.3752 • t = 0.7685, df = 17.982, p-value = 0.4522 • t = 0.8858, df = 12.341, p-value = 0.3927 • t = -1.0532, df = 17.886, p-value = 0.3063 • t = 0.0694, df = 17.496, p-value = 0.9455 • t = -0.0784, df = 13.86, p-value = 0.9386 • t = -0.606, df = 17.528, p-value = 0.5523

Summary Actual Nominal P-value>=0.5 50% P-value<0.5 50%

Test for Proportions • > prop.test(10,20,p=0.5) • 1-sample proportions test without continuity correction • data: 10 out of 20, null probability 0.5 • X-squared = 0, df = 1, p-value = 1 • alternative hypothesis: true p is not equal to 0.5 • 95 percent confidence interval: • 0.299298 0.700702 • sample estimates: • p • 0.5

One-sided(tailed) and two- sided(tailed) test • One sided – Null hypothesis μ=μ0 – Alternative hypothesis μ>μ0 • Two sided – Null hypothesis μ=μ0 – Alternative hypothesis μ≠μ0 • One-sided p-value = ½ two-sided p-value

Balanced v.s. Unbalanced • Balanced – Equal sample or experiment assigned to each treatment • Unbalanced – Unequal sample or experiment assigned to each treatment

Ordinal Data • Several levels with order • Excellent – good – fair • Ranks in the race • Interval data is an ordinal data • But ordinal data is not always interval data

What can be done with Ordinal data • Wilcoxon rank sum test – Mann- Whitney’s U test – Unpaired t-test – interval data • Wilcoxon signed rank test – Paired t-test – interval data

Wilcoxon test -exercise- • > wilcox.test(d0,d1) • Wilcoxon rank sum test with continuity correction • data: d0 and d1 • W = 0, p-value = 0.0001796 • alternative hypothesis: true location shift is not equal to 0 • Warning message: • In wilcox.test.default(d0, d1) : cannot compute exact p- value with ties

Wilcoxon test -unequal variance- Histogram of p Histogram of p 6000 Frequency 6000 Frequency 2000 2000 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p Equal variance Unequal variance P(p<0.05)=0.044 P(p<0.05)=0.12

Comparison of t-test and wilcoxin test • Detection power – 98% • Extremely low sample size and high difference – t test

Detection power of t-test • S.d. =1 • μ1 - μ2 = 1, 0.5, 0.1

Detection power of wilcox test norm, diff=1 0.8 powvec 0.4 0.0 0 20 40 60 80 100 • S.d. =1 mvec • μ1 - μ2 = 1

T-test Detection power • > power.t.test(n=10,delta=1, sd=NULL, sig.level =0.05, power=.5) • Two-sample t test power calculation • n = 10 • delta = 1 • sd = 1.079782 • sig.level = 0.05 • power = 0.5 • alternative = two.sided • NOTE: n is number in *each* group

Calculating sample size - t-test • power.t.test(delta=1, sd=1, sig.level =0.05, power=.95) • Two-sample t test power calculation • n = 26.98922 • delta = 1 • sd = 1 • sig.level = 0.05 • power = 0.95 • alternative = two.sided • NOTE: n is number in *each* group

t.test detection power (β -error) -exercise- • >t.test(rnorm(1e1,sd=1.079782),rnorm(1e1,sd =1.079082)) • t = -1.0004, df = 10.752, p-value = 0.3391 • t = 0.1531, df = 17.229, p-value = 0.8801 P>0.05 P<=0.05

What is Nominal Data • Categorically discrete • Order of category is arbitrary • Red, blue, green • Origin(Region)

Analysis of Nominal Data • Contigency table • Binominal test • Chisquare-test • Fisher’s exact test • McNemer test

FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. - PowerPoint PPT Presentation

Data analysis and Basic Statistics FAO training course June, 11, 2013 Yoshiki Tsukakoshi Ph.D. (statistical science) (National Food Research Institute, National Agriculture and Food Research Organization) Email: Yoshiki.tsukakoshi@gmail.com

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

FAO - EIA Food and Agriculture Organization (FAO) FCPF Task-Force FAO leads international efforts

FAO/WHO Guide for developing and improving national food recall systems at FAO Training Workshop

FAO mechanisms to respond to animal or zoonotic disease outbreaks: the FAO EMC-AH Maria Romano,

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

FAO/WHO Framework for developing national food safety emergency response (FSER) plans at FAO

FAO Symposium on FAO Symposium on The role of agricultural biotechnologies for production of

Commodity Chains Cameroon Food Loss Case Studies Djibril Drame, FAO AGS Tolly Lolo, FAO

FAO Partnerships to achieve Gender Equality and Womens Empowerment in Agriculture, Food

Amounts of Recovered Wood in COST E31 Countries and Europe IEA/UNECE/FAO Joint Wood Energy

Lessons learned on data discovery, integration and ingestion in AGRIS Fabrizio Celli (FAO)

FAO Land Tenure Service Mika-Petteri Trhnen Good Governance in Land Tenure and

Leadplane Training Course Leadplane Training Course Course Objectives Describe procedures for

Leadplane Training Course Leadplane Training Course Target Descriptions Leadplane Training

Revised: March 4, 2013 3/19/2013 3/19/2013 2 3/19/2013 3 3/19/2013 4 3/19/2013 5

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

Lucky Factors Campbell R. Harvey Duke University, NBER and Man Group plc Campbell R. Harvey

Inferential Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

EDUR 8131 Chat 13: ANOVA , Part 2 1 Notes 9a: One-way ANOVA Previous chat covered through

Conclusions versus Decisions in Quantitative Research HSRAANZ Webinar Series Tuesday 13 th June

Wald Test Asymptotics of LRT Lecture 21 Biostatistics 602 - Statistical Inference . . . .

IROC Technologies General Presentation June 2014 Who is IROC Technologies? History:

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Alpha Presentation Image Recognition Annotation and Validation Mobile Application The Capstone