Statistics Toolbox in A Review of Analysis Techniques for - - PowerPoint PPT Presentation

statistics toolbox in
SMART_READER_LITE
LIVE PREVIEW

Statistics Toolbox in A Review of Analysis Techniques for - - PowerPoint PPT Presentation

Statistics Toolbox in A Review of Analysis Techniques for Scientific Research Professional Development Opportunity for the Flow Cytometry Core Facility October 12, 2018 LKG Consulting Email: consulting.lkg@gmail.com Website:


slide-1
SLIDE 1

Statistics Toolbox in

Professional Development Opportunity for the

Flow Cytometry Core Facility

October 12, 2018

LKG Consulting

Email: consulting.lkg@gmail.com Website: www.consultinglkg.com

A Review of Analysis Techniques for Scientific Research

slide-2
SLIDE 2

The goal of this workshop is to give you the knowledge & tools to be confident in your ability to collect & analyze you’re data as well as correctly interpret your results… …Think of me as your new resource!

slide-3
SLIDE 3

Laura Gray-Steinhauer

www.ualberta.ca/~lkgray BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005) MSc in Forest Biology and Management (UofA, 2008) PhD in Forest Biology and Management (UofA, 2011)

Designated Professional Statistician with The Statistical Society of Canada (2014) Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…

A little about me…

slide-4
SLIDE 4

Workshop Schedule

8:15 – 8:30 Arrive to the Lab & Start up the computers 8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals) 8:45 – 9:15 Statistics Toolbox Refresh useful vocabulary, introduce a decision tree to plan your analysis path 9:15 – 9:45 Hypothesis Testing Refresher on p-values, Type 1 and Type 2 error, and statistical power 9:45 – 10:00 Break 10:00 – 11:00 Parametric versus Non-Parametric Tests Testing for parametric assumptions, ANOVA, Permutational ANOVA 11:00 – 11:30 Multivariate Statistics Introduction to principle component analysis (PCA) 11:30 – 1:00 Work period (questions are welcome) After 1:00 Enjoy your weekend!

This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today. Feel free to reach out to me via email with more questions: consulting.lkg@gmail.com.

slide-5
SLIDE 5

Workbook

  • Yours to keep!
  • R code is identified by Century Gothic font (everything

else is Arial)

  • Arbitrary object names are bold to indicate these could

change depending on what you name your variables.

  • Referenced data is provided at www.ualberta.ca/~lkgray
  • Please contact me to obtain permission to redistribute

content outside of the workshop attendees. Topics Included:

  • Descriptive statistics
  • Confidence intervals
  • Data distributions
  • Parametric assumptions
  • T-tests
  • ANOVA
  • ANCOVA
  • Non-parametric tests

Topics Included:

  • Permutational ANOVA & T-tests
  • Z-test for Proportions
  • Chi-squared test
  • Outlier tests and treatments
  • Correlation
  • Linear regression
  • Multiple linear regression
  • Akaike Information Criterion

Topics Included:

  • Non-linear regression
  • Logistic regression
  • Binomial ANOVA
  • Principle component analysis

(PCA)

  • Discriminant analysis
  • Multivariate analysis of variance

(MANOVA)

slide-6
SLIDE 6

R Project Website

https://cran.r-project.org/index.html

slide-7
SLIDE 7

https://www.rstudio.com/

RStudio (IDE: Integrated Development Environment)

Preferred among programmers, we will use it in this workshop

slide-8
SLIDE 8

Statistics Toolbox

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”

Aaron Levenstein (Author)

slide-9
SLIDE 9

Statistical Vocabulary

Statistical Term Real World Research World Population Class of things

E.g. Cancer patients

What you want to learn about

E.g. Cancer patients in Alberta

Sample Group representing a class

E.g. 1000 cancer patients in Alberta

What you actually study

E.g. 1000 cancer patients from 10 treatment centres in Alberta

Experimental Unit Individual thing

E.g. each of the 1000 cancer patients

Individual research subject

E.g. Cancer patients n=1000 Hospital populations n=10 (depends on research question)

Dependent Variable Property of things

E.g. white blood cell count

What you measure about subjects

E.g. white blood cell count

Independent Variable Environment of things

E.g. Treatment options, climate, etc.

What you think might influence dependent variable

E.g. Amount of treatment, combination of treatments, etc.

Data Values of variables What you record/information you collect

slide-10
SLIDE 10
  • Experiment – any controlled process of study which results in data collection, and

which the outcome is unknown

  • Descriptive statistics – numerical/graphical summary of data
  • Inferential statistics – predict or control the values of variables (make conclusions

with)

  • Statistical inference – to makes use of information from a sample to draw

conclusions (inferences) about the population from which the sample was taken

  • Parameter – an unknown value (needs to be estimated) used to represent a

population characteristic (e.g. population mean)

  • Statistic – estimation of parameter (e.g. mean of a sample)
  • Sampling distribution (aka. Probability distribution or Probability density function) –

probability associated with each possible value of a variable

  • Error - difference between an observed value (or calculated) value and its true (or

expected) value

Other important statistical terms

Also see Appendix 1 in your workbook

slide-11
SLIDE 11

Statistics Toolbox

What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?

slide-12
SLIDE 12

Analysis Goal Parametric

Assumptions Met

Non-Parametric

Alternative if fail assumptions

Binomial

Binary data/Event likelihood

Describe data characteristics Mean Standard deviation Standard error Etc. Median Quartiles Percentiles Proportions Probability distributions are always appropriate to describe data. Graphics are always appropriate to describe data. Compare 2 distinct/independent groups T-test Paired t-test Wilcox Rank-Sum Test Klomogorov-Smirnov Test Permutational T-test Z-Test for proportions Compare > 2 distinct/independent groups ANOVA Multi-Way ANOVA ANCOVA Blocking Kruskall Wallace Test Friedman Rank Test Permutational ANOVA Chi-Squared Test Binomial ANOVA Estimate the degree

  • f association

between 2 variables Pearson’s correlation Spearman rank correlation Kendall’s rank correlation Logistic regression Predict outcome based on relationship Linear regression Multiple linear regression Non-linear regression Logistic regression Odds Ratio

Statistics Toolbox

slide-13
SLIDE 13

If you have a continuous response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two treatment levels > Two treatment levels T-Test Permutational T-Test Klomogorov Smirniov (KS) Test Wilcox Test One-Way ANOVA Kruskall Wallace Test Freidman Rank Test Pearson’s Correlation Spearman’s Rank Correlation Kendall’s Rank Correlation Linear/Non- linear Regression

What you get Non-Parametric Parametric Regression

  • P-value indicating if 2 groups are

significantly different

  • P-value indicating there is a

significant effect of “treatment”.

  • Need pairwise comparisons to find

where the difference between groups

  • ccurs.
  • Correlation coefficient

indicating direction and magnitude of relationship.

  • “Goodness of fit” indicting how

well predictor is linked to response (R2 or AIC).

Binomial

slide-14
SLIDE 14

… and two or more predictor variables Predictor is categorical Predictor is continuous Two or more treatment levels for each predictor Multi-Way ANOVA Permutational ANOVA Multiple Regression

What you get

  • P-value indicating if there is a significant effect of each treatment.
  • Size of a significant effect (no interactions).
  • Need to consider the possibility of interactions.
  • Need pairwise comparisons with adjusted p-values to determine the difference

among treatments with interactions.

  • Also get the effect of the blocking term and/or undesired covariate.
  • Do not need to consider the interaction between treatments and blocks and/or

covariates.

  • Fit of how well predictors

are linked to response variable (Adjusted R2, AIC)

  • P-values to indicate which

predictors significantly affect the response variable.

Blocking ANCOVA Blocking

Non-Parametric Parametric Regression Binomial

slide-15
SLIDE 15

If you have a categorical response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two or more treatment levels Binomial ANOVA Logistic Regression

What you get

  • P-value indicating if there is

a significant effect of each treatment.

  • Size of a significant effect (no

interactions).

  • Need to consider the

possibility of interactions.

  • Need pairwise comparisons

with adjusted p-values to determine the difference among treatments with interactions.

  • P-value indicating

there is a significant effect

  • f “treatment”.
  • Need pairwise

comparisons to find where the difference between groups

  • ccurs.

… and two or more predictor variables Predictor is continuous Two treatment levels Z-Test for Proportions Two or more treatment levels Chi-squared Test

  • Fit of how well predictors

are linked to response variable (Adjusted R2, AIC)

  • P-values to indicate

which predictors significantly affect the response variable.

  • P-value indicating if 2

groups are significantly different

Non-Parametric Parametric Regression Binomial

slide-16
SLIDE 16

Example research questions: Do yield of different lentil varieties differ at 2 farms? Do the varieties differ among themselves? Does the density of the plants impact their average height?

The Lentil datasets (You are now a farmer)

Farm 1 Farm 2

Plot

1 Variety in each

Individual lentil plants

A A A A A A A B B B B B A B C C C C C C C C

slide-17
SLIDE 17

Datasets Available in R

  • Over 100+ datasets available for you to use
  • We will use:
  • iris: The famous (Fisher's or Anderson's) iris data set gives the sepal and

petal measurements for 50 flowers from each of Iris setosa, versicolor, and virginica.

  • USArrests: Data on US Arrests for violent crimes by US State.
slide-18
SLIDE 18

Hypothesis s Testing

“Statistics are not substitute for judgment.”

Henry Clay (Former US Senator)

slide-19
SLIDE 19

Formal hypotheses testing

A B

sample population

A B

Is this a difference due to random chance? Mean height Population sample

𝐼𝑝: ҧ 𝑦𝐵 = ҧ 𝑦𝐶 𝐼1: ҧ 𝑦𝐵 ≠ ҧ 𝑦𝐶

If actual p < , reject null hypothesis (𝐼𝑝) and accept alternative hypothesis (𝐼1)

slide-20
SLIDE 20

“Is this difference due to random chance?”

A B

Mean height Population sample

P-value – the probability the observed value or larger is due to random chance

Theory: We can never really prove if the 2 samples are truly different or the same – only ask if what we observe (or a greater difference) is due to random chance

How to interpret p-values: P-value = 0.05 – “Yes, 1 out of 20 times.” P-value = 0.01 – “Yes, 1 out of 100 times.”

The lower the probability a difference is due to random chance – the more likely is the result of an effect (what we test for)

In other words: “Is random chance a plausible explanation?”

slide-21
SLIDE 21

Type I Error – reject the null hypothesis (H0)

when it is actually true

Type II Error – failing to reject the null

hypothesis (H0) when it is not true

Remember rejection or acceptance of a p-value (and therefore the chance you will make an error) depends on the arbitrary -level you choose

  • -level will probability of making a Type I Error, but this

the probability of making a Type II Error

The -level you choose is completely up to you (typically it is set at 0.05), however, it should be chosen with consideration of the consequences of making a Type I or a Type II Error. Based on your study, would you rather err on the side of false positives or false negatives?

Null hypothesis is true Alternative hypothesis is true Fail to reject the null hypothesis ☺ Correct Decision

Incorrect Decision False Negative Type II Error Reject the null hypothesis  Incorrect Decision False Positive Type I Error

Correct Decision

slide-22
SLIDE 22

Example: Will current forests adequately protect genetic resources

under climate change?

Birch Mountain Wildlands

HO: Range of the current climate for the BMW protected area

= Range of the BMW protected area under climate change

Ha: Range of the current climate for the BMW protected area

≠ Range of the BMW protected area under climate change

If we reject HO: Climates ranges are different, therefore

genetic resources are not adequately protected and new protected areas need to be created

Consequences if I make:

  • Type I Error: Climates are actually the same and genetic resources are

indeed adequately protected in the BMW protected area – we created new parks when we didn’t need to

  • Type II Error: Climates are different and genetic resources are

vulnerable – we didn’t create new protected areas and we should have From an ecological standpoint it is better to make a Type I Error, but from an economic standpoint it is better to make a Type II Error Which standpoint should I take?

slide-23
SLIDE 23

Power is your ability to reject the null hypothesis when it is false (i.e. your ability to detect an effect when there is one). There are many ways to increase power:

  • 1. Increase your sample size (sample more of the population)
  • 2. Increase your alpha value (e.g. from 0.01 to 0.05) – watch for Type I

Error!

  • 3. Use a one-tailed test (you know the direction of the expected effect)
  • 4. Use a paired test (control and treatment are same sample)

Given you are testing whether or not what you observed or greater is due to random chance, more data gives you a better understanding of what is truly happening within the population, therefore sample size will the probability of making a Type 2 Error

Statistical Power

slide-24
SLIDE 24

BREAK 9:45 – 10:00

Go grab a coffee. Next we will cover specific tools in your new tool box.

slide-25
SLIDE 25

Statistics s Too

  • olbox

Par arametri ric ver ersus Non

  • n-Parametri

ric Tes ests

“He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”

Andrew Lang (Scottish poet)

slide-26
SLIDE 26

Uni Univari riate Tes est t Optio ions

Type Parametric Non-Parametric

Characteristics

  • Analysis to test group means
  • Based on raw data
  • More statistical power than non-

parametric tests

  • Analysis to test group medians
  • Based on ranked data
  • Less statistical power

Assumptions

  • Independent samples
  • Normality (data OR errors)
  • Homogeneity of variances
  • Independent samples

When to use?

  • Parametric assumptions are met
  • Non-Normal, BUT larger sample

size (CLT), however equal variances must be met

  • Parametric assumptions are not

met

  • Medians better represent your data

(skewed data distribution)

  • Small sample-size
  • Ordinal data, ranked data, or
  • utliers that you can’t remove

Examples

  • T-test
  • ANOVA (One-way, Two-way,

Paired)

  • Wilcox Rank Sum Test
  • Kruskal-Wallis Test
  • Permutational Tests (non-

traditional)

slide-27
SLIDE 27

Assumption #1: Independence of samples

“Your samples have to come from a randomized or randomly sampled design.”

  • Meaning rows in your data do NOT influence one another
  • Address this with experimental design (3 main things to

consider)

  • 1. Avoid pseudoreplication and potential confounding factors by designing

your experiment in a randomized design

  • 2. Avoid systematic arrangements which are distinct pattern in how

treatments are laid out.

  • If your treatments effect one another – the individual treatment effects

could be masked or overinflated

  • 3. Maintain temporal independence
  • If you need to take multiple samples from one individual over time record

and test your data considering the change in time (e.g. paired tests) NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at least 2 reps per treatment to execute and ANOVA Rule of Thumb: You need more rows then columns

slide-28
SLIDE 28

The Normal Distribution

𝑡2 = σ𝑗=1

𝑜

𝑦𝑗 − ҧ 𝑦 2 𝑜 − 1 SD = 𝑡2

Based on this curve:

  • 68.27% of observations are within 1 stdev of ҧ

𝑦

  • 95.45% of observations are within 2 stdev of ҧ

𝑦

  • 99.73% of observations are within 3 stdev of ҧ

𝑦 For confidence intervals:

  • 95% of observations are within 1.96 stdev of ҧ

𝑦

The base of parametric statistics

slide-29
SLIDE 29

Assumption #2: Data/Experimental errors are normally distributed

B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1

Residuals “If I was to repeat my sample repeatedly and calculate the means, those means would be normally distributed.”

Determine if the assumption is met by:

  • 1. Looking at the residuals of your sample
  • 2. Shaprio–wilks Test for Normality – if your data is mainly unique values
  • 3. D'Agostino-Pearson normality test – if you have lots of repeated values
  • 4. Lilliefors normality test – mean and variance are unknown
slide-30
SLIDE 30

t-distribution

(sampling distribution)

Normal distribution

Assumption #2: Data/Experimental errors are normally distributed

You may not need to worry about Normality?

Central Limit Theorem: “Sample means tend to cluster around the central

population value.” Therefore:

  • When sample size is large, you can assume that ҧ

𝑦 is close to the value of 𝜈

  • With a small sample size you have a better chance to get a mean that is far
  • ff the true population mean
slide-31
SLIDE 31

Assumption #2: Data/Experimental errors are normally distributed

You may not need to worry about Normality?

Central Limit Theorem: “Sample means tend to cluster around the central

population value.” Therefore:

  • When sample size is large, you can assume that ҧ

𝑦 is close to the value of 𝜈

  • With a small sample size you have a better chance to get a mean that is far
  • ff the true population mean

What does this mean?

  • For large N, the assumption for Normality can be relaxed
  • You may have decreased power to detect a difference among groups, BUT

your test is not really compromised if your residuals are not normal

  • Assumption of Normality is important when:

1. Very small N 2. Data is highly non-normal 3. Significant outliers are present 4. Small effect size

slide-32
SLIDE 32

Assumption #3: Equal variances between groups/treatments

0 4 8 12 16 20 24 ҧ 𝑦𝐵 = 12 𝑡𝐵 = 4 ҧ 𝑦𝐶 = 12 𝑡𝐶 = 6

Let’s say 5% of the A data fall above this threshold But >5% of the B data fall above the same threshold

So with larger variances, you can expect a greater number of observations at the extremes of the distributions This can have real implications on inferences we make from comparisons between groups

slide-33
SLIDE 33

B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1

Residuals “Does the know probability of observations between my two samples hold true?”

Determine if the assumption is met by:

  • 1. Looking at the residuals of your sample
  • 2. Bartlett Test

Assumption #3: Equal variances between treatments

slide-34
SLIDE 34

Assumption #3: Equal variances between treatments

Testing for Equal Variances – Residual Plots

Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units)

  • NORMAL distribution: equal number of points along observed
  • EQUAL variances: equal spread on either side of the meanpredicted

value=0

  • Good to go!
  • NON-NORMAL distribution: unequal number of points along
  • bserved
  • EQUAL variances: equal spread on either side of the meanpredicted value=0
  • Optional to fix
  • NORMAL/NON NORMAL: look at histogram or test
  • UNEQUAL variances: cone shape – away from or towards zero
  • This needs to be fixed for ANOVA (transformations)
  • OUTLIERS: points that deviate from the majority of data points
  • This needs to be fixed for ANOVA (transformations or removal)
slide-35
SLIDE 35
  • Treatment – predictor variable

(e.g. variety, fertilization, irrigation, etc.)

  • Treatment level – groups within treatments

(e.g. A,B,C or Control, 1xN, 2xN)

  • Covariate – undesired, uncontrolled predictor variable, confounding
  • F-value –

𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠

  • P-value – probability that the observed difference or larger in the

treatment means is due to random chance

Analysis of Variance (ANOVA) – Vocabulary

slide-36
SLIDE 36

B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1

Residuals

Analysis of Variance (ANOVA)

F-value –

𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠

=

𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓

SIGNAL NOISE

𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗

𝑜

ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑠 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗

𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗

𝑜

slide-37
SLIDE 37

Analysis of Variance (ANOVA)

Think of Pac-man!

  • All of the dots on the board represent the Total

Variation in your study

  • Every treatment you use in your analysis is a

different Pac-man player on the board

  • The amount of dots each player eat represents

variation between (e.g. the amount of variation each treatment

can explain)

  • The amount of dots left on the board after all

players have died represented the variation within

  • If players have a big effect they will eat more dots, reducing dots left on the board

(lowering variation within), increasing the F-value

  • A large F-value indicates a significant difference

𝐺 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜

slide-38
SLIDE 38

F Distribution (family of distributions)

𝐺 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜

𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗

𝑜

ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑜𝑝𝑐𝑞𝑢 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗

𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗

𝑜

probability

𝑡𝑗𝑕𝑜𝑏𝑚 > 𝑜𝑝𝑗𝑡𝑓 𝑡𝑗𝑕𝑜𝑏𝑚 < 𝑜𝑝𝑗𝑡𝑓 P-value

(percentiles, probabilities)

pf(F, 𝑒𝑔1, 𝑒𝑔2) qf(p, 𝑒𝑔1, 𝑒𝑔2)

0.50 0.95

∝= 0.05

slide-39
SLIDE 39

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

slide-40
SLIDE 40

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

pt(FA, dfA, dfERROR) pt(FB, dfB, dfERROR) pt(FAxB, dfAxB, dfERROR) MSA/MSERROR MSB/MSERROR MSAxB/MSERROR MSA*dfA MSB*dfB MSAxB*dfAxB MSERROR*dfERROR

slide-41
SLIDE 41

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

If the interaction is significant – you should ignore the main effects because the story is not that simple!

slide-42
SLIDE 42

Interaction plots – Different story under different conditions

1. 2. 3.

Farm1 Farm2

A B

  • VARIETY is significant (*)
  • FARM is significant (*)
  • FARM2 has better yield than FARM1
  • No Interaction
  • VARIETY is not significant
  • FARM is significant (*)
  • VARIETY A is better on FARM2 and VARIETY B is better on

FARM1

  • Significant Interaction

A B

Farm1 Farm2 Farm1 Farm2 Farm1 Farm2

4.

  • VARIETY is significant (*)
  • FARM is significant (*) – small difference
  • Main effects are significant, BUT hard to interpret with
  • verall means
  • Significant Interaction

A B

Avg Farm1 Avg Farm2

Yield Yield Yield Yield

  • VARIETY is not significant
  • FARM is not significant
  • Cannot distinguish a difference between VARIETY or FARM
  • No Interaction

A B

slide-43
SLIDE 43

Interaction plots – Different story under different conditions

  • An interaction detects non-parallel lines
  • Difficult to interpret interaction plots for more than a 2-

WAY ANOVA

  • If the interaction effect is NOT significant then you can

just interpret the main effects

  • BUT if you find a significant interaction you don’t want

to interpret main effects because the combination of treatment levels results in different outcomes

slide-44
SLIDE 44

Pairwise comparisons – What to do when you have an interaction

a.k.a Pairwise t-tests Number of comparisons:

𝐷 = 𝑢 𝑢 − 1 2 𝑢 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑚𝑓𝑤𝑓𝑚𝑡

Lentil Example: 3 VARITIES (A, B, and C)

A – B A – C B – C

𝐷 = 𝑢(𝑢 − 1) 2 = 3(2) 2 = 𝟒

Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I Error at all Experiment-wise Type I Error for  = 0.05:

𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.95𝐷

Lentil Example:

𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.953 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.87 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 𝟏. 𝟐𝟒 Significantly increased probability of making an error!

Therefore pairwise comparisons leads to compromised experiment-wise -level You can correct for multiple comparisons by calculating an adjusted p-value (Bonferroni, Holms, etc.)

slide-45
SLIDE 45

Pairwise comparisons – Tukey Honest Differences (Test)

a.k.a Pairwise t-tests with adjusted p-values If we have a significant interaction effect – use these values If we have NO significant interaction effect – we can just look at the main effects

Only need to consider relevant pairwise comparisons – think about it logically

slide-46
SLIDE 46

How to report a significant difference in a graph

W X Y Z W

  • NS

* NS X

  • *

NS Y

  • NS

Z

  • A

A B A,B W X Y Z

Same letter = non significant Different letter = significant

Create a matrix of significance and use it to code your graph

slide-47
SLIDE 47

Permutational Non-parametric tests

  • PNPT make NO Assumptions therefore any data can be used
  • PNPT work with absolute differences a.k.a distances
  • Smaller values indicate similarity
  • Makes the calculations equivalent to sum-of-squares

𝐸 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐸 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑕𝑠𝑝𝑣𝑞𝑡 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 𝑕𝑠𝑝𝑣𝑞𝑡 Calculating D (delta) & its distribution

  • For our test we can compare D to an expected distribution of D the

same way we do when we calculate an F-value

  • Use permutations (iterations) to generate the distribution of D from
  • ur raw data
  • Therefore shape of D distribution is dependent on your data
slide-48
SLIDE 48

Permutational Non-parametric tests

Determining the distribution of D

  • After you permute this process 5000 times (your choice) a distribution of D

will emerge

  • Shape depends on your data – may be normal or not (doesn’t matter)

10 2 3

slide-49
SLIDE 49

Permutational Non-parametric tests

Determining the distribution of D

  • After you permute this process 5000 times (your choice) a distribution of D

will emerge

  • Shape depends on your data – may be normal or not (doesn’t matter)

10 2 3

4921 D calculations < 10 from permutations 79 D calculations ≥ 10 from permutations P-value: 79/5000 = 0.0158

slide-50
SLIDE 50

Permutational ANOVA

Permutational ANOVA in R:

library(lmPerm) summary(aovp(YIELD~FARM*VARIETY, seqs=T)) Pairwise Permutational ANOVA in R:

  • ut1=aovp(YIELD~FARM*VARIETY, seqs=T)

TukeyHSD(out1)

Permutational Non-parametric tests in R

The option seqs=T calculates sequential sum of squares

(similar to regular ANOVA)

Good choice for balanced designs You can change the maximum number of iterations with the maxIter=

  • ption
slide-51
SLIDE 51

Permutational Non-parametric tests

  • For parametric tests we know Normal, T-distribution, F-

distribution look like

  • Therefore we can use the standard calculations (t-value) to calculate

statistics

  • When we violate the known distribution we need some other

curve to work with

  • Hard to estimate a theoretical distribution that fits your data
  • Best solution is to permute your data to generate a distribution
  • Permutational non-parametric statistics are just as powerful as

parametric tests

  • This technique is similar to bootstrapping
  • But bootstrap samples data rather than changing all observation classes
slide-52
SLIDE 52

If permutational techniques are so good why not always use them?

  • You say permutational non-parametric tests are as

powerful as parametric statistics – YES They Are!

  • But they are still fairly new to statistical practices
  • Still unknown or not understood among many uses
  • Best practice is to stick with parametric statistics when

you can, but when you can’t permutational tests are great options!

slide-53
SLIDE 53

Ext xtension of

  • f th

the Statistics Too

  • olbox

Mul ultiv ivaria iate Tes ests (R

(Rotation-Based)

“Definition of Statistics: The science of producing unreliable facts from reliable figures.”

Evan Esar (Humorist & Writer)

slide-54
SLIDE 54
  • Population – Class of things (What you want to learn about)
  • Sample – group representing a class (What you actually study)
  • Experimental unit – individual research subject (e.g. location, entity, etc.)
  • Response variable – property of thing that you believe is the result of

predictors (What you actually measure – e.g. lentil height, patient response) a.k.a dependent variable

  • Predictor variable(s) – environment of things which you believe is

influencing a response variable (e.g. climate, topography, drug combination, etc.) a.k.a independent variable

  • Error - difference between an observed value (or calculated) value and its

true (or expected) value

A Reminder from Univariate Statisics….

slide-55
SLIDE 55

Experimental Unit (row)

In Multivariate statistics:

  • Variables can be either numeric or categorical (depends on the

technique)

  • Focus is often placed on graphical representation of results

Frequency of species, Climate variables, Soil characteristics, Nutrient concentrations Drug levels, etc. Regions, Ecosystems, Forest types, Treatments, etc.

Rotation-based Methods

Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …

slide-56
SLIDE 56

Final results based on multiple variables give different inferences than 2 variables

Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …

Variable 1 Variable 2

Find an equation to rotate data to so that axis explains multiple variables

Variable 1,2 Variable 3 Variable 1, 2, 4, 9, 10 Variable 3, 6, 8

Repeat rotation process to achieve analysis objective

Rotation-based Methods

slide-57
SLIDE 57

1. Rotate so that new axis explains the greatest amount of variation within the

data

Principal Component Analysis (PCA) Factor Analysis

2. Rotate so that the variation between groups is maximized

Discriminatn Analysis (DISCRIM) Multivariate Analysis of Variance (MANOVA)

3. Rotate so that one dataset explains the most variation in another dataset

Canonical Correspondence Analysis (CCA)

Objective of Rotation-based Methods

slide-58
SLIDE 58

Z1 = a11X1 + a12X2 + … + a1nXn

First principal component (column vector) Column vectors of original variables

  • PCA Objective: Find linear combinations of the original variables X1, X2, …, Xn to

produce components Z1, Z2, …, Zn that are uncorrelated in order of their importance, and that describe the variation in the original data.

  • Principle components are the linear combinations of the original variables
  • Principle component 1 is NOT a replacement for variable 1 – All variables are

used to calculate each principal component For each component:

The constraint that a11

2 + a12 2 + … + a1n 2 = 1 ensures Var(Z1) is as large as possible

Coefficients for linear model

The Math Behind PCA

slide-59
SLIDE 59
  • Z2 is calculated using the same formula and constraint on a2n values

However, there is an addition condition that Z1 and Z2 have zero correlation for the data

  • The correlation condition continues for all successional principle

components i.e. Z3 is uncorrelated with both Z1 and Z2

  • The number of principal components calculated will match the number
  • f predictor variable included in the analysis
  • The amount of variation explained decreases with each successional

principal component

  • Generally you base your inferences on the first two or three

components because they explain the most variation in your data

  • Typically when you include a lot of predictor variables the last

couple of principal components explain very little (< 1%) of the variation in your data – not useful variables

The Math Behind PCA

slide-60
SLIDE 60

PCA in R:

princomp(dataMatrix,cor=T/F) (stats package)

Define whether the PCs should be calculated using the correlation or covariance matrix (derived within the function from the data) You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales Default is to use the correlation matrix because it standardizes to data before it calculated the PCs, removing the effect of the different units Data matrix of predictor variables You will assign the results back to a class once the PCs have been calculated

PCA in R

slide-61
SLIDE 61

PCA in R

slide-62
SLIDE 62

Loadings – these are the correlations between the original predictor variables and the principal components Identifies which of the original variables are driving the principal component

Example:

Comp.1 – is negatively related to Murder, Assault, and Rape Comp.2 – is negatively related to UrbanPop

Eigenvectors

PCA in R

slide-63
SLIDE 63

Scores – these are the calculated principal components Z1, Z2, …, Zn These are the values we plot to make inferences

PCA in R

slide-64
SLIDE 64

Variance – summary of the output displays the variance explained by each principal component Identifies how much weight you should put in your principal components

Example:

Comp.1 – 62 % Comp.2 – 25% Comp.3 – 9% Comp.4 – 4 %

Eignenvalues divided by the number of PCs

PCA in R

slide-65
SLIDE 65

Data points considering Comp.1 and Comp.2 scores (displays row names) Direction of the arrows +/- indicate the trend of points (towards the arrow indicates more of the variable) If vector arrows are perpendicular then the variables are not correlated If you original variables do not have some level of correlation then PCA will NOT work for your analysis – i.e. You wont learn anything!

PCA in R - Biplot

slide-66
SLIDE 66

WORK PERIOD 11:30 – 1:00

Follow the Workbook Examples for the Analyses You are Interested In. Any questions?

slide-67
SLIDE 67

Statistics Toolbox

What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?

… to characterize my data. … to find if there is a significant difference between my groups … to see what predictor conditions are associated with my groups. … normally distributed? … equal variances? … multiple response variables? … continuous or discrete? … binary data? … single treatment or multiple treatment?

slide-68
SLIDE 68

Thank You for Attending the Stats Workshop

I you have any further questions please feel free to contact me Flow Cytometry Core Facility

LKG Consulting

Email: consulting.lkg@gmail.com Website: www.consultinglkg.com