Statistics Toolbox in A Review of Analysis Techniques for - - PowerPoint PPT Presentation

statistics toolbox in
SMART_READER_LITE
LIVE PREVIEW

Statistics Toolbox in A Review of Analysis Techniques for - - PowerPoint PPT Presentation

Statistics Toolbox in A Review of Analysis Techniques for Scientific Research Professional Development Opportunity for the Flow Cytometry Core Facility December 7, 2018 LKG Consulting Email: consulting.lkg@gmail.com Website:


slide-1
SLIDE 1

Statistics Toolbox in

Professional Development Opportunity for the

Flow Cytometry Core Facility

December 7, 2018

LKG Consulting

Email: consulting.lkg@gmail.com Website: www.consultinglkg.com

A Review of Analysis Techniques for Scientific Research

slide-2
SLIDE 2

The goal of this workshop is to give you the knowledge & tools to be confident in your ability to collect & analyze you’re data as well as correctly interpret your results… …Think of me as your new resource!

slide-3
SLIDE 3

Laura Gray-Steinhauer

www.ualberta.ca/~lkgray BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005) MSc in Forest Biology and Management (UofA, 2008) PhD in Forest Biology and Management (UofA, 2011)

Designated Professional Statistician with The Statistical Society of Canada (2014) Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…

A little about me…

slide-4
SLIDE 4

Workshop Schedule

8:15 – 8:30 Arrive to the Lab & Start up the computers 8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals) 8:45 – 9:15 Statistics Toolbox Refresh useful vocabulary, introduce a decision tree to plan your analysis path 9:15 – 9:45 Hypothesis Testing Refresher on p-values, Type 1 and Type 2 error, and statistical power 9:45 – 10:00 Break 10:00 – 11:00 Parametric versus Non-Parametric Tests Testing for parametric assumptions, ANOVA, Permutational ANOVA 11:00 – 12:00 Work period (questions are welcome) 12:00 – 1:00 Lunch 1:00 – 2:00 Correlation & Regression Differences and interpretation of correlations, as well as linear, multiple linear and logistic techniques 2:00 – 3:00 Work period (questions are welcome) 3:00 – 3:45 Multivariate Statistics Introduction to principle component analysis (PCA), Discriminant Analysis (DISCRIM) 3:45 – 5:00 Work period (questions are welcome) After 5:00 Enjoy your weekend!

This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today. Feel free to reach out to me via email with more questions: consulting.lkg@gmail.com.

slide-5
SLIDE 5

Workbook

  • Yours to keep!
  • R code is identified by Century Gothic font (everything

else is Arial)

  • Arbitrary object names are bold to indicate these could

change depending on what you name your variables.

  • Referenced data is provided at www.ualberta.ca/~lkgray
  • Please contact me to obtain permission to redistribute

content outside of the workshop attendees. Topics Included:

  • Descriptive statistics
  • Confidence intervals
  • Data distributions
  • Parametric

assumptions

  • T-tests
  • ANOVA
  • ANCOVA
  • Non-parametric tests

Topics Included:

  • Permutational ANOVA & T-

tests

  • Z-test for Proportions
  • Chi-squared test
  • Outlier tests and treatments
  • Correlation
  • Linear regression
  • Multiple linear regression
  • Akaike Information Criterion

Topics Included:

  • Non-linear regression
  • Logistic regression
  • Binomial ANOVA
  • Principle component analysis

(PCA)

  • Discriminant analysis
  • Multivariate analysis of

variance (MANOVA)

slide-6
SLIDE 6

R Project Website

https://cran.r-project.org/index.html

slide-7
SLIDE 7

https://www.rstudio.com/

RStudio (IDE: Integrated Development Environment)

Preferred among programmers, we will use it in this workshop

slide-8
SLIDE 8

Statistics Toolbox

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”

Aaron Levenstein (Author)

slide-9
SLIDE 9

Statistical Vocabulary

Statistical Term Real World Research World Population Class of things

E.g. Cancer patients

What you want to learn about

E.g. Cancer patients in Alberta

Sample Group representing a class

E.g. 1000 cancer patients in Alberta

What you actually study

E.g. 1000 cancer patients from 10 treatment centres in Alberta

Experimental Unit Individual thing

E.g. each of the 1000 cancer patients

Individual research subject

E.g. Cancer patients n=1000 Hospital populations n=10 (depends on research question)

Dependent Variable Property of things

E.g. white blood cell count

What you measure about subjects

E.g. white blood cell count

Independent Variable Environment of things

E.g. Treatment options, climate, etc.

What you think might influence dependent variable

E.g. Amount of treatment, combination of treatments, etc.

Data Values of variables What you record/information you collect

slide-10
SLIDE 10
  • Experiment – any controlled process of study which results in data collection, and

which the outcome is unknown

  • Descriptive statistics – numerical/graphical summary of data
  • Inferential statistics – predict or control the values of variables (make conclusions

with)

  • Statistical inference – to makes use of information from a sample to draw

conclusions (inferences) about the population from which the sample was taken

  • Parameter – an unknown value (needs to be estimated) used to represent a

population characteristic (e.g. population mean)

  • Statistic – estimation of parameter (e.g. mean of a sample)
  • Sampling distribution (aka. Probability distribution or Probability density function) –

probability associated with each possible value of a variable

  • Error - difference between an observed value (or calculated) value and its true (or

expected) value

Other important statistical terms

Also see Appendix 1 in your workbook

slide-11
SLIDE 11

Statistics Toolbox

What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?

slide-12
SLIDE 12

Analysis Goal Parametric

Assumptions Met

Non-Parametric

Alternative if fail assumptions

Binomial

Binary data/Event likelihood

Describe data characteristics Mean Standard deviation Standard error Etc. Median Quartiles Percentiles Proportions Probability distributions are always appropriate to describe data. Graphics are always appropriate to describe data. Compare 2 distinct/independent groups T-test Paired t-test Wilcox Rank-Sum Test Klomogorov-Smirnov Test Permutational T-test Z-Test for proportions Compare > 2 distinct/independent groups ANOVA Multi-Way ANOVA ANCOVA Blocking Kruskall Wallace Test Friedman Rank Test Permutational ANOVA Chi-Squared Test Binomial ANOVA Estimate the degree

  • f association

between 2 variables Pearson’s correlation Spearman rank correlation Kendall’s rank correlation Logistic regression Predict outcome based on relationship Linear regression Multiple linear regression Non-linear regression Logistic regression Odds Ratio

Statistics Toolbox

slide-13
SLIDE 13

If you have a continuous response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two treatment levels > Two treatment levels T-Test Permutational T-Test Klomogorov Smirniov (KS) Test Wilcox Test One-Way ANOVA Kruskall Wallace Test Freidman Rank Test Pearson’s Correlation Spearman’s Rank Correlation Kendall’s Rank Correlation Linear/Non- linear Regression

What you get Non-Parametric Parametric Regression

  • P-value indicating if 2 groups are

significantly different

  • P-value indicating there is a

significant effect of “treatment”.

  • Need pairwise comparisons to find

where the difference between groups

  • ccurs.
  • Correlation coefficient

indicating direction and magnitude of relationship.

  • “Goodness of fit” indicting how

well predictor is linked to response (R2 or AIC).

Binomial

slide-14
SLIDE 14

… and two or more predictor variables Predictor is categorical Predictor is continuous Two or more treatment levels for each predictor Multi-Way ANOVA Permutational ANOVA Multiple Regression

What you get

  • P-value indicating if there is a significant effect of each treatment.
  • Size of a significant effect (no interactions).
  • Need to consider the possibility of interactions.
  • Need pairwise comparisons with adjusted p-values to determine the difference

among treatments with interactions.

  • Also get the effect of the blocking term and/or undesired covariate.
  • Do not need to consider the interaction between treatments and blocks and/or

covariates.

  • Fit of how well predictors

are linked to response variable (Adjusted R2, AIC)

  • P-values to indicate which

predictors significantly affect the response variable.

Blocking ANCOVA Blocking

Non-Parametric Parametric Regression Binomial

slide-15
SLIDE 15

If you have a categorical response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two or more treatment levels Binomial ANOVA Logistic Regression

What you get

  • P-value indicating if there is

a significant effect of each treatment.

  • Size of a significant effect (no

interactions).

  • Need to consider the

possibility of interactions.

  • Need pairwise comparisons

with adjusted p-values to determine the difference among treatments with interactions.

  • P-value indicating

there is a significant effect

  • f “treatment”.
  • Need pairwise

comparisons to find where the difference between groups

  • ccurs.

… and two or more predictor variables Predictor is continuous Two treatment levels Z-Test for Proportions Two or more treatment levels Chi-squared Test

  • Fit of how well predictors

are linked to response variable (Adjusted R2, AIC)

  • P-values to indicate

which predictors significantly affect the response variable.

  • P-value indicating if 2

groups are significantly different

Non-Parametric Parametric Regression Binomial

slide-16
SLIDE 16

Example research questions: Do yield of different lentil varieties differ at 2 farms? Do the varieties differ among themselves? Does the density of the plants impact their average height?

The Lentil datasets (You are now a farmer)

Farm 1 Farm 2

Plot

1 Variety in each

Individual lentil plants

A A A A A A A B B B B B A B C C C C C C C C

slide-17
SLIDE 17

Datasets Available in R

  • Over 100+ datasets available for you to use
  • We will use:
  • iris: The famous (Fisher's or Anderson's) iris data set gives the sepal and

petal measurements for 50 flowers from each of Iris setosa, versicolor, and virginica.

  • USArrests: Data on US Arrests for violent crimes by US State.
slide-18
SLIDE 18

Hypothesis s Testing

“Statistics are not substitute for judgment.”

Henry Clay (Former US Senator)

slide-19
SLIDE 19

Formal hypotheses testing

A B

sample population

A B

Is this a difference due to random chance? Mean height Population sample

𝐼𝑝: ҧ 𝑦𝐵 = ҧ 𝑦𝐶 𝐼1: ҧ 𝑦𝐵 ≠ ҧ 𝑦𝐶

If actual p < , reject null hypothesis (𝐼𝑝) and accept alternative hypothesis (𝐼1)

slide-20
SLIDE 20

“Is this difference due to random chance?”

A B

Mean height Population sample

P-value – the probability the observed value or larger is due to random chance

Theory: We can never really prove if the 2 samples are truly different or the same – only ask if what we observe (or a greater difference) is due to random chance

How to interpret p-values: P-value = 0.05 – “Yes, 1 out of 20 times.” P-value = 0.01 – “Yes, 1 out of 100 times.”

The lower the probability a difference is due to random chance – the more likely is the result of an effect (what we test for)

In other words: “Is random chance a plausible explanation?”

slide-21
SLIDE 21

Type I Error – reject the null hypothesis (H0)

when it is actually true

Type II Error – failing to reject the null

hypothesis (H0) when it is not true

Remember rejection or acceptance of a p-value (and therefore the chance you will make an error) depends on the arbitrary -level you choose

  • -level will probability of making a Type I Error, but this

the probability of making a Type II Error

The -level you choose is completely up to you (typically it is set at 0.05), however, it should be chosen with consideration of the consequences of making a Type I or a Type II Error. Based on your study, would you rather err on the side of false positives or false negatives?

Null hypothesis is true Alternative hypothesis is true Fail to reject the null hypothesis ☺ Correct Decision

Incorrect Decision False Negative Type II Error Reject the null hypothesis  Incorrect Decision False Positive Type I Error

Correct Decision

slide-22
SLIDE 22

Example: Will current forests adequately protect genetic resources

under climate change?

Birch Mountain Wildlands

HO: Range of the current climate for the BMW protected area

= Range of the BMW protected area under climate change

Ha: Range of the current climate for the BMW protected area

≠ Range of the BMW protected area under climate change

If we reject HO: Climates ranges are different, therefore

genetic resources are not adequately protected and new protected areas need to be created

Consequences if I make:

  • Type I Error: Climates are actually the same and genetic resources are

indeed adequately protected in the BMW protected area – we created new parks when we didn’t need to

  • Type II Error: Climates are different and genetic resources are

vulnerable – we didn’t create new protected areas and we should have From an ecological standpoint it is better to make a Type I Error, but from an economic standpoint it is better to make a Type II Error Which standpoint should I take?

slide-23
SLIDE 23

Power is your ability to reject the null hypothesis when it is false (i.e. your ability to detect an effect when there is one). There are many ways to increase power:

  • 1. Increase your sample size (sample more of the population)
  • 2. Increase your alpha value (e.g. from 0.01 to 0.05) – watch for Type I

Error!

  • 3. Use a one-tailed test (you know the direction of the expected effect)
  • 4. Use a paired test (control and treatment are same sample)

Given you are testing whether or not what you observed or greater is due to random chance, more data gives you a better understanding of what is truly happening within the population, therefore sample size will the probability of making a Type 2 Error

Statistical Power

slide-24
SLIDE 24

BREAK 9:45 – 10:00

Go grab a coffee. Next we will cover specific tools in your new tool box.

slide-25
SLIDE 25

Statistics s Too

  • olbox

Par arametri ric ver ersus Non

  • n-Parametri

ric Tes ests

“He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”

Andrew Lang (Scottish poet)

slide-26
SLIDE 26

Uni Univari riate Tes est t Optio ions

Type Parametric Non-Parametric

Characteristics

  • Analysis to test group means
  • Based on raw data
  • More statistical power than non-

parametric tests

  • Analysis to test group medians
  • Based on ranked data
  • Less statistical power

Assumptions

  • Independent samples
  • Normality (data OR errors)
  • Homogeneity of variances
  • Independent samples

When to use?

  • Parametric assumptions are met
  • Non-Normal, BUT larger sample

size (CLT), however equal variances must be met

  • Parametric assumptions are not

met

  • Medians better represent your data

(skewed data distribution)

  • Small sample-size
  • Ordinal data, ranked data, or
  • utliers that you can’t remove

Examples

  • T-test
  • ANOVA (One-way, Two-way,

Paired)

  • Wilcox Rank Sum Test
  • Kruskal-Wallis Test
  • Permutational Tests (non-

traditional)

slide-27
SLIDE 27

Assumption #1: Independence of samples

“Your samples have to come from a randomized or randomly sampled design.”

  • Meaning rows in your data do NOT influence one another
  • Address this with experimental design (3 main things to

consider)

  • 1. Avoid pseudoreplication and potential confounding factors by designing

your experiment in a randomized design

  • 2. Avoid systematic arrangements which are distinct pattern in how

treatments are laid out.

  • If your treatments effect one another – the individual treatment effects

could be masked or overinflated

  • 3. Maintain temporal independence
  • If you need to take multiple samples from one individual over time record

and test your data considering the change in time (e.g. paired tests) NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at least 2 reps per treatment to execute and ANOVA Rule of Thumb: You need more rows then columns

slide-28
SLIDE 28

The Normal Distribution

𝑡2 = σ𝑗=1

𝑜

𝑦𝑗 − ҧ 𝑦 2 𝑜 − 1 SD = 𝑡2

Based on this curve:

  • 68.27% of observations are within 1 stdev of ҧ

𝑦

  • 95.45% of observations are within 2 stdev of ҧ

𝑦

  • 99.73% of observations are within 3 stdev of ҧ

𝑦 For confidence intervals:

  • 95% of observations are within 1.96 stdev of ҧ

𝑦

The base of parametric statistics

slide-29
SLIDE 29

Assumption #2: Data/Experimental errors are normally distributed

B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1

Residuals “If I was to repeat my sample repeatedly and calculate the means, those means would be normally distributed.”

Determine if the assumption is met by:

  • 1. Looking at the residuals of your sample
  • 2. Shaprio–wilks Test for Normality – if your data is mainly unique values
  • 3. D'Agostino-Pearson normality test – if you have lots of repeated values
  • 4. Lilliefors normality test – mean and variance are unknown
slide-30
SLIDE 30

t-distribution

(sampling distribution)

Normal distribution

Assumption #2: Data/Experimental errors are normally distributed

You may not need to worry about Normality?

Central Limit Theorem: “Sample means tend to cluster around the central

population value.” Therefore:

  • When sample size is large, you can assume that ҧ

𝑦 is close to the value of 𝜈

  • With a small sample size you have a better chance to get a mean that is far
  • ff the true population mean
slide-31
SLIDE 31

Assumption #2: Data/Experimental errors are normally distributed

You may not need to worry about Normality?

Central Limit Theorem: “Sample means tend to cluster around the central

population value.” Therefore:

  • When sample size is large, you can assume that ҧ

𝑦 is close to the value of 𝜈

  • With a small sample size you have a better chance to get a mean that is far
  • ff the true population mean

What does this mean?

  • For large N, the assumption for Normality can be relaxed
  • You may have decreased power to detect a difference among groups, BUT

your test is not really compromised if your residuals are not normal

  • Assumption of Normality is important when:

1. Very small N 2. Data is highly non-normal 3. Significant outliers are present 4. Small effect size

slide-32
SLIDE 32

Assumption #3: Equal variances between groups/treatments

0 4 8 12 16 20 24 ҧ 𝑦𝐵 = 12 𝑡𝐵 = 4 ҧ 𝑦𝐶 = 12 𝑡𝐶 = 6

Let’s say 5% of the A data fall above this threshold But >5% of the B data fall above the same threshold

So with larger variances, you can expect a greater number of observations at the extremes of the distributions This can have real implications on inferences we make from comparisons between groups

slide-33
SLIDE 33

B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1

Residuals “Does the know probability of observations between my two samples hold true?”

Determine if the assumption is met by:

  • 1. Looking at the residuals of your sample
  • 2. Bartlett Test

Assumption #3: Equal variances between treatments

slide-34
SLIDE 34

Assumption #3: Equal variances between treatments

Testing for Equal Variances – Residual Plots

Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units)

  • NORMAL distribution: equal number of points along observed
  • EQUAL variances: equal spread on either side of the meanpredicted

value=0

  • Good to go!
  • NON-NORMAL distribution: unequal number of points along
  • bserved
  • EQUAL variances: equal spread on either side of the meanpredicted value=0
  • Optional to fix
  • NORMAL/NON NORMAL: look at histogram or test
  • UNEQUAL variances: cone shape – away from or towards zero
  • This needs to be fixed for ANOVA (transformations)
  • OUTLIERS: points that deviate from the majority of data points
  • This needs to be fixed for ANOVA (transformations or removal)
slide-35
SLIDE 35
  • Treatment – predictor variable

(e.g. variety, fertilization, irrigation, etc.)

  • Treatment level – groups within treatments

(e.g. A,B,C or Control, 1xN, 2xN)

  • Covariate – undesired, uncontrolled predictor variable, confounding
  • F-value –

𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠

  • P-value – probability that the observed difference or larger in the

treatment means is due to random chance

Analysis of Variance (ANOVA) – Vocabulary

slide-36
SLIDE 36

B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1

Residuals

Analysis of Variance (ANOVA)

F-value –

𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠

=

𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓

SIGNAL NOISE

𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗

𝑜

ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑠 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗

𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗

𝑜

slide-37
SLIDE 37

Analysis of Variance (ANOVA)

Think of Pac-man!

  • All of the dots on the board represent the Total

Variation in your study

  • Every treatment you use in your analysis is a

different Pac-man player on the board

  • The amount of dots each player eat represents

variation between (e.g. the amount of variation each treatment

can explain)

  • The amount of dots left on the board after all

players have died represented the variation within

  • If players have a big effect they will eat more dots, reducing dots left on the board

(lowering variation within), increasing the F-value

  • A large F-value indicates a significant difference

𝐺 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜

slide-38
SLIDE 38

F Distribution (family of distributions)

𝐺 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜

𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗

𝑜

ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑜𝑝𝑐𝑞𝑢 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗

𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗

𝑜

probability

𝑡𝑗𝑕𝑜𝑏𝑚 > 𝑜𝑝𝑗𝑡𝑓 𝑡𝑗𝑕𝑜𝑏𝑚 < 𝑜𝑝𝑗𝑡𝑓 P-value

(percentiles, probabilities)

pf(F, 𝑒𝑔1, 𝑒𝑔2) qf(p, 𝑒𝑔1, 𝑒𝑔2)

0.50 0.95

∝= 0.05

slide-39
SLIDE 39

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

slide-40
SLIDE 40

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

pt(FA, dfA, dfERROR) pt(FB, dfB, dfERROR) pt(FAxB, dfAxB, dfERROR) MSA/MSERROR MSB/MSERROR MSAxB/MSERROR MSA*dfA MSB*dfB MSAxB*dfAxB MSERROR*dfERROR

slide-41
SLIDE 41

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387

If the interaction is significant – you should ignore the main effects because the story is not that simple!

slide-42
SLIDE 42

Interaction plots – Different story under different conditions

1. 2. 3.

Farm1 Farm2

A B

  • VARIETY is significant (*)
  • FARM is significant (*)
  • FARM2 has better yield than FARM1
  • No Interaction
  • VARIETY is not significant
  • FARM is significant (*)
  • VARIETY A is better on FARM2 and VARIETY B is better on

FARM1

  • Significant Interaction

A B

Farm1 Farm2 Farm1 Farm2 Farm1 Farm2

4.

  • VARIETY is significant (*)
  • FARM is significant (*) – small difference
  • Main effects are significant, BUT hard to interpret with
  • verall means
  • Significant Interaction

A B

Avg Farm1 Avg Farm2

Yield Yield Yield Yield

  • VARIETY is not significant
  • FARM is not significant
  • Cannot distinguish a difference between VARIETY or FARM
  • No Interaction

A B

slide-43
SLIDE 43

Interaction plots – Different story under different conditions

  • An interaction detects non-parallel lines
  • Difficult to interpret interaction plots for more than a 2-

WAY ANOVA

  • If the interaction effect is NOT significant then you can

just interpret the main effects

  • BUT if you find a significant interaction you don’t want

to interpret main effects because the combination of treatment levels results in different outcomes

slide-44
SLIDE 44

Pairwise comparisons – What to do when you have an interaction

a.k.a Pairwise t-tests Number of comparisons:

𝐷 = 𝑢 𝑢 − 1 2 𝑢 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑚𝑓𝑤𝑓𝑚𝑡

Lentil Example: 3 VARITIES (A, B, and C)

A – B A – C B – C

𝐷 = 𝑢(𝑢 − 1) 2 = 3(2) 2 = 𝟒

Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I Error at all Experiment-wise Type I Error for  = 0.05:

𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.95𝐷

Lentil Example:

𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.953 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.87 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 𝟏. 𝟐𝟒 Significantly increased probability of making an error!

Therefore pairwise comparisons leads to compromised experiment-wise -level You can correct for multiple comparisons by calculating an adjusted p-value (Bonferroni, Holms, etc.)

slide-45
SLIDE 45

Pairwise comparisons – Tukey Honest Differences (Test)

a.k.a Pairwise t-tests with adjusted p-values If we have a significant interaction effect – use these values If we have NO significant interaction effect – we can just look at the main effects

Only need to consider relevant pairwise comparisons – think about it logically

slide-46
SLIDE 46

How to report a significant difference in a graph

W X Y Z W

  • NS

* NS X

  • *

NS Y

  • NS

Z

  • A

A B A,B W X Y Z

Same letter = non significant Different letter = significant

Create a matrix of significance and use it to code your graph

slide-47
SLIDE 47

Plotting “Bars” – What do you want to show in your graph?

Standard Error Standard Deviation

“How confident are we in our statistic?” Standard error – standard deviation of a statistic. Standard error of the mean - reflects the

  • verall distribution of the means you would

get from repeatedly resampling.

𝑇𝐹𝑦 = 𝑡 𝑜

Small values = the more representative the sample will be of the overall population. Large values = the less likely the sample adequately represents the overall population. “How much dispersion from the average exists?” Standard deviation – the amount of variation or dispersion within a set of data values.

𝑡2 = σ𝑗=1

𝑜

𝑦𝑗 − ҧ 𝑦 2 𝑜 − 1 s = 𝑡2

Small values = data points are very close to the mean. Large values = data points are spread out

  • ver a wide range.
slide-48
SLIDE 48

Nonparametric Tests – When assumptions fail

  • NPTs make no assumptions for normality, equal variances,
  • r outliers
  • The lack of assumptions makes, NPTs are not as powerful as

standard parametric tests

  • NPTs work with ranked data
  • If you were to repeatedly sample from the same non-normal

population and repeatedly calculate the difference in rank-sums the distribution of your differences would appear normal with a mean

  • f zero
  • The spread of rank-sum data (variance) is a function of your sample

size (max rank value)

  • P-value: “What is the probability that I get a difference as big or

bigger in my rank sums by random chance?”

  • If there are no treatment effects, the expectation is that the

difference among rank-sums is zero

slide-49
SLIDE 49

Nonparametric Tests – When assumptions fail

T-test equivalent when your data distributions are similarly shaped

  • Wilcoxon Signed Ranks Test – (One sample t-test) Test a hypothesis about the

location (median) of a population distribution

  • Wilcoxon Mann-Whitney Test – (Two sample t-test) Test the null hypothesis that two

populations have identical distribution functions against the alternative hypothesis that the two distribution functions differ only with respect to location (median), if at all

T-test equivalent when distributions are of different shape

  • Komogorov-Smirnov Test (less powerful than the Wilcox rank-sum tests) –
  • (One-tailed t-test) Test whether or not the sample of data is consistent with a specified

distribution function

  • (Two-tailed t-test) Test whether or not these two samples may reasonably be assumed to

come from the same distribution

One-way ANOVA equivalent for non-normal distributions

  • Kruskal Wallis Test – Tests the null hypothesis that all populations have identical distribution

functions against the alternative hypothesis that at least two of the samples differ only with respect to location (median), if at all.

  • Interaction rules apply, so a significant interaction must be followed up by pair-wise

Wilcoxon tests, comparing each of the treatment levels

slide-50
SLIDE 50

Permutational Non-parametric tests

  • PNPT make NO Assumptions therefore any data can be used
  • PNPT work with absolute differences a.k.a distances
  • Smaller values indicate similarity
  • Makes the calculations equivalent to sum-of-squares

𝐸 = 𝑡𝑗𝑕𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐸 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑕𝑠𝑝𝑣𝑞𝑡 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 𝑕𝑠𝑝𝑣𝑞𝑡 Calculating D (delta) & its distribution

  • For our test we can compare D to an expected distribution of D the

same way we do when we calculate an F-value

  • Use permutations (iterations) to generate the distribution of D from
  • ur raw data
  • Therefore shape of D distribution is dependent on your data
slide-51
SLIDE 51

Permutational Non-parametric tests

Determining the distribution of D

  • After you permute this process 5000 times (your choice) a distribution of D

will emerge

  • Shape depends on your data – may be normal or not (doesn’t matter)

10 2 3

slide-52
SLIDE 52

Permutational Non-parametric tests

Determining the distribution of D

  • After you permute this process 5000 times (your choice) a distribution of D

will emerge

  • Shape depends on your data – may be normal or not (doesn’t matter)

10 2 3

4921 D calculations < 10 from permutations 79 D calculations ≥ 10 from permutations P-value: 79/5000 = 0.0158

slide-53
SLIDE 53

Permutational ANOVA

Permutational ANOVA in R:

library(lmPerm) summary(aovp(YIELD~FARM*VARIETY, seqs=T)) Pairwise Permutational ANOVA in R:

  • ut1=aovp(YIELD~FARM*VARIETY, seqs=T)

TukeyHSD(out1)

Permutational Non-parametric tests in R

The option seqs=T calculates sequential sum of squares

(similar to regular ANOVA)

Good choice for balanced designs You can change the maximum number of iterations with the maxIter=

  • ption
slide-54
SLIDE 54

Permutational Non-parametric tests

  • For parametric tests we know Normal, T-distribution, F-

distribution look like

  • Therefore we can use the standard calculations (t-value) to calculate

statistics

  • When we violate the known distribution we need some other

curve to work with

  • Hard to estimate a theoretical distribution that fits your data
  • Best solution is to permute your data to generate a distribution
  • Permutational non-parametric statistics are just as powerful as

parametric tests

  • This technique is similar to bootstrapping
  • But bootstrap samples data rather than changing all observation classes
slide-55
SLIDE 55

If permutational techniques are so good why not always use them?

  • You say permutational non-parametric tests are as

powerful as parametric statistics – YES They Are!

  • But they are still fairly new to statistical practices
  • Still unknown or not understood among many uses
  • Best practice is to stick with parametric statistics when

you can, but when you can’t permutational tests are great options!

slide-56
SLIDE 56

WORK PERIOD 11:00 – 12:15

Follow the Workbook Examples for the Analyses You are Interested In. Any questions?

slide-57
SLIDE 57

LUNCH 12:15 – 1:00

Get some brain food – more statistics coming this afternoon!

slide-58
SLIDE 58

Statistics s Too

  • olbox

Reg egressio ion

“If the statistics are boring, then you've got the wrong numbers.”

Edward R. Tufte (Statistics Professor, Yale University)

slide-59
SLIDE 59

Correlation Coefficients

response predictor

r = 1

response predictor

r = -1 1 > r > 0

response predictor

r = 0

response predictor

  • 1 < r < 0

response predictor response predictor

r = 0 Positive relationship Negative relationship No relationship

  • Increase in X = increase in Y
  • r = 1 doesn’t have to be a
  • ne-to-one relationship
  • Increase in X = decrease in Y
  • r = -1 doesn’t have to be a
  • ne-to-one relationship
  • Increase in X has none
  • r no consistent effect
  • n Y

r = correlation coefficient range -1 to 1

slide-60
SLIDE 60

Correlation Methods

  • Pearson’s Correlation:

– Requires parametric assumptions – relationship order (direction) and magnitude of the data values is determined

  • Kendall’s & Spearman’s Correlation:

– Non-parametric (based on rank) – relationship order (direction) of the data values is determined magnitude cannot be taken from this value because it is based on ranks not raw data – Be careful with inferences made with these – Order is OK (positive vs negative) – but the magnitude is misleading

Comparison between methods

  • Kendall and Spearman coefficients will likely be larger than Pearson

coefficients for the same data because coefficients are calculated on ranks rather then the raw data

slide-61
SLIDE 61

Dealing with Multiple Inferences

Making inferences from tables of correlation coefficients and p- values

  • If we want to use multiple correlation coefficients and p-values to make general

conclusions we need to be cautious about inflating our Type I Error due to the multiple test/comparisons

Climate variable Correlation w/ growth (r2) p-value Temp Jan 0.03 0.4700 Temp Feb 0.24 0.2631 Temp Mar 0.38 0.1235 Temp Apr 0.66 0.0063 Temp May 0.57 0.0236 Temp Jun 0.46 0.1465 Temp Jul 0.86 0.0001 Temp Aug 0.81 0.0036 Temp Sep 0.62 0.0669 Temp Oct 0.43 0.1801 Temp Nov 0.46 0.1465 Temp Dec 0.07 0.4282 Research Question: Does lentil growth dependent on climate? Answer (based on a cursory examination of this

table): Yes, there are significant

relationships with temperature in April, May, July, and August at α=0.05 But this is not quite right – we need to adjust p-values for multiple inferences

slide-62
SLIDE 62

Correlation DOES NOT imply causation! A relationship DOES NOT imply causation!

Both of these values imply a relationship rather than one factor causing another factor value Be careful of your interpretations!

Important to Remember

slide-63
SLIDE 63

Correlation vs Causation

Example:

If you look at historic records there is a highly significant positive correlation between ice cream sales and the number of drowning deaths Do you think drowning deaths cause ice cream sales to increase? Of course NOT! Both occur in the summer months – therefore there is another mechanism responsible for the observed relationship

slide-64
SLIDE 64

Linear Regression

Output from R Estimate of model parameters (intercept and slope) Standard error of estimates Tests the null hypothesis that the coefficient is equal to zero (no effect)

A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response

Coefficient of determination a.k.a “Goodness of fit”

Measure of how close the data are to the fitted regression line

The significance of the overall relationship described by the model

slide-65
SLIDE 65

Linear Regression Assumptions

  • 1. For any given value of X, the distribution of Y must be

normal

  • BUT Y does not have to be normally distributed as a whole

2. For any given value of X, of Y must have equal variances

You can again check this by using the Shaprio Test, Bartlett Test, and residual plots on the residuals of your model (see section 4.1)

No assumptions for X – but be conscious of your data

The relationship you detect is obviously reflective of the data you include in your study

slide-66
SLIDE 66

Multiple Linear Regression

ID DBH VOL AGE DENSITY 1 11.5 1.09 23 0.55 2 5.5 0.52 24 0.74 3 11.0 1.05 27 0.56 4 7.6 0.71 23 0.71 5 10.0 0.95 22 0.63 6 8.4 0.78 29 0.63

Relating model back to data table Response variable (Y) Predictor variable 2 (x2) Predictor variable 1 (x1) Multiple linear regression: y = β0 + β1*x1 + β2*x2 DENSITY = Intercept + β1*AGE + β2*VOL β1, β2 : What I need to multiply AGE and VOL by (respectively) to get the value in DENSITY (predicted) Remember the difference between the observed and predicted DENSITY are our regression residuals Smaller residuals = Better Model

slide-67
SLIDE 67

Multiple Linear Regression

Output from R Estimate of model parameters (βi values) Standard error of estimates Tests the null hypothesis that the coefficient is equal to zero (no effect)

A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response

Coefficient of determination a.k.a “Goodness of fit”

Measure of how close the data are to the fitted regression line Adjusted R2

The significance of the overall relationship described by the model

slide-68
SLIDE 68

Non-Linear Regression

  • NLR make no assumptions for normality, equal variances,
  • r outliers
  • However the assumptions of independence (spatial &

temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply

  • We don’t have to worry about statistical power here

because we are fitting relationships

  • All we care about is if or how well we can model the relationship

between our response and predictor variables

Assumptions

slide-69
SLIDE 69

Non-Linear Regression

Curve Examples

There are MANY more examples you could choose from – what makes sense for your data?

slide-70
SLIDE 70

Non-Linear Regression

Curve Fitting

Procedure:

  • 1. Plot your variables to visualize the relationship

a. What curve does the pattern resemble? b. What might alternative options be?

  • 2. Decide on the curves you want to compare and run a non-linear

regression curve fitting

a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function

  • 3. Once you have parameters for your curves compare models

with AIC

  • 4. Plot the model with the lowest AIC on your point data to

visualize fit

slide-71
SLIDE 71

Non-Linear Regression

Output from R Non-linear model that we fit

Simplified logarithmic with slope=0

Estimates of model parameters Residual sum-of-squares for your non-linear model Number of iterations needed to estimate the parameters If you are stuck for starting points in your R code this website may be able to help: http://www.xuru.org/rt/NLR.asp Copy paste data & desired model formula

slide-72
SLIDE 72

Non-Linear Regression

  • Calculating an R2 is NOT APPROPIATE for non-linear regression
  • Why?
  • For linear models, the sums of the squared errors always add up in a specific

manner: 𝑇𝑇𝑆𝑓𝑕𝑠𝑓𝑡𝑡𝑗𝑝𝑜 + 𝑇𝑇𝐹𝑠𝑠𝑝𝑠 = 𝑇𝑇𝑈𝑝𝑢𝑏𝑚

  • Therefore 𝑆2=

𝑇𝑇𝑆𝑓𝑕𝑠𝑓𝑡𝑡𝑗𝑝𝑜 𝑇𝑇𝑈𝑝𝑢𝑏𝑚 which mathematically must produce a value

between 0 and 100%

  • But in nonlinear regression 𝑇𝑇𝑆𝑓𝑕𝑠𝑓𝑡𝑡𝑗𝑝𝑜 + 𝑇𝑇𝐹𝑠𝑠𝑝𝑠 ≠ 𝑇𝑇𝑈𝑝𝑢𝑏𝑚
  • Therefore the ratio used to construct R2 is bias in nonlinear regression
  • Best to use AIC value and the measurement of the residual sum-of-

squares to pick best model then plot the curve to visualize the fit

R2 for “goodness of fit”

slide-73
SLIDE 73

Akaike’s Information Criterion (AIC)

How do we decide which model is best?

  • AIC considers both the fit of the model and the model complexity
  • Complexity is measured as number parameters or the use of higher order

polynomials

  • Allows us to balance over- and under-fitting in our modelled relationships

– We want a model that is as simple as possible, but no simpler – A reasonable amount of explanatory power is traded off against model complexity – AIC measures the balance of this for us

  • Can be calculated for any kind of model allowing comparisons across

different modelling approaches and model fitting techniques

– Model with the lowest AIC value is the model that fits your data best (e.g. minimizes your model residuals) Hirotugu Akaike, 1927-2009

In the 1970s he used information theory to build a numerical equivalent of Occam's razor Occam’s razor: All else being equal, the simplest explanation is the best one

  • For model selection, this means the simplest model is preferred to

a more complex one

  • Of course, this needs to be weighed against the ability of the model

to actually predict anything

slide-74
SLIDE 74

Logistic Regression (a.k.a logit regression)

Relationship between a binary response variable and predictor variables

  • Binary response variable can be considered a class (1 or 0)
  • Yes or No
  • Present or Absent
  • The linear part of the logistic regression equation is used to find the

probability of being in a category based on the combination of predictors

  • Predictor variables are usually (but not necessarily) continuous
  • But it is harder to make inferences from regression outputs that use discrete or

categorical variables 𝑀𝑝𝑕𝑗𝑡𝑢𝑗𝑑 𝑁𝑝𝑒𝑓𝑚: 𝑧 = 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜 1 − 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜 Logit Model

slide-75
SLIDE 75

Logistic Regression (a.k.a logit regression)

  • Logistic regression make no assumptions for normality, equal

variances, or outliers

  • However the assumptions of independence (spatial & temporal) and

design considerations (randomization, sufficient replicates, no pseudoreplication) still apply

  • Logistic regression assumes the response variable is binary (0 & 1)
  • We don’t have to worry about statistical power here because we are

fitting relationships

  • All we care about is if or how well we can model the relationship

between our response and predictor variables

Assumptions

slide-76
SLIDE 76

Binomial distribution vs Normal distribution

  • Key difference: Values are continuous (Normal) vs discrete (Binomial)
  • As sample size increases the binomial distribution appears to resemble

the normal distribution

  • Binomial distribution is a family of distributions because the shape

references both the number of observations and the probability of “getting a success” - a value of 1 “What is probability of x success in n independent and identically distributed Bernoulli trials?”

  • Bernoulli trial (or binomial trial) - a random experiment with exactly two

possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted

slide-77
SLIDE 77

Regression Differences

Logistic Regression Linear Regression

  • References the Binomial distribution
  • Estimates the probability (p) of an

event occurring (y=1) rather then not

  • ccurring (y=0) from a knowledge of

relevant independent variables (our data)

  • Regression coefficients are estimated

using maximum likelihood estimation (iterative process)

  • References the Gaussian (normal)

distribution

  • Uses ordinary least squares to find a

best fitting line the estimates parameters that predict the change in the dependent variable for change in the independent variable predictor (x) response (y) predictor (x) response (y)

slide-78
SLIDE 78

Maximum likelihood estimation

  • Complex iterative process to find coefficient values that maximizes the

likelihood function Likelihood function - probability for the occurrence of a observed set of values X and Y given a function with defined parameters

Process:

  • 1. Begins with a tentative solution for each coefficient
  • 2. Revise it slightly to see if the likelihood function can be improved
  • 3. Repeats this revision until improvement is minute, at which point the process

is said to have converged

How coefficients are estimated for logistic regression

slide-79
SLIDE 79

Logistic Regression (a.k.a logit regression)

Output from R Estimate of model parameters (intercept and slope) Standard error of estimates Tests the null hypothesis that the coefficient is equal to zero (no effect)

A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response

AIC value for the model

slide-80
SLIDE 80

Binomial ANOVA

  • Can be used when there are 2 or more predictor variables that are

binomial

  • Uses output from generalized linear model referencing logistic

regression and binomial distribution – acts like a chi-squared test

  • First, build a logistic model
  • Next input the resulting model into the ANOVA test with a specification to

calculate p-values using chi-squared distribution rather than F-distribution (reserved for parametric statistics)

  • Provides a good indication of predictor variables to include in your

logistic model

slide-81
SLIDE 81

Binomial ANOVA

Rather than sum of squares and mean sum of squares, now the deviance and residual deviance from each parameter Tests the null hypothesis that the variable no effect on achieving a “success” (value of 1)

A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to achieving a success in the response variable A large p-value suggests that changes in the predictor are not associated with success in the response

Remember deviance is a measure of the lack of fit between the model and data with larger values indicating poorer fit

slide-82
SLIDE 82

Odds Ratios

  • Determine the odds of a “success” based on the predictors and

modelled relationship

  • The odds ratio for the model is the increase in odds above the value
  • f the intercept when you add a unit to the predictor(s).
  • E.g.: One unit increase in NITROGEN increases the odds of survival in

a plant by 0.22

  • Odds ratios can be converted into an estimated probability of survival for a

given value of the predictor(s) using the logit model equation and the coefficient estimates from the logistic model.

𝑀𝑝𝑕𝑗𝑡𝑢𝑗𝑑 𝑁𝑝𝑒𝑓𝑚: 𝑧 = 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜 1 − 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜

  • E.g.: If 200 units of NITORGEN are applied to a plot there is a 40%

chance of plant survival, but if NITROGEN is increased to 500 or 750 units the probability of survival increases to 77% and 93%, respectively.

slide-83
SLIDE 83

WORK PERIOD 2:00 – 3:00

Follow the Workbook Examples for the Analyses You are Interested In. Any questions?

slide-84
SLIDE 84

Ext xtension of

  • f th

the Statistics Too

  • olbox

Mul ultiv ivaria iate Tes ests (R

(Rotation-Based)

“Definition of Statistics: The science of producing unreliable facts from reliable figures.”

Evan Esar (Humorist & Writer)

slide-85
SLIDE 85
  • Population – Class of things (What you want to learn about)
  • Sample – group representing a class (What you actually study)
  • Experimental unit – individual research subject (e.g. location, entity, etc.)
  • Response variable – property of thing that you believe is the result of

predictors (What you actually measure – e.g. lentil height, patient response) a.k.a dependent variable

  • Predictor variable(s) – environment of things which you believe is

influencing a response variable (e.g. climate, topography, drug combination, etc.) a.k.a independent variable

  • Error - difference between an observed value (or calculated) value and its

true (or expected) value

A Reminder from Univariate Statisics….

slide-86
SLIDE 86

Experimental Unit (row)

In Multivariate statistics:

  • Variables can be either numeric or categorical (depends on the

technique)

  • Focus is often placed on graphical representation of results

Frequency of species, Climate variables, Soil characteristics, Nutrient concentrations Drug levels, etc. Regions, Ecosystems, Forest types, Treatments, etc.

Rotation-based Methods

Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …

slide-87
SLIDE 87

Final results based on multiple variables give different inferences than 2 variables

Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …

Variable 1 Variable 2

Find an equation to rotate data to so that axis explains multiple variables

Variable 1,2 Variable 3 Variable 1, 2, 4, 9, 10 Variable 3, 6, 8

Repeat rotation process to achieve analysis objective

Rotation-based Methods

slide-88
SLIDE 88

1. Rotate so that new axis explains the greatest amount of variation within the

data

Principal Component Analysis (PCA) Factor Analysis

2. Rotate so that the variation between groups is maximized

Discriminatn Analysis (DISCRIM) Multivariate Analysis of Variance (MANOVA)

3. Rotate so that one dataset explains the most variation in another dataset

Canonical Correspondence Analysis (CCA)

Objective of Rotation-based Methods

slide-89
SLIDE 89

Z1 = a11X1 + a12X2 + … + a1nXn

First principal component (column vector) Column vectors of original variables

  • PCA Objective: Find linear combinations of the original variables X1, X2, …, Xn to

produce components Z1, Z2, …, Zn that are uncorrelated in order of their importance, and that describe the variation in the original data.

  • Principle components are the linear combinations of the original variables
  • Principle component 1 is NOT a replacement for variable 1 – All variables are

used to calculate each principal component For each component:

The constraint that a11

2 + a12 2 + … + a1n 2 = 1 ensures Var(Z1) is as large as possible

Coefficients for linear model

The Math Behind PCA

slide-90
SLIDE 90
  • Z2 is calculated using the same formula and constraint on a2n values

However, there is an addition condition that Z1 and Z2 have zero correlation for the data

  • The correlation condition continues for all successional principle

components i.e. Z3 is uncorrelated with both Z1 and Z2

  • The number of principal components calculated will match the number
  • f predictor variable included in the analysis
  • The amount of variation explained decreases with each successional

principal component

  • Generally you base your inferences on the first two or three

components because they explain the most variation in your data

  • Typically when you include a lot of predictor variables the last

couple of principal components explain very little (< 1%) of the variation in your data – not useful variables

The Math Behind PCA

slide-91
SLIDE 91

PCA in R:

princomp(dataMatrix,cor=T/F) (stats package)

Define whether the PCs should be calculated using the correlation or covariance matrix (derived within the function from the data) You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales Default is to use the correlation matrix because it standardizes to data before it calculated the PCs, removing the effect of the different units Data matrix of predictor variables You will assign the results back to a class once the PCs have been calculated

PCA in R

slide-92
SLIDE 92

PCA in R

slide-93
SLIDE 93

Loadings – these are the correlations between the original predictor variables and the principal components Identifies which of the original variables are driving the principal component

Example:

Comp.1 – is negatively related to Murder, Assault, and Rape Comp.2 – is negatively related to UrbanPop

Eigenvectors

PCA in R

slide-94
SLIDE 94

Scores – these are the calculated principal components Z1, Z2, …, Zn These are the values we plot to make inferences

PCA in R

slide-95
SLIDE 95

Variance – summary of the output displays the variance explained by each principal component Identifies how much weight you should put in your principal components

Example:

Comp.1 – 62 % Comp.2 – 25% Comp.3 – 9% Comp.4 – 4 %

Eignenvalues divided by the number of PCs

PCA in R

slide-96
SLIDE 96

Data points considering Comp.1 and Comp.2 scores (displays row names) Direction of the arrows +/- indicate the trend of points (towards the arrow indicates more of the variable) If vector arrows are perpendicular then the variables are not correlated If you original variables do not have some level of correlation then PCA will NOT work for your analysis – i.e. You wont learn anything!

PCA in R - Biplot

slide-97
SLIDE 97
  • DISCRIM Objective: Rotate that data so that variation between groups is maximized

(“reduce complexity”).

  • “What distinguishes my groups?”
  • Different question compared to PCA which maximizes variation explained
  • Discriminant functions (DF) are the linear combinations of the original

variables.

  • Create a DF for every observation in the dataset (like PCA scores)
  • NOT average group measurements, but rather measurements on individuals

within the pre-determined groups.

For each function:

DF1 = aX1 + bX2 + … + zXn

Linear discriminant (column vector) Column vectors of original variables a, b,… z Coefficients for linear model

The Math Behind Discriminant Analysis (DISCRIM)

slide-98
SLIDE 98
  • 1. Find the axis that gives the greatest separation between 2 groups
  • 2. Fix that axis
  • 3. Rotate around the fixed axis to maximize difference between first 2 groups

and the 3rd group

  • 4. Repeat steps 2 & 3 for all groups included

x y

x’ y’

∝ ∝

x y

x’ y’

How DISCRIM works

slide-99
SLIDE 99

Proportion of variance explained by linear discriminants Mean observation values for variables in each pre-defined group The initial probability of belonging to a group

(more important for predicting class)

Coefficients of linear discriminants are the solutions to our linear functions MASS will only display solutions for the most significant linear discriminants Discriminants that explain very small portion of the variance are removed

DISCRIM in R – MASS package output

slide-100
SLIDE 100

Proportion of variance explained by linear discriminants Mean discriminant values for each pre- defined group Standard error of the means are also given By querying the analysis structure we can see the discriminant loadings which tell us the relationship between the DF values and the original variables (like PCA) Again candisc will only display solutions for discriminants that explain the most variation Less information is displayed in the candisc output, but you can get the loadings which are important! Candisc also produces a nicer plot

DISCRIM in R – candisc package output

slide-101
SLIDE 101

Problem: A new skull is found but we don’t know whether it belongs to homo erectus or homo habilis or if it’s a new group?

Homo erectus Homo habilis Group centroid New find (unknown origin)

Skull measurement

How predictions work:

1. Calculate group centroid 2. Find out which centroid is the closest position to the unknown data point New groups are defined when we find a significant difference between new find and predefined groups Popular method in taxonomy and anthropology

Using DISCRIM to predict which group

slide-102
SLIDE 102

WORK PERIOD 3:45 – 5:00

Follow the Workbook Examples for the Analyses You are Interested In. Any questions?

slide-103
SLIDE 103

Statistics Toolbox

What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?

… to characterize my data. … to find if there is a significant difference between my groups … to see what predictor conditions are associated with my groups. … normally distributed? … equal variances? … multiple response variables? … continuous or discrete? … binary data? … single treatment or multiple treatment?

slide-104
SLIDE 104

Thank You for Attending the Stats Workshop

I you have any further questions please feel free to contact me Flow Cytometry Core Facility

LKG Consulting

Email: consulting.lkg@gmail.com Website: www.consultinglkg.com