Statistics Toolbox in
Professional Development Opportunity for the
Flow Cytometry Core Facility
December 7, 2018
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com
A Review of Analysis Techniques for Scientific Research
Statistics Toolbox in A Review of Analysis Techniques for - - PowerPoint PPT Presentation
Statistics Toolbox in A Review of Analysis Techniques for Scientific Research Professional Development Opportunity for the Flow Cytometry Core Facility December 7, 2018 LKG Consulting Email: consulting.lkg@gmail.com Website:
Professional Development Opportunity for the
Flow Cytometry Core Facility
December 7, 2018
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com
A Review of Analysis Techniques for Scientific Research
Laura Gray-Steinhauer
www.ualberta.ca/~lkgray BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005) MSc in Forest Biology and Management (UofA, 2008) PhD in Forest Biology and Management (UofA, 2011)
Designated Professional Statistician with The Statistical Society of Canada (2014) Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…
Workshop Schedule
8:15 – 8:30 Arrive to the Lab & Start up the computers 8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals) 8:45 – 9:15 Statistics Toolbox Refresh useful vocabulary, introduce a decision tree to plan your analysis path 9:15 – 9:45 Hypothesis Testing Refresher on p-values, Type 1 and Type 2 error, and statistical power 9:45 – 10:00 Break 10:00 – 11:00 Parametric versus Non-Parametric Tests Testing for parametric assumptions, ANOVA, Permutational ANOVA 11:00 – 12:00 Work period (questions are welcome) 12:00 – 1:00 Lunch 1:00 – 2:00 Correlation & Regression Differences and interpretation of correlations, as well as linear, multiple linear and logistic techniques 2:00 – 3:00 Work period (questions are welcome) 3:00 – 3:45 Multivariate Statistics Introduction to principle component analysis (PCA), Discriminant Analysis (DISCRIM) 3:45 – 5:00 Work period (questions are welcome) After 5:00 Enjoy your weekend!
This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today. Feel free to reach out to me via email with more questions: consulting.lkg@gmail.com.
else is Arial)
change depending on what you name your variables.
content outside of the workshop attendees. Topics Included:
assumptions
Topics Included:
tests
Topics Included:
(PCA)
variance (MANOVA)
https://cran.r-project.org/index.html
https://www.rstudio.com/
Preferred among programmers, we will use it in this workshop
“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”
Aaron Levenstein (Author)
Statistical Term Real World Research World Population Class of things
E.g. Cancer patients
What you want to learn about
E.g. Cancer patients in Alberta
Sample Group representing a class
E.g. 1000 cancer patients in Alberta
What you actually study
E.g. 1000 cancer patients from 10 treatment centres in Alberta
Experimental Unit Individual thing
E.g. each of the 1000 cancer patients
Individual research subject
E.g. Cancer patients n=1000 Hospital populations n=10 (depends on research question)
Dependent Variable Property of things
E.g. white blood cell count
What you measure about subjects
E.g. white blood cell count
Independent Variable Environment of things
E.g. Treatment options, climate, etc.
What you think might influence dependent variable
E.g. Amount of treatment, combination of treatments, etc.
Data Values of variables What you record/information you collect
which the outcome is unknown
with)
conclusions (inferences) about the population from which the sample was taken
population characteristic (e.g. population mean)
probability associated with each possible value of a variable
expected) value
Also see Appendix 1 in your workbook
What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?
Analysis Goal Parametric
Assumptions Met
Non-Parametric
Alternative if fail assumptions
Binomial
Binary data/Event likelihood
Describe data characteristics Mean Standard deviation Standard error Etc. Median Quartiles Percentiles Proportions Probability distributions are always appropriate to describe data. Graphics are always appropriate to describe data. Compare 2 distinct/independent groups T-test Paired t-test Wilcox Rank-Sum Test Klomogorov-Smirnov Test Permutational T-test Z-Test for proportions Compare > 2 distinct/independent groups ANOVA Multi-Way ANOVA ANCOVA Blocking Kruskall Wallace Test Friedman Rank Test Permutational ANOVA Chi-Squared Test Binomial ANOVA Estimate the degree
between 2 variables Pearson’s correlation Spearman rank correlation Kendall’s rank correlation Logistic regression Predict outcome based on relationship Linear regression Multiple linear regression Non-linear regression Logistic regression Odds Ratio
If you have a continuous response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two treatment levels > Two treatment levels T-Test Permutational T-Test Klomogorov Smirniov (KS) Test Wilcox Test One-Way ANOVA Kruskall Wallace Test Freidman Rank Test Pearson’s Correlation Spearman’s Rank Correlation Kendall’s Rank Correlation Linear/Non- linear Regression
What you get Non-Parametric Parametric Regression
significantly different
significant effect of “treatment”.
where the difference between groups
indicating direction and magnitude of relationship.
well predictor is linked to response (R2 or AIC).
Binomial
… and two or more predictor variables Predictor is categorical Predictor is continuous Two or more treatment levels for each predictor Multi-Way ANOVA Permutational ANOVA Multiple Regression
What you get
among treatments with interactions.
covariates.
are linked to response variable (Adjusted R2, AIC)
predictors significantly affect the response variable.
Blocking ANCOVA Blocking
Non-Parametric Parametric Regression Binomial
If you have a categorical response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two or more treatment levels Binomial ANOVA Logistic Regression
What you get
a significant effect of each treatment.
interactions).
possibility of interactions.
with adjusted p-values to determine the difference among treatments with interactions.
there is a significant effect
comparisons to find where the difference between groups
… and two or more predictor variables Predictor is continuous Two treatment levels Z-Test for Proportions Two or more treatment levels Chi-squared Test
are linked to response variable (Adjusted R2, AIC)
which predictors significantly affect the response variable.
groups are significantly different
Non-Parametric Parametric Regression Binomial
Example research questions: Do yield of different lentil varieties differ at 2 farms? Do the varieties differ among themselves? Does the density of the plants impact their average height?
Farm 1 Farm 2
Plot
1 Variety in each
Individual lentil plants
A A A A A A A B B B B B A B C C C C C C C C
petal measurements for 50 flowers from each of Iris setosa, versicolor, and virginica.
“Statistics are not substitute for judgment.”
Henry Clay (Former US Senator)
sample population
Is this a difference due to random chance? Mean height Population sample
𝐼𝑝: ҧ 𝑦𝐵 = ҧ 𝑦𝐶 𝐼1: ҧ 𝑦𝐵 ≠ ҧ 𝑦𝐶
If actual p < , reject null hypothesis (𝐼𝑝) and accept alternative hypothesis (𝐼1)
“Is this difference due to random chance?”
Mean height Population sample
P-value – the probability the observed value or larger is due to random chance
Theory: We can never really prove if the 2 samples are truly different or the same – only ask if what we observe (or a greater difference) is due to random chance
How to interpret p-values: P-value = 0.05 – “Yes, 1 out of 20 times.” P-value = 0.01 – “Yes, 1 out of 100 times.”
The lower the probability a difference is due to random chance – the more likely is the result of an effect (what we test for)
In other words: “Is random chance a plausible explanation?”
Type I Error – reject the null hypothesis (H0)
when it is actually true
Type II Error – failing to reject the null
hypothesis (H0) when it is not true
Remember rejection or acceptance of a p-value (and therefore the chance you will make an error) depends on the arbitrary -level you choose
the probability of making a Type II Error
The -level you choose is completely up to you (typically it is set at 0.05), however, it should be chosen with consideration of the consequences of making a Type I or a Type II Error. Based on your study, would you rather err on the side of false positives or false negatives?
Null hypothesis is true Alternative hypothesis is true Fail to reject the null hypothesis ☺ Correct Decision
Incorrect Decision False Negative Type II Error Reject the null hypothesis Incorrect Decision False Positive Type I Error
☺
Correct Decision
Example: Will current forests adequately protect genetic resources
under climate change?
Birch Mountain Wildlands
HO: Range of the current climate for the BMW protected area
= Range of the BMW protected area under climate change
Ha: Range of the current climate for the BMW protected area
≠ Range of the BMW protected area under climate change
If we reject HO: Climates ranges are different, therefore
genetic resources are not adequately protected and new protected areas need to be created
Consequences if I make:
indeed adequately protected in the BMW protected area – we created new parks when we didn’t need to
vulnerable – we didn’t create new protected areas and we should have From an ecological standpoint it is better to make a Type I Error, but from an economic standpoint it is better to make a Type II Error Which standpoint should I take?
Power is your ability to reject the null hypothesis when it is false (i.e. your ability to detect an effect when there is one). There are many ways to increase power:
Error!
Given you are testing whether or not what you observed or greater is due to random chance, more data gives you a better understanding of what is truly happening within the population, therefore sample size will the probability of making a Type 2 Error
Go grab a coffee. Next we will cover specific tools in your new tool box.
“He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”
Andrew Lang (Scottish poet)
Type Parametric Non-Parametric
Characteristics
parametric tests
Assumptions
When to use?
size (CLT), however equal variances must be met
met
(skewed data distribution)
Examples
Paired)
traditional)
“Your samples have to come from a randomized or randomly sampled design.”
consider)
your experiment in a randomized design
treatments are laid out.
could be masked or overinflated
and test your data considering the change in time (e.g. paired tests) NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at least 2 reps per treatment to execute and ANOVA Rule of Thumb: You need more rows then columns
𝑡2 = σ𝑗=1
𝑜
𝑦𝑗 − ҧ 𝑦 2 𝑜 − 1 SD = 𝑡2
Based on this curve:
𝑦
𝑦
𝑦 For confidence intervals:
𝑦
The base of parametric statistics
B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1
Residuals “If I was to repeat my sample repeatedly and calculate the means, those means would be normally distributed.”
Determine if the assumption is met by:
t-distribution
(sampling distribution)
Normal distribution
You may not need to worry about Normality?
Central Limit Theorem: “Sample means tend to cluster around the central
population value.” Therefore:
𝑦 is close to the value of 𝜈
You may not need to worry about Normality?
Central Limit Theorem: “Sample means tend to cluster around the central
population value.” Therefore:
𝑦 is close to the value of 𝜈
What does this mean?
your test is not really compromised if your residuals are not normal
1. Very small N 2. Data is highly non-normal 3. Significant outliers are present 4. Small effect size
0 4 8 12 16 20 24 ҧ 𝑦𝐵 = 12 𝑡𝐵 = 4 ҧ 𝑦𝐶 = 12 𝑡𝐶 = 6
Let’s say 5% of the A data fall above this threshold But >5% of the B data fall above the same threshold
So with larger variances, you can expect a greater number of observations at the extremes of the distributions This can have real implications on inferences we make from comparisons between groups
B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1
Residuals “Does the know probability of observations between my two samples hold true?”
Determine if the assumption is met by:
Testing for Equal Variances – Residual Plots
Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units)
value=0
(e.g. variety, fertilization, irrigation, etc.)
(e.g. A,B,C or Control, 1xN, 2xN)
𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠
treatment means is due to random chance
B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1
Residuals
F-value –
𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠
=
𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓
SIGNAL NOISE
𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗
𝑜
ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑠 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗
𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗
𝑜
Think of Pac-man!
Variation in your study
different Pac-man player on the board
variation between (e.g. the amount of variation each treatment
can explain)
players have died represented the variation within
(lowering variation within), increasing the F-value
𝐺 = 𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜
𝐺 = 𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜
𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗
𝑜
ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑜𝑝𝑐𝑞𝑢 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗
𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗
𝑜
probability
𝑡𝑗𝑜𝑏𝑚 > 𝑜𝑝𝑗𝑡𝑓 𝑡𝑗𝑜𝑏𝑚 < 𝑜𝑝𝑗𝑡𝑓 P-value
(percentiles, probabilities)
pf(F, 𝑒𝑔1, 𝑒𝑔2) qf(p, 𝑒𝑔1, 𝑒𝑔2)
0.50 0.95
∞
∝= 0.05
Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387
Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387
pt(FA, dfA, dfERROR) pt(FB, dfB, dfERROR) pt(FAxB, dfAxB, dfERROR) MSA/MSERROR MSB/MSERROR MSAxB/MSERROR MSA*dfA MSB*dfB MSAxB*dfAxB MSERROR*dfERROR
Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387
If the interaction is significant – you should ignore the main effects because the story is not that simple!
1. 2. 3.
Farm1 Farm2
A B
FARM1
A B
Farm1 Farm2 Farm1 Farm2 Farm1 Farm2
4.
A B
Avg Farm1 Avg Farm2
Yield Yield Yield Yield
A B
WAY ANOVA
just interpret the main effects
to interpret main effects because the combination of treatment levels results in different outcomes
a.k.a Pairwise t-tests Number of comparisons:
𝐷 = 𝑢 𝑢 − 1 2 𝑢 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑚𝑓𝑤𝑓𝑚𝑡
Lentil Example: 3 VARITIES (A, B, and C)
A – B A – C B – C
𝐷 = 𝑢(𝑢 − 1) 2 = 3(2) 2 = 𝟒
Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I Error at all Experiment-wise Type I Error for = 0.05:
𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.95𝐷
Lentil Example:
𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.953 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.87 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 𝟏. 𝟐𝟒 Significantly increased probability of making an error!
Therefore pairwise comparisons leads to compromised experiment-wise -level You can correct for multiple comparisons by calculating an adjusted p-value (Bonferroni, Holms, etc.)
a.k.a Pairwise t-tests with adjusted p-values If we have a significant interaction effect – use these values If we have NO significant interaction effect – we can just look at the main effects
Only need to consider relevant pairwise comparisons – think about it logically
W X Y Z W
* NS X
NS Y
Z
A B A,B W X Y Z
Same letter = non significant Different letter = significant
Create a matrix of significance and use it to code your graph
Standard Error Standard Deviation
“How confident are we in our statistic?” Standard error – standard deviation of a statistic. Standard error of the mean - reflects the
get from repeatedly resampling.
𝑇𝐹𝑦 = 𝑡 𝑜
Small values = the more representative the sample will be of the overall population. Large values = the less likely the sample adequately represents the overall population. “How much dispersion from the average exists?” Standard deviation – the amount of variation or dispersion within a set of data values.
𝑡2 = σ𝑗=1
𝑜
𝑦𝑗 − ҧ 𝑦 2 𝑜 − 1 s = 𝑡2
Small values = data points are very close to the mean. Large values = data points are spread out
standard parametric tests
population and repeatedly calculate the difference in rank-sums the distribution of your differences would appear normal with a mean
size (max rank value)
bigger in my rank sums by random chance?”
difference among rank-sums is zero
T-test equivalent when your data distributions are similarly shaped
location (median) of a population distribution
populations have identical distribution functions against the alternative hypothesis that the two distribution functions differ only with respect to location (median), if at all
T-test equivalent when distributions are of different shape
distribution function
come from the same distribution
One-way ANOVA equivalent for non-normal distributions
functions against the alternative hypothesis that at least two of the samples differ only with respect to location (median), if at all.
Wilcoxon tests, comparing each of the treatment levels
𝐸 = 𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐸 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑠𝑝𝑣𝑞𝑡 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 𝑠𝑝𝑣𝑞𝑡 Calculating D (delta) & its distribution
same way we do when we calculate an F-value
Determining the distribution of D
will emerge
10 2 3
Determining the distribution of D
will emerge
10 2 3
4921 D calculations < 10 from permutations 79 D calculations ≥ 10 from permutations P-value: 79/5000 = 0.0158
Permutational ANOVA
Permutational ANOVA in R:
library(lmPerm) summary(aovp(YIELD~FARM*VARIETY, seqs=T)) Pairwise Permutational ANOVA in R:
TukeyHSD(out1)
The option seqs=T calculates sequential sum of squares
(similar to regular ANOVA)
Good choice for balanced designs You can change the maximum number of iterations with the maxIter=
distribution look like
statistics
curve to work with
parametric tests
powerful as parametric statistics – YES They Are!
you can, but when you can’t permutational tests are great options!
Follow the Workbook Examples for the Analyses You are Interested In. Any questions?
Get some brain food – more statistics coming this afternoon!
“If the statistics are boring, then you've got the wrong numbers.”
Edward R. Tufte (Statistics Professor, Yale University)
response predictor
r = 1
response predictor
r = -1 1 > r > 0
response predictor
r = 0
response predictor
response predictor response predictor
r = 0 Positive relationship Negative relationship No relationship
r = correlation coefficient range -1 to 1
– Requires parametric assumptions – relationship order (direction) and magnitude of the data values is determined
– Non-parametric (based on rank) – relationship order (direction) of the data values is determined magnitude cannot be taken from this value because it is based on ranks not raw data – Be careful with inferences made with these – Order is OK (positive vs negative) – but the magnitude is misleading
Comparison between methods
coefficients for the same data because coefficients are calculated on ranks rather then the raw data
Making inferences from tables of correlation coefficients and p- values
conclusions we need to be cautious about inflating our Type I Error due to the multiple test/comparisons
Climate variable Correlation w/ growth (r2) p-value Temp Jan 0.03 0.4700 Temp Feb 0.24 0.2631 Temp Mar 0.38 0.1235 Temp Apr 0.66 0.0063 Temp May 0.57 0.0236 Temp Jun 0.46 0.1465 Temp Jul 0.86 0.0001 Temp Aug 0.81 0.0036 Temp Sep 0.62 0.0669 Temp Oct 0.43 0.1801 Temp Nov 0.46 0.1465 Temp Dec 0.07 0.4282 Research Question: Does lentil growth dependent on climate? Answer (based on a cursory examination of this
table): Yes, there are significant
relationships with temperature in April, May, July, and August at α=0.05 But this is not quite right – we need to adjust p-values for multiple inferences
Both of these values imply a relationship rather than one factor causing another factor value Be careful of your interpretations!
Example:
If you look at historic records there is a highly significant positive correlation between ice cream sales and the number of drowning deaths Do you think drowning deaths cause ice cream sales to increase? Of course NOT! Both occur in the summer months – therefore there is another mechanism responsible for the observed relationship
Output from R Estimate of model parameters (intercept and slope) Standard error of estimates Tests the null hypothesis that the coefficient is equal to zero (no effect)
A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response
Coefficient of determination a.k.a “Goodness of fit”
Measure of how close the data are to the fitted regression line
The significance of the overall relationship described by the model
normal
2. For any given value of X, of Y must have equal variances
You can again check this by using the Shaprio Test, Bartlett Test, and residual plots on the residuals of your model (see section 4.1)
No assumptions for X – but be conscious of your data
The relationship you detect is obviously reflective of the data you include in your study
ID DBH VOL AGE DENSITY 1 11.5 1.09 23 0.55 2 5.5 0.52 24 0.74 3 11.0 1.05 27 0.56 4 7.6 0.71 23 0.71 5 10.0 0.95 22 0.63 6 8.4 0.78 29 0.63
Relating model back to data table Response variable (Y) Predictor variable 2 (x2) Predictor variable 1 (x1) Multiple linear regression: y = β0 + β1*x1 + β2*x2 DENSITY = Intercept + β1*AGE + β2*VOL β1, β2 : What I need to multiply AGE and VOL by (respectively) to get the value in DENSITY (predicted) Remember the difference between the observed and predicted DENSITY are our regression residuals Smaller residuals = Better Model
Output from R Estimate of model parameters (βi values) Standard error of estimates Tests the null hypothesis that the coefficient is equal to zero (no effect)
A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response
Coefficient of determination a.k.a “Goodness of fit”
Measure of how close the data are to the fitted regression line Adjusted R2
The significance of the overall relationship described by the model
temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply
because we are fitting relationships
between our response and predictor variables
Assumptions
Curve Examples
There are MANY more examples you could choose from – what makes sense for your data?
Curve Fitting
Procedure:
a. What curve does the pattern resemble? b. What might alternative options be?
regression curve fitting
a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function
with AIC
visualize fit
Output from R Non-linear model that we fit
Simplified logarithmic with slope=0
Estimates of model parameters Residual sum-of-squares for your non-linear model Number of iterations needed to estimate the parameters If you are stuck for starting points in your R code this website may be able to help: http://www.xuru.org/rt/NLR.asp Copy paste data & desired model formula
manner: 𝑇𝑇𝑆𝑓𝑠𝑓𝑡𝑡𝑗𝑝𝑜 + 𝑇𝑇𝐹𝑠𝑠𝑝𝑠 = 𝑇𝑇𝑈𝑝𝑢𝑏𝑚
ൗ
𝑇𝑇𝑆𝑓𝑠𝑓𝑡𝑡𝑗𝑝𝑜 𝑇𝑇𝑈𝑝𝑢𝑏𝑚 which mathematically must produce a value
between 0 and 100%
squares to pick best model then plot the curve to visualize the fit
R2 for “goodness of fit”
How do we decide which model is best?
polynomials
– We want a model that is as simple as possible, but no simpler – A reasonable amount of explanatory power is traded off against model complexity – AIC measures the balance of this for us
different modelling approaches and model fitting techniques
– Model with the lowest AIC value is the model that fits your data best (e.g. minimizes your model residuals) Hirotugu Akaike, 1927-2009
In the 1970s he used information theory to build a numerical equivalent of Occam's razor Occam’s razor: All else being equal, the simplest explanation is the best one
a more complex one
to actually predict anything
Relationship between a binary response variable and predictor variables
probability of being in a category based on the combination of predictors
categorical variables 𝑀𝑝𝑗𝑡𝑢𝑗𝑑 𝑁𝑝𝑒𝑓𝑚: 𝑧 = 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜 1 − 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜 Logit Model
variances, or outliers
design considerations (randomization, sufficient replicates, no pseudoreplication) still apply
fitting relationships
between our response and predictor variables
Assumptions
the normal distribution
references both the number of observations and the probability of “getting a success” - a value of 1 “What is probability of x success in n independent and identically distributed Bernoulli trials?”
possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
Logistic Regression Linear Regression
event occurring (y=1) rather then not
relevant independent variables (our data)
using maximum likelihood estimation (iterative process)
distribution
best fitting line the estimates parameters that predict the change in the dependent variable for change in the independent variable predictor (x) response (y) predictor (x) response (y)
likelihood function Likelihood function - probability for the occurrence of a observed set of values X and Y given a function with defined parameters
Process:
is said to have converged
How coefficients are estimated for logistic regression
Output from R Estimate of model parameters (intercept and slope) Standard error of estimates Tests the null hypothesis that the coefficient is equal to zero (no effect)
A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable A large p-value suggests that changes in the predictor are not associated with changes in the response
AIC value for the model
binomial
regression and binomial distribution – acts like a chi-squared test
calculate p-values using chi-squared distribution rather than F-distribution (reserved for parametric statistics)
logistic model
Rather than sum of squares and mean sum of squares, now the deviance and residual deviance from each parameter Tests the null hypothesis that the variable no effect on achieving a “success” (value of 1)
A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to achieving a success in the response variable A large p-value suggests that changes in the predictor are not associated with success in the response
Remember deviance is a measure of the lack of fit between the model and data with larger values indicating poorer fit
modelled relationship
a plant by 0.22
given value of the predictor(s) using the logit model equation and the coefficient estimates from the logistic model.
𝑀𝑝𝑗𝑡𝑢𝑗𝑑 𝑁𝑝𝑒𝑓𝑚: 𝑧 = 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜 1 − 𝑓𝛾0 +𝛾1 𝑦1 +𝛾2 𝑦2 +⋯+𝛾𝑜 𝑦𝑜
chance of plant survival, but if NITROGEN is increased to 500 or 750 units the probability of survival increases to 77% and 93%, respectively.
Follow the Workbook Examples for the Analyses You are Interested In. Any questions?
(Rotation-Based)
“Definition of Statistics: The science of producing unreliable facts from reliable figures.”
Evan Esar (Humorist & Writer)
predictors (What you actually measure – e.g. lentil height, patient response) a.k.a dependent variable
influencing a response variable (e.g. climate, topography, drug combination, etc.) a.k.a independent variable
true (or expected) value
Experimental Unit (row)
In Multivariate statistics:
technique)
Frequency of species, Climate variables, Soil characteristics, Nutrient concentrations Drug levels, etc. Regions, Ecosystems, Forest types, Treatments, etc.
Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …
Final results based on multiple variables give different inferences than 2 variables
Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …
Variable 1 Variable 2
Find an equation to rotate data to so that axis explains multiple variables
Variable 1,2 Variable 3 Variable 1, 2, 4, 9, 10 Variable 3, 6, 8
Repeat rotation process to achieve analysis objective
1. Rotate so that new axis explains the greatest amount of variation within the
data
Principal Component Analysis (PCA) Factor Analysis
2. Rotate so that the variation between groups is maximized
Discriminatn Analysis (DISCRIM) Multivariate Analysis of Variance (MANOVA)
3. Rotate so that one dataset explains the most variation in another dataset
Canonical Correspondence Analysis (CCA)
First principal component (column vector) Column vectors of original variables
produce components Z1, Z2, …, Zn that are uncorrelated in order of their importance, and that describe the variation in the original data.
used to calculate each principal component For each component:
The constraint that a11
2 + a12 2 + … + a1n 2 = 1 ensures Var(Z1) is as large as possible
Coefficients for linear model
However, there is an addition condition that Z1 and Z2 have zero correlation for the data
components i.e. Z3 is uncorrelated with both Z1 and Z2
principal component
components because they explain the most variation in your data
couple of principal components explain very little (< 1%) of the variation in your data – not useful variables
PCA in R:
princomp(dataMatrix,cor=T/F) (stats package)
Define whether the PCs should be calculated using the correlation or covariance matrix (derived within the function from the data) You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales Default is to use the correlation matrix because it standardizes to data before it calculated the PCs, removing the effect of the different units Data matrix of predictor variables You will assign the results back to a class once the PCs have been calculated
Loadings – these are the correlations between the original predictor variables and the principal components Identifies which of the original variables are driving the principal component
Example:
Comp.1 – is negatively related to Murder, Assault, and Rape Comp.2 – is negatively related to UrbanPop
Eigenvectors
Scores – these are the calculated principal components Z1, Z2, …, Zn These are the values we plot to make inferences
Variance – summary of the output displays the variance explained by each principal component Identifies how much weight you should put in your principal components
Example:
Comp.1 – 62 % Comp.2 – 25% Comp.3 – 9% Comp.4 – 4 %
Eignenvalues divided by the number of PCs
Data points considering Comp.1 and Comp.2 scores (displays row names) Direction of the arrows +/- indicate the trend of points (towards the arrow indicates more of the variable) If vector arrows are perpendicular then the variables are not correlated If you original variables do not have some level of correlation then PCA will NOT work for your analysis – i.e. You wont learn anything!
(“reduce complexity”).
variables.
within the pre-determined groups.
For each function:
Linear discriminant (column vector) Column vectors of original variables a, b,… z Coefficients for linear model
and the 3rd group
x y
x’ y’
x y
x’ y’
Proportion of variance explained by linear discriminants Mean observation values for variables in each pre-defined group The initial probability of belonging to a group
(more important for predicting class)
Coefficients of linear discriminants are the solutions to our linear functions MASS will only display solutions for the most significant linear discriminants Discriminants that explain very small portion of the variance are removed
Proportion of variance explained by linear discriminants Mean discriminant values for each pre- defined group Standard error of the means are also given By querying the analysis structure we can see the discriminant loadings which tell us the relationship between the DF values and the original variables (like PCA) Again candisc will only display solutions for discriminants that explain the most variation Less information is displayed in the candisc output, but you can get the loadings which are important! Candisc also produces a nicer plot
Problem: A new skull is found but we don’t know whether it belongs to homo erectus or homo habilis or if it’s a new group?
Homo erectus Homo habilis Group centroid New find (unknown origin)
Skull measurement
How predictions work:
1. Calculate group centroid 2. Find out which centroid is the closest position to the unknown data point New groups are defined when we find a significant difference between new find and predefined groups Popular method in taxonomy and anthropology
Follow the Workbook Examples for the Analyses You are Interested In. Any questions?
What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?
… to characterize my data. … to find if there is a significant difference between my groups … to see what predictor conditions are associated with my groups. … normally distributed? … equal variances? … multiple response variables? … continuous or discrete? … binary data? … single treatment or multiple treatment?
I you have any further questions please feel free to contact me Flow Cytometry Core Facility
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com