Statistics Toolbox in
Professional Development Opportunity for the
Flow Cytometry Core Facility
October 12, 2018
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com
A Review of Analysis Techniques for Scientific Research
Statistics Toolbox in A Review of Analysis Techniques for - - PowerPoint PPT Presentation
Statistics Toolbox in A Review of Analysis Techniques for Scientific Research Professional Development Opportunity for the Flow Cytometry Core Facility October 12, 2018 LKG Consulting Email: consulting.lkg@gmail.com Website:
Professional Development Opportunity for the
Flow Cytometry Core Facility
October 12, 2018
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com
A Review of Analysis Techniques for Scientific Research
Laura Gray-Steinhauer
www.ualberta.ca/~lkgray BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005) MSc in Forest Biology and Management (UofA, 2008) PhD in Forest Biology and Management (UofA, 2011)
Designated Professional Statistician with The Statistical Society of Canada (2014) Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…
Workshop Schedule
8:15 – 8:30 Arrive to the Lab & Start up the computers 8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals) 8:45 – 9:15 Statistics Toolbox Refresh useful vocabulary, introduce a decision tree to plan your analysis path 9:15 – 9:45 Hypothesis Testing Refresher on p-values, Type 1 and Type 2 error, and statistical power 9:45 – 10:00 Break 10:00 – 11:00 Parametric versus Non-Parametric Tests Testing for parametric assumptions, ANOVA, Permutational ANOVA 11:00 – 11:30 Multivariate Statistics Introduction to principle component analysis (PCA) 11:30 – 1:00 Work period (questions are welcome) After 1:00 Enjoy your weekend!
This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today. Feel free to reach out to me via email with more questions: consulting.lkg@gmail.com.
else is Arial)
change depending on what you name your variables.
content outside of the workshop attendees. Topics Included:
Topics Included:
Topics Included:
(PCA)
(MANOVA)
https://cran.r-project.org/index.html
https://www.rstudio.com/
Preferred among programmers, we will use it in this workshop
“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”
Aaron Levenstein (Author)
Statistical Term Real World Research World Population Class of things
E.g. Cancer patients
What you want to learn about
E.g. Cancer patients in Alberta
Sample Group representing a class
E.g. 1000 cancer patients in Alberta
What you actually study
E.g. 1000 cancer patients from 10 treatment centres in Alberta
Experimental Unit Individual thing
E.g. each of the 1000 cancer patients
Individual research subject
E.g. Cancer patients n=1000 Hospital populations n=10 (depends on research question)
Dependent Variable Property of things
E.g. white blood cell count
What you measure about subjects
E.g. white blood cell count
Independent Variable Environment of things
E.g. Treatment options, climate, etc.
What you think might influence dependent variable
E.g. Amount of treatment, combination of treatments, etc.
Data Values of variables What you record/information you collect
which the outcome is unknown
with)
conclusions (inferences) about the population from which the sample was taken
population characteristic (e.g. population mean)
probability associated with each possible value of a variable
expected) value
Also see Appendix 1 in your workbook
What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?
Analysis Goal Parametric
Assumptions Met
Non-Parametric
Alternative if fail assumptions
Binomial
Binary data/Event likelihood
Describe data characteristics Mean Standard deviation Standard error Etc. Median Quartiles Percentiles Proportions Probability distributions are always appropriate to describe data. Graphics are always appropriate to describe data. Compare 2 distinct/independent groups T-test Paired t-test Wilcox Rank-Sum Test Klomogorov-Smirnov Test Permutational T-test Z-Test for proportions Compare > 2 distinct/independent groups ANOVA Multi-Way ANOVA ANCOVA Blocking Kruskall Wallace Test Friedman Rank Test Permutational ANOVA Chi-Squared Test Binomial ANOVA Estimate the degree
between 2 variables Pearson’s correlation Spearman rank correlation Kendall’s rank correlation Logistic regression Predict outcome based on relationship Linear regression Multiple linear regression Non-linear regression Logistic regression Odds Ratio
If you have a continuous response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two treatment levels > Two treatment levels T-Test Permutational T-Test Klomogorov Smirniov (KS) Test Wilcox Test One-Way ANOVA Kruskall Wallace Test Freidman Rank Test Pearson’s Correlation Spearman’s Rank Correlation Kendall’s Rank Correlation Linear/Non- linear Regression
What you get Non-Parametric Parametric Regression
significantly different
significant effect of “treatment”.
where the difference between groups
indicating direction and magnitude of relationship.
well predictor is linked to response (R2 or AIC).
Binomial
… and two or more predictor variables Predictor is categorical Predictor is continuous Two or more treatment levels for each predictor Multi-Way ANOVA Permutational ANOVA Multiple Regression
What you get
among treatments with interactions.
covariates.
are linked to response variable (Adjusted R2, AIC)
predictors significantly affect the response variable.
Blocking ANCOVA Blocking
Non-Parametric Parametric Regression Binomial
If you have a categorical response variable… … and one predictor variable Predictor is categorical Predictor is continuous Two or more treatment levels Binomial ANOVA Logistic Regression
What you get
a significant effect of each treatment.
interactions).
possibility of interactions.
with adjusted p-values to determine the difference among treatments with interactions.
there is a significant effect
comparisons to find where the difference between groups
… and two or more predictor variables Predictor is continuous Two treatment levels Z-Test for Proportions Two or more treatment levels Chi-squared Test
are linked to response variable (Adjusted R2, AIC)
which predictors significantly affect the response variable.
groups are significantly different
Non-Parametric Parametric Regression Binomial
Example research questions: Do yield of different lentil varieties differ at 2 farms? Do the varieties differ among themselves? Does the density of the plants impact their average height?
Farm 1 Farm 2
Plot
1 Variety in each
Individual lentil plants
A A A A A A A B B B B B A B C C C C C C C C
petal measurements for 50 flowers from each of Iris setosa, versicolor, and virginica.
“Statistics are not substitute for judgment.”
Henry Clay (Former US Senator)
sample population
Is this a difference due to random chance? Mean height Population sample
𝐼𝑝: ҧ 𝑦𝐵 = ҧ 𝑦𝐶 𝐼1: ҧ 𝑦𝐵 ≠ ҧ 𝑦𝐶
If actual p < , reject null hypothesis (𝐼𝑝) and accept alternative hypothesis (𝐼1)
“Is this difference due to random chance?”
Mean height Population sample
P-value – the probability the observed value or larger is due to random chance
Theory: We can never really prove if the 2 samples are truly different or the same – only ask if what we observe (or a greater difference) is due to random chance
How to interpret p-values: P-value = 0.05 – “Yes, 1 out of 20 times.” P-value = 0.01 – “Yes, 1 out of 100 times.”
The lower the probability a difference is due to random chance – the more likely is the result of an effect (what we test for)
In other words: “Is random chance a plausible explanation?”
Type I Error – reject the null hypothesis (H0)
when it is actually true
Type II Error – failing to reject the null
hypothesis (H0) when it is not true
Remember rejection or acceptance of a p-value (and therefore the chance you will make an error) depends on the arbitrary -level you choose
the probability of making a Type II Error
The -level you choose is completely up to you (typically it is set at 0.05), however, it should be chosen with consideration of the consequences of making a Type I or a Type II Error. Based on your study, would you rather err on the side of false positives or false negatives?
Null hypothesis is true Alternative hypothesis is true Fail to reject the null hypothesis ☺ Correct Decision
Incorrect Decision False Negative Type II Error Reject the null hypothesis Incorrect Decision False Positive Type I Error
☺
Correct Decision
Example: Will current forests adequately protect genetic resources
under climate change?
Birch Mountain Wildlands
HO: Range of the current climate for the BMW protected area
= Range of the BMW protected area under climate change
Ha: Range of the current climate for the BMW protected area
≠ Range of the BMW protected area under climate change
If we reject HO: Climates ranges are different, therefore
genetic resources are not adequately protected and new protected areas need to be created
Consequences if I make:
indeed adequately protected in the BMW protected area – we created new parks when we didn’t need to
vulnerable – we didn’t create new protected areas and we should have From an ecological standpoint it is better to make a Type I Error, but from an economic standpoint it is better to make a Type II Error Which standpoint should I take?
Power is your ability to reject the null hypothesis when it is false (i.e. your ability to detect an effect when there is one). There are many ways to increase power:
Error!
Given you are testing whether or not what you observed or greater is due to random chance, more data gives you a better understanding of what is truly happening within the population, therefore sample size will the probability of making a Type 2 Error
Go grab a coffee. Next we will cover specific tools in your new tool box.
“He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”
Andrew Lang (Scottish poet)
Type Parametric Non-Parametric
Characteristics
parametric tests
Assumptions
When to use?
size (CLT), however equal variances must be met
met
(skewed data distribution)
Examples
Paired)
traditional)
“Your samples have to come from a randomized or randomly sampled design.”
consider)
your experiment in a randomized design
treatments are laid out.
could be masked or overinflated
and test your data considering the change in time (e.g. paired tests) NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at least 2 reps per treatment to execute and ANOVA Rule of Thumb: You need more rows then columns
𝑡2 = σ𝑗=1
𝑜
𝑦𝑗 − ҧ 𝑦 2 𝑜 − 1 SD = 𝑡2
Based on this curve:
𝑦
𝑦
𝑦 For confidence intervals:
𝑦
The base of parametric statistics
B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1
Residuals “If I was to repeat my sample repeatedly and calculate the means, those means would be normally distributed.”
Determine if the assumption is met by:
t-distribution
(sampling distribution)
Normal distribution
You may not need to worry about Normality?
Central Limit Theorem: “Sample means tend to cluster around the central
population value.” Therefore:
𝑦 is close to the value of 𝜈
You may not need to worry about Normality?
Central Limit Theorem: “Sample means tend to cluster around the central
population value.” Therefore:
𝑦 is close to the value of 𝜈
What does this mean?
your test is not really compromised if your residuals are not normal
1. Very small N 2. Data is highly non-normal 3. Significant outliers are present 4. Small effect size
0 4 8 12 16 20 24 ҧ 𝑦𝐵 = 12 𝑡𝐵 = 4 ҧ 𝑦𝐶 = 12 𝑡𝐶 = 6
Let’s say 5% of the A data fall above this threshold But >5% of the B data fall above the same threshold
So with larger variances, you can expect a greater number of observations at the extremes of the distributions This can have real implications on inferences we make from comparisons between groups
B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1
Residuals “Does the know probability of observations between my two samples hold true?”
Determine if the assumption is met by:
Testing for Equal Variances – Residual Plots
Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units) Predicted values Observed (original units)
value=0
(e.g. variety, fertilization, irrigation, etc.)
(e.g. A,B,C or Control, 1xN, 2xN)
𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠
treatment means is due to random chance
B ҧ 𝑦𝐶𝐺𝑏𝑠𝑛1 C A ҧ 𝑦𝐵𝐶𝐷𝐺𝑏𝑠𝑛1 ҧ 𝑦𝑑𝐺𝑏𝑠𝑛1 FARM 1 C A ҧ 𝑦𝐵𝐺𝑏𝑠𝑛1
Residuals
F-value –
𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑢ℎ𝑗𝑜 = 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑛𝑓𝑏𝑜 𝑡𝑟𝑣𝑏𝑠𝑓 𝑓𝑠𝑠𝑝𝑠
=
𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓
SIGNAL NOISE
𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗
𝑜
ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑠 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗
𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗
𝑜
Think of Pac-man!
Variation in your study
different Pac-man player on the board
variation between (e.g. the amount of variation each treatment
can explain)
players have died represented the variation within
(lowering variation within), increasing the F-value
𝐺 = 𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜
𝐺 = 𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐺 = 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜
𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 = σ𝑗
𝑜
ҧ 𝑦𝑗 − ҧ 𝑦𝐵𝑀𝑀 2 𝑜 − 1 ∗ 𝑜𝑝𝑐𝑞𝑢 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 = σ𝑗
𝑜 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓𝑗
𝑜
probability
𝑡𝑗𝑜𝑏𝑚 > 𝑜𝑝𝑗𝑡𝑓 𝑡𝑗𝑜𝑏𝑚 < 𝑜𝑝𝑗𝑡𝑓 P-value
(percentiles, probabilities)
pf(F, 𝑒𝑔1, 𝑒𝑔2) qf(p, 𝑒𝑔1, 𝑒𝑔2)
0.50 0.95
∞
∝= 0.05
Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387
Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387
pt(FA, dfA, dfERROR) pt(FB, dfB, dfERROR) pt(FAxB, dfAxB, dfERROR) MSA/MSERROR MSB/MSERROR MSAxB/MSERROR MSA*dfA MSB*dfB MSAxB*dfAxB MSERROR*dfERROR
Source of Variation df Sum of Squares Mean Squares F-value P-value Variety (A) 2 20 10 0.0263 0.9741 Farm (B) 1 435243 43243 1125.7085 <0.05 Variety x Farm (AxB) 2 253561 126781 327.9046 <0.05 Error 18 6959 387
If the interaction is significant – you should ignore the main effects because the story is not that simple!
1. 2. 3.
Farm1 Farm2
A B
FARM1
A B
Farm1 Farm2 Farm1 Farm2 Farm1 Farm2
4.
A B
Avg Farm1 Avg Farm2
Yield Yield Yield Yield
A B
WAY ANOVA
just interpret the main effects
to interpret main effects because the combination of treatment levels results in different outcomes
a.k.a Pairwise t-tests Number of comparisons:
𝐷 = 𝑢 𝑢 − 1 2 𝑢 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑢𝑠𝑓𝑏𝑢𝑛𝑓𝑜𝑢 𝑚𝑓𝑤𝑓𝑚𝑡
Lentil Example: 3 VARITIES (A, B, and C)
A – B A – C B – C
𝐷 = 𝑢(𝑢 − 1) 2 = 3(2) 2 = 𝟒
Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I Error at all Experiment-wise Type I Error for = 0.05:
𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.95𝐷
Lentil Example:
𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.953 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 1 − 0.87 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑈𝑧𝑞𝑓 𝐽 𝐹𝑠𝑠𝑝𝑠 = 𝟏. 𝟐𝟒 Significantly increased probability of making an error!
Therefore pairwise comparisons leads to compromised experiment-wise -level You can correct for multiple comparisons by calculating an adjusted p-value (Bonferroni, Holms, etc.)
a.k.a Pairwise t-tests with adjusted p-values If we have a significant interaction effect – use these values If we have NO significant interaction effect – we can just look at the main effects
Only need to consider relevant pairwise comparisons – think about it logically
W X Y Z W
* NS X
NS Y
Z
A B A,B W X Y Z
Same letter = non significant Different letter = significant
Create a matrix of significance and use it to code your graph
𝐸 = 𝑡𝑗𝑜𝑏𝑚 𝑜𝑝𝑗𝑡𝑓 𝐸 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝑠𝑝𝑣𝑞𝑡 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑥𝑗𝑢ℎ𝑗𝑜 𝑠𝑝𝑣𝑞𝑡 Calculating D (delta) & its distribution
same way we do when we calculate an F-value
Determining the distribution of D
will emerge
10 2 3
Determining the distribution of D
will emerge
10 2 3
4921 D calculations < 10 from permutations 79 D calculations ≥ 10 from permutations P-value: 79/5000 = 0.0158
Permutational ANOVA
Permutational ANOVA in R:
library(lmPerm) summary(aovp(YIELD~FARM*VARIETY, seqs=T)) Pairwise Permutational ANOVA in R:
TukeyHSD(out1)
The option seqs=T calculates sequential sum of squares
(similar to regular ANOVA)
Good choice for balanced designs You can change the maximum number of iterations with the maxIter=
distribution look like
statistics
curve to work with
parametric tests
powerful as parametric statistics – YES They Are!
you can, but when you can’t permutational tests are great options!
(Rotation-Based)
“Definition of Statistics: The science of producing unreliable facts from reliable figures.”
Evan Esar (Humorist & Writer)
predictors (What you actually measure – e.g. lentil height, patient response) a.k.a dependent variable
influencing a response variable (e.g. climate, topography, drug combination, etc.) a.k.a independent variable
true (or expected) value
Experimental Unit (row)
In Multivariate statistics:
technique)
Frequency of species, Climate variables, Soil characteristics, Nutrient concentrations Drug levels, etc. Regions, Ecosystems, Forest types, Treatments, etc.
Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …
Final results based on multiple variables give different inferences than 2 variables
Data.I D Type Varable 1 Variable 2 Variable 3 Variable 4 … 1 2 3 4 5 6 …
Variable 1 Variable 2
Find an equation to rotate data to so that axis explains multiple variables
Variable 1,2 Variable 3 Variable 1, 2, 4, 9, 10 Variable 3, 6, 8
Repeat rotation process to achieve analysis objective
1. Rotate so that new axis explains the greatest amount of variation within the
data
Principal Component Analysis (PCA) Factor Analysis
2. Rotate so that the variation between groups is maximized
Discriminatn Analysis (DISCRIM) Multivariate Analysis of Variance (MANOVA)
3. Rotate so that one dataset explains the most variation in another dataset
Canonical Correspondence Analysis (CCA)
First principal component (column vector) Column vectors of original variables
produce components Z1, Z2, …, Zn that are uncorrelated in order of their importance, and that describe the variation in the original data.
used to calculate each principal component For each component:
The constraint that a11
2 + a12 2 + … + a1n 2 = 1 ensures Var(Z1) is as large as possible
Coefficients for linear model
However, there is an addition condition that Z1 and Z2 have zero correlation for the data
components i.e. Z3 is uncorrelated with both Z1 and Z2
principal component
components because they explain the most variation in your data
couple of principal components explain very little (< 1%) of the variation in your data – not useful variables
PCA in R:
princomp(dataMatrix,cor=T/F) (stats package)
Define whether the PCs should be calculated using the correlation or covariance matrix (derived within the function from the data) You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales Default is to use the correlation matrix because it standardizes to data before it calculated the PCs, removing the effect of the different units Data matrix of predictor variables You will assign the results back to a class once the PCs have been calculated
Loadings – these are the correlations between the original predictor variables and the principal components Identifies which of the original variables are driving the principal component
Example:
Comp.1 – is negatively related to Murder, Assault, and Rape Comp.2 – is negatively related to UrbanPop
Eigenvectors
Scores – these are the calculated principal components Z1, Z2, …, Zn These are the values we plot to make inferences
Variance – summary of the output displays the variance explained by each principal component Identifies how much weight you should put in your principal components
Example:
Comp.1 – 62 % Comp.2 – 25% Comp.3 – 9% Comp.4 – 4 %
Eignenvalues divided by the number of PCs
Data points considering Comp.1 and Comp.2 scores (displays row names) Direction of the arrows +/- indicate the trend of points (towards the arrow indicates more of the variable) If vector arrows are perpendicular then the variables are not correlated If you original variables do not have some level of correlation then PCA will NOT work for your analysis – i.e. You wont learn anything!
Follow the Workbook Examples for the Analyses You are Interested In. Any questions?
What is the goal of my analysis? What kind of data do I have to answer my research question? How many variables do I want to include in my analysis? Does my data meet the analysis assumptions?
… to characterize my data. … to find if there is a significant difference between my groups … to see what predictor conditions are associated with my groups. … normally distributed? … equal variances? … multiple response variables? … continuous or discrete? … binary data? … single treatment or multiple treatment?
I you have any further questions please feel free to contact me Flow Cytometry Core Facility
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com