Multivariate Ordination Analyses: Principal Component Analysis Dilys - - PowerPoint PPT Presentation
Multivariate Ordination Analyses: Principal Component Analysis Dilys - - PowerPoint PPT Presentation
Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana Boza Multivariate Analyses Multivariate Analyses A multivariate data set includes more than one variable recorded from a number of i bl d d f b f
Multivariate Analyses Multivariate Analyses
A multivariate data set includes more than i bl d d f b f
- ne variable recorded from a number of
replicate sampling or experimental units, i f d bj sometimes referred to as objects.
If these objects are If these objects are
- rganisms, the variables
might be morphological g p g
- r physiological
measurements If the objects are ecological sampling ecological sampling units, the variables might be g physicochemical measurements or species abundances
What ordinations analyses are ?
Ordination is arranging items along a scale (axis) or l i l Th d f di i i multiples axes. The proposed
- f
- rdination
is summarized graphically complex relationships, extracting one or few dominant patterns from an extracting one or few dominant patterns from an infinite number of possible patterns. The placement of variables along an axis it is possible because the ordination it is base on the variables because the ordination it is base on the variables correlation.
What ordination analyses help us to ? see?
Select the most important variables from multiple Select the most important variables from multiple variables imagined or hypothesized. Reveal unforeseen patterns and suggest unforeseen processes. p
What type of question can we answer with ordination analysis? with ordination analysis?
In ecology, to seek and describe pattern of process. In community ecology, to describe the strongest patterns in species composition. I i i d d fi i In systematics, to recognize and to define species boundaries.
Multivariate Analysis Ordination Analysis Clasification (or Clustering Analysis) Direct Gradient Analysis Indirect Gradient Analysis Linear Regression (Few Species) Corresponden ce Analysis (CA) (Many Species) Distant Detrended CA (DCA) Canonical CA (CCA) Redundancy Analysis (RDA) Values P i i l N t i Raw Data available Principal Coordinate Analysis (PCoA Non‐metric Dimensional Analysis (NMDS) Principal Components Analysis (PCA) Non‐metric Dimensional Analysis (NMDS) Detrended CA (DCA) Canonical CA (CCA) (PCA) (NMDS)
Principal Components Analysis Principal Components Analysis
Principal component analysis (PCA) is a statistical p p y ( ) technique that has been specifically developed to address data reduction. In general terms the major aim of PCA is to reduce the In general terms, the major aim of PCA is to reduce the complexity of the interrelationships among a potentially large number of observed variables to a relatively small b f l b f h h h number of linear combinations of them, which are referred to as principal components. Principal components analysis finds a set of orthogonal Principal components analysis finds a set of orthogonal standardized linear combinations which together explain all of the variation in the original data.
What are the assumptions of PCA? What are the assumptions of PCA?
Assumes relationships among variables Assumes relationships among variables.
cloud of points in p‐dimensional space has linear dimensions that can be effectively summarized by the principal axes.
If the structure in the data is NONLINEAR (the cloud f d h h
- f points twists and curves its way through p‐
dimensional space), the principal axes will not be an efficient and informative summary of the data efficient and informative summary of the data.
Considerations before to run a PCA Considerations before to run a PCA
Normal Distributions Normal Distributions Data Outliers
- f
i Transformations Standardization Data Matrix
Normal Distributions Normal Distributions
- When using PCA data normality is not
When using PCA data normality is not
- essential. However, these methods are based
- n the correlation or covariance matrix which
- n the correlation or covariance matrix, which
is strongly affected by non‐normally distributed data and the presence of outliers distributed data and the presence of outliers.
Data outliers Data outliers
- Extreme values as well as outliers can have a
Extreme values as well as outliers can have a severe influence on PCA, since they are based
- n the correlation or covariance matrix (Pison
et al., 2003).
- Outliers should thus be removed prior to the
statistical analysis, or statistical methods able to handle outliers should be employed, and h i fl f l d b the influence of extreme values needs to be reduced (e.g., via a suitable transformation).
Transformations Transformations
Transformations, which change the scale of measurement of the data in relation to meeting the normality assumption of the data, in relation to meeting the normality assumption of parametric analyses and the homogeneity of variance assumption of most of these analyses. Transformations are particularly important for multivariate procedures based on eigenanalysis (e.g. principal components analysis) because covariances and correlations measure linear analysis) because covariances and correlations measure linear relationships between variables. Transformations that improve linearity will increase the Transformations that improve linearity will increase the efficiency with which the eigenanalysis extracts the eigenvectors.
Standardization Standardization
The first stage in rotating the data cloud is to The first stage in rotating the data cloud is to standardize the data by subtracting the mean and dividing by the standard deviation and dividing by the standard deviation. It may be argued that we should not divide by the standard deviation By standardizing we the standard deviation. By standardizing, we are giving all species the same variation, i.e. a standard deviation of 1 standard deviation of 1.
Data Matrix Data Matrix
We actually can have it both ways: We actually can have it both ways:
A PCA without dividing by the standard deviation is an analysis of the covariance matrix. A PCA in which you do indeed divide by the standard deviation is an analysis of the correlation matrix matrix.
When using species/variables measured in When using species/variables measured in different units, you must use a correlation matrix matrix.
Look at Descriptors
Homogeneous nature? All Same Kind ? Same Units? Heterogenous nature? Different kind? Different Units? Same Order of Magnitude Different order of Magnitude?
S matrix (Covariance) R matrix (Correlation) ( ) ( )
Advantages Disadvantages Correlation The results of There are considerable differences in the Matrix analyses for different sets of random variables are more directly standard deviations, caused mainly by differences in scale. None of the correlations is particularly large in absolute value are more directly comparable. absolute value. PCs has moderate‐sized coefficients for several
- f the variables.
PCs give coefficients for standardized variables and are therefore less easy to interpret directly. Covariance Matrix PCs for the covariance matrix The sensitivity of the PCs to the units of measurement used for each element of the Matrix covariance matrix are each dominated by a single variable. The variances and measurement used for each element of the
- variables. If there are large differences between
the variances of the elements of the variables, then those variables whose variances are largest total variance are more meaningful indices for measuring variability will tend to dominate the first few PCs. measuring variability in data sets that are symmetric.
Eigenvalues & Eigenvectors Eigenvalues & Eigenvectors
The eigenvectors are the loadings of the The eigenvectors are the loadings of the principal components spanning the new PCA coordinate system coordinate system. The amount of variability contained in each principal component is expressed by the principal component is expressed by the eigenvalues which are simply the variances of the scores the scores.
PCA searches for the direction in the multivariate space that in the multivariate space that contains the maximum variability. This is the direction of the first principal component (PC1). The second principal p p component (PC2) has to be
- rthogonal (perpendicular) to
PC1 and will contain the PC1 and will contain the maximum amount of the remaining data variability. S b t i i l Subsequent principal components are found by the same principle.
Biplots
A biplot is a visualization tool to t lt f PCA Th PCA present results of PCA. The PCA biplot is called the scaling process. The loadings(arrows) represent the
- elements. The lengths of the arrows
i h l di l i l in the plot are directly proportional to the variability included in the two components (PC1 and PC2) displayed, and the angle between any two arrows is a measure of the correlation between those variables correlation between those variables.
Misconceptions Misconceptions
PCA cannot cope with missing values (but PCA cannot cope with missing values (but neither can most other statistical methods). It does not require normality It does not require normality. It is not a hypothesis test. There are no clear distinctions between response variables and explanatory variables.
When should PCA be used? When should PCA be used?
In community ecology PCA is useful for In community ecology, PCA is useful for summarizing variables whose relationships are approximately linear or at least monotonic approximately linear or at least monotonic.
e.g. A PCA of many soil properties might be used to extract a few components that summarize main dimensions of soil variation
PCA is generally NOT useful for ordinating community data. Why? Because relationships among species are highly nonlinear.
Community trends Community trends along environmenal gradients appear as
Beta Diversity 2R - Covariance
g pp “horseshoes” in PCA
- rdinations.
2
None of the PC axes effectively i h d
Axis
summarizes the trend in species composition along
Axis 1
composition along the gradient.
The “Horseshoe”Effect The Horseshoe Effect
Curvature of the gradient and the degree of infolding of Curvature of the gradient and the degree of infolding of the extremes increase with beta diversity. PCA ordinations are not useful summaries of PCA ordinations are not useful summaries of community data except when beta diversity is very low Using correlation generally does better than covariance.
This is because standardization by species improves the correlation between Euclidean distance and environmental distance distance.
What if there’s more than one d l l l d ? underlying ecological gradient?
When two or more underlying gradients with high beta diversity a “horseshoe” is usually high beta diversity a horseshoe is usually not detectable. Interpretation problems are more severe Interpretation problems are more severe.
Data Set
Morphological and anatomical Morphological and anatomical variation of Calophyllum L. (Calophyllaceae) in South America.
- D. Vela
Kielmeyeroideae Calophylleae
Caraipa
Calophylleae
- Calophyllum
- Neotatea
- Marila
- Marila
- Mahurea
- Clusiella
- Kielmeyera
- Caraipa
- Haploclathra
- Poeciloneuron
- Mesua
Mesua
- Kayea
- Mammea
Kayea
Endodesmieae
- Endoodesmia
- Lebrunia
Calophyllum
Stevens, 2006
Wurdarck & Davis (2009)
Distribution of Calophyllaceae
144 144 species species 259 259 species species 10 10 species species 176 176 species species
Stevens, 2006 http://www.mobot.org/MOBOT/research/APweb/
www.wikimedia.org
Vein Resin canal
http://www.botany.hawaii.edu/faculty/carr/images/cal_ino.jpg http://pakuwon.wordpress.com/2009/02/13/nyamplung‐calophyllum‐inophyllum/ http://www.flickr.com/photos/mauroguanandi
Calophyllum brasiliense
- There is infraspecific variation in
- There is infraspecific variation in
tepal number between individuals of the same species, and between flowers from the same inflorescence. flowers from the same inflorescence.
Stevens (1974,1980)
Calophyllum brasiliense http://www.nationaalherbarium.nl/sungaiwain/ Calophyllum pisiferum Calophyllum lanigerum
1 M i bj ti
- 1. Main objective
1.A To distinguish species limits of Calophyllum in South America.
2 Specific objectives
- 2. Specific objectives
2.A To analyze morphological and anatomical i ti variation.
Data collection for morphological observations
Herbarium and personal Herbarium and personal collections. Collection sort: qualitative characteristics (Systematic Association Committee for Association Committee for descriptive Biological Terminology (cited by Stearn 2006). Measurement. Ruler and a digital g caliper. E l d t t i Excel data matrix . Specimen collections in rows and variables in columns.
Leaf characters Flower characters Fruit characters External Fruit length mm Petiole length mm (PTL) Pedicel length mm (PDL) g (FrLEx) Leaf length cm (LL) Perianth width mm (PW ) External Fruit width mm (FrWEx) L f l th t id t t Leaf length at widest part cm (LWWP) Perianth length mm (PRL) Internal Fruit length mm (FrLIn) Leaf width cm (LW) Anther length mm (AL) Internal Fruit width mm (FrWIn) ( ) g ( ) ( ) Apex length mm (PL) Anther width mm (AW) Stigma remained mm (StygR) Midrib width at abaxial side mm (MW) Stamen length mm (STL) Basal discoloration mm (BsDis) Vein angle degree (VA) Filament length mm (FL) Stone mm (Stn) Venation density (VD) Style length mm (STYL) Corky mm (CRK) Gynoecium length mm (GL) Gynoecium length mm (GL) Ovary length mm (OL) Stigma width mm (SL)