Multivariate Ordination Analyses: Principal Component Analysis Dilys - - PowerPoint PPT Presentation

▶

Oct 08, 2022 727 likes •1.16k views

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana Boza Multivariate Analyses Multivariate Analyses A multivariate data set includes more than one variable recorded from a number of i bl d d f b f

SLIDE 1

Multivariate Ordination Analyses: Principal Component Analysis

Dilys Vela Tatiana Boza Tatiana Boza

SLIDE 2

Multivariate Analyses Multivariate Analyses

A multivariate data set includes more than i bl d d f b f

ne variable recorded from a number of

replicate sampling or experimental units, i f d bj sometimes referred to as objects.

SLIDE 3

If these objects are If these objects are

rganisms, the variables

might be morphological g p g

r physiological

measurements If the objects are ecological sampling ecological sampling units, the variables might be g physicochemical measurements or species abundances

SLIDE 4

What ordinations analyses are ?

Ordination is arranging items along a scale (axis) or l i l Th d f di i i multiples axes. The proposed

f
rdination

is summarized graphically complex relationships, extracting one or few dominant patterns from an extracting one or few dominant patterns from an infinite number of possible patterns. The placement of variables along an axis it is possible because the ordination it is base on the variables because the ordination it is base on the variables correlation.

SLIDE 5

What ordination analyses help us to ? see?

Select the most important variables from multiple Select the most important variables from multiple variables imagined or hypothesized. Reveal unforeseen patterns and suggest unforeseen processes. p

SLIDE 6

What type of question can we answer with ordination analysis? with ordination analysis?

In ecology, to seek and describe pattern of process. In community ecology, to describe the strongest patterns in species composition. I i i d d fi i In systematics, to recognize and to define species boundaries.

SLIDE 7

Multivariate Analysis Ordination Analysis Clasification (or Clustering Analysis) Direct Gradient Analysis Indirect Gradient Analysis Linear Regression (Few Species) Corresponden ce Analysis (CA) (Many Species) Distant Detrended CA (DCA) Canonical CA (CCA) Redundancy Analysis (RDA) Values P i i l N t i Raw Data available Principal Coordinate Analysis (PCoA Non‐metric Dimensional Analysis (NMDS) Principal Components Analysis (PCA) Non‐metric Dimensional Analysis (NMDS) Detrended CA (DCA) Canonical CA (CCA) (PCA) (NMDS)

SLIDE 8

Principal Components Analysis Principal Components Analysis

Principal component analysis (PCA) is a statistical p p y ( ) technique that has been specifically developed to address data reduction. In general terms the major aim of PCA is to reduce the In general terms, the major aim of PCA is to reduce the complexity of the interrelationships among a potentially large number of observed variables to a relatively small b f l b f h h h number of linear combinations of them, which are referred to as principal components. Principal components analysis finds a set of orthogonal Principal components analysis finds a set of orthogonal standardized linear combinations which together explain all of the variation in the original data.

SLIDE 9

What are the assumptions of PCA? What are the assumptions of PCA?

Assumes relationships among variables Assumes relationships among variables.

cloud of points in p‐dimensional space has linear dimensions that can be effectively summarized by the principal axes.

If the structure in the data is NONLINEAR (the cloud f d h h

f points twists and curves its way through p‐

dimensional space), the principal axes will not be an efficient and informative summary of the data efficient and informative summary of the data.

SLIDE 10

Considerations before to run a PCA Considerations before to run a PCA

Normal Distributions Normal Distributions Data Outliers

i Transformations Standardization Data Matrix

SLIDE 11

Normal Distributions Normal Distributions

When using PCA data normality is not

When using PCA data normality is not

essential. However, these methods are based
n the correlation or covariance matrix which
n the correlation or covariance matrix, which

is strongly affected by non‐normally distributed data and the presence of outliers distributed data and the presence of outliers.

SLIDE 12

Data outliers Data outliers

Extreme values as well as outliers can have a

Extreme values as well as outliers can have a severe influence on PCA, since they are based

n the correlation or covariance matrix (Pison

et al., 2003).

Outliers should thus be removed prior to the

statistical analysis, or statistical methods able to handle outliers should be employed, and h i fl f l d b the influence of extreme values needs to be reduced (e.g., via a suitable transformation).

SLIDE 13

Transformations Transformations

Transformations, which change the scale of measurement of the data in relation to meeting the normality assumption of the data, in relation to meeting the normality assumption of parametric analyses and the homogeneity of variance assumption of most of these analyses. Transformations are particularly important for multivariate procedures based on eigenanalysis (e.g. principal components analysis) because covariances and correlations measure linear analysis) because covariances and correlations measure linear relationships between variables. Transformations that improve linearity will increase the Transformations that improve linearity will increase the efficiency with which the eigenanalysis extracts the eigenvectors.

SLIDE 14

Standardization Standardization

The first stage in rotating the data cloud is to The first stage in rotating the data cloud is to standardize the data by subtracting the mean and dividing by the standard deviation and dividing by the standard deviation. It may be argued that we should not divide by the standard deviation By standardizing we the standard deviation. By standardizing, we are giving all species the same variation, i.e. a standard deviation of 1 standard deviation of 1.

SLIDE 15

Data Matrix Data Matrix

We actually can have it both ways: We actually can have it both ways:

A PCA without dividing by the standard deviation is an analysis of the covariance matrix. A PCA in which you do indeed divide by the standard deviation is an analysis of the correlation matrix matrix.

When using species/variables measured in When using species/variables measured in different units, you must use a correlation matrix matrix.

SLIDE 16

Look at Descriptors

Homogeneous nature? All Same Kind ? Same Units? Heterogenous nature? Different kind? Different Units? Same Order of Magnitude Different order of Magnitude?

S matrix (Covariance) R matrix (Correlation) ( ) ( )

SLIDE 17

Advantages Disadvantages Correlation The results of There are considerable differences in the Matrix analyses for different sets of random variables are more directly standard deviations, caused mainly by differences in scale. None of the correlations is particularly large in absolute value are more directly comparable. absolute value. PCs has moderate‐sized coefficients for several

f the variables.

PCs give coefficients for standardized variables and are therefore less easy to interpret directly. Covariance Matrix PCs for the covariance matrix The sensitivity of the PCs to the units of measurement used for each element of the Matrix covariance matrix are each dominated by a single variable. The variances and measurement used for each element of the

variables. If there are large differences between

the variances of the elements of the variables, then those variables whose variances are largest total variance are more meaningful indices for measuring variability will tend to dominate the first few PCs. measuring variability in data sets that are symmetric.

SLIDE 18

Eigenvalues & Eigenvectors Eigenvalues & Eigenvectors

The eigenvectors are the loadings of the The eigenvectors are the loadings of the principal components spanning the new PCA coordinate system coordinate system. The amount of variability contained in each principal component is expressed by the principal component is expressed by the eigenvalues which are simply the variances of the scores the scores.

SLIDE 19

PCA searches for the direction in the multivariate space that in the multivariate space that contains the maximum variability. This is the direction of the first principal component (PC1). The second principal p p component (PC2) has to be

rthogonal (perpendicular) to

PC1 and will contain the PC1 and will contain the maximum amount of the remaining data variability. S b t i i l Subsequent principal components are found by the same principle.

SLIDE 20

Biplots

A biplot is a visualization tool to t lt f PCA Th PCA present results of PCA. The PCA biplot is called the scaling process. The loadings(arrows) represent the

elements. The lengths of the arrows

i h l di l i l in the plot are directly proportional to the variability included in the two components (PC1 and PC2) displayed, and the angle between any two arrows is a measure of the correlation between those variables correlation between those variables.

SLIDE 21

Misconceptions Misconceptions

PCA cannot cope with missing values (but PCA cannot cope with missing values (but neither can most other statistical methods). It does not require normality It does not require normality. It is not a hypothesis test. There are no clear distinctions between response variables and explanatory variables.

SLIDE 22

When should PCA be used? When should PCA be used?

In community ecology PCA is useful for In community ecology, PCA is useful for summarizing variables whose relationships are approximately linear or at least monotonic approximately linear or at least monotonic.

e.g. A PCA of many soil properties might be used to extract a few components that summarize main dimensions of soil variation

PCA is generally NOT useful for ordinating community data. Why? Because relationships among species are highly nonlinear.

SLIDE 23

Community trends Community trends along environmenal gradients appear as

Beta Diversity 2R - Covariance

g pp “horseshoes” in PCA

rdinations.

None of the PC axes effectively i h d

Axis

summarizes the trend in species composition along

Axis 1

composition along the gradient.

SLIDE 24

The “Horseshoe”Effect The Horseshoe Effect

Curvature of the gradient and the degree of infolding of Curvature of the gradient and the degree of infolding of the extremes increase with beta diversity. PCA ordinations are not useful summaries of PCA ordinations are not useful summaries of community data except when beta diversity is very low Using correlation generally does better than covariance.

This is because standardization by species improves the correlation between Euclidean distance and environmental distance distance.

SLIDE 25

What if there’s more than one d l l l d ? underlying ecological gradient?

When two or more underlying gradients with high beta diversity a “horseshoe” is usually high beta diversity a horseshoe is usually not detectable. Interpretation problems are more severe Interpretation problems are more severe.

SLIDE 26

Data Set

SLIDE 27

Morphological and anatomical Morphological and anatomical variation of Calophyllum L. (Calophyllaceae) in South America.

D. Vela

SLIDE 28

Kielmeyeroideae Calophylleae

Caraipa

Calophylleae

Calophyllum
Neotatea
Marila
Marila
Mahurea
Clusiella
Kielmeyera
Caraipa
Haploclathra
Poeciloneuron
Mesua

Mesua

Kayea
Mammea

Kayea

Endodesmieae

Endoodesmia
Lebrunia

Calophyllum

Stevens, 2006

SLIDE 29

Wurdarck & Davis (2009)

SLIDE 30

Distribution of Calophyllaceae

144 144 species species 259 259 species species 10 10 species species 176 176 species species

Stevens, 2006 http://www.mobot.org/MOBOT/research/APweb/

SLIDE 31

www.wikimedia.org

SLIDE 32

Vein Resin canal

SLIDE 33

SLIDE 34

SLIDE 35

http://www.botany.hawaii.edu/faculty/carr/images/cal_ino.jpg http://pakuwon.wordpress.com/2009/02/13/nyamplung‐calophyllum‐inophyllum/ http://www.flickr.com/photos/mauroguanandi

Calophyllum brasiliense

SLIDE 36

There is infraspecific variation in
There is infraspecific variation in

tepal number between individuals of the same species, and between flowers from the same inflorescence. flowers from the same inflorescence.

Stevens (1974,1980)

SLIDE 37

Calophyllum brasiliense http://www.nationaalherbarium.nl/sungaiwain/ Calophyllum pisiferum Calophyllum lanigerum

SLIDE 38

1 M i bj ti

1. Main objective

1.A To distinguish species limits of Calophyllum in South America.

2 Specific objectives

2. Specific objectives

2.A To analyze morphological and anatomical i ti variation.

SLIDE 39

Data collection for morphological observations

Herbarium and personal Herbarium and personal collections. Collection sort: qualitative characteristics (Systematic Association Committee for Association Committee for descriptive Biological Terminology (cited by Stearn 2006). Measurement. Ruler and a digital g caliper. E l d t t i Excel data matrix . Specimen collections in rows and variables in columns.

SLIDE 40

Leaf characters Flower characters Fruit characters External Fruit length mm Petiole length mm (PTL) Pedicel length mm (PDL) g (FrLEx) Leaf length cm (LL) Perianth width mm (PW ) External Fruit width mm (FrWEx) L f l th t id t t Leaf length at widest part cm (LWWP) Perianth length mm (PRL) Internal Fruit length mm (FrLIn) Leaf width cm (LW) Anther length mm (AL) Internal Fruit width mm (FrWIn) ( ) g ( ) ( ) Apex length mm (PL) Anther width mm (AW) Stigma remained mm (StygR) Midrib width at abaxial side mm (MW) Stamen length mm (STL) Basal discoloration mm (BsDis) Vein angle degree (VA) Filament length mm (FL) Stone mm (Stn) Venation density (VD) Style length mm (STYL) Corky mm (CRK) Gynoecium length mm (GL) Gynoecium length mm (GL) Ovary length mm (OL) Stigma width mm (SL)

SLIDE 41

REFERENCES REFERENCES

Claude, Julien. 2008. Morphometrics with R. Springer. Gotelli, Nicholas J., and Aaron M. Ellison. 2004. A primer of ecological statistics. Sinauer Associates Publishers. Jolliffe, I. T. 2002. Principal component analysis. Springer. Legendre, Pierre, and Louis Legendre. 1998. Numerical ecology. Elsevier. Q i G ld P d Mi h l J K h 2002 E i l Quinn, Gerald Peter, and Michael J. Keough. 2002. Experimental design and data analysis for biologists. Cambridge University Press.