ABabcdfghiejkl . . . . . .. . . . . . . .. . . . . . . .. . - - PowerPoint PPT Presentation

ababcdfghiejkl
SMART_READER_LITE
LIVE PREVIEW

ABabcdfghiejkl . . . . . .. . . . . . . .. . . . . . . .. . - - PowerPoint PPT Presentation

Using distances to address the challenges of heterogeneous data Susan Holmes http:/ /www-stat.stanford.edu/susan/ Bio-X and Statistics, Stanford University July 29, 2015 ABabcdfghiejkl . . . . . .. . . . . . . .. . . . . . . .. .


slide-1
SLIDE 1

Using distances to address the challenges of heterogeneous data

Susan Holmes http:/ /www-stat.stanford.edu/˜susan/

Bio-X and Statistics, Stanford University

July 29, 2015

ABabcdfghiejkl

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-2
SLIDE 2

The messes we deal with

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-3
SLIDE 3

Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way.

. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-4
SLIDE 4

Goals in Modern Biology: Systems Approach

Look at the data/ all the data: data integration . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-5
SLIDE 5

Goals in Modern Biology: Systems Approach

Look at the data/ all the data: data integration

Tumor Cells 5000 10000 15000 20000 5000 10000 15000 20000 5e-04 0.001 0 1 1 0 -1 1 0 0 0 -1 0 1 1 0 0 0 0 0 0 1 0 1 -1 0 -1 0 0 0 0 -1 0 1 1 0 0 -1 1 0 1 1 0 1 1 0 0 0 0 1 0 1

( (

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-6
SLIDE 6

What do statisticians do?

▶ Design new experiments to test scientific hypotheses. ▶ Visualize and summarize data in ways that account for

uncertainties.

▶ Look for meaningful differences or structure in high

dimensional noisy data.

▶ Predict the class of new observations given previously

  • bserved ones.

▶ Predict the value of a response variable given a whole

set of other explanatory variables.

▶ Combine different sources of data to understand complex

interactions. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-7
SLIDE 7

Today's challenge

▶ Data are not uniformly

distributed from some manifold.

▶ Data are not an identically

distributed random sample.

▶ Data are not independent.

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-8
SLIDE 8

Data can often be seen as points in a state space

Rp

x x

1 2

x x x x x

2 . . . . . . p i 1 3 .

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-9
SLIDE 9

Distances in Statistics

▶ Euclidean Distances, spatial distances. ▶ Weighted Euclidean distances: Mahalanobis distance for

discriminant analysis.

▶ Chisquare distances for contingency tables and discrete

data.

▶ Jaccard distances for presence absence is one of 50

distances used in Ecology.

▶ Earth Mover's distance on trees or graphs. ▶ Biologically meaningful distances (DNA, haplotype,

Proteins). . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-10
SLIDE 10

What do statisticians use distances for?

▶ Summaries through Fréchet Means and Medians and

pseudo variances.

▶ Center of Cloud of Objects Tk (equal weights): Find T0

that minimizes either ∑K

k=1 d2(T0, Tk)

this is the (L2) definition of the Fréchet mean object,

▶ or ∑K

k=1 d(T0, Tk) (L1 or Geometric Median).

▶ Pseudovariance=

1 K−1

∑K

k=1 d2(T0, Tk) = ˆ

  • s2. Dimension

reduction and visualization. Nearest Neighbor Methods. Clustering. Make network edges from close points. Prediction by minimizing weighted residual distances. Cross-products: correlations, autocorrelations. Generalizations of analysis of variance.

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-11
SLIDE 11

What do statisticians use distances for?

▶ Summaries through Fréchet Means and Medians and

pseudo variances.

▶ Dimension reduction and visualization. ▶ Nearest Neighbor Methods. ▶ Clustering. ▶ Make network edges from close points. ▶ Prediction by minimizing weighted residual distances. ▶ Cross-products: correlations, autocorrelations. ▶ Generalizations of analysis of variance.

Finding the right distance usually solves the statistical problem. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-12
SLIDE 12

Part I The Geometries of Data

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-13
SLIDE 13

First example: cell segmentation

Joint work with Adam Kapelner and PP Lee. Stained biopsy slides. Multispectral imaging (8 levels/wavelengths). Stained Lymph Node Aim to identify cell. Points similar in feature space are of the same type. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-14
SLIDE 14

Problem : Staining is heterogeneous

Both images are from the same image set. The stained cells are cancer cells stained with Fast Red red. Some regions of the tissue stain like the image on the left and other regions stain as the left. This shows the level of heterogeneity These are two ``subclasses'' of the same phenotype (the left is named subclass ``A,'' the right, subclass ``B''). . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-15
SLIDE 15

Problem : Staining is heterogeneous

Extreme variability in the image colors/intensity/contrast. Pixels from a same cell not independent and identically distributed across the different slides or across different cell types. Simple nearest neighbor approach:

  • Take 8 dimensional pixels points.
  • Assigning the point to the closest neighbor

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-16
SLIDE 16

Problem : Staining is heterogeneous

Extreme variability in the image colors/intensity/contrast. Pixels from a same cell not independent and identically distributed across the different slides or across different cell types. ? Simple nearest neighbor approach:

  • Take 8 dimensional pixels points.
  • Assigning the point to the closest neighbor

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-17
SLIDE 17

5 10 −2 2 4 6 8 Orange Red

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-18
SLIDE 18

5 10 −2 2 4 6 8 Orange Red

  • (3.2,2)

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-19
SLIDE 19

5 10 −2 2 4 6 8 Orange Red

  • (3.2,2)

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-20
SLIDE 20

5 10 −2 2 4 6 8 Orange Red

  • (3.2,2)

D1

2(p,m1)=19.7

D2

2(p,m2)=16

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-21
SLIDE 21

Multivariate Normal Data

Mahalanobis Transformation. Several different clusters with different variance-covariance matrices and different means. (µ1, Σ1) (µ2, Σ2) D2

1(x, µ1) = (x − µ1)TΣ−1 1 (x − µ1)

D2

2(x, µ2) = (x − µ2)TΣ−1 2 (x − µ2)

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-22
SLIDE 22

Corresponding Data Transformation

H = I − 1Dn1T, S = X′HDnHX

  • zi. = S− 1

2 (xi. − ¯

x) This is sometimes called `data sphering'.

5 10 −2 2 4 6 8 Sphered O Spehered Red

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-23
SLIDE 23

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-24
SLIDE 24

Output Data

Tumor

Tumor Cells

5000 10000 15000 20000 5000 10000 15000 20000 5e−04 0.001

Number of Tumor cells: 27,822 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-25
SLIDE 25

We can add information through choice of distances

Sample data can often be seen Variables are `vectors' as points in a state space. in data point space Rp Rn x x

1 2

x x x x x

2 . . . . . . p i 1 3 .

x x 1

2

x x x x x

2 . . . . . . n j 1 3 .

x

4 .

x 3

.

xtQy =< x, y >Q xtDy =< x, y >D Duality : Transposable data. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-26
SLIDE 26

Data Analysis: Geometrical Approach

  • i. The data are p variables measured on n observations.
  • ii. X with n rows (the observations) and p columns (the

variables).

  • iii. D is an n × n matrix of weights on the ``observations'',

which is most often diagonal but not always. iv Symmetric definite positive matrix Q, weights on . variables, often Q =       

1 σ2

1

...

1 σ2

2

... ... ... . . . ... ...

1 σ2

p

       . x x

1 2

x x x x x

2 . . . . . . p i 1 3 .

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-27
SLIDE 27

Euclidean Space and dimension reduction

These three matrices form the essential ``triplet" (X, Q, D) defining a multivariate data analysis. Q and D define geometries or inner products in Rp and Rn, respectively, through xtQy =< x, y >Q x, y ∈ Rp xtDy =< x, y >D x, y ∈ Rn. This can be extended to more inner products giving what is known as Kernel methods. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-28
SLIDE 28

Principal Component Analysis: Dimension Reduction

PCA seeks to replace the original (centered) matrix X by a matrix of lower rank, this can be solved using the singular value decomposition of X: X = USV′, with U′DU = In and V′QV = Ip and S diagonal XX′ = US2U′, with U′DU = In and S2 = Λ PCA is a linear nonparametric multivariate method for dimension reduction. D and Q are the relevant metrics on the dual row and column spaces of n samples and p variables. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-29
SLIDE 29

A Commutative Diagram Approach

Caillez and Pages, 1976. Escoufier, 1977 . Statisticians search for approximations with certain properties, for the case of PCA for instance, we rephrase the problem as follows:

▶ Q can be seen as a linear function from Rp to

Rp∗ = L(Rp), the space of scalar linear functions on Rp.

▶ D can be seen as a linear function from Rn to

Rn∗ = L(Rn).

V = XtDX Rp∗ − − − − →

X

Rn

Q

   V

D

 

W Rp ← − − − −

Xt

Rn∗ W = XQXt This duality gives `transposable' data. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-30
SLIDE 30

Properties of the Diagram

Rank of the diagram: X, Xt, VQ and WD all have the same rank. For Q and D symmetric matrices, VQ and WD are diagonalisable and have the same eigenvalues. λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0. Eigendecomposition of the diagram: VQ is Q symmetric, thus we can find Z such that VQZ = ZΛ, ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1) Modern extensions to this approach include Kernel methods in Machine Learning. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-31
SLIDE 31

Predicting and Summarizing through distances

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-32
SLIDE 32

Comparing Two Diagrams: the RV coefficient

Many problems can be rephrased in terms of comparison of two ``duality diagrams" or put more simply, two characterizing

  • perators, built from two ``triplets", usually with one of the

triplets being a response or having constraints imposed on it. Most often what is done is to compare two such diagrams, and try to get one to match the other in some optimal way.(O = WD) To compare two symmetric operators, there is either a vector covariance as inner product covV(O1, O2) = Tr(Ot

1O2) =< O1, O2 > or a vector correlation

(Escoufier, 1977) RV(O1, O2) = Tr(Ot

1O2)

√ Tr(Ot

1O1)tr(Ot 2O2)

. If we were to compare the two triplets ( Xn×1, 1, 1

nIn

) and ( Yn×1, 1, 1

nIn

) we would have RV = ρ2. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-33
SLIDE 33

Part II Dimension Reduction: the Euclidean embedding workhorse: MDS

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-34
SLIDE 34

Metric Multidimensional Scaling

Schoenberg (1935) . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-35
SLIDE 35

From Coordinates to Distances and Back

If we started with original data in Rp that are not centered: Y, apply the centering matrix X = HY, with H = (I − 1 n11′), and 1′ = (1, 1, 1 . . . , 1) Call B = XX′, if D(2) is the matrix of squared distances between rows of X in the euclidean coordinates, we can show that −1 2HD(2)H = B Schoenberg's result: exact Euclidean distance If B is positive semi-definite then D can be seen as a distance between points in a Euclidean space. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-36
SLIDE 36

Reverse engineering an Euclidean embedding

We can go backwards from a matrix D to X by taking the eigendecomposition of B = − 1

2HD(2)H in much the same way

that PCA provides the best rank r approximation for data by taking the singular value decomposition of X, or the eigendecomposition of XX′. X(r) = US(r)V′ with S(r) =       s1 ... s2 ... ... ... ... ... sr ... ... ... ...       . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-37
SLIDE 37

Multidimensional Scaling (MDS)

Simple classical multidimensional scaling.

▶ Square D elementwise D(2) = D2. ▶ Compute −1 2 HD2H = B. ▶ Diagonalize B to find the principal coordinates SV′. ▶ Choose a number of dimensions by inspecting the

eigenvalue's screeplot. The advantage is that the original distances don't have to be Euclidean. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-38
SLIDE 38

Taking Categorical Data and Making it into a Continuum

Horseshoe Example:Joint with Persi Diaconis and Sharad Goel (Annals of Applied Stats, 2005). Data from 2005 U.S. House

  • f Representatives roll call votes. We further restricted our

analysis to the 401 Representatives that voted on at least 90% of the roll calls (220 Republicans, 180 Democrats and 1 Independent) leading to a 401 × 669 matrix of voting data. The Data

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 R1

  • 1
  • 1

1

  • 1

1 1 1 1 1 ... R2

  • 1
  • 1

1

  • 1

1 1 1 1 1 ... R3 1 1

  • 1

1

  • 1

1 1 -1 -1

  • 1 ...

R4 1 1

  • 1

1

  • 1

1 1 -1 -1

  • 1 ...

R5 1 1

  • 1

1

  • 1

1 1 -1 -1

  • 1 ...

R6

  • 1
  • 1

1

  • 1

1 1 1 1 1 ... R7

  • 1
  • 1

1

  • 1
  • 1

1 1 1 1 1 ... R8

  • 1
  • 1

1

  • 1

1 1 1 1 1 ... R9 1 1

  • 1

1

  • 1

1 1 -1 -1

  • 1 ...

R10 -1

  • 1

1

  • 1

1 1 0 ...

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-39
SLIDE 39

L1 distance

We define a distance between legislators as ˆ d(li, lj) = 1 669

669

k=1

|vik − vjk|. Roughly, ˆ d(li, lj) is the percentage of roll calls on which legislators li and lj disagreed. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-40
SLIDE 40 !0.1 !0.05 0.05 0.1 !0.2 !0.1 0.1 0.2 !0.2 !0.15 !0.1 !0.05 0.05 0.1 0.15

3-Dimensional MDS mapping of legislators based on the 2005 U.S. House of Representatives roll call votes. We used dissimilarity indices 1-exp(−λd(R1, R2)) . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-41
SLIDE 41 !0.1 !0.05 0.05 0.1 !0.2 !0.1 0.1 0.2 !0.2 !0.15 !0.1 !0.05 0.05 0.1 0.15

3-Dimensional MDS mapping of legislators based on the 2005 U.S. House of Representatives roll call votes. Color has been added to indicate the party affiliation of each representative. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-42
SLIDE 42 50 100 150 200 250 300 350 400 10 20 30 40 50 60 70 80 90 100 MDS Rank National Journal Score

Comparison of the MDS derived rank for Representatives with the National Journal's liberal score . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-43
SLIDE 43

An Application: Visualizing Geodesic Distances between Trees

▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4 2 3 1 Fill-in of NNI moves: Billera, Holmes, Vogtmann (2001)(BHV). The boundaries between regions represent an area of uncertainty about the exact branching order. In biological terminology this is called an `unresolved' tree.

More details here

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-44
SLIDE 44

An Application: Visualizing Geodesic Distances between Trees

▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4 2 3 1

4 3 2 1

Fill-in of NNI moves: Billera, Holmes, Vogtmann (2001)(BHV). The boundaries between regions represent an area of uncertainty about the exact branching order. In biological terminology this is called an `unresolved' tree.

More details here

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-45
SLIDE 45

An Application: Visualizing Geodesic Distances between Trees

▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4 2 3 1

4 3 2 1 4 1 3 2

Fill-in of NNI moves: Billera, Holmes, Vogtmann (2001)(BHV). The boundaries between regions represent an area of uncertainty about the exact branching order. In biological terminology this is called an `unresolved' tree.

More details here

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-46
SLIDE 46

An Application: Visualizing Geodesic Distances between Trees

▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4 2 3 1

4 3 2 1 4 1 3 2

▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann

(2001)(BHV). The boundaries between regions represent an area of uncertainty about the exact branching order. In biological terminology this is called an `unresolved' tree.

More details here

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-47
SLIDE 47

A Cone Path

A path between two trees T and T′ always exists. Since all

  • rthants connect at the origin, any two trees T and T′ can

be connected by a two-segment path, this is called the cone-path. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-48
SLIDE 48

c a c b b a

Theorem(Billera, Holmes, Vogtmann (BHV ,2001)): Tree space with BHV metric is a CAT(0) space, that is, it has non-positive curvature. This implies there are geodesic between any two trees (Gromov). Note: This space of trees is not an Euclidean space. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-49
SLIDE 49

c a c b b a

The size of the ``pseudo-variance'' can be estimated from ∑ pid(T0, Ti)2. Properties of the Fréchet mean of a set of trees has been (Bhattacharya et al.2010, Miller, Mattingley, Owen, Marron, al. 2013). . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-50
SLIDE 50

Phylogenetic Trees

Malaria Data as seen using ape

Pre1 Pme2 Plo6 Pga11 Pma3 Pbe5 Pfr7 Pkn8 Pcy9 Pvi10 Pfa4

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-51
SLIDE 51

Sampling Distribution for Trees

Data 1 2 3

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-52
SLIDE 52

Data 1 2 3

Treespace T n

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-53
SLIDE 53

Data 1 2 3 4

True Sampling Distribution

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-54
SLIDE 54

Data 1 2 3 4

Bootstrap Sampling Distribution (non parametric)

n

^ * * * *

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-55
SLIDE 55

Bootstrap of Malaria Data

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-56
SLIDE 56

Hierarchical Clustering Trees

HEA25_EFFE_3 MEL39_EFFE_2 HEA31_EFFE_2 MEL67_EFFE_4 HEA55_EFFE_4 HEA59_EFFE_5 HEA26_EFFE_1 MEL51_EFFE_5 MEL36_EFFE_1 MEL53_EFFE_3 HEA31_NAI_2 HEA55_NAI_4 MEL67_NAI_4 MEL53_NAI_3 HEA25_NAI_3 MEL51_NAI_5 HEA59_NAI_5 HEA26_NAI_1 MEL36_NAI_1 MEL39_NAI_2 MEL51_MEM_5 HEA26_MEM_1 MEL67_MEM_4 HEA31_MEM_2 HEA55_MEM_4 HEA25_MEM_3 HEA59_MEM_5 MEL53_NAI_3 MEL36_MEM_1 MEL39_MEM_2

Human AF5q31 protein (AF5q31) intracellular hyaluronan−bindi selectin L (lymphocyte adhesio Human cDNA FLJ10470 fis, clone Human mRNA for KIAA0303 gene, KIAA0303 protein STAT induced STAT inhibitor 3 KIAA0752 protein GRB2−related adaptor protein delta (Drosophila)−like 1 stannin proteoglycan link protein Incyte EST amyloid beta (A4) precursor pr Human genomic DNA, chromosome follicular lymphoma variant tr Human, clone IMAGE:3875338, mR Human, Similar to phosphodiest Human sodium/myo−inositol cotr ESTs, Weakly similar to MUC2_H Human CpG island DNA genomic M POU domain, class 2, transcrip Human zinc finger protein ZNF2 Human 54 kDa progesterone rece protein tyrosine phosphatase, platelet/endothelial cell adhe Human cDNA FLJ20849 fis, clone Human mRNA for KIAA0972 protei KIAA0290 protein Human clone 295, 5cM region su eukaryotic translation initiat ferritin, heavy polypeptide 1 Human cDNA: FLJ22008 fis, clon Human insulin−like growth fact granzyme K (serine protease, g Human Epstein−Barr virus induc PAS−serine/threonine kinase lymphotoxin beta (TNF superfam Human mRNA for nel−related pro chemokine (C−C motif) receptor PAS−serine/threonine kinase Human mRNA for alpha−actinin, Human mRNA encoding the c−myc Human RATS1 mRNA, complete cds hyaluronoglucosaminidase 2 Human DNA for muscle nicotinic Human epithelial V−like antige interferon gamma receptor 2 (i Homo sapiens clone 24775 mRNA syntaphilin Human mRNA for endosialin prot Human zinc finger protein PLAG short−chain dehydrogenase/redu Human, short−chain dehydrogena

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-57
SLIDE 57
  • 20000

40000 60000 80000 120000

Eigenvalues of MDS for bootstrapped trees

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-58
SLIDE 58

−40 −20 20 40 −60 −40 −20 20 40

Bootstrapped trees

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 3839 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 7576 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

  • . .

. . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-59
SLIDE 59

Part III Combine and Compare Trees, Graphs and Contingent Count Data

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-60
SLIDE 60

Layers of Data in the Microbiome

Joshua Lederberg:`the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share

  • ur body space and have been all but ignored as

determinants of health and disease' Microbiome Complete collection of genes contained in the genomes of microbes living in a given environment. Numbers Humans shelter 100 trillion microbes (1014), (we are made of 10 ×1012 cells). Metagenome Composition of all genes present in an environment (soil, gut, seawater), regardless of species. Transciptome These are the mRNA transcripts in the cell, it reflects the genes that are being actively expressed at any given time. Metabolome The metabolites (small molecules) nucleic or fatty acids, sugars,... present in the sample either endogenous or exogenous (medication, pollution). . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-61
SLIDE 61

. Source: YK Lee and SK Mazmanian Science, 2010. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-62
SLIDE 62

Bacteria etc... and Us

The human microbiome or human microbiota is the assemblage of microorganisms that reside on the surface and in deep layers of skin, in the saliva and oral mucosa, in the conjunctiva, and in the gastrointestinal tracts.

▶ They include bacteria, fungi, and archaea. ▶ Some of these organisms perform tasks that are useful

for the human host. (live in symbiosis)

▶ Majority have no known beneficial or harmful effect.

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-63
SLIDE 63

Human Microbiome: What are the data?

DNA The Genomic material present (16sRNA-gene especially, but also shotgun). RNA What genes are being turned on (gene expression), transcriptomics. Mass Spec Specific signatures of chemical compounds present (LC/MS, GC/MS). Clinical Multivariate information about patients' clinical status, medication, weight. Environmental Location, nutrition, drugs, chemicals, temperature, time. Domain Knowledge Metabolic networks, phylogenetic trees, gene ontologies. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-64
SLIDE 64

Heterogeneous Data Objects

Object oriented input and data manipulation with phyloseq (McMurdie and Holmes, 2013, Plos ONE) Object oriented data in R:

Taxonomy Table taxonomyTable slots: .Data OTU Abundance class: otuTable slots: .Data, speciesAreRows Sample Variables sampleData slots: .Data, names, row.names, .S3Class Phylogenetic Tree class: phylo slots: see ape package matrix matrix data.frame phyloseq slots:

  • tuTable

sampleData taxTab tre Experiment-level data object: Component data objects:

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-65
SLIDE 65

Part IV Heteroscedasticity: Mixtures and to Normalize them

Source: xkcd. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-66
SLIDE 66

Points are measured with unequal variance

x x

1 2

x x x x x

2 . . . . . . p i 1 3 .

xn .

. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-67
SLIDE 67

Some real data (Caporoso et al, 2011) > GlobalPatterns phyloseq-class experiment-level object

  • tu_table()

OTU Table: [ 19216 taxa and 26 samples ] sample_data()Sample Data: [ 26 samples by 7 sample variables ] tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ] phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ] > sample_sums(GlobalPatterns) CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr 864077 1135457 697509 1543451 2076476 718943 433894 186297 ..... NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1 1478965 1652754 58688 493126 279704 937466 1211071 1216137 > summary(sample_sums(GlobalPatterns))

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. 58690 567100 1107000 1085000 1527000 2357000

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-68
SLIDE 68

Points are measured with unequal variance

x x

1 2

x x x x x

2 . . . . . . p i 1 3 .

xn .

. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-69
SLIDE 69

Equalization of variances

In this binomial example the variance of the proportion estimate is Var( X

n ) = pq n = q nE( X n ), a function of the mean.

This is a common occurrence and one that is traditionally dealt with in statistics by applying variance-stabilizing transformations. However, in order to find the right transformation, we need a good model for the error. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-70
SLIDE 70

Variance Stabilization

Prefer to deal with errors across samples which are independent and identically distributed. In particular homoscedasticity (equal variances) across all the noise levels. This is not the case when we have unequal sample sizes and variations in the accuracy across instruments. A standard way of dealing with heteroscedastic noise is to try to decompose the sources of heterogeneity and apply transformations that make the noise variance almost constant. These are called variance stabilizing transformations. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-71
SLIDE 71

Mixture Modeling works Miracles

▶ Beta-Binomial (deepSNV). ▶ Zero inflated Poisson or Gaussian. ▶ Gamma-Poisson.

Mixtures are ubiquitous because of a mathematical theorem De Finnetti's Theorem . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-72
SLIDE 72

Wolfgang Huber . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-73
SLIDE 73

Correct transformations are available

McMurdie and Holmes (2014) ``Waste Not, Want Not: Why rarefying microbiome data is inadmissible'', PLOS Computational Biology, Methods. We propose to model the read counts If technical replicates have same number of reads: sj, Poisson variation with mean µ = sjui. Taxa i incidence proportion ui. Number of reads for the sample j and taxa i would be Kij ∼ Poisson (sjui) . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-74
SLIDE 74

A distance on the known tree

Monge-Kantorovich earth mover's distance on the tree. Used to compare two samples or body sites for instance. Incorporate taxa abundances and phylogenetic tree

Epulopiscium Clostridium Adlercreutzia Lachnospira Alistipes Roseburia Coprococcus Clostridium Blautia Coprococcus Dehalobacterium Clostridium Clostridium Clostridium Coprobacillus Coprococcus Clostridium Clostridium Moryella Abundance 1 25 625 Class Actinobacteria (class) Bacilli Bacteroidia Clostridia Erysipelotrichi Gammaproteobacteria Mollicutes Verrucomicrobiae YS2

Duality diagram methods that can use any dependency structure. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-75
SLIDE 75

Unifrac Distance (Lozupone and Knight, 2005)

is a distance between groups of organisms that are related to each other by a tree. Suppose we have the OTUs present in sample 1 (blue) and in sample 2(red). Question: Do the two samples differ phylogenetically? It is defined as the ratio of the sum of the lengths of the branches leading to members of group A or members of group B but not both to the total branch length of the tree. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-76
SLIDE 76

Weighted Unifrac distance A modification of UniFrac,

weighted UniFrac is defined in (Lozupone et al., 2007) as

n

i=1

bi × | Ai AT − Bi BT |

▶ n = number of branches in the tree ▶ bi = length of the ith branch ▶ Ai = number of descendants of

ith branch in group A

▶ AT = total number of sequences

in group A [7]. [6]. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-77
SLIDE 77

Costello et al. 2010 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-78
SLIDE 78

Rao's Distance

We start with a distance between individuals. The heterogeneity of a population (Hi ) is the average distance between members of that population. The heterogeneity between two populations (Hij) is the average distance between a member of population i and a member of population j. The distance between two populations is Dij = Hij − 1 2(Hi + Hj) . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-79
SLIDE 79

Decomposition of Diversity

If we have populations 1, . . . , k with frequencies π1, . . . , πk, then the diversity of all the populations together is H0 =

k

i=1

πiHi + ∑

i

j

πiπjDij = H(w) + D(b) . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-80
SLIDE 80

Double Principal Coordinate Analysis

Pavoine, Dufour and Chessel (2004), Purdom (2010) and Fukuyama et al. (2011). . Suppose we have n species in p locations and a (euclidean) matrix ∆ giving the squares of the pairwise distances between the species. Then we can

▶ Use the distances between species to find an embedding

in n − 1 -dimensional space such that the euclidean distances between the species is the same as the distances between the species defined in ∆.

▶ Place each of the p locations at the barycenter of its

species profile. The euclidean distances between the locations will be the same as the square root of the Rao dissimilarity between them.

▶ Use PCA to find a lower-dimensional representation of

the locations. Give the species and communities coordinates such that the inertia decomposes the same way the diversity does. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-81
SLIDE 81

Fukuyama and Holmes, 2012.

Method Original description New formula Properties DPCoA square root of Rao's distance based on the square root of the patristic distances [∑

i bi(Ai/AT − Bi/BT)2]1/2

Most sensitive to outliers, least sensitive to noise, upweights deep differences, gives OTU locations wUniFrac ∑

i bi |Ai/AT − Bi/BT|

i bi |Ai/AT − Bi/BT|

Less sensitive to outliers/more sensitive to noise than DPCoA UniFrac fraction of branches leading to exactly one group ∑

i bi1{ Ai/AT−Bi/BT Ai/AT+Bi/BT ≥ 1}

Sensitive to noise, upweights shallow differences on the tree

Summary of the methods under consideration. ``Outliers" refers to highly abundant OTUs, and noise refers to noise in detecting low-abundance OTUs (see the text for more detail). . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-82
SLIDE 82

Antibiotic Time Course Data

Measurements of about 2500 different bacterial OTUs from stool samples of three patients (D, E, F) Each patient sampled ∼ 50 times during the course of treatment with ciprofloxacin (an antibiotic). Times categorized as Pre Cp, 1st Cp, 1st WPC (week post cipro), Interim, 2nd Cp, 2nd WPC, and Post Cp. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-83
SLIDE 83

UniFrac

Axis 1: 14.7% Axis 2: 10.3%

−0.2 −0.1 0.0 0.1 0.2
  • −0.2 −0.1 0.0 0.1 0.2 0.3 0.4

weighted UniFrac

Axis 1: 47.6% Axis 2: 12.3%

−0.2 −0.1 0.0 0.1 0.2
  • −0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3

weighted UF on presence/absence

Axis 1: 32.7% Axis 2: 15.1%

−0.05 0.00 0.05 0.10
  • −0.10
−0.05 0.000.050.100.150.20 subject
  • D
E F

Comparing the UniFrac variants. From left to right: PCoA/MDS with unweighted UniFrac, with weighted UniFrac, and with weighted UniFrac performed on presence/absence data extracted from the abundance data used in the other two plots . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-84
SLIDE 84

(a) MDS of OTUs

Axis 1: 6.2% Axis 2: 3.7% −1.5 −1.0 −0.5 0.0 0.5
  • −1.0
−0.5 0.0 0.5

(c) DPCoA OTU plot

CS1 CS2 −1.0 −0.5 0.0 0.5 1.0
  • −1.5 −1.0 −0.5 0.0
0.5 1.0 phylum
  • 4C0d−2
  • Actinobacteria
  • Bacteroidetes
  • Candidate division TM7
  • Cyanobacteria
  • Firmicutes
  • Fusobacteria
  • Lentisphaerae
  • Proteobacteria
  • Synergistetes
  • Verrucomicrobia

(b) DPCoA community plot

Axis 1: 40.9% Axis 2: 13.3% −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4
  • −1.0
−0.5 0.0 0.5 subject
  • D
E F

(a) PCoA/MDS of the OTUs based on the patristic distance, (b) community and (c) species points for DPCoA after removing two outlying species. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-85
SLIDE 85

Antibiotic Stress

We next want to visualize the effect of the antibiotic. Ordinations of the communities due to DPCoA and UniFrac with information about the whether the community was stressed or not stressed (pre cipro, interim, and post cipro were considered ``not stressed'', while first cipro, first week post cipro, second cipro, and second week post cipro were considered ``stressed''). We see that for UniFrac, the first axis seems to separate the stressed communities from the not stressed communities. DPCoA also seems to separate the out the stressed communities along the first axis (in the direction associated with Bacteroidetes), although only for subjects D and E. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-86
SLIDE 86

Axis1 Axis2

−0.2 −0.1 0.0 0.1 0.2

  • D 1

D 2 E 1 E 2 F 1 F 2

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4 Antibiotic stress

  • 1: not stressed

2: stressed Subject

  • D
  • E
  • F

PCoA/MDS with unweighted UniFrac. The labels represent subject plus antibiotic condition. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-87
SLIDE 87

Axis1 Axis2

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4

  • D 1

D 2 E 1 E 2 F 1 F 2

−1.0 −0.5 0.0 0.5

Community points as represented by DPCoA. The labels represent subject plus antibiotic condition. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-88
SLIDE 88

Conclusions for Antibiotic Stress

Since UniFrac emphasizes shallow differences on the tree and since PCoA/MDS with UniFrac seems to separate the subjects from each other better than the other two methods, we can conclude that the differences between subjects are mainly shallow ones. However, DPCoA also separates the subjects and the stressed versus non-stressed communities, and examining the community and OTU ordinations can tell us about the differences in the compositions of these communities. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-89
SLIDE 89

Distances enable statisticians to....

▶ Summarize data with medians, means and principal

directions.

▶ Encode some variations in uncertainty. ▶ Make comparisons of heterogeneous sources of

information.

▶ Integrate network and tree information. ▶ Measure diversity, inertia and generalize the notion of

variance. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-90
SLIDE 90

Questions for mathematicians ?

▶ How to make a method designed for uniformly distributed

points work for points generated by mixtures of heterogeneous distributions? Examples from work by Edelsbrunner, Carlsson, Zoromodian and co-authors. Source:Zoromodian. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-91
SLIDE 91

Questions for mathematicians

▶ How to build distances between images that account for

unequal measurement errors, even locally?

x x

1 2

x x x x x

2 . . . . . . p i 1 3 .

xn .

.

Work by Adler, Taylor and Worsley (2003,2005,2007) using Random Fields. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-92
SLIDE 92

Questions for mathematicians

▶ How well can the Euclidean embedding approximations

do? Are there better ways of approximating the commutative diagrams? This is also an important point of contact with the use of Stein's method in probability theory. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-93
SLIDE 93

Questions for mathematicians

▶ How well can the Euclidean embedding approximations

do?

▶ Are there better ways of approximating the commutative

diagrams? This is also an important point of contact with the use of Stein's method in probability theory. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-94
SLIDE 94

Questions for mathematicians

▶ How to distinguish between the effect of the curvature

  • f a state space and the effect of the unequal sampling?

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-95
SLIDE 95

Answers come from Differential Geometry.

Xavier Pennec, Yann Ollivier, Tom Fletcher, Rabi Bhattacharya. In particular enable us to incorporate the relevant data dependent transformations into localized metrics. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-96
SLIDE 96

Output showing posterior uncertainty measures

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-97
SLIDE 97

Benefitting from the tools and schools of Statisticians.......

Thanks to the R community:

▶ RStudio for tools for reproducible research and Hadley

Wickham for ggplot2.

▶ Ecologists and biologists: Chessel, Jombart, Dray,

Thioulouse ade4 and Emmanuel Paradis ape. Collaborators: David Relman, Alfred Spormann, Yves Escoufier, Les Dethfelsen, Justin Sonnenburg, Persi Diaconis, Sergio Baccallado, Elisabeth Purdom. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-98
SLIDE 98

Lab Group

Postdoctoral Fellows Paul (Joey) McMurdie, Ben Callahan, Simon Rubinstein-Salzado, Christof Seiler. Students: John Chakerian, Julia Fukuyama, Kris Sankaran. Funding from NIH/ NIGMS R01, NSF-VIGRE and NSF-DMS. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-99
SLIDE 99

References

  • L. Billera, S. Holmes, and K. Vogtmann.

The geometry of tree space.

  • Adv. Appl. Maths, 771--801, 2001.
  • J. Chakerian and S. Holmes.

distory:Distances between trees, 2010. Daniel Chessel, Anne Dufour, and Jean Thioulouse. The ade4 package - i: One-table methods. R News, 4(1):5--10, 2004.

  • P. Diaconis, S. Goel, and S. Holmes.

Horseshoes in multidimensional scaling and kernel methods. Annals of Applied Statistics, 2007 .

  • Y. Escoufier.

Operators related to a data matrix. In J.R. et al. Barra, editor, Recent developments in Statistics., pages 125--131. North Holland,, 1977 . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-100
SLIDE 100

Steven N Evans and Frederick A Matsen. The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples. arXiv, q-bio.PE, Jan 2010. M Hamady, C Lozupone, and R Knight. Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. The ISME Journal, Jan 2009. Susan Holmes. Multivariate analysis: The French way. In D. Nolan and T. P. Speed, editors, Probability and Statistics: Essays in Honor of David A. Freedman, volume 56 of IMS Lecture Notes--Monograph Series. IMS, Beachwood, OH, 2006. Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-101
SLIDE 101

Journal of Computational and Graphical Statistics, 5(3):299--314, 1996.

  • K. Mardia, J. Kent, and J. Bibby.

Multiariate Analysis. Academic Press, NY., 1979.

  • P. J. McMurdie and S. Holmes.

Phyloseq: Reproduible research platform for bacterial census data. PlosONE, 2013. April 22,. Serban Nacu, Rebecca Critchley-Thorne, Peter Lee, and Susan Holmes. Gene expression network analysis and applications to immunology. Bioinformatics, 23(7):850--8, Apr 2007 . Sandrine Pavoine, Anne-Béatrice Dufour, and Daniel Chessel. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-102
SLIDE 102

From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology, 228(4):523--537, 2004. Elizabeth Purdom. Analysis of a data matrix and a graph: Metagenomic data and the phylogenetic tree. Annals of Applied Statistics, Jul 2010.

  • C. R. Rao.

The use and interpretation of principal component analysis in applied research. Sankhya A, 26:329--359., 1964. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .