The Average Consensus Procedure: Franois-Joseph Lapointe , Universit - - PowerPoint PPT Presentation
The Average Consensus Procedure: Franois-Joseph Lapointe , Universit - - PowerPoint PPT Presentation
Overlapping Sets of Taxa Combination of Weighted Trees Containing Identical or The Average Consensus Procedure: Franois-Joseph Lapointe , Universit de Montral , Montreal, Canada Guy Cucumel , Universit du Qubec Montral , Montreal,
Consensus Trees
The Problem Suppose we have multiple phylogenies on overlapping leaf sets, how can we combine them effectively?
*Modified to show the overlapping leaf sets
Criteria
Intuitively, we want the best possible tree, i.e. the one which is closest to every input tree, in topology and branch lengths. To understand the problem, it is important to define which criteria are important.
- Tree Topology
- Branch Lengths
Notion of Tree Distance: Identical Leaf Set
Let’s suppose two trees are defined on the same leaf set. How “close” are they to one another? Let S = {1, 2, . . . , n} be the taxa, and T1 and T2 be two trees on S. Define ∆(T1, T2) :=
n
∑
i=1 n
∑
j=1
( d1(i, j) − d2(i, j) )2 (1) Where dℓ(i, j) is the distance from i to j in Tℓ. Note. It is important to normalize the branch lengths for this to make
- sense. Assume all branch lengths in [0, 1].
Optimal Consensus Tree: Identical Leaf Set
What is the best tree? Given k trees, T1, T2, . . . Tk on S, we want Tc that minimizes
k
∑
ℓ=1
∆(Tc, Tℓ) =
k
∑
ℓ=1 n
∑
i=1 n
∑
j=1
( dc(i, j) − dℓ(i, j) )2 (2) But how do we compute this? Claim Let d(i, j) be the average over all trees Tℓ of dℓ(i, j), then to optimize ∑k
ℓ=1 ∆(Tc, Tℓ) it suffices to optimize n
∑
i=1 n
∑
j=1
( dc(i, j) − d(i, j) )2
k
∑
ℓ=1
∆(Tc, Tℓ) =
k
∑
ℓ=1 n
∑
i=1 n
∑
j=1
( dc(i, j) − dℓ(i, j) )2 =
k
∑
ℓ=1 n
∑
i=1 n
∑
j=1
[ dc(i, j)2 − 2dc(i, j)dℓ(i, j) + dℓ(i, j)2] =
n
∑
i=1 n
∑
j=1
[ k · dc(i, j)2 − 2dc(i, j) ∑
ℓ dℓ(i, j) + ∑ ℓ
( dℓ(i, j)2)] = k ·
n
∑
i=1 n
∑
j=1
[ dc(i, j)2 − 2dc(i, j)d(i, j) + d(i, j)
2 − d(i, j) 2 + 1 k
∑
ℓ
( dℓ(i, j)2)] = k ·
n
∑
i=1 n
∑
j=1
[( dc(i, j) − d(i, j) )2 − g(T1, . . . , Tk) ] = f(k) ·
n
∑
i=1 n
∑
j=1
[( dc(i, j) − d(i, j) )2] − f(T1, . . . , Tk) So optimizing ∑k
ℓ=1 ∆(Tc, Tℓ) is the same as optimizing ∑n i=1
∑n
j=1
( dc(i, j) − d(i, j) )2.
Nonidentical Leaf Sets
So we just want the tree whose leaf distances are closest to d(i, j), in the least-squares metric. This can be done via a heuristic search using e.g. PHYLIP What if there is missing data for d(i, j)? We take the weighted average
- d(i, j)
∗
= 1 N(i, j)
k
∑
ℓ=1