SLIDE 1 POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN GENE TREES
Manuel Lafond1, Cedric Chauve2,3, Riccardo Dondi4, Nadia El-Mabrouk1
1 Université de Montréal, Canada 2 Université Bordeaux 1, France 3 Simon Fraser University, Canada 4 Universitá degli Studi di Bergamo, Italy
SLIDE 2 Introduction
- Gene tree for the SLC24a2 gene family (solute carrier 24)
G :
SLC24 Mouse SLC24 Rat SLC24 Microbat SLC24 Human SLC24 Chimp SLC24 Megabat SLC24 Squirrel
SLIDE 3 Introduction
- Species tree for the species having a gene in G.
S :
Mouse Rat Microbat Human Chimp Megabat Squirrel
SLIDE 4 Introduction
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl
G :
SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
SLIDE 5 Introduction
- LCA MAPPING : associate each ancestral gene with the
species it belonged to
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl
G :
SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
u v w x y z v z w z z z
SLIDE 6 Introduction
- G and S disagree => Duplication of an ancestral gene
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G :
SLC Mous
u v w x y z v z w z z z
SLIDE 7 Introduction
- Extant species are expected to have 2 copies of the gene
- None of them do. That’s dubious !
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G : u v w x y z v z w z z z
SLIDE 8 Introduction
- If some species was represented on both sides of the
duplication, it would be an Apparent Duplication (AD)
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G : u v w x y z v z w z z z
SLC Hum
SLIDE 9 Introduction
- Non-apparent duplication (NAD) : the left and right
subtrees of the duplication share no gene from the same species.
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G : NAD u v w x y z v z w z z z
SLIDE 10 Introduction
- Missing gene copies must have been lost sometime ago.
- NADs usually imply a bunch of losses.
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G : NAD
SLC Mous
u v w x y z v z w z z z
SLIDE 11 Introduction
- NADs are called dubious, or ambiguous duplications in
the Ensembl database.
- About 44% of duplication nodes are dubious.
- The SLC24 gene tree has 32 duplication nodes, 24 of which are
dubious.
- Simulations showed that only 5% percent of duplications
were actually NADs (Chauve & Mabrouk, 2009).
SLIDE 12 Introduction
- Alternative scenario for the root of G : no duplication
- ccurred.
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G : NAD
SLC Mous
SLIDE 13 Introduction
S :
Mous Rat MicBat Hum Chmp MegBat Sqrl SLC MicBat/ MegBat SLC Hum/Mo/ Rat/Chmp/ Sqrl
NAD G :
- Alternative scenario for the root of G : no duplication
- ccurred => speciation => the bat genes should be
separated from the others.
SLIDE 14 Introduction
- Break G as least as possible : send the maximal bat
subtrees left, and the maximal rodent/primate subtrees right
S :
Mo Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G :
SLIDE 15 Introduction
SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl
G : G’ :
SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl SLC MicBat SLC MegBat
- Break G as least as possible : send the maximal bat
subtrees left, and the maximal rodent/primate subtrees right
SLIDE 16 Introduction
- G’ ends up with possibly two unresolved polytomies.
- We are looking for a binary refinement of these
polytomies.
G’ :
SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl SLC MicBat SLC MegBat
SLIDE 17 Introduction
- Other sources of polytomies :
- Lack of phylogenetic signal in the sequences, causing some gene
tree construction algorithms to leave the gene tree partially unresolved.
- Contraction of gene tree branches having low support (e.g.
bootstrap values).
SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl
SLIDE 18 Previous works
- Find a binary refinement minimizing:
- Duplications + losses (Chang & Eulenstein, 2006, O(n3));
- Duplications + losses (Lafond & Swenson & El-Mabrouk, 2012,
O(n))
- Duplications and then losses (Zheng, Wu, Zhang, 2012, O(n))
- Losses: It’s a linear problem.
- Our problem here:
Minimize NAD nodes
- For all these optimization criteria, polytomies can be
refined independantly. Thus we reduce the problem to a single polytomy.
SLIDE 19 Introduction
- Given : a polytomy P and a species tree S
- Find : a binary refinement of P that minimizes the number
- f NADs created.
SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl Rat Hum Chmp Sqrl Mous
P S
SLIDE 20 Introduction
- Given : a polytomy P and a species tree S
- Objective : find a binary refinement of P that minimizes
the number of NADs created.
SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl Rat Hum Chmp Sqrl Mous SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl
NAD P S
SLIDE 21 Introduction
- Given : a polytomy P and a species tree S
- Objective : find a binary refinement of P that minimizes
the number of NADs created.
SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl Rat Hum Chmp Sqrl Mous SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl
P S
SLIDE 22 A simple example
a c d e S b P a1 a2 b1 c1 c2 d1 e1
SLIDE 23 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 24 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 25 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 26 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 27 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 28 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 29 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 30 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 31 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
a1,c1 are connected by Speciation (S)
SLIDE 32 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 33 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
a1,(a2, b1) are connected by Apparent Duplication (AD)
SLIDE 34 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
SLIDE 35 Reconnecting subtrees
a1 a2 b1 c1 c2 d1 e1 a c d e S b
a1,(a2, b1) are connected by Non- Apparent Duplication (NAD)
SLIDE 36 Relationship graph
a c d e S b a a b c c d e
Each subtree is a vertex. Each pair of vertices (x,y) is connected by an edge labeled by the connection type of x and y.
SLIDE 37 Relationship graph
a c d e S b a a b c c d e
Each subtree is a vertex. Each pair of vertices (x,y) is connected by an edge labeled by the connection type of x and y.
SLIDE 38 Relationship graph
a c d e S b a a b c c d e Spec AD NAD
Each subtree is a vertex. Each pair of vertices (x,y) is connected by an edge labeled by the connection type of x and y.
SLIDE 39 Relationship graph
a c d e S b a a b c c d e Spec AD NAD
Speciation clique : a clique exclusively made up of “Spec” edges.
SLIDE 40 Relationship graph
a c d e S b a a b c c d e Spec AD NAD
Speciation clique : a clique exclusively made up of “Spec” edges.
SLIDE 41 Relationship graph
a c d e S b a a b c c d e Spec AD NAD
Speciation clique : a clique exclusively made up of “Spec” edges.
SLIDE 42 Theorem
Spec AD NAD
There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.
a a b c c d e
SLIDE 43 Theorem
Spec AD NAD
There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.
a a b c c d e
SLIDE 44 Theorem
Spec AD NAD
There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.
a a b c c d e a c e
SLIDE 45 Theorem
Spec AD NAD
There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.
a a b c c d e a c e a b
SLIDE 46 Theorem
Spec AD NAD
There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.
a a b c c d e a c e a b c d
SLIDE 47
Theorem
There exists a binary refinement with a minimum of d NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges have a minimum of d + 1 connected components.
SLIDE 48
Problem reformulation
Given a graph with Spec and AD edges, find a set of cliques W such that W + the AD edges has the minimum number of connected components.
SLIDE 49
Problem reformulation
Given a graph with Spec and AD edges, find a set of cliques W such that W + the AD edges has the minimum number of connected components.
SLIDE 50
Problem reformulation
Given a graph R with Spec and AD edges, find a set of cliques W such that R restricted to W + the AD edges has the minimum number of connected components. In general, finding W is an NP-Hard problem. But R is not just any graph !
SLIDE 51 Characterization of the relationships
The relationship graph restricted to the Spec edges is {P4, 2K2}-free.
a a b c c d e
P4 2K2
SLIDE 52 a a b c c d e
P4 2K2
Characterization of the relationships
The relationship graph restricted to the Spec edges is {P4, 2K2}-free.
SLIDE 53 a a b c c d e
P4 2K2
Characterization of the relationships
The relationship graph restricted to the Spec edges is {P4, 2K2}-free.
SLIDE 54
Problem reformulation
Given a graph R with Spec and AD edges, find a set of cliques W such that R restricted to W + the AD edges has the minimum number of connected components. In general, finding W is an NP-Hard problem. But R is not just any graph ! R is {P4, 2k2}-free. Complexity for this class of graphs : who knows?
SLIDE 55
Heuristic
Since our goal is to connect AD connected components using Spec edges, take Spec edges that “link” two AD-components until there is possible choice left.
SLIDE 56 Heuristic
Since our goal is to connect AD connected components using Spec edges, take Spec edges that “link” two AD-components until there is possible choice left.
NOT THIS ONE !
SLIDE 57
Heuristic
Since our goal is to connect AD connected components using Spec edges, take Spec edges that “link” two AD-components until there is possible choice left.
SLIDE 58 Heuristic
And since we are looking for cliques, remove edges that couldn’t form a clique with our chosen edge.
CAN’T BE CHOSEN !
SLIDE 59
Heuristic
And since we are looking for cliques, remove edges that couldn’t form a clique with our chosen edge. Update the graph, and repeat.
SLIDE 60 Heuristic
- Bounds : using this idea, we developed a heuristic that can
be at most twice as bad as the best solution (in terms of AD components connected)
- If the graph has no Spec edge inside an AD-Component,
the heuristic is exact.
SLIDE 61
Random polytomy/species tree
We generated 1000 random polytomies having n subtrees for each n = 4..14 (along with a random species tree) Heuristic vs Brute force The heuristic always found a refinement with the minimum number of NADs. Minimizing NADs vs # of duplications + losses In 39,7% of random trees, finding a binary refinement the minimizes dups + losses does not minimize the number of NADs created.
SLIDE 62 Ensembl updates : what happens to NADs ?
a b c d
Assembly/sequences update
Ensembl V74 Ensembl V70
NAD fate (v70 => v74) % of NADs NAD => Speciation 63.4 % (630 trees) NAD => NAD 35.5 % (352 trees) NAD => Apparent Duplication 1.1 % (11 trees)
Events inferred at the root of NAD clades after Ensembl update (993 trees of fish genes, highest NAD only)
a b c d
SLIDE 63 Comparison with Ensembl updates
a b c d a b c d a b c d
Corrected Ensembl V74 Ensembl V70
SLIDE 64 RF-Distance between Our correction vs Ensembl Updated Tree
(Distance ratio) 65% of corrected trees share > 80% clades with Ensembl updated tree. For those trees that Ensembl made NAD => Spec, 44% are identical to
SLIDE 65
Likelihood
We found 4454 NAD nodes in 1896 Ensembl fish gene trees. For each tree T and each NAD node x Tx is the tree obtained by correcting NAD node x R(x) = LogLH (T) / LogLH(Tx) 43.9% of NAD nodes yielded a better likelihood (R(x) > 1) after correction 62.4% of the trees contained at least one NAD yielding a better likelihood after correction
SLIDE 66 Conclusion
- When does NAD correction/minimization apply ?
- Our heuristic builds a resolution that places duplications as
“high” as possible. We should consider exploring other (or all) solutions.
- Is the problem NP-Hard ? Is there a polynomial time
algorithm that solves it ?