POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN - - PowerPoint PPT Presentation

polytomy refinement for the
SMART_READER_LITE
LIVE PREVIEW

POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN - - PowerPoint PPT Presentation

POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN GENE TREES Manuel Lafond 1 , Cedric Chauve 2,3 , Riccardo Dondi 4 , Nadia El-Mabrouk 1 1 Universit de Montral, Canada 2 Universit Bordeaux 1, France 3 Simon Fraser University,


slide-1
SLIDE 1

POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN GENE TREES

Manuel Lafond1, Cedric Chauve2,3, Riccardo Dondi4, Nadia El-Mabrouk1

1 Université de Montréal, Canada 2 Université Bordeaux 1, France 3 Simon Fraser University, Canada 4 Universitá degli Studi di Bergamo, Italy

slide-2
SLIDE 2

Introduction

  • Gene tree for the SLC24a2 gene family (solute carrier 24)

G :

SLC24 Mouse SLC24 Rat SLC24 Microbat SLC24 Human SLC24 Chimp SLC24 Megabat SLC24 Squirrel

slide-3
SLIDE 3

Introduction

  • Species tree for the species having a gene in G.

S :

Mouse Rat Microbat Human Chimp Megabat Squirrel

slide-4
SLIDE 4

Introduction

  • G and S disagree

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl

G :

SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

slide-5
SLIDE 5

Introduction

  • LCA MAPPING : associate each ancestral gene with the

species it belonged to

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl

G :

SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

u v w x y z v z w z z z

slide-6
SLIDE 6

Introduction

  • G and S disagree => Duplication of an ancestral gene

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G :

SLC Mous

u v w x y z v z w z z z

slide-7
SLIDE 7

Introduction

  • Extant species are expected to have 2 copies of the gene
  • None of them do. That’s dubious !

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G : u v w x y z v z w z z z

slide-8
SLIDE 8

Introduction

  • If some species was represented on both sides of the

duplication, it would be an Apparent Duplication (AD)

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G : u v w x y z v z w z z z

SLC Hum

slide-9
SLIDE 9

Introduction

  • Non-apparent duplication (NAD) : the left and right

subtrees of the duplication share no gene from the same species.

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G : NAD u v w x y z v z w z z z

slide-10
SLIDE 10

Introduction

  • Missing gene copies must have been lost sometime ago.
  • NADs usually imply a bunch of losses.

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G : NAD

SLC Mous

u v w x y z v z w z z z

slide-11
SLIDE 11

Introduction

  • NADs are called dubious, or ambiguous duplications in

the Ensembl database.

  • About 44% of duplication nodes are dubious.
  • The SLC24 gene tree has 32 duplication nodes, 24 of which are

dubious.

  • Simulations showed that only 5% percent of duplications

were actually NADs (Chauve & Mabrouk, 2009).

slide-12
SLIDE 12

Introduction

  • Alternative scenario for the root of G : no duplication
  • ccurred.

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G : NAD

SLC Mous

slide-13
SLIDE 13

Introduction

S :

Mous Rat MicBat Hum Chmp MegBat Sqrl SLC MicBat/ MegBat SLC Hum/Mo/ Rat/Chmp/ Sqrl

NAD G :

  • Alternative scenario for the root of G : no duplication
  • ccurred => speciation => the bat genes should be

separated from the others.

slide-14
SLIDE 14

Introduction

  • Break G as least as possible : send the maximal bat

subtrees left, and the maximal rodent/primate subtrees right

S :

Mo Rat MicBat Hum Chmp MegBat Sqrl SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G :

slide-15
SLIDE 15

Introduction

SLC Mous SLC Rat SLC MicBat SLC Hum SLC Chmp SLC MegBat SLC Sqrl

G : G’ :

SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl SLC MicBat SLC MegBat

  • Break G as least as possible : send the maximal bat

subtrees left, and the maximal rodent/primate subtrees right

slide-16
SLIDE 16

Introduction

  • G’ ends up with possibly two unresolved polytomies.
  • We are looking for a binary refinement of these

polytomies.

G’ :

SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl SLC MicBat SLC MegBat

slide-17
SLIDE 17

Introduction

  • Other sources of polytomies :
  • Lack of phylogenetic signal in the sequences, causing some gene

tree construction algorithms to leave the gene tree partially unresolved.

  • Contraction of gene tree branches having low support (e.g.

bootstrap values).

SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl

slide-18
SLIDE 18

Previous works

  • Find a binary refinement minimizing:
  • Duplications + losses (Chang & Eulenstein, 2006, O(n3));
  • Duplications + losses (Lafond & Swenson & El-Mabrouk, 2012,

O(n))

  • Duplications and then losses (Zheng, Wu, Zhang, 2012, O(n))
  • Losses: It’s a linear problem.
  • Our problem here:

Minimize NAD nodes

  • For all these optimization criteria, polytomies can be

refined independantly. Thus we reduce the problem to a single polytomy.

slide-19
SLIDE 19

Introduction

  • Given : a polytomy P and a species tree S
  • Find : a binary refinement of P that minimizes the number
  • f NADs created.

SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl Rat Hum Chmp Sqrl Mous

P S

slide-20
SLIDE 20

Introduction

  • Given : a polytomy P and a species tree S
  • Objective : find a binary refinement of P that minimizes

the number of NADs created.

SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl Rat Hum Chmp Sqrl Mous SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl

NAD P S

slide-21
SLIDE 21

Introduction

  • Given : a polytomy P and a species tree S
  • Objective : find a binary refinement of P that minimizes

the number of NADs created.

SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl Rat Hum Chmp Sqrl Mous SLC Mous SLC Rat SLC Hum SLC Chmp SLC Sqrl

P S

slide-22
SLIDE 22

A simple example

a c d e S b P a1 a2 b1 c1 c2 d1 e1

slide-23
SLIDE 23

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-24
SLIDE 24

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-25
SLIDE 25

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-26
SLIDE 26

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-27
SLIDE 27

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-28
SLIDE 28

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-29
SLIDE 29

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-30
SLIDE 30

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-31
SLIDE 31

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

a1,c1 are connected by Speciation (S)

slide-32
SLIDE 32

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-33
SLIDE 33

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

a1,(a2, b1) are connected by Apparent Duplication (AD)

slide-34
SLIDE 34

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

slide-35
SLIDE 35

Reconnecting subtrees

a1 a2 b1 c1 c2 d1 e1 a c d e S b

a1,(a2, b1) are connected by Non- Apparent Duplication (NAD)

slide-36
SLIDE 36

Relationship graph

a c d e S b a a b c c d e

Each subtree is a vertex. Each pair of vertices (x,y) is connected by an edge labeled by the connection type of x and y.

slide-37
SLIDE 37

Relationship graph

a c d e S b a a b c c d e

Each subtree is a vertex. Each pair of vertices (x,y) is connected by an edge labeled by the connection type of x and y.

slide-38
SLIDE 38

Relationship graph

a c d e S b a a b c c d e Spec AD NAD

Each subtree is a vertex. Each pair of vertices (x,y) is connected by an edge labeled by the connection type of x and y.

slide-39
SLIDE 39

Relationship graph

a c d e S b a a b c c d e Spec AD NAD

Speciation clique : a clique exclusively made up of “Spec” edges.

slide-40
SLIDE 40

Relationship graph

a c d e S b a a b c c d e Spec AD NAD

Speciation clique : a clique exclusively made up of “Spec” edges.

slide-41
SLIDE 41

Relationship graph

a c d e S b a a b c c d e Spec AD NAD

Speciation clique : a clique exclusively made up of “Spec” edges.

slide-42
SLIDE 42

Theorem

Spec AD NAD

There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.

a a b c c d e

slide-43
SLIDE 43

Theorem

Spec AD NAD

There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.

a a b c c d e

slide-44
SLIDE 44

Theorem

Spec AD NAD

There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.

a a b c c d e a c e

slide-45
SLIDE 45

Theorem

Spec AD NAD

There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.

a a b c c d e a c e a b

slide-46
SLIDE 46

Theorem

Spec AD NAD

There exists a binary refinement with zero NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges form a single connected component.

a a b c c d e a c e a b c d

slide-47
SLIDE 47

Theorem

There exists a binary refinement with a minimum of d NADs iff there exists a set of disjoint speciation cliques W in the relationship graph such that W + the AD edges have a minimum of d + 1 connected components.

slide-48
SLIDE 48

Problem reformulation

Given a graph with Spec and AD edges, find a set of cliques W such that W + the AD edges has the minimum number of connected components.

slide-49
SLIDE 49

Problem reformulation

Given a graph with Spec and AD edges, find a set of cliques W such that W + the AD edges has the minimum number of connected components.

slide-50
SLIDE 50

Problem reformulation

Given a graph R with Spec and AD edges, find a set of cliques W such that R restricted to W + the AD edges has the minimum number of connected components. In general, finding W is an NP-Hard problem. But R is not just any graph !

slide-51
SLIDE 51

Characterization of the relationships

The relationship graph restricted to the Spec edges is {P4, 2K2}-free.

a a b c c d e

P4 2K2

slide-52
SLIDE 52

a a b c c d e

P4 2K2

Characterization of the relationships

The relationship graph restricted to the Spec edges is {P4, 2K2}-free.

slide-53
SLIDE 53

a a b c c d e

P4 2K2

Characterization of the relationships

The relationship graph restricted to the Spec edges is {P4, 2K2}-free.

slide-54
SLIDE 54

Problem reformulation

Given a graph R with Spec and AD edges, find a set of cliques W such that R restricted to W + the AD edges has the minimum number of connected components. In general, finding W is an NP-Hard problem. But R is not just any graph ! R is {P4, 2k2}-free. Complexity for this class of graphs : who knows?

slide-55
SLIDE 55

Heuristic

Since our goal is to connect AD connected components using Spec edges, take Spec edges that “link” two AD-components until there is possible choice left.

slide-56
SLIDE 56

Heuristic

Since our goal is to connect AD connected components using Spec edges, take Spec edges that “link” two AD-components until there is possible choice left.

NOT THIS ONE !

slide-57
SLIDE 57

Heuristic

Since our goal is to connect AD connected components using Spec edges, take Spec edges that “link” two AD-components until there is possible choice left.

slide-58
SLIDE 58

Heuristic

And since we are looking for cliques, remove edges that couldn’t form a clique with our chosen edge.

CAN’T BE CHOSEN !

slide-59
SLIDE 59

Heuristic

And since we are looking for cliques, remove edges that couldn’t form a clique with our chosen edge. Update the graph, and repeat.

slide-60
SLIDE 60

Heuristic

  • Bounds : using this idea, we developed a heuristic that can

be at most twice as bad as the best solution (in terms of AD components connected)

  • If the graph has no Spec edge inside an AD-Component,

the heuristic is exact.

slide-61
SLIDE 61

Random polytomy/species tree

We generated 1000 random polytomies having n subtrees for each n = 4..14 (along with a random species tree) Heuristic vs Brute force The heuristic always found a refinement with the minimum number of NADs. Minimizing NADs vs # of duplications + losses In 39,7% of random trees, finding a binary refinement the minimizes dups + losses does not minimize the number of NADs created.

slide-62
SLIDE 62

Ensembl updates : what happens to NADs ?

a b c d

Assembly/sequences update

Ensembl V74 Ensembl V70

NAD fate (v70 => v74) % of NADs NAD => Speciation 63.4 % (630 trees) NAD => NAD 35.5 % (352 trees) NAD => Apparent Duplication 1.1 % (11 trees)

Events inferred at the root of NAD clades after Ensembl update (993 trees of fish genes, highest NAD only)

a b c d

slide-63
SLIDE 63

Comparison with Ensembl updates

a b c d a b c d a b c d

Corrected Ensembl V74 Ensembl V70

slide-64
SLIDE 64

RF-Distance between Our correction vs Ensembl Updated Tree

(Distance ratio) 65% of corrected trees share > 80% clades with Ensembl updated tree. For those trees that Ensembl made NAD => Spec, 44% are identical to

  • ur corrected tree.
slide-65
SLIDE 65

Likelihood

We found 4454 NAD nodes in 1896 Ensembl fish gene trees. For each tree T and each NAD node x Tx is the tree obtained by correcting NAD node x R(x) = LogLH (T) / LogLH(Tx) 43.9% of NAD nodes yielded a better likelihood (R(x) > 1) after correction 62.4% of the trees contained at least one NAD yielding a better likelihood after correction

slide-66
SLIDE 66

Conclusion

  • When does NAD correction/minimization apply ?
  • Our heuristic builds a resolution that places duplications as

“high” as possible. We should consider exploring other (or all) solutions.

  • Is the problem NP-Hard ? Is there a polynomial time

algorithm that solves it ?