Introduction Gene family Several similar genes that have evolved - - PowerPoint PPT Presentation

introduction
SMART_READER_LITE
LIVE PREVIEW

Introduction Gene family Several similar genes that have evolved - - PowerPoint PPT Presentation

A N OPTIMAL RECONCILIATION ALGORITHM FOR GENE TREES WITH POLYTOMIES Manuel Lafond, Krister M. Swenson, Nadia El Mabrouk 1 DIRO, Universit de Montral Introduction Gene family Several similar genes that have evolved from a common


slide-1
SLIDE 1

AN OPTIMAL RECONCILIATION

ALGORITHM FOR GENE TREES WITH POLYTOMIES

Manuel Lafond, Krister M. Swenson, Nadia El Mabrouk DIRO, Université de Montréal

1

slide-2
SLIDE 2

Introduction

 Gene family  Several similar genes that have evolved from a common

ancestor

 Usually identified by sequence similarity  Dup-loss model : Evolution scenario determined

by three kinds of events

 Speciation : a new species is created, one copy of the

gene existing in both species

 Duplication : the gene is duplicated, giving the species

at least two copies of it

 Loss : the gene disappears from the family

2

slide-3
SLIDE 3

Gene family history

a b c d e f g Species tree Gene tree a1 b1 b2 c1 d1 a1 a2 b1 b2 c1 d1 Speciation Duplication Loss

3

slide-4
SLIDE 4

Reconciliation

a1 a2 b1 b2 c1 d1

 Given : a set of genes in the same family, a gene tree

G and a species tree S

 Infer : the evolutionary events that have led to the

  • bserved gene tree

Gene tree a1 b1 b2 c1 d1 Species tree

4

slide-5
SLIDE 5

Reconciliation

Gene tree a1 b1 b2 c1 d1 a b c d Species tree Reconciliation tree a1 b1 b2 c1 d1 a2

 A reconciliation is an « extension » of G that is

consistent with S i.e. reflects the same phylogeny

e f g e e e f g

5

slide-6
SLIDE 6

Reconciliation

 Parsimony criterion : minimum number of duplications +

losses (mutation cost)

6

Gene tree a1 b1 b2 c1 d1 a b c d Species tree Reconciliation tree a1 b1 b2 c1 d1 a2 e f g e e e f g

6

slide-7
SLIDE 7

LCA Mapping

Gene tree a1 b1 b2 c1 d1 a b c d Species tree

 Many possible reconciliation trees  LCA Mapping (Bonizzoni et al., 2003)  Map each node of G with the lowest common ancestor of its

leaves

 Minimizes the duplication+loss cost in linear time  The label of a node x is the LCA mapping of x

e f g e e f g Duplication a e

7

slide-8
SLIDE 8

Motivation

 Most known methods work with binary gene trees  In case of uncertainty, a gene tree can be non-

binary (weak edges)

 Non-binary nodes are called polytomies  Reconciliation trees are binary

a b c d e f g S b c G a a b a d d

8

slide-9
SLIDE 9

Polytomies

 Each polytomy can be solved independently

(Chang & Eulenstein, 2006)

 Cubic time algorithm for each polytomy

a b c d e f g S b c G a a b a d d b c a a G1 b c a a

9

slide-10
SLIDE 10

Polytomies

 Each polytomy can be solved independently

(Chang & Eulenstein, 2006)

a b c d e f g S G b a d d d d a b b c a a a d d G2 g c

10

slide-11
SLIDE 11

Polytomies

 Each polytomy can be solved independently

(Chang & Eulenstein, 2006)

a b c d e f g S G b b g g b c a a g b g G3 g d d a b c g a f

11

slide-12
SLIDE 12

Polytomies

 Each polytomy can be solved independently

(Chang & Eulenstein, 2006)

a b c d e f g S G b b g g b c a a g b g G3 g d d a b c g a f a f g g

12

slide-13
SLIDE 13

The core problem

 Find the minimum cost reconciliation between a

species tree and a polytomy

a b b c c a b c d e f g S G

13

slide-14
SLIDE 14

Resolution

 A reconciliation between S and a binary

refinement of G.

a b b c c a b c d e f g S G

14

slide-15
SLIDE 15

Resolution

 B(G) is a binary refinement of G

a b b c c a b c d e f g S B(G)

15

slide-16
SLIDE 16

Resolution

 R(B(G)) is a reconciliation between S and B(G)

a b b c c a b c d e f g S R(B(G)) d e f g b c

16

slide-17
SLIDE 17

Problem statement

 Given : a binary species tree S and a polytomy G  Find : a minimum mutation cost resolution of G.

a b b c c a b c d e f g S G

17

slide-18
SLIDE 18

Partial resolution at node s

 A tree obtained from G in which every subtree rooted at

a node labeled s is consistent with the species tree.

 Every descendant of s is part of one of these subtrees.

a b b c a b c d e f g S G a a a a b b c G’ a a a e e a a

18

slide-19
SLIDE 19

Partial resolution cost

 The mutation cost of a partial resolution is the sum

  • f the costs of all of its subtrees

a b b c a b c d e f g S G a a a a b b c G’ a a a e e a a

19

slide-20
SLIDE 20

k-partial resolution at node s

 A partial resolution with exactly k maximal subtrees

rooted at s.

a b b c a b c d e f g S G a a a a b b c G’ a a a e e a a

20

slide-21
SLIDE 21

k-partial resolution at node s

 A partial resolution with exactly k maximal subtrees

rooted at s.

a b b c a b c d e f g S G a a a a b b c G’ a a a e e a a

21

e

slide-22
SLIDE 22

Methodology

 Idea : an optimal resolution contains a minimum k-

partial resolution at s, for every node s in V(S)

a b b b c a b c d e f g S G a

22

slide-23
SLIDE 23

Methodology

 R(B(G)) has a 1-partial resolution at e  It also has a 2-partial resolution at e

a b c d e f g S a b b c d R(B(G)) e f g b e a e

 For which k’s does the optimal resolution contain a k-

partial resolution ?

23

slide-24
SLIDE 24

Methodology

 M(s, k) denotes the minimum cost of a k-partial

resolution at s

 M(root(S), 1) is the minimum cost of the full resolution of G

 The solution is a 1-partial resolution at root(S)

a b b c d R(B(G)) : a 1-partial resolution at g e f g = root(S) b e a e

24

slide-25
SLIDE 25

Computation of M(s, k)

 We compute the values of M(s, k) for each node s

in V(S) in a bottom-up manner, and for every k.

k = 1 2 3 4 5 6

M(a, k) M(b, k) M(c, k) M(d, k) M(f, k) M(e, k) M(g, k)

a b c d e f g S a b b c c G a a a

25

slide-26
SLIDE 26

Computation of M(s, k)

 M(a, 4) = 0

k = 1 2 3 4 5 6

M(a, k) M(b, k) M(c, k) M(d, k) M(f, k) M(e, k) M(g, k)

a b c d e f g S a b b c c G a a a

26

slide-27
SLIDE 27

Computation of M(s, k)

 M(a, 5) = 1 (one loss in a)

k = 1 2 3 4 5 6

M(a, k)

1

M(b, k) M(c, k) M(d, k) M(e, k) M(f, k) M(g, k)

a b c d e f g S a b b c G’ a a a a

27

slide-28
SLIDE 28

Computation of M(s, k)

 M(a, 3) = 1 (one duplication in a)

k = 1 2 3 4 5 6

M(a, k)

1 1

M(b, k) M(c, k) M(d, k) M(e, k) M(f, k) M(g, k)

a b c d e f g S a b b c G’ a a a a

28

slide-29
SLIDE 29

Computation of M(s, k)

 Let nb(s) denote the number of leaves of G labeled

s

 For instance, nb(a) = 4, nb(b) = 2, …  In general, if s is a leaf, then M(s, k) = |k - nb(s)|

a b b c G a a a

29

slide-30
SLIDE 30

Computation of M(s, k)

 The leaf values are easy to compute  M(s, k) = |k – nb(s)|

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k) M(f, k) M(g, k)

a b c d e f g S a b b c G a a a

30

slide-31
SLIDE 31

Computation of M(s, k)

 Computing M(e, k)

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

a b c d e f g S a b b c G a a a

31

slide-32
SLIDE 32

Computation of M(s, k)

 Either  M(e, 2) = M(a, 2) + M(b, 2) (from above – indicates speciation)  M(e, 2) = M(e, 1) + 1 (from the left – indicates a loss)  M(e, 2) = M(e, 1) + 1 (from the left – indicates a duplication)

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

x y z

32

+1 loss +1 dup

+

slide-33
SLIDE 33

Computation of M(s, k)

 Temporarily let M(s, k) = M(s1, k) + M(s2, k) for

every k

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

4 2 2 2 4 6

33

slide-34
SLIDE 34

Computation of M(s, k)

 Keep the minimum values only  If there are more than one, they will be grouped

together

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

2 2 2

34

slide-35
SLIDE 35

Computation of M(s, k)

 Extend the minimums, adding one for each cell

traversed

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

35

+1 +1 +1

slide-36
SLIDE 36

Computation of M(s, k)

 The whole table can be filled this way

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8 a b c d e f g S a b b c G a a a

36

slide-37
SLIDE 37

Computation of M(s, k)

 The minimum cost of a resolution of G is

M(g, 1) = 4

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8 a b c d e f g S a b b c G a a a

37

slide-38
SLIDE 38

Building the resolution

 Using the table, we’ll find the number of

duplications and losses for each node of s.

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

38

slide-39
SLIDE 39

Building the resolution

 Backtrack where the value of M(g, 1) came from

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

39

slide-40
SLIDE 40

Building the resolution

 Backtrack where the value of M(g, 1) came from  M(g, 1) = M(e, 1) + M(f, 1)

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

40

slide-41
SLIDE 41

Building the resolution

 Backtrack where the value of M(g, 1) came from  M(f, 1) = M(c, 1) + M(d, 1)

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

41

slide-42
SLIDE 42

Building the resolution

 Backtrack where the value of M(g, 1) came from  M(e, 1) = M(e, 2) + 1

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

42

 One duplication

in e !

slide-43
SLIDE 43

Building the resolution

 Backtrack where the value of M(g, 1) came from  M(e, 2) = M(a, 2) + M(b, 2)

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

43

slide-44
SLIDE 44

Building the resolution

 For leaves, go to the cell with value zero

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

44

 Two duplications

in a !

slide-45
SLIDE 45

Building the resolution

 For leaves, go to the cell with value zero

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

45

 If there is no

zero, assume it is at column 0

 One loss in d

slide-46
SLIDE 46

Building the resolution

 This gives :  1 duplication in e  1 loss in d  2 duplications in a

a a b c a a b a a e e d f e g a b b c a b c d e f g S G a a a R

46

slide-47
SLIDE 47

Computing the table

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

 Problem : we stopped at k = 6, but this value was

arbitrary

 Who knows when to stop ?

47

slide-48
SLIDE 48

Computing the table

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

 Computing this table takes O(|S|* k-max) steps

48

slide-49
SLIDE 49

Computing the table

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

 The values of a row follow a pattern

M(a, k)

49

slide-50
SLIDE 50

Computing the table

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

 The values of a row follow a pattern

M(b, k)

50

slide-51
SLIDE 51

Computing the table

k = 1 2 3 4 5 6

M(a, k)

3 2 1 1 2

M(b, k)

1 1 2 3 4

M(c, k)

1 2 3 4 5

M(d, k)

1 2 3 4 5 6

M(e, k)

3 2 2 2 3 4

M(f, k)

1 2 3 4 5 6

M(g, k)

4 4 5 6 7 8

 The values of a row follow a pattern

M(e, k)

51

slide-52
SLIDE 52

Computing the table

 The values of a row follow a pattern  If we know m1, m2 and

γ, we can find the value

  • f M(s, k) for any k in

constant time

 m1, m2 are called

breakpoints, and γ the minimum value

m1 m2 γ

52

slide-53
SLIDE 53

Computing the table

 Finding m1, m2, γ  Easy for leaf nodes

M(a, k) = |k – nb(a)|

m1 = m2 = nb(a) γ = 0

53

slide-54
SLIDE 54

Computing the table

 For an internal node s with children a,b  The breakpoints and min. val. of M(s, k) can be

computed in constant time if we know the breakpoints/min. val. of M(a, k) and M(b, k)

M(a, k), M(b, k)

M(a, k) + M(b, k)

a1 a2 b1 b2 γa γb γ m1 m2

54

slide-55
SLIDE 55

Conclusion

 Computing one row takes constant time, and there

are |S| rows, so the « table » can be computed in O(|S|) steps

 Finding the number of duplications and losses for

each node can be done in O(|S|) steps

 Building the resolution can be done in O(|S|) steps

as well

55

slide-56
SLIDE 56

Conclusion

 One polytomy can be solved in O(|S|) steps  A complete gene tree can have up to |G|

polytomies, so a complete resolution can be

  • btained in O(|G||S|) steps

 In the worst case, a resolution has O(|G||S|) nodes  Therefore, this algorithm is optimal  It runs in as much steps as the maximum size

  • f the output

56