GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali - - PowerPoint PPT Presentation

gene tree correction
SMART_READER_LITE
LIVE PREVIEW

GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali - - PowerPoint PPT Presentation

1 GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson 1,4 , Eric Tannier 2,3 and Nadia El-Mabrouk 1 1 Universit de Montral 2 Laboratoire de Biometrie et Biologie Evolutive 3 INRIA Grenoble


slide-1
SLIDE 1

GENE TREE CORRECTION GUIDED BY ORTHOLOGY

Manuel Lafond1, Magali Semeria2, Krister M. Swenson1,4, Eric Tannier2,3 and Nadia El-Mabrouk1

1Université de Montréal 2Laboratoire de Biometrie et Biologie Evolutive 3INRIA Grenoble Rhône-Alpe 4McGill Center for Bioinformatics

1

slide-2
SLIDE 2

Introduction

  • Gene trees reflect the evolutionary history of a family of

homologous genes

  • Ancestral genes may have undergone duplication or speciation

2 ZNF800Z1

G :

ZNF800Z2 ZNF800S ZNF800M ZNF800T

Duplication Speciation

G : Gene tree of the Ensembl ZincFinger protein 800 gene, for the species

  • Zebrafish
  • Stickleback
  • Medaka
  • Tetraodon
slide-3
SLIDE 3

Introduction

3 ZNF800Z1

G :

ZNF800Z2 ZNF800S ZNF800M ZNF800T

Duplication Speciation

  • Pairwise extant genes relationships
  • Orthologs : LCA is a speciation (e.g. ZNF800Z2, ZNF800T)
  • Paralogs : LCA is a duplication (e.g. ZNF800Z1, ZNF800T)

(LCA = Lowest Common Ancestor)

slide-4
SLIDE 4

Introduction

  • Each gene tree has an associated species tree
  • Each extant gene g is mapped to an extant species by a function s(g)

4 ZNF800Z1

G :

ZNF800Z2 ZNF800S G : Gene tree for the ZincFinger protein 800

S : Z T S M

ZNF800M ZNF800T

slide-5
SLIDE 5

Introduction

  • Each gene tree has an associated species tree
  • Each extant gene g is mapped to an extant species by a function s(g)
  • We use this mapping to ease up notation

5 G : Gene tree for the ZincFinger protein 800

S : Z T S M G : Z1 Z2 S1 M1 T1

slide-6
SLIDE 6

Introduction

  • Each gene tree has an associated species tree
  • s(g) for ancestral genes : we use LCA Mapping, where each ancestral

gene is mapped to the LCA of its descendants mappings in S

6 G : Gene tree for the ZincFinger protein 800

S : Z T S M α β γ β1 γ1 β2 γ2 G : Z1 Z2 S1 M1 T1

slide-7
SLIDE 7

Introduction

  • Reconciliation infers speciation/duplication events
  • If g has the same mapping as one of its children, infer a duplication

(otherwise, infer a speciation)

7 G : Gene tree for the ZincFinger protein 800

S : Z T S M α β γ β1 γ1 β2 γ2 G : Z1 Z2 S1 M1 T1

slide-8
SLIDE 8

Introduction

  • Reconciliation infers speciation/duplication events
  • If g has the same mapping as one of its children, infer a duplication

(otherwise, infer a speciation)

8 G : Gene tree for the ZincFinger protein 800

S : Z T S M α β γ β1 γ1 β2 γ2 G : Z1 Z2 S1 M1 T1

slide-9
SLIDE 9

Introduction

  • Orthology and paralogy are inferred given the gene tree.
  • But instead, can we infer (or correct) parts of the gene tree,

given orthology/paralogy relationships ?

12 G : Gene tree for the ZincFinger protein 800

S : Z T S M α β γ β1 γ1 β2 γ2 G : Z1 Z2 S1 M1 T1

slide-10
SLIDE 10

Introduction

  • CASE 1 : Suppose we KNOW β1 is a speciation, and we

want to keep the β1 clade (i.e. do not insert/remove leaves in the β1 subtree)

  • Correct the gene tree making the minimum number of “moves”

13

S : Z T S M α β γ β1 γ1 β2 γ2 G : Z1 Z2 S1 M1 T1 Untrusted duplication

slide-11
SLIDE 11

Introduction

14

S : Z T S M α β γ β1 γ1 α1 γ2 G : Z1 Z2 T1 M1 S1

  • CASE 1 : Suppose we KNOW β1 is a speciation, and we

want to keep the β1 clade (i.e. do not insert/remove leaves in the β1 subtree)

  • Correct the gene tree making the minimum number of “moves”
slide-12
SLIDE 12

Introduction

  • CASE 2 : Suppose we KNOW Z1 and T1 are orthologous
  • Correct the gene tree making the minimum of “moves”

15

S : Z T S M α β γ β1 γ1 α1 γ2 G : Z1 Z2 M1 T1 S1 Untrusted duplication

slide-13
SLIDE 13

Introduction

  • CASE 2 : Suppose we KNOW Z1 and T1 are orthologous
  • Correct the gene tree making the minimum of “moves”

16

S : Z T S M α β γ β1 γ1 α1 Z3 G : Z1 Z2 M1 T1 S1

slide-14
SLIDE 14

Two correction problems

  • Case 1 and 2 give us speciation (orthology) constraints
  • Given G containing untrusted duplications, find a gene tree G’ that

satisfies the given constraints AND messes up G as least as possible

  • e.g. minimize the Robinson-Foulds distance

17

G’ : Z1 Z2 M1 T1 S1 G : Z1 Z2 S1 M1 T1

slide-15
SLIDE 15

RF distance

  • In the case of rooted binary trees T1, T2 with the same

leaves :

  • RFDist(T1, T2) is simply two times the number of

clades in T1, but not in T2

18

T2’ : g1 g3 g2 g4 g5 T1 : g1 g2 g3 g4 g5 x y z r

slide-16
SLIDE 16

RF distance

19

T2 : g1 g3 g2 g4 g5 T1 : g1 g2 g3 g4 g5 x y z : {g4, g5} r

 In the case of rooted binary trees T1, T2

with the same leaves :

 RFDist(T1, T2) is simply two times the

number of clades in T1, but not in T2

slide-17
SLIDE 17

RF distance

  • In the case of rooted binary trees T1, T2 with the same

leaves :

  • RFDist(T1, T2) is simply two times the number of

clades in T1, but not in T2

20

T2 : g1 g3 g2 g4 g5 T1 : g1 g2 g3 g4 g5 x : {g1, g2, g3} y z : {g4, g5} r

slide-18
SLIDE 18

RF distance

  • In the case of rooted binary trees T1, T2 with the same

leaves :

  • RFDist(T1, T2) is simply two times the number of

clades in T1, but not in T2

21

T2 : g1 g3 g2 g4 g5 T1 : g1 g2 g3 g4 g5 y : {g2, g3} r x : {g1, g2, g3} z : {g4, g5}

slide-19
SLIDE 19

RF distance

  • In the case of rooted binary trees T1, T2 with the same

leaves :

  • RFDist(T1, T2) is simply two times the number of

clades in T1, but not in T2

22

T2 : g1 g3 g2 g4 g5 T1 : g1 g2 g3 g4 g5 x y z r distRF(T1, T2) = 2

slide-20
SLIDE 20

Detecting untrustworthy duplications

  • Some duplications are labeled “dubious” or given low

confidence values by Ensembl

  • We can use synteny to infer orthology/paralogy

relationships [1]

  • Software inferring ancestral adjacencies might pick up

erroneous duplications

  • Using DeCo, one can identify bad duplications when more than

two adjacencies are inferred on an ancestral gene [2]

23

[1] Lafond, Swenson, El-Mabrouk, “Error detection and correction of gene trees”, MASGE (2013) [2] Chauve, El-Mabrouk, Guéguen, Semeria, Tannier, “Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later“, MAGE (2013)

slide-21
SLIDE 21

Detecting untrustworthy duplications

  • Suppose genes a1, b1 from genomes a and b are in

syntenic blocks (they are in a conserved region of homologous genes)

  • In this example, a conserved region involving 5 genes families

24

a-- a- a1 a+ a++ Genome a Genome b b-- b- b1 b+ b++

slide-22
SLIDE 22

Detecting untrustworthy duplications

  • Suppose genes a1, b1 from genomes a and b are in

syntenic blocks (they are in a conserved region of homologous genes)

  • In this example, a conserved region involving 5 genes families
  • Look at the gene trees of each involved family

25

a-- a- a1 a+ a++ Genome a Genome b b-- b- b1 b+ b++

slide-23
SLIDE 23

Detecting untrustworthy duplications

  • If all the homologous genes in the regions are
  • rthologous, we expect a1 and b1 to also be
  • rthologous

26

a-- a- a1 a+ a++ Genome a Genome b b-- b- b1 b+ b++

slide-24
SLIDE 24

Detecting untrustworthy duplications

  • If all the homologous genes in the regions are
  • rthologous, we expect a1 and b1 to also be
  • rthologous
  • If not, some unlikely event occurred

27

a-- a- a1 a+ a++ Genome a Genome b b-- b- b1 b+ b++

slide-25
SLIDE 25

Detecting untrustworthy duplications

28

a-- a- a1 a+ a++ Genome a Genome b b-- b- b1 b+ b++ ab abcopy

  • What’s wrong with this ?
  • If only the ancestral gene ab duplicated, the copy typically went

somewhere else on the ancestral genome

  • And somehow, it ended up in a region similar to the original

gene…mostly by chance.

slide-26
SLIDE 26

Detecting untrustworthy duplications

29

a-- a- a1 a+ a++ Genome a Genome b b-- b- b1 b+ b++

  • We looked at ~6000 Ensembl gene trees
  • The trees for the Zebrafish, Medaka, Tetraodon and Stickleback

species

  • 22% (~1200) of these trees contained this type of bad

duplication

slide-27
SLIDE 27

Problem 1

  • Given: given a gene tree G, a species tree S, and a

set C of clades that are required to be speciations

  • Find : A corrected gene tree G’ in which all clades in C

are preserved, are speciations, and such that RFDist(G, G’) is minimized (as many clades as possible are preserved)

31

G : a1 c2 c1 b1 d2 d1 G’ : a1 b1 c1 d2 d1 c2 In green : preserved clades

slide-28
SLIDE 28

Problem 1

  • A solution doesn’t always exist
  • In this example, if C = {x,y}, we cannot correct both x and y into

speciations

  • A solution exists iff for any two x, y in C, we don’t have that x is an

ancestor of y and s(x) = x(y)

  • We will assume there exists a solution

32

G : a1 c1 d1 b1 a b c d S : x y C = {x, y} s(x) = s(y)

slide-29
SLIDE 29

Problem 1

  • To transform x into a speciation
  • Let L and R be the two children of s(x)

33

G : a1 c3 c2 b1 d2 d1 a b c d S : x s(x) L R c1 b2

slide-30
SLIDE 30

Problem 1

  • Find GL (resp. GR), the set of maximal

subtrees of G that contains only genes mapped to species in L (resp. R)

34

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2

slide-31
SLIDE 31

Problem 1

  • Form G* by making two polytomies

(non-binary subtrees) with GL and GR, joined under a common parent

35

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 a1 b1 b2 c1 c3 c2 d2 d1 G* : L1 R1

slide-32
SLIDE 32

Problem 1

  • Theorem : any binary resolution of G* is

a solution to Problem 1.

36

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 a1 b1 b2 c1 c3 c2 d2 d1 G* : L1 R1

slide-33
SLIDE 33

Problem 1

  • Theorem : any binary resolution of G* is

a solution to Problem 1.

  • In fact, every solution is the result of a

binary resolution of G*.

37

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 a1 b1 b2 c1 c3 c2 d2 d1 G* : L1 R1

slide-34
SLIDE 34

Problem 1

  • Theorem : any binary resolution of G* is

a solution to Problem 1

  • In fact, every solution is the result of a

binary resolution of G*.

38

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 a1 b1 b2 c1 c3 c2 d2 d1 G* : L1 R1

slide-35
SLIDE 35

Problem 1

  • Triplet maximizing solution :
  • For leaves x,y,z, a triplet ((x, y), z) is in G if

LCA(x,y,z) is above LCA(x, y).

  • e.g. ((c2, c3), c1) is a triplet

39

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2

slide-36
SLIDE 36

Problem 1

  • Triplet maximizing solution :
  • Make GL (resp. GR) by taking the maximum

induced tree of G containing only leaves in L (resp. R).

40

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 In green : GL a1 b1

slide-37
SLIDE 37

Problem 1

  • Triplet maximizing solution :
  • Make GL (resp. GR) by taking the maximum

induced tree of G containing only leaves in L (resp. R).

41

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 In red : GR a1 b1 c3 c2 c1 d2 d1

slide-38
SLIDE 38

Problem 1

  • Triplet maximizing solution :
  • Join GL and GR
  • This minimizes the RF-Distance and the

triplets distance !

42

a b c d S : s(x) L R G : a1 c3 c2 b1 d2 d1 x c1 b2 a1 b1 c3 c2 c1 d2 d1

slide-39
SLIDE 39

Problem 2

  • Problem 2: given a reconciled gene tree G, a species

tree S, and a set P of pairs of genes that are required to be orthologous

  • Find : A corrected gene tree G’ in which all gene pairs

in P are orthologous, such that RFDist(G, G’) is minimized

50

G : a1 a2 b1 c1 d1 G’ : a1 a2 b1 c1 d1 P = {(a1, b1)} In green : preserved clades

slide-40
SLIDE 40

Problem 1 : simple example

  • a1, d2 should not be paralogs
  • Which clades in {u,v,w,x,y,z} can we

preserve ?

51

G : a1 d1 b1 c1 P = {(a1, d2)} a b c d S : b2 d2 c2 x y u v w z

slide-41
SLIDE 41

Problem 1 : simple example

  • Can we preserve the u clade ?

52

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)}

slide-42
SLIDE 42

Problem 1 : simple example

  • Can we preserve the u clade ?
  • No ! Wherever d2 ends up, by

reconciliation LCA(a1, d2) will be a duplication (because of d1, d2)

53

a b c d S : a1 d1 b1 c1 u d2 P = {(a1, d2)}

slide-43
SLIDE 43

Problem 1 : simple example

  • Can we preserve the w clade ?

54

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)}

slide-44
SLIDE 44

Problem 1 : simple example

  • Can we preserve the w clade ?
  • Sure ! Here’s how !

55

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} G’ : a1 d1 b1 c1 b2 d2 c2 w

slide-45
SLIDE 45

Problem 1 : simple example

  • What about the x clade ?
  • Just send a1 near d2 !

56

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} G’ : a1 d1 b1 c1 b2 d2 c2 x

slide-46
SLIDE 46

Problem 1 : simple example

58

G : a1 d1 b1 c1 b2 d2 c2 u h(a1, d2) P = {(a1, d2)} g a b c d S : s(g) = s(u) h(d2, a1)  For some constraint (a, b) in P :

  • Let g = LCA(a,b)
  • Let ha,b be the highest node on the path from a to g such that

s(ha,b) is a descendant of s(g).

  • Every node on the path from ha,b to g (excluding h and g)

corresponds to an unpreservable clade.

  • Define hb,a analogously
slide-47
SLIDE 47

Problem 1 : simple example

  • For some constraint (a, b) in P :
  • Let g = LCA(a,b)
  • Let ha,b be the highest node on the path from a to g such that

s(ha,b) is a descendant of s(g).

  • Every node on the path from ha,b to g (excluding h and g)

corresponds to an unpreservable clade.

  • Define hb,a analogously

59

 For every constraint (a, b) in P

 Compute ha,b and hb,a  Find the unpreservable clades they imply  Identifies all unpreservable clades.  Can be done in time O(|P| |V(G)|)

slide-48
SLIDE 48

Problem 1 : simple example

  • We can identify preservable nodes

rather easily (in O(|P||V(G)|) time).

  • But, can we preserve them all at once ?

60

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)}

slide-49
SLIDE 49

Problem 1 : simple example

  • It turns out we can !
  • Highest preservable descendant : a

preservable node whose only preservable ancestor is the root

61

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)}  The set of highest

preservable descendants in G is {w, x, b2, y}

(the leaves are always preservable)

  • This set partitions the

leaves of G

slide-50
SLIDE 50

Problem 1 : simple example

  • Extract all the highest preservable subtrees

62

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-51
SLIDE 51

Problem 1 : simple example

  • Extract all the highest preservable subtrees
  • Join the subtrees in the order given by a

bottom-up traversal of S

  • i.e. priorize creating a new root r such that s(r) is

the lowest in S

63

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-52
SLIDE 52

Problem 1 : simple example

  • Extract all the highest preservable subtrees
  • Join the subtrees in the order given by a

bottom-up traversal of S

  • i.e. priorize creating a new root r such that s(r) is

the lowest in S

64

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-53
SLIDE 53

Problem 1 : simple example

  • Extract all the highest preservable subtrees
  • Join the subtrees in the order given by a

bottom-up traversal of S

  • i.e. priorize creating a new root r such that s(r) is

the lowest in S

65

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-54
SLIDE 54

Problem 1 : simple example

  • Extract all the highest preservable subtrees
  • Join the subtrees in the order given by a

bottom-up traversal of S

  • i.e. priorize creating a new root r such that s(r) is

the lowest in S

66

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-55
SLIDE 55

Problem 1 : simple example

  • Extract all the highest preservable subtrees
  • Join the subtrees in the order given by a

bottom-up traversal of S

  • i.e. priorize creating a new root r such that s(r) is

the lowest in S

67

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-56
SLIDE 56

Problem 1 : simple example

  • Theorem : We have formed every

possible orthology relationship with the given subtrees.

  • Corollary : our required orthologs are
  • rthologs

68

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-57
SLIDE 57

Problem 1 : simple example

  • We’re done. Now a1, d2 are orthologs,

and we saved every clade we could.

69

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-58
SLIDE 58

Problem 1 : simple example

  • If there are still bad duplications, then

they are in the highest preservable subtrees.

  • Recursively repeat the procedure for every one
  • f them, until we get to the leaves.

70

G : a1 d1 b1 c1 a b c d S : b2 d2 c2 x y u v w z P = {(a1, d2)} a1 b1 d1 c1 b2 d2 c2

slide-59
SLIDE 59

Some results

  • Using synteny to find required orthologs, we corrected

1000 Ensembl gene trees with the problem 2 algorithms (with our four favorite fish species)

  • Then used the AU Test to verify the plausibility of our

corrected gene trees

  • 82.3% of our trees were statistically viable
  • 17.7% of our trees were rejected
  • 14.8% of the original Ensembl trees were rejected

71

slide-60
SLIDE 60

Open avenues

  • Can we find required ortholog/paralog gene relationships

without the gene tree ?

  • The more we have, the more precise the gene tree will be.
  • Problem : given a set of required orthologs AND required

paralogs, are they compatible ?

  • Does there exist a gene tree that satisfies the given constraints ?

72

slide-61
SLIDE 61

Open avenues

  • Is the RF Distance the best ?
  • Other distances : NNI, SPR, …
  • Can we incorporate orthology/paralogy constraints

into the gene tree building procedure, instead of correcting it a posteriori ?

73