gene tree correction
play

GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali - PowerPoint PPT Presentation

1 GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson 1,4 , Eric Tannier 2,3 and Nadia El-Mabrouk 1 1 Universit de Montral 2 Laboratoire de Biometrie et Biologie Evolutive 3 INRIA Grenoble


  1. 1 GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson 1,4 , Eric Tannier 2,3 and Nadia El-Mabrouk 1 1 Université de Montréal 2 Laboratoire de Biometrie et Biologie Evolutive 3 INRIA Grenoble Rhône-Alpe 4 McGill Center for Bioinformatics

  2. 2 Introduction • Gene trees reflect the evolutionary history of a family of homologous genes • Ancestral genes may have undergone duplication or speciation Duplication G : G : Gene tree of the Speciation Ensembl ZincFinger protein 800 gene, for the species • Zebrafish • Stickleback • Medaka • Tetraodon ZNF800 Z1 ZNF800 M ZNF800 Z2 ZNF800 S ZNF800 T

  3. 3 Introduction (LCA = Lowest Common Ancestor) • Pairwise extant genes relationships • Orthologs : LCA is a speciation (e.g. ZNF800 Z2 , ZNF800 T ) • Paralogs : LCA is a duplication (e.g. ZNF800 Z1 , ZNF800 T ) Duplication G : Speciation ZNF800 Z1 ZNF800 M ZNF800 Z2 ZNF800 S ZNF800 T

  4. 4 Introduction • Each gene tree has an associated species tree • Each extant gene g is mapped to an extant species by a function s(g) G : S : M T S Z ZNF800 Z1 ZNF800 Z2 ZNF800 S ZNF800 M ZNF800 T G : Gene tree for the ZincFinger protein 800

  5. 5 Introduction • Each gene tree has an associated species tree • Each extant gene g is mapped to an extant species by a function s(g) • We use this mapping to ease up notation G : S : M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

  6. 6 Introduction • Each gene tree has an associated species tree • s(g) for ancestral genes : we use LCA Mapping , where each ancestral gene is mapped to the LCA of its descendants mappings in S γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

  7. 7 Introduction • Reconciliation infers speciation/duplication events • If g has the same mapping as one of its children, infer a duplication (otherwise, infer a speciation) γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

  8. 8 Introduction • Reconciliation infers speciation/duplication events • If g has the same mapping as one of its children, infer a duplication (otherwise, infer a speciation) γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

  9. 12 Introduction • Orthology and paralogy are inferred given the gene tree. • But instead, can we infer (or correct) parts of the gene tree, given orthology/paralogy relationships ? γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

  10. 13 Introduction • CASE 1 : Suppose we KNOW β 1 is a speciation, and we want to keep the β 1 clade (i.e. do not insert/remove leaves in the β 1 subtree) • Correct the gene tree making the minimum number of “moves” γ 1 γ Untrusted duplication G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1

  11. 14 Introduction • CASE 1 : Suppose we KNOW β 1 is a speciation, and we want to keep the β 1 clade (i.e. do not insert/remove leaves in the β 1 subtree) • Correct the gene tree making the minimum number of “moves” γ 1 γ G : γ 2 S : β β 1 α α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1

  12. 15 Introduction • CASE 2 : Suppose we KNOW Z 1 and T 1 are orthologous • Correct the gene tree making the minimum of “moves” Untrusted duplication γ 1 γ G : γ 2 S : β β 1 α α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1

  13. 16 Introduction • CASE 2 : Suppose we KNOW Z 1 and T 1 are orthologous • Correct the gene tree making the minimum of “moves” γ 1 γ G : S : β β 1 α Z 3 α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1

  14. 17 Two correction problems • Case 1 and 2 give us speciation (orthology) constraints • Given G containing untrusted duplications, find a gene tree G’ that satisfies the given constraints AND messes up G as least as possible • e.g. minimize the Robinson-Foulds distance G’ : G : M 1 T 1 M 1 Z 1 Z 2 T 1 S 1 Z 1 Z 2 S 1

  15. 18 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T2’ : T1 : x z y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

  16. 19 RF distance  In the case of rooted binary trees T1, T2 with the same leaves :  RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x z : {g4, g5} y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

  17. 20 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x : {g1, g2, g3} z : {g4, g5} y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

  18. 21 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x : {g1, g2, g3} z : {g4, g5} y : {g2, g3} g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

  19. 22 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : distRF(T1, T2) = 2 T2 : x z y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

  20. 23 Detecting untrustworthy duplications • Some duplications are labeled “dubious” or given low confidence values by Ensembl • We can use synteny to infer orthology/paralogy relationships [1] • Software inferring ancestral adjacencies might pick up erroneous duplications • Using DeCo, one can identify bad duplications when more than two adjacencies are inferred on an ancestral gene [2] [1] Lafond, Swenson, El-Mabrouk , “Error detection and correction of gene trees”, MASGE (2013) [2] Chauve, El-Mabrouk, Guéguen, Semeria, Tannier , “ Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later “, MAGE (2013)

  21. 24 Detecting untrustworthy duplications • Suppose genes a1, b1 from genomes a and b are in syntenic blocks (they are in a conserved region of homologous genes) • In this example, a conserved region involving 5 genes families a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

  22. 25 Detecting untrustworthy duplications • Suppose genes a1, b1 from genomes a and b are in syntenic blocks (they are in a conserved region of homologous genes) • In this example, a conserved region involving 5 genes families • Look at the gene trees of each involved family a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

  23. 26 Detecting untrustworthy duplications • If all the homologous genes in the regions are orthologous, we expect a 1 and b 1 to also be orthologous a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

  24. 27 Detecting untrustworthy duplications • If all the homologous genes in the regions are orthologous, we expect a 1 and b 1 to also be orthologous • If not, some unlikely event occurred a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

  25. 28 Detecting untrustworthy duplications • What’s wrong with this ? • If only the ancestral gene ab duplicated, the copy typically went somewhere else on the ancestral genome • And somehow, it ended up in a region similar to the original gene…mostly by chance . ab ab copy a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

  26. 29 Detecting untrustworthy duplications • We looked at ~6000 Ensembl gene trees • The trees for the Zebrafish, Medaka, Tetraodon and Stickleback species • 22% (~1200) of these trees contained this type of bad duplication a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

  27. 31 Problem 1 • Given : given a gene tree G, a species tree S, and a set C of clades that are required to be speciations • Find : A corrected gene tree G’ in which all clades in C are preserved, are speciations, and such that RFDist(G , G’) is minimized (as many clades as possible are preserved) G’ : G : a 1 c 1 c 2 b 1 d 1 d 2 a 1 b 1 c 1 c 2 d 1 d 2 In green : preserved clades

  28. 32 Problem 1 • A solution doesn’t always exist • In this example, if C = {x,y}, we cannot correct both x and y into speciations • A solution exists iff for any two x, y in C, we don’t have that x is an ancestor of y and s(x) = x(y) • We will assume there exists a solution x G : y s(x) = s(y) S : a b c d a 1 c 1 d 1 b 1 C = {x, y}

  29. 33 Problem 1 • To transform x into a speciation s(x) S : • Let L and R be the two children of s(x) L R a b c d x G : b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2

  30. 34 Problem 1 • Find G L (resp. G R ), the set of maximal s(x) S : subtrees of G that contains only genes L R mapped to species in L (resp. R) a b c d x G : b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2

  31. 35 Problem 1 • Form G* by making two polytomies s(x) S : (non-binary subtrees) with G L and G R , L R joined under a common parent a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2

  32. 36 Problem 1 • Theorem : any binary resolution of G* is s(x) S : a solution to Problem 1. L R a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2

  33. 37 Problem 1 • Theorem : any binary resolution of G* is s(x) S : a solution to Problem 1. L R • In fact, every solution is the result of a binary resolution of G*. a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend