GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali - PowerPoint PPT Presentation

1 GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson 1,4 , Eric Tannier 2,3 and Nadia El-Mabrouk 1 1 Université de Montréal 2 Laboratoire de Biometrie et Biologie Evolutive 3 INRIA Grenoble Rhône-Alpe 4 McGill Center for Bioinformatics

2 Introduction • Gene trees reflect the evolutionary history of a family of homologous genes • Ancestral genes may have undergone duplication or speciation Duplication G : G : Gene tree of the Speciation Ensembl ZincFinger protein 800 gene, for the species • Zebrafish • Stickleback • Medaka • Tetraodon ZNF800 Z1 ZNF800 M ZNF800 Z2 ZNF800 S ZNF800 T

3 Introduction (LCA = Lowest Common Ancestor) • Pairwise extant genes relationships • Orthologs : LCA is a speciation (e.g. ZNF800 Z2 , ZNF800 T ) • Paralogs : LCA is a duplication (e.g. ZNF800 Z1 , ZNF800 T ) Duplication G : Speciation ZNF800 Z1 ZNF800 M ZNF800 Z2 ZNF800 S ZNF800 T

4 Introduction • Each gene tree has an associated species tree • Each extant gene g is mapped to an extant species by a function s(g) G : S : M T S Z ZNF800 Z1 ZNF800 Z2 ZNF800 S ZNF800 M ZNF800 T G : Gene tree for the ZincFinger protein 800

5 Introduction • Each gene tree has an associated species tree • Each extant gene g is mapped to an extant species by a function s(g) • We use this mapping to ease up notation G : S : M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

6 Introduction • Each gene tree has an associated species tree • s(g) for ancestral genes : we use LCA Mapping , where each ancestral gene is mapped to the LCA of its descendants mappings in S γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

7 Introduction • Reconciliation infers speciation/duplication events • If g has the same mapping as one of its children, infer a duplication (otherwise, infer a speciation) γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

8 Introduction • Reconciliation infers speciation/duplication events • If g has the same mapping as one of its children, infer a duplication (otherwise, infer a speciation) γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

12 Introduction • Orthology and paralogy are inferred given the gene tree. • But instead, can we infer (or correct) parts of the gene tree, given orthology/paralogy relationships ? γ 1 γ G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1 G : Gene tree for the ZincFinger protein 800

13 Introduction • CASE 1 : Suppose we KNOW β 1 is a speciation, and we want to keep the β 1 clade (i.e. do not insert/remove leaves in the β 1 subtree) • Correct the gene tree making the minimum number of “moves” γ 1 γ Untrusted duplication G : γ 2 S : β β 1 α β 2 M T S Z T 1 M 1 Z 1 Z 2 S 1

14 Introduction • CASE 1 : Suppose we KNOW β 1 is a speciation, and we want to keep the β 1 clade (i.e. do not insert/remove leaves in the β 1 subtree) • Correct the gene tree making the minimum number of “moves” γ 1 γ G : γ 2 S : β β 1 α α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1

15 Introduction • CASE 2 : Suppose we KNOW Z 1 and T 1 are orthologous • Correct the gene tree making the minimum of “moves” Untrusted duplication γ 1 γ G : γ 2 S : β β 1 α α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1

16 Introduction • CASE 2 : Suppose we KNOW Z 1 and T 1 are orthologous • Correct the gene tree making the minimum of “moves” γ 1 γ G : S : β β 1 α Z 3 α 1 M T S Z S 1 M 1 Z 1 Z 2 T 1

17 Two correction problems • Case 1 and 2 give us speciation (orthology) constraints • Given G containing untrusted duplications, find a gene tree G’ that satisfies the given constraints AND messes up G as least as possible • e.g. minimize the Robinson-Foulds distance G’ : G : M 1 T 1 M 1 Z 1 Z 2 T 1 S 1 Z 1 Z 2 S 1

18 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T2’ : T1 : x z y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

19 RF distance  In the case of rooted binary trees T1, T2 with the same leaves :  RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x z : {g4, g5} y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

20 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x : {g1, g2, g3} z : {g4, g5} y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

21 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : T2 : x : {g1, g2, g3} z : {g4, g5} y : {g2, g3} g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

22 RF distance • In the case of rooted binary trees T1, T2 with the same leaves : • RFDist(T1, T2) is simply two times the number of clades in T1, but not in T2 r T1 : distRF(T1, T2) = 2 T2 : x z y g 1 g 2 g 3 g 4 g 5 g 1 g 3 g 2 g 4 g 5

23 Detecting untrustworthy duplications • Some duplications are labeled “dubious” or given low confidence values by Ensembl • We can use synteny to infer orthology/paralogy relationships [1] • Software inferring ancestral adjacencies might pick up erroneous duplications • Using DeCo, one can identify bad duplications when more than two adjacencies are inferred on an ancestral gene [2] [1] Lafond, Swenson, El-Mabrouk , “Error detection and correction of gene trees”, MASGE (2013) [2] Chauve, El-Mabrouk, Guéguen, Semeria, Tannier , “ Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later “, MAGE (2013)

24 Detecting untrustworthy duplications • Suppose genes a1, b1 from genomes a and b are in syntenic blocks (they are in a conserved region of homologous genes) • In this example, a conserved region involving 5 genes families a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

25 Detecting untrustworthy duplications • Suppose genes a1, b1 from genomes a and b are in syntenic blocks (they are in a conserved region of homologous genes) • In this example, a conserved region involving 5 genes families • Look at the gene trees of each involved family a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

26 Detecting untrustworthy duplications • If all the homologous genes in the regions are orthologous, we expect a 1 and b 1 to also be orthologous a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

27 Detecting untrustworthy duplications • If all the homologous genes in the regions are orthologous, we expect a 1 and b 1 to also be orthologous • If not, some unlikely event occurred a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

28 Detecting untrustworthy duplications • What’s wrong with this ? • If only the ancestral gene ab duplicated, the copy typically went somewhere else on the ancestral genome • And somehow, it ended up in a region similar to the original gene…mostly by chance . ab ab copy a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

29 Detecting untrustworthy duplications • We looked at ~6000 Ensembl gene trees • The trees for the Zebrafish, Medaka, Tetraodon and Stickleback species • 22% (~1200) of these trees contained this type of bad duplication a -- a - a 1 a + a ++ b -- b - b 1 b + b ++ Genome a Genome b

31 Problem 1 • Given : given a gene tree G, a species tree S, and a set C of clades that are required to be speciations • Find : A corrected gene tree G’ in which all clades in C are preserved, are speciations, and such that RFDist(G , G’) is minimized (as many clades as possible are preserved) G’ : G : a 1 c 1 c 2 b 1 d 1 d 2 a 1 b 1 c 1 c 2 d 1 d 2 In green : preserved clades

32 Problem 1 • A solution doesn’t always exist • In this example, if C = {x,y}, we cannot correct both x and y into speciations • A solution exists iff for any two x, y in C, we don’t have that x is an ancestor of y and s(x) = x(y) • We will assume there exists a solution x G : y s(x) = s(y) S : a b c d a 1 c 1 d 1 b 1 C = {x, y}

33 Problem 1 • To transform x into a speciation s(x) S : • Let L and R be the two children of s(x) L R a b c d x G : b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2

34 Problem 1 • Find G L (resp. G R ), the set of maximal s(x) S : subtrees of G that contains only genes L R mapped to species in L (resp. R) a b c d x G : b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2

35 Problem 1 • Form G* by making two polytomies s(x) S : (non-binary subtrees) with G L and G R , L R joined under a common parent a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2

36 Problem 1 • Theorem : any binary resolution of G* is s(x) S : a solution to Problem 1. L R a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2

37 Problem 1 • Theorem : any binary resolution of G* is s(x) S : a solution to Problem 1. L R • In fact, every solution is the result of a binary resolution of G*. a b c d x G : G* : L 1 R 1 b 2 a 1 c 1 c 2 c 3 b 1 d 1 d 2 a 1 b 1 b 2 c 1 c 2 c 3 d 1 d 2

GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali - PowerPoint PPT Presentation

1 GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson 1,4 , Eric Tannier 2,3 and Nadia El-Mabrouk 1 1 Universit de Montral 2 Laboratoire de Biometrie et Biologie Evolutive 3 INRIA Grenoble

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Error Detection and Correction of Gene Trees Using Gene Order Manuel Lafond , Krister M. Swenson

Gene Tree Parsimony for Incomplete Gene Trees Md. Shamsuzzoha Bayzid and Tandy Warnow

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Eight Truths about Correction from the Book of Proverbs 3 1. The right attitude to correction

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Data Requirement for Species Tree from Multiple Gene Trees (Dasarathy, Nowak, Roch 2015) Daewon

TRACTION: Fast non-parametric improvement of estimated gene trees S. Christensen, E. Molloy, P.

Gene-gene and gene-environment interactions in genetic case- control association studies Jurg Ott

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

KnowOS Goals of a Knowledge Operating System Provide persistent object store (interconnected

ORTHOLOGYAND PARALOGY CONSTRAINTS: SATISFIABILITY AND CONSISTENCY Manuel Lafond, Nadia

Bylaw Amendment The Orthopaedic Section, APTA, Inc., Board of Directors is presenting the

Graph-theore*c algorithms to improve phylogenomic analyses Tandy Warnow and Pranjal Vachaspa3

RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 3 billion

The Birth of HPC Cuba How supercomputing is being made available to all Cuban researchers using

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Sambuz

Useful Links

Newsletter

Mail Us