What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental - PDF document

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental Algorithms, module 2) speciation events Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester species wolf cat lion horse rhino (taxa) Phylogenetics I 1 Phylogenetic trees display the evolutionary relationships among a set of objects (species). Contemporary species are represented by the leaves. Internal nodes of the tree represent speciation events ( ≈ common ancestors, usually extinct). 1 These slides are partially based on the Lecture Notes from Bielefeld University ”Algorithms for Phylogenetic Reconstruction” (2016/17), by J. Stoye, R. Wittler, et al. 2 / 27 Di ff erent types of phylogenetic trees Phylogenetic reconstruction • rooted vs. unrooted (root on top/bottom vs. root in the middle) Goal • binary (fully resolved) vs. multifurcating (polytomies) Given n objects and data on these objects, find a phylogenetic tree with • are edge lengths significant? these objects at the leaves which best reflects the input data. • is there a time scale on the side? 3 / 27 4 / 27 Phylogenetic reconstruction Phylogenetic reconstruction Note: We need to define more precisely There are two main issues: • what kind of input data we have, 1. How well does a tree reflect my data? • what kind of tree we want (e.g. rooted or unrooted), and 2. How do we find such a tree? • what we mean by “reflect the data.” 5 / 27 6 / 27

Number of phylogenetic trees Number of phylogenetic trees Say we have answered these questions, then: Could we just list all possible trees and then choose the/a best one? # taxa # unrooted trees # rooted trees n (2 n − 5)!! (2 n − 3)!! 1 1 1 2 1 1 3 1 3 4 3 15 All phylogenetic trees (rooted and unrooted) on 4 taxa. 7 / 27 8 / 27 Number of phylogenetic trees Number of phylogenetic trees #taxa #unrooted trees #rooted trees (2 n − 5)!! (2 n − 3)!! n Theorem There are U n = (2 n − 5)!! = Q n i =3 (2 i − 5) unrooted binary phylogenetic 1 1 1 trees on n objects, and R n = (2 n − 3)!! = Q n i =2 (2 i − 3) rooted binary 2 1 1 phylogenetic trees on n objects. 3 1 3 Proof 4 3 15 By induction on n , using that (1) we can get every unrooted tree on n + 1 5 15 105 objects in a unique way by adding the ( n + 1)st leaf to an unrooted tree 6 105 945 on the first n objects; (2) an unrooted binary tree with n leaves has 2 n − 3 7 945 10 , 395 edges, (3) every unrooted tree on n objects can be rooted in (number of 8 10 , 395 135 , 135 edges) ways, yielding a rooted tree on n objects. 9 135 , 135 2 , 027 , 025 10 2 , 027 , 025 34 , 459 , 425 9 / 27 10 / 27 Number of phylogenetic trees Types of input data We can have two kinds of input data: So there are super-exponentially many trees: • distance data: n × n matrix of pairwise distances between the taxa, or We cannot check all of them! • character data: n × m matrix giving the states of m characters for the n taxa 11 / 27 12 / 27

Distance data Distance data Path metric of a tree Distance data is given as an ( n × n ) matrix M with the pairwise distances Given a tree T , the path-metric of T is d T , defined as: d T ( u , v ) = sum of between the taxa. edge weights on the (unique) path between u and v . Example E.g., M a , b = 5 means that Ex. e the distance between a and b a b c a 2 is 5. Often, this is the edit 0 5 2 3 a 1 d T ( a , b ) = 5 , 4 3 distance (between two genomic b 5 0 4 d T ( a , d ) = 11 , d sequences, or between homolo- 2 5 c 2 4 0 d T ( c , d ) = 9 , . . . gous proteins, . . . ). b c We want to find a tree with a , b , c at the leaves s.t. the distance in the tree (the path metric) between a and b is 5, between a and c is 2, etc. Note d T ( u , v ) is also defined for inner nodes u , v , but we only need it for leaves. 13 / 27 14 / 27 Example Distance data For our earlier example, we can find such a tree: First of all, the input matrix M has to define a metric (= a distance function), i.e. for all x , y , z , Ex. 1 (from before) a a b c 1,5 • M ( x , y ) ≥ 0 and ( M ( x , y ) = 0 i ff x = y ) (positive definite) b 3,5 a 0 5 2 • M ( x , y ) = M ( y , x ) (symmetry) 5 0 4 b 0,5 • M ( x , y ) + M ( y , z ) ≥ M ( x , z ) (triangle inequality) c c 2 4 0 For example, the edit distance is a metric (on strings), the Hamming distance (on strings of the same length), the Euclidean distance (on R 2 ). Question Is it always possible to find a tree s.t. its path-metric equals the input distances? I.e. does such a tree exist for any input matrix M ? 15 / 27 16 / 27 Conditions on distance matrix Rooted trees and the molecular clock Question: speciation events When does a tree exist whose path metric agrees with a distance matrix M ? Answer: species wolf cat lion horse rhino • if we want a rooted tree: M needs to be ultrametric (taxa) • if we want an unrooted tree: M needs to be additive In a rooted phylogenetic tree, the molecular clock assumption holds: that the speed of evolution is the same along all branches, i.e. the path distance from each leaf to the root is the same. Such a tree is also called an ultrametric tree. 17 / 27 18 / 27

Ultrametrics and the three-point condition Example Three point condition Ex. 2 Let d be a metric on a set of objects O , then d is an ultrametric if a b c d ∀ x , y , z ∈ O : 5 0 10 10 10 a 3 b 10 0 2 6 d ( x , y ) ≤ max { d ( x , z ) , d ( z , y ) } c 10 2 0 6 1 d 10 6 6 0 x x a b c d z d xy d = d y Checking the ultrametric condition, we see that: xz yz • for a , b , c we get 2 , 10 , 10 — okay z y • for a , b , d we get 6 , 10 , 10 — okay Figure: Three point condition. It implies that the path metric of a rooted tree is • for a , c , d we get 6 , 10 , 10 — okay an ultrametric. • for b , c , d we get 2 , 6 , 6 — okay In other words, among the three distances, there is no unique maximum. 19 / 27 20 / 27 Example Ultrametrics and the three-point condition Compare this to our earlier example. There the matrix M does not define Theorem an ultrametric! Given an ( n × n ) distance matrix M . There is a rooted tree whose path Ex. 1 (from before) metric agrees with M if and only if M defines an ultrametric (i.e. if and Indeed, the only tree we found only if it is a metric and the 3-point-condition holds). This tree is unique 2 . a b c was not rooted: a 0 5 2 Algorithm 5 0 4 b a The algorithm UPGMA ( unweighted pair group mtheod using arithmetic c 2 4 0 1,5 b 3,5 averages , Michener & Sokal 1957), a hierarchical clustering algorithm, constructs this tree, given an input matrix which is ultrametric. Its running For the triple a , b , c (the only 0,5 time is O ( n 2 ). triple), we get: 2 , 4 , 5, and c there is a unique maximum: 5. 2 i.e. there is only one such tree 21 / 27 22 / 27 Additive metrics and the four-point condition Additive metrics and the four-point condition x u So what is the condition on the matrix M for unrooted trees? Four point condition. Let d be a metric on a set of objects O , then d is an additive metric if y v ∀ x , y , u , v ∈ O : d d xv xu d xy d < + + = + uv d ( x , y ) + d ( u , v ) ≤ max { d ( x , u ) + d ( y , v ) , d ( x , v ) + d ( y , u ) } d yv d yu In other words, among the three sums of two distances, there is no unique Figure: The four point condition. It implies that the path metric of a tree is an maximum. additive metric. 23 / 27 24 / 27

Example Additive metrics and the four-point condition e a Theorem 2 3 1 Given an ( n × n ) distance matrix M . There is an unrooted tree whose path 4 3 metric agrees with M if and only if M defines an additive metric (i.e. if and d 2 5 only if it is a metric and the 4-point-condition holds). This tree is unique. b Algorithm c The algorithm NJ (Neighbor Joining) constructs this tree, given an additive matrix M (Saitu & Nei, 1987). Its running time is O ( n 3 ). For ex., choose these 4 points: a , b , c , e . Then we get the three sums: In fact, it is even possible to compute a “good” tree if the matrix is not d ( a , b ) + d ( c , e ) = 5 + 8 = 13, d ( a , c ) + d ( b , e ) = 12 + 9 = 21, and additive but “almost” (all this needs to be defined precisely, of course). d ( a , e ) + d ( b , c ) = 10 + 11 = 21. Among 13 , 21 , 21, there is no unique maximum—okay. (Careful, this has to hold for all quadruples; how many are there?) 25 / 27 26 / 27 Summary for distance data • When the input is a distance matrix, then we are looking for a tree whose path metric agrees with M . • A rooted tree agreeing with M exists if and only if the distance matrix M defines an ultrametric. • This tree can then be computed e ffi ciently (i.e. in polynomial time), with UPGMA. • An unrooted tree agreeing with M exists if and only if the distance matrix M defines an additive metric. • It can be computed e ffi ciently (i.e. in polynomial time), with Neighbor Joining. 27 / 27

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental - PDF document

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental Algorithms, module 2) speciation events Zsuzsanna Lipt ak Masters in Medical Bioinformatics academic year 2018/19, II. semester species wolf cat lion horse rhino

Assessing Phylogenetic Hypotheses and Phylogenetic Data We use numerical phylogenetic methods

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and

FOUND IN TRANSLATION: Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic

Phylogenetic Networks Networks Phylogenetic Daniel H. Huson Daniel H. Huson www-

Swinging from Tree to Tree: Rearrangement Operations and their Metrics Stefan Grnewald

Another tree example Phylogenetic tree Patient 1 Plan Clone Phylogeny B C RFTA16 Om1

Spaces of phylogenetic networks Jonathan Klawitter PhD Exam 5th March, 2020 2 - 1

CSCE 471/871 Lecture 5: Phylogenetic Trees Building Phylogenetic Trees Stephen Scott

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

Phylogenetic tree Michael Schroeder Biotechnology Center TU Dresden Phylogenetic trees

Drawing Tree-Based Phylogenetic Networks with Minimum Number of Crossings Jonathan Klawitter

Phylogenetic tree Michael Schroeder Biotechnology Center TU Dresden Phylogenetic trees

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Balance indices for phylogenetic trees under well-known probability models Universitat de les

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

False Layers Delmarva Variant Strain Phylogenetic Tree Cloacal/Pharyngal One of these 50 week

Marine Molluscs Simon Hills (biologist) Ecology Group Institute of Natural Resources Massey

Phylogeny Reconstruction Methods in Linguistics Tandy Warnow The University of Texas at Austin

An Approximate Approach for Solving the Balanced Minimum Evolution Problem A. Aringhieri * , C.

trt t ts t

Outline Review of trees. Coun4ng features. Characterbased phylogeny Maximum

Computing parsimony Parsimony treats each site (position in a sequence) l independently Total

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin

Amorerealisticapproachto simulatingheterotachyanditseffect