Character data Bioinformatics Algorithms (Fundamental Algorithms, - PDF document

Character data Bioinformatics Algorithms (Fundamental Algorithms, module 2) Now the input data consists of states of characters for the given objects, Zsuzsanna Lipt´ ak e.g. • morphological data, e.g. number of toes, reproductive method, type Masters in Medical Bioinformatics academic year 2018/19, II semester of hip bone, . . . or • molecular data, e.g. what is the nucletoide in a certain position. Phylogenetics II 1 1 These slides are partially based on the Lecture Notes from Bielefeld University ”Algorithms for Phylogenetic Reconstruction” (2016/17), by J. Stoye, R. Wittler, et al. 2 / 22 Character data Character data Example C 1 : # wheels C 2 : existence of engine (a) (b) bicycle 2 0 invention of engine number of motorcycle 2 1 0 wheels car 4 1 tricycle 3 0 1 0 1 1 0 0 2 2 3 4 • objects (species): Bicycle, motorcycle, tricycle, car motorcycle car tricycle bicycle motorcycle bicycle tricycle car • characters: number of wheels; existence of an engine • character states: 2 , 3 , 4 for C 1 ; 0 , 1 for C 2 (1 = YES, 0 = NO) Two di ff erent phylogenetic trees for the same set of objects. • This matrix M is called a character-state-matrix, of dimension ( n × m ), where for 1 ≤ i ≤ n , 1 ≤ j ≤ m : M ij = state of character j for object i . (Here: n = 4 , m = 2.) 3 / 22 4 / 22 Character data Compatibility Definition A character is compatible with a tree if all inner nodes of the tree can be labeled such that each character state induces one connected subtree. We want to avoid (a) • parallel evolution (= convergence) invention of engine • reversals 0 Together these two conditions are also called homoplasies. 1 0 1 1 0 0 Mathematical formulation: compatibility. motorcycle car tricycle bicycle This tree is compatible with C 2 , one possibility of labeling the inner nodes is shown. 5 / 22 6 / 22

Compatibility Compatibility Definition Definition A character is compatible with a tree if all inner nodes of the tree can be A character is compatible with a tree if all inner nodes of the tree can be labeled such that each character state induces one subtree (i.e. is labeled such that each character state induces one connected subtree. connected). (b) (a) number of wheels invention of engine 0 1 0 2 2 3 4 1 1 0 0 motorcycle bicycle tricycle car motorcycle car tricycle bicycle This tree is compatible with C 1 . (We have to give a labeling of the inner This tree is also compatible with C 1 : We have to give a labeling of the nodes to prove this.) It is not compatible with C 2 (why?) inner nodes (w.r.t. C 1 ) to prove this. (Exercise!) 7 / 22 8 / 22 Compatibility Perfect Phylogeny Definition A tree T is called a perfect phylogeny (PP) for C if all characters C ∈ C are compatible with T . Here is another example input character-state matrix (here n = 5 , m = 2): Example C 1 C 2 ↵ A A A C � � C C � C G ✏ G G AA AC CC CG GG Our goal is to find a tree that is compatible with every character. Such a alpha beta gamma delta epsilon tree is called Perfect Phylogeny. Why? We have to find a labeling of the inner nodes s.t. for both characters C 1 and C 2 , each state induces a subtree. 9 / 22 10 / 22 Perfect Phylogeny Perfect Phylogeny Definition A tree T is called a perfect phylogeny (PP) for the character-state matrix Theorem M if all characters are compatible with T . Let M be a character-state matrix of dimension n × m , and for 1 ≤ i ≤ m , Example let r i = number of distinct states in column i (i.e. the number of states which actually occur). Then a tree T is a perfect phylogeny (PP) for M if AC and only if pc ( T ) = P m i =1 ( r i − 1). Example For the previous example, we have r 1 = r 2 = 3, so a tree T is a PP i ff AC CC pc ( T ) = 2 + 2 = 4. CG Example AC CG GG AA CC beta gamma delta For the vehicle-example, we have r 1 = 2 , r 2 = 3, therefore if pc ( T ) = 3, alpha epsilon then a tree is a PP. Note: Our tree (b) for the vehicles was also a PP, since it is compatible both with C 1 and with C 2 . 11 / 22 12 / 22

Perfect Phylogeny Parsimony Parsimony: What is a best possible tree? • Ideally, we would like to find a PP for our input data. AC • Deciding in general whether a PP exists is NP-hard. (More precisely: For characters with number of states ≥ 4, the PP problem is NP-hard.) CC AC • Doesn’t really matter, since most of the time, no PP exists anyway. Why: due to homoplasies; because our input data has errors; our CG evolutionary model may not be adequate; and, and, and . . . • Therefore we usually want to find a best possible tree. AC CC CG GG AA alpha beta gamma delta epsilon Why is this tree “perfect”? 13 / 22 14 / 22 Parsimony Parsimony What is a best possible tree? Definition The parsimony cost of a phylogenetic tree with labeled inner nodes is the number of state changes along the edges (i.e. the sum of the edge costs, AC where the cost of an edge = number of characters whose state di ff ers 1 between child and parent). AC CC AC 1 1 CG 1 1 AA AC CC CG GG CC AC alpha beta gamma delta epsilon 1 1 CG Why is this tree “perfect”? 1 AC CC CG GG AA Because it has few changes of states! alpha beta gamma delta epsilon In red, we marked the edges where there are state changes (an evolutionary event happened), and how many (in this case, always 1). The parsimony cost of this labeled tree is 4. 15 / 22 16 / 22 Parsimony Parsimony Definition The parsimony cost of a phylogenetic tree (without labels on the inner Phylogenetic Reconstruction with Character Data nodes) is the minimum of the parsimony cost over all possible labelings of Given a character-state matrix M , our goal is to find a phylogenetic tree the inner nodes. which minimizes the parsimony cost. We split the problem into two sub-problems: 1. Small Parsimony: Given a phylogenetic tree, find its parsimony cost, i.e. find a most parsimonious labeling of the inner nodes. This problem can be solved e ffi ciently. 2. Large Parsimony or Maximum Parsimony: Find a tree with minimum AA AC CC CG GG parsimony cost. This problem is NP-hard. alpha beta gamma delta epsilon The parsimony cost of this tree is 4, because the best labeling has cost 4. 17 / 22 18 / 22

Small Parsimony Maximum Parsimony Small Parsimony Problem Maximum Parsimony Problem Given: a phylogenetic tree T with character-states at the nodes. The maximum parsimony problem is, given a character-state matrix, find a Find: a labeling of the inner nodes with states with minimum parsimony phylogenetic tree with lowest parsimony cost (= a “most parsimonious tree”). cost. Algorithm This problem can be solved using Fitch’ algorithm, which runs in time • When a PP exists, then it is also the most parsimonious tree. O ( nmr ), where n = number of species, m = number of characters, and • In general, the Maximum Parsimony Problem is NP-hard. r = maximum number of states over all characters. 19 / 22 20 / 22 Summary for character data Summary for character data (cont’ed) • When the input is a character-state matrix, then we would like to find a tree which is compatible with each character. • Such a tree is called a perfect phylogeny (PP). • The problem of finding a most parsimonious tree (a tree with lowest • PPP is NP-hard (for number of states ≥ 4). parsimony cost) is split into Small Parsimony and Maximum Parsimony: • Usually, no PP exists, therefore in general . . . • Small Parsimony can be solved e ffi cienly, e.g. by Fitch’ algorithm. • We are looking for a most parsimonious tree (a tree with lowest • Maximum Parsimony is NP-hard, so probably no e ffi cient algorithms parsimony cost). exist. • The parsimony cost is defined as the minimum number of the state changes on the edges over all possible labelings of the inner nodes. • Recall: There are super-exponentially many trees on n taxa (both rooted and unrooted), so we cannot try them all. 21 / 22 22 / 22

Character data Bioinformatics Algorithms (Fundamental Algorithms, - PDF document

Character data Bioinformatics Algorithms (Fundamental Algorithms, module 2) Now the input data consists of states of characters for the given objects, Zsuzsanna Lipt ak e.g. morphological data, e.g. number of toes, reproductive method,

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Strings II Review Strings are stored character by character. Can access each character

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Braking a bike wastes energy All these poor people have to pedal back up to speed again Up

Dung-style Argumentation and AGM-style Belief Revision Guido Boella, Celia da Costa Pereira,

Complexity Natural Machine Language Learning www.urv.cat Complexity www.urv.cat Natural

Finding & Researching Women Anne Gillespie Mitchell Researching Women Most records are

Heuristic Search for Planning Sheila McIlraith University of Toronto Fall 2010 S. McIlraith

Atlas Refinement with Bounded Packing Efficiency Presented by Jerry Yin Packing efficiency

Asymptotic properties of bike-sharing systems Nicolas Gast 1 SICSA workshop Edinburgh, May

Ridership Patterns in an Urban Bike Share System Hans Engler June 12, 2015 Hans Engler

Character data Bioinformatics Algorithms (Fundamental Algorithms, - PDF document

Character data Bioinformatics Algorithms (Fundamental Algorithms, module 2) Now the input data consists of states of characters for the given objects, Zsuzsanna Lipt ak e.g. morphological data, e.g. number of toes, reproductive method,

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Strings II Review Strings are stored character by character. Can access each character

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Braking a bike wastes energy All these poor people have to pedal back up to speed again Up

Dung-style Argumentation and AGM-style Belief Revision Guido Boella, Celia da Costa Pereira,

Complexity Natural Machine Language Learning www.urv.cat Complexity www.urv.cat Natural

Finding &amp; Researching Women Anne Gillespie Mitchell Researching Women Most records are

Heuristic Search for Planning Sheila McIlraith University of Toronto Fall 2010 S. McIlraith

Atlas Refinement with Bounded Packing Efficiency Presented by Jerry Yin Packing efficiency

Asymptotic properties of bike-sharing systems Nicolas Gast 1 SICSA workshop Edinburgh, May

Ridership Patterns in an Urban Bike Share System Hans Engler June 12, 2015 Hans Engler

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Finding & Researching Women Anne Gillespie Mitchell Researching Women Most records are