Bioinformatics Algorithms (Fundamental Algorithms, module 2) - PowerPoint PPT Presentation

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees (and other string indexes) 1 1 Some of these slides are based on slides of Jens Stoye’s.

Text indexes Let T be a string of length n over alphabet Σ (which we refer to as text in the following). A text index (or string index) is a data structure built on the text which allows to answer a certain type of query (e.g. pattern matching) without traversing the whole text. Typically, we want 1. the index not to use too much space (linear or sublinear in n ), and 2. the query time to be fast (ideally: independent of n ). 2 / 17

A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. 3 / 17

A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. Pattern matching Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ, find all occurrences of P as substring of T . 3 / 17

A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. Pattern matching Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ, find all occurrences of P as substring of T . Variants: • output all occurrences of P in T — ”all-occurrences version” • decide whether P occurs in T (yes - no) — ”decision version” • output the number of occurrences of P in T — ”counting version” We usually refer to the number of occurrences of P as occ P . 3 / 17

Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) 4 / 17

Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. 4 / 17

Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. • But not tolerable with the data sizes we are seeing now! 4 / 17

Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. • But not tolerable with the data sizes we are seeing now! • That is why we need text indexes. 4 / 17

The k -mer index 5 / 17

The k -mer index r u r P 2 ( s ) 0 0 AA 1 1 AC Recall that a k -mer (or k -gram) 2 1 AG is a string of length k . 3 0 AT 4 2 CA k -mer index 5 CC 0 Earlier in this course, we saw 6 CG 0 the k -mer profile, P k ( s ) 7 CT 0 (or q -gram profile) 8 GA 0 9 GC 1 of a string s . 10 GG 2 Ex. 11 GT 0 12 TA 0 s = ACAGGGCA , 13 TC 0 on the right is P 2 ( s ). 14 TG 0 15 0 TT 6 / 17

The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA 9 6 GC 10 4 , 5 GG 11 GT 12 TA 13 TC 14 TG 15 TT 7 / 17

The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA 13 TC 14 TG 15 TT 7 / 17

The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA Time (p.m.): O ( k ) for decision, 13 TC O ( k + occ P ) for all-occurrences. 14 TG 15 TT 7 / 17

The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA Time (p.m.): O ( k ) for decision, 13 TC O ( k + occ P ) for all-occurrences. 14 TG N.B.: works only for patterns of 15 TT length exactly k 7 / 17

The suffix tree 8 / 17

The suffix tree T = BANANA$ (add sentinel character $ / ∈ Σ) labels only conceptual! two pointers into string [3,4] NA [7,7] $ NA$ [5,7] 7 7 B A [2,2] A 3 3 [7,7] N $ [ 1 A , 7 N $ [7,7] 5 [ ] 5 N 3 A , A $ 4 6 6 ] N [ 1 5 1 A , 7 $ $ [7,7] ] 2 4 2 4 9 / 17

The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; 10 / 17

The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , 10 / 17

The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), 10 / 17

The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), • sd ( v ) is the string-depth of a node v is the length of its path-label, 10 / 17

The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), • sd ( v ) is the string-depth of a node v is the length of its path-label, • a locus ( u , d ) is a position on an edge ( v , u ) where u is a node of ST ( T ) and sd ( v ) < d ≤ sd ( u ): d is the string-depth of locus ( u , d ). 10 / 17

• N.B.: the edge labels are not stored explicitly: • they are represented by two pointers [ b , e ] into T : beginning and end of an occurrence of the edge label; • this representation is not necessarily unique • e.g. in the example, any edge with label NA can be represented by [3 , 4] or [5 , 6] labels only conceptual! two pointers into string [3,4] $ NA [7,7] NA$ [5,7] 7 7 BANANA$ A [2,2] 3 3 $ [7,7] [1,7] $ [7,7] 5 [3,4] 5 NA 6 6 NA$ [5,7] 1 1 $ [7,7] 2 2 4 4 11 / 17

Suffix tree properties • The leaves of ST ( T ) correspond to the suffixes of T $. 12 / 17

Bioinformatics Algorithms (Fundamental Algorithms, module 2) - PowerPoint PPT Presentation

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees (and other string indexes) 1 1 Some of these slides are based on slides of Jens

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental Algorithms, module 2)

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Optimization of Pattern Matching Algorithm for Memory Based Architecture Cheng-Hung Lin, Yu-Tang

String matching Announcements Programming assignment 1 posted - need to submit a .sh file The

HTML Templates The Problem We want to serve custom HTML In HW3 you're sending di ff erent

String and Character Manipulation http://cs.mst.edu C-style Strings (ntcas) char name[10] =

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

CSE182 lecture 4 notes &questions Vineet Bafna October 5, 2006 1 Notes Recall that we are

Strings Strings A string is a series of characters Characters can be referenced by using

Bioinformatics Algorithms (Fundamental Algorithms, module 2) - PowerPoint PPT Presentation

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees (and other string indexes) 1 1 Some of these slides are based on slides of Jens

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental Algorithms, module 2)

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Optimization of Pattern Matching Algorithm for Memory Based Architecture Cheng-Hung Lin, Yu-Tang

String matching Announcements Programming assignment 1 posted - need to submit a .sh file The

HTML Templates The Problem We want to serve custom HTML In HW3 you're sending di ff erent

String and Character Manipulation http://cs.mst.edu C-style Strings (ntcas) char name[10] =

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

CSE182 lecture 4 notes &amp;questions Vineet Bafna October 5, 2006 1 Notes Recall that we are

Strings Strings A string is a series of characters Characters can be referenced by using

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

CSE182 lecture 4 notes &questions Vineet Bafna October 5, 2006 1 Notes Recall that we are