bioinformatics algorithms
play

Bioinformatics Algorithms (Fundamental Algorithms, module 2) - PowerPoint PPT Presentation

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees (and other string indexes) 1 1 Some of these slides are based on slides of Jens


  1. Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees (and other string indexes) 1 1 Some of these slides are based on slides of Jens Stoye’s.

  2. Text indexes Let T be a string of length n over alphabet Σ (which we refer to as text in the following). A text index (or string index) is a data structure built on the text which allows to answer a certain type of query (e.g. pattern matching) without traversing the whole text. Typically, we want 1. the index not to use too much space (linear or sublinear in n ), and 2. the query time to be fast (ideally: independent of n ). 2 / 17

  3. A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. 3 / 17

  4. A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. Pattern matching Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ, find all occurrences of P as substring of T . 3 / 17

  5. A common string problem: Pattern matching Pattern matching (aka exact string matching) is at the core of almost every text-managing application. Pattern matching Given a (typically long) string T (the text), and a (typically much shorter) string P (the pattern) over the same alphabet Σ, find all occurrences of P as substring of T . Variants: • output all occurrences of P in T — ”all-occurrences version” • decide whether P occurs in T (yes - no) — ”decision version” • output the number of occurrences of P in T — ”counting version” We usually refer to the number of occurrences of P as occ P . 3 / 17

  6. Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) 4 / 17

  7. Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. 4 / 17

  8. Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. • But not tolerable with the data sizes we are seeing now! 4 / 17

  9. Pattern matching Pattern matching (p.m.) text: T = T 1 . . . T n of length n , pattern: P = P 1 . . . P m of length m • The best non-index-based algorithms solve this problem in time O ( n + m ) (e.g. Knuth-Morris-Pratt) • This is optimal, since one has to read both strings at least once. • But not tolerable with the data sizes we are seeing now! • That is why we need text indexes. 4 / 17

  10. The k -mer index 5 / 17

  11. The k -mer index r u r P 2 ( s ) 0 0 AA 1 1 AC Recall that a k -mer (or k -gram) 2 1 AG is a string of length k . 3 0 AT 4 2 CA k -mer index 5 CC 0 Earlier in this course, we saw 6 CG 0 the k -mer profile, P k ( s ) 7 CT 0 (or q -gram profile) 8 GA 0 9 GC 1 of a string s . 10 GG 2 Ex. 11 GT 0 12 TA 0 s = ACAGGGCA , 13 TC 0 on the right is P 2 ( s ). 14 TG 0 15 0 TT 6 / 17

  12. The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA 9 6 GC 10 4 , 5 GG 11 GT 12 TA 13 TC 14 TG 15 TT 7 / 17

  13. The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA 13 TC 14 TG 15 TT 7 / 17

  14. The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA Time (p.m.): O ( k ) for decision, 13 TC O ( k + occ P ) for all-occurrences. 14 TG 15 TT 7 / 17

  15. The k -mer index k -mer index of s r u r Replacing the number of occurrences 0 AA by the occurrences themselves, 1 AC 1 we get the k -mer index of s . 2 3 AG 3 AT Ex. 4 2 , 7 CA s = ACAGGGCA , 5 CC on the right 2-mer index of s . 6 CG 7 CT Analysis (for p.m.) 8 GA Space: total space is O ( σ k + n ), 9 6 GC since no. of rows = σ k and total 10 4 , 5 GG 11 GT number of entries = n − k + 1. 12 TA Time (p.m.): O ( k ) for decision, 13 TC O ( k + occ P ) for all-occurrences. 14 TG N.B.: works only for patterns of 15 TT length exactly k 7 / 17

  16. The suffix tree 8 / 17

  17. The suffix tree T = BANANA$ (add sentinel character $ / ∈ Σ) labels only conceptual! two pointers into string [3,4] NA [7,7] $ NA$ [5,7] 7 7 B A [2,2] A 3 3 [7,7] N $ [ 1 A , 7 N $ [7,7] 5 [ ] 5 N 3 A , A $ 4 6 6 ] N [ 1 5 1 A , 7 $ $ [7,7] ] 2 4 2 4 9 / 17

  18. The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; 10 / 17

  19. The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , 10 / 17

  20. The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), 10 / 17

  21. The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), • sd ( v ) is the string-depth of a node v is the length of its path-label, 10 / 17

  22. The suffix tree Given T string over Σ (finite ordered alphabet), and $ �∈ Σ. Definitions • ST ( T ) is a rooted tree with edge-labels from (Σ ∪ { $ } ) + such that • the labels of all edges outgoing from a node begin with different characters; • the paths from the root to the leaves of ST ( T ) spell the suffixes of T $; • each node in ST ( T ) is either the root, a leaf, or a branching node; • L ( u ) is the path-label of node u : the concatenation of edge labels on the path from the root to u , • a leaf v has leaf-label i if and only if L ( v ) = T i . . . T n $ ( i ’th suffix), • sd ( v ) is the string-depth of a node v is the length of its path-label, • a locus ( u , d ) is a position on an edge ( v , u ) where u is a node of ST ( T ) and sd ( v ) < d ≤ sd ( u ): d is the string-depth of locus ( u , d ). 10 / 17

  23. • N.B.: the edge labels are not stored explicitly: • they are represented by two pointers [ b , e ] into T : beginning and end of an occurrence of the edge label; • this representation is not necessarily unique • e.g. in the example, any edge with label NA can be represented by [3 , 4] or [5 , 6] labels only conceptual! two pointers into string [3,4] $ NA [7,7] NA$ [5,7] 7 7 BANANA$ A [2,2] 3 3 $ [7,7] [1,7] $ [7,7] 5 [3,4] 5 NA 6 6 NA$ [5,7] 1 1 $ [7,7] 2 2 4 4 11 / 17

  24. Suffix tree properties • The leaves of ST ( T ) correspond to the suffixes of T $. 12 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend