yangjun chen yujia wu department of applied computer
play

Yangjun Chen, Yujia Wu Department of Applied Computer Science - PowerPoint PPT Presentation

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1 Outline Motivation - Statement of Problem - Related work BWT


  1. BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1

  2. Outline  Motivation - Statement of Problem - Related work  BWT Arrays – A space-economic Index for String Matching  String Matching with k Mismatches - Search trees - Mismatching information - Mismatching trees  Experiments  Conclusion and Future Work 2

  3. Statement of Problem  String matching with k mismatches: find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s . - In DNA databases, due to polymorphisms or mutations among individuals or even sequencing errors, a read (a short sample DNA sequence) may disagree in some positions at any of its occurrences in a genome. pattern Example: k = 4 a a a a a c a a a c target a c a c a c a g a a g c c c 3

  4. Related Work  Exact string matching On-line algorithms: - Knuth-Morris-Pratt , Boyer-Moore , Aho-Corasick - Index based: suffix trees ( Weiner ; McCreight ; Ukkonen ), suffix arrays ( Manber , Myers ), BWT- transformation ( Burrow - Wheeler ), Hash ( Karp , Rabin )  Inexact string matching String matching with k mismatches - Hamming distance ( Lan andau dau, U. Vish ishkin in; Amir mir at at al al.; .; - Cole ) String matching with k differences - Levelshtein distance ( Chang, Lampe pe ) - String matching with wild-cards ( Manber, Baeza-Yates ) - 4

  5. BWT-Index  Burrows-Wheeler Transform ( BWT )  s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ BWT construction: Rank correspondence: rk F F L rk L if SA [ i ] = 1; L [ i ] = $, $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ 1 - a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 1 1 L [ i ] = s [ SA [ i ] – 1], otherwise. rank: 3 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 2 1 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 3 - SA […] – suffix array a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 4 2 rank: 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 1 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 2 3 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 1 4 rk F ( e ) = rk L ( e ) 5

  6. Backward Search of BWT-Index < z , [  , β]>,  s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ if z appears in L  ; search ( z ,  ) = otherwise.  ,  Search p = aca Z : a character  : a range in F Backward Search L  : a range in L , corresponding to  Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 6

  7. Backward Search of BWT-Index search ( c , < a , [2, 5]>) search ( a , < c , [1, 2]>) Search sequence : < a , [2, 5]> < c , [1, 2]> < a , [3, 4]> Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 7 7

  8. rankAll range |  | arr er X   suc  Ar Arrange rrays ys eac each for or a char haract acter such th that at A X [ i ] (the (the i th th ent entry in in the the array for X ) is is the the number er of of appearanc rances es of of X wi within in L [1 .. .. i ]. ment L [  .. ..  ] (    ) to  Ins nstea ead of of sc scanning anning a ce certain ain seg segmen to find nd a sub subra range nge tain X   , we ether A X [  - 1] = for or a ce certai we can can simply simply look look up up A X to to see see wh wheth A  [  ]. If then  does in  .. ..  ]. Othe wise, [ A X [  - 1] If it it is is the the case, case, then oes not oc occu cur in Otherwise, + 1, A X [  ] ] should ld be be the the found range. A $ A a A c A g A t F L Example 0 1 0 0 0 $ a 4 0 1 1 0 0 To find the first and the last appearance a 4 c 2 0 1 1 1 0 a 3 g 1 of c in L [2 .. 5], we only need to find 1 1 1 1 0 a 1 $ A c [2 – 1] = A c [1] = 0 and A c [5] = 2. So the 1 1 2 1 0 a 2 c 1 1 2 2 1 0 corresponding range is c 2 a 3 1 3 2 1 0 c 1 a 1 [ A c [2 - 1] + 1, A c [5]] = [1, 2]. 1 4 2 1 0 g 1 a 2

  9. Reduce rankAll -Index Size F -ranks: F  = <a; x a , y a > Find a range:  top   F ( x  ) + A  [  ( top -1) /  ] + r +1 BWT array: L  bot   F ( x  ) + A  [  bot /  ] + r  Reduced appearance array: A  with bucket  r is the number of  's appearances within size  . L [  ( top - 1)/  .. top - 1] Reduced suffix array: SA * with bucket size  . r’ is the number of  's appearances within  L [  bot /  .. bot ] F  = <  ; x  , y  > L SA * i A $ A a A c A g A t F L rk L SA 8 a 4 1 8 0 1 0 0 0 $ a 4 1 F $ = <$; 1, 1> 7 7 c 2 2 0 1 1 0 0 a 4 c 2 1 F a = < a ; 2, 5> 5 g 1 3 5 0 1 1 1 0 a 3 g 1 1 + + + F c = < c ; 6, 7> 1 1 $ 4 1 1 1 1 0 a 1 $ - F g = < g ; 8, 8> 3 3 c 1 5 1 1 2 1 0 a 2 c 1 2 6 a 3 6 6 1 2 2 1 0 c 2 a 3 2 2 2 a 1 7 1 3 2 1 0 c 1 a 1 3 4 a 2 8 4 1 4 2 1 0 g 1 a 2 4 9

  10. String Matching with k Mismatches  Search Trees pattern: r = tcaca ; target: s = acagaca ; k = 2. v 0 < - , [1, 8]> T : r : v 1 r [1] = t v 2 v 3 < a , [1, 4]> < g , [1, 1]> < c , [1, 2]> v 6 r [2] = c v 4 v 5 < c , [1, 2]> < g , [1, 1]> < a , [2, 3]> v 7 < a , [4, 4]> v 10 v 8 v 9 v 11 r [3] = a < g , [1, 1]> < c , [2, 2]> < a , [2, 3]> < a , [4, 4]> v 14 < a , [4, 4>] v 12 v 13 v 15 r [4] = c < a , [3, 3]> < g , [1, 1]> < c , [2, 2]> v 18 v 17 < c , [2, 2]> v 19 < a , [3, 3]> r [5] = a v 16 < a , [4, 4]> <$, [-, -]> P 2 P 3 P 1 P 4 10

  11. String Matching with k Mismatches  Mismatching information R – mismatching table for r with | r| = m. R ij – containing the positions of the first 2 k + 1 mismatches between r [ i .. m – q + i ] and r [ j .. m – q + j ], where q = max{ i , j }, such that if R ij [ l ] = x (   ) then r [ i + x - 1]  r [ j + x - 1] or one of them does not exist, and it is the l- th mismatch between them. i r : r 1 : tcacg  1 2 3 4 R 12 : r 2 : cacg r 1 : tcacg    R 13 : 1 3 r 3 : acg tcacg r 1 : i    R 14 1 2 r 4 : cg r 1 : tcacg     R 15 1 g r 5 : 11

  12. String Matching with k Mismatches  Derivation of mismatching information We store only part of mismatching information, specifically: R 12 , …, R 1 m , while all the other mismatching information will be dynamically derived. Step 1: A 1 = R 12 : Step 2: Step 3: Derive the mismatching 4  4  4  1 2 3 1 2 3 1 2 3 information between p p p A 2 = R 13 :  1 = r [2 .. 4] = cacg and 1 3    1 3       1 3  2 = r [3 .. 5] = acg q q q from R 12 and R 13 .  1 [1]= c   2 [1]= a  1 [3]= c   2 [3]= g A : 1 1 2 3 1 2 12

  13. String Matching with k Mismatches  Algorithm for Derivation of mismatching information  Let  ,  1 and  2 be three strings. Let A 1 and A 2 be two arrays containing all the positions of mismatches between  and  1 , and  and  2 , respectively.  Create a new array A such that if A [ i ] = j (   ), then  1 [ j ]   1 [ j ], or one of them does not exists. It is the i th mismatch between them. 1. p p := 1; q q := 1; l l := 1; 1; 1. 2. If A 2 [ q ] ] < A 1 [ p ], ], then { A [ l ] ] := A 2 [ q ]; ]; q := q q + 1; l l := l l + 1;} ;} 2. 3. If A 1 [ p ] ] < A 2 [ q ], ], then { A [ l ] ] := A 1 [ p ]; ]; p p := p p + 1; l l := l l + 1;} ;} 3. ], then {if  1 [ p ] ]   2 [ q ], then { A [ l ] := q ; 4. If A 1 [ p ] ] = A 2 [ q ], ; l l := l l + 1;} p p := p p + 1; q q := q q + 1;} ;} 4. re  , stop (if A 1 (or 5. If p p > | A 1 |, |, q > | A 2 |, |, or bot r both A 1 [ p ] ] and A 2 [ q ] ] are r A 2 ) has some re remaining aining 5. ot  , first elements ts, , which ch are not st appe pend nd them to t the rear of A , and then stop.) .) 6. Ot Otherwi wise se, , go to (2). 6. 13 13

  14. String Matching with k Mismatches  Derivation of mismatching information for paths in a search tree. This part of P 3 will not be created. We derive <-, [1, 8]> the mismatching information for it according r : … to P 1 and R 21 . v 1 r [1] = t < a , [1, 4]> < c , [1, 2]> P : P  : P : P  : r [2] = c < c , [1, 2]> < a , [2, 3]> j i r [3] = a < a , [2, 3]> < g , [1, 1]> i … … j r [4] = c < a , [4, 4>]> < g , [1, 1]> < c , [2, 2]> r [5] = a P 3 < a , [4, 4]> P 1 14 14 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend