Yangjun Chen, Yujia Wu Department of Applied Computer Science - PowerPoint PPT Presentation

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1

Outline  Motivation - Statement of Problem - Related work  BWT Arrays – A space-economic Index for String Matching  String Matching with k Mismatches - Search trees - Mismatching information - Mismatching trees  Experiments  Conclusion and Future Work 2

Statement of Problem  String matching with k mismatches: find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s . - In DNA databases, due to polymorphisms or mutations among individuals or even sequencing errors, a read (a short sample DNA sequence) may disagree in some positions at any of its occurrences in a genome. pattern Example: k = 4 a a a a a c a a a c target a c a c a c a g a a g c c c 3

Related Work  Exact string matching On-line algorithms: - Knuth-Morris-Pratt , Boyer-Moore , Aho-Corasick - Index based: suffix trees ( Weiner ; McCreight ; Ukkonen ), suffix arrays ( Manber , Myers ), BWT- transformation ( Burrow - Wheeler ), Hash ( Karp , Rabin )  Inexact string matching String matching with k mismatches - Hamming distance ( Lan andau dau, U. Vish ishkin in; Amir mir at at al al.; .; - Cole ) String matching with k differences - Levelshtein distance ( Chang, Lampe pe ) - String matching with wild-cards ( Manber, Baeza-Yates ) - 4

BWT-Index  Burrows-Wheeler Transform ( BWT )  s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ BWT construction: Rank correspondence: rk F F L rk L if SA [ i ] = 1; L [ i ] = $, $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ 1 - a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 1 1 L [ i ] = s [ SA [ i ] – 1], otherwise. rank: 3 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 2 1 a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 3 - SA […] – suffix array a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 4 2 rank: 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 g 1 a 3 1 2 c 1 a 2 g 1 a 3 c 2 a 4 $ a 1 a 4 $ a 1 c 1 a 2 g 1 a 3 c 2 2 3 g 1 a 3 c 2 a 4 $ a 1 c 1 a 2 $ a 1 c 1 a 2 g 1 a 3 c 2 a 4 1 4 rk F ( e ) = rk L ( e ) 5

Backward Search of BWT-Index < z , [  , β]>,  s = a 1 c 1 a 2 g 1 a 3 c 2 a 4 $ if z appears in L  ; search ( z ,  ) = otherwise.  ,  Search p = aca Z : a character  : a range in F Backward Search L  : a range in L , corresponding to  Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 6

Backward Search of BWT-Index search ( c , < a , [2, 5]>) search ( a , < c , [1, 2]>) Search sequence : < a , [2, 5]> < c , [1, 2]> < a , [3, 4]> Suffix Array F L F L F L F L $ a 4 $ a 4 $ a 4 8 $ a 4 a 4 c 2 a 4 c 2 a 4 c 2 7 a 4 c 2 a 3 g 1 a 3 g 1 a 3 g 1 5 a 3 g 1 a 1 $ a 1 $ a 1 $ 1 a 1 $ a 2 c 1 a 2 c 1 a 2 c 1 3 a 2 c 1 c 2 a 3 c 2 a 3 c 2 a 3 6 c 2 a 3 c 1 a 1 c 1 a 1 c 1 a 1 c 1 a 1 2 g 1 a 2 g 1 a 2 g 1 a 2 4 g 1 a 2 7 7

rankAll range |  | arr er X   suc  Ar Arrange rrays ys eac each for or a char haract acter such th that at A X [ i ] (the (the i th th ent entry in in the the array for X ) is is the the number er of of appearanc rances es of of X wi within in L [1 .. .. i ]. ment L [  .. ..  ] (    ) to  Ins nstea ead of of sc scanning anning a ce certain ain seg segmen to find nd a sub subra range nge tain X   , we ether A X [  - 1] = for or a ce certai we can can simply simply look look up up A X to to see see wh wheth A  [  ]. If then  does in  .. ..  ]. Othe wise, [ A X [  - 1] If it it is is the the case, case, then oes not oc occu cur in Otherwise, + 1, A X [  ] ] should ld be be the the found range. A $ A a A c A g A t F L Example 0 1 0 0 0 $ a 4 0 1 1 0 0 To find the first and the last appearance a 4 c 2 0 1 1 1 0 a 3 g 1 of c in L [2 .. 5], we only need to find 1 1 1 1 0 a 1 $ A c [2 – 1] = A c [1] = 0 and A c [5] = 2. So the 1 1 2 1 0 a 2 c 1 1 2 2 1 0 corresponding range is c 2 a 3 1 3 2 1 0 c 1 a 1 [ A c [2 - 1] + 1, A c [5]] = [1, 2]. 1 4 2 1 0 g 1 a 2

Reduce rankAll -Index Size F -ranks: F  = <a; x a , y a > Find a range:  top   F ( x  ) + A  [  ( top -1) /  ] + r +1 BWT array: L  bot   F ( x  ) + A  [  bot /  ] + r  Reduced appearance array: A  with bucket  r is the number of  's appearances within size  . L [  ( top - 1)/  .. top - 1] Reduced suffix array: SA * with bucket size  . r’ is the number of  's appearances within  L [  bot /  .. bot ] F  = <  ; x  , y  > L SA * i A $ A a A c A g A t F L rk L SA 8 a 4 1 8 0 1 0 0 0 $ a 4 1 F $ = <$; 1, 1> 7 7 c 2 2 0 1 1 0 0 a 4 c 2 1 F a = < a ; 2, 5> 5 g 1 3 5 0 1 1 1 0 a 3 g 1 1 + + + F c = < c ; 6, 7> 1 1 $ 4 1 1 1 1 0 a 1 $ - F g = < g ; 8, 8> 3 3 c 1 5 1 1 2 1 0 a 2 c 1 2 6 a 3 6 6 1 2 2 1 0 c 2 a 3 2 2 2 a 1 7 1 3 2 1 0 c 1 a 1 3 4 a 2 8 4 1 4 2 1 0 g 1 a 2 4 9

String Matching with k Mismatches  Search Trees pattern: r = tcaca ; target: s = acagaca ; k = 2. v 0 < - , [1, 8]> T : r : v 1 r [1] = t v 2 v 3 < a , [1, 4]> < g , [1, 1]> < c , [1, 2]> v 6 r [2] = c v 4 v 5 < c , [1, 2]> < g , [1, 1]> < a , [2, 3]> v 7 < a , [4, 4]> v 10 v 8 v 9 v 11 r [3] = a < g , [1, 1]> < c , [2, 2]> < a , [2, 3]> < a , [4, 4]> v 14 < a , [4, 4>] v 12 v 13 v 15 r [4] = c < a , [3, 3]> < g , [1, 1]> < c , [2, 2]> v 18 v 17 < c , [2, 2]> v 19 < a , [3, 3]> r [5] = a v 16 < a , [4, 4]> <$, [-, -]> P 2 P 3 P 1 P 4 10

String Matching with k Mismatches  Mismatching information R – mismatching table for r with | r| = m. R ij – containing the positions of the first 2 k + 1 mismatches between r [ i .. m – q + i ] and r [ j .. m – q + j ], where q = max{ i , j }, such that if R ij [ l ] = x (   ) then r [ i + x - 1]  r [ j + x - 1] or one of them does not exist, and it is the l- th mismatch between them. i r : r 1 : tcacg  1 2 3 4 R 12 : r 2 : cacg r 1 : tcacg    R 13 : 1 3 r 3 : acg tcacg r 1 : i    R 14 1 2 r 4 : cg r 1 : tcacg     R 15 1 g r 5 : 11

String Matching with k Mismatches  Derivation of mismatching information We store only part of mismatching information, specifically: R 12 , …, R 1 m , while all the other mismatching information will be dynamically derived. Step 1: A 1 = R 12 : Step 2: Step 3: Derive the mismatching 4  4  4  1 2 3 1 2 3 1 2 3 information between p p p A 2 = R 13 :  1 = r [2 .. 4] = cacg and 1 3    1 3       1 3  2 = r [3 .. 5] = acg q q q from R 12 and R 13 .  1 [1]= c   2 [1]= a  1 [3]= c   2 [3]= g A : 1 1 2 3 1 2 12

String Matching with k Mismatches  Algorithm for Derivation of mismatching information  Let  ,  1 and  2 be three strings. Let A 1 and A 2 be two arrays containing all the positions of mismatches between  and  1 , and  and  2 , respectively.  Create a new array A such that if A [ i ] = j (   ), then  1 [ j ]   1 [ j ], or one of them does not exists. It is the i th mismatch between them. 1. p p := 1; q q := 1; l l := 1; 1; 1. 2. If A 2 [ q ] ] < A 1 [ p ], ], then { A [ l ] ] := A 2 [ q ]; ]; q := q q + 1; l l := l l + 1;} ;} 2. 3. If A 1 [ p ] ] < A 2 [ q ], ], then { A [ l ] ] := A 1 [ p ]; ]; p p := p p + 1; l l := l l + 1;} ;} 3. ], then {if  1 [ p ] ]   2 [ q ], then { A [ l ] := q ; 4. If A 1 [ p ] ] = A 2 [ q ], ; l l := l l + 1;} p p := p p + 1; q q := q q + 1;} ;} 4. re  , stop (if A 1 (or 5. If p p > | A 1 |, |, q > | A 2 |, |, or bot r both A 1 [ p ] ] and A 2 [ q ] ] are r A 2 ) has some re remaining aining 5. ot  , first elements ts, , which ch are not st appe pend nd them to t the rear of A , and then stop.) .) 6. Ot Otherwi wise se, , go to (2). 6. 13 13

String Matching with k Mismatches  Derivation of mismatching information for paths in a search tree. This part of P 3 will not be created. We derive <-, [1, 8]> the mismatching information for it according r : … to P 1 and R 21 . v 1 r [1] = t < a , [1, 4]> < c , [1, 2]> P : P  : P : P  : r [2] = c < c , [1, 2]> < a , [2, 3]> j i r [3] = a < a , [2, 3]> < g , [1, 1]> i … … j r [4] = c < a , [4, 4>]> < g , [1, 1]> < c , [2, 2]> r [5] = a P 3 < a , [4, 4]> P 1 14 14 14

Yangjun Chen, Yujia Wu Department of Applied Computer Science - PowerPoint PPT Presentation

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1 Outline Motivation - Statement of Problem - Related work BWT

Improving Background Based Conversation with Context-aware Knowledge Pre-selection Pengjie Ren

On Scalable and Efficient Computation of Large Scale Optimal Transport Yujia Xie, Minshuo Chen,

Conditional Independence in Testing Bayesian Networks Yujia Shen, Haiying Huang, Arthur Choi,

Inferring Human Interaction from Motion Trajectories Tianmin Shu 1 Yujia Peng 2 Lifeng Fan 1

Graph Matching Networks for Learning the Similarity of Graph Structured Objects Yujia Li, Chenjie

Variance Reduction for Matrix Games Yair Carmon Yujia Jin Aaron Sidford Kevin Tian

Scalable Deep Generative Modeling for Sparse Graphs Hanjun Dai 1 , Azade Nazi 1 , Yujia Li 2 , Bo

Learning Transferable Graph Exploration Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh,

LUCC and Its Influences on LUCC and Its Influences on Regional NPP Regional NPP Maosong Liu,

PCE Hierarchical SDNs draft-chen-pce-h-sdns-00 Huaimo Chen (huaimo.chen@huawei.com) Mehmet Toy

William Yun Chen William Yun Chen chen_w@math.psu.edu Pennsylvania State University ICERM

The L p convergence of finite Markov chains Guan-Yu Chen Department of Applied Mathematics, NCTU

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Attack Detection in Wireless Localization Yingying Chen Chen Yingying Dept. of Computer

Energy-Efficient Transmission in 5G Communications Jun Chen National Instruments jun.chen@ni.com

Blind deconvolution of 3D data in wide field fluorescence microscopy Ferrol Soulez 1 , 2 Loc

Appalachian Freshwater Initiative Year 4 Accomplishments WV EPSCoR All Hands Meeting June 13,

CIRCULATING MICRO-RNAS AS BIOMARKERS FOR MYOCARDIAL DAMAGE FOLLOWING TAVI By Rachel Frenklak

ERC grant workshop My ERC project + experience (1 st and 2 nd

Our research with Masters Athletes Birmingham 4 November 2018 Jamie McPhee & Hans Degens

under Uncertainty Roberto Thern, PROVIDE DH Dimitri Van de Ville, IVAN Bucharest, April 4,

Table of Contents Greetings - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Upper Colorado River Basin Water Forum : Stories from the Field Orchard Mesa Irrig. Improvements

Yangjun Chen, Yujia Wu Department of Applied Computer Science - PowerPoint PPT Presentation

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1 Outline Motivation - Statement of Problem - Related work BWT

Improving Background Based Conversation with Context-aware Knowledge Pre-selection Pengjie Ren

On Scalable and Efficient Computation of Large Scale Optimal Transport Yujia Xie, Minshuo Chen,

Conditional Independence in Testing Bayesian Networks Yujia Shen, Haiying Huang, Arthur Choi,

Inferring Human Interaction from Motion Trajectories Tianmin Shu 1 Yujia Peng 2 Lifeng Fan 1

Graph Matching Networks for Learning the Similarity of Graph Structured Objects Yujia Li, Chenjie

Variance Reduction for Matrix Games Yair Carmon Yujia Jin Aaron Sidford Kevin Tian

Scalable Deep Generative Modeling for Sparse Graphs Hanjun Dai 1 , Azade Nazi 1 , Yujia Li 2 , Bo

Learning Transferable Graph Exploration Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh,

LUCC and Its Influences on LUCC and Its Influences on Regional NPP Regional NPP Maosong Liu,

PCE Hierarchical SDNs draft-chen-pce-h-sdns-00 Huaimo Chen (huaimo.chen@huawei.com) Mehmet Toy

William Yun Chen William Yun Chen chen_w@math.psu.edu Pennsylvania State University ICERM

The L p convergence of finite Markov chains Guan-Yu Chen Department of Applied Mathematics, NCTU

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Attack Detection in Wireless Localization Yingying Chen Chen Yingying Dept. of Computer

Energy-Efficient Transmission in 5G Communications Jun Chen National Instruments jun.chen@ni.com

Blind deconvolution of 3D data in wide field fluorescence microscopy Ferrol Soulez 1 , 2 Loc

Appalachian Freshwater Initiative Year 4 Accomplishments WV EPSCoR All Hands Meeting June 13,

CIRCULATING MICRO-RNAS AS BIOMARKERS FOR MYOCARDIAL DAMAGE FOLLOWING TAVI By Rachel Frenklak

ERC grant workshop My ERC project + experience (1 st and 2 nd

Our research with Masters Athletes Birmingham 4 November 2018 Jamie McPhee &amp; Hans Degens

under Uncertainty Roberto Thern, PROVIDE DH Dimitri Van de Ville, IVAN Bucharest, April 4,

Table of Contents Greetings - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Upper Colorado River Basin Water Forum : Stories from the Field Orchard Mesa Irrig. Improvements

Our research with Masters Athletes Birmingham 4 November 2018 Jamie McPhee & Hans Degens