the q gram distance bioinformatics algorithms
play

The q -gram distance Bioinformatics Algorithms In many situations, - PDF document

The q -gram distance Bioinformatics Algorithms In many situations, edit distance is a good model for di ff erences / (Fundamental Algorithms, module 2) similarity between strings. But sometimes, other distance functions serve the purpose


  1. The q -gram distance Bioinformatics Algorithms • In many situations, edit distance is a good model for di ff erences / (Fundamental Algorithms, module 2) similarity between strings. • But sometimes, other distance functions serve the purpose better. Zsuzsanna Lipt´ ak Masters in Medical Bioinformatics academic year 2018/19, II. semester The q -gram distance 2 / 21 The q -gram distance What is a q -gram? • In many situations, edit distance is a good model for di ff erences / Let Σ be the alphabet, with | Σ | = σ . similarity between strings. Def. • But sometimes, other distance functions serve the purpose better. A q -gram is a string of length q . Motivations for using q -gram distance 1. If two parts of a sequence are exchanged (e.g. two paragraphs, two long substrings, two genes), then one can argue that the resulting strings still have high similarity; however, the edit distance will be big. The q -gram distance can be more appropriate in this case. 2. The edit distance needs quadratic computation time, but this is often too slow. The q -gram distance can be computed in linear time. 2 / 21 3 / 21 What is a q -gram? Occurrence count Let Σ be the alphabet, with | Σ | = σ . Let s be a string of length n � q , and u be a q -gram. The occurrence Def. count of u in s is A q -gram is a string of length q . Note N ( s , u ) = |{ i : s i . . . s i + q − 1 = u }| , q -grams are also called k -mers, w -words, or k -tuples. Typically, q (or k , the number of times q -gram u occurs in s . w , etc.) is small, much smaller than the strings we will want to compare. Ex. Let s = ACAGGGCA and q = 2. We will fix q , and use the number of occurrences of q -grams to compute distances between strings. 3 / 21 4 / 21

  2. Occurrence count q -gram profile Fix some enumeration (listing) of Σ q , i.e. some order in which we want to Let s be a string of length n � q , and u be a q -gram. The occurrence list all q -grams; e.g. the lexicographic order. count of u in s is Def. N ( s , u ) = |{ i : s i . . . s i + q − 1 = u }| , Let s be a string over Σ , | s | � q . The q -gram profile of s , P q ( s ) is an array of size σ q , where the i th entry is the number of times q -gram u occurs in s . Ex. P q ( s )[ i ] = N ( s , u i ) , Let s = ACAGGGCA and q = 2. Then and u i is the i th q -gram in the enumeration. N ( s , AC ) = N ( s , AG ) = N ( s , GC ) = 1 , N ( s , CA ) = N ( s , GG ) = 2, and for all other q -grams u over Σ , N ( s , u ) = 0. 4 / 21 5 / 21 q -gram distance Example: P q ( s ) P q ( t ) P q ( v ) u Let Σ = { A , C , G , T } and q = 2. 0 1 1 AA 1 1 1 AC (Introduced by Ukkonen, 1992) 1 0 1 Let AG 0 0 0 AT s = ACAGGGCA , Def.: Given two strings s , t , the q -gram distance of s and t is 2 2 1 CA t = GGGCAACA , 0 0 0 CC v = AAGGACA . X dist q − gram ( s , t ) = | N ( s , u ) � N ( t , u ) | . 0 0 0 CG Then the q -gram profiles of s , t , v are 0 0 0 u ∈ Σ q CT shown on the right. 0 0 1 GA Equivalent def.: Given two strings s , t , the q -gram distance of s and t is GC 1 1 0 2 2 1 GG σ q Notice that the sum of all entries of 0 0 0 GT X dist q − gram ( s , t ) = | P q ( s )[ i ] � P q ( t )[ i ] | , 0 0 0 TA P q ( s ) = | s | � q +1 = total number of i =1 0 0 0 TC q -gram occurrences in s = number of 0 0 0 TG which is the Manhattan distance 1 of the q -gram profiles of s and t . distinct positions in s where a q -gram 0 0 0 TT starts. 1 The Manhattan distance, or L 1 -distance, of two vectors x , y ∈ R n is defined as P n i =1 | x i − y i | . 6 / 21 7 / 21 q -gram distance The q -gram distance is a pseudo-metric In the previous example ( q = 2, s = ACAGGGCA , t = GGGCAACA , and Lemma v = AAGGACA ), we have The q -gram distance is a pseudo-metric, i.e. it is non-negative, symmetric, and obeys the triangle inequality (but it is possible to have x 6 = y with dist q − gram ( x , y ) = 0). dist 2 − gram ( s , t ) = 2 , dist 2 − gram ( s , v ) = 5 , and dist 2 − gram ( t , v ) = 5 . Proof: The three properties follow from the fact that the Manhattan metric is a Note that it is possible to have distinct strings with q -gram distance 0, e.g. metric. The example above shows that dist q − gram ( x , y ) = 0 does not imply x = y . for w = AGGGCACA , we have dist 2 − gram ( s , w ) = 0 . Exercise: Prove the lemma explicitly. (Don’t just believe this, double check it!) 8 / 21 9 / 21

  3. Connection to edit distance Connection to edit distance q -gram Lemma Let d edit ( s , t ) denote the (unit-cost) edit distance of s and t . Then Examples dist q − gram ( s , t ) With the earlier examples, we have  d edit ( s , t ) . 2 q 1. Exchange of two long substrings: d edit ( s , t ) = 6 , d edit ( s , w ) = 4 (compare to: dist q − gram ( s , t ) = 2 , dist q − gram ( s , w ) = 0 , with q = 2). Proof 2. The q -gram distance is at most 2 q times edit distance ( q -gram Every edit operation contributes to the q -gram distance at most 2 q : Consider the lemma): d edit ( s , v ) = 2 simplest case, a substitution in position i of s , where character s i is substituted by character x , and let s 0 be the resulting string. If q  i  | s | � q + 1, then there (compare to: dist q − gram ( s , v ) = 5  8 = d edit ( s , v ) · 2 q , with q = 2) are exactly q q -grams of s a ff ected by the substitution: s i � q +1 . . . s i , up to s i . . . s i + q � 1 (otherwise fewer); the counts of all these are decremented by 1, while Based on the q -gram lemma and the fact that the q -gram distance can be the counts of the new q -grams s i � 1+1 . . . x , s i . . . xs i + q , etc. are incremented by 1. computed in linear time, we can use the q -gram distance as a filter for edit Therefore, dist q � gram ( s , s 0 )  2 q (it could be less because these q -grams need not distance computations. be all distinct). For a deletion, the number of q -grams whose count is decremented is at most q , while those whose count is incremented is at most q � 1; for an insertion the other way around.—The claim follows by induction on the number of edit operations. 10 / 21 11 / 21 Computation of the q -gram distance Computation of the q -gram distance Basic ideas Algorithm for computing q -gram distance input: Strings s , t of length | s | = n and | t | = m • Use a sliding window of size q over s and t output: dist q − gram ( s , t ) • Use an array d q of size σ q 1. initialize d q [0 . . . σ q � 1] with 0s • First slide a window over s , increment respective entry for every q -gram seen 2. for i = 1 , . . . , n � q + 1 : r rank ( s i . . . s i + q − 1 ) d q [ r ] d q [ r ] + 1 • Then slide over t , decrement respective entry for every q -gram seen 3. for i = 1 , . . . , m � q + 1 : r rank ( t i . . . t i + q − 1 ) • Now d q [ r ] = N ( s , u r ) � N ( t , u r ). d q [ r ] d q [ r ] � 1 • Sum up the absolute values of the entries: 4. d 0 dist q − gram ( s , t ) = P i | d q [ i ] | 5. for i = 0 . . . σ q � 1 : d d + | d q [ i ] | . We will see: This algorithm runs in linear time. 6. return d But: how do we know where to find the entry for the current q -gram? This is called ranking (coming soon) For an example, see next slide. 12 / 21 13 / 21 r u r d q after the d q after the r u r d q after the pass thru s pass thru t pass thru s Goal 0 0 0 AA 0 � 1 AA Given q -gram u , we want to know which entry of Example: 1 1 1 1 0 AC AC the array u corresponds to. 2 1 1 2 AG 1 AG Ex.: Where is the q -gram CG ? In position 6. 3 0 0 3 0 AT AT s = ACAGGGCA , 4 2 0 4 2 CA CA t = GGGCAACA . Ranking functions 5 0 0 5 0 CC CC • A ranking function is a bijection 6 0 0 6 0 CG CG On the right, the array d q rank : Σ q ! [0 . . . σ q � 1]. 7 0 0 7 0 CT CT after line 2. of the algo 8 0 0 8 0 GA GA • rank ( u ) gives us the position of u in the (now d q equals P q ( s )) 9 1 9 GC 1 0 GC enumeration of Σ q and after line 3. 10 2 10 GG 2 0 GG • needs to be very e ffi ciently computable 11 0 Finally, we have 11 0 0 GT GT 12 0 0 12 TA 0 d 2 ( s , t ) = | � 1 | + 1 = 2. TA • the ranking function we use will give us 13 0 0 13 0 TC TC constant time per q -gram of s 14 0 0 14 0 TG TG 15 0 0 15 0 TT TT 14 / 21 15 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend