counting occurrences for a finite set of words an
play

Counting occurrences for a finite set of words: an - PowerPoint PPT Presentation

Counting occurrences for a finite set of words: an inclusion-exclusion approach Pierre Nicod` eme CNRS - LIX, Ecole polytechnique joint work with Fr ement and ed erique Bassino, Julien Cl Julien Fayolle Problem setting Compute


  1. Counting occurrences for a finite set of words: an inclusion-exclusion approach Pierre Nicod` eme CNRS - LIX, ´ Ecole polytechnique joint work with Fr´ ement and ed´ erique Bassino, Julien Cl´ Julien Fayolle

  2. Problem setting Compute separately the number of occurrences of a non-reduced set of words U in a random text under Bernoulli (non-uniform) model Reduced set: no word is factor of another word Reduced Non-Reduced U = { aab, ba, bb } U = { aa, aab, bbaabb } Methods – Formal languages manipulations (R´ egnier-Szpankowski) ( it fails in the non-reduced case ) – Aho-Corasick (automaton) + Chomsky-Sch¨ utzenberger – Inclusion-Exclusion (Goulden-Jackson, Noonan-Zeilberger)

  3. Analytic Aim U = { u 1 , . . . u r } non-reduced set of words O ( r ) n : random variable counting the number of occurrences of the word u r in a random text of size n (Bernoulli model) We want to compute � Pr( O (1) = k 1 , . . . , O ( r ) = k r ) x k 1 1 . . . x k r r z n F ( z, x 1 , . . . , x r ) = n n k 1 ≥ 0 ,...,k r ≥ 0 ,n ≥ 0 From there � ∂ . . . ∂ � � O (1) × · · · × O ( r ) = [ z n ] � F ( z, x 1 , . . . , x r ) E � n n ∂x 1 ∂x r � x 1 = ··· = x r =1

  4. (Auto)-Correlation Set auto-correlation ababa ababa | C ababa,ababa = { ǫ, ba, baba } h = ababa � � ababa ababa C h,h = { w, | w | < | h | } h.w = r.h and correlation C h 1 ,h 2 = { w, | w | < | h 2 | } h 1 .w = r.h 2 and h 1 = baba, h 2 = abaaba − → C baba,abaaba = { aba, baaba }

  5. Generating function of a language language = set of words alphabet A = { a, b } A ⋆ = ǫ + A + A 2 + · · · + A n + . . . all the words L ⊂ A ⋆ F L ( a, b ) = � w ∈L commute( w ) � ( aabaa ) ⋆ = ǫ + aabaa + ( aabaa ) 2 + ( aabaa ) 3 + · · · 1 L = ( aabaa ) ⋆ + bbb 1 − a 4 b + b 3 = ⇒ F L ( a, b ) = if X . Y non ambiguous, F X ·Y ( a, b ) = F X ( a, b ) × F Y ( a, b ) if X and Y disjoint, F X + Y ( a, b ) = F X ( a, b )+ F Y ( a, b ) 1 if X ⋆ non ambiguous, F X ⋆ ( a, b ) = 1 − F X ( a, b )

  6. Weighted and Counting Generating Function Generating function of the language L M ( a, b ) = � α ∈L commute( α ) α ∈L p α z | α | = � π n z n Weighted generating function W ( z ) = M ( ω a z, ω b z ) = � ω a = Pr( a ) , ω b = Pr( b ) , p α proba. of word α , π n proba. that a word of size n belongs to L α ∈L z | α | = � f n z n Counting generating function F ( z ) = M ( z, z ) = � f n number of words of the language of size n Example L = { ǫ, aa, ab, ba, aaab } ( ǫ empty word)  M ( a, b ) = 1 + a 2 + 2 ab + a 3 b  ⇒ F ( z ) = 1 + 3 z 2 + z 3 

  7. Formal Languages Analysis (R´ egnier-Szpankowski - 1998) “parse” the text with respect to the occurrences Right R − set of texts obtained by reading up to the first occurrence Minimal M − set of texts separating two occurrences Ultimate U − set of texts following the last occurrence Not N − set of texts with no occurrence A ⋆ = N + R . ( M ) ⋆ . U L x = N + R x. ( M x ) ⋆ . U ⇒

  8. Equations over the langages C = C h,h π h = Pr( h ) (Bernoulli model) (I) A ⋆ = U + MA ⋆ (II) A ⋆ h = R . C + R . A ⋆ .h (III) M + = A ⋆ .h + C − ǫ (IV) N . A = R + N − ǫ solving π h z | h | 1 R ( z ) = U ( z ) = π h z | h | + (1 − z ) C ( z ) π h z | h | + (1 − z ) C ( z ) C ( z ) z − 1 N ( z ) = M ( z ) = 1 + π h z | h | + (1 − z ) C ( z ) π h z | h | + (1 − z ) C ( z ) 1 L ( z, x ) = 1 − x 1 − z + π h z | h | x + (1 − x ) C ( z )

  9. Reduced sets (R´ egnier) R i , M i,j , U i � R i ( z ) , M i,j ( z ) , U i ( z ) functions of C h 1 ,h 1 ( z ) , C h 2 ,h 2 ( z ) , C h 1 ,h 2 ( z ) , C h 2 ,h 1 ( z ) ⋆      x 1 M 1 , 1 ( z ) x 2 M 1 , 2 ( z )  U 1 ( z ) F ( z, x 1 , x 2 ) = N ( z )+( x 1 R 1 ( z ) , x 2 R 2 ( z ))   x 1 M 2 , 1 ( z ) x 2 M 2 , 2 ( z ) U 2 ( z ) This collapses in case of non-reduced sets

  10. Aho-Corasick – Input: non-reduced set of words U . – Output: automaton A U recognizing A ∗ U . Algorithm: 1. build T U , the ordinary trie representing the set U 2. build A U = ( A , Q, δ, ǫ, T ): – Q = Pref( U ) – T = A ∗ U ∩ Pref( U )  if qx ∈ Pref( U ) , qx  – δ ( q, x ) = Border( qx ) otherwise ,  Border( v ) = the longest proper suffix of v which belongs to Pref( U ) if defined, or ǫ otherwise.

  11. Example U = { aab, aa } ǫ aab a b a a aa Trie T U of U

  12. Example U = { aab, aa } δ ( ǫ, b ) = Border( b ) = ǫ b ǫ a aab b a a aa

  13. Example U = { aab, aa } δ ( a, b ) = Border( a.b ) = ǫ b ǫ a aab b a a b aa

  14. Example U = { aab, aa } δ ( aa, a ) = Border( aa.a ) = aa b ǫ a aab b a a b aa a

  15. Example U = { aab, aa } δ ( aab, a ) = Border( aab.a ) = a b ǫ a aab a b a a b aa a

  16. Example U = { aab, aa } δ ( aab, b ) = Border( aab.b ) = ǫ b b ǫ a aab a b a a b aa a

  17. Example U = { aab, aa }   0 0 b b a   b ǫ 0 0 aab b ax 2 a   a T ( x 1 , x 2 ) = ,   b a a   0 0 ax 2 bx 1   b aa   b a 0 0 a x 1 , x 2 marks for aab, aa

  18. Example U = { aab, aa }   b a 0 0 b   ǫ b 0 0 aab b ax 2 a   a T ( x 1 , x 2 ) = ,   b a a   0 0 ax 2 bx 1   b aa   0 0 b a   a 1 1 F ( a, b, x 1 , x 2 ) = (1 , 0 , 0 , 0)( I − T ( a, b, x 1 , x 2 )) − 1   1   1 1 − a ( x 2 − 1) = 1 − ax 2 − b + ab ( x 2 − 1) − a 2 bx 2 ( x 1 − 1) 2 .

  19. Inclusion-Exclusion Principle - Analytic Version Set of camelus genus (camel and dromedary); the number of humps is counted by the formal variable x . � � F ( x ) = x 2 + x F = , , { “objects of P in which each elementary configuration (hump) Φ = is either distinguished or not” } � � = , , , , , Φ( t ) = t + 1 + t 2 + t + t + 1 = 2 + 3 t + t 2 = F (1 + t ) Inclusion-Exclusion principle If Φ( t ) is easy to get, then F ( x ) = Φ( x − 1).

  20. Application: counts for one word word aaa f ( x ): unknown p.g.f of counts of aaa bbbbbaaaaaaaabbbbb each occurrence is distinguished or not (flip-flop) ⇒ 2 k configurations for a text with k occurrences bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb   1 f ( x ) � f (1 + x ) = φ ( x )   x � � + x � f ( x ) = φ ( x − 1)   computing easier φ ( t ) and substituting t � x − 1 give harder f ( x ) (Inclusion-Exclusion paradigm)

  21. One word - Clusters word aaa C aaa,aaa = { ǫ, a, aa } bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb clusters C C aaa = aaa • ( ǫ + a • + aa • + a • a • + a • a • a • + a • aa • + aa • a • + . . . ) ǫ + ( ( C aaa,aaa − ǫ ) • ) + � � = aaa • double counting (further removed by the inclusion-exclusion principle): z + z 2 ( C aaa,aaa − ǫ ) + ( z ) = 1 − ( z + z 2 ) = z + 2 z 2 + 3 z 3 + 5 z 4 + 8 z 5 + 13 z 6 + . . . � = z + z 2 + z 3 + z 4 + z 5 + z 6 + . . .

  22. Word aaa - Clusters - Generating function C aaa,aaa ( z ) = 1 + z + z 2 C aaa,aaa = { ǫ, a, aa } C aaa = aaa • ( ǫ + a • + aa • + a • a • + a • a • a • + a • aa • + aa • a • + . . . ) ǫ + (( C aaa,aaa − ǫ ) • ) + � � = aaa • C aaa ( z, x ) = zzzx (1 + zx + zzx + zxzx + zxzxzx + zxzzx + zzxzx + . . . ) � ǫ + ( C aaa,aaa ( z ) × x ) + � = z 3 x xz + xz 2 xz 3 � � = xz 3 1 + = 1 − ( xz + xz 2 ) 1 − ( xz + xz 2 )

  23. Parsing of a text with respect to clusters C = C h,h , word h , clusters C xh ( z ) C = h + h. C + h CC + h CCC + . . . ⇒ = C ( z, x ) = 1 − x ( C ( z ) − 1) When reading a random text T , at each position, either we read a letter of the alphabet A , either we begin a cluster C , T = ǫ + A + C + AA + A C + C A + CC + AAA + AA C + A C A + C AA + A CC + . . . = Seq( A + C ) Therefore, counting with x the number of occurrences of the word h , we have, removing double counting by inclusion-exclusion, 1 1 � = F ( z, x ) = � ( x − 1) h ( z ) 1 − A ( z )+ C ( z, x − 1) 1 − A ( z ) − 1 − ( x − 1)( C ( z ) − 1)

  24. Reduced set - (Goulden-Jackson - 1979, 1983) U = { aba, bab, aa } bbbbbabababaabbbbb bbbbbabababaabbbbb bbbbbabababaabbbbb clusters C i,j begin with w i and finish with w j � C i,j = w i C w i ,w j + C i,k . ( C w k ,w j − δ kj ǫ ) 1 ≤ k ≤ 3 − 1       C w 1 ,w 1 • − ǫ C w 1 ,w 2 • C w 1 ,w 3 • 1       C = ( w 1 • , w 2 • , w 3 • )  I − C w 2 ,w 1 • C w 2 ,w 2 • − ǫ C w 2 ,w 3 • 1            C w 3 ,w 1 • C w 3 ,w 2 • C w 3 ,w 3 • − ǫ 1 1 T = Seq( A + C ) = ⇒ Φ( z, x 1 , x 2 , x 3 ) = 1 − A ( z ) − C ( z, x 1 , x 2 , x 3 ) 1 F ( z, x 1 , x 2 , x 3 ) = Φ( z, x 1 − 1 , x 2 − 1 , x 3 − 1) = 1 − A ( z ) − C ( z, x 1 − 1 , x 2 − 1 , x 3 − 1)

  25. General Case: Non Reduced Set of Words U = { aa, ab, baaaab } I II aaaabbbbbbbabaaaabbbb aa ab aa baaaab ab aa aa aa create clusters of distinguished occurrences Reduced Cluster , no induced factor occurrences (Cluster I). Count distinguished occurrences by t i � x i − 1 (Inclusion-Exclusion principle) Induced Factor Occurrences , occurrence baaaab of reduced Cluster II induces 0, 1, 2, or 3 distinguished occurrences aa . To recover the correct count of 8 marked configurations, count them by (1 + t i ) 3 � x 3 i .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend