The Combinatorics of Overlapping Squares Bill Smyth Algorithms - - PowerPoint PPT Presentation

the combinatorics of overlapping squares
SMART_READER_LITE
LIVE PREVIEW

The Combinatorics of Overlapping Squares Bill Smyth Algorithms - - PowerPoint PPT Presentation

Runs Overlapping Squares Applications? The Combinatorics of Overlapping Squares Bill Smyth Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Canada Department of Mathematics & Statistics,


slide-1
SLIDE 1

Runs Overlapping Squares Applications?

The Combinatorics of Overlapping Squares

Bill Smyth

Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Canada Department of Mathematics & Statistics, University of Western Australia, Perth, Australia email: smyth@mcmaster.ca

Challenges in Combinatorics on Words The Fields Institute, Toronto 24 April 2013

1 / 17

slide-2
SLIDE 2

Runs Overlapping Squares Applications?

Abstract

I briefly review two closely-related research topics pursued over the last ten years or so:

◮ What is the maximum number of runs (maximal periodicities)

in a string of length n?

◮ What are the limitations on the occurrence of overlapping

squares in a string? I suggest new strategies for dealing with these questions, as well as possible algorithmic consequences.

2 / 17

slide-3
SLIDE 3

Runs Overlapping Squares Applications?

Outline

  • 1. Runs
  • 2. Overlapping Squares
  • 3. Applications?

3 / 17

slide-4
SLIDE 4

Runs Overlapping Squares Applications?

Repetitions & Runs

◮ If x = vuew, with integer e > 1 and u neither a suffix of v nor a prefix of

w (e is maximum), then ue is said to be a repetition in x. The integers u and e are the period and exponent, respectively, of the repetition.

◮ For example, in

x = abaababaab, (1) there are repetitions a2 (twice), (ab)2 and (ba)2, (aba)2, and (abaab)2. Each of these repetitions is a square (e = 2). In general, every repetition has a square prefix.

◮ If v = x[i..j] has period u, where v/u ≥ 2, and if neither x[i −1..j] nor

x[i..j+1] (whenever these are defined) has period u, then x is said to be a maximal periodicity or run in x [M89] and v is said to have exponent e = ⌊v/u⌋ and tail t = v mod u. When t = 0, the run is also a repetition.

◮ All of the repetitions in (1) are runs except for (ab)2 and (ba)2: these are

prefix and suffix, respectively, of the run v = ababa.

◮ In general, every repetition is a substring of some run; thus computing all

the runs implicitly computes all the repetitions.

4 / 17

slide-5
SLIDE 5

Runs Overlapping Squares Applications?

Computing Repetitions

In the early 1980s three O(x log x)-time algorithms were proposed to compute all the repetitions in a given string x:

◮ Crochemore [C81] describes a method of successive refinement that

identifies all equal substrings of lengths 1, 2, . . . until for some length ℓ every substring is unique. As remarked in [S03], his method is essentially an algorithm for suffix tree construction. Crochemore also showed that a string x can contain as many as O(x log x) repetitions — thus all these algorithms are optimal.

◮ Apostolico & Preparata [AP83] use suffix trees plus auxiliary data

structures.

◮ Main & Lorentz [ML84] use a divide-and-conquer approach based

  • n prior computation of the Lempel-Ziv factorization LZx.

Note: all use global data structures.

5 / 17

slide-6
SLIDE 6

Runs Overlapping Squares Applications?

Computing LZ [ZL77]

Figure: A wide variety of algorithmic approaches to the computation of the Lempel-Ziv factorization, all of them based on the computation of global data structures (from [ACIKSTY13])

6 / 17

slide-7
SLIDE 7

Runs Overlapping Squares Applications?

Computing Runs

◮ In 1989 Main [M89] showed how to compute all “leftmost”

runs, again from LZx, in linear time — thus still global data structures.

◮ In 1999 Kolpakov & Kucherov [KK99, KK00] showed how to

compute all runs from the leftmost ones, also in linear time.

◮ To establish linearity, they proved that the maximum number

ρ(n) of runs over all strings of length n satisfies ρ(n) ≤ k1n−k2 √n log2 n (2) for some universal positive constants k1 and k2.

◮ They provided computational evidence (up to n = 60) that

ρ(n) ≤ n — this was their conjecture.

◮ Based on work by many authors over the last 10 years, it has

been shown that 0.944575 < ρ(n)/n ≤ 1.029: the lower bound is combinatorial [S10], the upper largely computational [CIT11].

7 / 17

slide-8
SLIDE 8

Runs Overlapping Squares Applications?

Unsatisfactory Situation

Moreover, the expected number of runs in a string of length n is small (Puglisi & Simpson [PS08]):

◮ 0.41n for alphabet size σ = 2; ◮ 0.25n for DNA (Σ = {A, C, G, T}); ◮ 0.04n for protein (σ = 20); ◮ 0.01n for English-language text.

Runs (hence repetitions) in most strings are sparse! We have to use global data structures to compute something that is not only local in the string, but that generally occurs sparsely —

  • bviously we need to understand better what is going on.

8 / 17

slide-9
SLIDE 9

Runs Overlapping Squares Applications?

Combinatorial Insight?

If ρ(n)/n is limited to be near one, it means that on average there is about one run starting at each position. So ... if TWO runs start at some position, then there must be some other position, probably nearby, at which NO runs start. Runs always start with squares — what do we know about squares that begin at about the same position? What combinatorial insight do we have into the restrictions that might be imposed upon

  • ccurrences of overlapping squares? Until recently, very little:

9 / 17

slide-10
SLIDE 10

Runs Overlapping Squares Applications?

From 1906 to 1995!

Lemma (Crochemore & Rytter [CR95])

Suppose u is not a repetition, and suppose v = uj for any j ≥ 1. If u2 is a prefix of v2, in turn a proper prefix of w2, then w ≥ u+v. The Fibonacci string demonstrates that this result is best possible (squares ending at positions 6, 10, 16 = 6+10, 26 = 10+16):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

x = a b a a b a b a a b a a b a b a a b a b a a b a a b

The Three Squares Lemma is a result of great insight: it tells us that if three squares occur at the same position, then one of them has to be “large”. But we need to know much more: what if the three squares just overlap, just occur in the same neighbourhood? What then???

10 / 17

slide-11
SLIDE 11

Runs Overlapping Squares Applications?

New Ideas (since 2005)

We paraphrase the accumulated results of [FSS05, PST05, S05, FPST06, S07, KS12, FFSS12]: The bulk of the research considers two squares u2 and v2, u < v < 2u, so that u, but not u2, is a prefix of v. There are two cases, whose analysis is quite different, but whose results are qualitatively the same, a breakdown of the string into runs of small period: (C1) v ≤ 3u/2; (C2) v > 3u/2. The details are complicated, but the main results are as follows:

11 / 17

slide-12
SLIDE 12

Runs Overlapping Squares Applications?

u < v ≤ 3u/2: w not required

Theorem (C1)

If x = v2 with prefix u2, u < v ≤ 3u/2, then x = u1mu2u1m+1u2u1, where u1 = v−u ≤ u/2, u2 = u mod u1 ≥ 0, m = ⌊u/u1⌋ ≥ 2, and u2 is a proper prefix of u1. Moreover, x contains no runs of period ≥ u1 other than specific identifiable ones described in [KS12]. For example, the prefix f[1..10] = v2 = (abaab)2 of the Fibonacci string f given above has proper prefix u2 = (aba)2; hence u = 3 and v = 5, we find 3u/2 < v < 2u, and so u1 = a, u2 = b, the shortest possible C1. Also the prefix f[1..16] = v2 = (abaababa)2 has proper prefix u2 = (abaab)2, so that now u = 5, v = 8, again satisfying 3u/2 < v < 2u, and u1 = ab, u2 = b.

12 / 17

slide-13
SLIDE 13

Runs Overlapping Squares Applications?

3u/2 < v < 2u

Theorem (C2)

Suppose u2 and v2, 3u/2 < v < 2u, occur at the same position i in x. Then v = u1u2u1u1u2, where u1 = 2u−v, u2 = 2v −3v. If moreover a third square w2 occurs at position i +k, where v −u < w < v, w = u, 0 ≤ k < v −u, then x[i..i +2v −1] breaks down into runs of small period according to 14 well-defined subcases [KS12, FFSS12]. I confess that it is an exaggeration to call this a “theorem” – two

  • f the 14 subcases have been only partly proved

[FPST06, FFSS12]. Nevertheless there is convincing evidence from extensive computer simulations [KS12] that the incomplete cases do satisfy the stated constraint.

13 / 17

slide-14
SLIDE 14

Runs Overlapping Squares Applications?

Two Subcases

We show Subcases 5 & 13: for both it is true [KS12] that v = dv/d, with d a prefix of v of length d = gcd(u, v, w). u1 u2 u1 u1 u2 u1 u2 u1 u1

✛ ✲

u

✛ ✲

v k w(1) w(2)

Figure: Subcase 5: 0 ≤ k ≤ u1, u+u1 < k+w ≤ v

u1 u2 u1 u1 u2 u1 u2 u1 u1 u2

✛ ✲

u

✛ ✲

v k w(1) w(2)

✛ ✲

x[k+1 . . . k+2w]

Figure: Subcase 13: u1 < k < u1+u2, v < k+w ≤ 2u

14 / 17

slide-15
SLIDE 15

Runs Overlapping Squares Applications?

Along the Way ...

In connection with (C2), a new and useful lemma1 emerged: what happens when both x and some rotation (cyclic shift) of x have the same period?

Lemma

Suppose both x and Rv(x), 0 < v < x, have period u, where ℓ = x mod u > 0 and e = ⌊x/u⌋. Let xv denote Rv(x), and let d = gcd(u, ℓ). Then (a) if e = 1 and v ≥ ℓ, xv−ℓ[1..2ℓ] is a square of period ℓ; (b) if e = 1 and v ≤ ℓ, x[1..v +ℓ] has period ℓ; (c) if e > 1 and v < u, x[1..v +ℓ] has period ℓ; if moreover v +d ≥ u, then x is a repetition of period d; (d) if e > 1 and u ≤ v ≤ x−u, x[1..u+ℓ], hence x, is a repetition of period d; (e) if e > 1 and x−u < v, where v ′ = v−(x−u), x[v ′+1..u+ℓ] has period ℓ; if moreover v ′ ≤ d, then x is a repetition of period d.

1Credit to 23 PhD students in Informatics at the University of Warsaw, who

verified the result up to x = 4000!

15 / 17

slide-16
SLIDE 16

Runs Overlapping Squares Applications?

The General Case

Clearly, quite apart from the two missing subcases, there is much more work to be done:

◮ What can be said when w > v (as in the case of the Three Squares

Lemma), but with k > 0?

◮ What if u2 and v2 are not coincident? ◮ What if w2 occurs to the left of u2 — or somewhere in between u2 amd

v2?

◮ In other words, we need a general case that puts together

u u k1 v v and v v k2 w w In fact, a start has recently been made in this direction [S13], but an analysis of the combinaotorial possibilities requires consideration of many more subcases.

16 / 17

slide-17
SLIDE 17

Runs Overlapping Squares Applications?

Putting It All Together

◮ With deeper combinatorial insight, perhaps we can classify the

possible periodic structures at each position in a string,

◮ so that a computer program can do a left-to-right scan to

compute all the repetitions using an order of magnitude less time and space than present algorithms;

◮ and thus deal with terabytes tomorrow the way we process

gigabytes today:

◮ an advance in software based on combinatorics.

????

17 / 17

slide-18
SLIDE 18

Runs Overlapping Squares Applications?

Anisa Al-Hafeedh, Maxime Crochemore, Lucian Ilie, Evguenia Kopylova, W. F. Smyth, German Tischler & Munina Yusufu, A comparison of index-based Lempel-Ziv LZ77 factorization algorithms, ACM Computing Surveys (2012) to appear. Alberto Apostolico & Franco P. Preparata, Optimal off-line detection of repetitions in a string, Theoret. Comput. Sci. 22 (1983) 297–315. Maxime Crochemore, An optimal algorithm for computing all the repetitions in a word, Inform. Process. Lett. 12–5 (1981) 244–248. Maxime Crochemore, Lucian Ilie & Liviu Tinta, The “runs” conjecture, TCS 412–27 (2012) 2931–2941. Maxime Crochemore and Wojciech Rytter, Squares, cubes, and time-space efficient strings searching, Algorithmica 13 (1995) 405–425. Kangmin Fan, Simon J. Puglisi, W. F. Smyth & Andrew Turpin, A new periodicity lemma, SIAM J. Discrete Math. 20–3 (2005) 656–668. Kangmin Fan, R. J. Simpson & W. F. Smyth, A new periodicity lemma (preliminary version), Proc. 16th Annual Symp. Combinatorial Pattern Matching, Springer Lecture Notes in Computer Science LNCS 3537 (2005) 257–265. Frantisek Franek, Robert C. G. Fuller, Jamie Simpson & W. F. Smyth, More results on overlapping squares, J. Discrete Algorithms (2012) to appear.

17 / 17

slide-19
SLIDE 19

Runs Overlapping Squares Applications?

Mathieu Giraud, Not so many runs in strings, Proc. 2nd Internat. Conf. on Language & Automata Theory & Applications, Carlos Mart´ ın-Vide, Friedrich Otto & Henning Fernau (eds.), Lecture Notes in Computer Science, LNCS 5196, Springer-Verlag (2008) 232–239. Mathieu Giraud, Asymptotic behavior of the numbers of runs and microruns,

  • Inform. & Computation 207–11 (2009) 1221–1228.

Roman Kolpakov & Gregory Kucherov, Finding maximal repetitions in a word in linear time, Proc. 40th Annual IEEE Symp. Found. Computer Science (1999) 596–604. Roman Kolpakov & Gregory Kucherov, On maximal repetitions in words, J. Discrete Algorithms 1 (2000) 159–186. Evguenia Kopylova & W. F. Smyth, The three squares lemma revisited, J. Discrete Algorithms 11 (2012) 3–14. Michael G. Main, Detecting leftmost maximal periodicities, Discrete Applied

  • Maths. 25 (1989) 145–153.

Michael G. Main & Richard J. Lorentz, An O(n log n) algorithm for finding all repetitions in a string, J. Algorithms 5 (1984) 422–432. Simon J. Puglisi & R. J. Simpson, The expected number of runs in a word, Australasian J. Combinatorics 42 (2008) 45–54.

17 / 17

slide-20
SLIDE 20

Runs Overlapping Squares Applications?

Simon J. Puglisi, R. J. Simpson & W. F. Smyth, How many runs can a string contain?, Theoret. Comput. Sci. 401 (2008) 165–171. Simon J. Puglisi, W. F. Smyth & Andrew Turpin, Some restrictions on periodicity in strings, Proc. 16th Australasian Workshop on Combinatorial Algs. (2005) 263–268. Wojciech Rytter, The number of runs in a string: improved analysis of the linear upper bound, Proc. 23rd Symp. Theoretical Aspects of Computer Science,

  • B. Durand & W. Thomas (eds.), LNCS 2884, Springer-Verlag (2006) 184–195.
  • R. J. Simpson, Intersecting periodic words, Theoret. Comput. Sci. 374 (2007)

58–65. Jamie Simpson, Modified Padovan words and the maximum number of runs in a word, Australasian J. Combinatorics 46 (2010) 129–145. Bill Smyth, Computing Patterns in Strings, Pearson Addison-Wesley (2003) 423 pp.

  • W. F. Smyth, Computing periodicities in strings — a new approach, Proc. 16th

Australasian Workshop on Combinatorial Algs. (2005) 481–488.

  • W. F. Smyth, Three overlapping squares: the general case characterized,
  • Theoret. Comput. Sci. , submitted for publication (2013).

17 / 17

slide-21
SLIDE 21

Runs Overlapping Squares Applications?

Jacob Ziv & Abraham Lempel, A universal algorithm for sequential data compression, IEEE Trans. Information Theory 23 (1977) 337–343.

17 / 17