The Expected Number of Repetitions in Random Words
Arseny M. Shur
Ural Federal University, Ekaterinburg, Russia
Shanghai, April 25, 2015
- A. M. Shur (Ural Federal U)
Expected Number of Repetitions Shanghai, April 25, 2015 1 / 19
The Expected Number of Repetitions in Random Words Arseny M. Shur - - PowerPoint PPT Presentation
The Expected Number of Repetitions in Random Words Arseny M. Shur Ural Federal University, Ekaterinburg, Russia Shanghai, April 25, 2015 A. M. Shur (Ural Federal U) Expected Number of Repetitions Shanghai, April 25, 2015 1 / 19
Arseny M. Shur
Ural Federal University, Ekaterinburg, Russia
Shanghai, April 25, 2015
Expected Number of Repetitions Shanghai, April 25, 2015 1 / 19
A discipline that studies properties of sequences of symbols Born: 1906
Named: 1983
Mathemetics and Its Applications (1983)
Expected Number of Repetitions Shanghai, April 25, 2015 2 / 19
A discipline that studies properties of sequences of symbols Born: 1906
Named: 1983
Mathemetics and Its Applications (1983)
Sources: Algebra (terms) Symbolic dynamics (trajectories) Grammars and rewriting systems Algorithms for sequential data Biological strings . . .
Expected Number of Repetitions Shanghai, April 25, 2015 2 / 19
A palindrome is a word which is equal to its reversal, like a a i i b b
h p
Expected Number of Repetitions Shanghai, April 25, 2015 3 / 19
A palindrome is a word which is equal to its reversal, like a a i i b b
h p Palindromes are one of the most simple and common repetitions in words, along with squares, which are words consisting of two equal parts, like c c
u s s
Expected Number of Repetitions Shanghai, April 25, 2015 3 / 19
A palindrome is a word which is equal to its reversal, like a a i i b b
h p Palindromes are one of the most simple and common repetitions in words, along with squares, which are words consisting of two equal parts, like c c
u s s Palindromes are in some sense counterparts of squares: in a sequence of states of some finite-state machine, a square indicates repeated behaviour, while a palindrome shows that the machine reversed back to front; among the basic data structures, palindromes correspond to stacks, while squares correspond to queues; as a consequence, the language of all palindromes is context-free, while the language
Expected Number of Repetitions Shanghai, April 25, 2015 3 / 19
We consider finite words over finite (k-letter) alphabets; we write w = w[1..n] for a word of length n; words of the form w[i..j] are factors
Expected Number of Repetitions Shanghai, April 25, 2015 4 / 19
We consider finite words over finite (k-letter) alphabets; we write w = w[1..n] for a word of length n; words of the form w[i..j] are factors
A lot of results on the possible number of distinct palindromic factors and square factors in a word: max number of palindromes is n (Droubay, Pirillo, 2001) max number of squares is between n − O(√n) and 2n − O(log n) (Ilie, 2007) min number of palindromes is k for k ≥ 3 and 8 for k = 2, n ≥ 9 min number of squares is 0 for k ≥ 3 (Thue, 1912) and 3 for k = 2 (Fraenkel, Simpson, 1995) any number of palindromes between min and max is available for k ≥ 4, a word can contain k palindromes and 0 squares simultaneously
Expected Number of Repetitions Shanghai, April 25, 2015 4 / 19
Problems
Find the expected number of
Expected Number of Repetitions Shanghai, April 25, 2015 5 / 19
Problems
Find the expected number of
Theorem
The expected number of distinct palindromic factors in a random word
Expected Number of Repetitions Shanghai, April 25, 2015 5 / 19
Problems
Find the expected number of
Theorem
The expected number of distinct palindromic factors in a random word
As a by-product of the technique used, we also get
Theorem
The expected number of distinct square factors in a random word of length n over a fixed nontrivial alphabet is Θ(√n).
Expected Number of Repetitions Shanghai, April 25, 2015 5 / 19
Let k (alphabetic size) be fixed; E(n) is the expectation studied. The expected number Em(n) of distinct palindromic factors
⋆ the total number of k-ary palindromes of length m; ⋆ the expected number of occurrences of palindromic factors of length m in a random word of length n.
Expected Number of Repetitions Shanghai, April 25, 2015 6 / 19
Let k (alphabetic size) be fixed; E(n) is the expectation studied. The expected number Em(n) of distinct palindromic factors
⋆ the total number of k-ary palindromes of length m; blue ⋆ the expected number of occurrences of palindromic factors of length m in a random word of length n. red
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
Expected Number of Repetitions Shanghai, April 25, 2015 6 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
E(n) = Em(n) is bounded by the total area under the graphs; since all graphs are those of exponents, the area under each pair
multiple; thus, E(n) = O(√n); some additional considerations show that the upper bound is sharp up to a constant multiple, implying E(n) = Θ(√n).
Expected Number of Repetitions Shanghai, April 25, 2015 7 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
E(n) = Em(n) is bounded by the total area under the graphs; since all graphs are those of exponents, the area under each pair
multiple; thus, E(n) = O(√n); some additional considerations show that the upper bound is sharp up to a constant multiple, implying E(n) = Θ(√n).
Expected Number of Repetitions Shanghai, April 25, 2015 7 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
E(n) = Em(n) is bounded by the total area under the graphs; since all graphs are those of exponents, the area under each pair
multiple; thus, E(n) = O(√n); some additional considerations show that the upper bound is sharp up to a constant multiple, implying E(n) = Θ(√n).
Expected Number of Repetitions Shanghai, April 25, 2015 7 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
E(n) = Em(n) is bounded by the total area under the graphs; since all graphs are those of exponents, the area under each pair
multiple; thus, E(n) = O(√n); some additional considerations show that the upper bound is sharp up to a constant multiple, implying E(n) = Θ(√n).
Expected Number of Repetitions Shanghai, April 25, 2015 7 / 19
Refinement of the obtained result: consider E(n, k) instead of E(n) and find the dependence of the constant in the Θ(√n) expression on k.
Expected Number of Repetitions Shanghai, April 25, 2015 8 / 19
Refinement of the obtained result: consider E(n, k) instead of E(n) and find the dependence of the constant in the Θ(√n) expression on k. intuition: more letters – more luck needed to get a palindrome;
Expected Number of Repetitions Shanghai, April 25, 2015 8 / 19
Refinement of the obtained result: consider E(n, k) instead of E(n) and find the dependence of the constant in the Θ(√n) expression on k. intuition: more letters – more luck needed to get a palindrome; broken by the picture: the peak on the right graph is ≈ √ kn;
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
Expected Number of Repetitions Shanghai, April 25, 2015 8 / 19
Refinement of the obtained result: consider E(n, k) instead of E(n) and find the dependence of the constant in the Θ(√n) expression on k. intuition: more letters – more luck needed to get a palindrome; broken by the picture: the peak on the right graph is ≈ √ kn; is E(n, k) = Θ( √ kn)? Not so easy.
Expected Number of Repetitions Shanghai, April 25, 2015 8 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
If po is an integer , we get √ kn
Expected Number of Repetitions Shanghai, April 25, 2015 9 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
If po is an integer [half-integer], we get √ kn [2√n]
Expected Number of Repetitions Shanghai, April 25, 2015 9 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
If po is an integer [half-integer], we get √ kn [2√n] similar for pe, but the values are √ k times less
Expected Number of Repetitions Shanghai, April 25, 2015 9 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
If po is an integer [half-integer], we get √ kn [2√n] similar for pe, but the values are √ k times less note that pe ≈ po + 1/2 (in fact, pe = p0 + 1/2!)
Expected Number of Repetitions Shanghai, April 25, 2015 9 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
If po is an integer [half-integer], we get √ kn [2√n] similar for pe, but the values are √ k times less note that pe ≈ po + 1/2 (in fact, pe = p0 + 1/2!) ◮ the upper bound oscillates between the orders of √n and √ kn
Expected Number of Repetitions Shanghai, April 25, 2015 9 / 19
Length 2m m pe km
n−2m+1 km
Length 2m+1 m po km+1
n−2m km
If po is an integer [half-integer], we get √ kn [2√n] similar for pe, but the values are √ k times less note that pe ≈ po + 1/2 (in fact, pe = p0 + 1/2!) ◮ the upper bound oscillates between the orders of √n and √ kn ◮ what’s next?
Expected Number of Repetitions Shanghai, April 25, 2015 9 / 19
To get more intuition, suppose (even if this is not true) that for a random k-ary word of length n all events of type “to contain a given palindrome of length m” are independent and equiprobable.
Expected Number of Repetitions Shanghai, April 25, 2015 10 / 19
To get more intuition, suppose (even if this is not true) that for a random k-ary word of length n all events of type “to contain a given palindrome of length m” are independent and equiprobable. Balls: palindromic factors of length m of our random word
w[i1..j1] w[i2..j2] w[i3..j3] w[i4..j4] w[i5..j5]
· · ·
w[is..js]
Bins: distinct palindromes of length m
aaaaa aabaa ababa
· · ·
bbbbb
Expected Number of Repetitions Shanghai, April 25, 2015 10 / 19
To get more intuition, suppose (even if this is not true) that for a random k-ary word of length n all events of type “to contain a given palindrome of length m” are independent and equiprobable. Balls: palindromic factors of length m of our random word
w[i1..j1] w[i2..j2] w[i3..j3] w[i4..j4] w[i5..j5]
· · ·
w[is..js]
Bins: distinct palindromes of length m
aaaaa aabaa ababa
· · ·
bbbbb
Folklore Proposition
For N bins and CN balls, the expected number of empty bins is ≈Ne−C.
Expected Number of Repetitions Shanghai, April 25, 2015 10 / 19
Conjecture
The function E(n, k) oscillates between its maximums E(n, k) =
e + 4 k−1 − k+1 2(k3−1) − O
kek
√ kn + O(
√ k log n √n
) for integer po and minimums E(n, k) =
e + 4 k−1 − k2+1 2(k3−1) − O
ek
√n + O(
√ k log n √n
) for integer pe.
Expected Number of Repetitions Shanghai, April 25, 2015 11 / 19
Conjecture
The function E(n, k) oscillates between its maximums E(n, k) =
e + 4 k−1 − k+1 2(k3−1) − O
kek
√ kn + O(
√ k log n √n
) for integer po and minimums E(n, k) =
e + 4 k−1 − k2+1 2(k3−1) − O
ek
√n + O(
√ k log n √n
) for integer pe. A big amount of experimental data for C(k) = E(n, k)/√n for different k and n ∼ 106 − 108 was obtained by M. Rubinchik, with the use of a novel data structure (palindromic tree).
Expected Number of Repetitions Shanghai, April 25, 2015 11 / 19
Conjecture
The function E(n, k) oscillates between its maximums E(n, k) =
e + 4 k−1 − k+1 2(k3−1) − O
kek
√ kn + O(
√ k log n √n
) for integer po and minimums E(n, k) =
e + 4 k−1 − k2+1 2(k3−1) − O
ek
√n + O(
√ k log n √n
) for integer pe. A big amount of experimental data for C(k) = E(n, k)/√n for different k and n ∼ 106 − 108 was obtained by M. Rubinchik, with the use of a novel data structure (palindromic tree). His data supports the conjecture in general cannot definitely tell whether the coefficients are correct
Expected Number of Repetitions Shanghai, April 25, 2015 11 / 19
Conjecture
The function E(n, k) oscillates between its maximums E(n, k) =
e + 4 k−1 − k+1 2(k3−1) − O
kek
√ kn + O(
√ k log n √n
) for integer po and minimums E(n, k) =
e + 4 k−1 − k2+1 2(k3−1) − O
ek
√n + O(
√ k log n √n
) for integer pe. A big amount of experimental data for C(k) = E(n, k)/√n for different k and n ∼ 106 − 108 was obtained by M. Rubinchik, with the use of a novel data structure (palindromic tree). His data supports the conjecture in general cannot definitely tell whether the coefficients are correct One of the problems: for small alphabets, the suggested maximums and minimums are almost indistinguishable
Expected Number of Repetitions Shanghai, April 25, 2015 11 / 19
Bad news: our assumption was totally wrong, because the events “to contain a given palindrome of length m” are dependent and have different probabilities. aaa · · · aaa is less probable than baa · · · aab, and each of them “suppresses” the other.
Expected Number of Repetitions Shanghai, April 25, 2015 12 / 19
Bad news: our assumption was totally wrong, because the events “to contain a given palindrome of length m” are dependent and have different probabilities. aaa · · · aaa is less probable than baa · · · aab, and each of them “suppresses” the other. Why the predictions with balls and bins were good?
Expected Number of Repetitions Shanghai, April 25, 2015 12 / 19
Bad news: our assumption was totally wrong, because the events “to contain a given palindrome of length m” are dependent and have different probabilities. aaa · · · aaa is less probable than baa · · · aab, and each of them “suppresses” the other. Why the predictions with balls and bins were good? Good news: the probabilities for all words of length m are quite close (any word of length m is more probable as a factor than any word of length m+1); moreover, most of the palindromes have almost the same (or even exactly the same) probability. There is also a way to avoid considering dependencies.
Expected Number of Repetitions Shanghai, April 25, 2015 12 / 19
Bad news: our assumption was totally wrong, because the events “to contain a given palindrome of length m” are dependent and have different probabilities. aaa · · · aaa is less probable than baa · · · aab, and each of them “suppresses” the other. Why the predictions with balls and bins were good? Good news: the probabilities for all words of length m are quite close (any word of length m is more probable as a factor than any word of length m+1); moreover, most of the palindromes have almost the same (or even exactly the same) probability. There is also a way to avoid considering dependencies. Approach: the theory of factor avoidance.
Expected Number of Repetitions Shanghai, April 25, 2015 12 / 19
A word u avoids a word w if w is not a factor of u.
Lemma
E(n, k, m) =
w∈P
kn
in a random word of length n
fixed k-letter alphabet)
Expected Number of Repetitions Shanghai, April 25, 2015 13 / 19
A word u avoids a word w if w is not a factor of u.
Lemma
E(n, k, m) =
w∈P
kn
in a random word of length n
fixed k-letter alphabet) Since E(n, k) =
n/2
E(n, k, m), all we need is a good asymptotics for Aw(n).
Expected Number of Repetitions Shanghai, April 25, 2015 13 / 19
A word u is a border of a word w if u is both a prefix and a suffix of w (including the case u = w) With a word w of length m we associate its border array, which is a word ˆ w[1..m] over {0, 1} such that w[i] = 1 if and only if w has a border of length m−i+1 The border array can be interpreted as the array of coefficients of a real-valued border polynomial fw(x) such that ˆ w[i] is the coefficient of xm−i. Since ˆ w[1] = 1, this polynomial has degree m−1
Example
The word w = aabaabaa has non-empty borders w, aabaa, aa, and a.
w equals 10010011.
Expected Number of Repetitions Shanghai, April 25, 2015 14 / 19
Recall that fw(k) is the border polynomial of w.
Theorem (Guibas, Odlyzko, 1978, 1981)
1) The number Aw(n) of words of length n avoiding a given word w of length m > 3 is Aw(n) = Cwθn
w + O(1.7n),
where θw = k − 1 fw(k) − f ′
w(k)
f 3
w(k) − O
m2 k3m
Cw = 1 1 − (k − θ)2f ′
w(θ) .
2) The condition fu(k) < fw(k) implies Au(n) ≤ Aw(n) for all n ≥ 0 and, in particular, θu ≤ θw.
Expected Number of Repetitions Shanghai, April 25, 2015 15 / 19
Recall that fw(k) is the border polynomial of w.
Theorem (Guibas, Odlyzko, 1978, 1981)
1) The number Aw(n) of words of length n avoiding a given word w of length m > 3 is Aw(n) = Cwθn
w + O(1.7n),
where θw = k − 1 fw(k) − f ′
w(k)
f 3
w(k) − O
m2 k3m
Cw = 1 1 − (k − θ)2f ′
w(θ) .
2) The condition fu(k) < fw(k) implies Au(n) ≤ Aw(n) for all n ≥ 0 and, in particular, θu ≤ θw. fu(k) < fw(k) iff ˆ u < ˆ w as an integer written in binary
Expected Number of Repetitions Shanghai, April 25, 2015 15 / 19
Recall that fw(k) is the border polynomial of w.
Theorem (Guibas, Odlyzko, 1978, 1981)
1) The number Aw(n) of words of length n avoiding a given word w of length m > 3 is Aw(n) = Cwθn
w + O(1.7n),
where θw = k − 1 fw(k) − f ′
w(k)
f 3
w(k) − O
m2 k3m
Cw = 1 1 − (k − θ)2f ′
w(θ) .
2) The condition fu(k) < fw(k) implies Au(n) ≤ Aw(n) for all n ≥ 0 and, in particular, θu ≤ θw. fu(k) < fw(k) iff ˆ u < ˆ w as an integer written in binary W.h.p., if w is a palindrome, then ˆ w = 10 · · · 0z, where |z| = O(log |w|)
Expected Number of Repetitions Shanghai, April 25, 2015 15 / 19
Since almost all palindromes of length m have “almost equal” border polynomials, we can derive the following formula: E(n, k, m) = kε 1 − e−
1 k2ε
√n + O
√n
m is even, kε 1 − e−
1 k2ε
√ kn + O
√n
m is odd, where m = 2(pe + ε) = 2(po + ε) + 1.
Expected Number of Repetitions Shanghai, April 25, 2015 16 / 19
Since almost all palindromes of length m have “almost equal” border polynomials, we can derive the following formula: E(n, k, m) = kε 1 − e−
1 k2ε
√n + O
√n
m is even, kε 1 − e−
1 k2ε
√ kn + O
√n
m is odd, where m = 2(pe + ε) = 2(po + ε) + 1. The function g(x) = x(1 − e−1/x2), appearing as the coefficient (for x = kε) has a tricky behaviour. In particular, g(1) = 1 − 1/e ≈ 0.6321, but this is not the maximum value! maxx>0 g(x) ≈ 0.6382 is reached at x ≈ 0.8921 thus, the coefficients suggested by the balls-and-bins approach were slightly incorrect
Expected Number of Repetitions Shanghai, April 25, 2015 16 / 19
Theorem
Let k ≥ 2. (1) The expected palindromic richness E(n, k) of a random k-ary word
(2) The ratio E(n,k)
√n
has no limit as n → ∞ with k fixed. (3) The function C(k) = lim infn→∞
E(n,k) √n
is Θ(1) as k → ∞. (4) The function C(k) = lim supn→∞
E(n,k) √n
is Θ( √ k) as k → ∞. (5) limk→∞ C(k) = 3 − 1/e. (6) limk→∞ C(k)/ √ k = χ, where χ ≈ 0.6382 is the maximum of the function f(x) = x(1 − e−1/x2) in the interval (0, ∞).
Expected Number of Repetitions Shanghai, April 25, 2015 17 / 19
Theorem
Let k ≥ 2. (1) The expected palindromic richness E(n, k) of a random k-ary word
(2) The ratio E(n,k)
√n
has no limit as n → ∞ with k fixed. (3) The function C(k) = lim infn→∞
E(n,k) √n
is Θ(1) as k → ∞. (4) The function C(k) = lim supn→∞
E(n,k) √n
is Θ( √ k) as k → ∞. (5) limk→∞ C(k) = 3 − 1/e. (6) limk→∞ C(k)/ √ k = χ, where χ ≈ 0.6382 is the maximum of the function f(x) = x(1 − e−1/x2) in the interval (0, ∞). Some particular values:
C(2) ≈ 6.17315, C(2) = 6.17368 C(3) ≈ 4.40121, C(3) = 4.41410 C(10) ≈ 3.02693, C(10) = 3.41133 C(50) ≈ 2.70152, C(50) = 5.09183
Expected Number of Repetitions Shanghai, April 25, 2015 17 / 19
Forget about squares. They are much alike even-length palindromes. The only slight difference is in borders, but still, almost all squares have “almost equal” border polynomials. The rest is easy, because there is no analogs of odd-length palindromes.
Expected Number of Repetitions Shanghai, April 25, 2015 18 / 19
Expected Number of Repetitions Shanghai, April 25, 2015 19 / 19