Factor Automata of Automata and Applications
Mehryar Mohri1,2, Pedro Moreno2, Eugene Weinstein1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu
1 Courant Institute of Mathematical Sciences 2 Google Inc.
Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - - PowerPoint PPT Presentation
Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc. Introduction Objective:
Mehryar Mohri1,2, Pedro Moreno2, Eugene Weinstein1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu
1 Courant Institute of Mathematical Sciences 2 Google Inc.
2
and transitions [Crochemore ’85; Blumer et al. ’86]
studied [Blumer et al. ’87]
2|x| − 2 3|x| − 4 U
||U|| U U 2||U|| − 1 3||U|| − 3
transitions
3
recognizing the suffixes and factors of , respectively
epsilons), determinize, minimize
A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|
1 a 2 c 3 a 4 b 5 b a
A
4
recognizing the suffixes and factors of , respectively
epsilons), determinize, minimize
A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|
1 a 2 c 3 a 4 b 5 b a
ε ε ε ε ε
A
4
recognizing the suffixes and factors of , respectively
epsilons), determinize, minimize
A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|
1 a 2 c 3 a 4 b 5 b a 1 a 2 b 3 c c 4 b a 5 a 6 b b a
ε ε ε ε ε
A
4
recognizing the suffixes and factors of , respectively
epsilons), determinize, minimize
A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|
1 a 2 c 3 a 4 b 5 b a 1 a 2 b 3 c c 4 b a 5 a 6 b b a
ε ε ε ε ε
A
4
count the number of possible sets of suffixes
directly related in size to
[Blumer et al. ’86]; string sets in [Blumer et al. ’87]
|F(A)| |A| |S(A)| S(A) A
5
by share the same -length suffix. Suffix-unique if
k A A k = 1 k
1 a 2 c 3 a 4 b 5 b a
end-set(x) x A end-set(ac) = {2, 3, 4, 5} x ≡ y end-set(x) = end-set(y) [x] x
6
string in can be read to reach a final state
A S(A) N(q) A suff(q) N(3) = {2, 1} suff(q) q q S(A) suff(3) = {ab, ba}
1 a 2 b 3 c b c 4 a 5 a 6 b b a 1 a 2 c 4 b b 3 a 5 a b
7
Nstr
A
be two states of such that , then
q q S(A)
A be a suffix-unique that N(q) ∩ N(q′) = ∅,
ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or
A
be two states of such that , then
q q S(A)
A be a suffix-unique that N(q) ∩ N(q′) = ∅,
ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or
A S(A) q q u u S(A) u u q q
be two states of such that , then
q q S(A)
A be a suffix-unique that N(q) ∩ N(q′) = ∅,
ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or
A
exists p ∈ N(q) ∩ N(q′). ff( ′) such that both
S(A) q q u u S(A) u u p A u u q q A
be two states of such that , then
q q S(A)
A be a suffix-unique that N(q) ∩ N(q′) = ∅,
ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or
A
exists p ∈ N(q) ∩ N(q′). ff( ′) such that both
S(A) q q u u S(A) v v u u p A u u q q v v v ∈ suff(q) v ∈ suff(q) p A
ending in must also end in
QED.
x v u u’
A u A v uv p u u u
us, suff(q′) ⊆ suff(q),
′
N(q′) ⊆ N(q). statement of the
9
S(A) v v u u p A u u q q v v
automaton, then the number of states of is bounded as
either suffix sets are disjoint, or one includes the other
distinct equivalence class , count these to get bound
we will analyze
|S(A)|Q ≤ 2|A|Q − 3.
A S(A) q S(A) [x]
10
factors (since is a right-equivalence relation)
[x] x S(A) ax, bx(a = b)
≡
x
most Nnb ≤ |A|Q − 2 + Nstr. nodes
, observe that
|A|Q − 2 Nstr
factors (since is a right-equivalence relation)
[x] x S(A) ax, bx(a = b)
≡
x
most Nnb ≤ |A|Q − 2 + Nstr. nodes
, observe that
|A|Q − 2 Nstr
Disjoint Includes Includes
accepted by then each is a child of the root
i
de [ǫ] sub-tree rooted
A a1, . . . , aNstr [ai] [a1]
...
i
de [ǫ] sub-tree rooted
[a2] [aNstr]
...
[aNstr+k] [ai] nai nai − 1 |A|Q − 2 Nb ≤ Nstr+k
i=1
(nai − 1) + 1 ≤ |A|Q − 2 − Nstr
− − most Nnb + Nb ≤ 2|A|Q − 4.
|S(A)|Q ≤ 2|A|Q − 3.
13
A U |S(U)|E ≤ 3|A|E − 4 |F(U)|E ≤ 3|A|E − 4 |F(U)|Q ≤ 2|A|Q − 2 |S(U)|Q ≤ 2|A|Q − 2.
13
A U |S(U)|Q ≤ 2||U|| − 1 |S(U)|E ≤ 3|A|E − 4 |F(U)|E ≤ 3|A|E − 4 |F(U)|Q ≤ 2|A|Q − 2 |S(U)|Q ≤ 2|A|Q − 2. |F(U)|E ≤ 3||U|| − 3
13
accepts strings and is the part of after removing all suffixes of length
unique, construct suffix automaton, remove symbols
A U |S(U)|Q ≤ 2||U|| − 1 A k n Ak A k |S(A)|Q ≤ 2|Ak|Q + 2kn − 3, |F(A)|Q ≤ 2|Ak|Q + 2kn − 3. |S(U)|E ≤ 3|A|E − 4 |F(U)|E ≤ 3|A|E − 4 |F(U)|Q ≤ 2|A|Q − 2 |S(U)|Q ≤ 2|A|Q − 2. |F(U)|E ≤ 3||U|| − 3 |S(A)|E ≤ 2|Ak|E + 3kn − 3k − 1 |F(A)|E ≤ 2|Ak|E + 3kn − 3k − 1
13
phonemes is the alphabet
14
1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 2000 4000 6000 8000 10000 12000 14000 16000 Size # Songs # States factor # Arcs factor # States/Arcs Non-factor
|F(A)|E ≈ 2.1|A|E
16000 2000 4000 6000 8000 10000 12000 14000 16000 5 10 15 20 25 30 35 40 45 Non-unique songs k (suffix length)
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 20 40 60 80 100 120 Non-unique Factors Factor Length
16
set of strings, or more generally of another automaton
potentially very large sets of strings
identification system
17
kn k
Mehryar Mohri1,2, Pedro Moreno2, Eugene Weinstein1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu
1 Courant Institute of Mathematical Sciences 2 Google Inc.