Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - - PowerPoint PPT Presentation

factor automata of automata and applications
SMART_READER_LITE
LIVE PREVIEW

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - - PowerPoint PPT Presentation

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc. Introduction Objective:


slide-1
SLIDE 1

Factor Automata of Automata and Applications

Mehryar Mohri1,2, Pedro Moreno2, Eugene Weinstein1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu

1 Courant Institute of Mathematical Sciences 2 Google Inc.

slide-2
SLIDE 2

Introduction

  • Objective: construct full index for a large set of strings
  • We want to efficiently search for factors (subwords)
  • Deterministic minimal factor automaton is a good option
  • Optimal lookup speed (linear in size of query)
  • Set of strings might be given as an automaton
  • Smaller representation
  • Might be produced by another application
  • Hence, consider factor automata of automata

2

slide-3
SLIDE 3

Past Work

  • Factor automaton of a string has at most states,

and transitions [Crochemore ’85; Blumer et al. ’86]

  • Can be constructed by a linear-time online algorithm
  • Size bounds for a set of strings has also previously been

studied [Blumer et al. ’87]

  • If is the sum of the lengths of all the strings in
  • x

2|x| − 2 3|x| − 4 U

||U|| U U 2||U|| − 1 3||U|| − 3

  • Factor automaton of has at most states and

transitions

  • We prove a substantially better bound here

3

slide-4
SLIDE 4

Suffix & Factor Automata

  • We start out with an automaton recognizing strings in
  • Let and be the deterministic minimal automata

recognizing the suffixes and factors of , respectively

  • To construct make each state of initial (by adding

epsilons), determinize, minimize

  • To construct make each state of final, minimize
  • Consequence:

A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|

1 a 2 c 3 a 4 b 5 b a

A

4

slide-5
SLIDE 5

Suffix & Factor Automata

  • We start out with an automaton recognizing strings in
  • Let and be the deterministic minimal automata

recognizing the suffixes and factors of , respectively

  • To construct make each state of initial (by adding

epsilons), determinize, minimize

  • To construct make each state of final, minimize
  • Consequence:

A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|

1 a 2 c 3 a 4 b 5 b a

ε ε ε ε ε

A

4

slide-6
SLIDE 6

Suffix & Factor Automata

  • We start out with an automaton recognizing strings in
  • Let and be the deterministic minimal automata

recognizing the suffixes and factors of , respectively

  • To construct make each state of initial (by adding

epsilons), determinize, minimize

  • To construct make each state of final, minimize
  • Consequence:

A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|

1 a 2 c 3 a 4 b 5 b a 1 a 2 b 3 c c 4 b a 5 a 6 b b a

ε ε ε ε ε

A

4

slide-7
SLIDE 7

Suffix & Factor Automata

  • We start out with an automaton recognizing strings in
  • Let and be the deterministic minimal automata

recognizing the suffixes and factors of , respectively

  • To construct make each state of initial (by adding

epsilons), determinize, minimize

  • To construct make each state of final, minimize
  • Consequence:

A U S(A) F(A) S(A) A F(A) S(A) |F(A)| ≤ |S(A)|

1 a 2 c 3 a 4 b 5 b a 1 a 2 b 3 c c 4 b a 5 a 6 b b a

ε ε ε ε ε

A

4

slide-8
SLIDE 8

Size Bound: Strategy

  • Goal: a bound on in terms of
  • Work on bounding – consider suffixes only for now
  • Idea: each state in accepts a distinct set of suffixes, so

count the number of possible sets of suffixes

  • The suffix sets can be arranged in a hierarchy, which is

directly related in size to

  • Motivated by similar arguments for single-string case in

[Blumer et al. ’86]; string sets in [Blumer et al. ’87]

|F(A)| |A| |S(A)| S(A) A

5

slide-9
SLIDE 9

Suffix Sets

  • Automaton is -suffix unique if no two strings accepted

by share the same -length suffix. Suffix-unique if

  • Define : set of states in reachable after reading
  • e.g.,
  • denotes
  • This is a right-invariant equivalence relation
  • is the equivalence class of

k A A k = 1 k

1 a 2 c 3 a 4 b 5 b a

end-set(x) x A end-set(ac) = {2, 3, 4, 5} x ≡ y end-set(x) = end-set(y) [x] x

6

slide-10
SLIDE 10
  • is number of strings accepted by
  • If is a state of , is set of suffixes accepted from
  • e.g.,
  • is the set of states in from which a non-empty

string in can be read to reach a final state

  • e.g.,

Notation

A S(A) N(q) A suff(q) N(3) = {2, 1} suff(q) q q S(A) suff(3) = {ab, ba}

1 a 2 b 3 c b c 4 a 5 a 6 b b a 1 a 2 c 4 b b 3 a 5 a b

7

Nstr

A

slide-11
SLIDE 11

Suffix Set Inclusion

slide-12
SLIDE 12
  • Lemma: Let be a suffix-unique automaton and let and

be two states of such that , then

Suffix Set Inclusion

q q S(A)

A be a suffix-unique that N(q) ∩ N(q′) = ∅,

ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or

A

slide-13
SLIDE 13
  • Lemma: Let be a suffix-unique automaton and let and

be two states of such that , then

  • Proof: Let paths in to and be labeled with and .

Suffix Set Inclusion

q q S(A)

A be a suffix-unique that N(q) ∩ N(q′) = ∅,

ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or

A S(A) q q u u S(A) u u q q

slide-14
SLIDE 14
  • Lemma: Let be a suffix-unique automaton and let and

be two states of such that , then

  • Proof: Let paths in to and be labeled with and .
  • Thus must have a state

Suffix Set Inclusion

q q S(A)

A be a suffix-unique that N(q) ∩ N(q′) = ∅,

ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or

A

exists p ∈ N(q) ∩ N(q′). ff( ′) such that both

S(A) q q u u S(A) u u p A u u q q A

slide-15
SLIDE 15
  • Lemma: Let be a suffix-unique automaton and let and

be two states of such that , then

  • Proof: Let paths in to and be labeled with and .
  • Thus must have a state
  • Thus, exist paths and from to final

Suffix Set Inclusion

q q S(A)

A be a suffix-unique that N(q) ∩ N(q′) = ∅,

ą suff(q) ⊆ suff(q′) and N(q) ⊆ N(q′) ć ą suff(q′) ⊆ suff(q) and N(q′) ⊆ N(q) ć or

A

exists p ∈ N(q) ∩ N(q′). ff( ′) such that both

S(A) q q u u S(A) v v u u p A u u q q v v v ∈ suff(q) v ∈ suff(q) p A

slide-16
SLIDE 16

Suffix Set Inclusion

  • Since is suffix-unique, any string accepted by and

ending in must also end in

  • Thus, any path from initial to must end in
  • By same reasoning, it must also end in
  • Hence, is a suffix of , or vice versa
  • Assume the former, then , thus

QED.

x v u u’

A u A v uv p u u u

us, suff(q′) ⊆ suff(q),

  • btain similarly the other

N(q′) ⊆ N(q). statement of the

9

S(A) v v u u p A u u q q v v

slide-17
SLIDE 17

Suffix-unique Bound

  • Theorem: If is a suffix-unique deterministic and minimal

automaton, then the number of states of is bounded as

  • Proof (sketch):
  • Lemma: For any two states of the suffix automaton,

either suffix sets are disjoint, or one includes the other

  • We can show that each state of corresponds to a

distinct equivalence class , count these to get bound

  • The equivalence sets induce a suffix sets hierarchy which

we will analyze

|S(A)|Q ≤ 2|A|Q − 3.

A S(A) q S(A) [x]

10

slide-18
SLIDE 18

Suffix Sets: Non-branching

  • Count non-branching, branching nodes separately
  • Consider state in with equivalence class , longest
  • The only way to have a branching node is if there exist

factors (since is a right-equivalence relation)

  • Node is only non-branching when is a prefix or suffix
  • distinct prefixes, suffix only when final state:
  • Total non-branching nodes

[x] x S(A) ax, bx(a = b)

x

most Nnb ≤ |A|Q − 2 + Nstr. nodes

  • f

, observe that

|A|Q − 2 Nstr

slide-19
SLIDE 19

Suffix Sets: Non-branching

  • Count non-branching, branching nodes separately
  • Consider state in with equivalence class , longest
  • The only way to have a branching node is if there exist

factors (since is a right-equivalence relation)

  • Node is only non-branching when is a prefix or suffix
  • distinct prefixes, suffix only when final state:
  • Total non-branching nodes

[x] x S(A) ax, bx(a = b)

x

most Nnb ≤ |A|Q − 2 + Nstr. nodes

  • f

, observe that

|A|Q − 2 Nstr

Disjoint Includes Includes

slide-20
SLIDE 20

Suffix Sets: Branching

  • If are the distinct final symbols of each string

accepted by then each is a child of the root

  • Let tree rooted at have leaves( branching nodes)
  • Total number of leaves is (not initial and super-final)
  • Total branching
  • Total size of tree
  • Add “super-final” state, get QED.

i

de [ǫ] sub-tree rooted

A a1, . . . , aNstr [ai] [a1]

...

i

de [ǫ] sub-tree rooted

[a2] [aNstr]

...

[aNstr+k] [ai] nai nai − 1 |A|Q − 2 Nb ≤ Nstr+k

i=1

(nai − 1) + 1 ≤ |A|Q − 2 − Nstr

  • ≤ | |

− − most Nnb + Nb ≤ 2|A|Q − 4.

|S(A)|Q ≤ 2|A|Q − 3.

slide-21
SLIDE 21

Final Size Result

13

slide-22
SLIDE 22

Final Size Result

  • If is a prefix tree representing a set of strings then

A U |S(U)|E ≤ 3|A|E − 4 |F(U)|E ≤ 3|A|E − 4 |F(U)|Q ≤ 2|A|Q − 2 |S(U)|Q ≤ 2|A|Q − 2.

13

slide-23
SLIDE 23

Final Size Result

  • If is a prefix tree representing a set of strings then
  • Substantial improvement over previous:

A U |S(U)|Q ≤ 2||U|| − 1 |S(U)|E ≤ 3|A|E − 4 |F(U)|E ≤ 3|A|E − 4 |F(U)|Q ≤ 2|A|Q − 2 |S(U)|Q ≤ 2|A|Q − 2. |F(U)|E ≤ 3||U|| − 3

13

slide-24
SLIDE 24

Final Size Result

  • If is a prefix tree representing a set of strings then
  • Substantial improvement over previous:
  • When is -suffix unique, deterministic and minimal, and

accepts strings and is the part of after removing all suffixes of length

  • Proof idea: add terminal symbols to make string set suffix-

unique, construct suffix automaton, remove symbols

A U |S(U)|Q ≤ 2||U|| − 1 A k n Ak A k |S(A)|Q ≤ 2|Ak|Q + 2kn − 3, |F(A)|Q ≤ 2|Ak|Q + 2kn − 3. |S(U)|E ≤ 3|A|E − 4 |F(U)|E ≤ 3|A|E − 4 |F(U)|Q ≤ 2|A|Q − 2 |S(U)|Q ≤ 2|A|Q − 2. |F(U)|E ≤ 3||U|| − 3 |S(A)|E ≤ 2|Ak|E + 3kn − 3k − 1 |F(A)|E ≤ 2|Ak|E + 3kn − 3k − 1

13

slide-25
SLIDE 25

Application

  • Application: large-scale music identification
  • Matching audio recording to a large song database
  • Approach: learn inventory of music sounds (“phonemes”)
  • A song is described by unique music phone sequence
  • Each song represented by unique string, set of music

phonemes is the alphabet

14

slide-26
SLIDE 26

Music ID Experiments

  • In our music ID application, we have
  • Factor automaton size scales linearly with # of songs

1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 2000 4000 6000 8000 10000 12000 14000 16000 Size # Songs # States factor # Arcs factor # States/Arcs Non-factor

|F(A)|E ≈ 2.1|A|E

slide-27
SLIDE 27

Music ID Experiments

  • For 15,000+ songs, string set is 45-suffix unique
  • Number of “collisions” among song suffixes/factors drops
  • ff rapidly with increasing length

16000 2000 4000 6000 8000 10000 12000 14000 16000 5 10 15 20 25 30 35 40 45 Non-unique songs k (suffix length)

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 20 40 60 80 100 120 Non-unique Factors Factor Length

16

slide-28
SLIDE 28

Summary

  • We have addressed the size of a factor automaton of a

set of strings, or more generally of another automaton

  • We have proven substantially better size bounds
  • This suggests factor automata are useful for indexing

potentially very large sets of strings

  • Our conclusions are verified experimentally in our music

identification system

  • In the future, do a finer analysis
  • Tighten the term in the -suffix unique bound

17

kn k

slide-29
SLIDE 29

Factor Automata of Automata and Applications

Mehryar Mohri1,2, Pedro Moreno2, Eugene Weinstein1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu

1 Courant Institute of Mathematical Sciences 2 Google Inc.