factor automata of automata and applications
play

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - PowerPoint PPT Presentation

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc. Introduction Objective:


  1. Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc.

  2. Introduction • Objective: construct full index for a large set of strings • We want to efficiently search for factors (subwords) • Deterministic minimal factor automaton is a good option • Optimal lookup speed (linear in size of query) • Set of strings might be given as an automaton • Smaller representation • Might be produced by another application • Hence, consider factor automata of automata 2

  3. Past Work • Factor automaton of a string has at most states, 2 | x | − 2 x and transitions [Crochemore ’85; Blumer et al. ’86] 3 | x | − 4 • Can be constructed by a linear-time online algorithm • Size bounds for a set of strings has also previously been U studied [Blumer et al. ’87] • If is the sum of the lengths of all the strings in || U || U • • Factor automaton of has at most states and 2 || U || − 1 U transitions 3 || U || − 3 • We prove a substantially better bound here 3

  4. Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: | F ( A ) | ≤ | S ( A ) | 3 a b a c 0 1 2 5 b a 4 4

  5. Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: | F ( A ) | ≤ | S ( A ) | ε ε 3 ε a b a c 0 1 2 5 b a 4 ε ε 4

  6. Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: b | F ( A ) | ≤ | S ( A ) | 1 6 c a ε a b ε 3 ε a b 4 b a c c a 0 1 2 5 0 3 5 b a b 4 ε a 2 ε 4

  7. Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: b | F ( A ) | ≤ | S ( A ) | 1 6 c a ε a b ε 3 ε a b 4 b a c c a 0 1 2 5 0 3 5 b a b 4 ε a 2 ε 4

  8. Size Bound: Strategy • Goal: a bound on in terms of | F ( A ) | | A | • Work on bounding – consider suffixes only for now | S ( A ) | • Idea: each state in accepts a distinct set of suffixes, so S ( A ) count the number of possible sets of suffixes • The suffix sets can be arranged in a hierarchy, which is directly related in size to A • Motivated by similar arguments for single-string case in [Blumer et al. ’86]; string sets in [Blumer et al. ’87] 5

  9. Suffix Sets • Automaton is -suffix unique if no two strings accepted A k by share the same -length suffix. Suffix-unique if A k k = 1 • Define : set of states in reachable after reading end - set ( x ) A x • e.g., end - set ( ac ) = { 2 , 3 , 4 , 5 } • denotes end - set ( x ) = end - set ( y ) x ≡ y • This is a right-invariant equivalence relation • is the equivalence class of [ x ] x 3 a b a c 0 1 2 5 b a 4 6

  10. Notation • is number of strings accepted by N str A • If is a state of , is set of suffixes accepted from S ( A ) suff( q ) q q • e.g., suff(3) = { ab, ba } • is the set of states in from which a non-empty N ( q ) A string in can be read to reach a final state suff( q ) • e.g., N (3) = { 2 , 1 } S ( A ) b a b 2 3 a b a 5 a b 0 1 a 2 c c b 6 4 4 a b A 0 1 c b a 3 5 7

  11. Suffix Set Inclusion

  12. ć ą ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q )

  13. ą ć ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u S ( A ) q u u � q �

  14. ą ć ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u • Thus must have a state exists p ∈ N ( q ) ∩ N ( q ′ ). A ff ( ′ ) such that both S ( A ) A q u u p u � q � u �

  15. ą ą ć Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u • Thus must have a state exists p ∈ N ( q ) ∩ N ( q ′ ). A ff ( ′ ) such that both • Thus, exist paths and from to final v � ∈ su ff ( q � ) v ∈ suff( q ) p S ( A ) A q u u v v p v � u � v � q � u �

  16. Suffix Set Inclusion S ( A ) A q u u v v p v � u � v � q � u � • Since is suffix-unique, any string accepted by and A A ending in must also end in v uv • Thus, any path from initial to must end in p u • By same reasoning, it must also end in u � • Hence, is a suffix of , or vice versa u � u • Assume the former, then , thus ′ us, su ff ( q ′ ) ⊆ su ff ( q ), N ( q ′ ) ⊆ N ( q ). QED. obtain similarly the other statement of the u v x u ’ 9

  17. Suffix-unique Bound • Theorem: If is a suffix-unique deterministic and minimal A automaton, then the number of states of is bounded as S ( A ) | S ( A ) | Q ≤ 2 | A | Q − 3 . • Proof (sketch): • Lemma: For any two states of the suffix automaton, either suffix sets are disjoint, or one includes the other • We can show that each state of corresponds to a S ( A ) q distinct equivalence class , count these to get bound [ x ] • The equivalence sets induce a suffix sets hierarchy which we will analyze 10

  18. Suffix Sets: Non-branching • Count non-branching, branching nodes separately • Consider state in with equivalence class , longest S ( A ) [ x ] x • The only way to have a branching node is if there exist factors (since is a right-equivalence relation) ax, bx ( a � = b ) ≡ • Node is only non-branching when is a prefix or suffix x • distinct prefixes, suffix only when final state: | A | Q − 2 N str • Total non-branching nodes most N nb ≤ | A | Q − 2 + N str . nodes of , observe that

  19. Suffix Sets: Non-branching Includes Includes Disjoint • Count non-branching, branching nodes separately • Consider state in with equivalence class , longest S ( A ) [ x ] x • The only way to have a branching node is if there exist factors (since is a right-equivalence relation) ax, bx ( a � = b ) ≡ • Node is only non-branching when is a prefix or suffix x • distinct prefixes, suffix only when final state: | A | Q − 2 N str • Total non-branching nodes most N nb ≤ | A | Q − 2 + N str . nodes of , observe that

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend