Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - PowerPoint PPT Presentation

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc.

Introduction • Objective: construct full index for a large set of strings • We want to efficiently search for factors (subwords) • Deterministic minimal factor automaton is a good option • Optimal lookup speed (linear in size of query) • Set of strings might be given as an automaton • Smaller representation • Might be produced by another application • Hence, consider factor automata of automata 2

Past Work • Factor automaton of a string has at most states, 2 | x | − 2 x and transitions [Crochemore ’85; Blumer et al. ’86] 3 | x | − 4 • Can be constructed by a linear-time online algorithm • Size bounds for a set of strings has also previously been U studied [Blumer et al. ’87] • If is the sum of the lengths of all the strings in || U || U • • Factor automaton of has at most states and 2 || U || − 1 U transitions 3 || U || − 3 • We prove a substantially better bound here 3

Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: | F ( A ) | ≤ | S ( A ) | 3 a b a c 0 1 2 5 b a 4 4

Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: | F ( A ) | ≤ | S ( A ) | ε ε 3 ε a b a c 0 1 2 5 b a 4 ε ε 4

Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: b | F ( A ) | ≤ | S ( A ) | 1 6 c a ε a b ε 3 ε a b 4 b a c c a 0 1 2 5 0 3 5 b a b 4 ε a 2 ε 4

Size Bound: Strategy • Goal: a bound on in terms of | F ( A ) | | A | • Work on bounding – consider suffixes only for now | S ( A ) | • Idea: each state in accepts a distinct set of suffixes, so S ( A ) count the number of possible sets of suffixes • The suffix sets can be arranged in a hierarchy, which is directly related in size to A • Motivated by similar arguments for single-string case in [Blumer et al. ’86]; string sets in [Blumer et al. ’87] 5

Suffix Sets • Automaton is -suffix unique if no two strings accepted A k by share the same -length suffix. Suffix-unique if A k k = 1 • Define : set of states in reachable after reading end - set ( x ) A x • e.g., end - set ( ac ) = { 2 , 3 , 4 , 5 } • denotes end - set ( x ) = end - set ( y ) x ≡ y • This is a right-invariant equivalence relation • is the equivalence class of [ x ] x 3 a b a c 0 1 2 5 b a 4 6

Notation • is number of strings accepted by N str A • If is a state of , is set of suffixes accepted from S ( A ) suff( q ) q q • e.g., suff(3) = { ab, ba } • is the set of states in from which a non-empty N ( q ) A string in can be read to reach a final state suff( q ) • e.g., N (3) = { 2 , 1 } S ( A ) b a b 2 3 a b a 5 a b 0 1 a 2 c c b 6 4 4 a b A 0 1 c b a 3 5 7

Suffix Set Inclusion

ć ą ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q )

ą ć ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u S ( A ) q u u � q �

ą ć ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u • Thus must have a state exists p ∈ N ( q ) ∩ N ( q ′ ). A ff ( ′ ) such that both S ( A ) A q u u p u � q � u �

ą ą ć Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u • Thus must have a state exists p ∈ N ( q ) ∩ N ( q ′ ). A ff ( ′ ) such that both • Thus, exist paths and from to final v � ∈ su ff ( q � ) v ∈ suff( q ) p S ( A ) A q u u v v p v � u � v � q � u �

Suffix Set Inclusion S ( A ) A q u u v v p v � u � v � q � u � • Since is suffix-unique, any string accepted by and A A ending in must also end in v uv • Thus, any path from initial to must end in p u • By same reasoning, it must also end in u � • Hence, is a suffix of , or vice versa u � u • Assume the former, then , thus ′ us, su ff ( q ′ ) ⊆ su ff ( q ), N ( q ′ ) ⊆ N ( q ). QED. obtain similarly the other statement of the u v x u ’ 9

Suffix-unique Bound • Theorem: If is a suffix-unique deterministic and minimal A automaton, then the number of states of is bounded as S ( A ) | S ( A ) | Q ≤ 2 | A | Q − 3 . • Proof (sketch): • Lemma: For any two states of the suffix automaton, either suffix sets are disjoint, or one includes the other • We can show that each state of corresponds to a S ( A ) q distinct equivalence class , count these to get bound [ x ] • The equivalence sets induce a suffix sets hierarchy which we will analyze 10

Suffix Sets: Non-branching • Count non-branching, branching nodes separately • Consider state in with equivalence class , longest S ( A ) [ x ] x • The only way to have a branching node is if there exist factors (since is a right-equivalence relation) ax, bx ( a � = b ) ≡ • Node is only non-branching when is a prefix or suffix x • distinct prefixes, suffix only when final state: | A | Q − 2 N str • Total non-branching nodes most N nb ≤ | A | Q − 2 + N str . nodes of , observe that

Suffix Sets: Non-branching Includes Includes Disjoint • Count non-branching, branching nodes separately • Consider state in with equivalence class , longest S ( A ) [ x ] x • The only way to have a branching node is if there exist factors (since is a right-equivalence relation) ax, bx ( a � = b ) ≡ • Node is only non-branching when is a prefix or suffix x • distinct prefixes, suffix only when final state: | A | Q − 2 N str • Total non-branching nodes most N nb ≤ | A | Q − 2 + N str . nodes of , observe that

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - PowerPoint PPT Presentation

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc. Introduction Objective:

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Confirmatory Factor Analysis and Exploratory-Confirmatory Factor Analysis Maximum

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

CSC 473 Automata, Grammars & Languages 9/29/10 Automata, Grammars and Languages Discourse 03

Automata and program analysis Thomas Colcombet FCT Bordeaux 13 September 2017 based on

Graph Automata Jan Leike July 2nd, 2012 Motivation We want an automata model that Motivation

Applied Automata Theory Roland Meyer TU Kaiserslautern Roland Meyer (TU KL) Applied Automata

Applied Automata Theory Roland Meyer TU Kaiserslautern Roland Meyer (TU KL) Applied Automata

Multiple tree automata a new model of tree automata Gwendal Collet (TU Wien), Julien David (LIPN)

Pushdown Automata 7-0 Pushdown Automata The automata we saw so far were

Seminar: Automata Theory Timed Automata Jennifer Nist 11 th February 2016 Chair of Software

Automata Theory Why Study Automata? What the Course is About 1 Why Study Automata? A survey of

Patty Roberts, Director Myth ESY is just about regression and recoupment Step #1 Download the

15-780: ProbabilisticGraphicalModels J. Zico Kolter February 22-24, 2016 1 Outline

FY 2017 Supplemental Comprehensive Housing Counseling Grant Application Training Audio is

Chaos, Random Matrix Theory and Spectral Properties of the SYK Model Jacobus Verbaarschot

Climate Change and the New Industrial Revolution - How we can respond and prosper Professor Lord

Kaon form factor and decay constant from lattice QCD Aida X. El-Khadra (University of Illinois)

Lattice Attacks on RSA Nadia Heninger University of Pennsylvania September 19, 2017 Reminder:

1 st Parameterized Algorithms & Computational Experiments Challenge Where it came from, how

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , - PowerPoint PPT Presentation

Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc. Introduction Objective:

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Confirmatory Factor Analysis and Exploratory-Confirmatory Factor Analysis Maximum

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

CSC 473 Automata, Grammars &amp; Languages 9/29/10 Automata, Grammars and Languages Discourse 03

Automata and program analysis Thomas Colcombet FCT Bordeaux 13 September 2017 based on

Graph Automata Jan Leike July 2nd, 2012 Motivation We want an automata model that Motivation

Applied Automata Theory Roland Meyer TU Kaiserslautern Roland Meyer (TU KL) Applied Automata

Applied Automata Theory Roland Meyer TU Kaiserslautern Roland Meyer (TU KL) Applied Automata

Multiple tree automata a new model of tree automata Gwendal Collet (TU Wien), Julien David (LIPN)

Pushdown Automata 7-0 Pushdown Automata The automata we saw so far were

Seminar: Automata Theory Timed Automata Jennifer Nist 11 th February 2016 Chair of Software

Automata Theory Why Study Automata? What the Course is About 1 Why Study Automata? A survey of

Patty Roberts, Director Myth ESY is just about regression and recoupment Step #1 Download the

15-780: ProbabilisticGraphicalModels J. Zico Kolter February 22-24, 2016 1 Outline

FY 2017 Supplemental Comprehensive Housing Counseling Grant Application Training Audio is

Chaos, Random Matrix Theory and Spectral Properties of the SYK Model Jacobus Verbaarschot

Climate Change and the New Industrial Revolution - How we can respond and prosper Professor Lord

Kaon form factor and decay constant from lattice QCD Aida X. El-Khadra (University of Illinois)

Lattice Attacks on RSA Nadia Heninger University of Pennsylvania September 19, 2017 Reminder:

1 st Parameterized Algorithms &amp; Computational Experiments Challenge Where it came from, how

CSC 473 Automata, Grammars & Languages 9/29/10 Automata, Grammars and Languages Discourse 03

1 st Parameterized Algorithms & Computational Experiments Challenge Where it came from, how