Constant delay algorithms for regular document spanners
Fernando Florenzano Cristian Riveros Domagoj Vrgoˇ c From PUC Chile Mart´ ın Ugarte Stijn Vansummeren From Universit´ e Libre de Bruxelles
Constant delay algorithms for regular document spanners Fernando - - PowerPoint PPT Presentation
Constant delay algorithms for regular document spanners Fernando Florenzano Cristian Riveros Domagoj Vrgo c From PUC Chile Mart n Ugarte Stijn Vansummeren From Universit e Libre de Bruxelles Rule-based information extraction by
Fernando Florenzano Cristian Riveros Domagoj Vrgoˇ c From PUC Chile Mart´ ın Ugarte Stijn Vansummeren From Universit´ e Libre de Bruxelles
18:30 ERROR 06 19:10 OK 00 20:00 ERROR 19
“Extract all pairs (time,id)
x y x y
1 8 2
3 3 4 0 5 6 E 7 R 8 R 9 O 10R 11 120 136 14
151 169 17
181 190 20 21O 22K 23 240 250 26
272 280 29
300 310 32 33E 34R 35R 36O 37R 38 391 409 41
Σ∗ ⋅ x{δδ ∶ δδ} ⋅ ERROR ⋅ y{δδ} ⋅ Σ∗ δ = (0 + 1 + . . . + 9) Rule: RGX formula x y [1, 6⟩ [13, 15⟩ [28, 33⟩ [40, 42⟩ Output: mappings
Problem: Evaluation of rules in information extraction. Input: RGX formula R and document d. Output: Enumerate all mappings of d that satisfy R.
1 8 2
3 3 4 0 5 6 E 7 R 8 R 9 O 10R 11 120 136 14
151 169 17
181 190 20 21O 22K 23 240 250 26
272 280 29
300 310 32 33E 34R 35R 36O 37R 38 391 409 41
Σ∗ ⋅ x{δδ ∶ δδ} ⋅ ERROR ⋅ y{δδ} ⋅ Σ∗ δ = (0 + 1 + . . . + 9) x y [1, 6⟩ [13, 15⟩ [28, 33⟩ [40, 42⟩ Rule: RGX formula Output: mappings
1 8 2
3 3 4 0 5 6 E 7 R 8 R 9 O 10R 11 120 136 14
151 169 17
181 190 20 21O 22K 23 240 250 26
272 280 29
300 310 32 33E 34R 35R 36O 37R 38 391 409 41
Σ∗ ⋅ x1{δδ} ⋅ Σ∗ ⋅ x2{δδ} ⋅ Σ∗ δ = (0 + 1 + . . . + 9) Rule: RGX formula x1 x2 [1, 3⟩ [4, 6⟩ [1, 3⟩ [13, 15⟩ ⋮ ⋮ [1, 3⟩ [40, 42⟩ [4, 6⟩ [13, 15⟩ [4, 6⟩ [16, 18⟩ ⋮ ⋮ Output: mappings Θ(∣d∣2) In general, a RGX formula with k variables can have an output of size Θ(∣d∣k).
Given a RGX rule R and a document d, a constant delay algorithm is a two-phase enumeration algorithm:
Can we have an efficient constant delay algorithm for RGX formulas?
Specifically, our contributions are:
deterministic functional extended variable-set automata.
variable-set automata and spanner algebra.
In this talk: only the main ideas of the constant delay algorithm.
Variable-set automata and their variants The constant delay algorithm
Variable-set automata and their variants The constant delay algorithm
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x
1
2 b 3
document:
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x
1
2 b 3
document: x = [1, 3⟩, y = [1, 4⟩ 1 x ⊢ 3 y ⊢ 3 a 4 a 5 ⊣x 6 b 7 ⊣y
1
2 b 3
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x
1
2 b 3
document: x = [1, 3⟩, y = [1, 4⟩ 1 x ⊢ 3 y ⊢ 3 a 4 a 5 ⊣x 6 b 7 ⊣y
1
2 b 3
x = [1, 4⟩, y = [1, 3⟩ 2 y ⊢ 3 x ⊢ 4 a 4 a 5 ⊣y 6 b 7 ⊣x
1
2 b 3
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x
1
2 b 3
document:
The evaluation problem of variable-set automata is NP-complete. How do we restrict VA to have constant delay algorithms?
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x Problem: A VA can have accepting runs that are NOT valid.
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x Problem: A VA can have accepting runs that are NOT valid.
1 x ⊢ 3 y ⊢ 3 a 4 a 5 ⊣x 6 b 7 ⊣x
1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x
A VA is functional if every accepting run is a valid run.
1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x
A VA is functional if every accepting run is a valid run.
Every VA is equivalent to a functional VA of at most exponential size.
1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x
1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x Problem: VA can use several paths of variables for the same extraction of spans.
1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x
An extended VA uses transitions extended with sets of variables such that between each pair of letters at most one of these transitions are used.
3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a a {⊣x} {⊣y} b b {⊣y} {⊣x}
An extended VA uses transitions extended with sets of variables such that between each pair of letters at most one of these transitions are used.
Every VA is equivalent to an extended VA of at most exponential size.
3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a a {⊣x} {⊣y} b b {⊣y} {⊣x} Problem: A VA can have several runs that witness the same output.
3 {x ⊢, y ⊢} 3 a 4 a 5 {⊣x} 6 b 7 {⊣y} 3 {x ⊢, y ⊢} 4 a 4 a 5 {⊣x} 6 b 7 {⊣y}
3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a a {⊣x} {⊣y} b b {⊣y} {⊣x}
An extended VA is deterministic if the transition relation is a function.
3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a {⊣x} {⊣y} b b {⊣y} {⊣x}
An extended VA is deterministic if the transition relation is a function.
Every extended VA is equivalent to a deterministic extended VA
Variable-set automata and their variants The constant delay algorithm
Given an deterministic and functional extended VA A = (Q, q0, F, δ). procedure Evaluate(A, a1 . . . an) for all q ∈ Q / {q0} do listq ← ǫ listq0 ← [] for i ∶= 1 to n do Capturing(i) Reading(i) Capturing(n + 1) Enumerate({listq}q∈Q, F) procedure Capturing(i) for all q ∈ Q do listold
q
← listq.lazycopy for all q ∈ Q with listold
q
≠ ǫ do for all S ∈ Markersδ(q) do node ← Node((S, i), listold
q )
p ← δ(q, S) listp.add(node) procedure Reading(i) for all q ∈ Q do listold
q
← listq listq ← ǫ for all q ∈ Q with listold
q
≠ ǫ do p ← δ(q, ai) listp.append(listold
q )
Given an deterministic and functional extended VA A = (Q, q0, F, δ).
1
2 b 3
document d: d1 d2 d3 d4 VA Ad: a a b {x ⊢} {y ⊢} {⊣x, y ⊢} . . . . . . . . . . . .
Given an deterministic and functional extended VA A = (Q, q0, F, δ).
transitions with the position of d where they take place.
d1 d2 d3 d4 a a b ⋮ ⋮ ⋮ ⋮ q0 q1 q2 q3 q4 q5 q6 q7 {x ⊢, y ⊢} a a {⊣x} b {⊣y} b {⊣y} {⊣x} {x ⊢, y ⊢}, 1 a a {⊣x}, 3 {⊣y}, 3 b b {⊣y}, 4 {⊣x}, 4
Given an deterministic and functional extended VA A = (Q, q0, F, δ).
transitions with the position of d where they take place.
construct the “forward” ε-closure of the resulting graph.
{x ⊢, y ⊢}, 1 a a {⊣x}, 3 {⊣y}, 3 b b {⊣y}, 4 {⊣x}, 4
{x ⊢, y ⊢}, 1 a ǫ ǫ a {⊣x}, 3 {⊣y}, 3 b ǫ b ǫ {⊣y}, 4 {⊣x}, 4
{x ⊢, y ⊢}, 1 a ǫ ǫ a {⊣x}, 3 {⊣y}, 3 b ǫ b ǫ {⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1
{⊣x}, 3 {⊣y}, 3 b ǫ b ǫ {⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1 {⊣x}, 3
{⊣y}, 3 b ǫ {⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1 {⊣x}, 3 {⊣y}, 3
{⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1 {⊣x}, 3 {⊣y}, 3 Given that the VA is functional, extended and deterministic: each path in the graph corresponds exactly to an output mapping, and every path is different (i.e. there are no duplicates).
Given an deterministic and functional extended VA A = (Q, q0, F, δ).
transitions with the position of d where they take place.
construct the “forward” ε-closure of the resulting graph. Finally, we enumerate all paths from the resulting acyclic labeled graph which can easily be done with constant delay between outputs.
Given a VA A and a document d if: n = #states of A m = #transitions of A l = #number of variables of A Class of regular spanners Precomputation phase deterministic functional extended VA (n + m) ⋅ ∣d∣ functional extended VA 2n ⋅ m ⋅ ∣d∣ functional VA / functional RGX 2n ⋅ (n2 + ∣Σ∣) ⋅ ∣d∣ VA / RGX (2n5ℓ + 2n3ℓ∣Σ∣) ⋅ ∣d∣ In the paper, we give some evidences that the exponential blow-up of functional extended VA seems unavoidable.
We provide a simple constant delay algorithm for evaluating deterministic functional extended VA. We extend this algorithm for the full class of variable-set automata and (also) regular spanner algebra. Future work:
in rule-based information extraction. Thanks!