Constant delay algorithms for regular document spanners Fernando - - PowerPoint PPT Presentation

constant delay algorithms for regular document spanners
SMART_READER_LITE
LIVE PREVIEW

Constant delay algorithms for regular document spanners Fernando - - PowerPoint PPT Presentation

Constant delay algorithms for regular document spanners Fernando Florenzano Cristian Riveros Domagoj Vrgo c From PUC Chile Mart n Ugarte Stijn Vansummeren From Universit e Libre de Bruxelles Rule-based information extraction by


slide-1
SLIDE 1

Constant delay algorithms for regular document spanners

Fernando Florenzano Cristian Riveros Domagoj Vrgoˇ c From PUC Chile Mart´ ın Ugarte Stijn Vansummeren From Universit´ e Libre de Bruxelles

slide-2
SLIDE 2

Rule-based information extraction by example

18:30 ERROR 06 19:10 OK 00 20:00 ERROR 19

“Extract all pairs (time,id)

  • f ERROR events”

x y x y

1

1 8 2

:

3 3 4 0 5 6 E 7 R 8 R 9 O 10R 11 120 136 14

151 169 17

:

181 190 20 21O 22K 23 240 250 26

272 280 29

:

300 310 32 33E 34R 35R 36O 37R 38 391 409 41

Σ∗ ⋅ x{δδ ∶ δδ} ⋅ ERROR ⋅ y{δδ} ⋅ Σ∗ δ = (0 + 1 + . . . + 9) Rule: RGX formula x y [1, 6⟩ [13, 15⟩ [28, 33⟩ [40, 42⟩ Output: mappings

slide-3
SLIDE 3

Rule-based information extraction by example

Problem: Evaluation of rules in information extraction. Input: RGX formula R and document d. Output: Enumerate all mappings of d that satisfy R.

1

1 8 2

:

3 3 4 0 5 6 E 7 R 8 R 9 O 10R 11 120 136 14

151 169 17

:

181 190 20 21O 22K 23 240 250 26

272 280 29

:

300 310 32 33E 34R 35R 36O 37R 38 391 409 41

Σ∗ ⋅ x{δδ ∶ δδ} ⋅ ERROR ⋅ y{δδ} ⋅ Σ∗ δ = (0 + 1 + . . . + 9) x y [1, 6⟩ [13, 15⟩ [28, 33⟩ [40, 42⟩ Rule: RGX formula Output: mappings

slide-4
SLIDE 4

Unfortunately, the output can easily become exponential

1

1 8 2

:

3 3 4 0 5 6 E 7 R 8 R 9 O 10R 11 120 136 14

151 169 17

:

181 190 20 21O 22K 23 240 250 26

272 280 29

:

300 310 32 33E 34R 35R 36O 37R 38 391 409 41

Σ∗ ⋅ x1{δδ} ⋅ Σ∗ ⋅ x2{δδ} ⋅ Σ∗ δ = (0 + 1 + . . . + 9) Rule: RGX formula x1 x2 [1, 3⟩ [4, 6⟩ [1, 3⟩ [13, 15⟩ ⋮ ⋮ [1, 3⟩ [40, 42⟩ [4, 6⟩ [13, 15⟩ [4, 6⟩ [16, 18⟩ ⋮ ⋮ Output: mappings Θ(∣d∣2) In general, a RGX formula with k variables can have an output of size Θ(∣d∣k).

slide-5
SLIDE 5

Constant delay algorithms to the rescue

Definition

Given a RGX rule R and a document d, a constant delay algorithm is a two-phase enumeration algorithm:

  • 1. Preprocessing phase: linear in ∣d∣ and, hopefully, linear in ∣R∣.
  • 2. Enumeration phase: constant time between two consecutive outputs.

Can we have an efficient constant delay algorithm for RGX formulas?

slide-6
SLIDE 6

In this paper, we propose a constant delay algorithm for variable-set automata

Specifically, our contributions are:

  • 1. We study the class of extended and deterministic variable-set automata.
  • 2. We give a simple constant delay algorithm for

deterministic functional extended variable-set automata.

  • 3. We extend this algorithm for the full class of

variable-set automata and spanner algebra.

  • 4. We study the complexity of counting the number of output mappings.

In this talk: only the main ideas of the constant delay algorithm.

slide-7
SLIDE 7

Variable-set automata and their variants The constant delay algorithm

Outline

slide-8
SLIDE 8

Variable-set automata and their variants The constant delay algorithm

Outline

slide-9
SLIDE 9

Variable-set automata (VA)

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x

a

1

a

2 b 3

document:

slide-10
SLIDE 10

Variable-set automata (VA)

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x

a

1

a

2 b 3

document: x = [1, 3⟩, y = [1, 4⟩ 1 x ⊢ 3 y ⊢ 3 a 4 a 5 ⊣x 6 b 7 ⊣y

a

1

a

2 b 3

slide-11
SLIDE 11

Variable-set automata (VA)

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x

a

1

a

2 b 3

document: x = [1, 3⟩, y = [1, 4⟩ 1 x ⊢ 3 y ⊢ 3 a 4 a 5 ⊣x 6 b 7 ⊣y

a

1

a

2 b 3

x = [1, 4⟩, y = [1, 3⟩ 2 y ⊢ 3 x ⊢ 4 a 4 a 5 ⊣y 6 b 7 ⊣x

a

1

a

2 b 3

slide-12
SLIDE 12

Variable-set automata (VA)

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x

a

1

a

2 b 3

document:

Theorem (Freydenberger17,MRV18)

The evaluation problem of variable-set automata is NP-complete. How do we restrict VA to have constant delay algorithms?

slide-13
SLIDE 13

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA
slide-14
SLIDE 14

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x Problem: A VA can have accepting runs that are NOT valid.

slide-15
SLIDE 15

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x Problem: A VA can have accepting runs that are NOT valid.

Example of an accepting run that is not valid

1 x ⊢ 3 y ⊢ 3 a 4 a 5 ⊣x 6 b 7 ⊣x

slide-16
SLIDE 16

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 6 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b ⊣y ⊣x

Definition: functional VA

A VA is functional if every accepting run is a valid run.

slide-17
SLIDE 17

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x

Definition: functional VA

A VA is functional if every accepting run is a valid run.

Theorem (FKRV15)

Every VA is equivalent to a functional VA of at most exponential size.

slide-18
SLIDE 18

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x

slide-19
SLIDE 19

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x Problem: VA can use several paths of variables for the same extraction of spans.

slide-20
SLIDE 20

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

1 2 3 4 5 5’ 6 6’ 7 x ⊢ y ⊢ y ⊢ x ⊢ a a a ⊣x ⊣y b b ⊣y ⊣x

Definition: extended VA

An extended VA uses transitions extended with sets of variables such that between each pair of letters at most one of these transitions are used.

slide-21
SLIDE 21

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a a {⊣x} {⊣y} b b {⊣y} {⊣x}

Definition: extended VA

An extended VA uses transitions extended with sets of variables such that between each pair of letters at most one of these transitions are used.

Theorem

Every VA is equivalent to an extended VA of at most exponential size.

slide-22
SLIDE 22

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a a {⊣x} {⊣y} b b {⊣y} {⊣x} Problem: A VA can have several runs that witness the same output.

Example of several runs with the same input/output

3 {x ⊢, y ⊢} 3 a 4 a 5 {⊣x} 6 b 7 {⊣y} 3 {x ⊢, y ⊢} 4 a 4 a 5 {⊣x} 6 b 7 {⊣y}

slide-23
SLIDE 23

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a a {⊣x} {⊣y} b b {⊣y} {⊣x}

Definition: deterministic (Input/Output) VA

An extended VA is deterministic if the transition relation is a function.

slide-24
SLIDE 24

Problematic behaviors of VA and their classes

  • 1. Functional VA
  • 2. Extended VA
  • 3. Deterministic VA

3 4 5 5’ 6 6’ 7 {x ⊢, y ⊢} a a {⊣x} {⊣y} b b {⊣y} {⊣x}

Definition: deterministic (Input/Output) VA

An extended VA is deterministic if the transition relation is a function.

Theorem

Every extended VA is equivalent to a deterministic extended VA

  • f at most exponential size.
slide-25
SLIDE 25

Variable-set automata and their variants The constant delay algorithm

Outline

slide-26
SLIDE 26

The constant delay algorithm for extended VA

Given an deterministic and functional extended VA A = (Q, q0, F, δ). procedure Evaluate(A, a1 . . . an) for all q ∈ Q / {q0} do listq ← ǫ listq0 ← [] for i ∶= 1 to n do Capturing(i) Reading(i) Capturing(n + 1) Enumerate({listq}q∈Q, F) procedure Capturing(i) for all q ∈ Q do listold

q

← listq.lazycopy for all q ∈ Q with listold

q

≠ ǫ do for all S ∈ Markersδ(q) do node ← Node((S, i), listold

q )

p ← δ(q, S) listp.add(node) procedure Reading(i) for all q ∈ Q do listold

q

← listq listq ← ǫ for all q ∈ Q with listold

q

≠ ǫ do p ← δ(q, ai) listp.append(listold

q )

slide-27
SLIDE 27

Sketch idea of the constant delay algorithm in 3 steps

Given an deterministic and functional extended VA A = (Q, q0, F, δ).

  • 1. Convert the document d into a deterministic extended VA Ad.

a

1

a

2 b 3

document d: d1 d2 d3 d4 VA Ad: a a b {x ⊢} {y ⊢} {⊣x, y ⊢} . . . . . . . . . . . .

slide-28
SLIDE 28

Sketch idea of the constant delay algorithm in 3 steps

Given an deterministic and functional extended VA A = (Q, q0, F, δ).

  • 1. Convert the document d into a deterministic extended VA Ad.
  • 2. Build the product between A and Ad, and annotate the variable

transitions with the position of d where they take place.

slide-29
SLIDE 29
  • 2. Build the product between A and Ad

d1 d2 d3 d4 a a b ⋮ ⋮ ⋮ ⋮ q0 q1 q2 q3 q4 q5 q6 q7 {x ⊢, y ⊢} a a {⊣x} b {⊣y} b {⊣y} {⊣x} {x ⊢, y ⊢}, 1 a a {⊣x}, 3 {⊣y}, 3 b b {⊣y}, 4 {⊣x}, 4

slide-30
SLIDE 30

Sketch idea of the constant delay algorithm in 3 steps

Given an deterministic and functional extended VA A = (Q, q0, F, δ).

  • 1. Convert the document d into a deterministic eVA Ad.
  • 2. Build the product between A and Ad, and annotate the variable

transitions with the position of d where they take place.

  • 3. Replace all the letters in the transitions of A × Ad with ε, and

construct the “forward” ε-closure of the resulting graph.

slide-31
SLIDE 31
  • 3. Replace all the letters with ǫ-transitions and

construct the forward ε-closure

{x ⊢, y ⊢}, 1 a a {⊣x}, 3 {⊣y}, 3 b b {⊣y}, 4 {⊣x}, 4

slide-32
SLIDE 32
  • 3. Replace all the letters with ǫ-transitions and

construct the forward ε-closure

{x ⊢, y ⊢}, 1 a ǫ ǫ a {⊣x}, 3 {⊣y}, 3 b ǫ b ǫ {⊣y}, 4 {⊣x}, 4

slide-33
SLIDE 33
  • 3. Replace all the letters with ǫ-transitions and

construct the forward ε-closure

{x ⊢, y ⊢}, 1 a ǫ ǫ a {⊣x}, 3 {⊣y}, 3 b ǫ b ǫ {⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1

slide-34
SLIDE 34
  • 3. Replace all the letters with ǫ-transitions and

construct the forward ε-closure

{⊣x}, 3 {⊣y}, 3 b ǫ b ǫ {⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1 {⊣x}, 3

slide-35
SLIDE 35
  • 3. Replace all the letters with ǫ-transitions and

construct the forward ε-closure

{⊣y}, 3 b ǫ {⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1 {⊣x}, 3 {⊣y}, 3

slide-36
SLIDE 36
  • 3. Replace all the letters with ǫ-transitions and

construct the forward ε-closure

{⊣y}, 4 {⊣x}, 4 {x ⊢, y ⊢}, 1 {⊣x}, 3 {⊣y}, 3 Given that the VA is functional, extended and deterministic: each path in the graph corresponds exactly to an output mapping, and every path is different (i.e. there are no duplicates).

slide-37
SLIDE 37

Sketch idea of the constant delay algorithm in 3 steps

Given an deterministic and functional extended VA A = (Q, q0, F, δ).

  • 1. Convert the document d into a deterministic eVA Ad.
  • 2. Build the product between A and Ad, and annotate the variable

transitions with the position of d where they take place.

  • 3. Replace all the letters in the transitions of A × Ad with ε, and

construct the “forward” ε-closure of the resulting graph. Finally, we enumerate all paths from the resulting acyclic labeled graph which can easily be done with constant delay between outputs.

slide-38
SLIDE 38

Efficiency of the constant delay algorithm

Given a VA A and a document d if: n = #states of A m = #transitions of A l = #number of variables of A Class of regular spanners Precomputation phase deterministic functional extended VA (n + m) ⋅ ∣d∣ functional extended VA 2n ⋅ m ⋅ ∣d∣ functional VA / functional RGX 2n ⋅ (n2 + ∣Σ∣) ⋅ ∣d∣ VA / RGX (2n5ℓ + 2n3ℓ∣Σ∣) ⋅ ∣d∣ In the paper, we give some evidences that the exponential blow-up of functional extended VA seems unavoidable.

slide-39
SLIDE 39

Conclusions and future work

We provide a simple constant delay algorithm for evaluating deterministic functional extended VA. We extend this algorithm for the full class of variable-set automata and (also) regular spanner algebra. Future work:

  • 1. Code the algorithm and show that it works in practice.
  • 2. Extend the algorithm to include other features used

in rule-based information extraction. Thanks!