Cross-Document Pattern Matching Gregory Kucherov 1 Yakov Nekrich 2 - - PowerPoint PPT Presentation

cross document pattern matching
SMART_READER_LITE
LIVE PREVIEW

Cross-Document Pattern Matching Gregory Kucherov 1 Yakov Nekrich 2 - - PowerPoint PPT Presentation

Cross-Document Pattern Matching Gregory Kucherov 1 Yakov Nekrich 2 Tatiana Starikovskaya 3 , 1 1 Universit e Paris-Est & CNRS, 2 University of Chile, 3 Lomonosov Moscow State University. G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)


slide-1
SLIDE 1

Cross-Document Pattern Matching

Gregory Kucherov1 Yakov Nekrich2 Tatiana Starikovskaya3,1

1Universit´

e Paris-Est & CNRS, 2University of Chile, 3Lomonosov Moscow State University.

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 1 / 18

slide-2
SLIDE 2

Pattern Matching Problem

Given a text T and a pattern P, count all occurrences of P in T. ST(T)

u = locus(P) P

O(|P|) time, O(|T|) space

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 2 / 18

slide-3
SLIDE 3

Cross-Document Pattern Matching Problem

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count all occurrences of P in Tℓ. Example documents: genomic sequences pattern: a fragment of one of the sequences

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 3 / 18

slide-4
SLIDE 4

Cross-Document Pattern Matching Problem

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count all occurrences of P in Tℓ. ST(Tℓ)

u = locus(P) P

Standard solution: O(|P|) time, O(|Tℓ|) space Faster solution? Yes.

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 3 / 18

slide-5
SLIDE 5

Variants

◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 4 / 18

slide-6
SLIDE 6

Variants

◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 4 / 18

slide-7
SLIDE 7

Counting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count all occurrences of P in Tℓ. 1) identify a position p of some occurrence of P in Tℓ 2) find the locus of Tℓ[p..p + |P| − 1] in ST(Tℓ), and retrieve the number of leaves in its subtree ST(Tℓ)

u = locus(P) P = Tℓ[p..p + |P| − 1]

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 5 / 18

slide-8
SLIDE 8

Counting: step 1

1) identify a position p of some occurrence of P in Tℓ

Tk[i..] r Tℓ[p1..] r1 Tℓ[p2..] r2

GSA

k ℓ ℓ

D p1, p2: starting positions of the closest to Tk[i..] suffixes of Tℓ r1 = select(ℓ, rank(D[1..r − 1], ℓ)) r2 = select(ℓ, rank(D[1..r − 1], ℓ) + 1) [Golynski et al. 2006] Rank and select queries on D can be supported in O(1) and O(log log m) time respectively.

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 6 / 18

slide-9
SLIDE 9

Counting: step 1

1) identify a position p of some occurrence of P in Tℓ

Tk[i..] r Tℓ[p1..] r1 Tℓ[p2..] r2

GSA

k ℓ ℓ

D ⇒ Positions p1 and p2 can be computed in O(log log m) time. P occurs at p1 ⇔ lcp(Tℓ[p1..], Tk[i..]) ≥ |P|. Step 1 takes O(log log m) time.

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 6 / 18

slide-10
SLIDE 10

Counting: step 2

2) find the locus of Tℓ[p..p + |P| − 1] in ST(Tℓ), and retrieve the number of leaves in its subtree ST(Tℓ)

u = locus(P) P = Tℓ[p..p + |P| − 1] v = locus(Tℓ[p..])

weight(w): string depth of a node w w = wla(v, q): the ancestor of v of minimal depth s.t. weight(w) ≥ q u = wla(v, |P|)

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 7 / 18

slide-11
SLIDE 11

Weighted Level Ancestor Problem

[Farach et al. 1996, Amir et al. 2007] w = wla(v, q) can be found in O(log log W ) time and linear space, where W is the maximal weight of a node in the tree.

Theorem

w = wla(v, q) can be found in O(min{

  • log nw/ log log nw, log log q})

time and linear space.

w nw leaves

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 8 / 18

slide-12
SLIDE 12

Counting: step 2

2) find the locus of Tℓ[p..p + |P| − 1] in ST(Tℓ), and retrieve the number of leaves in its subtree ST(Tℓ)

u = locus(P) P = Tℓ[p..p + |P| − 1] v = locus(Tℓ[p..])

u = wla(v, |P|), nu = occ ⇒ u can be found in min{

  • log occ/ log log occ, log log |P|} time.
  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 9 / 18

slide-13
SLIDE 13

Counting: step 2

2) find the locus of Tℓ[p..p + |P| − 1] in ST(Tℓ), and retrieve the number of leaves in its subtree ST(Tℓ)

u = locus(P) P = Tℓ[p..p + |P| − 1] v = locus(Tℓ[p..])

Theorem

Counting takes O(t + log log m) time and O(n) space, where t = min{

  • log occ/ log log occ, log log |P|}.
  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 9 / 18

slide-14
SLIDE 14

Variants

◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 10 / 18

slide-15
SLIDE 15

Reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], report all occurrences of P in Tℓ. 1) identify a position p of Tℓ at which P occurs Step 1 of Counting, takes O(log log m) time 2) report all s: lcp(Tℓ[p..], Tℓ[s..]) ≥ |P|

P Tℓ[p..] P Tℓ[s..]

⇔ lcp(Tℓ[p..], Tℓ[s..]) ≥ |P|

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 11 / 18

slide-16
SLIDE 16

Reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], report all occurrences of P in Tℓ. 1) identify a position p of Tℓ at which P occurs Step 1 of Counting, takes O(log log m) time 2) report all s: lcp(Tℓ[p..], Tℓ[s..]) ≥ |P|

P Tℓ[p..] P Tℓ[s..]

⇔ lcp(Tℓ[p..], Tℓ[s..]) ≥ |P|

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 11 / 18

slide-17
SLIDE 17

Reporting: step 2

2) report all s: lcp(Tℓ[p..], Tℓ[s..]) ≥ |P|

Tℓ[p..]

SA(Tℓ) while lcp(Tℓ[s..], Tℓ[p..]) ≥ |P|, report s

Theorem

Reporting takes O(log log m + occ) time and O(n) space.

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 12 / 18

slide-18
SLIDE 18

Variants

◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 13 / 18

slide-19
SLIDE 19

Document counting and reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count or report all documents in which P occurs. 1) find u = locus(P) in the generalized suffix tree Reduction to the WLA Problem O(min{

  • log docc/ log log docc, log log |P|}) time

2) report or count distinct documents in the subtree of u GST

u = locus(P) P = Tk[i..j] v = locus(Tk[i..])

D

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 14 / 18

slide-20
SLIDE 20

Document counting and reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count or report all documents in which P occurs. 1) find u = locus(P) in the generalized suffix tree Reduction to the WLA Problem O(min{

  • log docc/ log log docc, log log |P|}) time

2) report or count distinct documents in the subtree of u GST

u = locus(P) P = Tk[i..j] v = locus(Tk[i..])

D

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 14 / 18

slide-21
SLIDE 21

Document counting and reporting: step 2

2) report or count distinct documents in the subtree of u ⇔ report or count distinct documents in the corresponding segment of the document array D [Muthukrishnan 2002] Reporting of distinct documents in a segment of D takes O(ndocs) time and O(n) space.

Theorem

Document reporting takes O(t + ndocs) time and O(n) space, where t = min{

  • log docc/ log log docc, log log |P|}.

[Bozanis et al. 1995] Counting of distinct documents in a segment of D takes O(log n) time and O(n) space.

Theorem

Document counting takes O(log n) time and O(n) space.

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 15 / 18

slide-22
SLIDE 22

Variants

◮ Counting ◮ Reporting ◮ Document counting and reporting ◮ Dynamic counting and reporting

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 16 / 18

slide-23
SLIDE 23

Dynamic counting and reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count all occurrences of P in Tℓ. Dynamic operation: adding a document. 1) find a position p of some occurrence of P in Tℓ 2) find the locus of Tℓ[p..p + |P| − 1] in ST(Tℓ), and retrieve the number of leaves in its subtree

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 17 / 18

slide-24
SLIDE 24

Dynamic counting and reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count all occurrences of P in Tℓ. Dynamic operation: adding a document. 1) find a position p of some occurrence of P in Tℓ — O(log n) time

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 17 / 18

slide-25
SLIDE 25

Dynamic counting and reporting

Given a set of documents T1, T2, . . . , Tm and a pattern P = Tk[i..j], count all occurrences of P in Tℓ. Dynamic operation: adding a document. 1) find a position p of some occurrence of P in Tℓ — O(log n) time GST

Tk[i..] Tℓ[p1..] Tℓ[p2..]

[Dietz et al. 1987] to compare ranks of any two leaves in O(1) time Suffix array of Tℓ

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 17 / 18

slide-26
SLIDE 26

Summary of the results

m, n: the number of the documents and their total length resp.

◮ Counting: O(log log m +min{

  • log occ/ log log occ, log log |P|}) time

◮ Reporting: O(log log m + occ) time ◮ Document counting: O(log n) time ◮ Document reporting:

O(min{

  • log docc/ log log docc, log log |P|} + ndocs) time

◮ Dynamic counting: O(log n) time ◮ Dynamic reporting: O(log n + occ) time

(update: O(log n) time per letter)

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 18 / 18

slide-27
SLIDE 27

Summary of the results

m, n: the number of the documents and their total length resp.

◮ Counting: O(log log m +min{

  • log occ/ log log occ, log log |P|}) time

◮ Reporting: O(log log m + occ) time ◮ Document counting: O(log n) time ◮ Document reporting:

O(min{

  • log docc/ log log docc, log log |P|} + ndocs) time

◮ Dynamic counting: O(log n) time ◮ Dynamic reporting: O(log n + occ) time

(update: O(log n) time per letter)

◮ Succinct data structures for counting, reporting and document

reporting

  • G. Kucherov, Y. Nekrich, T. Starikovskaya (shortinst)

Cross-Document Pattern Matching CPM 2012 18 / 18