I/O-Efficient Data Structures for Colored Range and Prefix - - PowerPoint PPT Presentation

i o efficient data structures for colored range and
SMART_READER_LITE
LIVE PREVIEW

I/O-Efficient Data Structures for Colored Range and Prefix - - PowerPoint PPT Presentation

I/O-Efficient Data Structures for Colored Range and Prefix Reporting Kasper Green Larsen, MADALGO, Aarhus University Rasmus Pagh, IT University of Copenhagen Presenter: Yakov Nekrich 1 Motivating example Jag ligger och ska somna, jag ser


slide-1
SLIDE 1

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

Kasper Green Larsen, MADALGO, Aarhus University Rasmus Pagh, IT University of Copenhagen Presenter: Yakov Nekrich

1

slide-2
SLIDE 2

Motivating example

  • Store collection of documents (= sets of words).
  • Given a string p, return the IDs of all

documents that contain p as a word.

  • Optimal solution: “Inverted index”,

precomputing the answer for each p.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser

  • kända bilder/
  • ch tecken

klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

2

slide-3
SLIDE 3

Motivating example

  • Store collection of documents (= sets of words).
  • Given a string p, return the IDs of all

documents that contain p as prefix of a word.

  • Optimal solution: Topic of this talk.

.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser

  • kända bilder/
  • ch tecken

klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

“colored prefix reporting”

3

slide-4
SLIDE 4

Simple data structures

  • Inverted list for each prefix:

Too much space for fixed alphabet size.

  • Inverted list for each word:

Too much time if many documents have many different words with prefix p.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser

  • kända bilder/
  • ch tecken

klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

4

slide-5
SLIDE 5

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

5

slide-6
SLIDE 6

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

6

slide-7
SLIDE 7

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

y-coord = x-coord of previous point of same color

7

slide-8
SLIDE 8

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

3-sided range query captures

  • nly first occurrence of color

8

slide-9
SLIDE 9

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

  • Parameters:

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

9

slide-10
SLIDE 10

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

Optimal time and space

  • Parameters:

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

9

slide-11
SLIDE 11

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

Optimal time and space Implies optimal colored prefix reporting

  • Parameters:

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

9

slide-12
SLIDE 12

Model of computation

  • I/O model with Ɵ(B log n) bits per block.
  • Memory is a sequence of blocks.

Cost model: Number of block retrievals (I/Os).

  • O(1) blocks can be stored in “cache” and

accessed at no cost.

10

slide-13
SLIDE 13

Model of computation

  • I/O model with Ɵ(B log n) bits per block.
  • Memory is a sequence of blocks.

Cost model: Number of block retrievals (I/Os).

  • O(1) blocks can be stored in “cache” and

accessed at no cost.

Many other papers: Block stores B “items”.

10

slide-14
SLIDE 14

Selected previous results

Reference Space

  • verhead Search time

Model Arge et al. ’99 O(1) O(logB(n)) Comparison Nekrich ’07 O(1) logBlogB...(n) Unrestricted Nekrich ’07 logB*(n) O(logB*(n)) Unrestricted Nekrich ’07 (logB(n))2 O(1) Unrestricted Here O(1) O(1) Unrestricted

data structures for 3-sided range reporting

11

slide-15
SLIDE 15

High-level description

  • 1. Place points in a binary priority search tree

points to report points in range

12

slide-16
SLIDE 16

High-level description

points to report points in range

  • 2. Search for points near the “fringe” using O(1)

searches in smaller “base” data structures.

13

slide-17
SLIDE 17

High-level description

points to report points in range

  • 3. Read blocks containing remaining points (easy).

14

slide-18
SLIDE 18

Base data structure

  • Core technical contribution of paper.
  • Able to handle point sets of size poly(B log n)
  • ptimally.
  • Main technique: Use of tabulation and fusion

trees allow us to make a dynamic data structure partially persistent free of charge when the number of updates to it is small.

  • Refer to paper for more details.

15

slide-19
SLIDE 19

Indivisibility assumption

  • Assumption often used to show lower bounds:

Each block contains at most B points/items; reading a block is required to report an item.

  • Our data structure is among few to break this

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

16

slide-20
SLIDE 20

Indivisibility assumption

  • Assumption often used to show lower bounds:

Each block contains at most B points/items; reading a block is required to report an item.

  • Our data structure is among few to break this

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

  • Open problem: Is there a nontrivial lower

bound under the indivisibility assumption?

16

slide-21
SLIDE 21

Memory models

  • There may be alternatives to the cache-oriented

I/O model: Unlike conventional processors that rely on the hardware to automatically bring data and instructions close to the processor with a hierarchy

  • f hardware caches, the Cell Broadband Engine

requires the programmer to create a “shopping list” of the data that the program requires. from ibm.com

17

slide-22
SLIDE 22

Scatter-I/O model

  • One I/O operation can read or write any set of

B words in memory.

  • In this stronger model, we give a much simpler

data structure for colored prefix search.

18

slide-23
SLIDE 23

Scatter-I/O model

  • One I/O operation can read or write any set of

B words in memory.

  • In this stronger model, we give a much simpler

data structure for colored prefix search.

  • Open problem: Can notoriously I/O-difficult

graph problems such as BFS and DFS be efficiently solved in this model?

18

slide-24
SLIDE 24

Conclusion

  • Theoretically optimal solutions in the I/O

model for colored prefix/range reporting.

  • In fact, optimal solution to 3-sided range

reporting in two dimensions.

19

slide-25
SLIDE 25

Conclusion

  • Theoretically optimal solutions in the I/O

model for colored prefix/range reporting.

  • In fact, optimal solution to 3-sided range

reporting in two dimensions.

  • Main open problem: Efficient extension to

top-k searches, where only the k highest ranked colors are reported.

19