[PPT] - I/O-Efficient Data Structures for Colored Range and Prefix PowerPoint Presentation

SLIDE 1

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

Kasper Green Larsen, MADALGO, Aarhus University Rasmus Pagh, IT University of Copenhagen Presenter: Yakov Nekrich

1

SLIDE 2

Motivating example

Store collection of documents (= sets of words).
Given a string p, return the IDs of all

documents that contain p as a word.

Optimal solution: “Inverted index”,

precomputing the answer for each p.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser

kända bilder/
ch tecken

klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

2

SLIDE 3

Motivating example

Store collection of documents (= sets of words).
Given a string p, return the IDs of all

documents that contain p as prefix of a word.

Optimal solution: Topic of this talk.

.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser

kända bilder/
ch tecken

klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

“colored prefix reporting”

3

SLIDE 4

Simple data structures

Inverted list for each prefix:

Too much space for fixed alphabet size.

Inverted list for each word:

Too much time if many documents have many different words with prefix p.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser

kända bilder/
ch tecken

klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

4

SLIDE 5

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

5

SLIDE 6

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

6

SLIDE 7

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

y-coord = x-coord of previous point of same color

7

SLIDE 8

Relation to range reporting

[GHJS ’03] words in lexicographic order

p*

3-sided range query captures

nly first occurrence of color

8

SLIDE 9

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

Parameters:

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

9

SLIDE 10

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

Optimal time and space

Parameters:

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

9

SLIDE 11

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

Optimal time and space Implies optimal colored prefix reporting

Parameters:

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

9

SLIDE 12

Model of computation

I/O model with Ɵ(B log n) bits per block.
Memory is a sequence of blocks.

Cost model: Number of block retrievals (I/Os).

O(1) blocks can be stored in “cache” and

accessed at no cost.

10

SLIDE 13

Model of computation

I/O model with Ɵ(B log n) bits per block.
Memory is a sequence of blocks.

Cost model: Number of block retrievals (I/Os).

O(1) blocks can be stored in “cache” and

accessed at no cost.

Many other papers: Block stores B “items”.

10

SLIDE 14

Selected previous results

Reference Space

verhead Search time

Model Arge et al. ’99 O(1) O(logB(n)) Comparison Nekrich ’07 O(1) logBlogB...(n) Unrestricted Nekrich ’07 logB(n) O(logB(n)) Unrestricted Nekrich ’07 (logB(n))2 O(1) Unrestricted Here O(1) O(1) Unrestricted

data structures for 3-sided range reporting

11

SLIDE 15

High-level description

1. Place points in a binary priority search tree

points to report points in range

12

SLIDE 16

High-level description

points to report points in range

2. Search for points near the “fringe” using O(1)

searches in smaller “base” data structures.

13

SLIDE 17

High-level description

points to report points in range

3. Read blocks containing remaining points (easy).

14

SLIDE 18

Base data structure

Core technical contribution of paper.
Able to handle point sets of size poly(B log n)
ptimally.
Main technique: Use of tabulation and fusion

trees allow us to make a dynamic data structure partially persistent free of charge when the number of updates to it is small.

Refer to paper for more details.

15

SLIDE 19

Indivisibility assumption

Assumption often used to show lower bounds:

Each block contains at most B points/items; reading a block is required to report an item.

Our data structure is among few to break this

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

16

SLIDE 20

Indivisibility assumption

Assumption often used to show lower bounds:

Each block contains at most B points/items; reading a block is required to report an item.

Our data structure is among few to break this

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

Open problem: Is there a nontrivial lower

bound under the indivisibility assumption?

16

SLIDE 21

Memory models

There may be alternatives to the cache-oriented

I/O model: Unlike conventional processors that rely on the hardware to automatically bring data and instructions close to the processor with a hierarchy

f hardware caches, the Cell Broadband Engine

requires the programmer to create a “shopping list” of the data that the program requires. from ibm.com

17

SLIDE 22

Scatter-I/O model

One I/O operation can read or write any set of

B words in memory.

In this stronger model, we give a much simpler

data structure for colored prefix search.

18

SLIDE 23

Scatter-I/O model

One I/O operation can read or write any set of

B words in memory.

In this stronger model, we give a much simpler

data structure for colored prefix search.

Open problem: Can notoriously I/O-difficult

graph problems such as BFS and DFS be efficiently solved in this model?

18

SLIDE 24

Conclusion

Theoretically optimal solutions in the I/O

model for colored prefix/range reporting.

In fact, optimal solution to 3-sided range

reporting in two dimensions.

19

SLIDE 25

Conclusion

Theoretically optimal solutions in the I/O

model for colored prefix/range reporting.

In fact, optimal solution to 3-sided range

reporting in two dimensions.

Main open problem: Efficient extension to

top-k searches, where only the k highest ranked colors are reported.

19

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

Kasper Green Larsen, MADALGO, Aarhus University Rasmus Pagh, IT University of Copenhagen Presenter: Yakov Nekrich

Motivating example

documents that contain p as a word.

precomputing the answer for each p.

Motivating example

documents that contain p as prefix of a word.

.

Simple data structures

Too much space for fixed alphabet size.

Too much time if many documents have many different words with prefix p.

Relation to range reporting

[GHJS ’03] words in lexicographic order

Relation to range reporting

[GHJS ’03] words in lexicographic order

Relation to range reporting

[GHJS ’03] words in lexicographic order

y-coord = x-coord of previous point of same color

Relation to range reporting

[GHJS ’03] words in lexicographic order

3-sided range query captures

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os.

New result

k = number of points returned B = number of points in a memory block (assume query string fits in one block)

Model of computation

Cost model: Number of block retrievals (I/Os).

accessed at no cost.

Model of computation

Cost model: Number of block retrievals (I/Os).

accessed at no cost.

Selected previous results

Reference Space

Model Arge et al. ’99 O(1) O(logB(n)) Comparison Nekrich ’07 O(1) logBlogB...(n) Unrestricted Nekrich ’07 logB*(n) O(logB*(n)) Unrestricted Nekrich ’07 (logB(n))2 O(1) Unrestricted Here O(1) O(1) Unrestricted

data structures for 3-sided range reporting

High-level description

points to report points in range

High-level description

points to report points in range

searches in smaller “base” data structures.

High-level description

points to report points in range

Base data structure

trees allow us to make a dynamic data structure partially persistent free of charge when the number of updates to it is small.

Indivisibility assumption

Each block contains at most B points/items; reading a block is required to report an item.

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

Indivisibility assumption

Each block contains at most B points/items; reading a block is required to report an item.

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

bound under the indivisibility assumption?

Memory models

I/O model: Unlike conventional processors that rely on the hardware to automatically bring data and instructions close to the processor with a hierarchy

requires the programmer to create a “shopping list” of the data that the program requires. from ibm.com

Scatter-I/O model

B words in memory.

data structure for colored prefix search.

Scatter-I/O model

B words in memory.

data structure for colored prefix search.

graph problems such as BFS and DFS be efficiently solved in this model?

Conclusion

model for colored prefix/range reporting.

reporting in two dimensions.

Conclusion

model for colored prefix/range reporting.

reporting in two dimensions.

top-k searches, where only the k highest ranked colors are reported.

Model Arge et al. ’99 O(1) O(logB(n)) Comparison Nekrich ’07 O(1) logBlogB...(n) Unrestricted Nekrich ’07 logB(n) O(logB(n)) Unrestricted Nekrich ’07 (logB(n))2 O(1) Unrestricted Here O(1) O(1) Unrestricted