[PPT] - Space and Time-Efficient Data Structures for Massive Datasets PowerPoint Presentation

SLIDE 1

Space and Time-Efficient Data Structures for Massive Datasets

Giulio Ermanno Pibiri

giulio.pibiri@di.unipi.it Supervisor

Rossano Venturini

Computer Science Department University of Pisa

10/10/2017

1

SLIDE 2

2

High Level Thesis

Data Structures + Data Compression Faster Algorithms Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together.

SLIDE 3

3

Published Results

1. Clustered Elias-Fano Indexes
2. Dynamic Elias-Fano Representation
3. Efficient Data Structures for Massive N-Gram Datasets

SLIDE 4

3

Published Results

1. Clustered Elias-Fano Indexes

Journal paper

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017

2. Dynamic Elias-Fano Representation
3. Efficient Data Structures for Massive N-Gram Datasets

SLIDE 5

3

Published Results

1. Clustered Elias-Fano Indexes

Journal paper

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

2. Dynamic Elias-Fano Representation
3. Efficient Data Structures for Massive N-Gram Datasets

SLIDE 6

3

Published Results

1. Clustered Elias-Fano Indexes

Journal paper

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017

2. Dynamic Elias-Fano Representation
3. Efficient Data Structures for Massive N-Gram Datasets

SLIDE 7

3

Published Results

http://pages.di.unipi.it/pibiri/ EVERYTHING that I do (papers, slides and code) is fully accessible at my page:

1. Clustered Elias-Fano Indexes

Journal paper

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM), 2017

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017

2. Dynamic Elias-Fano Representation
3. Efficient Data Structures for Massive N-Gram Datasets

SLIDE 8

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}

ccur”.

SLIDE 9

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}

ccur”.

house is red red is always good the the is boy hungry is boy red house is the always hungry

T = {always, boy, good, house, hungry, is, red, the}

2 1 3 4 5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]

SLIDE 10

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}

ccur”.

q = {boy, is, the}

house is red red is always good the the is boy hungry is boy red house is the always hungry

T = {always, boy, good, house, hungry, is, red, the}

2 1 3 4 5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]

SLIDE 11

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}

ccur”.

q = {boy, is, the}

house is red red is always good the the is boy hungry is boy red house is the always hungry

T = {always, boy, good, house, hungry, is, red, the}

2 1 3 4 5

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]

SLIDE 12

5

Clustered Elias-Fano Indexes - TOIS’17

Every encoder represents each sequence individually. No exploitation of redundancy.

SLIDE 13

5

Clustered Elias-Fano Indexes - TOIS’17

Every encoder represents each sequence individually. No exploitation of redundancy. Idea: encode clusters of posting lists.

SLIDE 14

6

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists

SLIDE 15

6

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists reference list R

SLIDE 16

6

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists reference list R

SLIDE 17

6

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists reference list R

log u bits

R << u

log R bits

VS

SLIDE 18

6

Clustered Elias-Fano Indexes - TOIS’17

cluster of posting lists reference list R

log u bits

R << u

log R bits

VS

Problems 1. Build the clusters. 2. Synthesise the reference list.

NP-hard problem already for a simplified formulation.

SLIDE 19

7

Clustered Elias-Fano Indexes - TOIS’17

SLIDE 20

7

Clustered Elias-Fano Indexes - TOIS’17

SLIDE 21

7

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

SLIDE 22

7

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)

SLIDE 23

7

Clustered Elias-Fano Indexes - TOIS’17

Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average)

SLIDE 24

8

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

Insert(x) inserts x in S
Delete(x) deletes x from S
Search(x) checks whether x belongs to S
Minimum() returns the minimum element of S
Maximum() returns the maximum element of S
Predecessor(x) returns max{y ∈ S : y < x}
Successor(x) returns min{y ∈ S : y ≥ x}

SLIDE 25

8

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

Insert(x) inserts x in S
Delete(x) deletes x from S
Search(x) checks whether x belongs to S
Minimum() returns the minimum element of S
Maximum() returns the maximum element of S
Predecessor(x) returns max{y ∈ S : y < x}
Successor(x) returns min{y ∈ S : y ≥ x}

In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and O(n) space. More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.

SLIDE 26

8

(Integer) Dynamic Ordered Sets

A dynamic ordered set S is a data structure representing n keys and supporting the following operations:

Insert(x) inserts x in S
Delete(x) deletes x from S
Search(x) checks whether x belongs to S
Minimum() returns the minimum element of S
Maximum() returns the maximum element of S
Predecessor(x) returns max{y ∈ S : y < x}
Successor(x) returns min{y ∈ S : y ≥ x}

Challenge

How to optimally solve the integer dynamic

rdered set problem in compressed space?

In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and O(n) space. More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.

SLIDE 27

9

Motivation

Integer Data Structures

van Emde Boas Trees
X/Y-Fast Tries
Fusion Trees
Exponential Search Trees
…

Elias-Fano Encoding

EF(S(n,u)) = n log(u/n) + 2n bits to

encode an ordered integer sequence S

O(1) Access
O(1 + log(u/n)) Predecessor

space + time

dynamic

+ space + static

+ time

SLIDE 28

9

Motivation

Integer Data Structures

van Emde Boas Trees
X/Y-Fast Tries
Fusion Trees
Exponential Search Trees
…

Elias-Fano Encoding

EF(S(n,u)) = n log(u/n) + 2n bits to

encode an ordered integer sequence S

O(1) Access
O(1 + log(u/n)) Predecessor

space + time

dynamic

+ space + static

+ time

SLIDE 29

9

Motivation

Integer Data Structures

van Emde Boas Trees
X/Y-Fast Tries
Fusion Trees
Exponential Search Trees
…

Elias-Fano Encoding

EF(S(n,u)) = n log(u/n) + 2n bits to

encode an ordered integer sequence S

O(1) Access
O(1 + log(u/n)) Predecessor

space + time

dynamic

+ space + static

+ time

Can we grab the best from both?

?

SLIDE 30

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits
O(1) Access
O(min{1+log(u/n), loglog n}) Predecessor

Result 1

10

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits
O(1) Access
O(1) Append (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits
O(log n / loglog n) Access
O(log n / loglog n) Insert/Delete (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 3

SLIDE 31

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits
O(1) Access
O(min{1+log(u/n), loglog n}) Predecessor

Result 1

10

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits
O(1) Access
O(1) Append (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits
O(log n / loglog n) Access
O(log n / loglog n) Insert/Delete (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 3

SLIDE 32

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits
O(1) Access
O(min{1+log(u/n), loglog n}) Predecessor

Result 1

10

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits
O(1) Access
O(1) Append (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits
O(log n / loglog n) Access
O(log n / loglog n) Insert/Delete (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 3

SLIDE 33

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits
O(1) Access
O(min{1+log(u/n), loglog n}) Predecessor

Result 1

10

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits
O(1) Access
O(1) Append (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits
O(log n / loglog n) Access
O(log n / loglog n) Insert/Delete (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 3

SLIDE 34

For u = nγ, γ = (1):

EF(S(n,u)) + o(n) bits
O(1) Access
O(min{1+log(u/n), loglog n}) Predecessor

Result 1

10

Dynamic Elias-Fano Representation - CPM’17

EF(S(n,u)) + o(n) bits
O(1) Access
O(1) Append (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 2

EF(S(n,u)) + o(n) bits
O(log n / loglog n) Access
O(log n / loglog n) Insert/Delete (amortized)
O(min{1+log(u/n), loglog n}) Predecessor

Result 3

SLIDE 35

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

SLIDE 36

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks

T

SLIDE 37

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks

T T T T

SLIDE 38

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits

SLIDE 39

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits

SLIDE 40

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits

SLIDE 41

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits

SLIDE 42

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits

SLIDE 43

11

Dynamic Elias-Fano Representation - CPM’17

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

P Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits

SLIDE 44

11

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

O(loglog n) time

P is a dynamic prefix-sums DS

O(b) time

O(n / (b x log2 n)) x O(log u) = o(n) bits each

upper level

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

P Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits

SLIDE 45

11

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

O(loglog n) time

P is a dynamic prefix-sums DS

O(b) time

O(n / (b x log2 n)) x O(log u) = o(n) bits each

upper level

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

P Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits
(n) bits

SLIDE 46

11

Dynamic Elias-Fano Representation - CPM’17

Y is an Y-fast trie

O(loglog n) time

P is a dynamic prefix-sums DS

O(b) time

O(n / (b x log2 n)) x O(log u) = o(n) bits each

upper level

S

EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n

P Y

block of log2 n mini blocks

T T T T

lower level

T is a k-ary tree of constant height:

O(loglog n) time
O(log2 n loglog n) bits
(n) bits
(n) bits

+ some technicalities

SLIDE 47

12

N-grams

Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach.

SLIDE 48

12

N-grams

Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach.

SLIDE 49

12

N-grams

Strings of N words. N typically ranges from 1 to 5.

Books

≈ 6% of the books ever published

Extracted from text using a sliding window approach.

SLIDE 50

12

N-grams

Strings of N words. N typically ranges from 1 to 5.

N

number of grams

1

24,359,473

2

667,284,771

3

7,397,041,901

4

1,644,807,896

5

1,415,355,596

More than 11 billion grams.

Books

≈ 6% of the books ever published

Extracted from text using a sliding window approach.

SLIDE 51

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 52

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 53

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 54

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 55

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 56

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 57

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

SLIDE 58

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

Millions of unigrams.
Height 5: longer contexts.
The number of siblings has a

funnel-shaped distribution.

SLIDE 59

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

Millions of unigrams.
Height 5: longer contexts.
The number of siblings has a

funnel-shaped distribution.

1 2 3 4

SLIDE 60

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

u/n by varying context-length k

Millions of unigrams.
Height 5: longer contexts.
The number of siblings has a

funnel-shaped distribution.

1 2 3 4

SLIDE 61

k = 1

13

Compressed Tries with Context-based ID Remapping - SIGIR’17

High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.

u/n by varying context-length k

Millions of unigrams.
Height 5: longer contexts.
The number of siblings has a

funnel-shaped distribution.

1 2 3 4

SLIDE 62

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

SLIDE 63

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

SLIDE 64

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

SLIDE 65

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

SLIDE 66

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

reduces space by more than 36% on average

you will notice this!

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

SLIDE 67

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

reduces space by more than 36% on average

you will notice this!

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

SLIDE 68

14

Compressed Tries with Context-based ID Remapping - SIGIR’17

Context-based ID Remapping

reduces space by more than 36% on average

you will notice this!

Test machine Intel Xeon E5-2630 v3, 2.4 GHz  193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest

ptimization setting

will you notice this?

brings approximately 30% more time

SLIDE 69

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

SLIDE 70

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

SLIDE 71

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

SLIDE 72

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

SLIDE 73

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X 2.5÷ 5.2X 3.1÷ 5.8X 5.5X 2X 2X 2X 2.5X 2.5X 3X 2.7X 2.8X 3.5X 2X

SLIDE 74

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X 2.5÷ 5.2X 3.1÷ 5.8X 5.5X 2X 2X 2X 2.5X 2.5X 3X 2.7X 2.8X 3.5X 2X

SLIDE 75

15

Compressed Tries with Context-based ID Remapping - SIGIR’17

2.3X 2.5X

Elias-Fano Tries substantially outperform ALL previous solutions in both space and time.
As fast as the state-of-the-art (KenLM) but more than twice smaller.

2.5÷ 5.2X 3.1÷ 5.8X 5.5X 2X 2X 2X 2.5X 2.5X 3X 2.7X 2.8X 3.5X 2X

SLIDE 76

16

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds counting normalization interpolation

11 17 36 20 30 60 35 52 104

Tongrams - 1 Tongrams - 2 Tongrams - 4

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words

seconds counting normalization interpolation

22 26 46 32 42 77 53 62 112 84 104 179 148 155 297

Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words

SLIDE 77

16

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds counting normalization interpolation

11 17 36 20 30 60 35 52 104

Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation

11 17 36 51 60 138

KenLM Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words

seconds counting normalization interpolation

22 26 46 32 42 77 53 62 112 84 104 179 148 155 297

Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words

SLIDE 78

16

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds counting normalization interpolation

11 17 36 20 30 60 35 52 104

Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation

11 17 36 51 60 138

KenLM Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words

seconds counting normalization interpolation

22 26 46 32 42 77 53 62 112 84 104 179 148 155 297

Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words

3.89X

SLIDE 79

16

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds counting normalization interpolation

11 17 36 20 30 60 35 52 104

Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation

11 17 36 51 60 138

KenLM Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words

seconds counting normalization interpolation

22 26 46 32 42 77 53 62 112 84 104 179 148 155 297

Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16 seconds counting normalization interpolation

22 26 46 153 171 261

KenLM Tongrams

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words

3.89X

SLIDE 80

16

On going work (preliminary results)

Scalable Modified Kneser-Ney Language Model Estimation

seconds counting normalization interpolation

11 17 36 20 30 60 35 52 104

Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation

11 17 36 51 60 138

KenLM Tongrams

1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words

seconds counting normalization interpolation

22 26 46 32 42 77 53 62 112 84 104 179 148 155 297

Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16 seconds counting normalization interpolation

22 26 46 153 171 261

KenLM Tongrams

15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words

3.89X 6.22X

SLIDE 81

17

Planned work for this year

1. Conclude two journal extensions

TCS VLDBJ

SLIDE 82

17

Planned work for this year

1. Conclude two journal extensions
2. Develop other research ideas

TCS VLDBJ

SLIDE 83

17

Planned work for this year

1. Conclude two journal extensions
2. Develop other research ideas

TCS VLDBJ

Inverted Indexes with false positives allowed.

SLIDE 84

17

Planned work for this year

1. Conclude two journal extensions
2. Develop other research ideas

TCS VLDBJ

Inverted Indexes with false positives allowed. Data structures for features repository.

SLIDE 85

17

Planned work for this year

1. Conclude two journal extensions
2. Develop other research ideas

TCS VLDBJ

Inverted Indexes with false positives allowed. Compressed tries based on double-arrays. Data structures for features repository.

SLIDE 86

17

Planned work for this year

1. Conclude two journal extensions
2. Develop other research ideas
3. 6 months abroad.

TCS VLDBJ

Inverted Indexes with false positives allowed. Compressed tries based on double-arrays. Data structures for features repository.

SLIDE 87

Thanks for your attention, time, patience!

Any questions?

18