Efficient Processing of Set-Similarity Joins on Large Computer - - PowerPoint PPT Presentation

efficient processing of set similarity joins on large
SMART_READER_LITE
LIVE PREVIEW

Efficient Processing of Set-Similarity Joins on Large Computer - - PowerPoint PPT Presentation

Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica rares@ics.uci.edu Department of Computer Science University of California, Irvine Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 1 / 39 Research Overview


slide-1
SLIDE 1

Efficient Processing of Set-Similarity Joins

  • n Large Computer Clusters

Rares Vernica rares@ics.uci.edu

Department of Computer Science University of California, Irvine

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 1 / 39

slide-2
SLIDE 2

Research Overview

In 2005, I joined UC Irvine...

SQL Standard, 5th edition Query processing and optimization Indexes: B-tree, R-tree, Hash Transactions and recovery

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 2 / 39

slide-3
SLIDE 3

Research Overview

In 2005, I joined UC Irvine...

SQL Standard, 5th edition Query processing and optimization Indexes: B-tree, R-tree, Hash Transactions and recovery

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 2 / 39

slide-4
SLIDE 4

Research Overview

In 2005, I joined UC Irvine...

SQL Standard, 5th edition Query processing and optimization Indexes: B-tree, R-tree, Hash Transactions and recovery

Fuzzy or “similar to” query processing

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 2 / 39

slide-5
SLIDE 5

Application: Master-Data Management

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 3 / 39

slide-6
SLIDE 6

Application: Biometrics

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 4 / 39

slide-7
SLIDE 7

Application: Spell Checking

Requires similarity search on large amounts of data.

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 5 / 39

slide-8
SLIDE 8

Scalable Solutions

N-gram 1T tokens in 95B sequences GeneBank 100B bases in 100M sequences DHS 100M identities and 140K transactions/day FBI 66M identities and 8K transactions/day

Challenges

Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 6 / 39

slide-9
SLIDE 9

Scalable Solutions

N-gram 1T tokens in 95B sequences GeneBank 100B bases in 100M sequences DHS 100M identities and 140K transactions/day FBI 66M identities and 8K transactions/day

Challenges

Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 6 / 39

slide-10
SLIDE 10

Research Overview - Fuzzy Query Processing

Query relaxation VLDB 2006 Selectivity estimation VLDBJ 2008 Top-k queries KEYS 2009 Keyword query cleaning TR 2010 Parallel fuzzy-joins SIGMOD 2010

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 7 / 39

slide-11
SLIDE 11

Research Overview - Fuzzy Query Processing

Query relaxation VLDB 2006 Selectivity estimation VLDBJ 2008 Top-k queries KEYS 2009 Keyword query cleaning TR 2010 Parallel fuzzy-joins SIGMOD 2010

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 7 / 39

slide-12
SLIDE 12

Outline

1

Motivation

2

Problem Statement

3

Preliminaries

4

Parallel Algorithms Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX

5

Summary & Impact

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 8 / 39

slide-13
SLIDE 13

Example: Bibliography Cleaning

Title → Set

10: {mapreduce,simplified,data,processing,...} 20: {map,reduce,simplified,data,processing,...}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x, y) = |x∩y|

|x∪y|

jaccard(10, 20) = 6

9

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39

slide-14
SLIDE 14

Example: Bibliography Cleaning

Title → Set

10: {mapreduce,simplified,data,processing,...} 20: {map,reduce,simplified,data,processing,...}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x, y) = |x∩y|

|x∪y|

jaccard(10, 20) = 6

9

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39

slide-15
SLIDE 15

Example: Bibliography Cleaning

Title → Set

10: {mapreduce,simplified,data,processing,...} 20: {map,reduce,simplified,data,processing,...}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x, y) = |x∩y|

|x∪y|

jaccard(10, 20) = 6

9

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39

slide-16
SLIDE 16

Example: Bibliography Cleaning

Title → Set

10: {mapreduce,simplified,data,processing,...} 20: {map,reduce,simplified,data,processing,...}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x, y) = |x∩y|

|x∪y|

jaccard(10, 20) = 6

9

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39

slide-17
SLIDE 17

Example: Bibliography Cleaning

Title → Set

10: {mapreduce,simplified,data,processing,...} 20: {map,reduce,simplified,data,processing,...}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x, y) = |x∩y|

|x∪y|

jaccard(10, 20) = 6

9

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39

slide-18
SLIDE 18

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39

slide-19
SLIDE 19

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39

slide-20
SLIDE 20

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39

slide-21
SLIDE 21

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39

slide-22
SLIDE 22

Problem Statement: Set-Similarity Join

Input

Two files of records e.g., R(RID, a, b) and S(RID, c, d) A join column on each file e.g., R.a and S.c A similarity function, sim e.g., Jaccard A similarity threshold, τ

Output

All pairs of records from R and S where sim(R.a, S.c) ≥ τ

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 11 / 39

slide-23
SLIDE 23

Outline

1

Motivation

2

Problem Statement

3

Preliminaries

4

Parallel Algorithms Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX

5

Summary & Impact

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 12 / 39

slide-24
SLIDE 24

Single Machine Set-Similarity Join

1

Nested loops

2

Inverted list index [Sarawagi and Kirpal, 2004]

1

Indexing phase

2

Candidate generation phase

3

Verification phase

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39

slide-25
SLIDE 25

Single Machine Set-Similarity Join

1

Nested loops

2

Inverted list index [Sarawagi and Kirpal, 2004]

1

Indexing phase

2

Candidate generation phase

3

Verification phase

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39

slide-26
SLIDE 26

Single Machine Set-Similarity Join

1

Nested loops

2

Inverted list index [Sarawagi and Kirpal, 2004]

1

Indexing phase

2

Candidate generation phase

3

Verification phase

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39

slide-27
SLIDE 27

Single Machine Set-Similarity Join

1

Nested loops

2

Inverted list index [Sarawagi and Kirpal, 2004]

1

Indexing phase

2

Candidate generation phase

3

Verification phase

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39

slide-28
SLIDE 28

Single Machine Set-Similarity Join

1

Nested loops

2

Inverted list index [Sarawagi and Kirpal, 2004]

1

Indexing phase

2

Candidate generation phase

3

Verification phase

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39

slide-29
SLIDE 29

Single Machine Set-Similarity Join

1

Nested loops

2

Inverted list index [Sarawagi and Kirpal, 2004]

1

Indexing phase

2

Candidate generation phase

3

Verification phase

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39

slide-30
SLIDE 30

Set-Similarity Filtering

Length Filtering [Arasu et al., 2006]

Similar records have similar lengths E.g.,

sim is Jaccard τ = .8 Record length is 5 Similar records have length ∈ [4, 6]

50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3 50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3 50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 14 / 39

slide-31
SLIDE 31

Set-Similarity Filtering

Length Filtering [Arasu et al., 2006]

Similar records have similar lengths E.g.,

sim is Jaccard τ = .8 Record length is 5 Similar records have length ∈ [4, 6]

50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3 50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3 50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 14 / 39

slide-32
SLIDE 32

Set-Similarity Filtering

Length Filtering [Arasu et al., 2006]

Similar records have similar lengths E.g.,

sim is Jaccard τ = .8 Record length is 5 Similar records have length ∈ [4, 6]

50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3 50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3 50 5 90 80 60 70 7 6 5 6 40 20 30 10 Length 4 3 4 3

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 14 / 39

slide-33
SLIDE 33

Set-Similarity Filtering

Prefix Filtering [Chaudhuri et al., 2006]

Pigeonhole principle Global order for set elements: Sort each record’s tokens

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39

slide-34
SLIDE 34

Set-Similarity Filtering

Prefix Filtering [Chaudhuri et al., 2006]

Pigeonhole principle Global order for set elements: Sort each record’s tokens E.g., sim is intersection size, τ = 4

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39

slide-35
SLIDE 35

Set-Similarity Filtering

Prefix Filtering [Chaudhuri et al., 2006]

Pigeonhole principle Global order for set elements: Sort each record’s tokens E.g., sim is intersection size, τ = 4 Prefix length is 2

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39

slide-36
SLIDE 36

Set-Similarity Filtering

Prefix Filtering [Chaudhuri et al., 2006]

Pigeonhole principle Global order for set elements: Sort each record’s tokens E.g., sim is intersection size, τ = 4 Prefix length is 2

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39

slide-37
SLIDE 37

Set-Similarity Filtering

Prefix Filtering [Chaudhuri et al., 2006]

Pigeonhole principle Global order for set elements: Sort each record’s tokens E.g., sim is intersection size, τ = 4 Prefix length is 2

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39

slide-38
SLIDE 38

Outline

1

Motivation

2

Problem Statement

3

Preliminaries

4

Parallel Algorithms Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX

5

Summary & Impact

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 16 / 39

slide-39
SLIDE 39

Parallel Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

slide-40
SLIDE 40

Parallel Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

slide-41
SLIDE 41

Parallel Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

slide-42
SLIDE 42

Parallel Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

slide-43
SLIDE 43

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

slide-44
SLIDE 44

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

slide-45
SLIDE 45

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

slide-46
SLIDE 46

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

slide-47
SLIDE 47

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

slide-48
SLIDE 48

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1

Use Prefix Filter

slide-49
SLIDE 49

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1

Use Prefix Filter

slide-50
SLIDE 50

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1

Use Prefix Filter

2

Use unfrequent tokens in the prefix

slide-51
SLIDE 51

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1

Use Prefix Filter

2

Use unfrequent tokens in the prefix

slide-52
SLIDE 52

Parallel Set-Similarity Joins

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1

Use Prefix Filter

2

Use unfrequent tokens in the prefix

3

Project records

slide-53
SLIDE 53

Processing Stages

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39

slide-54
SLIDE 54

Processing Stages

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39

slide-55
SLIDE 55

Processing Stages

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39

slide-56
SLIDE 56

Outline

1

Motivation

2

Problem Statement

3

Preliminaries

4

Parallel Algorithms Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX

5

Summary & Impact

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 20 / 39

slide-57
SLIDE 57

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-58
SLIDE 58

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-59
SLIDE 59

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-60
SLIDE 60

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-61
SLIDE 61

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-62
SLIDE 62

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-63
SLIDE 63

MapReduce Review

map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3).

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

slide-64
SLIDE 64

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-65
SLIDE 65

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-66
SLIDE 66

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-67
SLIDE 67

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-68
SLIDE 68

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-69
SLIDE 69

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-70
SLIDE 70

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

slide-71
SLIDE 71

Stage 2: RID-Pair Generation

Map Map Map G ...

T

  • ken

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

A B C D E F ... C F E C D ... F G B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ... Map Map Map G ...

T

  • ken

1 2 ... 10 11 ... 20 21 ... ... ... ... ... ... ... ... ... ...

RID a b

D E F ... C F E C D ... F G B A F ... A B C

Key Value

1,A B C 1,A B C A B ... C D ... G A ... ... 10,C F 11,E C D ... 20,F G 21,B A F ... Group by key B B ... A A ... C E ... 1,A B C 21,B A F ... 1,A B C 21,B A F ... 10,C F 2,D E F ...

Key Value

Reduce Reduce Reduce 1 ... 1 2 ... 2 ...

RID1 RID2 Sim.

0.5 ... 0.5 0.5 ... 0.5 ... 21 ... 21 11 ... 11 ...

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

Reduce Alternatives

Basic Kernel (BK): nested loops PPJoin+ Kernel (PK): inverted list index

slide-72
SLIDE 72

Experimental Setting

Hardware Software

Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 23 / 39

slide-73
SLIDE 73

Experimental Setting

Datasets

DBLP

Average length: 259 bytes Number of records: 1.2M Total size: 300MB

CITESEERX

Average length: 1374 bytes Number of records: 1.3M Total size: 1.8GB

Increased each up to ×25, preserving join properties

DBLP: 31M records, 8.2GB CITESEERX: 32M records, 45GB

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 24 / 39

slide-74
SLIDE 74

Running Time

Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

slide-75
SLIDE 75

Running Time

Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

Legend

Stage 1

BTO: Basic Token Ordering OPTO: One Phase Token Ordering

Stage 2

BK: Basic Kernel PK: PPJoin+ Kernel

Stage 3

BRJ: Basic Record Join OPRJ: One Phase Record Join

slide-76
SLIDE 76

Running Time

Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

slide-77
SLIDE 77

Running Time

Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

slide-78
SLIDE 78

Speedup

2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ Ideal Relative running time Self-join DBLP×10 Different cluster sizes

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 26 / 39

slide-79
SLIDE 79

Outline

1

Motivation

2

Problem Statement

3

Preliminaries

4

Parallel Algorithms Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX

5

Summary & Impact

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 27 / 39

slide-80
SLIDE 80

Beyond MapReduce

MapReduce Limitations

Simplicity over performance Rigid framework Base “query” language: Java Declarative query languages as add-ons

MapReduce-inspired Alternatives

Include elements of MapReduce More runtime choices Built-in declarative query language Examples:

Scope/Dryad (Microsoft) Nephele/PACTs (TU Berlin) ASTERIX/Hyracks (UC Irvine)

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39

slide-81
SLIDE 81

Beyond MapReduce

MapReduce Limitations

Simplicity over performance Rigid framework Base “query” language: Java Declarative query languages as add-ons

MapReduce-inspired Alternatives

Include elements of MapReduce More runtime choices Built-in declarative query language Examples:

Scope/Dryad (Microsoft) Nephele/PACTs (TU Berlin) ASTERIX/Hyracks (UC Irvine)

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39

slide-82
SLIDE 82

Beyond MapReduce

MapReduce Limitations

Simplicity over performance Rigid framework Base “query” language: Java Declarative query languages as add-ons

MapReduce-inspired Alternatives

Include elements of MapReduce More runtime choices Built-in declarative query language Examples:

Scope/Dryad (Microsoft) Nephele/PACTs (TU Berlin) ASTERIX/Hyracks (UC Irvine)

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39

slide-83
SLIDE 83

ASTERIX/Hyracks

ASTERIX Overview

Scalable data platform Semi-structured data model Declarative query language Rule-based optimizer Runs on Hyracks Built-in Set-Similarity Joins

Hyracks Overview

Partition-parallel framework More flexible than MapReduce Library of operators and connectors

Operators: Map, Join, Aggregate, etc. Connectors: 1:1, M:N Hash, M:N Replicate, etc.

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39

slide-84
SLIDE 84

ASTERIX/Hyracks

ASTERIX Overview

Scalable data platform Semi-structured data model Declarative query language Rule-based optimizer Runs on Hyracks Built-in Set-Similarity Joins

Hyracks Overview

Partition-parallel framework More flexible than MapReduce Library of operators and connectors

Operators: Map, Join, Aggregate, etc. Connectors: 1:1, M:N Hash, M:N Replicate, etc.

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39

slide-85
SLIDE 85

ASTERIX/Hyracks

ASTERIX Overview

Scalable data platform Semi-structured data model Declarative query language Rule-based optimizer Runs on Hyracks Built-in Set-Similarity Joins

Hyracks Overview

Partition-parallel framework More flexible than MapReduce Library of operators and connectors

Operators: Map, Join, Aggregate, etc. Connectors: 1:1, M:N Hash, M:N Replicate, etc.

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39

slide-86
SLIDE 86

Hyracks Native Plan

FileWrite HashJoin 1:1 FileScan(S) TokenizeRIDPrefixToken 1:1 TokenizeRIDPrefixToken HashJoinWithEvaluator M:N Hash HashGroup M:N Hash Tokenize HashGroup M:N Hash M:N Hash FileScan(R) 1:1 Split 1:1 1:1 HashJoin M:N Hash FileScan(R) 1:1 Sort M:N Replicate M:N Hash FileScan(S) M:N Hash M:N Replicate FileScan(R) M:N Hash

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 30 / 39

slide-87
SLIDE 87

Hadoop vs. Hyracks

5 10 25 Dataset Size (times the original) 100 200 300 400 500 600 700 800 900 Time (seconds) Hadoop Hyracks Compat Hyracks Native

Running time Self-join DBLP×n n ∈ [5, 25] 10-node cluster

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 31 / 39

slide-88
SLIDE 88

Set-Similarity Join in ASTERIX

Fuzzy-Join Query

for $dblp in dataset(’DBLP’) for $citeseer in dataset(’CITESEER’) where $dblp.title ~= $citeseer.title return { ’dblp’ : $dblp, ’citeseer’ : $citeseer }

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 32 / 39

slide-89
SLIDE 89

ASTERIX/Hyracks Stack

Parser Translator Hyracks Hyracks Hyracks Job Generator Query

  • Expr. Tree

Logical Plan Logical Plan Hyracks Job Optimizer Parser Translator Hyracks Hyracks Hyracks Job Generator Query

  • Expr. Tree

Logical Plan Logical Plan Hyracks Job where ... ~= ... Logical plan with the three stages Optimizer

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 33 / 39

slide-90
SLIDE 90

ASTERIX/Hyracks Stack

Parser Translator Hyracks Hyracks Hyracks Job Generator Query

  • Expr. Tree

Logical Plan Logical Plan Hyracks Job Optimizer Parser Translator Hyracks Hyracks Hyracks Job Generator Query

  • Expr. Tree

Logical Plan Logical Plan Hyracks Job where ... ~= ... Logical plan with the three stages Optimizer

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 33 / 39

slide-91
SLIDE 91

Hyracks Plan for Set-Similarity Joins

HashJoin AsterixMeta (stream-select,assign,stream-select) 1:1 AsterixMeta (assign,unnest) HashLeftOuterJoin M:N Hash Scan (S) 1:1 Scan (R) HashJoin M:N Hash PreclusteredGroup AsterixMeta (assign,unnest) 1:1 HashLeftOuterJoin Sort M:N Hash Split AsterixMeta (assign,running-agg,assign) 1:N Replicate AsterixMeta (assign,running-agg,assign) 1:N Replicate M:N Hash AsterixMeta (assign,unnest) HashGroup 1:1 M:N Hash Scan (R) AsterixMeta (assign,unnest) 1:1 AsterixMeta (stream-project) 1:1 HashJoin AsterixMeta (assign,assign,stream-project) 1:1 HashGroup M:N Hash Sort PreclusteredGroup 1:1 Sort M:1 Hash Merge Scan (S) M:N Hash M:N Hash AsterixMeta (assign,unnest) M:N Hash M:N Hash AsterixMeta (sink-write) M:N Replicate 1:1 M:N Hash M:N Hash 1:1 HashGroup M:N Hash M:N Hash 1:1 Scan (R) 1:1

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 34 / 39

slide-92
SLIDE 92

ASTERIX Console

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 35 / 39

slide-93
SLIDE 93

ASTERIX Console

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 35 / 39

slide-94
SLIDE 94

Impact

Source-code release for Hadoop: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Interest from industry: Fox Audience Network Yahoo! (Pig) Ask.com Source-code release for ASTERIX (comming soon)

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 36 / 39

slide-95
SLIDE 95

Summary

Set-similarity joins in MapReduce Three-stage approach Balance workload and minimize replication End-to-end algorithms

Self-join R-S join

Experiments

Speedup and scaleup 40 cores, 40 disks cluster

Similarity joins in ASTERIX/Hyracks – in progress High-level query language integration – in progress

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 37 / 39

slide-96
SLIDE 96

Future Work

Large domain for set elements Records with large sets Skew in join-result Weights for set elements

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 38 / 39

slide-97
SLIDE 97

Publications

ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving World Models. Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras. Journal of Distributed and Parallel Databases, 2011. (to appear) Hyracks: A Flexible and Extensible Foundation for Data-Intensive Computing. Vinayak

  • R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, Rares Vernica. ICDE 2011.

CIRCUMFLEX: A Scheduling Optimizer for MapReduce Workloads Involving Shared Scans. Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Sujay Parekh, Kun-Lung Wu, Andrey Balmin, Rares Vernica. Submitted for publication, VLDB 2011. Adaptive MapReduce using Situation-Aware Mappers. Rares Vernica, Andrey Balmin, Kevin

  • S. Beyer, Vuk Ercegovac. Submitted for publication, VLDB 2011.

AKYRA: Efficient Keyword-Query Cleaning in Relational Databases. Rares Vernica, Chen Li. Technical Report, University of California, Irvine 2010. Efficient Parallel Set-Similarity Joins Using MapReduce. Rares Vernica, Michael J. Carey, Chen Li. SIGMOD 2010. Efficient Top-k Algorithms for Fuzzy Search in String Collections. Rares Vernica, Chen Li. KEYS 2009, Workshop on Keyword Search on Structured Data, SIGMOD 2009. Entity Categorization Over Large Document Collections. Venkatesh Ganti, Arnd Christian König, Rares Vernica. KDD 2008. SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases. Liang Jin, Chen Li, Rares Vernica. VLDB J. 2008. Relaxing Join and Selection Queries. Nick Koudas, Chen Li, Anthony K. H. Tung, Rares

  • Vernica. VLDB 2006.

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-98
SLIDE 98

Set-Similarity Join in ASTERIX

Fuzzy-Join Query

for $dblp in dataset(’DBLP’) for $citeseer in dataset(’CITESEER’) let [$match, $sim] := $dblp.title ~= $citeseer.title with similarity, simfunction ’Jaccard’, simthreshold .8 where $match

  • rder by $sim

return { ’dblp’ : $dblp, ’citeseer’ : $citeseer, ’sim’ : $sim }

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-99
SLIDE 99

Speedup Breakdown

Stage 1

2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BTO OPTO Ideal

Stage 2

2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BK PK Ideal

Stage 3

2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BRJ OPRJ Ideal

Relative running time Self-join DBLP×10 Different cluster sizes

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-100
SLIDE 100

Scaleup

2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size (times 2.5 x original) 200 400 600 800 1000 Time (seconds) BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ Running time Self-joining DBLP×n n ∈ [5, 25] Proportional cluster

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-101
SLIDE 101

Scaleup Breakdown

Stage 1

2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 40 80 120 160 Time (seconds) BTO OPTO Ideal

Stage 2

2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 100 200 300 400 500 600 Time (seconds) BK PK Ideal

Stage 3

2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 50 100 150 200 Time (seconds) BRJ OPRJ Ideal

Running time Self-joining DBLP×n, n ∈ [5, 25] Proportional cluster

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-102
SLIDE 102

Stage 1: Basic Token Ordering (BTO)

1 2 ... 10 11 ... 20 21 ... A B C D E F ... C F E C D ... F G B A F ... ... ... ... ... ... ... ... ... ...

RID a b

Map Map Map A B ... C F ... F G ...

Key Value

1 1 ... 2 1 ... 2 1 ... Group by key B B ... A A ... C C ... 1 1 ... 1 1 ... 1 2 ...

Key Value

B D ... A F ... C E ... 2 2 ... 2 4 ... 3 2 ...

Key Value

Reduce Reduce Reduce

Phase 1 Compute token frequencies

Map Map Map 2 2 ... 2 3 ... 3 2 ... B D ... A F ... C E ...

Key Value

Group by key 1 ... 2 2 2 2 ... 3 ... 4 ... G ... A B D E ... C ... F ...

Key Value

Reduce G ... A B D E ... C ... F ...

Phase 2 Sort tokens by freqency

T

  • ken

Alternative

One Phase Token Ordering (OPTO) One MapReduce phase: sort in memory

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-103
SLIDE 103

Stage 3: Basic Record Join (BRJ)

1 2 ... 10 11 ... A B C D E F ... C F E C D ... ... ... ... ... ... ... 2 1 ... 0.5 0.5 ... 11 21 ...

RID1 RID2 Sim. RID a b

Map Map Map 1,A B C,... 2,D E F,... ... 10,C F,... 11,E C D,... ... (2,11),0.5 (2,11),0.5 ... 1 2 ... 10 11 ... 2 11 ...

Key Value

Entire Record

Group by key 2,D E F,... (2,11),0.5 ... 1,A B C,... (1,21),0.5 ... 11,E C D,... (2,11),0.5 ... 2 2 ... 1 1 ... 11 11 ...

Key Value

Reduce Reduce Reduce 2,11 1,21 ... 1,21 ... 2,11 ... 2,D E F,...,0.5 21,B A F,...,0.5 ... 1,A B C,...,0.5 ... 11,E C D,...,0.5 ... Phase 1 Duplicate the RID pairs and fill half on each

Key Value

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-104
SLIDE 104

Stage 3: Basic Record Join (BRJ)

Map Map Map 2,11 1,21 ... 1,21 ... 2,11 ... 2,D E F,...,0.5 21,B A F,...,0.5 ... 1,A B C,...,0.5 ... 11,E C D,...,0.5 ...

Identity Map

Key Value

Group by key 2,11 2,11 ... 1,21 1,21 ... ... 2,D E F,...,0.5 11,E C D,...,0.5 21,B A F,...,0.5 1,A B C,...,0.5 ...

Key Value

Reduce Phase 2 Bring together and fill-in the half filled pairs Reduce Reduce 2 ... 1 ... ... D E F ... A B C ... ... ... ... ... ... ...

RID1 a1 b1 Sim. RID2 a2 b2

0.5 ... 0.5 ... ... 11 ... 21 ... ... E C D ... B A F ... ... ... ... ... ... ...

Alternative

One Phase Record Join (OPRJ) One MapReduce phase: map-side join

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

slide-105
SLIDE 105

Arasu, A., Ganti, V., and Kaushik, R. (2006). Efficient exact set-similarity joins. In VLDB, pages 918–929. Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In ICDE, page 5. Sarawagi, S. and Kirpal, A. (2004). Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743–754.

Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39