Efficient Processing of Set-Similarity Joins
- n Large Computer Clusters
Rares Vernica rares@ics.uci.edu
Department of Computer Science University of California, Irvine
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 1 / 39
Efficient Processing of Set-Similarity Joins on Large Computer - - PowerPoint PPT Presentation
Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica rares@ics.uci.edu Department of Computer Science University of California, Irvine Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 1 / 39 Research Overview
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 1 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 2 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 2 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 2 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 3 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 4 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 5 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 6 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 6 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 7 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 7 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 8 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 9 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 10 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 11 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 12 / 39
1
2
1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39
1
2
1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39
1
2
1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39
1
2
1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39
1
2
1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39
1
2
1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 13 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 14 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 14 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 14 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 16 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1
2
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1
2
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39 1
2
3
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 20 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
T
T
RID a b
T
RID a b
T
RID a b
Key Value
T
RID a b
Key Value
T
RID a b
Key Value
Key Value
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
T
RID a b
Key Value
Key Value
RID1 RID2 Sim.
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 23 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 24 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 26 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 27 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39
FileWrite HashJoin 1:1 FileScan(S) TokenizeRIDPrefixToken 1:1 TokenizeRIDPrefixToken HashJoinWithEvaluator M:N Hash HashGroup M:N Hash Tokenize HashGroup M:N Hash M:N Hash FileScan(R) 1:1 Split 1:1 1:1 HashJoin M:N Hash FileScan(R) 1:1 Sort M:N Replicate M:N Hash FileScan(S) M:N Hash M:N Replicate FileScan(R) M:N Hash
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 30 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 31 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 32 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 33 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 33 / 39
HashJoin AsterixMeta (stream-select,assign,stream-select) 1:1 AsterixMeta (assign,unnest) HashLeftOuterJoin M:N Hash Scan (S) 1:1 Scan (R) HashJoin M:N Hash PreclusteredGroup AsterixMeta (assign,unnest) 1:1 HashLeftOuterJoin Sort M:N Hash Split AsterixMeta (assign,running-agg,assign) 1:N Replicate AsterixMeta (assign,running-agg,assign) 1:N Replicate M:N Hash AsterixMeta (assign,unnest) HashGroup 1:1 M:N Hash Scan (R) AsterixMeta (assign,unnest) 1:1 AsterixMeta (stream-project) 1:1 HashJoin AsterixMeta (assign,assign,stream-project) 1:1 HashGroup M:N Hash Sort PreclusteredGroup 1:1 Sort M:1 Hash Merge Scan (S) M:N Hash M:N Hash AsterixMeta (assign,unnest) M:N Hash M:N Hash AsterixMeta (sink-write) M:N Replicate 1:1 M:N Hash M:N Hash 1:1 HashGroup M:N Hash M:N Hash 1:1 Scan (R) 1:1
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 34 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 35 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 35 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 36 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 37 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 38 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BTO OPTO Ideal
2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BK PK Ideal
2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BRJ OPRJ Ideal
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 40 80 120 160 Time (seconds) BTO OPTO Ideal
2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 100 200 300 400 500 600 Time (seconds) BK PK Ideal
2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 50 100 150 200 Time (seconds) BRJ OPRJ Ideal
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
1 2 ... 10 11 ... 20 21 ... A B C D E F ... C F E C D ... F G B A F ... ... ... ... ... ... ... ... ... ...
RID a b
Map Map Map A B ... C F ... F G ...
Key Value
1 1 ... 2 1 ... 2 1 ... Group by key B B ... A A ... C C ... 1 1 ... 1 1 ... 1 2 ...
Key Value
B D ... A F ... C E ... 2 2 ... 2 4 ... 3 2 ...
Key Value
Reduce Reduce Reduce
Phase 1 Compute token frequencies
Map Map Map 2 2 ... 2 3 ... 3 2 ... B D ... A F ... C E ...
Key Value
Group by key 1 ... 2 2 2 2 ... 3 ... 4 ... G ... A B D E ... C ... F ...
Key Value
Reduce G ... A B D E ... C ... F ...
Phase 2 Sort tokens by freqency
T
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
1 2 ... 10 11 ... A B C D E F ... C F E C D ... ... ... ... ... ... ... 2 1 ... 0.5 0.5 ... 11 21 ...
RID1 RID2 Sim. RID a b
Map Map Map 1,A B C,... 2,D E F,... ... 10,C F,... 11,E C D,... ... (2,11),0.5 (2,11),0.5 ... 1 2 ... 10 11 ... 2 11 ...
Key Value
Entire Record
Group by key 2,D E F,... (2,11),0.5 ... 1,A B C,... (1,21),0.5 ... 11,E C D,... (2,11),0.5 ... 2 2 ... 1 1 ... 11 11 ...
Key Value
Reduce Reduce Reduce 2,11 1,21 ... 1,21 ... 2,11 ... 2,D E F,...,0.5 21,B A F,...,0.5 ... 1,A B C,...,0.5 ... 11,E C D,...,0.5 ... Phase 1 Duplicate the RID pairs and fill half on each
Key Value
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
Map Map Map 2,11 1,21 ... 1,21 ... 2,11 ... 2,D E F,...,0.5 21,B A F,...,0.5 ... 1,A B C,...,0.5 ... 11,E C D,...,0.5 ...
Identity Map
Key Value
Group by key 2,11 2,11 ... 1,21 1,21 ... ... 2,D E F,...,0.5 11,E C D,...,0.5 21,B A F,...,0.5 1,A B C,...,0.5 ...
Key Value
Reduce Phase 2 Bring together and fill-in the half filled pairs Reduce Reduce 2 ... 1 ... ... D E F ... A B C ... ... ... ... ... ... ...
RID1 a1 b1 Sim. RID2 a2 b2
0.5 ... 0.5 ... ... 11 ... 21 ... ... E C D ... B A F ... ... ... ... ... ... ...
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39
Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39