efficient processing of set similarity joins on large
play

Efficient Processing of Set-Similarity Joins on Large Computer - PowerPoint PPT Presentation

Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica rares@ics.uci.edu Department of Computer Science University of California, Irvine Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 1 / 39 Research Overview


  1. Set-Similarity Filtering Prefix Filtering [Chaudhuri et al., 2006] Pigeonhole principle Global order for set elements: Sort each record’s tokens E.g., sim is intersection size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 15 / 39

  2. Outline Motivation 1 Problem Statement 2 Preliminaries 3 Parallel Algorithms 4 Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX Summary & Impact 5 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 16 / 39

  3. Parallel Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

  4. Parallel Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

  5. Parallel Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

  6. Parallel Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 17 / 39

  7. Parallel Set-Similarity Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  8. Parallel Set-Similarity Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  9. Parallel Set-Similarity Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  10. Parallel Set-Similarity Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  11. Parallel Set-Similarity Joins Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  12. Parallel Set-Similarity Joins Use Prefix Filter 1 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  13. Parallel Set-Similarity Joins Use Prefix Filter 1 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  14. Parallel Set-Similarity Joins Use Prefix Filter 1 Use unfrequent 2 tokens in the prefix Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  15. Parallel Set-Similarity Joins Use Prefix Filter 1 Use unfrequent 2 tokens in the prefix Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  16. Parallel Set-Similarity Joins Use Prefix Filter 1 Use unfrequent 2 tokens in the prefix Project records 3 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 18 / 39

  17. Processing Stages Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39

  18. Processing Stages Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39

  19. Processing Stages Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 19 / 39

  20. Outline Motivation 1 Problem Statement 2 Preliminaries 3 Parallel Algorithms 4 Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX Summary & Impact 5 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 20 / 39

  21. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  22. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  23. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  24. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  25. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  26. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  27. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 21 / 39

  28. T T T T T T T oken oken oken oken oken oken oken G G G G G G G ... ... ... ... ... ... ... RID a b RID a b RID a b RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 1 1 1 1 1 A B C A B C A B C A B C A B C A B C A B C ... ... ... ... ... ... ... A A A A A 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map Map Map 2 2 2 2 2 2 2 D E F D E F D E F D E F D E F D E F D E F ... ... ... ... ... ... ... B B B B B 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 21,B A F 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key Group by key 10 10 10 10 10 10 10 C F C F C F C F C F C F C F ... ... ... ... ... ... ... C C C C 10,C F 10,C F 10,C F 10,C F A A A 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map Map Map 11 11 11 11 11 11 11 E C D E C D E C D E C D E C D E C D E C D ... ... ... ... ... ... ... D D D D 11,E C D 11,E C D 11,E C D 11,E C D A A A 21,B A F 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 20 20 20 F G F G F G F G F G F G F G ... ... ... ... ... ... ... G G G G 20,F G 20,F G 20,F G 20,F G C C C 10,C F 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map Map Map Map Map Map 21 21 21 21 21 21 21 B A F B A F B A F B A F B A F B A F B A F ... ... ... ... ... ... ... A A A A 21,B A F 21,B A F 21,B A F 21,B A F E E E 2,D E F 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T oken G ... Map Map Map Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  29. T T T T T T oken oken oken oken oken oken G G G G G G ... ... ... ... ... ... RID a b RID a b RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 1 1 1 1 A B C A B C A B C A B C A B C A B C ... ... ... ... ... ... A A A A A 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map Map 2 2 2 2 2 2 D E F D E F D E F D E F D E F D E F ... ... ... ... ... ... B B B B B 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 21,B A F 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key Group by key 10 10 10 10 10 10 C F C F C F C F C F C F ... ... ... ... ... ... C C C C 10,C F 10,C F 10,C F 10,C F A A A 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map Map 11 11 11 11 11 11 E C D E C D E C D E C D E C D E C D ... ... ... ... ... ... D D D D 11,E C D 11,E C D 11,E C D 11,E C D A A A 21,B A F 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 20 20 F G F G F G F G F G F G ... ... ... ... ... ... G G G G 20,F G 20,F G 20,F G 20,F G C C C 10,C F 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map Map Map Map Map 21 21 21 21 21 21 B A F B A F B A F B A F B A F B A F ... ... ... ... ... ... A A A A 21,B A F 21,B A F 21,B A F 21,B A F E E E 2,D E F 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T T oken oken G G ... ... RID a b 1 A B C ... Map Map 2 D E F ... ... ... ... 10 C F ... Map Map 11 E C D ... ... ... ... 20 F G ... Map Map 21 B A F ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  30. T T T T T oken oken oken oken oken G G G G G ... ... ... ... ... RID a b RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 1 1 1 A B C A B C A B C A B C A B C ... ... ... ... ... A A A A A 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map 2 2 2 2 2 D E F D E F D E F D E F D E F ... ... ... ... ... B B B B B 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 21,B A F 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key Group by key 10 10 10 10 10 C F C F C F C F C F ... ... ... ... ... C C C C 10,C F 10,C F 10,C F 10,C F A A A 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map 11 11 11 11 11 E C D E C D E C D E C D E C D ... ... ... ... ... D D D D 11,E C D 11,E C D 11,E C D 11,E C D A A A 21,B A F 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 20 F G F G F G F G F G ... ... ... ... ... G G G G 20,F G 20,F G 20,F G 20,F G C C C 10,C F 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map Map Map Map 21 21 21 21 21 B A F B A F B A F B A F B A F ... ... ... ... ... A A A A 21,B A F 21,B A F 21,B A F 21,B A F E E E 2,D E F 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T T T oken oken oken G G G ... ... ... RID a b RID a b 1 1 A B C A B C ... ... Map Map Map 2 2 D E F D E F ... ... ... ... ... ... ... ... 10 10 C F C F ... ... Map Map Map 11 11 E C D E C D ... ... ... ... ... ... ... ... 20 20 F G F G ... ... Map Map Map 21 21 B A F B A F ... ... ... ... ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  31. T T T T oken oken oken oken G G G G ... ... ... ... RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 1 1 A B C A B C A B C A B C ... ... ... ... A A A A 1,A B C 1,A B C 1,A B C 1,A B C B B B 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map 2 2 2 2 D E F D E F D E F D E F ... ... ... ... B B B B 1,A B C 1,A B C 1,A B C 1,A B C B B B 21,B A F 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key Group by key 10 10 10 10 C F C F C F C F ... ... ... ... C C C C 10,C F 10,C F 10,C F 10,C F A A A 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map 11 11 11 11 E C D E C D E C D E C D ... ... ... ... D D D D 11,E C D 11,E C D 11,E C D 11,E C D A A A 21,B A F 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 F G F G F G F G ... ... ... ... G G G G 20,F G 20,F G 20,F G 20,F G C C C 10,C F 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map Map Map 21 21 21 21 B A F B A F B A F B A F ... ... ... ... A A A A 21,B A F 21,B A F 21,B A F 21,B A F E E E 2,D E F 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T T T T oken oken oken oken G G G G ... ... ... ... RID a b RID a b RID a b Key Value 1 1 1 A B C A B C A B C ... ... ... A 1,A B C Map Map Map Map 2 2 2 D E F D E F D E F ... ... ... B 1,A B C ... ... ... ... ... ... ... ... ... 10 10 10 C F C F C F ... ... ... Map Map Map Map 11 11 11 E C D E C D E C D ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 F G F G F G ... ... ... Map Map Map Map 21 21 21 B A F B A F B A F ... ... ... ... ... ... ... ... ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  32. T T T oken oken oken G G G ... ... ... RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 1 A B C A B C A B C ... ... ... A A A 1,A B C 1,A B C 1,A B C B B B 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map 2 2 2 D E F D E F D E F ... ... ... B B B 1,A B C 1,A B C 1,A B C B B B 21,B A F 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key Group by key 10 10 10 C F C F C F ... ... ... C C C 10,C F 10,C F 10,C F A A A 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map 11 11 11 E C D E C D E C D ... ... ... D D D 11,E C D 11,E C D 11,E C D A A A 21,B A F 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 F G F G F G ... ... ... G G G 20,F G 20,F G 20,F G C C C 10,C F 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map Map 21 21 21 B A F B A F B A F ... ... ... A A A 21,B A F 21,B A F 21,B A F E E E 2,D E F 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T T T T T oken oken oken oken oken G G G G G ... ... ... ... ... RID a b RID a b RID a b RID a b Key Value Key Value 1 1 1 1 A B C A B C A B C A B C ... ... ... ... A A 1,A B C 1,A B C Map Map Map Map Map 2 2 2 2 D E F D E F D E F D E F ... ... ... ... B B 1,A B C 1,A B C ... ... ... ... ... ... ... ... ... ... ... ... ... ... 10 10 10 10 C F C F C F C F ... ... ... ... C 10,C F Map Map Map Map Map 11 11 11 11 E C D E C D E C D E C D ... ... ... ... D 11,E C D ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 F G F G F G F G ... ... ... ... G 20,F G Map Map Map Map Map 21 21 21 21 B A F B A F B A F B A F ... ... ... ... A 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  33. T T oken oken G G ... ... RID a b RID a b Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 A B C A B C ... ... A A 1,A B C 1,A B C B B 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map 2 2 D E F D E F ... ... B B 1,A B C 1,A B C B B 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key 10 10 C F C F ... ... C C 10,C F 10,C F A A 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map 11 11 E C D E C D ... ... D D 11,E C D 11,E C D A A 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 F G F G ... ... G G 20,F G 20,F G C C 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map 21 21 B A F B A F ... ... A A 21,B A F 21,B A F E E 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T T T T T T oken oken oken oken oken oken G G G G G G ... ... ... ... ... ... RID a b RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value 1 1 1 1 1 A B C A B C A B C A B C A B C ... ... ... ... ... A A A 1,A B C 1,A B C 1,A B C B 1,A B C Map Map Map Map Map Map 2 2 2 2 2 D E F D E F D E F D E F D E F ... ... ... ... ... B B B 1,A B C 1,A B C 1,A B C B 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key 10 10 10 10 10 C F C F C F C F C F ... ... ... ... ... C C 10,C F 10,C F A 1,A B C Map Map Map Map Map Map 11 11 11 11 11 E C D E C D E C D E C D E C D ... ... ... ... ... D D 11,E C D 11,E C D A 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 20 F G F G F G F G F G ... ... ... ... ... G G 20,F G 20,F G C 10,C F Map Map Map Map Map Map 21 21 21 21 21 B A F B A F B A F B A F B A F ... ... ... ... ... A A 21,B A F 21,B A F E 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  34. T oken G ... RID a b Key Value Key Value RID1 RID2 Sim. 1 A B C ... A 1,A B C B 1,A B C Reduce 1 21 0.5 Map 2 D E F ... B 1,A B C B 21,B A F ... ... ... ... ... ... ... ... ... ... Group by key 10 C F ... C 10,C F A 1,A B C Reduce 1 21 0.5 Map 11 E C D ... D 11,E C D A 21,B A F 2 11 0.5 ... ... ... ... ... ... ... ... ... ... 20 F G ... G 20,F G C 10,C F Reduce 2 11 0.5 Map 21 B A F ... A 21,B A F E 2,D E F ... ... ... ... ... ... ... ... ... ... Stage 2: RID-Pair Generation T T T T T T T oken oken oken oken oken oken oken G G G G G G G ... ... ... ... ... ... ... RID a b RID a b RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. 1 1 1 1 1 1 A B C A B C A B C A B C A B C A B C ... ... ... ... ... ... A A A A 1,A B C 1,A B C 1,A B C 1,A B C B B 1,A B C 1,A B C Reduce 1 21 0.5 Map Map Map Map Map Map Map 2 2 2 2 2 2 D E F D E F D E F D E F D E F D E F ... ... ... ... ... ... B B B B 1,A B C 1,A B C 1,A B C 1,A B C B B 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key 10 10 10 10 10 10 C F C F C F C F C F C F ... ... ... ... ... ... C C C 10,C F 10,C F 10,C F A A 1,A B C 1,A B C Reduce 1 21 0.5 Map Map Map Map Map Map Map 11 11 11 11 11 11 E C D E C D E C D E C D E C D E C D ... ... ... ... ... ... D D D 11,E C D 11,E C D 11,E C D A A 21,B A F 21,B A F 2 11 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 20 20 F G F G F G F G F G F G ... ... ... ... ... ... G G G 20,F G 20,F G 20,F G C C 10,C F 10,C F Reduce 2 11 0.5 Map Map Map Map Map Map Map 21 21 21 21 21 21 B A F B A F B A F B A F B A F B A F ... ... ... ... ... ... A A A 21,B A F 21,B A F 21,B A F E E 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  35. Stage 2: RID-Pair Generation Reduce Alternatives T T T T T T T T oken oken oken oken oken oken oken oken Basic Kernel (BK): nested loops G G G G G G G G PPJoin+ Kernel (PK): inverted list index ... ... ... ... ... ... ... ... RID a b RID a b RID a b RID a b RID a b RID a b RID a b Key Value Key Value Key Value Key Value Key Value Key Value Key Value Key Value RID1 RID2 Sim. RID1 RID2 Sim. 1 1 1 1 1 1 1 A B C A B C A B C A B C A B C A B C A B C ... ... ... ... ... ... ... A A A A A 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map Map Map Map 2 2 2 2 2 2 2 D E F D E F D E F D E F D E F D E F D E F ... ... ... ... ... ... ... B B B B B 1,A B C 1,A B C 1,A B C 1,A B C 1,A B C B B B 21,B A F 21,B A F 21,B A F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Group by key Group by key Group by key 10 10 10 10 10 10 10 C F C F C F C F C F C F C F ... ... ... ... ... ... ... C C C C 10,C F 10,C F 10,C F 10,C F A A A 1,A B C 1,A B C 1,A B C Reduce Reduce 1 1 21 21 0.5 0.5 Map Map Map Map Map Map Map Map 11 11 11 11 11 11 11 E C D E C D E C D E C D E C D E C D E C D ... ... ... ... ... ... ... D D D D 11,E C D 11,E C D 11,E C D 11,E C D A A A 21,B A F 21,B A F 21,B A F 2 2 11 11 0.5 0.5 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 20 20 20 20 20 20 20 F G F G F G F G F G F G F G ... ... ... ... ... ... ... G G G G 20,F G 20,F G 20,F G 20,F G C C C 10,C F 10,C F 10,C F Reduce Reduce 2 2 11 11 0.5 0.5 Map Map Map Map Map Map Map Map 21 21 21 21 21 21 21 B A F B A F B A F B A F B A F B A F B A F ... ... ... ... ... ... ... A A A A 21,B A F 21,B A F 21,B A F 21,B A F E E E 2,D E F 2,D E F 2,D E F ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 22 / 39

  36. Experimental Setting Hardware Software Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 23 / 39

  37. Experimental Setting Datasets DBLP Average length: 259 bytes Number of records: 1.2M Total size: 300MB CITESEERX Average length: 1374 bytes Number of records: 1.3M Total size: 1.8GB Increased each up to × 25, preserving join properties DBLP: 31M records, 8.2GB CITESEERX: 32M records, 45GB Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 24 / 39

  38. Running Time Self-join DBLP × n n ∈ [ 5 , 25 ] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

  39. Running Time Legend Self-join DBLP × n Stage 1 n ∈ [ 5 , 25 ] BTO: Basic Token Ordering OPTO: One Phase Token Ordering 10-node cluster Stage 2 Best time BK: Basic Kernel Bulk of the time PK: PPJoin+ Kernel Stage 3 BRJ: Basic Record Join OPRJ: One Phase Record Join Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

  40. Running Time Self-join DBLP × n n ∈ [ 5 , 25 ] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

  41. Running Time Self-join DBLP × n n ∈ [ 5 , 25 ] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 25 / 39

  42. Speedup 5 BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ 4 Ideal Speedup Relative running time 3 Self-join DBLP × 10 Different cluster sizes 2 1 2 3 4 5 6 7 8 9 10 # Nodes Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 26 / 39

  43. Outline Motivation 1 Problem Statement 2 Preliminaries 3 Parallel Algorithms 4 Overview Processing Stages Set-Similarity Joins in MapReduce Set-Similarity Joins in ASTERIX Summary & Impact 5 Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 27 / 39

  44. Beyond MapReduce MapReduce Limitations Simplicity over performance Rigid framework Base “query” language: Java Declarative query languages as add-ons MapReduce-inspired Alternatives Include elements of MapReduce More runtime choices Built-in declarative query language Examples: Scope/Dryad (Microsoft) Nephele/PACTs (TU Berlin) ASTERIX/Hyracks (UC Irvine) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39

  45. Beyond MapReduce MapReduce Limitations Simplicity over performance Rigid framework Base “query” language: Java Declarative query languages as add-ons MapReduce-inspired Alternatives Include elements of MapReduce More runtime choices Built-in declarative query language Examples: Scope/Dryad (Microsoft) Nephele/PACTs (TU Berlin) ASTERIX/Hyracks (UC Irvine) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39

  46. Beyond MapReduce MapReduce Limitations Simplicity over performance Rigid framework Base “query” language: Java Declarative query languages as add-ons MapReduce-inspired Alternatives Include elements of MapReduce More runtime choices Built-in declarative query language Examples: Scope/Dryad (Microsoft) Nephele/PACTs (TU Berlin) ASTERIX/Hyracks (UC Irvine) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 28 / 39

  47. ASTERIX/Hyracks ASTERIX Overview Scalable data platform Semi-structured data model Declarative query language Rule-based optimizer Runs on Hyracks Built-in Set-Similarity Joins Hyracks Overview Partition-parallel framework More flexible than MapReduce Library of operators and connectors Operators: Map, Join, Aggregate, etc. Connectors: 1:1, M:N Hash, M:N Replicate, etc. Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39

  48. ASTERIX/Hyracks ASTERIX Overview Scalable data platform Semi-structured data model Declarative query language Rule-based optimizer Runs on Hyracks Built-in Set-Similarity Joins Hyracks Overview Partition-parallel framework More flexible than MapReduce Library of operators and connectors Operators: Map, Join, Aggregate, etc. Connectors: 1:1, M:N Hash, M:N Replicate, etc. Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39

  49. ASTERIX/Hyracks ASTERIX Overview Scalable data platform Semi-structured data model Declarative query language Rule-based optimizer Runs on Hyracks Built-in Set-Similarity Joins Hyracks Overview Partition-parallel framework More flexible than MapReduce Library of operators and connectors Operators: Map, Join, Aggregate, etc. Connectors: 1:1, M:N Hash, M:N Replicate, etc. Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 29 / 39

  50. Hyracks Native Plan FileWrite 1:1 HashJoin M:N Hash M:N Hash HashJoin FileScan(S) M:N Hash M:N Hash HashGroup FileScan(R) M:N Hash HashJoinWithEvaluator M:N Hash M:N Hash TokenizeRIDPrefixToken TokenizeRIDPrefixToken 1:1 1:1 1:1 1:1 FileScan(R) Split FileScan(S) M:N Replicate Sort M:N Replicate HashGroup M:N Hash Tokenize 1:1 FileScan(R) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 30 / 39

  51. Hadoop vs. Hyracks 900 800 Hadoop Time (seconds) Hyracks Compat 700 Hyracks Native 600 Running time 500 Self-join DBLP × n 400 300 n ∈ [ 5 , 25 ] 200 10-node cluster 100 0 5 10 25 Dataset Size (times the original) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 31 / 39

  52. Set-Similarity Join in ASTERIX Fuzzy-Join Query for $dblp in dataset(’DBLP’) $citeseer in dataset(’CITESEER’) for where $dblp.title ~= $citeseer.title return { ’dblp’ : $dblp, ’citeseer’ : $citeseer } Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 32 / 39

  53. Query Parser Expr. Tree Translator where ... ~= ... Logical Plan Optimizer Logical plan with Logical Plan the three stages Job Generator Hyracks Job Hyracks Hyracks Hyracks ASTERIX/Hyracks Stack Query Parser Expr. Tree Translator Logical Plan Optimizer Logical Plan Job Generator Hyracks Job Hyracks Hyracks Hyracks Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 33 / 39

  54. ASTERIX/Hyracks Stack Query Query Parser Parser Expr. Tree Expr. Tree Translator Translator where ... ~= ... Logical Plan Logical Plan Optimizer Optimizer Logical plan with Logical Plan Logical Plan the three stages Job Generator Job Generator Hyracks Job Hyracks Job Hyracks Hyracks Hyracks Hyracks Hyracks Hyracks Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 33 / 39

  55. Hyracks Plan for Set-Similarity Joins AsterixMeta (sink-write) M:N Replicate AsterixMeta (assign,assign,stream-project) 1:1 HashJoin M:N Hash M:N Hash AsterixMeta Scan (S) (stream-project) 1:1 HashJoin M:N Hash M:N Hash Scan (R) HashGroup M:N Hash AsterixMeta (stream-select,assign,stream-select) 1:1 HashJoin M:N Hash M:N Hash AsterixMeta AsterixMeta (assign,unnest) (assign,unnest) 1:1 1:1 PreclusteredGroup PreclusteredGroup 1:1 1:1 Sort Sort M:N Hash M:N Hash HashLeftOuterJoin HashLeftOuterJoin M:N Hash M:N Hash M:N Hash M:N Hash AsterixMeta AsterixMeta AsterixMeta AsterixMeta (assign,unnest) (assign,running-agg,assign) (assign,running-agg,assign) (assign,unnest) 1:1 1:N Replicate 1:N Replicate 1:1 Scan (R) Split Scan (S) M:1 Hash Merge Sort 1:1 HashGroup M:N Hash HashGroup 1:1 AsterixMeta (assign,unnest) 1:1 Scan (R) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 34 / 39

  56. ASTERIX Console Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 35 / 39

  57. ASTERIX Console Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 35 / 39

  58. Impact Source-code release for Hadoop: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Interest from industry: Fox Audience Network Yahoo! (Pig) Ask.com Source-code release for ASTERIX (comming soon) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 36 / 39

  59. Summary Set-similarity joins in MapReduce Three-stage approach Balance workload and minimize replication End-to-end algorithms Self-join R-S join Experiments Speedup and scaleup 40 cores, 40 disks cluster Similarity joins in ASTERIX/Hyracks – in progress High-level query language integration – in progress Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 37 / 39

  60. Future Work Large domain for set elements Records with large sets Skew in join-result Weights for set elements Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 38 / 39

  61. Publications ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving World Models. Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras. Journal of Distributed and Parallel Databases, 2011. (to appear) Hyracks: A Flexible and Extensible Foundation for Data-Intensive Computing. Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, Rares Vernica. ICDE 2011. CIRCUMFLEX: A Scheduling Optimizer for MapReduce Workloads Involving Shared Scans. Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Sujay Parekh, Kun-Lung Wu, Andrey Balmin, Rares Vernica. Submitted for publication, VLDB 2011. Adaptive MapReduce using Situation-Aware Mappers. Rares Vernica, Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac. Submitted for publication, VLDB 2011. AKYRA: Efficient Keyword-Query Cleaning in Relational Databases. Rares Vernica, Chen Li. Technical Report, University of California, Irvine 2010. Efficient Parallel Set-Similarity Joins Using MapReduce. Rares Vernica, Michael J. Carey, Chen Li. SIGMOD 2010. Efficient Top-k Algorithms for Fuzzy Search in String Collections. Rares Vernica, Chen Li. KEYS 2009, Workshop on Keyword Search on Structured Data, SIGMOD 2009. Entity Categorization Over Large Document Collections. Venkatesh Ganti, Arnd Christian König, Rares Vernica. KDD 2008. SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases. Liang Jin, Chen Li, Rares Vernica. VLDB J. 2008. Relaxing Join and Selection Queries. Nick Koudas, Chen Li, Anthony K. H. Tung, Rares Vernica. VLDB 2006. Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

  62. Set-Similarity Join in ASTERIX Fuzzy-Join Query for $dblp in dataset(’DBLP’) $citeseer in dataset(’CITESEER’) for let [$match, $sim] := $dblp.title ~= $citeseer.title with similarity, simfunction ’Jaccard’, simthreshold .8 where $match order by $sim return { ’dblp’ : $dblp, ’citeseer’ : $citeseer, ’sim’ : $sim } Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

  63. Speedup Breakdown Stage 1 Stage 2 Stage 3 5 5 5 BTO BK BRJ OPTO PK OPRJ 4 4 4 Ideal Ideal Ideal Speedup Speedup Speedup 3 3 3 2 2 2 1 1 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes # Nodes # Nodes Relative running time Self-join DBLP × 10 Different cluster sizes Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

  64. Scaleup 1000 800 Time (seconds) 600 Running time Self-joining DBLP × n 400 n ∈ [ 5 , 25 ] BTO-BK-BRJ BTO-PK-BRJ Proportional cluster 200 BTO-PK-OPRJ 0 2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size (times 2.5 x original) Rares Vernica (UC Irvine) Parallel Fuzzy-Joins 39 / 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend