a fast approach for parallel deduplication on multicore
play

A fast approach for parallel deduplication on multicore processors - PowerPoint PPT Presentation

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata Galante, Carlos A. Heuser Overview General Blocking MD-Approach Overview MapReduce Implementation Evaluation Discussion


  1. A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata Galante, Carlos A. Heuser

  2. Overview ● General Blocking ● MD-Approach Overview ● MapReduce Implementation ● Evaluation ● Discussion

  3. General Blocking DiscID DiscName Genre Year ... 1 From The Cradle - Eric Clapton Blues 1994 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 5 Beatles - A Hard Day’s Night Rock 1964 ... 6 Curtis Mayfield - Curtis Soul 1970 ... ... ... ... ... ...

  4. General Blocking - Blocking Key DiscID DiscName Genre Year ... 1 From The Cradle - Eric Clapton Blues 1994 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 5 Beatles - A Hard Day’s Night Rock 1964 ... 6 Curtis Mayfield - Curtis Soul 1970 ... ... ... ... ... ...

  5. General Blocking - Balance Problem DiscID DiscName Genre Year ... 1 From The Cradle - Eric Clapton Blues 1994 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 6 Curtis Mayfield - Curtis Soul 1970 ... 5 Beatles - A Hard Day’s Night Rock 1964 ...

  6. General Blocking - Keys Problem DiscID DiscName Genre Year ... 1 From The Cradle - Eric Clapton Blues 1994 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 5 Beatles - A Hard Day’s Night Rock 1964 ... 6 Curtis Mayfield - Curtis Soul 1970 ... ... ... ... ... ...

  7. Blocking Functions & Multipass ● blocking functions are defined as followed: ○ bf 1 (record) = {genre} ○ bf 2 (record) = {year, genre} ○ bf 3 (record) = {1 st 3 letters of genre, 1 st 3 digits of year} ● in a n-multipass several blocking functions are applied to each record ○ BFS = {bf 1, bf 2, ..., bf n }

  8. MD-Approach - Idea B 1 B 2 D B 3 B 4 Blocking Step

  9. MD-Approach - Idea B 1 M B 2 M D B 3 B 4 M Blocking Step Match

  10. MD-Approach - Idea B 1 M B 2 M D B 3, 1 M B 3 B 3, 2 M B 4 M Blocking Step MD-Approach Match

  11. MD-Approach - Idea B 1 M B 2 M D B 3, 1 M B 3 B 3, 2 M B 4 M Blocking Step MD-Approach Match

  12. MD-Approach - MapReduce Overview

  13. Map-Reduce Implementation Phase I - First Blocking Step ● create dataset segments ● only map phase ● emits key-value pair ○ generated blocking key as key, e.g. bf(record) = {1 st 3 letters of genre, 1 st 3 digits of year} ○ record as value 2 Marvin Gaye - Here, My Dear Soul 1975 ... Sou197 2 Marvin Gaye - Here, My Dear Soul 1975 ...

  14. Map-Reduce Implementation Phase I - First Blocking Step ● multi-passing ○ set of n several blocking functions ■ BFS = {bf 1, bf 2, ..., bf n } ○ for each record emit at once : ■ <k bf1 : record 1 > ... <k bf1 : record n > <k ... : record 1 > ... <k ... : record n > <k bfn : record 1 > ... <k bfn : record n > 2 Marvin Gaye - Here, My Dear Soul 1975 ... bf 1 bf 2 Sou197 2 Marvin Gaye - Here, My Dear Soul 1975 ... MarvSou 2 Marvin Gaye - Here, My Dear Soul 1975 ...

  15. Map-Reduce Implementation Phase II - Sort Blocks & Match ● identify unbalanced blocks ○ compare the record count of each block with a threshold ○ use reduce function until a certain threshold is reached ● reduce step (match step) ○ receives all records with the same key (here same block) ○ nested-loop pairwise comparing ○ outputs pairs of similar records

  16. Map-Reduce Implementation Phase III - Second Blocking Step ● only unbalanced blocks ● map: expand blocking key from first blocking step ■ e.g. bf 1 (record) = {1 st 3 letters of genre, 1 st 3 digits of year} → bf 1 '(record) = {all letters of genre, all digits of year} ■ creates very fine granular blocks Blu199 1 From The Cradle - Eric Clapton Blues 1994 ... Blu199 4 Eric Clapton - From the Cradle Blues 1995 ... Blues1994 1 From The Cradle - Eric Clapton Blues 1994 ... Blues1995 4 Eric Clapton - From the Cradle Blues 1995 ...

  17. Map-Reduce Implementation Phase III - Second Blocking Step ● to avoid loss of true positives use 'sliding window approach' ○ create an index structure for fine-grained keys after map phase ○ compare with k-nearest neighbors ○ if the similarity is high enough merge records with very similar keys to bigger blocks again ● reduce step (match) is same as in Phase II

  18. Map-Reduce Implementation Phase IV - Merge Pairs ● short map-reduce operations to clean output file ○ identify and remove replicated pairs ○ multipass generates duplicates of detected records

  19. Evaluation ● Phoenix MR framework was used for implementation - shared memory-architecture ● synthetic dataset generated by Febrl (1M, 2M, 4M, each with 10% duplicates) ● compared with BTO-BK ● used different similarity metrics for different approaches

  20. Relevance for the seminar ● interesting and intuitive main idea ● due to weaknesses in English language, sometimes hard to understand ● the MR-specific implementation details are very rare ● the mapping from a shared-memory (Phoenix) onto a shared-nothing (Hadoop, Stratosphere) architecture will be challenging ● to sum best things up: ○ single-run multi-pass ○ load balancing through re-blocking

  21. Sources 1. Dal Bianco, Guilherme, Renata Galante, and Carlos A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the ACM Symposium on Applied Computing, 2011.

  22. Map-Reduce Implementation First MR-Step ● map-step ○ emits (blocking-key, value) ● identify unbalanced blocks ● reduce-step (balanced blocks only) ○ similarity function ○ arithmetic average ○ find duplicate by threshold

  23. Map-Reduce Implementation Second MR-Step ● map-step ○ emits expanded blocking-key ● "sliding window sort" (binary search) ● reduce-step ○ same as in First MR-Step

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend