A fast approach for parallel deduplication on multicore processors - - PowerPoint PPT Presentation

a fast approach for parallel deduplication on multicore
SMART_READER_LITE
LIVE PREVIEW

A fast approach for parallel deduplication on multicore processors - - PowerPoint PPT Presentation

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco, Renata Galante, Carlos A. Heuser Overview General Blocking MD-Approach Overview MapReduce Implementation Evaluation Discussion


slide-1
SLIDE 1

A fast approach for parallel deduplication on multicore processors

Guilherme Dal Bianco, Renata Galante, Carlos A. Heuser

slide-2
SLIDE 2

Overview

  • General Blocking
  • MD-Approach Overview
  • MapReduce Implementation
  • Evaluation
  • Discussion
slide-3
SLIDE 3

General Blocking

DiscID DiscName Genre Year ...

1 From The Cradle - Eric Clapton Blues 1994 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 5 Beatles - A Hard Day’s Night Rock 1964 ... 6 Curtis Mayfield - Curtis Soul 1970 ... ... ... ... ... ...

slide-4
SLIDE 4

DiscID DiscName Genre Year ...

1 From The Cradle - Eric Clapton Blues 1994 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 5 Beatles - A Hard Day’s Night Rock 1964 ... 6 Curtis Mayfield - Curtis Soul 1970 ... ... ... ... ... ...

General Blocking - Blocking Key

slide-5
SLIDE 5

General Blocking - Balance Problem

1 From The Cradle - Eric Clapton Blues 1994 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 6 Curtis Mayfield - Curtis Soul 1970 ... 5 Beatles - A Hard Day’s Night Rock 1964 ...

DiscID DiscName Genre Year ...

slide-6
SLIDE 6

DiscID DiscName Genre Year ...

1 From The Cradle - Eric Clapton Blues 1994 ... 2 Marvin Gaye - Here, My Dear Soul 1975 ... 3 The Beatles - A Hard Day’s Night Blues 1964 ... 4 Eric Clapton - From the Cradle Blues 1995 ... 5 Beatles - A Hard Day’s Night Rock 1964 ... 6 Curtis Mayfield - Curtis Soul 1970 ... ... ... ... ... ...

General Blocking - Keys Problem

slide-7
SLIDE 7

Blocking Functions & Multipass

  • blocking functions are defined as followed:

bf1(record) = {genre}

bf2(record) = {year, genre}

bf3(record) = {1st 3 letters of genre, 1st 3 digits of year}

  • in a n-multipass several blocking functions are applied to each record

BFS = {bf1, bf2, ..., bfn}

slide-8
SLIDE 8

MD-Approach - Idea

D B1 B3 B4 B2

Blocking Step

slide-9
SLIDE 9

MD-Approach - Idea

D B1 B3 B4 B2 M M M

Match Blocking Step

slide-10
SLIDE 10

MD-Approach - Idea

D B1 B3 B4 B3, 1 B2 B3, 2 M M M M M

Match MD-Approach Blocking Step

slide-11
SLIDE 11

MD-Approach - Idea

D B1 B3 B4 B3, 1 B2 B3, 2 M M M M M

MD-Approach Match Blocking Step

slide-12
SLIDE 12

MD-Approach - MapReduce Overview

slide-13
SLIDE 13

Map-Reduce Implementation Phase I - First Blocking Step

  • create dataset segments
  • only map phase
  • emits key-value pair

generated blocking key as key, e.g. bf(record) = {1st 3 letters of genre, 1st 3 digits

  • f year}

record as value

2 Marvin Gaye - Here, My Dear Soul 1975 ... Sou197 2 Marvin Gaye - Here, My Dear Soul 1975 ...

slide-14
SLIDE 14

Map-Reduce Implementation Phase I - First Blocking Step

  • multi-passing

set of n several blocking functions

BFS = {bf1, bf2, ..., bfn}

for each record emit at once:

<kbf1 : record1> ... <kbf1 : recordn> <k... : record1> ... <k... : recordn> <kbfn : record1> ... <kbfn : recordn>

2 Marvin Gaye - Here, My Dear Soul 1975 ... Sou197 2 Marvin Gaye - Here, My Dear Soul 1975 ... MarvSou 2 Marvin Gaye - Here, My Dear Soul 1975 ...

bf1 bf2

slide-15
SLIDE 15

Map-Reduce Implementation Phase II - Sort Blocks & Match

  • identify unbalanced blocks

○ compare the record count of each block with a threshold ○ use reduce function until a certain threshold is reached

  • reduce step (match step)

○ receives all records with the same key (here same block) ○ nested-loop pairwise comparing ○ outputs pairs of similar records

slide-16
SLIDE 16

Map-Reduce Implementation Phase III - Second Blocking Step

  • only unbalanced blocks
  • map: expand blocking key from first

blocking step ■

e.g. bf1(record) = {1st 3 letters of genre, 1st 3 digits of year} → bf1'(record) = {all letters of genre, all digits of year}

creates very fine granular blocks

Blu199 1 From The Cradle - Eric Clapton Blues 1994 ... Blu199 4 Eric Clapton - From the Cradle Blues 1995 ... Blues1994 1 From The Cradle - Eric Clapton Blues 1994 ... Blues1995 4 Eric Clapton - From the Cradle Blues 1995 ...

slide-17
SLIDE 17

Map-Reduce Implementation Phase III - Second Blocking Step

  • to avoid loss of true positives use 'sliding

window approach'

○ create an index structure for fine-grained keys after map phase ○ compare with k-nearest neighbors ○ if the similarity is high enough merge records with very similar keys to bigger blocks again

  • reduce step (match) is same as in Phase II
slide-18
SLIDE 18

Map-Reduce Implementation Phase IV - Merge Pairs

  • short map-reduce operations

to clean output file

○ identify and remove replicated pairs ○ multipass generates duplicates of detected records

slide-19
SLIDE 19

Evaluation

  • Phoenix MR framework was used for implementation -

shared memory-architecture

  • synthetic dataset generated by Febrl (1M, 2M, 4M, each

with 10% duplicates)

  • compared with BTO-BK
  • used different similarity metrics for different approaches
slide-20
SLIDE 20

Relevance for the seminar

  • interesting and intuitive main idea
  • due to weaknesses in English language, sometimes

hard to understand

  • the MR-specific implementation details are very rare
  • the mapping from a shared-memory (Phoenix) onto a

shared-nothing (Hadoop, Stratosphere) architecture will be challenging

  • to sum best things up:

○ single-run multi-pass ○ load balancing through re-blocking

slide-21
SLIDE 21

Sources

1. Dal Bianco, Guilherme, Renata Galante, and Carlos A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the ACM Symposium on Applied Computing, 2011.

slide-22
SLIDE 22

Map-Reduce Implementation First MR-Step

  • map-step

○ emits (blocking-key, value)

  • identify unbalanced blocks
  • reduce-step (balanced blocks
  • nly)

○ similarity function ○ arithmetic average ○ find duplicate by threshold

slide-23
SLIDE 23

Map-Reduce Implementation Second MR-Step

  • map-step

○ emits expanded blocking-key

  • "sliding window sort" (binary

search)

  • reduce-step

○ same as in First MR-Step