Semi-Joins and Bloom Join Databases: The Complete Book Ch 20 1 - - PowerPoint PPT Presentation

semi joins and bloom join
SMART_READER_LITE
LIVE PREVIEW

Semi-Joins and Bloom Join Databases: The Complete Book Ch 20 1 - - PowerPoint PPT Presentation

Semi-Joins and Bloom Join Databases: The Complete Book Ch 20 1 Practical Concerns UNION R 1 S 1 R 1 S 2 R 2 S 1 R N S M R 1 R 2 R N S 1 S 2 S M 2 Practical Concerns UNION R 1 S 1 R 1 S 2 R 2 S 1 R N


slide-1
SLIDE 1

Semi-Joins and Bloom Join

Databases: The Complete Book Ch 20

1

slide-2
SLIDE 2

Practical Concerns

2

R1⋈S1 R1⋈S2 R2⋈S1

RN⋈SM

R1 R2 RN

S1 S2 SM

UNION

slide-3
SLIDE 3

Practical Concerns

2

R1⋈S1 R1⋈S2 R2⋈S1

RN⋈SM

R1 R2 RN

S1 S2 SM

UNION

Where does the computation happen? How does the data get there?

slide-4
SLIDE 4

Distributing the Work

3

S ⋈B R

Let’s start simple… what can we do with no partitions?

slide-5
SLIDE 5

Distributing the Work

3

S ⋈B R

Let’s start simple… what can we do with no partitions? R and S may be any RA expression…

slide-6
SLIDE 6

Distributing the Work

4

S ⋈B R

Node 1

slide-7
SLIDE 7

Distributing the Work

4

S ⋈B R

Node 1

No Parallelism!

slide-8
SLIDE 8

Distributing the Work

5

S ⋈B R

Node 2 Node 1 Node 3

slide-9
SLIDE 9

Distributing the Work

5

S ⋈B R

Node 2 Node 1 Node 3

Lots of Data Transfer! All of R and All of S get sent!

slide-10
SLIDE 10

Distributing the Work

6

S ⋈B R

Node 2 Node 1

slide-11
SLIDE 11

Distributing the Work

6

S ⋈B R

Node 2 Node 1

All of R get sent Better! We can guess whether R or S is smaller.

slide-12
SLIDE 12

Distributing the Work

7

What can we do if R is partitioned?

R2 ⋈B S R1 ⋈B U

slide-13
SLIDE 13

Distributing the Work

8

There are lots of partitioning strategies, but this one is interesting….

R2 ⋈B S R1 ⋈B U

Node 2 Node 3 Node 1

slide-14
SLIDE 14

R2 ⋈B S1 R1 ⋈B U

Distributing the Work

9

… it can be used as a model for partitioning S…

Node 2 Node 3 Node 1

slide-15
SLIDE 15

R2 ⋈B S2 R1 ⋈B U

Distributing the Work

10

… it can be used as a model for partitioning S…

Node 2 Node 3 Node 1

slide-16
SLIDE 16

R2 ⋈B S R1 ⋈B U

Distributing the Work

11

…and neatly captures the data transfer issue.

Node 2 Node 3 Node 1

slide-17
SLIDE 17

Distributing the Work

12

Si joins with R1,R2,…,RN locally. So let’s use it:

slide-18
SLIDE 18

Distributing the Work

12

Si joins with R1,R2,…,RN locally. So let’s use it: Goal: Minimize amount of data sent from Rk to Si

slide-19
SLIDE 19

Distributing the Work

12

Si joins with R1,R2,…,RN locally. So let’s use it: Goal: Minimize amount of data sent from Rk to Si Solution 1: Use the partitioning strategy (like last lecture)

slide-20
SLIDE 20

Distributing the Work

12

Si joins with R1,R2,…,RN locally. So let’s use it: Goal: Minimize amount of data sent from Rk to Si Solution 1: Use the partitioning strategy (like last lecture) Solution 2: “Hints” to figure out what Rk should send

slide-21
SLIDE 21

Sending Hints

13

Node 1 Node 2

Rk Si

Rk ⋈B Si

The naive approach…

slide-22
SLIDE 22

Sending Hints

14

Node 1 Node 2

Rk Si

Rk ⋈B Si

The naive approach…

Send me Rk

slide-23
SLIDE 23

Sending Hints

15

Node 1 Node 2

Rk Si

Rk ⋈B Si

The naive approach…

Rk

slide-24
SLIDE 24

Sending Hints

16

Node 1 Node 2

Rk Si

Rk ⋈B Si

The smarter approach…

πB( )

Si

slide-25
SLIDE 25

Sending Hints

17

Node 1 Node 2

Rk Si

Rk ⋈B Si

The smarter approach…

πB( )

Si

⋈ πB( )

Rk Si

slide-26
SLIDE 26

Sending Hints

18

Node 1 Node 2

Rk ⋈B Si

The smarter approach…

<1,A> <2,B> <2,C> <3,D> <4,E> <2,X> <3,Y> <6,Y>

slide-27
SLIDE 27

Sending Hints

19

Node 1 Node 2

Rk ⋈B Si

The smarter approach…

<2,X>

Send me rows with a ‘B’ of 2,3, or 6

<3,Y> <1,A> <2,B> <2,C> <3,D> <4,E> <6,Y>

slide-28
SLIDE 28

Sending Hints

20

Node 1 Node 2

Rk ⋈B Si

The smarter approach…

<1,A> <2,B> <2,C> <3,D> <2,X>

Send me rows with a ‘B’ of 2,3, or 6

<3,Y>

<2,B> <2,C> <3,D>

<4,E> <6,Y>

slide-29
SLIDE 29

Sending Hints

20

Node 1 Node 2

Rk ⋈B Si

The smarter approach…

<1,A> <2,B> <2,C> <3,D> <2,X>

Send me rows with a ‘B’ of 2,3, or 6

<3,Y>

<2,B> <2,C> <3,D>

<4,E>

This is called a semi-join.

<6,Y>

slide-30
SLIDE 30

Sending Hints

21

Now Node 1 sends as little data as possible… … but Node 2 needs to send a lot of data. Can we do better?

slide-31
SLIDE 31

Sending Hints

22

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 1: Parity Bits

1 1

<6,Y>

slide-32
SLIDE 32

Sending Hints

23

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 1: Parity Bits

1 1

Send me data with a parity bit of ‘0’

<6,Y>

slide-33
SLIDE 33

Sending Hints

24

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 1: Parity Bit

1 1

Send me data with a parity bit of ‘0’

<2,B> <2,C> <4,E>

<6,Y>

slide-34
SLIDE 34

Sending Hints

24

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 1: Parity Bit

1 1

Send me data with a parity bit of ‘0’

<2,B> <2,C> <4,E>

Node 1 sending too much is ok! (Node 2 still needs to compute ⋈B)

<6,Y>

slide-35
SLIDE 35

Sending Hints

24

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 1: Parity Bit

1 1

Send me data with a parity bit of ‘0’

<2,B> <2,C> <4,E>

Node 1 sending too much is ok! (Node 2 still needs to compute ⋈B)

<6,Y>

Problem: One parity bit is too little

slide-36
SLIDE 36

Sending Hints

25

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 1: Parity Bit

1 1 1

Problem: One parity bit is too little

<3,Y> <6,Y>

slide-37
SLIDE 37

Sending Hints

26

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 2: Parity Bits

01 10 10 00 11 10 11<3,Y>

<6,Y>

10

Send me data with parity bits 10 or 11

<2,B> <2,C> <3,D>

slide-38
SLIDE 38

Sending Hints

26

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 2: Parity Bits

01 10 10 00 11 10 11<3,Y>

<6,Y>

10

Send me data with parity bits 10 or 11

<2,B> <2,C> <3,D>

Problem: Almost as much data as πB

slide-39
SLIDE 39

Sending Hints

27

Can we summarize the parity bits?

slide-40
SLIDE 40

Bloom Filters

28

Alice Bob Carol Dave

slide-41
SLIDE 41

Bloom Filters

29

Bloom Filter

Alice Bob Carol Dave

slide-42
SLIDE 42

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?
slide-43
SLIDE 43

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?

Yes

slide-44
SLIDE 44

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?

Is Eve part of the set? Yes

slide-45
SLIDE 45

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?

Is Eve part of the set? Yes No

slide-46
SLIDE 46

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?

Is Eve part of the set? Is Fred part

  • f the set?

Yes No

slide-47
SLIDE 47

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?

Is Eve part of the set? Is Fred part

  • f the set?

Yes No Yes

slide-48
SLIDE 48

Bloom Filters

30

Bloom Filter

Alice Bob Carol Dave

Is Alice part

  • f the set?

Is Eve part of the set? Is Fred part

  • f the set?

Yes No Yes

Bloom Filter Guarantee Test definitely returns Yes if the element is in the set Test usually returns No if the element is not in the set

slide-49
SLIDE 49

Bloom Filters

31

A Bloom Filter is a bit vector M - # of bits in the bit vector K - # of hash functions For ONE key (or record): For i between 0 and K: bitvector[ hashi (key) % M ] = 1

slide-50
SLIDE 50

Bloom Filters

31

A Bloom Filter is a bit vector M - # of bits in the bit vector K - # of hash functions For ONE key (or record): For i between 0 and K: bitvector[ hashi (key) % M ] = 1

Each bit vector has ~K bits set

slide-51
SLIDE 51

Bloom Filters

32

00101010 01010110 10000110 01001100 Key 1 Key 2 Key 3 Key 4

Filters are combined by Bitwise-OR e.g. (Key 1 | Key 2) = 01111110

slide-52
SLIDE 52

Bloom Filters

32

00101010 01010110 10000110 01001100 Key 1 Key 2 Key 3 Key 4

Filters are combined by Bitwise-OR e.g. (Key 1 | Key 2) = 01111110 How do we test for inclusion?

slide-53
SLIDE 53

Bloom Filters

32

00101010 01010110 10000110 01001100 Key 1 Key 2 Key 3 Key 4

Filters are combined by Bitwise-OR e.g. (Key 1 | Key 2) = 01111110 How do we test for inclusion? (Key & Filter) == Key? (Key 1 & S) = 00101010 (Key 3 & S) = 00000110 (Key 4 & S) = 01001100

slide-54
SLIDE 54

Bloom Filters

32

00101010 01010110 10000110 01001100 Key 1 Key 2 Key 3 Key 4

Filters are combined by Bitwise-OR e.g. (Key 1 | Key 2) = 01111110 How do we test for inclusion? (Key & Filter) == Key? (Key 1 & S) = 00101010 (Key 3 & S) = 00000110 (Key 4 & S) = 01001100

slide-55
SLIDE 55

Bloom Filters

32

00101010 01010110 10000110 01001100 Key 1 Key 2 Key 3 Key 4

Filters are combined by Bitwise-OR e.g. (Key 1 | Key 2) = 01111110 How do we test for inclusion? (Key & Filter) == Key? (Key 1 & S) = 00101010 (Key 3 & S) = 00000110 (Key 4 & S) = 01001100

X √

slide-56
SLIDE 56

Bloom Filters

32

00101010 01010110 10000110 01001100 Key 1 Key 2 Key 3 Key 4

Filters are combined by Bitwise-OR e.g. (Key 1 | Key 2) = 01111110 How do we test for inclusion? (Key & Filter) == Key? (Key 1 & S) = 00101010 (Key 3 & S) = 00000110 (Key 4 & S) = 01001100

X √ False Positive √

slide-57
SLIDE 57

Sending Hints

33

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 3: Bloom Filters

<3,Y> <6,Y>

slide-58
SLIDE 58

Sending Hints

34

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 3: Bloom Filters

<3,Y> <6,Y>

Send me rows with a ‘B’ in the bloom filter summarizing the set {2,3,6}

slide-59
SLIDE 59

Sending Hints

35

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 3: Bloom Filters

<3,Y> <6,Y>

<2,B> <2,C> <3,D> <4,E>

Send me rows with a ‘B’ in the bloom filter summarizing the set {2,3,6}

slide-60
SLIDE 60

Sending Hints

35

Node 1 Node 2

Rk ⋈B Si

<1,A> <2,B> <2,C> <3,D> <2,X> <4,E>

Strategy 3: Bloom Filters

<3,Y> <6,Y>

<2,B> <2,C> <3,D> <4,E>

This is called a bloom-join.

Send me rows with a ‘B’ in the bloom filter summarizing the set {2,3,6}