Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP - - PowerPoint PPT Presentation

hash tables
SMART_READER_LITE
LIVE PREVIEW

Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP - - PowerPoint PPT Presentation

Hash Tables Lecture # 06 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 UPCO M IN G DATABASE EVEN TS MapD Talk Thursday Sept 20 th @ 12:00pm CIC 4 th Floor CMU 15-445/645 (Fall


slide-1
SLIDE 1

Database Systems 15-445/15-645 Fall 2018 Andy Pavlo Computer Science Carnegie Mellon Univ.

AP AP

Lecture # 06

Hash Tables

slide-2
SLIDE 2

CMU 15-445/645 (Fall 2018)

UPCO M IN G DATABASE EVEN TS

MapD Talk

→ Thursday Sept 20th @ 12:00pm → CIC 4th Floor

2

slide-3
SLIDE 3

CMU 15-445/645 (Fall 2018)

ADM IN ISTRIVIA

Project #1 is due Wednesday Sept 26th @ 11:59pm Homework #2 is due Friday Sept 28th @ 11:59pm

3

slide-4
SLIDE 4

CMU 15-445/645 (Fall 2018)

REM IN DER

If you have a question during the lecture, raise your hand and stop me. Do not come up to the front after the lecture. There are no stupid questions(*).

4

slide-5
SLIDE 5

CMU 15-445/645 (Fall 2018)

CO URSE STATUS

We are now going to talk about how to support the DBMS's execution engine to read/write data from pages. Two types of data structures:

→ Hash Tables → Trees

5

Query Planning Operator Execution Access Methods Buffer Pool Manager Disk Manager

slide-6
SLIDE 6

CMU 15-445/645 (Fall 2018)

DATA STRUCTURES

Internal Meta-data Core Data Storage Temporary Data Structures Table Indexes

6

slide-7
SLIDE 7

CMU 15-445/645 (Fall 2018)

DESIGN DECISIO N S

Data Organization

→ How we layout data structure in memory/pages and what information to store to support efficient access.

Concurrency

→ How to enable multiple threads to access the data structure at the same time without causing problems.

7

slide-8
SLIDE 8

CMU 15-445/645 (Fall 2018)

H ASH TABLES

A hash table implements an associative array abstract data type that maps keys to values. It uses a hash function to compute an offset into the array, from which the desired value can be found.

8

slide-9
SLIDE 9

CMU 15-445/645 (Fall 2018)

STATIC H ASH TABLE

Allocate a giant array that has one slot for every element that you need to record. To find an entry, mod the key by the number of elements to find the offset in the array.

9

hash(key)

1 2 n abc def xyz Ø

slide-10
SLIDE 10

CMU 15-445/645 (Fall 2018)

STATIC H ASH TABLE

Allocate a giant array that has one slot for every element that you need to record. To find an entry, mod the key by the number of elements to find the offset in the array.

9

hash(key)

1 2 n abcdefghi defghijk xyz123

slide-11
SLIDE 11

CMU 15-445/645 (Fall 2018)

ASSUM PTIO N S

You know the number of elements ahead of time. Each key is unique. Perfect hash function.

→ If key1≠key2, then hash(key1)≠hash(key2)

10

hash(key)

1 2 n abcdefghi defghijk xyz123

slide-12
SLIDE 12

CMU 15-445/645 (Fall 2018)

H ASH TABLE

Design Decision #1: Hash Function

→ How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate.

Design Decision #2: Hashing Scheme

→ How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys.

11

slide-13
SLIDE 13

CMU 15-445/645 (Fall 2018)

TO DAY'S AGEN DA

Hash Functions Static Hashing Schemes Dynamic Hashing Schemes

12

slide-14
SLIDE 14

CMU 15-445/645 (Fall 2018)

H ASH FUN CTIO N S

We don’t want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate.

13

slide-15
SLIDE 15

CMU 15-445/645 (Fall 2018)

H ASH FUN CTIO N S

MurmurHash (2008)

→ Designed to a fast, general purpose hash function.

Google CityHash (2011)

→ Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes).

Google FarmHash (2014)

→ Newer version of CityHash with better collision rates.

CLHash (2016)

→ Fast hashing function based on carry-less multiplication.

14

slide-16
SLIDE 16

CMU 15-445/645 (Fall 2018)

H ASH FUN CTIO N BEN CH M ARKS

15

6000 12000 18000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Core i7-8700K @ 3.70GHz

32 64 128 192

slide-17
SLIDE 17

CMU 15-445/645 (Fall 2018)

H ASH FUN CTIO N BEN CH M ARKS

16

12000 24000 36000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes) std::hash MurmurHash3 CityHash FarmHash CLHash

Source: Fredrik Widlund

Intel Core i7-8700K @ 3.70GHz

32 64 128 192

slide-18
SLIDE 18

CMU 15-445/645 (Fall 2018)

STATIC H ASH IN G SCH EM ES

Approach #1: Linear Probe Hashing Approach #2: Robin Hood Hashing Approach #3: Cuckoo Hashing

17

slide-19
SLIDE 19

CMU 15-445/645 (Fall 2018)

LIN EAR PRO BE H ASH IN G

Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.

→ To determine whether an element is present, hash to a location in the index and scan for it. → Have to store the key in the index to know when to stop scanning. → Insertions and deletions are generalizations of lookups.

18

slide-20
SLIDE 20

CMU 15-445/645 (Fall 2018)

<key>|<value>

LIN EAR PRO BE H ASH IN G

19

A B C D hash(key) | val

A

E F

slide-21
SLIDE 21

CMU 15-445/645 (Fall 2018)

LIN EAR PRO BE H ASH IN G

19

A B C D hash(key) | val

A

| val

B

E F

slide-22
SLIDE 22

CMU 15-445/645 (Fall 2018)

LIN EAR PRO BE H ASH IN G

19

A B C D hash(key) | val

A

| val

B

| val

C

E F

slide-23
SLIDE 23

CMU 15-445/645 (Fall 2018)

LIN EAR PRO BE H ASH IN G

19

A B C D hash(key) | val

A

| val

B

| val

C

| val

D

E F

slide-24
SLIDE 24

CMU 15-445/645 (Fall 2018)

LIN EAR PRO BE H ASH IN G

19

A B C D hash(key) | val

A

| val

B

| val

C

| val

D

E | val

E

F

slide-25
SLIDE 25

CMU 15-445/645 (Fall 2018)

LIN EAR PRO BE H ASH IN G

19

A B C D hash(key) | val

A

| val

B

| val

C

| val

D

E | val

E

F | val

F

slide-26
SLIDE 26

CMU 15-445/645 (Fall 2018)

N O N- UN IQ UE KEYS

Choice #1: Separate Linked List

→ Store values in separate storage area for each key.

20

slide-27
SLIDE 27

CMU 15-445/645 (Fall 2018)

N O N- UN IQ UE KEYS

Choice #1: Separate Linked List

→ Store values in separate storage area for each key.

Choice #2: Redundant Keys

→ Store duplicate keys entries together in the hash table.

20

XYZ ABC

value1 value2 value3

Value Lists

value1 value2

XYZ|value1 ABC|value1 XYZ|value2 XYZ|value3 ABC|value2

slide-28
SLIDE 28

CMU 15-445/645 (Fall 2018)

O BSERVATIO N

To reduce the # of wasteful comparisons, it is important to avoid collisions of hashed keys. This requires a hash table with ~2x the number of slots as the number of elements.

21

slide-29
SLIDE 29

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys.

→ Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key.

22

slide-30
SLIDE 30

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

E

# of "Jumps" From First Position

F

slide-31
SLIDE 31

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

E F

slide-32
SLIDE 32

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

E F

A[0] == C[0]

slide-33
SLIDE 33

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

| val [1]

D

E F

C[1] > D[0]

slide-34
SLIDE 34

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

| val [1]

D

E

A[0] == E[0]

F

slide-35
SLIDE 35

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

| val [1]

D

E

A[0] == E[0] C[1] == E[1]

F

slide-36
SLIDE 36

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

| val [1]

D

E

A[0] == E[0] C[1] == E[1] D[1] < E[2]

F

slide-37
SLIDE 37

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

E | val [2]

E A[0] == E[0] C[1] == E[1] D[1] < E[2]

F | val [2]

D

slide-38
SLIDE 38

CMU 15-445/645 (Fall 2018)

RO BIN H O O D H ASH IN G

23

A B C D hash(key) | val [0]

A

| val [0]

B

| val [1]

C

E | val [2]

E

F | val [2]

D

| val [1]

F D[2] > F[0]

slide-39
SLIDE 39

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

Use multiple hash tables with different hash functions.

→ On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location.

Look-ups and deletions are always O(1) because

  • nly one location per hash table is checked.

24

slide-40
SLIDE 40

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

slide-41
SLIDE 41

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

A|val

slide-42
SLIDE 42

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

A|val

slide-43
SLIDE 43

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

B|val A|val

slide-44
SLIDE 44

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

Insert C

hash1(C) hash2(C)

B|val A|val

slide-45
SLIDE 45

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

Insert C

hash1(C) hash2(C)

A|val C|val

slide-46
SLIDE 46

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

Insert C

hash1(C) hash2(C) hash1(B)

A|val C|val

slide-47
SLIDE 47

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

Insert C

hash1(C) hash2(C) hash1(B)

C|val B|val

slide-48
SLIDE 48

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

25

Hash Table #1

Hash Table #2

Insert A

hash1(A) hash2(A)

Insert B

hash1(B) hash2(B)

Insert C

hash1(C) hash2(C) hash1(B) hash2(A)

A|val C|val B|val

slide-49
SLIDE 49

CMU 15-445/645 (Fall 2018)

CUCKO O H ASH IN G

Make sure that we don’t get stuck in an infinite loop when moving keys. If we find a cycle, then we can rebuild the entire hash tables with new hash functions.

→ With two hash functions, we (probably) won’t need to rebuild the table until it is at about 50% full. → With three hash functions, we (probably) won’t need to rebuild the table until it is at about 90% full.

26

slide-50
SLIDE 50

CMU 15-445/645 (Fall 2018)

O BSERVATIO N

The previous hash tables require knowing the number of elements you want to store ahead of time.

→ Otherwise you have rebuild the entire table if you need to grow/shrink.

27

SELECT A.id FROM A, B WHERE A.id = B.id ORDER BY A.id

A B

τ A.id

A.id

σ

A.id=B.id

slide-51
SLIDE 51

CMU 15-445/645 (Fall 2018)

O BSERVATIO N

The previous hash tables require knowing the number of elements you want to store ahead of time.

→ Otherwise you have rebuild the entire table if you need to grow/shrink.

Dynamic hash tables are able to grow/shrink on demand.

→ Extendible Hashing → Linear Hashing

27

Directory

Page Data Page Data Page Data

Page Table

slide-52
SLIDE 52

CMU 15-445/645 (Fall 2018)

CH AIN ED H ASH IN G

Maintain a linked list of buckets for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket.

→ To determine whether an element is present, hash to its bucket and scan for it. → Insertions and deletions are generalizations of lookups.

28

slide-53
SLIDE 53

CMU 15-445/645 (Fall 2018)

CH AIN ED H ASH IN G

29

Ø

hash(key)

⋮ ⋮

Buckets

slide-54
SLIDE 54

CMU 15-445/645 (Fall 2018)

CH AIN ED H ASH IN G

29

Ø

hash(key)

⋮ ⋮

Buckets

slide-55
SLIDE 55

CMU 15-445/645 (Fall 2018)

CH AIN ED H ASH IN G

The hash table can grow infinitely because you just keep adding new buckets to the linked list. You only need to take a latch on the bucket to store a new entry or extend the linked list.

30

slide-56
SLIDE 56

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

Chained-hashing approach where we split buckets instead of letting the linked list grow forever. This requires reshuffling entries on split, but the change is localized.

31

slide-57
SLIDE 57

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

slide-58
SLIDE 58

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

hash(A) = 01110… Find A

slide-59
SLIDE 59

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

hash(A) = 01110… Find A

slide-60
SLIDE 60

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

hash(A) = 01110… Find A hash(B) = 10111… Insert B

slide-61
SLIDE 61

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

hash(A) = 01110… Find A hash(B) = 10111… Insert B

10111…

slide-62
SLIDE 62

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

hash(A) = 01110… Find A hash(B) = 10111… Insert B hash(C) = 10100… Insert C

10111…

slide-63
SLIDE 63

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

01… 00… 10… 11…

local local local

00010… 01110…

1

10101… 10011…

2

11010…

2

hash(A) = 01110… Find A hash(B) = 10111… Insert B hash(C) = 10100… Insert C

10111…

slide-64
SLIDE 64

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

00010… 01110…

1

11010…

2

hash(A) = 01110… Find A hash(B) = 10111… Insert B hash(C) = 10100… Insert C

010… 000… 100… 110… 011… 001… 101… 111…

3

10011…

3

10101… 10111…

3

slide-65
SLIDE 65

CMU 15-445/645 (Fall 2018)

EXTEN DIBLE H ASH IN G

32

global 2

00010… 01110…

1

11010…

2

hash(A) = 01110… Find A hash(B) = 10111… Insert B hash(C) = 10100… Insert C

010… 000… 100… 110… 011… 001… 101… 111…

3

10011…

3

10101… 10111…

3

10100…

slide-66
SLIDE 66

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

Maintain a pointer that tracks the next bucket to split. When any bucket overflows, split the bucket at the pointer location. Overflow criterion is left up to the implementation.

→ Space Utilization → Average Length of Overflow Chains

33

slide-67
SLIDE 67

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8 5 9 13 6 7 11

Split Pointer hash1(key) = key % n

20

slide-68
SLIDE 68

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

20

slide-69
SLIDE 69

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17

20

slide-70
SLIDE 70

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17

17 20

Overflow!

slide-71
SLIDE 71

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17

17

4

hash2(key) = key % 2n

20

Overflow!

slide-72
SLIDE 72

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17

17

4

20

hash2(key) = key % 2n

slide-73
SLIDE 73

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17

17

4

20

hash2(key) = key % 2n

slide-74
SLIDE 74

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17 hash1(20) = 20%4 = 0 Find 20

17

4

20

hash2(key) = key % 2n

slide-75
SLIDE 75

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

34

1 2 3

8

hash1(6) = 6%4 = 2 Find 6

5 9 13 6 7 11

Split Pointer hash1(key) = key % n

hash1(17) = 17%4 = 1 Insert 17 hash1(20) = 20%4 = 0 Find 20

17

4

20

hash2(key) = key % 2n

hash2(20) = 20%8 = 4

slide-76
SLIDE 76

CMU 15-445/645 (Fall 2018)

LIN EAR H ASH IN G

Splitting buckets based on the split pointer will eventually get to all overflowed buckets.

→ When the pointer reaches the last slot, delete the first hash function and move back to beginning.

The pointer can also move backwards when buckets are empty.

35

slide-77
SLIDE 77

CMU 15-445/645 (Fall 2018)

CO N CLUSIO N

Fast data structures that support O(1) look-ups that are used all throughout the DBMS internals.

→ Trade-off between speed and flexibility.

Hash tables are usually not what you want to use for a table index… Postgres Demo

36

slide-78
SLIDE 78

CMU 15-445/645 (Fall 2018)

N EXT CLASS

B+Trees Skip Lists Radix Trees

37