Online Bigtable merge compaction Neal E. Young 1 Claire Mathieu - - PowerPoint PPT Presentation

online bigtable merge compaction
SMART_READER_LITE
LIVE PREVIEW

Online Bigtable merge compaction Neal E. Young 1 Claire Mathieu - - PowerPoint PPT Presentation

Online Bigtable merge compaction Neal E. Young 1 Claire Mathieu Carl Staelin Arman Yousefia CNRS Paris Google Haifa UC Riverside UCLA Northeastern University, September 17, 2015 1 funded by faculty re$earch award BIGTABLE data storage


slide-1
SLIDE 1

Online Bigtable merge compaction

Claire Mathieu

CNRS Paris

Carl Staelin

Google Haifa

Neal E. Young1

UC Riverside

Arman Yousefia

UCLA

Northeastern University, September 17, 2015

1funded by

faculty re$earch award

slide-2
SLIDE 2

BIGTABLE — data storage at

Google Maps, Search/Crawl, Gmail . . . use BIGTABLE to store data.

I 24,500 Bigtable Servers I 1.2 million requests per second I 16 GB/s of outgoing RPC traffic I over a petabyte of data just for Google Crawl and Analytics I these figures are from 2006

Similar to other “NoSQL” databases: Accumulo, AsterixDB, Cassandra, HBase, Hypertable, Spanner, . . . Used by Adobe, Ebay, Facebook, GitHub, Meetup, Netflix, Twitter, . . . “Log-structured merge tree” architecture — for high-volume, highly reliable, distributed, real-time data storage.

slide-3
SLIDE 3

BIGTABLE — implements dictionary data type

  • perations supported by a Bigtable instance:

I write(key, value) I read(key) — return most recent value written for key I

. . . there’s more, but not today . . .

slide-4
SLIDE 4

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty– file sequence

slide-5
SLIDE 5

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) file sequence write(1, a);

slide-6
SLIDE 6

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) (2, b) file sequence write(1, a); write(2, b);

slide-7
SLIDE 7

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) (2, b) (3, c) file sequence write(1, a); write(2, b); write(3, c);

slide-8
SLIDE 8

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) (2, b) (3, c) (4, d) file sequence write(1, a); write(2, b); write(3, c); write(4, d);

slide-9
SLIDE 9

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

write(1, a); write(2, b); write(3, c); write(4, d); flush();

slide-10
SLIDE 10

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (5, e) (6, f ) (7, g) file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

write(1, a); write(2, b); write(3, c); write(4, d); flush(); write(5, e); write(6, f ); write(7, g);

slide-11
SLIDE 11

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

(5, e) (6, f ) (7, g)

| {z }

from 2nd flush

write(1, a); write(2, b); write(3, c); write(4, d); flush(); write(5, e); write(6, f ); write(7, g); flush();

slide-12
SLIDE 12

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

(5, e) (6, f ) (7, g)

| {z }

from 2nd flush

(8, h) (9, i)

| {z }

from 3rd flush

write(1, a); write(2, b); write(3, c); write(4, d); flush(); write(5, e); write(6, f ); write(7, g); flush(); write(8, h); write(9, i); flush();

slide-13
SLIDE 13

BIGTABLE — writes and flushes

write(key, value):

  • 1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

(5, e) (6, f ) (7, g)

| {z }

from 2nd flush

(8, h) (9, i)

| {z }

from 3rd flush

Environment forces Flushes at arbitrary times.

slide-14
SLIDE 14

BIGTABLE — reads and compactions

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

(5, e) (6, f ) (7, g)

| {z }

from 2nd flush

(8, h) (9, i)

| {z }

from 3rd flush

read(key):

  • 1. Check cache for key.
  • 2. If not found, check files (most recent first).

← cost = O(#files)

slide-15
SLIDE 15

BIGTABLE — reads and compactions

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

(5, e) (6, f ) (7, g)

| {z }

from 2nd flush

(8, h) (9, i)

| {z }

from 3rd flush

read(key):

  • 1. Check cache for key.
  • 2. If not found, check files (most recent first).

← cost = O(#files)

compaction():

← asynchronous background process, to reduce read costs Periodically select files to merge.

slide-16
SLIDE 16

BIGTABLE — reads and compactions

cache: –empty– file sequence:

(1, a) (2, b) (3, c) (4, d)

| {z }

from 1st flush

(5, e) (6, f ) (7, g) (8, h) (9, i)

| {z }

merge of 2nd and 3rd

read(key):

  • 1. Check cache for key.
  • 2. If not found, check files (most recent first).

← cost = O(#files)

compaction():

← asynchronous background process, to reduce read costs Periodically select files to merge. ← cost = O(SIZE of merged files) !!

goals: (i) keep read costs low (ii) keep compaction costs low constraint: each merge must merge a contiguous subsequence of files

slide-17
SLIDE 17

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.
slide-18
SLIDE 18

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = ∞, problem is easy — never merge

slide-19
SLIDE 19

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = ∞, problem is easy — never merge

after flush 1:

slide-20
SLIDE 20

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = ∞, problem is easy — never merge

after flush 1: after flush 2:

slide-21
SLIDE 21

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = ∞, problem is easy — never merge

after flush 1: after flush 2: after flush 3: after flush 4:

. . . Total compaction cost = 0.

slide-22
SLIDE 22

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:

slide-23
SLIDE 23

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1: after flush 2: ← too many files!

slide-24
SLIDE 24

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1: after flush 2: ← compaction cost x1 + x2

slide-25
SLIDE 25

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1: after flush 2: ← compaction cost x1 + x2 after flush 3: ← too many files!

slide-26
SLIDE 26

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1: after flush 2: ← compaction cost x1 + x2 after flush 3: ← compaction cost x1 + x2 + x3

slide-27
SLIDE 27

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1, x2, . . . , xn.

← xt is size of file resulting from flush t

Integer k > 0.

← tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k.

  • bjective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1: after flush 2: ← compaction cost x1 + x2 after flush 3: ← compaction cost x1 + x2 + x3

. . .

after flush n: ← compaction cost x1 + · · · + xn

Total compaction cost Pn

i=2(x1 +x2 +· · ·+xi) ≈ Pn i=1(n −i +1)xi.

slide-28
SLIDE 28

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

slide-29
SLIDE 29

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

slide-30
SLIDE 30

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2.

slide-31
SLIDE 31

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2.

slide-32
SLIDE 32

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2. 3.

slide-33
SLIDE 33

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2. 3. 4.

slide-34
SLIDE 34

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2. 3. 4.

slide-35
SLIDE 35

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2. 3. 4. 5.

slide-36
SLIDE 36

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2. 3. 4. 5.

. . .

slide-37
SLIDE 37

Google’s default compaction algorithm:

Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1. 2. 3. 4. 5.

. . . Total compaction cost = Θ(n2).

n 2 -

                     for general k, cost is Θ(n2/3k−1)

slide-38
SLIDE 38

OPTIMAL solution for k = 2, uniform x = 1, 1, 1, . . .

1. 2. 3. 4.

. . . ← “big” merges: O(√n), of size O(n) ← “small” merges: O(n), of size O(√n)

√n -

          

Total compaction cost = O(n3/2).

for general k, opt cost is Θ(kn1+1/k )

slide-39
SLIDE 39

Definition: c-competitive online algorithm

A compaction algorithm is c-competitive if, on any input (k, x), its solution costs at most c times the optimal cost. A compaction algorithm is online if its choice of merge after flush t depends only on k and x1, x2, . . . , xt (the files flushed so far).

I Default’s cost can be n times opt cost (for any k). I So default is no better than n-competitive.

→ May have high compaction cost even for “easy” inputs.

Theorem 1. There is a k-competitive online algorithm for bmc.

← today

Theorem 2. No deterministic online algorithm is less than k-competitive.

slide-40
SLIDE 40

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

?

slide-41
SLIDE 41

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

slide-42
SLIDE 42

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

?

slide-43
SLIDE 43

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

slide-44
SLIDE 44

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

  • alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge):

slide-45
SLIDE 45

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

  • alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge): Opt’s cost for big merge during interval is at least C.

slide-46
SLIDE 46

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

  • alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge): Opt’s cost for big merge during interval is at least C. case 2 (during this interval, opt does no big merge):

slide-47
SLIDE 47

Idea behind 2-competitive online algorithm (for k = 2) . . .

Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.

s. previous big merge, cost C

  • alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge): Opt’s cost for big merge during interval is at least C. case 2 (during this interval, opt does no big merge): Opt’ cost for small merges during interval is at least C.

slide-48
SLIDE 48

Idea behind k-competitive online algorithm for general k

idea: Do big merge, then recurse with k = k − 1. Q: When to do next big merge? A: When cost of previous big merge ≈ (cost for recursion)/(k − 1).

Recurse with k = k − 1 to handle this part.

“Balanced rent-or-buy algorithm (brb)”

slide-49
SLIDE 49

Recap of analyses in worst-case model

Bigtable default is at best n-competitive...

Theorem 1. Brb is a k-competitive online algorithm for bmc.

← today

Theorem 2. No deterministic online algorithm is less than k-competitive.

What about “typical” inputs?

slide-50
SLIDE 50

Preliminary benchmarks (one example with k = 5)

500 1000 1500 2000 1e+05 2e+05 3e+05 4e+05 n cost per step

Default BRB Optimal

0e+00 4e+04 8e+04 0.0e+00 4.0e+06 8.0e+06 1.2e+07 n cost per step

Default BRB

xt’s are i.i.d. from log-normal distribution. Conjectures

  • 1. Brb and Opt cost per time step ∼ x k n1/k/e.
  • 2. Default cost per time step ∼ x n/(2 · 3k−1).
slide-51
SLIDE 51

Lots of work in progress

theoretical:

I average-case analyses:

absolute and relative costs on i.i.d. inputs

I randomized online algorithms (o(k)-competitive?) I optimal compaction schedules

≡ optimal binary search trees practical:

I realistic testing. . . on AsterixDB, then at Google

problem variants:

I allow expiration/deletion of key/value pairs (done) I allowing k to vary — bmc w/ read costs... (open!)

Working paper available on arxiv.org

(Search web for “bigtable merge compaction”.)

slide-52
SLIDE 52

Bmc with read costs (geometric interpretation)

given: Staircase step-lengths and step-heights (x1, y1), (x2, y2), . . .. do: Partition region below staircase into axis-parallel rectangles.

  • bjective: Minimize the sum of the widths and heights of the rectangles.

x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7

  • pen problem: is there an O(1)-competitive online algorithm?
slide-53
SLIDE 53

Bmc with read costs (geometric interpretation)

given: Staircase step-lengths and step-heights (x1, y1), (x2, y2), . . .. do: Partition region below staircase into axis-parallel rectangles.

  • bjective: Minimize the sum of the widths and heights of the rectangles.

x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7

  • pen problem: is there an O(1)-competitive online algorithm?
slide-54
SLIDE 54

Thank you

slide-55
SLIDE 55

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1, x2, . . . , xn. Int. k > 0. do: Partition region below staircase into axis-parallel rectangles, so no row has more than k rectangles.

  • bjective: Minimize the sum of the widths of the rectangles.

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

input: an uneven staircase with 10 steps; k = 2.

slide-56
SLIDE 56

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1, x2, . . . , xn. Int. k > 0. do: Partition region below staircase into axis-parallel rectangles, so no row has more than k rectangles.

  • bjective: Minimize the sum of the widths of the rectangles.

input: an uneven staircase with 10 steps; k = 2. solution

slide-57
SLIDE 57

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1, x2, . . . , xn. Int. k > 0. do: Partition region below staircase into axis-parallel rectangles, so no row has more than k rectangles.

  • bjective: Minimize the sum of the widths of the rectangles.

input: an uneven staircase with 10 steps; k = 2. not a solution

This partition is cheaper. . . but not valid for k = 2.