SLIDE 1 Online Bigtable merge compaction
Claire Mathieu
CNRS Paris
Carl Staelin
Google Haifa
Neal E. Young1
UC Riverside
Arman Yousefia
UCLA
Northeastern University, September 17, 2015
1funded by
faculty re$earch award
SLIDE 2
BIGTABLE — data storage at
Google Maps, Search/Crawl, Gmail . . . use BIGTABLE to store data.
I 24,500 Bigtable Servers I 1.2 million requests per second I 16 GB/s of outgoing RPC traffic I over a petabyte of data just for Google Crawl and Analytics I these figures are from 2006
Similar to other “NoSQL” databases: Accumulo, AsterixDB, Cassandra, HBase, Hypertable, Spanner, . . . Used by Adobe, Ebay, Facebook, GitHub, Meetup, Netflix, Twitter, . . . “Log-structured merge tree” architecture — for high-volume, highly reliable, distributed, real-time data storage.
SLIDE 3 BIGTABLE — implements dictionary data type
- perations supported by a Bigtable instance:
I write(key, value) I read(key) — return most recent value written for key I
. . . there’s more, but not today . . .
SLIDE 4 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty– file sequence
SLIDE 5 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) file sequence write(1, a);
SLIDE 6 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) (2, b) file sequence write(1, a); write(2, b);
SLIDE 7 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) (2, b) (3, c) file sequence write(1, a); write(2, b); write(3, c);
SLIDE 8 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) (2, b) (3, c) (4, d) file sequence write(1, a); write(2, b); write(3, c); write(4, d);
SLIDE 9 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
write(1, a); write(2, b); write(3, c); write(4, d); flush();
SLIDE 10 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (5, e) (6, f ) (7, g) file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
write(1, a); write(2, b); write(3, c); write(4, d); flush(); write(5, e); write(6, f ); write(7, g);
SLIDE 11 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
(5, e) (6, f ) (7, g)
| {z }
from 2nd flush
write(1, a); write(2, b); write(3, c); write(4, d); flush(); write(5, e); write(6, f ); write(7, g); flush();
SLIDE 12 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
(5, e) (6, f ) (7, g)
| {z }
from 2nd flush
(8, h) (9, i)
| {z }
from 3rd flush
write(1, a); write(2, b); write(3, c); write(4, d); flush(); write(5, e); write(6, f ); write(7, g); flush(); write(8, h); write(9, i); flush();
SLIDE 13 BIGTABLE — writes and flushes
write(key, value):
- 1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
(5, e) (6, f ) (7, g)
| {z }
from 2nd flush
(8, h) (9, i)
| {z }
from 3rd flush
Environment forces Flushes at arbitrary times.
SLIDE 14 BIGTABLE — reads and compactions
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
(5, e) (6, f ) (7, g)
| {z }
from 2nd flush
(8, h) (9, i)
| {z }
from 3rd flush
read(key):
- 1. Check cache for key.
- 2. If not found, check files (most recent first).
← cost = O(#files)
SLIDE 15 BIGTABLE — reads and compactions
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
(5, e) (6, f ) (7, g)
| {z }
from 2nd flush
(8, h) (9, i)
| {z }
from 3rd flush
read(key):
- 1. Check cache for key.
- 2. If not found, check files (most recent first).
← cost = O(#files)
compaction():
← asynchronous background process, to reduce read costs Periodically select files to merge.
SLIDE 16 BIGTABLE — reads and compactions
cache: –empty– file sequence:
(1, a) (2, b) (3, c) (4, d)
| {z }
from 1st flush
(5, e) (6, f ) (7, g) (8, h) (9, i)
| {z }
merge of 2nd and 3rd
read(key):
- 1. Check cache for key.
- 2. If not found, check files (most recent first).
← cost = O(#files)
compaction():
← asynchronous background process, to reduce read costs Periodically select files to merge. ← cost = O(SIZE of merged files) !!
goals: (i) keep read costs low (ii) keep compaction costs low constraint: each merge must merge a contiguous subsequence of files
SLIDE 17 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
SLIDE 18 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = ∞, problem is easy — never merge
SLIDE 19 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = ∞, problem is easy — never merge
after flush 1:
SLIDE 20 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = ∞, problem is easy — never merge
after flush 1: after flush 2:
SLIDE 21 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = ∞, problem is easy — never merge
after flush 1: after flush 2: after flush 3: after flush 4:
. . . Total compaction cost = 0.
SLIDE 22 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:
SLIDE 23 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1: after flush 2: ← too many files!
SLIDE 24 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1: after flush 2: ← compaction cost x1 + x2
SLIDE 25 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1: after flush 2: ← compaction cost x1 + x2 after flush 3: ← too many files!
SLIDE 26 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1: after flush 2: ← compaction cost x1 + x2 after flush 3: ← compaction cost x1 + x2 + x3
SLIDE 27 Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1, x2, . . . , xn.
← xt is size of file resulting from flush t
Integer k > 0.
← tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k.
- bjective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1: after flush 2: ← compaction cost x1 + x2 after flush 3: ← compaction cost x1 + x2 + x3
. . .
after flush n: ← compaction cost x1 + · · · + xn
Total compaction cost Pn
i=2(x1 +x2 +· · ·+xi) ≈ Pn i=1(n −i +1)xi.
SLIDE 28
Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
SLIDE 29 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
SLIDE 30 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2.
SLIDE 31 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2.
SLIDE 32 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2. 3.
SLIDE 33 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2. 3. 4.
SLIDE 34 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2. 3. 4.
SLIDE 35 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2. 3. 4. 5.
SLIDE 36 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2. 3. 4. 5.
. . .
SLIDE 37 Google’s default compaction algorithm:
Merge minimal suffix so as to maintain (i) #files ≤ k and (ii) each file’s size exceeds total size of files to the right. Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1. 2. 3. 4. 5.
. . . Total compaction cost = Θ(n2).
n 2 -
for general k, cost is Θ(n2/3k−1)
SLIDE 38 OPTIMAL solution for k = 2, uniform x = 1, 1, 1, . . .
1. 2. 3. 4.
. . . ← “big” merges: O(√n), of size O(n) ← “small” merges: O(n), of size O(√n)
√n -
Total compaction cost = O(n3/2).
for general k, opt cost is Θ(kn1+1/k )
SLIDE 39
Definition: c-competitive online algorithm
A compaction algorithm is c-competitive if, on any input (k, x), its solution costs at most c times the optimal cost. A compaction algorithm is online if its choice of merge after flush t depends only on k and x1, x2, . . . , xt (the files flushed so far).
I Default’s cost can be n times opt cost (for any k). I So default is no better than n-competitive.
→ May have high compaction cost even for “easy” inputs.
Theorem 1. There is a k-competitive online algorithm for bmc.
← today
Theorem 2. No deterministic online algorithm is less than k-competitive.
SLIDE 40 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
?
SLIDE 41 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
SLIDE 42 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
?
SLIDE 43 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
SLIDE 44 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
- alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge):
SLIDE 45 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
- alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge): Opt’s cost for big merge during interval is at least C.
SLIDE 46 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
- alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge): Opt’s cost for big merge during interval is at least C. case 2 (during this interval, opt does no big merge):
SLIDE 47 Idea behind 2-competitive online algorithm (for k = 2) . . .
Q: At each step, do “big” merge or small merge? A: Do big merge when cost C of previous big merge ≈ total cost of small merges since then.
s. previous big merge, cost C
- alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges. case 1 (during this interval, opt does a big merge): Opt’s cost for big merge during interval is at least C. case 2 (during this interval, opt does no big merge): Opt’ cost for small merges during interval is at least C.
SLIDE 48
Idea behind k-competitive online algorithm for general k
‘
idea: Do big merge, then recurse with k = k − 1. Q: When to do next big merge? A: When cost of previous big merge ≈ (cost for recursion)/(k − 1).
Recurse with k = k − 1 to handle this part.
“Balanced rent-or-buy algorithm (brb)”
SLIDE 49
Recap of analyses in worst-case model
Bigtable default is at best n-competitive...
Theorem 1. Brb is a k-competitive online algorithm for bmc.
← today
Theorem 2. No deterministic online algorithm is less than k-competitive.
What about “typical” inputs?
SLIDE 50 Preliminary benchmarks (one example with k = 5)
500 1000 1500 2000 1e+05 2e+05 3e+05 4e+05 n cost per step
Default BRB Optimal
0e+00 4e+04 8e+04 0.0e+00 4.0e+06 8.0e+06 1.2e+07 n cost per step
Default BRB
xt’s are i.i.d. from log-normal distribution. Conjectures
- 1. Brb and Opt cost per time step ∼ x k n1/k/e.
- 2. Default cost per time step ∼ x n/(2 · 3k−1).
SLIDE 51
Lots of work in progress
theoretical:
I average-case analyses:
absolute and relative costs on i.i.d. inputs
I randomized online algorithms (o(k)-competitive?) I optimal compaction schedules
≡ optimal binary search trees practical:
I realistic testing. . . on AsterixDB, then at Google
problem variants:
I allow expiration/deletion of key/value pairs (done) I allowing k to vary — bmc w/ read costs... (open!)
Working paper available on arxiv.org
(Search web for “bigtable merge compaction”.)
SLIDE 52 Bmc with read costs (geometric interpretation)
given: Staircase step-lengths and step-heights (x1, y1), (x2, y2), . . .. do: Partition region below staircase into axis-parallel rectangles.
- bjective: Minimize the sum of the widths and heights of the rectangles.
x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7
- pen problem: is there an O(1)-competitive online algorithm?
SLIDE 53 Bmc with read costs (geometric interpretation)
given: Staircase step-lengths and step-heights (x1, y1), (x2, y2), . . .. do: Partition region below staircase into axis-parallel rectangles.
- bjective: Minimize the sum of the widths and heights of the rectangles.
x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7
- pen problem: is there an O(1)-competitive online algorithm?
SLIDE 54
Thank you
SLIDE 55 A geometric interpretation of bmc
given: Uneven staircase with step-lengths x1, x2, . . . , xn. Int. k > 0. do: Partition region below staircase into axis-parallel rectangles, so no row has more than k rectangles.
- bjective: Minimize the sum of the widths of the rectangles.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
input: an uneven staircase with 10 steps; k = 2.
SLIDE 56 A geometric interpretation of bmc
given: Uneven staircase with step-lengths x1, x2, . . . , xn. Int. k > 0. do: Partition region below staircase into axis-parallel rectangles, so no row has more than k rectangles.
- bjective: Minimize the sum of the widths of the rectangles.
input: an uneven staircase with 10 steps; k = 2. solution
SLIDE 57 A geometric interpretation of bmc
given: Uneven staircase with step-lengths x1, x2, . . . , xn. Int. k > 0. do: Partition region below staircase into axis-parallel rectangles, so no row has more than k rectangles.
- bjective: Minimize the sum of the widths of the rectangles.
input: an uneven staircase with 10 steps; k = 2. not a solution
This partition is cheaper. . . but not valid for k = 2.