1 / 32
Dremel: Interactice Analysis of Web-Scale Datasets
By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh
Dremel: Interactice Analysis of Web-Scale Datasets By Sergey - - PowerPoint PPT Presentation
Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview Scalable, interactive ad-hoc
1 / 32
By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh
2 / 32
3 / 32
4 / 32
5 / 32
DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal, 100), COUNT(*) FROM t
6 / 32
– Google uses GFS
– Protocol Buffers
7 / 32
8 / 32
9 / 32
– Lossless representation of record structure in
– Fast encoding and decoding (assembly) of
10 / 32
11 / 32
12 / 32
13 / 32
– Repetition level – Definition level – Compressed field values
14 / 32
15 / 32
16 / 32
17 / 32
18 / 32
19 / 32
20 / 32
– Slots are available execution threads on leaf
– Amount of data processed larger than
– Redispatch work that is taking too long
21 / 32
– Examine access characteristics on a single machine – Show benefits of columnar storage for MR execution – Show Dremel's performance
22 / 32
23 / 32
300k record fragment of Table T1 (1GB) used
24 / 32
25 / 32
26 / 32
27 / 32
28 / 32
29 / 32
interactive speeds on disk resident datasets of up to 1 trillion records
columns and servers is achievable for systems containing thousands of nodes
expensive
–
Software layers need to be optimized to directly consume column-oriented database
system can benefit from economies of scale while offering a better user experience
return most of the data to tradeoff speed and accuracy
time bounds is hard
30 / 32
–
Map Reduce, Hadoop
–
HadoopDB
Nested Data
–
Xmill
–
Complex value models
–
Nested relational models
–
Recursive Algebra and Query Optimizations for Nested Relations
–
Pig
–
Scope
–
DryadLINQ
31 / 32
– Replica consistency issues, etc.
32 / 32