NoDB: Efficient Query Processing on Raw Data Files
Ioannis Alagiannis‡ Renata Borovica‡ Miguel Branco‡ Stratos Idreos* Anastasia Ailamaki‡
*CWI, Amsterdam
‡École Polytechnique
Fédérale de Lausanne
NoDB: Efficient Query Processing on Raw Data Files Ioannis - - PowerPoint PPT Presentation
NoDB: Efficient Query Processing on Raw Data Files Ioannis Alagiannis Renata Borovica Miguel Branco Stratos Idreos * Anastasia Ailamaki cole Polytechnique *CWI, Amsterdam Fdrale de Lausanne From data to results Time
*CWI, Amsterdam
‡École Polytechnique
Fédérale de Lausanne
Time loading data preparation
3
0|ALGERIA|0| haggle. carefully fina 1|ARGENTINA|1|al foxes promise 2|BRAZIL|1|y alongside of the pen 3|CANADA|1|eas hang ironic, silen 4|EGYPT|4|y above the carefully th ...
raw file
1 ARGENTINA 1 al foxes promis
Fields
0|ALGERIA|0| haggle. carei… 1|ARGENTINA|1|al foxes p…
2|BRAZIL|1|y alongside o… 3|CANADA|1|eas hang iro… ……………………………….…………
convert to binary
parsing tokenize
50 100 150 200 250
Straw-man approach
Execution Time (sec)
I/O parse token. convert query
4
Raw files first-class citizen No data loading Driven by the workload Instant gateway to data
Time
5
6
7
Raw File
tuples attributes positions of attributes
Created on-the-fly Reduce parsing Reduce tokenizing
8
Raw File
known position looking for
Created on-the-fly Reduce parsing Reduce tokenizing
9
tuples attributes Indexed attributes:
PM:
10
tuples attributes a4, a6
PM:
p4, p6 p4, p6 p4, p6 p4, p6
Indexed attributes:
11
tuples attributes a4, a6
PM:
p4, p6 p4, p6 p4, p6 p4, p6
Indexed attributes:
p4, p6, p9 p4, p6, p9 p4, p6, p9 p4, p6, p9
a4, a6, a9
12
10 20 30 40 50 200 400 600 800 1000 1200 Execution Time (sec) # pointers (in millions)
13
15% from full positional map
2x improvement
Random queries on 10 attributes Vary storage capacity (15MB-2GB)
1 10 100 Q1 Q5 Q10 Q15 Q20 Q25 Q30 Q35 Q40 Execution Time (sec) - log scale Query Sequence PostgreSQL PostgresRaw
14
200 400 600 800 1000 1200 1400 1600 1800 MySQL CSV Engine MySQL DBMS X DBMS X w/ external files PostgreSQL PostgresRaw Execution Time (sec) Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 Load
15
16
200 400 600 800 1000 1200 1400 1600 1800 MySQL CSV Engine MySQL DBMS X DBMS X w/ external files PostgreSQL PostgresRaw Execution Time (sec) Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 Load
200 400 600 800 1000 1200 1400 1600 1800 MySQL CSV Engine MySQL DBMS X DBMS X w/ external files PostgreSQL PostgresRaw Execution Time (sec) Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 Load
17
18