SLIDE 1 The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses
Subi Arumugam1,Alin Dobra1,Christopher M. Jermaine2, Niketan Pansare2,Luis Perez2
1University of Florida, 2Rice University
June 9, 2010
SLIDE 2 Motivation
- Storage is cheap: 1TB disk is 80-100$
- Disks have high throughput
- 100$ 1TB disk can do 150MB/s reads/writes
- 4,000$ 1TB SSD (OCZ p88) reads at 1.4GB/s
- Processors are fast: 6GFLOPs/Core, 24GFrops for 100$
- TPC-H Q1 ( at 1TB scale factor )
- 8 Aggregates over 95-97% of lineitem
- need to read about 160-700GB: 2 P88 scan in 60-250s
- need to perform 30FLOPs*6 · 109=180GFLOPS; 8s
- Q1 should be I/O bound; should do 8 in parallel
SLIDE 3 Motivation
- Storage is cheap: 1TB disk is 80-100$
- Disks have high throughput
- 100$ 1TB disk can do 150MB/s reads/writes
- 4,000$ 1TB SSD (OCZ p88) reads at 1.4GB/s
- Processors are fast: 6GFLOPs/Core, 24GFrops for 100$
- TPC-H Q1 ( at 1TB scale factor )
- 8 Aggregates over 95-97% of lineitem
- need to read about 160-700GB: 2 P88 scan in 60-250s
- need to perform 30FLOPs*6 · 109=180GFLOPS; 8s
- Q1 should be I/O bound; should do 8 in parallel
- Best non-clustered performer: 142s for 1.7M$
- 64 cores, 512GB memory, 576 disks
SLIDE 4 Large Scale Analytics
Goals
- Deal with analytical queries on large data (1-10TB)
- Get closer to theoretical CPU performance
- gap stands at 100-1000 for most databases
- Sub 100,000$ system with minute response time (1TB)
- stay I/O bound even with fast disks and multiple queries
- No or little tuning: no indexing, no tunable partitioning
SLIDE 5 Large Scale Analytics
Goals
- Deal with analytical queries on large data (1-10TB)
- Get closer to theoretical CPU performance
- gap stands at 100-1000 for most databases
- Sub 100,000$ system with minute response time (1TB)
- stay I/O bound even with fast disks and multiple queries
- No or little tuning: no indexing, no tunable partitioning
DataPath
- System designed from ground up to meet these goals.
SLIDE 6 Benchmark System
Old System (2008) – 60,000$
- 8 processors, 32 cores
- 128GB DDR2 RAM (16 bays)
- 2 Averatec RAID controlless, 4 12-disk enclosures
- 47 Velociraptor Disks, 8 Baracuda disks
- Maximum aggregate throughput 2.2GB/s
New System (2010) – 20,000$
- 4 processors, 48 cores
- 128GB DDR3 memory
- 2 OCZ Z-drive 1TB PCI SSD disks
- Maximum aggregate throughput 2.8GB/s
SLIDE 7
Data-centric Computation
SLIDE 8 DataPath Execution Model
- Tuple-oriented execution model
- Tuples shared by queries in the system
- Chunks of tuples pushed into waypoints for processing
- Waypoints implement operations for multiple queries
- Tuple processing loops at full CPU speed
for (int i = 0; true; i++) { if (tuple[i].BelongsTo (Q1)) Q1.Process (tuple[i]); if (tuple[i].BelongsTo (Q2)) Q2.Process (tuple[i]); if (tuple[i].BelongsTo (Q3)) Q3.Process (tuple[i]); }
SLIDE 9 Query Execution – Example
Q1: SELECT SUM (l quantity)
FROM lineitem WHERE l shipdate > ’1-1-06’;
lineitem
Q1: l_shipdate > ‘1-1-06’
Q1: SUM(l_quantity)
SLIDE 10 Query Execution – Example
Q1: SELECT SUM (l quantity)
FROM lineitem WHERE l shipdate > ’1-1-06’;
Q2: SELECT SUM (l extendedprice)
FROM lineitem, order WHERE l shipmode <> ’rail’ AND o orderdate < ’1-1-08’ AND l orderkey = o orderkey;
lineitem
Q1: l_shipdate > ‘1-1-06’
Q2: l_shipmode <> ‘rail’
Q2: o_orderdate < ‘1-1-08’
Q2: l_orderkey = o_orderkey Q1: SUM(l_quantity)
- Q2: SUM(l_extendedprice)
- ut
- Q1, Q2
Q2 Q1 Q2
SLIDE 11 Query Execution – Example
Q1: SELECT SUM (l quantity)
FROM lineitem WHERE l shipdate > ’1-1-06’;
Q2: SELECT SUM (l extendedprice)
FROM lineitem, order WHERE l shipmode <> ’rail’ AND o orderdate < ’1-1-08’ AND l orderkey = o orderkey;
Q3: SELECT AVG (l discount)
FROM lineitem, orders WHERE
l orderkey = o orderkey;
lineitem
Q1: l_shipdate > ‘1-1-06’
Q2: l_shipmode <> ‘rail’
Q2: o_orderdate < ‘1-1-08’
Q2: l_orderkey = o_orderkey Q1: SUM(l_quantity)
- Q2: SUM(l_extendedprice)
- ut
- Q1, Q2
Q2 Q1 Q2
SLIDE 12 Tuple Processing Loop
Usual problems:
- branch mis-prediction
- instruction cache misses
- per-tuple overhead
DataPath solution – Use a C++ meta-compiler
- generate new tuple processing loops for each waypoint
when new queries added
- code is human-readable (has even comments)
- compiled as a library with -O3 -msse4.1
- everything is hardcoded
- compiler finds sharing, branch-misprediction, SSE
SLIDE 13
File Scanner
Staging Area
Chunk 1 Chunk 3 Chunk 4 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 3 2 1 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 14
File Scanner
Staging Area
Chunk 1 Chunk 3 2 Chunk 4 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 3 1 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 15
File Scanner
Staging Area
Chunk 1 Chunk 3 2 Chunk 4 3 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 1 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 16
File Scanner
Staging Area
Chunk 1 Chunk 3 2 5 Chunk 4 3 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 1 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 17
File Scanner
Staging Area
1 Chunk 1 Chunk 3 2 5 Chunk 4 3 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 18
File Scanner
Staging Area
1 Chunk 1 Chunk 3 2 4 5 Chunk 4 3 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 19
File Scanner
Staging Area
1 Chunk 1 Chunk 3 2 4 5 6 Chunk 4 3 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
SLIDE 20
File Scanner
Staging Area
1 Chunk 1 Chunk 3 2 4 5 6 8 Chunk 4 3 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 10 9 7 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
SLIDE 21
File Scanner
Staging Area
Chunk 5 Chunk 3 5 6 8 Chunk 4 7 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 10 9 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
SLIDE 22
File Scanner
Staging Area
Chunk 5 Chunk 3 5 6 8 10 Chunk 4 7 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 9 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
SLIDE 23
File Scanner
Staging Area
Chunk 5 Chunk 3 5 9 6 8 10 Chunk 4 7 Chunk 2
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1