DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, - - PowerPoint PPT Presentation
DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, - - PowerPoint PPT Presentation
DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis* , Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki Data loading: A necessary evil Volume => Expensive Top query performance 40 zettabytes by
Data loading: A necessary evil
2
Data loading is a persistent analysis bottleneck
Top query performance ACID guarantees Volume => Expensive 40 zettabytes by 2020* Velocity => Continuous Fresh data = Interesting data₸
₸ Abad [IISWC12]
* [IDC12]
Loading a DBMS
3
Read Load Convert
How does hardware affect loading?
Loading a DBMS
3
Read Load Convert
How does hardware affect loading?
Experimental setup
- Hardware
– Dual socket 8 cores Intel(R) Xeon(R) CPU E5-2640 – 64 GB RAM – HDD: 4 x 500 GB 7.5k RPM SATA disks – SSD: 3 x 200GB SSD disks – DAS: 24 x 500 GB 7.5k RPM SATA disks
- Software
– PostgreSQL, DBMS-R – MonetDB, DBMS-C – PostgreSQL parallel external loader (“PCOPY”)
- Benchmarks & Real-world Datasets
4
1500 3000 4500 6000 7500 25 50 75 100
Loading Time (sec) File size (GB) Symantec Loading Time
Single-threaded data loading
5
Effect of compression
1500 3000 4500 6000 7500 25 50 75 100
Loading Time (sec) File size (GB) TPC-H Loading Time
DBMS-C PostgreSQL MonetDB DBMS-R
[Input storage: HDD Destination storage: DAS]
Dataset characteristics matter
1000 2000 3000 25 50 75 100
Loading Time (sec) File size (GB) Symantec Loading Time
Parallel data loading
6
Sublinear speedup for 16 threads
1000 2000 3000 25 50 75 100
Loading Time (sec) File size (GB) TPC-H Loading Time
DBMS-C MonetDB DBMS-R PCOPY
Input storage: HDD - Destination storage: DAS 16 threads Speedup – 16 threads DBMS-R PCOPY MonetDB DBMS-C TPC-H 100GB 1.25 2.77 1.72 2.84 Symantec 100GB 0.87 1.9 2.1
20 40 60 80 100 25 50 75 100 Utilzation (%)
Time (sec) DBMS-R, TPC-H SF10 In: HDD - Out: DAS
Write BW
7
Resource Utilization
Unable to saturate resources
20 40 60 80 100 25 50 75 100 Utilzation (%)
Time (sec) DBMS-R, TPC-H SF10 In: HDD - Out: DAS
Write BW
7
Resource Utilization
Unable to saturate resources
20 40 60 80 100 25 50 75 100 Utilzation (%)
Time (sec) DBMS-R, TPC-H SF10 In: HDD - Out: DAS
Read BW Write BW
7
Resource Utilization
Unable to saturate resources
20 40 60 80 100 25 50 75 100 Utilzation (%)
Time (sec) DBMS-R, TPC-H SF10 In: HDD - Out: DAS
CPU Read BW Write BW
7
Resource Utilization
Unable to saturate resources
5 10 15 20 25 30 25 50 75 100 I/O Wait (%) DBMS-R MonetDB 20 40 60 80 100 25 50 75 100 Utilzation (%)
Time (sec) DBMS-R, TPC-H SF10 In: HDD - Out: DAS
CPU Read BW Write BW
7
Resource Utilization
Unable to saturate resources
Read patterns
8
Block Address Elapsed Time
MonetDB
Block Address Elapsed Time
DBMS-R
Random I/O causes underutilization
[TPC-H SF10 Input storage: HDD Destination storage: DAS]
Serial reader vs. Parallel readers
9
# readers depends on input device speed Serial reader improves read utilization
20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Read Utilzation (%)
Time (sec) PCOPY read utilization
Serial Reader Parallel Readers [TPC-H SF10 Input storage: HDD Destination storage: DAS]
Impact of storage
10
Impact of storage
11
100 200 300 400 PCOPY DBMS-R DBMS-C
Loading Time (sec)
HDD Source Storage
HDD to DAS HDD to SSD HDD to ramfs
[TPC-H SF10]
100 200 300 400 PCOPY DBMS-R DBMS-C
Loading Time (sec)
DAS Destination Storage
HDD to DAS SSD to DAS ramfs to DAS
Write bottleneck when source storage is fast Slow source storage bottlenecks all systems
Slow input storage Varying input storage
Best-case storage scenario
12
100% CPU utilization, yet B/W still underutilized
100 200 300 400
PCOPY DBMS-R DBMS-C Loading Time (sec)
ramfs Source Storage
ramfs to HDD ramfs to DAS ramfs to SSD ramfs to ramfs
[TPC-H SF10] 20 40 60 80 100 5 10 15 20 25 30 CPU Utilization (%)
Time (sec) DBMS-R In: ramfs – Out: ramfs
Device Bandwidth: 12.8 GB/sec Read Rate: 250 MB/sec
Data loading: Where does time go?
13
0% 20% 40% 60% 80% 100% PostgreSQL MonetDB
CPU breakdown (%)
Other Write data Logging Tuple creation Conversion Tokenizing Parsing
Parsing, conversion, tokenization hotspots
[10 int columns; 10GB]
Reducing data loading overheads
14
NoDB Data Vaults Instant Loading RAW FPGAs GPUs
HW
Reduced Accesses
Optimized Code Path
SDS/Q
Bulk loading on modern hardware
- General case: Resource under-utilization
- Slow destination storage matters
- Complex code paths bound max speed
15
Bulk loading on modern hardware
- General case: Resource under-utilization
- Slow destination storage matters
- Complex code paths bound max speed
15
Thank You! Questions?
Backup Slides
16
50x data growth from 2010 to 2020
23
Can DBMS keep up with data growth?
[IDC2012]
Storage Characteristics
Name Capacity Configuration Read Speed Write Speed RPM HDD 2TB 4 x HDD (RAID-0) 170 MB/s 160 MB/s 7.5K DAS 12TB 24 x HDD (RAID-0) 1100 MB/s 330 MB/s 7.5K SSD 600GB 3 x SSD (RAID-0) 565 MB/s 268 MB/s n/a
24
Parallel data loading – 16 threads
7
Sublinear speedup for 16x DoP
4 8 12 16 25 50 75 100
Speedup File size (GB) TPC-H
PCOPY MonetDB DBMS-R DBMS-C
[Input storage: HDD Destination storage: DAS]
1 2 3 4 25 50 75 100
Speedup File size (GB) Symantec
Single-threaded loading – Extra datasets
26
Column stores invest in compression
1500 3000 4500 6000 7500 25 50 75 100
Loading Time (sec) File size (GB) TPC-C Loading Time
PostgreSQL MonetDB DBMS-R DBMS-C
Input storage: HDD Destination storage: DAS
1500 3000 4500 6000 7500 25 50 75 100
Loading Time (sec) File size (GB) SDSS Loading Time
Parallel data loading – Extra datasets
27
1000 2000 3000 4000 25 50 75 100
Loading Time (sec) File size (GB) TPC-C Loading Time
PCOPY MonetDB DBMS-R DBMS-C
Input storage: HDD Destination storage: DAS
1000 2000 3000 25 50 75 100
Loading Time (sec) File size (GB) SDSS Loading Time
4 8 12 16 25 50 75 100
Speedup File size (GB) SDSS
Parallel data loading – Extra datasets
28
1 2 3 4 25 50 75 100
Speedup File size (GB) TPC-C
PostgreSQL MonetDB DBMS-R DBMS-C
Input storage: HDD Destination storage: DAS
The effect of compression
Name TPC-H TPC-C SDSS Symantec DBMS-R 1.5 1.3 1.5 1.5 PostgreSQL 1.4 1.4 1.4 1.1 DBMS-C 0.27 0.82 0.18 0.25 MonetDB 1.1 1.4 1.0 0.92
29
Column stores: Reduced footprint favors OLAP
[10GB] DB size / input file
5 10 15 20 25 30 25 50 75 100 I/O Wait (%) PCOPY DBMS-R MonetDB DBMS-C 20 40 60 80 100 25 50 75 100 Utilzation (%)
Time (sec) DBMS-R, TPC-H SF10 In: HDD - Out: DAS
CPU Read BW Write BW
8
Resource Utilization
Unable to saturate resources
MonetDB utilization
31
20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Utilzation (%) Time (sec)
[Data: TPCH – SF10 Input storage: HDD Destination storage: DAS]
PCOPY utilization
32
[Data: TPCH – SF10 Input storage: HDD Destination storage: DAS] 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 110 Utilzation (%)
Time (sec)
CPU Read BW
DBMS-C utilization
33
[Data: TPCH – SF10 Input storage: HDD Destination storage: DAS] 20 40 60 80 100 25 50 75 100 125 150 175 200 225 Utilzation (%)
Time (sec)
DBMS-C read patterns
34
Reducing data loading overheads
- In situ querying [SIGMOD12, VLDB14]
- Data Vaults: Exploit metadata [Ivanova12,Kargin15]
- Instant Loading: SIMD & Code gen. [Muehlbauer13]
- Accelerators (FPGAs, GPUs)
35