Low-Latency Transaction Execution on Graphics Processors: Dream or - - PowerPoint PPT Presentation

low latency transaction execution on graphics processors
SMART_READER_LITE
LIVE PREVIEW

Low-Latency Transaction Execution on Graphics Processors: Dream or - - PowerPoint PPT Presentation

Workgroup Databases and Software Engineering University of Magdeburg Low-Latency Transaction Execution on Graphics Processors: Dream or Reality? Iya Arefyeva, Gabriel Campero Durand, Marcus Pinnecke, David Broneske , Gunter Saake Motivation:


slide-1
SLIDE 1

Low-Latency Transaction Execution

  • n Graphics Processors:

Dream or Reality?

Iya Arefyeva, Gabriel Campero Durand, Marcus Pinnecke, David Broneske, Gunter Saake

Workgroup Databases and Software Engineering University of Magdeburg

slide-2
SLIDE 2

Motivation: Context

2

GPGPUs are becoming essential for accelerating computation

  • 3 out of Top 5 from HPC 500 (June 2018) are powered by GPUs
  • 56% of the flops on the list come from GPU acceleration [10]

Summit Supercomputer, Oak Ridge

slide-3
SLIDE 3

Motivation: Context

3

Online analytical processing (OLAP):

  • few long running tasks performed on big chunks of data
  • easy to exploit data parallelism → good for GPUs

GPU accelerated systems for OLAP: GDB [1], HyPE [2], CoGaDB [3], Ocelot [4], Caldera [5], MapD [8]

Online transaction processing (OLTP):

  • thousands of short lived transactions within a short period of time
  • data should be processed as soon as possible due to user interaction

GPU accelerated systems for OLTP: GPUTx [6]

GPUs are also important for accelerating database workloads: → comparably less studied

slide-4
SLIDE 4

Motivation: Context

4

Hybrid transactional and analytical processing (HTAP):

  • real-time analytics on data that is ingested and modified in the transactional

database engine

  • challenging due to conflicting requirements in workloads

GPU accelerated systems for HTAP: Caldera [5]* *However in Caldera, GPUs don’t process OLTP workloads → possible underutilization

Caldera Architecture [5]

slide-5
SLIDE 5

Motivation: GPUs for OLTP

5

Intrinsic GPU challenges:

1. SPMD processing 2. Coalesced memory access 3. Branch divergence overheads 4. Communication bottleneck: data needs to be transferred from RAM to GPU and back

  • ver PCIe bus

5. Bandwidth bottleneck: bandwidth of PCIe bus is lower than the bandwidth of a GPU 6. Limited memory

SM structure of Nvidia's Pascal GP100 SM [9]

slide-6
SLIDE 6

Motivation: GPUs for OLTP

6

OLTP Challenges:

  • Managing isolation and consistency with massive parallelism
  • Previous research (GPUTx [6]) proposed a Bulk Execution Model and a

K-set transaction handling

Experiments with GPUTx [6]

slide-7
SLIDE 7

Our contributions

7

In this early work, we 1. Evaluate a simplified version of the K-set execution model from GPUTx, assuming single key operations and massive point queries. 2. Test on a CRUD benchmark, reporting impact of batch sizes and bounded staleness. 3. Suggest 2 possible characteristics that could aid in the adoption of GPUs for OLTP, as we seek to adopt them in the design of an GPU OLTP query processor.

slide-8
SLIDE 8

8

Prototype Design

slide-9
SLIDE 9

batch collection

Implementation

9

  • Storage engine is implemented in C++
  • OpenCL for GPU programming
  • The table is stored on the GPU (in case the

GPU is used), only the necessary data is transferred

  • Client requests are handled in a single

thread

  • In order to support operator-based K-sets

several cases need to be considered.

  • These cases determine our

transaction manager client 1 client 2 client N

...

batch processing

replying to clients request request request

batch processing batch processing

slide-10
SLIDE 10

Implementation

10

write 1 write 2 write 3 write 4 write 5 write 6 write 7 key 22 key 4 key 19 key 8 key 10 key 1 key 56 write 8 key 5

new request collected batch for writes

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything (K=0.1 in our experiments) Case 1: Reads or independent writes:

slide-11
SLIDE 11

Implementation

11

write 1 write 2 write 3 write 4 write 5 write 6 write 7 key 22 key 4 key 19 key 8 key 10 key 1 key 56 write 8 key 5 write 8 key 5

new request collected batch for writes

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything (K=0.1 in our experiments) Case 1: Reads or independent writes:

slide-12
SLIDE 12

Implementation

12

write 1 write 2 write 3 write 4 write 5 write 6 write 7 key 22 key 4 key 19 key 8 key 10 key 1 key 56 batch processing write 8 key 5 write 1 write 2 write 3 write 4 write 5 write 6 write 7 key 22 key 4 key 19 key 8 key 10 key 1 key 56 write 8 key 5 write 8 key 5

new request collected batch for writes batch full

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything (K=0.1 in our experiments) Case 1: Reads or independent writes:

slide-13
SLIDE 13

Implementation

13

Case 2: Write after Write

write 8 key 4 write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1

collected batch for writes new request

!

slide-14
SLIDE 14

Implementation

14

Case 2: Write after Write

write 8 key 4 write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1

collected batch for writes new request

batch processing write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1

!

flush writes

slide-15
SLIDE 15

Implementation

15

Case 2: Write after Write

write 8 key 4 write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1

collected batch for writes new request

batch processing write 8 key 4 write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1

!

flush writes collected batch for writes

slide-16
SLIDE 16

Implementation

16

Case 3: Read after Write

read 5 key 4 write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1

new request

batch processing read 5 key 4 write 1 write 2 write 3 write 4 write 5 write 6 key 22 key 4 key 19 key 8 key 10 key 1 read 1 read 2 read 3 read 4 key 7 key 13 key 32 key 25

!

flush writes collected batch for writes collected batch for reads

slide-17
SLIDE 17

Implementation

17

Case 4: Write after Read

write 5 key 4 read 1 read 2 read 3 read 4 read 5 read 6 read 7 key 22 key 4 key 19 key 8 key 10 key 1 key 56

new request

batch processing write 5 key 4 read 1 read 2 read 3 read 4 read 5 read 6 read 7 key 22 key 4 key 19 key 8 key 10 key 1 key 56 write 1 write 2 write 3 write 4 key 7 key 13 key 32 key 25

!

flush reads collected batch for reads collected batch for writes

slide-18
SLIDE 18

18

Evaluation

slide-19
SLIDE 19

YCSB (Yahoo! Cloud Serving Benchmark)

19

YCSB client architecture [7]

slide-20
SLIDE 20

100k read operations All fields of a tuple are read Zipfian distribution of requests 1 million update operations Only one field is updated Zipfian distribution of requests 100k read/update operations (50% reads and 50% updates) 80% operations access last entries (20% of tuples)

Workloads

20

Read-only Workload R

Write-only Workload W Mixed Workload M 10k records in the table Each tuple consists of 10 fields (100 bytes each), key length is 24 bytes

  • CPU: Intel Xeon E5-2630
  • GPU: Nvidia Tesla K40c
  • OpenCL 1.2
  • CentOS 7.1 (kernel version 3.10.0)

Goal: Evaluating performance on independent reads or write to find the impact of batch size Goal: What is the impact of concurrency control? Do stale reads improve performance?

slide-21
SLIDE 21

Evaluation (workload R - read only)

21

  • CPU & row store provides the best

performance

  • Small batches reduce collection time
  • Very small batches for GPUs are not

efficient

  • Execution is faster with bigger

batches

  • However, it does not

compensate for slow response time

slide-22
SLIDE 22

Evaluation (workload W - update only)

22

  • CPU & row store provides the best

performance

  • Small batches reduce collection time
  • Very small batches for GPUs are not

efficient

  • Execution is faster with bigger

batches

  • However, it does not

compensate for slow response time

slide-23
SLIDE 23

Evaluation (workload M - read/update, CPU)

23

  • Concurrency control is beneficial for the

CPU smaller batches → clients get replies quicker

  • Allowing stale reads (0.01 s) improves

the performance for the CPU due to shorter waiting time before execution

  • Big batches are better because of the

reduced waiting time in case of conflicting

  • perations

big batches→ more operations are executed & the server waits less often

with CC w/o CC with CC w/o CC

slide-24
SLIDE 24

Evaluation (workload M - read/update, GPU)

24

  • Concurrency control is not beneficial for

the GPU smaller batches → the GPU is not utilized efficiently

  • Allowing stale reads improves the

performance for the GPU & column store due to shorter waiting time before execution

  • Big batches are better because of the

reduced waiting time in case of conflicting

  • perations

more operations are executed → the server waits less often

with CC w/o CC with CC w/o CC

slide-25
SLIDE 25

25

Conclusions and Future Work

slide-26
SLIDE 26

Discussion & Conclusion

26

The GPU batch size conundrum for OLTP:

Case 1: small batches are processed

  • clients get replies quicker
  • GPUs are not utilized efficiently due to the small number of data elements

(this could be improved by splitting requests into fine-grained operations) Case 2: big batches are processed

  • many data elements are beneficial for GPUs
  • but it takes long to collect batches and throughput can be decreased

(this gets faster with higher arrival rates) + Other considerations: transfer overhead in case the table is not stored on the GPU

slide-27
SLIDE 27

Future Work

27

+ More complex transactions and support for rollbacks + Concepts for recovery and logging + Comparison with state of the art systems

slide-28
SLIDE 28

Future Work

28

+ More complex transactions and support for rollbacks + Concepts for recovery and logging + Comparison with state of the art systems

Thank you!

Questions?

slide-29
SLIDE 29

29

References

1. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q. and Sander, P.V., 2009. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4), p.21. 2. Breß, S. and Saake, G., 2013. Why it is time for a HyPE: A hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 6(12), pp.1398-1403. 3. Breß, S., 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated

  • DBMS. Datenbank-Spektrum, 14(3), pp.199-209.

4. Heimel, M., Saecker, M., Pirk, H., Manegold, S. and Markl, V., 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9), pp.709-720. 5. Appuswamy, R., Karpathiotakis, M., Porobic, D. and Ailamaki, A., 2017. The Case For Heterogeneous HTAP. In 8th Biennial Conference on Innovative Data Systems Research (No. EPFL-CONF-224447). 6. He, B. and Yu, J.X., 2011. High-throughput transaction executions on graphics processors. Proceedings of the VLDB Endowment, 4(5), pp.314-325. 7. Cooper, Brian F., Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. "Benchmarking cloud serving systems with YCSB." In Proceedings of the 1st ACM symposium on Cloud computing, pp. 143-154. ACM, 2010. 8. MapD Product Website: https://www.mapd.com/ 9. Soyata, Tolga. GPU Parallel Program Development Using CUDA. Chapman and Hall/CRC, 2018.

slide-30
SLIDE 30

30

References

10. Top 500 news: https://www.top500.org/news/new-gpu-accelerated-supercomputers-change-the-balance-of-power-o n-the-top500/ 11. Appuswamy, Raja, Angelos C. Anadiotis, Danica Porobic, Mustafa K. Iman, and Anastasia Ailamaki. "Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads." Proceedings of the VLDB Endowment 11, no. 2 (2017): 121-134.

slide-31
SLIDE 31

Extra Slides: Isolation level

31

+ By assuming single

  • peration transactions

and no consistency checks, our results avoid divergence and report on larger batches. + We are on single-key CRUD, not SQL level. + All anomalies are preventable through K-sets and managing the dependency graph, so this more general approach supports complete Serializability

Source: https://blog.acolyer.org/2016/02/24/a-critique-of-ansi-sql-isolation-levels/

slide-32
SLIDE 32

Implementation: Introduction

32

  • Two core components:
  • Process Manager(PM): In charge of how processes are executed.
  • Transactional Storage Manager (TxSM): Includes the concurrency control

functionality.

State-of-the-art PM and TxSM design for GPUs [6]

From the PM approaches we adopt the latter, but while GPUTx is stored procedure-based,

  • ur implementation is operator-based.

Assumptions:

  • Break transactions into smaller record-level operators (basic CRUD)
  • More complex transactions have to be managed higher in the hierarchy
slide-33
SLIDE 33

GPU computing

  • For optimal execution behavior, each

thread within a work group should access sequential blocks of memory (coalesced memory access)

  • Communication bottleneck: data

needs to be transferred from RAM to GPU and back over a PCIe bus

  • Bandwidth bottleneck: the bandwidth
  • f a PCIe bus is lower than the

bandwidth of a GPU

33 Block 1 Shared memory

Thread Thread

Constant memory Global memory Texture memory

Block 2 Shared memory

Thread Thread

CPU

Registers Registers Registers Registers

  • composed of many cores

→ multiple threads at a time

  • efficient at data parallelism
  • limited memory size

Local memory Local memory Local memory Local memory

slide-34
SLIDE 34

Motivation: GPUs for OLTP

34

GPUs for OLTP:

  • The goal: to reduce the cost of database ownership by improvements in

throughput

  • There are challenges both from the device and from the workload
slide-35
SLIDE 35

Row vs. column store

35

Row store:

  • allows to quickly perform operations

that affect all attributes

  • ne pointer to access a tuple
  • fields of a tuple are likely to be

pre-fetched in the cache

  • good fit for OLTP
  • if only a fraction of attributes is

needed, unnecessary fields are retrieved together with relevant data

A a1 B C b1 c1 a2 b2 c2 a3 b3 c3 A a1 B C b1 c1 a2 b2 c2 a3 b3 c3

Column store:

  • allows to read only the necessary

data

  • good fit for OLAP
  • better compression rate
  • requires accessing each field

separately