FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU - - PowerPoint PPT Presentation

finestream fine grained window based stream
SMART_READER_LITE
LIVE PREVIEW

FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU - - PowerPoint PPT Presentation

USENIX ATC20 2020 USENIX Annual Technical Conference JULY 15 17, 2020 FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du


slide-1
SLIDE 1

USENIX ATC’20 2020 USENIX Annual Technical Conference JULY 15–17, 2020

FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du Renmin University of China Technische Universitảt Berlin National University of Singapore

1

slide-2
SLIDE 2

Outline

  • 1. Background
  • 2. Motivation
  • 3. Challenges
  • 4. FineStream
  • 5. Evaluation
  • 6. Conclusion

2

slide-3
SLIDE 3
  • 1. Background
  • Bulk-synchronous parallel model
  • query granularity

3

  • Continuous operator model
  • operator granularity
  • perator 1
  • perator 2
  • perator n

… query CPU GPU

  • perator 1
  • perator 2
  • perator n

… query CPU GPU CPU and GPU can concurrently execute in both cases — only the granularity is different. This paper [SIGMOD’16] Saber: Window-based hybrid stream processing for heterogeneous architectures

slide-4
SLIDE 4
  • 2. Integrated Architectures

4

  • 2011, Jan
  • AMD APU
  • 2014, Apr
  • Nvi

vidia ia Tegra

Be Benefit its

  • No PCI-e transfer overhead
  • Shared global memory
  • High energy efficiency
  • 2012, Jan
  • Int

Intel l Iv Ivy y Br Bridge

slide-5
SLIDE 5
  • 1. Background
  • Integrated architectures vs. discrete architectures

5

Integrated architectures Discrete architectures Architecture A10-7850K Ryzen5 2400G GTX 1080Ti V100 #cores 512+4 704+4 3584 5120 TFLOPS 0.9 1.7 11.3 14.1 bandwidth (GB/s) 25.6 38.4 484.4 900 price ($) 209 169 1100 8999 TDP (W) 95 65 250 300

slide-6
SLIDE 6
  • 3. Stream Processing with SQL
  • Data stream
  • Window
  • Operator
  • Query

* Batch

6

tuple … window w1 window w2 window size window slide data stream …

  • perator 1
  • perator 2
  • perator n

… query results

slide-7
SLIDE 7

Outline

  • 1. Background
  • 2. Motivation
  • 3. Challenges
  • 4. FineStream
  • 5. Evaluation
  • 6. Conclusion

7

slide-8
SLIDE 8
  • 2. Motivation
  • Varying Operator-Device Preference

8

CPU queue: GPU queue: time

  • perator 2

aggregation

  • perator 1

group-by

  • perator 1

group-by

  • perator 2

aggregation query on CPU query on GPU 18.2 ms 5.2 ms 5.8 ms 6.7 ms

slide-9
SLIDE 9
  • 2. Motivation
  • Performance (tuples/s) of operators on the CPU and the GPU of the

integrated architecture.

9

Operator CPU only GPU only Device choice Projection 14.2 14.3 GPU Selection 13.1 14.1 GPU Aggregation 14.7 13.5 CPU Group-by 8.1 12.4 GPU Join 0.7 0.1 CPU

slide-10
SLIDE 10
  • Fine-Grained Stream Processing
  • A fine-grained stream processing method that can consider both integrated

architecture characteristics and operator features shall have better performance.

  • memory bandwidth limit
  • operators - preferred devices
  • 2. Motivation

10

  • CPU and GPU have good performance
  • consider the interplay of operator features and architecture difference.
slide-11
SLIDE 11

Outline

  • 1. Background
  • 2. Motivation
  • 3. Challenges
  • 4. FineStream
  • 5. Evaluation
  • 6. Conclusion

11

slide-12
SLIDE 12
  • 3. Challenges
  • Challenge 1: Application topology combined with architectural

characteristics

12

System DRAM CPU core

CPU

CPU core CPU core GPU core

GPU

GPU core GPU core Shared Memory Management Unit CPU Cache GPU Cache OP2 OP3 OP5 OP6 OP9 OP10 OP7 OP11

slide-13
SLIDE 13
  • 3. Challenges
  • Challenge 2: SQL query plan optimization with shared main memory

13

query on CPU query on GPU query on CPU query on GPU CPU queue: GPU queue: time CPU only 18.2 ms GPU only 6.7 ms CPU-GPU co-run 22.4 ms

slide-14
SLIDE 14
  • 3. Challenges
  • Challenge 3: Adjustment for dynamic workload

14

OP3 OP1 OP2 90% 10% OP3 OP1 OP2 10% 90%

slide-15
SLIDE 15

Outline

  • 1. Background
  • 2. Motivation
  • 3. Challenges
  • 4. FineStream
  • 5. Evaluation
  • 6. Conclusion

15

slide-16
SLIDE 16
  • 4. FineStream
  • Overview

16

batch

  • nline

profiling performance model

  • p1
  • p2
  • p1
  • perators

dev dev dev device mapping

dispatcher results

stream batch SQL FineStream

slide-17
SLIDE 17
  • 4. FineStream
  • Topology

17

OP1 OP2 OP3 OP4 OP5 OP6 OP8 OP9 OP10 OP7 OP11 branch1 branch2 branch3 pathcritical

slide-18
SLIDE 18
  • 4. FineStream
  • Optimization 1: Branch Co-Running

18

tstage3 time branch 3 branch 2 branch 1 tstage

1

tstage2 tstage3 time branch 3 branch 2 branch 1 tstage1 tstage2 branch 3 (a) Branch parallelism. (b) Branch scheduling optimization.

slide-19
SLIDE 19
  • 4. FineStream
  • Optimization 2: Batch Pipeline

19

OP3 OP6 OP10 OP7 OP11 phase 1 phase 2 PH i: phase i B i: batch i OP1 OP2 OP4 OP5 OP8 OP9 PH1 B1 PH1 B2 … PH2 B1 PH2 B2 time (a) Phase partitioning. (b) Batch pipeline.

slide-20
SLIDE 20
  • 4. FineStream
  • Optimization 3: Handling Dynamic Workload
  • Light-Weight Resource Reallocation
  • Query Plan Adjustment

20

OP3 OP1 OP2 90% 10% (a) 90% workload goes to OP2. (b) 90% workload goes to OP3. Shared memory GPU CUs CPU CUs Integrated architectures OP3 OP1 OP2 10% 90% Shared memory CPU CUs GPU CUs Integrated architectures

slide-21
SLIDE 21
  • 4. FineStream

21

thread 1 stream 1 CPU GPU thread 2 operators OPi CPU GPU bandwidth utilization performance OP1 … parallelism utilization Branch Co-Running Batch Pipeline DAG 1 OP CPU% GPU% OP1 … … … … … OPi … … batch3 batch4 dynamic- workload detection

  • perator

dataflow monitoring still low performance? … time … yes resource reallocation migration detected DAG i … … … … default batch1 batch2 … … … … query plan adjustment

  • Execution flow
slide-22
SLIDE 22

Outline

  • 1. Background
  • 2. Motivation
  • 3. Challenges
  • 4. FineStream
  • 5. Evaluation
  • 6. Conclusion

22

slide-23
SLIDE 23
  • 5. Evaluation
  • Platforms
  • AMD A10- 7850K
  • Ryzen 5 2400G
  • Datasets
  • Google compute cluster monitoring
  • Anomaly detection in smart grids
  • Linear road benchmark
  • Synthetically generated dataset
  • Benchmarks
  • Nine queries

23

Example - Q1 (Google compute cluster monitoring) select timestamp, category, sum(cpu) as totalCPU from TaskEvents [range 256 slide 1] group by category

slide-24
SLIDE 24
  • 5. Evaluation
  • Throughput: FineStream achieves the best performance in most cases.

24

5 10 15 20 25 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 A10-7850K Ryzen5 2400G Throughput (1E5 tuples/s) Single Saber FineStream

slide-25
SLIDE 25
  • 5. Evaluation
  • Latency: Low latency in most cases.

25

0.5 1 1.5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 A10-7850K Ryzen5 2400G Latency (s) Single Saber FineStream

slide-26
SLIDE 26
  • 5. Evaluation
  • Throughput vs. latency
  • Queries with high throughput usually have low latency, and vice versa.

26

5 10 15 20 25 0.2 0.4 0.6 0.8 1 1.2 1.4 Throughput (1E5 tuples/s) Latency (s) FineStream(A10-7850K) Saber(A10-7850K) FineStream(Ryzen5) Saber(Ryzen)

slide-27
SLIDE 27
  • 5. Evaluation
  • Utilization
  • FineStream utilizes the GPU device better on the integrated architecture.

27

20 40 60 80 100 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 CPU GPU utilization (%) Saber FineStream

slide-28
SLIDE 28
  • 5. Evaluation
  • Comparison with Discrete Architectures
  • Throughput: The discrete GPUs exhibit 1.8x to 5.7x higher throughput than

the integrated architectures, due to the more computational power of discrete GPUs.

  • Latency:
  • Discrete GPUs: Ttotal = TPCIe_transmit + Tcompute
  • Integrated GPUs: Ttotal = Tcompute

28

slide-29
SLIDE 29
  • 5. Evaluation
  • Comparison with Discrete Architectures
  • High Price-Throughput Ratio

29

2000 4000 6000 8000 10000 12000 14000 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Price-performance ratio (performance/USD) 1080ti v100 A10-7850K Ryzen

slide-30
SLIDE 30
  • 5. Evaluation
  • Comparison with Discrete Architectures
  • High Energy Efficiency

30

5000 10000 15000 20000 25000 30000 35000 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Energy efficiency (performance/Watt) 1080ti v100 A10-7850K Ryzen

slide-31
SLIDE 31

Outline

  • 1. Background
  • 2. Motivation
  • 3. Challenges
  • 4. FineStream
  • 5. Evaluation
  • 6. Conclusion

31

slide-32
SLIDE 32
  • 6. Conclusion
  • The first fine-grained window-based relational stream processing.
  • Lightweight query plan adaptations handling dynamic workloads.
  • FineStream evaluation on a set of stream queries.

32

fengzhang@ruc.edu.cn, yanglin2330@ruc.edu.cn, shuhao.zhang@tu-berlin.de, hebs@comp.nus.edu.sg, lu-wei@ruc.edu.cn, duyong@ruc.edu.cn

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du Renmin University of China, Technische Universitảt Berlin, National University of Singapore

USENIX ATC’20 2020 USENIX Annual Technical Conference JULY 15–17, 2020