SLIDE 1 LSDS Cambridge, October 31
Alexandros Koliousis
a.koliousis@imperial.ac.uk
Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, Peter Pietzuch and, more recently, George Theodorakis, Panos Garefalakis and Holger Pirk
Large-Scale Data & Systems Group Department of Computing, Imperial College London http://lsds.doc.ic.ac.uk
Large-Scale Data & Systems Group
The design of a hybrid stream processing system for heterogeneous servers
SLIDE 2
LSDS Cambridge, October 31
Streams of Data Everywhere
2
Many new data sources are now available:
Linear access patterns make data processing a streaming problem
SLIDE 3 LSDS Cambridge, October 31
High-Throughput Low-Latency Analytics
3
Google Zeitgeist
40K
user queries/s
Within ms
Feedzai
40K
card trans/s
In 25 ms
NovaSparks
150M
stock options/s
In less than 1 ms
Facebook Insights
9GB
In less than 10 s
t t+ t+1
window
SLIDE 4 LSDS Cambridge, October 31
Algorithmic Complexity Increases
4
… Share state Aggregate Iterate … Pre-process Parallelize …
Online machine learning, data mining Topic- based filtering Content- based filtering Complex pattern matching Stream queries
highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed
T1 T2 T3 T1(a, b, c) T2(c, d, e) T3(g, i, h)
Publish/Subscribe Complex Event Processing (CEP) Stream processing
SLIDE 5
LSDS Cambridge, October 31
Design Space for Data-Intensive Systems
5
Tension between performance & algorithmic complexity
Easy for most algorithms Hard for machine learning algorithms Hard for all algorithms
Result latency Data amount
MBs GBs TBs 10s 1s 100ms 10ms 1ms
SLIDE 6
LSDS Cambridge, October 31
Scale Out in Data Centres
6
SLIDE 7 LSDS Cambridge, October 31 7
Input data
...
Servers in data centre
Results
select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40
Task parallelism:
Multiple data processing jobs
Data parallelism:
Single data processing job
select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40
Task vs Data Parallelism
SLIDE 8 LSDS Cambridge, October 31
Idea: Execute data-parallel tasks
Tasks organised as dataflow graph Almost all big data systems do this:
– Apache Hadoop, Apache Spark, Apache Storm, Apache Flink, Google TensorFlow, ...
Peter Pietzuch - Imperial College London
8
parallelism degree 3 parallelism degree 2
Distributed Dataflow Systems
SLIDE 9 LSDS Cambridge, October 31
“Nobody Ever Got Fired for Using a Hadoop Cluster” [HotCDP’12]
Or Flink or Spark ;)
9
- 2012 study of MapReduce workloads
– Microsoft: median job size < 14 GB – Yahoo: median job size < 12.5 GB – Facebook: 90% of jobs < 100 GB
Many data-intensive jobs easily fit into memory It’s expensive to scale-out in terms of hardware and engineering!
☛ In many cases a single server is cheaper/more efficient than a cluster
The size of the workloads has changed, but so has the size/price of memory!
SLIDE 10 LSDS Cambridge, October 31 10
L3
C1 C2 C3 C4 C5 C6 C7 C8
L3
C1 C2 C3 C4 C5 C6 C7 C8
L2 Cache
DRAM DRAM
Processor1 ... N Socket 1 Socket 2
Co Command Queue
PC PCIe Ie B Bus DM DMA
10s of
streaming processors
Exploit Single-Node Heterogeneous Hardware
Servers with CPUs and GPUs now common
– 10x higher linear memory access throughput – Limited data transfer throughput
1000s of
cores
10s GB of
RAM
Use both CPU & GPU resources for stream processing
SLIDE 11 LSDS Cambridge, October 31
CQL: SQL-based declarative language for continuous queries [Arasu et al., VLDBJ’06]
Credit card fraud detection example:
– Find attempts to use same card in different regions within 5-min window
11
se select di distinct W.cid fr from Payments [ra range 300 seconds] as as W, Payments [pa partition-by by 1 row] as as L wh where W.cid = L.cid an and W.region != L.region
CQL offers correct window semantics
With Well-Defined High-Level Queries
<\>
Self-join
SLIDE 12 LSDS Cambridge, October 31
Challenges & Contributions
- 1. How to parallelise sliding-window queries across CPU and GPU?
Decouple query semantics from system parameters
- 2. When to use CPU or GPU for a CQL operator?
Hybrid processing: offload tasks to both CPU and GPU
- 3. How to reduce GPU data movement costs?
Amortise data movement delays with deep pipelining
12
SABER
Window-Based Hybrid Stream Processing Engine for CPUs & GPUs
Details omitted
SLIDE 13 LSDS Cambridge, October 31
Problem: Window semantics affect system throughput and latency
– Pick task size based on window size?
13
1 2 3 4 5 6
How to Parallelise Window Computation?
Window-based parallelism results in redundant computation
size: 4 sec slide: 1 sec
Task T1 Task T2
Output window results in order
SLIDE 14 LSDS Cambridge, October 31
Problem: Window semantics affect system throughput and latency
– Pick task size based on window size?
14
On window slide?
How to Parallelise Window Computation?
…
Slide-based parallelism limits GPU parallelism
1 2 3 4 5 6 size: 4 sec slide: 1 sec
T1 T2 T3 T4 T5
Compose window results from partial results
SLIDE 15 LSDS Cambridge, October 31
Avoid coupling throughput/latency of queries to window definition
– e.g. Spark imposes lower bound on window slide:
15 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9
Throughput (106 tuples/s) Window slide (106 tuples/s)
1s 2s 3s 4s 5s
Window slide limited by
Micro-batch size limited by window slide
How to Relate Slides to Tasks?
(0.1, 0.2)
SLIDE 16 LSDS Cambridge, October 31
Idea: Decouple task size from window size/slide
– Pick based on underlying hardware features
16
10 9 8 7 6 5 4 3 2 1 15 14 13 12 11
– Task contains one or more window fragments
- E.g. closing/pending/opening windows in T2
SABER’s Window Processing Model
T1 T2 T3 w1 w2 w3 w4 w5
size: 7 rows slide: 2 rows 5 tuples/task
SLIDE 17 LSDS Cambridge, October 31
Idea: Decouple task size from window size/slide
– Assemble window fragment results – Output them in correct order
17
Worker B: T2
w1 w2 w3 w4 w5
Worker A: T1
w1 w2 w3
w1
result
w2
result
Result Stage Slot 2 Slot 1 Output result circular buffer
Worker B stores T2 results and exits (nothing to forward) Worker A stores T1 results, merges window fragment results and forwards complete windows downstream
Merging Window Fragment Results
SLIDE 18 LSDS Cambridge, October 31 18
Operator Implementations / API
Fragment function, ff
Processes window fragments
Assembly function, fa
Merges partial window results
Batch function, fb
Composes fragment functions within a task Allows incremental processing
10 9 8 7 6 5 4 3 2 1
T1 T2 w1 w2
size: 7 rows slide: 2 rows 5 tuples/sec
fa fa ff ff
w2 results
ff ff
w1 results
fb
SLIDE 19 LSDS Cambridge, October 31
How to Pick the Task Size?
19 2 4 6 8
32 64 128 256 512 1024 2048 4096
Throughput (GB/s)
Task Size (KB) CPU GPU
SLIDE 20
LSDS Cambridge, October 31 20 0.05 0.1 0.15 0.2 2 4 6 8 64 256 1024 4096 16384 Latency (sec) Throughput (GB/s) Window Slide (Bytes) SABER SABER Latency
How Does Window Slide Affect Performance?
Performance of window-based queries remains predictable
Aggregationavg [ro rows 1024, sl slide x]
SLIDE 21 LSDS Cambridge, October 31
Challenges & Contributions
- 1. How to parallelise sliding-window queries across CPU and GPU?
Decouple query semantics from system parameters
- 2. When to use CPU or GPU for a CQL operator?
Hybrid processing: offload tasks to both CPU and GPU
- 3. How to reduce GPU data movement costs?
Amortise data movement delays with deep pipelining
21
SABER
Window-Based Hybrid Stream Processing Engine for CPUs & GPUs
SLIDE 22 LSDS Cambridge, October 31
Idea: Enable tasks to run on both processors
– Scheduler assigns tasks to idle processors
22
CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task Queue: CPU GPU 3 6 9 12 CPU GPU First-Come First-Served T1 T4 T8 T2 T3 T5 T6 T7 T9 T10
SABER’s Hybrid Stream Processing Model
FCFS ignores effectiveness of processor for given task
Past behavior:
comes first
Idle
SLIDE 23 LSDS Cambridge, October 31
Idea: Idle processor skips tasks that could be executed faster by another processor
– Decision based on observed query task throughput
23
T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task Queue: 3 6 9 12 CPU GPU HLS T3 T2 T1 T7 T10 T4 T5 T6 CPU GPU CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms 3 6 9 12
Heterogeneous Look-Ahead Scheduler (HLS)
HLS fully utilises processors
T9 T8 Past behavior:
comes first
SLIDE 24 LSDS Cambridge, October 31 24
T1 T2 T2 T1
α α
CP CPU GP GPU
T1 T2
The SABER Architecture
1 2 3 4 5 6
Scheduling & execution stage
Dequeue tasks based on HLS
Dispatching stage
Dispatch fixed-size tasks Merge & forward partial window results
Result stage
Java
15K LOC
C & OpenCL
4K LOC
SLIDE 25
LSDS Cambridge, October 31 25
Ubuntu Linux 14.04 NVIDIA driver 346.47
Intel Xeon
2.6 GHz
NVIDIA Quadro
K5200 PCIe 3.0 x16 10 Gbps NIC
Evaluation: Set-up & Workloads
16
cores
64GB
RAM
2,304
cores
8GB
RAM
Google Cluster Data
144M jobs events from Google infrastructure
SmartGrid Measurements
974M plug measurements from houses
Linear Road Benchmark
11M car positions and speed on highway
SLIDE 26 LSDS Cambridge, October 31 26 10 20 30 40 50
CM2 SG1 SG2 LRB3 LRB4 Throughput (106 tuples/s)
SABER (CPU contrib.) SABER (GPU contrib.) Cluster Mgmt. Smart Grid LRB
Is Hybrid Stream Processing Effective?
Different queries result in different CPU:GPU processing split that is hard to predict offline
aggravg group-byavg select group-byavg group-bycnt group-bycnt group-byavg select Intel Xeon 2.6 GHz NVIDIA Quadro K5200 16 16 cor
2, 2,304 304 cores
SLIDE 27
LSDS Cambridge, October 31 27 2 4 6
Throughput (GB/s)
0.1 0.2 0.3 SABER (CPU only) SABER (GPU only) SABER
Is Hybrid Stream Processing Effective?
Aggregate throughput of CPU and GPU always higher than its counterparts
Aggregation Group-by θ-join GPU is faster CPU is faster Not additive due to queue contention
SLIDE 28 LSDS Cambridge, October 31 2 4 6 8 1 4 16 64
Throughput (GB/s) # selection predicates
SABER (CPU only) SABER (GPU only) SABER 28 Dispatch bound 0.1 0.2 0.3 0.4 1 4 16 64
# join predicates
[ro rows 1024, sl slide 1024]
What is the CPU/GPU Trade-Off?
Hybrid processing model benefits from GPU ability to process complex predicates fast
[ro rows 1024, sl slide 1024]
SLIDE 29 LSDS Cambridge, October 31 29
W1 benefits from static scheduling but HLS fully utilises GPU:
– GPU also runs ~%1 of of γ tasks
W2 benefits from FCFS but HLS better utilises GPU:
– HLS CPU:GPU split is 1:2.5 for π and 1:0.5 for α
Is Heterogeneous Look-Ahead Scheduling Effective?
1 2 3 4 5
W1 W2
Throughput (GB/s) FCFS Static HLS
π
γ
[ro rows 1024, sl slide 512]
π
α
[ro rows 1024, sl slide 1024]
CPU GPU π 5x γ 6x CPU GPU π 1.5x α 1.5x
W1 W2
W1 W2
SLIDE 30 LSDS Cambridge, October 31 2 4 6 8 20 30 40
Throughput (GB/s)
Time (seconds) 0.1 0.2 20 30 40
Selectivity
30
Example: higher selectivity, more predicates evaluated, GPU is preferred
Is Heterogeneous Look-Ahead Scheduling Adaptive?
SAB SABER SAB SABER (GPU PU contrib ib.)
HLS periodically uses idle, non-preferred processor to run tasks to update query task throughput
SLIDE 31
LSDS Cambridge, October 31
H/W-Oblivious Tasks, H/W-Conscious Operators
To begin with, can SABER compete with popular distributed stream processing systems?
31 https://lsds.doc.ic.ac.uk/blog/do-we-need-distributed-stream-processing
SLIDE 32 LSDS Cambridge, October 31
Enter Yahoo! Stream Benchmark
An industry standard (wannabe)
Storm, Flink, Spark, Apex, Drizzle, Diff. Dataflow
Tumbling-window query, bottlenecked by factors
32
π
γcnt
σ R
Ad Click Events Campaigns
How many times a campaign has been seen in a tumbling window
SLIDE 33 LSDS Cambridge, October 31
Systems Compared
Apache Flink (1.3.2) Apache Spark Streaming (2.4.0) SABER (1.0), without GPU support StreamBox: a single-server system with emphasis
- n out-of-order processing
33
SLIDE 34
LSDS Cambridge, October 31
Experimental Setup
6 servers (1 master and 5 slaves): 2 Intel Xeon E5- 2660 v3 2.60 GHz CPUs
○ 20 physical CPU cores ○ 25 MB LLC
32 GB of memory 10 GigE connection between the nodes In-memory generation 8 cores per node
34
SLIDE 35 LSDS Cambridge, October 31
On a Single Server…
35
Reduced serialization costs; keeping data in LLC
3.4× 1.9× 6.6×
SLIDE 36
LSDS Cambridge, October 31
On Multiple Servers…
36
SLIDE 37 LSDS Cambridge, October 31
On Multiple Servers…
37
3.4× 1.2×
64 millions/sec with 6 cores! Flink
Spark!
SLIDE 38 LSDS Cambridge, October 31
COST [HotOS’15]
38
Spark Flink SABER Handwritten C++ Throughput (million tuples/ sec)
2 4.8 11.8
39 Pipeline Strategy [Hyper, VLDB’11]:
- keep data in CPU registers
- as many sequential operations as possible per tuple
- maximize data locality
Do better than LLC?
With a compiler-based approach to generate custom code based
- n a set of hardware-specific optimisations for any given query
SLIDE 39 LSDS Cambridge, October 31
H/W-Efficient Streaming Operators
Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation [ADMS’18]
- Incremental computation for both invertible and
non-invertible functions
- Parallel processing within a slide (>1) with SIMD
instructions
- Bridge the gap between sliding and tumbling
window computation
39
SLIDE 40
LSDS Cambridge, October 31
HammerSlide + SABER
40
SLIDE 41
LSDS Cambridge, October 31
Window processing model
Decouples query semantics from system parameters
Hybrid stream processing model
Can achieve aggregate throughput of heterogeneous processors
Hybrid Look-ahead Scheduling (HLS)
Allows use of both CPU and GPU opportunistically for arbitrary workloads
41
Alexandros Koliousis
github.com/lsds/saber
Thank you! Any Questions?
Summary