Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - - PowerPoint PPT Presentation

alexandros koliousis
SMART_READER_LITE
LIVE PREVIEW

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - - PowerPoint PPT Presentation

Large-Scale Data & Systems Group The design of a hybrid stream processing system for heterogeneous servers Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo


slide-1
SLIDE 1

LSDS Cambridge, October 31

Alexandros Koliousis

a.koliousis@imperial.ac.uk

Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, Peter Pietzuch and, more recently, George Theodorakis, Panos Garefalakis and Holger Pirk

Large-Scale Data & Systems Group Department of Computing, Imperial College London http://lsds.doc.ic.ac.uk

Large-Scale Data & Systems Group

The design of a hybrid stream processing system for heterogeneous servers

slide-2
SLIDE 2

LSDS Cambridge, October 31

Streams of Data Everywhere

2

Many new data sources are now available:

Linear access patterns make data processing a streaming problem

slide-3
SLIDE 3

LSDS Cambridge, October 31

High-Throughput Low-Latency Analytics

3

Google Zeitgeist

40K

user queries/s

Within ms

Feedzai

40K

card trans/s

In 25 ms

NovaSparks

150M

stock options/s

In less than 1 ms

Facebook Insights

9GB

  • f page metrics/s

In less than 10 s

t t+ t+1

window

slide-4
SLIDE 4

LSDS Cambridge, October 31

Algorithmic Complexity Increases

4

… Share state Aggregate Iterate … Pre-process Parallelize …

Online machine learning, data mining Topic- based filtering Content- based filtering Complex pattern matching Stream queries

highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed

T1 T2 T3 T1(a, b, c) T2(c, d, e) T3(g, i, h)

Publish/Subscribe Complex Event Processing (CEP) Stream processing

slide-5
SLIDE 5

LSDS Cambridge, October 31

Design Space for Data-Intensive Systems

5

Tension between performance & algorithmic complexity

Easy for most algorithms Hard for machine learning algorithms Hard for all algorithms

Result latency Data amount

MBs GBs TBs 10s 1s 100ms 10ms 1ms

slide-6
SLIDE 6

LSDS Cambridge, October 31

Scale Out in Data Centres

6

slide-7
SLIDE 7

LSDS Cambridge, October 31 7

Input data

...

Servers in data centre

Results

select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40

Task parallelism:

Multiple data processing jobs

Data parallelism:

Single data processing job

select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40

Task vs Data Parallelism

slide-8
SLIDE 8

LSDS Cambridge, October 31

Idea: Execute data-parallel tasks

  • n cluster nodes

Tasks organised as dataflow graph Almost all big data systems do this:

– Apache Hadoop, Apache Spark, Apache Storm, Apache Flink, Google TensorFlow, ...

Peter Pietzuch - Imperial College London

8

parallelism degree 3 parallelism degree 2

Distributed Dataflow Systems

slide-9
SLIDE 9

LSDS Cambridge, October 31

“Nobody Ever Got Fired for Using a Hadoop Cluster” [HotCDP’12]

Or Flink or Spark ;)

9

  • 2012 study of MapReduce workloads

– Microsoft: median job size < 14 GB – Yahoo: median job size < 12.5 GB – Facebook: 90% of jobs < 100 GB

Many data-intensive jobs easily fit into memory It’s expensive to scale-out in terms of hardware and engineering!

☛ In many cases a single server is cheaper/more efficient than a cluster

The size of the workloads has changed, but so has the size/price of memory!

slide-10
SLIDE 10

LSDS Cambridge, October 31 10

L3

C1 C2 C3 C4 C5 C6 C7 C8

L3

C1 C2 C3 C4 C5 C6 C7 C8

L2 Cache

DRAM DRAM

Processor1 ... N Socket 1 Socket 2

Co Command Queue

PC PCIe Ie B Bus DM DMA

10s of

streaming processors

Exploit Single-Node Heterogeneous Hardware

Servers with CPUs and GPUs now common

– 10x higher linear memory access throughput – Limited data transfer throughput

1000s of

cores

10s GB of

RAM

Use both CPU & GPU resources for stream processing

slide-11
SLIDE 11

LSDS Cambridge, October 31

CQL: SQL-based declarative language for continuous queries [Arasu et al., VLDBJ’06]

Credit card fraud detection example:

– Find attempts to use same card in different regions within 5-min window

11

se select di distinct W.cid fr from Payments [ra range 300 seconds] as as W, Payments [pa partition-by by 1 row] as as L wh where W.cid = L.cid an and W.region != L.region

CQL offers correct window semantics

With Well-Defined High-Level Queries

<\>

Self-join

slide-12
SLIDE 12

LSDS Cambridge, October 31

Challenges & Contributions

  • 1. How to parallelise sliding-window queries across CPU and GPU?

Decouple query semantics from system parameters

  • 2. When to use CPU or GPU for a CQL operator?

Hybrid processing: offload tasks to both CPU and GPU

  • 3. How to reduce GPU data movement costs?

Amortise data movement delays with deep pipelining

12

SABER

Window-Based Hybrid Stream Processing Engine for CPUs & GPUs

Details omitted

slide-13
SLIDE 13

LSDS Cambridge, October 31

Problem: Window semantics affect system throughput and latency

– Pick task size based on window size?

13

1 2 3 4 5 6

How to Parallelise Window Computation?

Window-based parallelism results in redundant computation

size: 4 sec slide: 1 sec

Task T1 Task T2

Output window results in order

slide-14
SLIDE 14

LSDS Cambridge, October 31

Problem: Window semantics affect system throughput and latency

– Pick task size based on window size?

14

On window slide?

How to Parallelise Window Computation?

Slide-based parallelism limits GPU parallelism

1 2 3 4 5 6 size: 4 sec slide: 1 sec

T1 T2 T3 T4 T5

Compose window results from partial results

slide-15
SLIDE 15

LSDS Cambridge, October 31

Avoid coupling throughput/latency of queries to window definition

– e.g. Spark imposes lower bound on window slide:

15 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9

Throughput (106 tuples/s) Window slide (106 tuples/s)

1s 2s 3s 4s 5s

Window slide limited by

  • min. latency (~500 ms)

Micro-batch size limited by window slide

How to Relate Slides to Tasks?

(0.1, 0.2)

slide-16
SLIDE 16

LSDS Cambridge, October 31

Idea: Decouple task size from window size/slide

– Pick based on underlying hardware features

  • e.g. PCIe throughput

16

10 9 8 7 6 5 4 3 2 1 15 14 13 12 11

– Task contains one or more window fragments

  • E.g. closing/pending/opening windows in T2

SABER’s Window Processing Model

T1 T2 T3 w1 w2 w3 w4 w5

size: 7 rows slide: 2 rows 5 tuples/task

slide-17
SLIDE 17

LSDS Cambridge, October 31

Idea: Decouple task size from window size/slide

– Assemble window fragment results – Output them in correct order

17

Worker B: T2

w1 w2 w3 w4 w5

Worker A: T1

w1 w2 w3

w1

result

w2

result

Result Stage Slot 2 Slot 1 Output result circular buffer

Worker B stores T2 results and exits (nothing to forward) Worker A stores T1 results, merges window fragment results and forwards complete windows downstream

Merging Window Fragment Results

slide-18
SLIDE 18

LSDS Cambridge, October 31 18

Operator Implementations / API

Fragment function, ff

Processes window fragments

Assembly function, fa

Merges partial window results

Batch function, fb

Composes fragment functions within a task Allows incremental processing

10 9 8 7 6 5 4 3 2 1

T1 T2 w1 w2

size: 7 rows slide: 2 rows 5 tuples/sec

fa fa ff ff

w2 results

ff ff

w1 results

  • utput

fb

slide-19
SLIDE 19

LSDS Cambridge, October 31

How to Pick the Task Size?

19 2 4 6 8

32 64 128 256 512 1024 2048 4096

Throughput (GB/s)

Task Size (KB) CPU GPU

slide-20
SLIDE 20

LSDS Cambridge, October 31 20 0.05 0.1 0.15 0.2 2 4 6 8 64 256 1024 4096 16384 Latency (sec) Throughput (GB/s) Window Slide (Bytes) SABER SABER Latency

How Does Window Slide Affect Performance?

Performance of window-based queries remains predictable

Aggregationavg [ro rows 1024, sl slide x]

slide-21
SLIDE 21

LSDS Cambridge, October 31

Challenges & Contributions

  • 1. How to parallelise sliding-window queries across CPU and GPU?

Decouple query semantics from system parameters

  • 2. When to use CPU or GPU for a CQL operator?

Hybrid processing: offload tasks to both CPU and GPU

  • 3. How to reduce GPU data movement costs?

Amortise data movement delays with deep pipelining

21

SABER

Window-Based Hybrid Stream Processing Engine for CPUs & GPUs

slide-22
SLIDE 22

LSDS Cambridge, October 31

Idea: Enable tasks to run on both processors

– Scheduler assigns tasks to idle processors

22

CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task Queue: CPU GPU 3 6 9 12 CPU GPU First-Come First-Served T1 T4 T8 T2 T3 T5 T6 T7 T9 T10

SABER’s Hybrid Stream Processing Model

FCFS ignores effectiveness of processor for given task

Past behavior:

comes first

Idle

slide-23
SLIDE 23

LSDS Cambridge, October 31

Idea: Idle processor skips tasks that could be executed faster by another processor

– Decision based on observed query task throughput

23

T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task Queue: 3 6 9 12 CPU GPU HLS T3 T2 T1 T7 T10 T4 T5 T6 CPU GPU CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms 3 6 9 12

Heterogeneous Look-Ahead Scheduler (HLS)

HLS fully utilises processors

T9 T8 Past behavior:

comes first

slide-24
SLIDE 24

LSDS Cambridge, October 31 24

T1 T2 T2 T1

  • p

α α

  • p

CP CPU GP GPU

T1 T2

The SABER Architecture

1 2 3 4 5 6

Scheduling & execution stage

Dequeue tasks based on HLS

Dispatching stage

Dispatch fixed-size tasks Merge & forward partial window results

Result stage

Java

15K LOC

C & OpenCL

4K LOC

slide-25
SLIDE 25

LSDS Cambridge, October 31 25

Ubuntu Linux 14.04 NVIDIA driver 346.47

Intel Xeon

2.6 GHz

NVIDIA Quadro

K5200 PCIe 3.0 x16 10 Gbps NIC

Evaluation: Set-up & Workloads

16

cores

64GB

RAM

2,304

cores

8GB

RAM

Google Cluster Data

144M jobs events from Google infrastructure

SmartGrid Measurements

974M plug measurements from houses

Linear Road Benchmark

11M car positions and speed on highway

slide-26
SLIDE 26

LSDS Cambridge, October 31 26 10 20 30 40 50

CM2 SG1 SG2 LRB3 LRB4 Throughput (106 tuples/s)

SABER (CPU contrib.) SABER (GPU contrib.) Cluster Mgmt. Smart Grid LRB

Is Hybrid Stream Processing Effective?

Different queries result in different CPU:GPU processing split that is hard to predict offline

aggravg group-byavg select group-byavg group-bycnt group-bycnt group-byavg select Intel Xeon 2.6 GHz NVIDIA Quadro K5200 16 16 cor

  • res

2, 2,304 304 cores

slide-27
SLIDE 27

LSDS Cambridge, October 31 27 2 4 6

Throughput (GB/s)

0.1 0.2 0.3 SABER (CPU only) SABER (GPU only) SABER

Is Hybrid Stream Processing Effective?

Aggregate throughput of CPU and GPU always higher than its counterparts

Aggregation Group-by θ-join GPU is faster CPU is faster Not additive due to queue contention

slide-28
SLIDE 28

LSDS Cambridge, October 31 2 4 6 8 1 4 16 64

Throughput (GB/s) # selection predicates

SABER (CPU only) SABER (GPU only) SABER 28 Dispatch bound 0.1 0.2 0.3 0.4 1 4 16 64

# join predicates

[ro rows 1024, sl slide 1024]

What is the CPU/GPU Trade-Off?

Hybrid processing model benefits from GPU ability to process complex predicates fast

[ro rows 1024, sl slide 1024]

slide-29
SLIDE 29

LSDS Cambridge, October 31 29

W1 benefits from static scheduling but HLS fully utilises GPU:

– GPU also runs ~%1 of of γ tasks

W2 benefits from FCFS but HLS better utilises GPU:

– HLS CPU:GPU split is 1:2.5 for π and 1:0.5 for α

Is Heterogeneous Look-Ahead Scheduling Effective?

1 2 3 4 5

W1 W2

Throughput (GB/s) FCFS Static HLS

π

γ

[ro rows 1024, sl slide 512]

π

α

[ro rows 1024, sl slide 1024]

CPU GPU π 5x γ 6x CPU GPU π 1.5x α 1.5x

W1 W2

W1 W2

slide-30
SLIDE 30

LSDS Cambridge, October 31 2 4 6 8 20 30 40

Throughput (GB/s)

Time (seconds) 0.1 0.2 20 30 40

Selectivity

30

Example: higher selectivity, more predicates evaluated, GPU is preferred

Is Heterogeneous Look-Ahead Scheduling Adaptive?

SAB SABER SAB SABER (GPU PU contrib ib.)

HLS periodically uses idle, non-preferred processor to run tasks to update query task throughput

slide-31
SLIDE 31

LSDS Cambridge, October 31

H/W-Oblivious Tasks, H/W-Conscious Operators

To begin with, can SABER compete with popular distributed stream processing systems?

31 https://lsds.doc.ic.ac.uk/blog/do-we-need-distributed-stream-processing

slide-32
SLIDE 32

LSDS Cambridge, October 31

Enter Yahoo! Stream Benchmark

An industry standard (wannabe)

Storm, Flink, Spark, Apex, Drizzle, Diff. Dataflow

Tumbling-window query, bottlenecked by factors

  • ther than computation

32

π

γcnt

σ R

Ad Click Events Campaigns

How many times a campaign has been seen in a tumbling window

slide-33
SLIDE 33

LSDS Cambridge, October 31

Systems Compared

Apache Flink (1.3.2) Apache Spark Streaming (2.4.0) SABER (1.0), without GPU support StreamBox: a single-server system with emphasis

  • n out-of-order processing

33

slide-34
SLIDE 34

LSDS Cambridge, October 31

Experimental Setup

6 servers (1 master and 5 slaves): 2 Intel Xeon E5- 2660 v3 2.60 GHz CPUs

○ 20 physical CPU cores ○ 25 MB LLC

32 GB of memory 10 GigE connection between the nodes In-memory generation 8 cores per node

34

slide-35
SLIDE 35

LSDS Cambridge, October 31

On a Single Server…

35

Reduced serialization costs; keeping data in LLC

3.4× 1.9× 6.6×

slide-36
SLIDE 36

LSDS Cambridge, October 31

On Multiple Servers…

36

slide-37
SLIDE 37

LSDS Cambridge, October 31

On Multiple Servers…

37

3.4× 1.2×

64 millions/sec with 6 cores! Flink

  • utperforms

Spark!

slide-38
SLIDE 38

LSDS Cambridge, October 31

COST [HotOS’15]

38

Spark Flink SABER Handwritten C++ Throughput (million tuples/ sec)

2 4.8 11.8

39 Pipeline Strategy [Hyper, VLDB’11]:

  • keep data in CPU registers
  • as many sequential operations as possible per tuple
  • maximize data locality

Do better than LLC?

With a compiler-based approach to generate custom code based

  • n a set of hardware-specific optimisations for any given query
slide-39
SLIDE 39

LSDS Cambridge, October 31

H/W-Efficient Streaming Operators

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation [ADMS’18]

  • Incremental computation for both invertible and

non-invertible functions

  • Parallel processing within a slide (>1) with SIMD

instructions

  • Bridge the gap between sliding and tumbling

window computation

39

slide-40
SLIDE 40

LSDS Cambridge, October 31

HammerSlide + SABER

40

slide-41
SLIDE 41

LSDS Cambridge, October 31

Window processing model

Decouples query semantics from system parameters

Hybrid stream processing model

Can achieve aggregate throughput of heterogeneous processors

Hybrid Look-ahead Scheduling (HLS)

Allows use of both CPU and GPU opportunistically for arbitrary workloads

41

Alexandros Koliousis

github.com/lsds/saber

Thank you! Any Questions?

Summary