Designing Hybrid Data Processing Systems for Heterogeneous Servers - - PowerPoint PPT Presentation

designing hybrid data processing systems for
SMART_READER_LITE
LIVE PREVIEW

Designing Hybrid Data Processing Systems for Heterogeneous Servers - - PowerPoint PPT Presentation

Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk> 1 University of Cambridge


slide-1
SLIDE 1

Designing Hybrid Data Processing Systems for Heterogeneous Servers

Peter Pietzuch

Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk>

University of Cambridge – Cambridge, United Kingdom – November 2017

1

slide-2
SLIDE 2

Data is the New Oil

  • Many new sources of data become available

– Most data is produced continuously

  • Data powers plethora of new and personalised services…

2 Peter Pietzuch - Imperial College London Mobile devices Scientific instruments Cameras Social feeds IoT devices Internet services, web sites RFID tags Data repositories

slide-3
SLIDE 3

Data-Intensive Systems

  • Data analytics over web click streams

– How to maximise user experience with relevant content? – How to analyse “click paths” to trace most common user routes?

  • Machine learning models for
  • nline prediction

– E.g. serving adverts on search engines

Peter Pietzuch - Imperial College London 3

Hits Page Views Visits Unique Visitors Uniquely Identified Visitors Volume of Available Data

f1 fn y E {−1,1}

predict update

  • Solution: AdPredictor

– Bayesian learning algorithm ranks adverts according to click probabilities

slide-4
SLIDE 4

Throughout and Result Freshness Matter

Facebook Insights: Aggregates 9 GB/s < 10 sec latency Feedzai: 40K credit card transactions/s < 25 ms latency Google Zeitgeist: 40K user queries/s < 1 ms latency NovaSparks: 150M trade options/s < 1 ms latency

Peter Pietzuch - Imperial College London 4

High-throughput processing Low-latency results Data-intensive system

slide-5
SLIDE 5

Design Space for Data-Intensive Systems

  • Tension between performance

and algorithmic complexity

Peter Pietzuch - Imperial College London 5

Easy for most algorithms Hard for machine learning algorithms Hard for all algorithms

Result latency Data amount

MBs GBs TBs 10s 1s 100ms 10ms 1ms

slide-6
SLIDE 6

Algorithmic Complexity Increases

… Share state Aggregate Iterate … Pre-process Parallelize …

Online machine learning, data mining Topic- based filtering Content- based filtering Complex pattern matching Stream queries

highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed

T1 T2 T3 T1(a, b, c) T2(c, d, e) T3(g, i, h)

Publish/Subscribe Complex Event Processing (CEP) Stream processing

Peter Pietzuch - Imperial College London 6

slide-7
SLIDE 7

Scale Out Model in Data Centres

Peter Pietzuch - Imperial College London 7

slide-8
SLIDE 8

Task Parallelism vs. Data Parallelism

Peter Pietzuch - Imperial College London 8

Input data

...

Servers in data centre

Results

select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40

Task parallelism:

Multiple data processing jobs

Data parallelism:

Single data processing job

select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select distinct W.cid From Payments [range 300 seconds] as W, Payments [partition-by 1 row] as L where W.cid = L.cid and W.region != L.region select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40

slide-9
SLIDE 9

Distributed Dataflow Systems

  • Idea:

Execute data-parallel tasks

  • n cluster nodes
  • Tasks organised as dataflow graph
  • Almost all big data systems do this:
  • Apache Hadoop, Apache Spark,

Apache Storm, Apache Flink, Google TensorFlow, ...

  • Peter Pietzuch - Imperial College London

9

parallelism degree 3 parallelism degree 2

slide-10
SLIDE 10

Nobody Ever Got Fired For Using Hadoop/Spark

  • 2012 study of MapReduce workloads

– Microsoft: median job size < 14 GB – Yahoo: median job size < 12.5 GB – Facebook: 90% of jobs < 100 GB

  • Many data-intensive jobs easily fit into memory
  • One server cheaper/more efficient than compute cluster

Peter Pietzuch - Imperial College London 10

(A. Rowstron, D. Narayanan, A. Donnely,

  • G. O’Shea, A. Douglas, HotCDP’12)
slide-11
SLIDE 11

Parallelism of Heterogeneous Servers

Servers have many parallel CPU cores Heterogeneous servers with GPUs common New types of compute accelerators: Xeon Phi, Google's TPUs, FPGAs, ...

Peter Pietzuch - Imperial College London 11 L3

C1 C2 C3 C4 C5 C6 C7 C8

L3

C1 C2 C3 C4 C5 C6 C7 C8

L2 Cache

DRAM DRAM

SMX1 ... SMXN Socket 1 Socket 2

Command Queue

PCIe Bus DMA

1000s of GPU cores 10s of CPU cores

slide-12
SLIDE 12

Slide courtesy of Torsten Hoefler (Systems Group, ETH Zürich)

Servers Are Becoming Increasingly Heterogeneous

Peter Pietzuch - Imperial College London 12

E How can Data-Intensive Systems Exploit Heterogeneous Hardware?

slide-13
SLIDE 13

Roadmap

  • SABER: Hybrid stream processing engine for heterogeneous servers
  • [SIGMOD’16]
  • (1) How to parallelise computation on modern hardware?
  • (2) How to utilise heterogeneous servers?
  • (3) Experimental performance results

Peter Pietzuch - Imperial College London 13

slide-14
SLIDE 14

Analytics with Window-based Stream Queries

  • Real-time analytics over data streams
  • Windows define finite data amount for processing

Peter Pietzuch - Imperial College London 14

window

highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed

now

Time-based window with size τ at current time t

[t - τ : t] Vehicles[Range τ seconds]

Count-based window with size n:

last n tuples Vehicles[Rows n]

slide-15
SLIDE 15

Defining Stream Query Semantics

  • Windows convert data streams to dynamic relations (database table)

Peter Pietzuch - Imperial College London 15

Streams Relations

Window specification Stream operators:

Istream, Dstream, Rstream Any relational query (select, project, join, group by, etc)

slide-16
SLIDE 16

SQL Stream Queries

SQL provides well-defined declarative semantics for queries

– Based on relational algebra (select, project, join, …)

  • Example: Identify slow moving traffic on highway

– Input stream: Vehicles(highway, segment, direction, speed) – Find highway segments with average speed below 40 km/h

Peter Pietzuch - Imperial College London 16

select highway, segment, direction, AVG(speed) as avg from Vehicles[range 5 sec slide 1 sec] group by highway, segment, direction having avg < 40

Input data Output Operators

slide-17
SLIDE 17

(1) How to Parallelise Computation?

  • Perform query evaluation across sliding windows in parallel

– Exploit data parallelism across stream

Peter Pietzuch - Imperial College London 17

1 2 3 4 5 6

w1 w2 w3 w4

size: 4 sec slide: 1 sec

slide-18
SLIDE 18

How to use GPUs with Stream Queries?

  • Naive strategy parallelises computation along window boundaries

Peter Pietzuch - Imperial College London 18

1 2 3 4 5 6 size: 4 sec slide: 1 sec

Task T1 Task T2

Combine partial results

E Window-based parallelism results in redundant computation

slide-19
SLIDE 19

How to use GPUs with Stream Queries?

  • Parallel processing of non-overlapping window data?

Peter Pietzuch - Imperial College London 19

w1 w2 w3 w4

1 2 3 4 5 6 size: 4 sec slide: 1 sec

T1 T2 T3 T4 T5

Combine partial results

E Slide-based parallelism limits degree of parallelism

slide-20
SLIDE 20

Apache Spark: Small Slides à Low Throughput

  • Spark relates window slide to micro-batch size used for parallelisation

Peter Pietzuch - Imperial College London 20

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 3 4 5 6 7 8 9

Throughput (106 tuples/s) Window slide (106 tuples)

select AVG(S.1) from S [rows 1024 slide x]

E Avoid coupling system parameters with query definition

slide-21
SLIDE 21

SABER: Parallel Window Processing

  • Idea: Parallelise using task size that is best for hardware
  • Task contains one or more window fragments

Peter Pietzuch - Imperial College London 21 10 9 8 7 6 5 4 3 2 1 15 14 13 12 11 10 9 8 7 6 15 14 13 12 11

T1 T2 T3 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5

size: 7 tuples slide: 2 tuples 5 tuples/task

slide-22
SLIDE 22

SABER: Window Fragment Processing

  • Process window fragments in parallel
  • Reassemble partial results to obtain overall result

Peter Pietzuch - Imperial College London 22

Worker B: T2

w1 w2 w3 w4 w5

Worker A: T1

w1 w2 w3

w1

result

w2

result

Result stage Slot 2 Slot 1 Empty Empty Output result circular buffer

Partial result reassembly must also be done in parallel

slide-23
SLIDE 23
  • Fragment function ff

– Processes window fragments

  • Assembly function fa

– Merges partial window results

  • Batch function fb

– Composes fragment functions within task – Allows incremental processing

10 9 8 7 6 5 4 3 2 1

T1 T2 w1 w2

size: 7 tuples slide: 2 tuples 5 tuples/task

fa fa ff ff

w2 results

ff ff

w1 results

  • utput

fb

API for Operator Implementation

Peter Pietzuch - Imperial College London 23

slide-24
SLIDE 24

SABER: Performance of Window-based Queries

Peter Pietzuch - Imperial College London 24

0.05 0.1 0.15 0.2 2 4 6 8 64 256 1024 4096 16384

Latency (sec) Throughput (GB/s) Window slide (tuples)

select AVG(S.1) from S [rows 1024 slide x]

E Performance of window-based queries remains predictable

slide-25
SLIDE 25

2 4 6 8 0.0625 0.25 1 4 16 64 256 1024 4096

Throughput (GB/s) Task Size (KB)

CPU-only processing GPU-only processing

How to Pick the Task Size?

  • Problem: Small data transfers over PCIe bus costly

– Example: select * from S where p1 [rows 1 slide 1]

Limited by dispatcher and thread contention Limited by data movement

Peter Pietzuch - Imperial College London 25

slide-26
SLIDE 26

Roadmap

  • SABER: Hybrid stream processing engine for heterogeneous servers
  • [SIGMOD’16]
  • (1) How to parallelise computation on modern hardware?
  • (2) How to utilise heterogeneous servers?

Peter Pietzuch - Imperial College London 26

E Avoid coupling system parameters with processing semantics

slide-27
SLIDE 27

(2) How to Utilise Heterogeneous Servers?

  • Hard to decide acceleration potential of heterogeneous processors

– Depends on operator semantics, window definition, data distribution, …

Peter Pietzuch - Imperial College London 27

E Don’t leave decision about heterogeneous processors to users

2 4 6

Throughput (GB/s)

0.1 0.2 0.3 CPU execution GPU execution Aggregation Group-by θ-join GPU is faster CPU is faster

slide-28
SLIDE 28

SABER: Hybrid Execution Model

  • Idea: Execute tasks on all heterogeneous processors (CPUs, GPUs, ...)
  • Fully utilise all hardware parallelism available in dedicated servers

Peter Pietzuch - Imperial College London 28

T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task queue CPU worker GPU worker 3 6 9 12 CPU GPU T1 T2 T3 T2 T1 T4 T5 T6 T7 T8 T9 T10 T4 T5 T6

slide-29
SLIDE 29

Static Task Scheduling using Cost Model?

Peter Pietzuch - Imperial College London 29

CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task queue T2 T1 CPU GPU Static: QA on CPU and QB on GPU T7 T9 T10 T3 T4 T5 T6 T8 3 6 9 12 CPU workers GPU worker

E Static scheduling under-utilises processors

  • Profile tasks to obtain cost model
  • Assign tasks to processor with shortest execution time
slide-30
SLIDE 30

First-Come First-Serve Task Scheduling?

Peter Pietzuch - Imperial College London 30

E FCFS ignores effectiveness of processors for given task

CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task queue CPU workers GPU worker 3 6 9 12 CPU GPU First-Come First-Served T1 T4 T8 T2 T3 T5 T6 T7 T9 T10

  • Assign tasks to processors first-come, first-serve

– CPU/GPU execute both QA and QB tasks

slide-31
SLIDE 31

Heterogeneous Lookahead Scheduling (HLS)

Idea: Scheduler assigns tasks to idle processors dynamically

– Skips tasks that could be executed faster by another processor

Peter Pietzuch - Imperial College London 31

CPU GPU QA 3 ms 2 ms QB 3 ms 1 ms T2 T1 T3 T4 T5 T6 T7 T8 T9 QB QA QB QB QB QB QA QB QA T10 QA Task queue CPU workers GPU worker 3 6 9 12 CPU GPU HLS

T1 is slower on CPU: Skip

T1

T2 is slower on CPU: Skip

T2

T3 is slower on CPU but already have 3 ms of work for GPU

T3 T2 T1

Skip T4, T5 and T6 and select T7

T4 T5 T6 T7

Finally skip T8 and T9 and select T10

T8 T9 T10 T4 T5 T6

E HLS achieves aggregate throughput of all heterogeneous processors

slide-32
SLIDE 32

SABER Hybrid Stream Processing Engine

Peter Pietzuch - Imperial College London 32

T1 T2 T2 T1

  • p

α α

  • p

CPU GPU

T1 T2

Scheduling & execution stage

Dequeue tasks based on HLS

Dispatching stage

Dispatch fixed-size tasks Merge & forward partial window results

Result stage

Java

15K LOC

C & OpenCL

4K LOC

slide-33
SLIDE 33

Roadmap

  • SABER: Hybrid stream processing engine for heterogeneous servers
  • [SIGMOD’16]
  • (1) How to parallelise computation on modern hardware?
  • (2) How to utilise heterogeneous servers?
  • (3) Experimental performance results

Peter Pietzuch - Imperial College London 33

E Hybrid execution utilises all heterogeneous processors E Avoid coupling system parameters with processing semantics

slide-34
SLIDE 34

Experimental Evaluation

Peter Pietzuch - Imperial College London 34

Ubuntu Linux 14.04 NVIDIA driver 346.47 Intel Xeon 2.6 GHz NVIDIA Quadro K5200 PCIe 3.0 x16 10 Gbps NIC 16 cores 64 GB RAM 2,304 cores 8 GB RAM

Google Cluster Data

144M jobs events from Google infrastructure

SmartGrid Measurements

974M plug measurements from houses

Linear Road Benchmark

11M car positions and speed on highway

slide-35
SLIDE 35

What is SABER’s Performance?

Peter Pietzuch - Imperial College London 35

10 20 30 40 50

CM2 SG1 SG2 LRB3 LRB4 Throughput (106 tuples/s)

SABER (CPU contrib.) SABER (GPU contrib.) Cluster Mgmt. Smart Grid LRB aggravg group-byavg select group-byavg group-bycnt group-bycnt group-byavg select

Intel Xeon 2.6 GHz NVIDIA Quadro K5200

16 cores 2,304 cores

E SABER exploits both CPUs and GPUs effectively for different queries

slide-36
SLIDE 36

Is Hybrid Throughput Additive?

Peter Pietzuch - Imperial College London 36

2 4 6

Throughput (GB/s)

0.1 0.2 0.3 SABER (CPU only) SABER (GPU only) SABER Aggregation Group-by θ-join GPU is faster CPU is faster Not additive due to queue contention

E Aggregate throughput of CPU + GPU always highest

slide-37
SLIDE 37

What is the Trade-Off between CPUs and GPUs?

  • Hybrid processing model benefits from GPU's ability to process complex

predicates fast

2 4 6 8 1 4 16 64

Throughput (GB/s) # selection predicates

SABER (CPU only) SABER (GPU only) SABER Dispatch bound 0.1 0.2 0.3 0.4 1 4 16 64

# join predicates

[rows 1024, slide 1024] [rows 1024, slide 1024]

Peter Pietzuch - Imperial College London 37

slide-38
SLIDE 38

Does SABER Adapt to Workload Changes?

38

0.1 0.2 10 20 30 40 50 60

Selectivity

2 4 6 8 10 20 30 40 50 60

Throughput (GB/s) Time (seconds) Series1 Series2 SABER SABER (GPU contribution)

HLS periodically uses idle, non-preferred processor to run tasks to update query task throughput matrix E Higher selectivity à more predicates evaluated à GPU preferred

Peter Pietzuch - Imperial College London

slide-39
SLIDE 39

Summary

  • Heterogeneous servers have huge impact on data-intensive systems

– Shift from scale out to scale up model – Need new general-purpose system designs for heterogeneous servers

☛SABER: Hybrid Stream Processing Engine for CPUs & GPUs

  • (1) Parallelise computation to fit hardware capabilities

E Decouple hardware/system parameters from processing semantics

  • (2) Fully utilise all heterogeneous processors independently of workload

E Hybrid processing model to achieve aggregate CPU/GPU throughput

39 Peter Pietzuch - Imperial College London

slide-40
SLIDE 40

Acknowledgement: LSDS Group at Imperial College London

Peter Pietzuch - Imperial College London 40

Peter Pietzuch

http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk>

Thank you! Any Questions? We're Hiring! Post-docs, PhDs