Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - PowerPoint PPT Presentation

Large-Scale Data & Systems Group The design of a hybrid stream processing system for heterogeneous servers Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, Peter Pietzuch and, more recently, George Theodorakis, Panos Garefalakis and Holger Pirk Large-Scale Data & Systems Group Department of Computing, Imperial College London http://lsds.doc.ic.ac.uk LSDS Cambridge, October 31

Streams of Data Everywhere Many new data sources are now available: Linear access patterns make data processing a streaming problem LSDS Cambridge, October 31 2

High-Throughput Low-Latency Analytics NovaSparks Facebook Insights Google Zeitgeist Feedzai 150M 9GB 40K 40K of page metrics/s user queries/s card trans/s stock options/s In less than 1 ms In less than 10 s Within ms In 25 ms window t+ t+1 t LSDS Cambridge, October 31 3

Algorithmic Complexity Increases T1 Pre-process … T1(a, b, c) Share state … highway highway highway highway highway Iterate segment segment segment segment segment T2 Parallelize T2(c, d, e) direction direction direction direction direction speed speed speed speed speed … Aggregate T3 T3(g, i, h) Complex Topic- Content- Online machine Stream pattern based based learning, data queries matching filtering filtering mining Complex Event Stream Publish/Subscribe Processing (CEP) processing LSDS Cambridge, October 31 4

Design Space for Data-Intensive Systems Tension between performance & algorithmic complexity TBs Data amount Hard for all GBs algorithms Hard for machine learning MBs algorithms Easy for most algorithms 10s 1s 100ms 10ms 1ms Result latency LSDS Cambridge, October 31 5

Scale Out in Data Centres LSDS Cambridge, October 31 6

Task vs Data Parallelism ... Input data Results Servers in data centre select distinct W.cid select distinct W.cid select highway, segment, direction, AVG(speed) From Payments [ range 300 seconds] as W, select distinct W.cid from Vehicles[ range 5 seconds slide 1 second] From Payments [ range 300 seconds] as W, Payments [ partition-by 1 row] as L select highway, segment, direction, AVG(speed) From Payments [ range 300 seconds] as W, group by highway, segment, direction Payments [ partition-by 1 row] as L where from Vehicles[ range 5 seconds slide 1 second] W.cid = L.cid and W.region != L.region having avg < 40 Payments [ partition-by 1 row] as L where W.cid = L.cid and W.region != L.region group by highway, segment, direction where W.cid = L.cid and W.region != L.region having avg < 40 Task parallelism: Data parallelism: Multiple data processing jobs Single data processing job LSDS Cambridge, October 31 7

Peter Pietzuch - Imperial College London Distributed Dataflow Systems Idea: Execute data-parallel tasks on cluster nodes Tasks organised as dataflow graph parallelism degree 2 Almost all big data systems do this: Apache Hadoop, Apache Spark, Apache – parallelism Storm, Apache Flink, degree 3 Google TensorFlow, ... LSDS Cambridge, October 31 8

“Nobody Ever Got Fired for Using a Hadoop Cluster” [HotCDP’12] Or Flink or Spark ;) • 2012 study of MapReduce workloads – Microsoft: median job size < 14 GB The size of the workloads has changed, – Yahoo: median job size < 12.5 GB but so has the size/price of memory ! – Facebook: 90% of jobs < 100 GB Many data-intensive jobs easily fit into memory It’s expensive to scale-out in terms of hardware and engineering! ☛ In many cases a single server is cheaper/more efficient than a cluster LSDS Cambridge, October 31 9

Exploit Single-Node Heterogeneous Hardware Servers with CPUs and GPUs now common – 10x higher linear memory access throughput – Limited data transfer throughput PCIe PC Ie B Bus Command Queue Co Processor 1 ... N Socket 1 Socket 2 C 1 C 5 C 1 C 5 10s of C 2 C 6 C 2 C 6 streaming processors C 3 C 7 C 3 C 7 1000s of C 4 C 8 C 4 C 8 cores L3 L3 10s GB of L2 Cache RAM DMA DM DRAM DRAM Use both CPU & GPU resources for stream processing LSDS Cambridge, October 31 10

With Well-Defined High-Level Queries CQL: SQL-based declarative language for continuous queries [Arasu et al. , VLDBJ’06] Credit card fraud detection example: – Find attempts to use same card in different regions within 5-min window CQL offers correct window semantics <\> Self-join se select di distinct W.cid fr from Payments [ ra range 300 seconds] as as W, Payments [ pa partition-by by 1 row] as as L wh where W.cid = L.cid an and W.region != L.region LSDS Cambridge, October 31 11

SABER Window-Based Hybrid Stream Processing Engine for CPUs & GPUs Challenges & Contributions 1. How to parallelise sliding-window queries across CPU and GPU? Decouple query semantics from system parameters 2. When to use CPU or GPU for a CQL operator? Hybrid processing: offload tasks to both CPU and GPU 3. How to reduce GPU data movement costs? Amortise data movement delays with deep pipelining Details omitted LSDS Cambridge, October 31 12

How to Parallelise Window Computation? Problem: Window semantics affect system throughput and latency – Pick task size based on window size? size: 4 sec 6 5 4 3 2 1 slide: 1 sec Task T 1 Output window results in order Task T 2 Window-based parallelism results in redundant computation LSDS Cambridge, October 31 13

How to Parallelise Window Computation? Problem: Window semantics affect system throughput and latency – Pick task size based on window size? On window slide? size: 4 sec 6 5 4 3 2 1 slide: 1 sec T 1 T 2 Compose window results from partial results T 3 T 4 T 5 … Slide-based parallelism limits GPU parallelism LSDS Cambridge, October 31 14

How to Relate Slides to Tasks? Avoid coupling throughput/latency of queries to window definition – e.g. Spark imposes lower bound on window slide: 1s 2s 3s 4s 5s 2 Window slide limited by 1.5 (10 6 tuples/s) Throughput min. latency (~500 ms) 1 Micro - batch size limited 0.5 by window slide (0.1, 0.2) 0 0 1 2 3 4 5 6 7 8 9 Window slide (10 6 tuples/s) LSDS Cambridge, October 31 15

SABER’s Window Processing Model Idea: Decouple task size from window size/slide – Pick based on underlying hardware features • e.g. PCIe throughput T 3 T 2 T 1 5 tuples/task 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 size: 7 rows w 1 slide: 2 rows w 2 w 3 w 4 w 5 – Task contains one or more window fragments • E.g. closing/pending/opening windows in T 2 LSDS Cambridge, October 31 16

Merging Window Fragment Results Idea: Decouple task size from window size/slide – Assemble window fragment results – Output them in correct order Worker A: T 1 w 1 w 2 w 2 result w 3 w 1 result w 1 w 2 w 3 Slot 2 Slot 1 Output result w 4 Result Stage w 5 circular buffer Worker B : T 2 Worker B stores T 2 results and exits (nothing to forward) Worker A stores T 1 results, merges window fragment results and forwards complete windows downstream LSDS Cambridge, October 31 17

Operator Implementations / API T 2 T 1 5 tuples/sec Fragment function, f f 10 9 8 7 6 5 4 3 2 1 size: 7 rows w 1 slide: 2 rows Processes window fragments w 2 Assembly function, f a f f f f f b f f f f Merges partial window results w 2 results w 1 results Batch function, f b f a f a Composes fragment functions within a task output Allows incremental processing LSDS Cambridge, October 31 18

How to Pick the Task Size? 8 CPU GPU Throughput (GB/s) 6 4 2 0 32 64 128 256 512 1024 2048 4096 Task Size (KB) LSDS Cambridge, October 31 19

How Does Window Slide Affect Performance? Performance of window-based queries remains predictable 8 0.2 Throughput (GB/s) 6 0.15 Latency (sec) 4 0.1 SABER SABER Latency 2 0.05 0 0 64 256 1024 4096 16384 Window Slide (Bytes) Aggregation avg [ ro rows 1024, sl slide x ] LSDS Cambridge, October 31 20

SABER Window-Based Hybrid Stream Processing Engine for CPUs & GPUs Challenges & Contributions 1. How to parallelise sliding-window queries across CPU and GPU? Decouple query semantics from system parameters 2. When to use CPU or GPU for a CQL operator? Hybrid processing: offload tasks to both CPU and GPU 3. How to reduce GPU data movement costs? Amortise data movement delays with deep pipelining LSDS Cambridge, October 31 21

SABER’s Hybrid Stream Processing Model Idea: Enable tasks to run on both processors – Scheduler assigns tasks to idle processors Past behavior: Task Queue: CPU CPU GPU comes first T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU Q A 3 ms 2 ms Q A Q A Q B Q A Q B Q B Q B Q B Q A Q B Q B 3 ms 1 ms 0 3 6 9 12 First-Come First-Served T 1 T 4 T 8 T 10 CPU GPU T 2 T 3 T 5 T 6 T 7 T 9 Idle FCFS ignores effectiveness of processor for given task LSDS Cambridge, October 31 22

Heterogeneous Look-Ahead Scheduler (HLS) Idea: Idle processor skips tasks that could be executed faster by another processor – Decision based on observed query task throughput Past behavior: Task Queue: CPU CPU GPU comes first T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU Q A 3 ms 2 ms Q A Q A Q B Q A Q B Q B Q B Q B Q A Q B Q B 3 ms 1 ms 0 0 3 3 6 6 9 9 12 12 HLS T 7 T 3 T 10 CPU T 1 T 2 T 4 T 5 T 6 T 8 T 9 GPU HLS fully utilises processors LSDS Cambridge, October 31 23

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - PowerPoint PPT Presentation

Large-Scale Data & Systems Group The design of a hybrid stream processing system for heterogeneous servers Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis,

of Antioch 100 BC Nico Ordozgoiti Venus de Milo 2015 Alexandros of Antioch 100 BC Nico

Transactions 09 Transactions Alexandros Labrinidis University of Pittsburgh 2 Alexandros

A Pixel Format Guide to the galaxy Alexandros Frantzis alexandros.frantzis@collabora.com

AV Access Randomized Trial: Outcomes Through Six Months Alexandros Mallios, MD Institut

LOCKING CS 2550 / Spring 2006 Principles of Database Systems 10 Locking Alexandros

Recommender Systems Alexandros Karatzoglou Research Scientist @ Telefonica Research, Barcelona

Posterior Arthroscopic Tibiotalocalcaneal Fusion: Surgical Technique and early resul ts Alexandros

PROVING AND DISCOVERING WITH JAVA GEMETRY EXPERT (JGEX) Kostas Georgios-Alexandros, Bampatsias

Assistance to agricultural development in Africa: Decline and Reversal Alexandros Sarris

Determinants of Index Insurance Uptake Alexandros Sarris Professor, department of Economics,

OF COMPANIES WATER AND ENERGY TECHNOLOGIES Focused on biogas Dr.-Ing. Alexandros D. Yfantis

Adding Unusual Transports to The Serval Project Alexandros Tsiridis & Joseph Hill

Can Social Group-Formation Norms Influence Behavior?: An Experimental Study Alexandros Rigos

in developing countries Alexandros Sarris Emeritus professor of economics, National and

TagAlong: Efficient Integration of Battery-Free Sensor Tags in Standard Wireless Networks Carlos

CS270: Lecture 2. Admin: CS270: Lecture 2. Admin: Check Piazza. CS270: Lecture 2. Admin:

Bloom Filters Anna Karlin Most slides by Shreya Jayaraman, Luxi Wang, Alex Tsun Bloom Filters:

Dragging Proofs out of Pictures Ralf Hinze 1 Dan Marsden 2 1 Institute for Computing and

Synthesis of Domain Specific Encoders for Bit- Vector Solvers Jeevana Priya Inala with Rohit

1 Coalescing Logistics Computing the Interference Graph (in MiniJava compiler) Rule Use results

r 4Th 0 b1 4 5 -fr 4$- ci 4 C v-s C 3 - 6 r 3 4 Z n C -c C - a

Concurrency Control Instructor: Matei Zaharia cs245.stanford.edu Outline What makes a schedule

Sambuz

Useful Links

Newsletter

Mail Us

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - PowerPoint PPT Presentation

Large-Scale Data & Systems Group The design of a hybrid stream processing system for heterogeneous servers Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis,

of Antioch 100 BC Nico Ordozgoiti Venus de Milo 2015 Alexandros of Antioch 100 BC Nico

Transactions 09 Transactions Alexandros Labrinidis University of Pittsburgh 2 Alexandros

A Pixel Format Guide to the galaxy Alexandros Frantzis alexandros.frantzis@collabora.com

AV Access Randomized Trial: Outcomes Through Six Months Alexandros Mallios, MD Institut

LOCKING CS 2550 / Spring 2006 Principles of Database Systems 10 Locking Alexandros

Recommender Systems Alexandros Karatzoglou Research Scientist @ Telefonica Research, Barcelona

Posterior Arthroscopic Tibiotalocalcaneal Fusion: Surgical Technique and early resul ts Alexandros

PROVING AND DISCOVERING WITH JAVA GEMETRY EXPERT (JGEX) Kostas Georgios-Alexandros, Bampatsias

Assistance to agricultural development in Africa: Decline and Reversal Alexandros Sarris

Determinants of Index Insurance Uptake Alexandros Sarris Professor, department of Economics,

OF COMPANIES WATER AND ENERGY TECHNOLOGIES Focused on biogas Dr.-Ing. Alexandros D. Yfantis

Adding Unusual Transports to The Serval Project Alexandros Tsiridis &amp; Joseph Hill

Can Social Group-Formation Norms Influence Behavior?: An Experimental Study Alexandros Rigos

in developing countries Alexandros Sarris Emeritus professor of economics, National and

TagAlong: Efficient Integration of Battery-Free Sensor Tags in Standard Wireless Networks Carlos

CS270: Lecture 2. Admin: CS270: Lecture 2. Admin: Check Piazza. CS270: Lecture 2. Admin:

Bloom Filters Anna Karlin Most slides by Shreya Jayaraman, Luxi Wang, Alex Tsun Bloom Filters:

Dragging Proofs out of Pictures Ralf Hinze 1 Dan Marsden 2 1 Institute for Computing and

Synthesis of Domain Specific Encoders for Bit- Vector Solvers Jeevana Priya Inala with Rohit

1 Coalescing Logistics Computing the Interference Graph (in MiniJava compiler) Rule Use results

r 4Th 0 b1 4 5 -fr 4$- ci 4 C v-s C 3 - 6 r 3 4 Z n C -c C - a

Concurrency Control Instructor: Matei Zaharia cs245.stanford.edu Outline What makes a schedule

Sambuz

Useful Links

Newsletter

Mail Us

Adding Unusual Transports to The Serval Project Alexandros Tsiridis & Joseph Hill