Data stream statistics over sliding windows: How to summarize 150 - PowerPoint PPT Presentation

Data stream statistics over sliding windows: How to summarize 150 Million updates per second on a single node Grigorios Chrysos†, Odysseas Papapetrou‡, Dionisios Pnevmatikatos †, Apostolos Dollas†, Minos Garofalakis† †Technical University of Crete, Greece ‡Eindhoven University of Technology, Netherlands ATHENA Research and Innovation Center, Greece

Why process data streams in real-time? Real time, continuous, high-volume data streams: • Network monitoring for DoS attacks • Monitoring market data to guide algorithmic trading • Adaptive online advertising, etc. Too big to store in memory => build approximate sketch synopses Our focus here: Exponential Count-Min (ECM) sketches • Papapetrou et al [VLDB12, VLDBJ15] • Space and time efficient • Support frequency and inner product queries • Bounded error data structures Contribution: Explore ECM Sketch acceleration architectures on FPGA 2

Outline • ECK Sketch Primer • ECM Acceleration Architectures • Evaluation • Conclusions 3

Example: Distribution statistics at routers • Maintain sliding-window data stream statistics IP address Timestamp (msec) 194.42.1.1 0 1000 msec sliding window 194.44.2.6 2 194.42.1.1 4 IP counters 220.40.41.4 7 ip freq. 194.42.1.1 8 194.42.1.1 2 194.42.1.1 3 194.42.1.1 3 194.42.1.1 1 194.42.1.1 2 194.42.1.1 2 … 194.44.2.6 194.44.2.6 1 0 220.40.41.4 999 220.40.41.4 220.40.41.4 220.40.41.4 2 1 1 222.1.34.7 1001 194.42.1.1 1003 222.1.34.7 1 194.42.1.1 1009

ECM Sketch Primer • Sketch is a set of d hash functions f1, f2, . . . , fd and a 2- dimensional array of w x d “counters” • “Counter” is an Exponential Histogram structure (space efficient for large time windows) • For each incoming key: • Is hashed d times to select which EH to update in each of the d rows • d EΗs are updated 5

w columns ECM Sketch f 1 +1,t … f 2 +1,t … d rows 132.1.3.4 observed at time 31 f d +1,t Updating the individual EHs 6

w columns ECM Sketch f 1 +1,t … f 2 +1,t … d rows 132.1.3.4 observed at time 31 f d +1,t Updating the individual EHs Level 2 Level 1 Level 0 size=2 2 size=2 1 size=2 0 Before 4 2 2 1 1 After 4 2 2 1 1 1 Invariant 2 invalidated: 3 buckets of size 1 1st merge 4 2 2 2 1 2nd merge 4 4 2 1 Time 0 14 19 23 26 28 31 7

Sizing ECM Sketches ECM sketch provides frequency estimates with an error less than ε*N, with probability at least 1 − δ N denotes the length of the sliding window ECM Sketch parameters:  Number of rows: d = ln 1/δ  Number of Exponential Histograms (EHs) in each line : w = e/ε  Number of positions at each bucket level: k = 1/ε  Number of bucket levels for each EH: L >= O(log(2N/k) + 1) Update complexity: O(logN) Amortized complexity is constant , expected 2 merges per update 8

Outline • ECK sketch primer • ECM Acceleration Architectures • Evaluation • Conclusions 9

Accelerator Architecture #1 ECM Sketches are 3-D structures d x w x L - Only one EH per row active at any time - Have d independent structures - Group data for each of the w EHs of a ECM row in BRAMs - Update takes >=1 cycle, but pipelined! Result: + Fully pipelined, guaranteed throughput design - Worst case design: each EH has L pipeline stages, only 2 active on the average 10

Fully pipelined architecture (FC) Window Size P P P I I I ... ... 10 8 3 1 45 40 28 26 Expires? Expires? P P P ... ... E P E P E P L I L I L I ... ... 10 8 3 1 Expires? 45 40 28 26 Expires? I P I P I P ... Hash ... ... Tuple N E N E N E EH Id Memory Memory Memory Memory Memory Memory Memory Memory P P P E L E L E L I I I Func 1 ... ... I I I ... 10 8 3 1 Expires? Hash 45 40 28 26 Expires? P P P ... ... R N R N R N EH Id E E E Memory Memory Memory Memory Memory Memory Memory Memory E E E E E E L L L Func 0 G G G Bucket Level #1 Bucket Level #L I I I ... Hash ... R R R N N N EH Id Memory Memory Memory Memory Memory Memory Memory Memory E E E E E E ECM Row 1 Func 0 G G G Bucket Level #0 Bucket Level #N-1 R R R E E E ECM Row 2 G G G Bucket Level #0 Bucket Level #N-1 ... ECM Row d Problem : Did not fit in Convey V6 FPGA due to high BRAM use 11

Accelerator Architecture #2 Our Convey HC-2ex platform uses Virtex6 devices => Not particularly large devices Together with “shell”, the FC architecture did not fit BRAM space was the bottleneck Go for space efficiency: BRAMs underused (w is 55, minimum BRAM rows is 512) Amortized update cost is 2 => most pipelined levels are idle! 12

Key idea to exploit amortized ECM update cost ... Hash BL #1 BL #2 BL #L-1 BL #L Func Hash BL #1 ECM Worker Func CAUTION : Space: mapped L-1 level counters into one worker (BRAM size?) Multiple hits in the same row => more work for worker Multiple hits in the same EH => more work for worker 13

ECM Worker Internal Structure ECM Worker Window P Size ... I 45 40 28 26 Expires? P New Merge E FIFO L I ... Tuple N Memory Memory Memory Memory E R E Update G FIFO 14

Provide additional memory Cost-Aware architecture (CA) & processing BW ECM Worker #0 Window Size Bucket EH Id Hash ECM Worker #1 Level #1 Func 1 ECM row 1 ... ... Tuple ECM Worker #P-1 Bucket EH Id Hash Level #1 Func d ECM Worker #P ECM row d 15

Now we can play: One parameter: how many bucket levels to instantiate before Worker • More levels => better tolerance to skewed workloads What about LARGE windows? L becomes large BUT update load exponentially decreases => store in DRAM! DRAM is slower than BRAM => need to get there infrequently 16

Hybrid Architecture ECM FrontStage #1 Window Size P P P I I I ... ... 10 8 3 1 Exp? 45 40 28 26 Exp? P P P E E E Hash L L L ... I I I ... ... Memory Memory Memory Memory Memory Memory Memory Memory EH Id N N N Func 1 E E E DRAM R R R E E E G G G Bucket Level #1 Bucket Level #K ECM BackStage P New Merge I ... P FIFO E Tuple L ECM FrontStage #d I ... 22 14 6 5 Exp? N E P P P I I I R Updates ... ... 30 29 27 25 Exp? E 95 92 76 45 Exp? P P P G E E E FIFO Hash L L L ... I I I ... ... Memory Memory Memory Memory Memory Memory Memory Memory EH Id N N N Func d E E E R R R E E E G G G Bucket Level #K Bucket Level #1 CAUTION : Choose K carefully so that DRAM BW is sufficient most of the time 17

Can we Exceed 1 tuple per cycle? All architectures so far assume input of one tuple per cycle What if I have T input tuples per cycle? • Hash d*T tuples • Update d*T EHs • If d*T << #EHs, chances are good that different EHs will be updated Corollaries: • Cannot group into d rows (d << d*T) • Multiple updates to same EH at same cycle are possible! 18

> 1 tuple per cycle: Multithreaded Architecture d*T ~Hybrid ICN ~Hybrid Pipeline hashing Backstage New ECM FrontStage #1 Heavy Hiter Detection Hash Tuple ECM FrontStage #1 DDR Hash #1 #1 #d Extra EH Struct. #1 ... ... ... ICN ECM BackStage Heavy Hiter New ECM FrontStage #T*d Detection Tuple Hash ECM FrontStage #T * d Hash #T #1 #d Extra EH Struct. #T * d “overflow” pipeline 19

Outline • ECK sketch primer • ECM Acceleration Architectures • Evaluation • Conclusions 20

System implementation System Parameters ε = 0.05, δ = 0.05 w = 55, d = 3, k = 11 CA architecture P was set to 6 (2 workers per row) Hybrid: K = 5 (bucket levels before DRAM) MT: K = 5, T = 3, #FrontStages = 10 Target platform Convey HC-2ex, two six-core Xeon E5-2640 processors, 128GB and four Xilinx Virtex-6 LX760 FPGAs (use only one )  Shell logic clock fixed at 150MHz  474K LUTs, 948K flip flops, and 1440x18 Kbit BRAMs 21

Evaluation Five Input Datasets  Crawdad SNMP Fall 03/04 [11]  CAIDA Anonymized Internet Traces 2011  WC, the data set from world cup98 [2]  Two randomly generated traces Software baseline • Reference software from Papapetrou et al. [VLDBJ15] • Multi-thread parallelized version of the reference SW (lock limited) FPGA versions • Implemented & tested on Convey 22

Performance comparison (single FPGA) Note: SW performance is between 10-27 Mtuples/sec † FP opera�ng frequency is es�mated FP performance is guaranteed, {CA, Hybrid, MT} are best effort 23

Resource utilization Numbers DO NOT include the “shell” logic CA is more cost effective than FP (6x in logic, 3x in BRAMs) MT cost is significant, Hybrid is affordable FP & CA are the best overall options 24

Performance on Recent Devices: US+ xczu17eg Note: Post P&R tool result FP is affordable, CA is even better (in cost)! Hybrid and MT are not really worth it 25

Conclusions Sliding-window statistics on streaming data is an important application domain ECM Sketches offer error bound in common queries and are HW friendly A range of efficient accelerators is possible and offer 5-10x compared to multithreaded SW Guaranteed or best-effort operation? Cost vs Error tolerance tradeoff! Additional resources in modern FPGAs can be used to implement better ECM sketches: larger time window and/or tighter error bounds ε and δ 26

Data stream statistics over sliding windows: How to summarize 150 - PowerPoint PPT Presentation

Data stream statistics over sliding windows: How to summarize 150 Million updates per second on a single node Grigorios Chrysos, Odysseas Papapetrou, Dionisios Pnevmatikatos , Apostolos Dollas, Minos Garofalakis Technical

Stream Statistics Over Sliding Window Sum Problem Trends References Anil Maheshwari School of

Stream Statistics Over Sliding Window Algorithm Sum Problem Trends References Anil Maheshwari

Sliding right into disaster - Left-to-right sliding windows leak Daniel J. Bernstein, Joachim

Glacier Sliding Ian Hewitt, University of Oxford hewitt@maths.ox.ac.uk Sliding / friction laws -

Windows Not just for houses Windows 1-10 Windows Server Essentially a jacked up windows 8 box

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Sliding system for concertina doors 271 Sliding system for concertina doors - Technical features

Lecture 3: Introduction to Sliding Mode Control Reference: S.C. Tan, Chapter 1. Sliding Mode

Sliding Window Protocol Sliding window protocol: Stop & Wait: inefficient if a is

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

1. 2. 3. 1. 2. 3. Windows 10 IoT Core Universal Windows Platform (UWP) Microsoft Azure v7

Windows Not Just For Houses Everyone Uses Windows! Versions of Windows 10 There are multiple

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Brief Announcement: Tracking Distributed Aggregates over Time-based Sliding Windows Graham

Module 1 Overview of Windows 10 Module Overview Introduction to Windows 10 Implementing

The Sliding Window Algorithm The Sliding Window algorithm sums several small

The Castrati: A Caricature Glucks Orpheus and its interpreters Alto castrato (1762) Haut -

Opera Productions, Old and New 3. Playing Toy Theaters Glucks Orpheus and its interpreters Alto

Program Receiver Technology used in (Radio) Astronomy Instruments & Telescope Drivers

Serverless Microservices Are The New Black Lorna Mitchell, IBM Serverless FaaS: Functions as a

Quantum Virtual Networks for Openstack Salvatore Orlando (@taturiello) Citrix Systems Who is

PREDICTING TIE STRENGTH WITH SOCIAL MEDIA Eric Gilbert & Karrie Karahalios University of

Delivering successful change to Trade Promotional Management; looking from an alternative angle

75 # What happened What went well What needs improvement To your questions Additional Q&A

Data stream statistics over sliding windows: How to summarize 150 - PowerPoint PPT Presentation

Data stream statistics over sliding windows: How to summarize 150 Million updates per second on a single node Grigorios Chrysos, Odysseas Papapetrou, Dionisios Pnevmatikatos , Apostolos Dollas, Minos Garofalakis Technical

Stream Statistics Over Sliding Window Sum Problem Trends References Anil Maheshwari School of

Stream Statistics Over Sliding Window Algorithm Sum Problem Trends References Anil Maheshwari

Sliding right into disaster - Left-to-right sliding windows leak Daniel J. Bernstein, Joachim

Glacier Sliding Ian Hewitt, University of Oxford hewitt@maths.ox.ac.uk Sliding / friction laws -

Windows Not just for houses Windows 1-10 Windows Server Essentially a jacked up windows 8 box

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Sliding system for concertina doors 271 Sliding system for concertina doors - Technical features

Lecture 3: Introduction to Sliding Mode Control Reference: S.C. Tan, Chapter 1. Sliding Mode

Sliding Window Protocol Sliding window protocol: Stop &amp; Wait: inefficient if a is

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

1. 2. 3. 1. 2. 3. Windows 10 IoT Core Universal Windows Platform (UWP) Microsoft Azure v7

Windows Not Just For Houses Everyone Uses Windows! Versions of Windows 10 There are multiple

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Brief Announcement: Tracking Distributed Aggregates over Time-based Sliding Windows Graham

Module 1 Overview of Windows 10 Module Overview Introduction to Windows 10 Implementing

The Sliding Window Algorithm The Sliding Window algorithm sums several small

The Castrati: A Caricature Glucks Orpheus and its interpreters Alto castrato (1762) Haut -

Opera Productions, Old and New 3. Playing Toy Theaters Glucks Orpheus and its interpreters Alto

Program Receiver Technology used in (Radio) Astronomy Instruments &amp; Telescope Drivers

Serverless Microservices Are The New Black Lorna Mitchell, IBM Serverless FaaS: Functions as a

Quantum Virtual Networks for Openstack Salvatore Orlando (@taturiello) Citrix Systems Who is

PREDICTING TIE STRENGTH WITH SOCIAL MEDIA Eric Gilbert &amp; Karrie Karahalios University of

Delivering successful change to Trade Promotional Management; looking from an alternative angle

75 # What happened What went well What needs improvement To your questions Additional Q&amp;A

Sliding Window Protocol Sliding window protocol: Stop & Wait: inefficient if a is

Program Receiver Technology used in (Radio) Astronomy Instruments & Telescope Drivers

PREDICTING TIE STRENGTH WITH SOCIAL MEDIA Eric Gilbert & Karrie Karahalios University of

75 # What happened What went well What needs improvement To your questions Additional Q&A