Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda - PowerPoint PPT Presentation

A Performance Benchmark for NetFlow Data Analysis on Distributed Stream Processing Systems NOMS 2016 Wednesday 27 th April, 2016 Milan Čermák Daniel Tovarňák, Martin Laštovička, Pavel Čeleda

NetFlow/IPFIX Monitoring and Analysis Performance Benchmark of Stream Processing Systems Page 2 / 18

Network Monitoring using NetFlow/IPFIX Flow Monitoring Groups packets into n-tuples that have common properties. From the IP point of view we know who communicates with whom, when, and for how long. Used for network traffic measurement in high-speed and large-scale networks. RFC 7011 A fl ow is de fi ned as “ a set of IP packets passing an observation point in the network during a certain time interval, such that all packets belonging to a particular fl ow have a set of common properties ” Performance Benchmark of Stream Processing Systems Page 2 / 18

Disadvantages of Flow Data Analysis Not real-time Flow data typically analysed in 5 minute intervals Delayed detection of serious network attacks Hidden network traffic characteristics Invisible peaks Distorted traffic statistics Performance Benchmark of Stream Processing Systems Page 3 / 18

Solution? Performance Benchmark of Stream Processing Systems Page 4 / 18

Distributed Stream Processing Systems Samza Storm Spark Consumer Spout Receiver Data source Standalone, Cluster manager YARN, Mesos YARN, Mesos YARN, Mesos Stream Configured in Configured in partitions Parallelism Topology SparkContext based Message processing Sequential Sequential Small batches Database, User Database, User Proprietary – Data sharing between implemented implemented SparkContext, nodes communication communication Tachyon Java, Clojure, Java, Scala, Programming Java, Scala Scala, any other Python language using JSON API User definition Time window Proprietary Proprietary of Spout User definition Separate Job Accumulator Count window of Bolt Table: Characteristics of Distributed Stream Processing Systems Performance Benchmark of Stream Processing Systems Page 5 / 18

Benchmark of Distributed Stream Processing Systems Performance Benchmark of Stream Processing Systems Page 6 / 18

Benchmark of Stream Processing Systems Benchmark characteristics Follows the universal Stream Bench benchmark by Lu et al. Focus only on the fl ow throughput, not on fault tolerance or durability. Using real network data and common operations. Benchmark of standard systems without speci fi c optimizations. Throughput measured using dataset size, time between computation start and arrival of predetermined computation result. Performance Benchmark of Stream Processing Systems Page 6 / 18

Benchmark of Stream Processing Systems Dataset Based on the CAIDA network traffic public dataset. PCAP transformed into fl ows represented in the JSON format ( ∼ 270 bytes). Basis formed from one million fl ows of the one IP address. Final dataset consist repetitive insertions of the basis corresponding to the number of available processor cores. {" date_first_seen ":"2015 -07 -18 T18 :07:33.475+01:00" , " date_last_seen ":"2015 -07 -18 T18 :07:33.475+01:00" , "duration ":0.000 ," src_ip_addr ":"86.135.210.175" , " dst_ip_addr ":"31.157.1.1" ," src_port ":54700 , "dst_port ":80 ," protocol ":6 ," flags ":".A...." , "tos ":0 ," packets ":1 ," bytes ":56} Performance Benchmark of Stream Processing Systems Page 7 / 18

Benchmark of Stream Processing Systems Selected operations 1. Identity: Input data processing without executing any operation on them. 2. Filter: Only fl ows fi tting a fi ltering rule are selected from the input dataset and sent to the output. 3. Count: Flows containing a given value are fi ltered and their count is returned as a result. 4. Aggregation: Contrary to the count operation, the aggregation sums speci fi c values over all fl ows. 5. TOP N: An extension of the aggregation returning only a given number of fl ows with the highest sums of values. 6. SYN DoS: The detection of an attack represented by a high number of fl ows from one source IP address with TCP SYN packets only. Performance Benchmark of Stream Processing Systems Page 8 / 18

Benchmark of Stream Processing Systems Benchmark architecture Corresponds to a typical deployment architecture of the distributed stream processing systems. Utilization of the Kafka as the messaging system. Two environments: a) single host and b) multiple hosts . Performance Benchmark of Stream Processing Systems Page 9 / 18

Benchmark Results Performance Benchmark of Stream Processing Systems Page 10 / 18

Testbed Configuration Common configuration of nodes � Xeon R � E5-2670 (16/32 HT cores in total), 2 x Intel R 192 GB 1600M MHz RDIMM ECC RAM, 2 x HDD 600 GB SAS 10k RPM, 2,5" (RAID1), 10 Gbit/s network connection, 1 Gbit/s virtual NICs. Virtual machines configuration Type vCPUs Memory Hard Drive 32 128 GB 300 GB vm_large vm_normal 16 64 GB 300 GB 8 32 GB 300 GB vm_medium 4 16 GB 300 GB vm_small Performance Benchmark of Stream Processing Systems Page 10 / 18

Benchmark Results One vm_large node (32 vCPUs in total) 3 000 k Storm Spark 2 500 k Samza Throughput [ fl ow/s] 2 000 k 1 500 k 1 000 k 500 k 0 I F C A T S d i o g o Y e l t u g p N n e t r n r N e D i t t g y o a S t i o n Samza provides almost constant throughput for all operations. Strom and Spark decreases to 700 k fl ows/s. Throughput slowdown probably caused by shu ffl ing of incoming messages, which led to input socket overloading. Performance Benchmark of Stream Processing Systems Page 11 / 18

Benchmark Results One vm_normal node (16 vCPUs in total) 3 000 k Storm Spark 2 500 k Samza Throughput [ fl ow/s] 2 000 k 1 500 k 1 000 k 500 k 0 I F C A T S d i o Y e l o g t u g p N n e n r t r N D i t e t g o y a S t i o n Lower computational resources reduce the internal data processing speed and shuffling of messages. Input socket not overloaded. Significant increase in Spark throughput. Performance Benchmark of Stream Processing Systems Page 12 / 18

Benchmark Results Four vm_medium nodes (32 vCPUs in total) 3 000 k Storm Spark 2 500 k Samza Throughput [ fl ow/s] 2 000 k 1 500 k 1 000 k 500 k 0 I F C A T S d i o Y e l o g t u g p N n e n r t r N D i t e t g o y a S t i o n Systems are better adapted to deployment in a cluster mode. Spark provides similar throughput as Samza. Large throughput variance probably caused by the network load or systems errors. Performance Benchmark of Stream Processing Systems Page 13 / 18

Benchmark Results Four vm_small nodes (16 vCPUs in total) 3 000 k Storm Spark 2 500 k Samza Throughput [ fl ow/s] 2 000 k 1 500 k 1 000 k 500 k 0 I F C A T S d i o g o Y e l t u p N n e g r n r t e N D i t t g y o a S t i o n No increase in data processing speed. Throughput of Storm reduced by half. Samza, deployed on 32 vCPUs was probably limited by a network bandwidth saturation. Performance Benchmark of Stream Processing Systems Page 14 / 18

Benchmark Summary Benchmarked systems are able to process at least 500 k fl ows/s. Spark and Samza o ff er much higher throughput than Storm. Possibility of a higher throughput using more e ffi cient data format than JSON (MessagePack). Hight throughput on single node o ff ers to combine stream processing with standard fl ow processing tools like NFDUMP. Each of tested systems have speci fi c behaviour depending on the cluster setup. Samza has the best throughput but restricts number of partitions to number of available cores. Performance Benchmark of Stream Processing Systems Page 15 / 18

Framework for Real-time Analysis of NetFlow Data Performance Benchmark of Stream Processing Systems Page 16 / 18

Real-time Analysis of NetFlow Data Framework for the real-time generation of network traffic statistics using Apache Spark Streaming. Possibility to implement the same basic methods for fl ow data analysis. Will be presented on the Demo Session on Thursday. Performance Benchmark of Stream Processing Systems Page 16 / 18

Conclusion Proposed the novel performance benchmark of a fl ow data analysis on distributed stream processing systems. Testing using real network traffic dataset and common data analysis operations. Only Samza and Spark provides a high-enough fl ow throughput. The benchmark source code and dataset preparations scripts are available on: https://is.muni.cz/repo/1323006 Performance Benchmark of Stream Processing Systems Page 17 / 18

A PERFORMANCE BENCHMARK FOR NETFLOW DATA ANALYSIS ON DISTRIBUTED STREAM PROCESSING SYSTEMS Milan Čermák cermak@ics.muni.cz

Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda - PowerPoint PPT Presentation

A Performance Benchmark for NetFlow Data Analysis on Distributed Stream Processing Systems NOMS 2016 Wednesday 27 th April, 2016 Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda NetFlow/IPFIX Monitoring and Analysis

Mechanisms of macrolide resistance Ribosomal modification by methylase ( erm genes) S.

belgium @ 2015 milan universal exposition belgium @ 2015 milan universal exposition 2015 milan

MILAN FOOD POLICY making our food system more sustainable and inclusive Andrea Magarini Milan

Comply or Close The New Reality for Industrial Facilities in China Piers Touzel Country

ERM on a Budget How one midsized company is implementing ERM with limited resources (time,

ADDAM and CSA-ERM Modelling ADDAM and CSA-ERM Modelling Approach and Results for the Approach

ERM 2005 ERM 2005 Morgantown, W.V. Morgantown, W.V. SPE Paper # 98012 SPE Paper # 98012

Epiretinal 1. How is epiretinal membrane (ERM) best diagnosed? 2. How is ERM differentially

Deep networks CS 446 The ERM perspective These lectures will follow an ERM perspective on deep

FY 2018 Results Presentation FY 2017 Results Presentation Milan, 12 th March 2019 Milan, 24 th

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

ERM at skyguide and interface with BCM - Fachveranstaltung Netzwerk Risikomanagement - Aarburg,

Making a buck beats raising a buck! Rob Bills Managing Director & CEO ASX: ERM 1

How to replace qualitative guessing with a capital view A look into quantitative based ERM

OUTLINE BACKGROUND: REGULATORY ENVIRONMENT SII/ERM IMPLEMENTATION: BUSINESS

A p r i l 2 0 1 9 The HME Opportunity Creat ing lon ong-t er erm sh shareh eholder er va

MECHANISMS OF ACTION FOR CONTROL OF SOILBORNE PATHOGENS BY HIGH NITROGEN-CONTAINING SOIL

INSTITUTION STEM and SBS NSF ADVANCE Third Year Site Visit March 2-3, 2015 OUTLINE CHARGE

Full-Year Result 2017 Susan Duinhoven , President and CEO Markus Holm , CFO & COO 8 February

2019 LEGISLATIVE SESSION HIGHLIGHTS PRESENTED BY UTAH CHARTER NETWORK | KIM FRANK HIGHLIGHTED

21 August 2017 Disclaimer THIS ANNOUNCEMENT CONTAINS INSIDE INFORMATION FOR THE PURPOSES OF

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

A. Takada (ISAS/JAXA), T. Tanimori, H. Kubo, K. Miuchi, S. Kabuki, Y. Kishimoto, J. Parker, H.

D-Branes and AdS/CFT Junaid Saif Khan Supervised by: Dr. Babar A. Qureshi MS mid-year

Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda - PowerPoint PPT Presentation

A Performance Benchmark for NetFlow Data Analysis on Distributed Stream Processing Systems NOMS 2016 Wednesday 27 th April, 2016 Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda NetFlow/IPFIX Monitoring and Analysis

Mechanisms of macrolide resistance Ribosomal modification by methylase ( erm genes) S.

belgium @ 2015 milan universal exposition belgium @ 2015 milan universal exposition 2015 milan

MILAN FOOD POLICY making our food system more sustainable and inclusive Andrea Magarini Milan

Comply or Close The New Reality for Industrial Facilities in China Piers Touzel Country

ERM on a Budget How one midsized company is implementing ERM with limited resources (time,

ADDAM and CSA-ERM Modelling ADDAM and CSA-ERM Modelling Approach and Results for the Approach

ERM 2005 ERM 2005 Morgantown, W.V. Morgantown, W.V. SPE Paper # 98012 SPE Paper # 98012

Epiretinal 1. How is epiretinal membrane (ERM) best diagnosed? 2. How is ERM differentially

Deep networks CS 446 The ERM perspective These lectures will follow an ERM perspective on deep

FY 2018 Results Presentation FY 2017 Results Presentation Milan, 12 th March 2019 Milan, 24 th

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

ERM at skyguide and interface with BCM - Fachveranstaltung Netzwerk Risikomanagement - Aarburg,

Making a buck beats raising a buck! Rob Bills Managing Director &amp; CEO ASX: ERM 1

How to replace qualitative guessing with a capital view A look into quantitative based ERM

OUTLINE BACKGROUND: REGULATORY ENVIRONMENT SII/ERM IMPLEMENTATION: BUSINESS

A p r i l 2 0 1 9 The HME Opportunity Creat ing lon ong-t er erm sh shareh eholder er va

MECHANISMS OF ACTION FOR CONTROL OF SOILBORNE PATHOGENS BY HIGH NITROGEN-CONTAINING SOIL

INSTITUTION STEM and SBS NSF ADVANCE Third Year Site Visit March 2-3, 2015 OUTLINE CHARGE

Full-Year Result 2017 Susan Duinhoven , President and CEO Markus Holm , CFO &amp; COO 8 February

2019 LEGISLATIVE SESSION HIGHLIGHTS PRESENTED BY UTAH CHARTER NETWORK | KIM FRANK HIGHLIGHTED

21 August 2017 Disclaimer THIS ANNOUNCEMENT CONTAINS INSIDE INFORMATION FOR THE PURPOSES OF

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

A. Takada (ISAS/JAXA), T. Tanimori, H. Kubo, K. Miuchi, S. Kabuki, Y. Kishimoto, J. Parker, H.

D-Branes and AdS/CFT Junaid Saif Khan Supervised by: Dr. Babar A. Qureshi MS mid-year

Making a buck beats raising a buck! Rob Bills Managing Director & CEO ASX: ERM 1

Full-Year Result 2017 Susan Duinhoven , President and CEO Markus Holm , CFO & COO 8 February

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark