Large-scale NetFlow Information Management Adrien Raulot, Shahrukh - - PowerPoint PPT Presentation

large scale netflow information management
SMART_READER_LITE
LIVE PREVIEW

Large-scale NetFlow Information Management Adrien Raulot, Shahrukh - - PowerPoint PPT Presentation

Large-scale NetFlow Information Management Adrien Raulot, Shahrukh Zaidi University of Amsterdam Supervisor: Wim Biemolt (SURFnet) February 5, 2018 Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24


slide-1
SLIDE 1

Large-scale NetFlow Information Management

Adrien Raulot, Shahrukh Zaidi

University of Amsterdam Supervisor: Wim Biemolt (SURFnet)

February 5, 2018

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24

slide-2
SLIDE 2

What is NetFlow?

Traffic monitoring technology originaly developed by Cisco. Flow: “a set of IP packets passing an observation point in the network during a certain time interval. All packets belonging to a particular flow have a set of common properties.”[4] Important differences with regular packet capture methods:

NetFlow considered to be less privacy sensitive NetFlow requires less computational resources for analysis

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 2 / 24

slide-3
SLIDE 3

What is NetFlow?

Figure 1: Schematic overview of the NetFlow export process.[2]

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 3 / 24

slide-4
SLIDE 4

NetFlow Analysis

Three main application areas[3]: Flow analysis and reporting Threat detection Performance monitoring

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 4 / 24

slide-5
SLIDE 5

NetFlow analysis techniques

NfDump:

Figure 2: Schematic overview of the NfDump tool set.[1]

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 5 / 24

slide-6
SLIDE 6

Netflow Analysis techniques

Limitations of this setup[5]:

Inefficient file-based store: NfDump typically stores NetFlow data in separate files for every 5 minutes time frame Very slow processing speed: each file is read line by line from the

  • beginning. Therefore, analysis of large amounts of NetFlow data takes

a lot of time. Limited analysis methods: as network situations are becoming more and more complex, new analysis approaches are required that allow for NetFlow data analysis.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 6 / 24

slide-7
SLIDE 7

Research question

Which data analysis technique could be used in order to analyse the current SURFnet NetFlow data in a more time-efficient manner?

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 7 / 24

slide-8
SLIDE 8

What is Apache Hadoop?

Framework for large datasets processing Distributed, local computation & storage Hadoop Distributed File System (HDFS) YARN (Yet Another Resource Negotiator) Batch, interactive & real-time jobs Designed to be scalable

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 8 / 24

slide-9
SLIDE 9

What is Apache Hadoop?

Figure 3: Schematic overview of Hadoop 2.0.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 9 / 24

slide-10
SLIDE 10

What is Apache Spark?

Hadoop-related project, but not only Powerful computing engine for Big Data processing In-memory Built-in modules for streaming, SQL, machine learning, etc. Binding for Java, Scala, Python and R Ease of use

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 10 / 24

slide-11
SLIDE 11

What is Apache Parquet?

Data-store for Hadoop Column-oriented Fast access to data

Figure 4: Schematic overview of a row vs column-oriented database.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 11 / 24

slide-12
SLIDE 12

Choice for analysis technique (summary)

Figure 5: Apache Parquet logo. Figure 6: Apache Hadoop logo. Figure 7: Apache Spark logo.

To-Do list:

1 Store NetFlow data into Parquet files on HDFS 2 Load Parquet files using PySpark (Python API) 3 Query the data using Spark SQL Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 12 / 24

slide-13
SLIDE 13

Experiments: test environment

Hadoop cluster specifications: ∼ 100 nodes ∼ 600 cores ∼ 4TB of memory ∼ 2PB of storage Apache Hadoop 2.7.2 Apache Spark 2.1.1 NfDump server specifications: 1x Dell PowerEdge R230 Intel Xeon CPU E3-1240L v5 @ 2.10GHz 4 cores 16GB of RAM ∼ 200GB of SSD storage NfDump v1.6.12

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 13 / 24

slide-14
SLIDE 14

Experiments: implementation

1 Convert NetFlow binary data to CSV

nfdump -r nfcapd.201801011245 -o csv

2 Write two Spark jobs in Python:

Converter: Converts CSV data to Parquet format Querier: Loads Parquet data & executes queries

3 Write SQL query

query = ’SELECT ts, sa, da FROM nf_data’

4 Using the Querier, execute and cache the results 5 Proceed with next operations on the cached results

print results.count() print results.show()

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 14 / 24

slide-15
SLIDE 15

Experiments: test queries

Retrieve all flows containing a specific IP address Retrieve all flows with a byte count larger than 100MBs List the top 10 of Telnet connections with only the SYN flag set in the IP header ordered by the number of bits per second List the top 10 of IP addresses receiving the largest amount of traffic Retrieve all flows with only the SYN flag set in the IP header

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 15 / 24

slide-16
SLIDE 16

Results: retrieve all flows containing a specific IP address

5min 30min 1hr 3.5hrs 7hrs 2 4 6 8 Execution time in minutes 0:08 0:33 1:05 3:33 6:42 NfDump Hadoop+Spark 5min 30min 1hr 3.5hrs 7hrs Time frame 1 2 3 4 5 Execution time in minutes 3:00 2:37 2:58 3:22 3:46

Figure 8: Execution time of retrieving all flows containing a specific IP address.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 16 / 24

slide-17
SLIDE 17

Results: retrieve all flows with byte count >100MB

5min 30min 1hr 3.5hrs 7hrs 2 4 6 8 Execution time in minutes 0:08 0:28 1:06 3:39 6:52 NfDump Hadoop+Spark 5min 30min 1hr 3.5hrs 7hrs Time frame 1 2 3 4 5 Execution time in minutes 3:15 3:07 3:09 2:50 2:53

Figure 9: Execution time of retrieving all flows with byte count larger than 100MB.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 17 / 24

slide-18
SLIDE 18

Results: list top 10 Telnet connections with only SYN flag set ordered by bps

5min 30min 1hr 3.5hrs 7hrs 1 2 3 4 Execution time in minutes 0:09 0:49 2:09 NfDump Hadoop+Spark 5min 30min 1hr 3.5hrs 7hrs Time frame 1 2 3 4 Execution time in minutes 3:22 2:29 3:09 3:15 3:15

Figure 10: Execution time of retrieving the top 10 of Telnet connections with only the SYN flag set ordered by the number of bits per second.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 18 / 24

slide-19
SLIDE 19

Results: List top 10 IPs receiving most traffic

5min 30min 1hr 3.5hrs 7hrs 5 10 15 20 25 Execution time in minutes 0:19 1:25 4:04 11:12 23:22 NfDump Hadoop+Spark 5min 30min 1hr 3.5hrs 7hrs Time frame 1 2 3 4 5 Execution time in minutes 2:39 2:38 2:42 3:37 4:03

Figure 11: Execution time of Retrieving the top 10 IP addresses receiving the largest amount of traffic.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 19 / 24

slide-20
SLIDE 20

Results: Retrieve all flows with only SYN flag set

5min 30min 1hr 3.5hrs 7hrs 20 40 60 80 100 Execution time in minutes 1:02 5:22 11:53 41:44 89:23 NfDump Hadoop+Spark 5min 30min 1hr 3.5hrs 7hrs Time frame 2 4 6 Execution time in minutes 3:14 3:28 3:21 5:06 5:52

Figure 12: Execution time of retrieving all flows with only the SYN flag set in the IP header.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 20 / 24

slide-21
SLIDE 21

Discussion

Execution time of NfDump increases linearly with longer time frames. Hadoop scales very well:

Execution time of Spark with Hadoop does not increase significantly when dealing with larger amounts of data.

NfDump struggles with executing more complex queries, whereas this is no problem for Spark and Hadoop.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24

slide-22
SLIDE 22

Conclusion and future work

Combination of Hadoop and Apache Spark is a viable option for analyzing large-scale NetFlow data. Tuning and optimization to the Spark implementation and Hadoop cluster may lead to even better performance.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 22 / 24

slide-23
SLIDE 23

Questions?

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 23 / 24

slide-24
SLIDE 24

References

NfDump. http://nfdump.sourceforge.net/.

  • I. Cisco.
  • NetFlow. Introduction to Cisco IOS NetFlow C a technical overview,

2007.

  • R. Hofstede, P. ˇ

Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, and A. Pras. Flow monitoring explained: From packet capture to data analysis with netflow and ipfix. IEEE Communications Surveys & Tutorials, 16(4):2037–2064, 2014.

  • G. Sadasivan.

Architecture for ip flow information export. Architecture, 2009.

  • Z. Tian.

Management of large scale NetFlow data by distributed systems. Master’s thesis, NTNU, 2016.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 24 / 24