George Karabatis 1 , Jianwu Wang 1 , Ahmed AlEroud 2 {georgek, - - PowerPoint PPT Presentation

george karabatis 1 jianwu wang 1 ahmed aleroud 2
SMART_READER_LITE
LIVE PREVIEW

George Karabatis 1 , Jianwu Wang 1 , Ahmed AlEroud 2 {georgek, - - PowerPoint PPT Presentation

Towards Adaptive Big Data Cyber-attack Detection via Semantic Link Networks George Karabatis 1 , Jianwu Wang 1 , Ahmed AlEroud 2 {georgek, jianwu, ahmed21}@umbc.edu 1 Department of Information Systems University of Maryland, Baltimore County 2


slide-1
SLIDE 1

UMBC

Towards Adaptive Big Data Cyber-attack Detection via Semantic Link Networks

George Karabatis1, Jianwu Wang1, Ahmed AlEroud2

{georgek, jianwu, ahmed21}@umbc.edu

1Department of Information Systems

University of Maryland, Baltimore County

2Department of Computer Information Systems

Yarmouk University, Irbid, Jordan Mission Critical Big Data Analytics

MCBDA – Prairie View, TX, May 2016

1

slide-2
SLIDE 2

UMBC

Why are Cyber Attacks an issue?

2

Grid Security Sabotage of Operations Data Security

(Database & Communication)

Communication Interference Financial fraud

slide-3
SLIDE 3

UMBC

Intrusion Detection Systems

  • Packet-based IDSs: Analyze the content of network

packets to predict attacks – Fairly hard task with today’s high speed Gigabit networks which carry vast volumes of network packets

  • Flow-based IDSs: Detect Cyber attacks by analyzing

net-flows – The content of packets is not-available – Only traffic-based features

3

slide-4
SLIDE 4

UMBC

Packet-based Intrusion Detection

Advantages

  • Have full access to payload
  • More information is available
  • More accurate intrusion detection

Disadvantages

  • Increasing network bandwidth generates huge

amounts of data

  • Analysis of data is computationally expensive

Result: Perfect big data problem

4

slide-5
SLIDE 5

UMBC

Network Flows (flows)

Think of it like phone call metadata: who called whom, when, but without the conversation

  • Source/Destination IP
  • Input/Output Router Interface
  • Protocol
  • Type of Service
  • Packet Count
  • Octet Count
  • Start/End Time
  • TCP Flags
  • Source/Dest Network Mask
  • Input/Output Interface encapsulation size
  • IP Address of next hop within the peer
  • Router IP of cache shortcut in supervisor
slide-6
SLIDE 6

UMBC

flow 1 flow 2 flow 3 flow 4

NetFlow

  • Set of packets that “belong together”

– Source/destination IP addresses and port numbers – Same protocol, … – Same input/output interfaces at a router (if known)

  • Packets that are “close” together in time

– Maximum spacing between packets (e.g., 30 sec)

slide-7
SLIDE 7

UMBC

Flow-based intrusion detection

Advantages

  • Less information is available
  • Detection process is faster due to less data

Disadvantages

  • Have no access to payload
  • Subset of attacks detected
  • Accuracy not as good as packet-based

7

slide-8
SLIDE 8

UMBC

Semantic Link Networks (SLN)

  • A SNL is a graph with nodes and edges
  • Nodes: Represent alerts or benign activity
  • Edges: Weighted links representing similarity
  • f the nodes

– Measured in terms of context: time, location, numerical, and descriptive features

8

slide-9
SLIDE 9

UMBC

Contextual features

  • Time

– Start, end time of flows

  • Location

– Source, destination IP addresses, port numbers

  • Numerical

– Traffic statistics, e.g. # of packets, octets

  • Descriptive

– Other characteristics, e.g. flags, protocol

9

slide-10
SLIDE 10

UMBC

Constructing SLNs

Nodes represent either alerts or benign activities – Each node is initially represented as feature vector

10

n1 n2 n1 n2 f1 1 1 f2 1 f3 1 f4 Binary Feature Vectors (e.g. TCP flags) n1 n2 f1 0.7 0.8 f2 0.02 0.5 f3 0.9 0.03 f4 0.01 0.01 Feature Vectors using numerical weights 𝑞 𝑜1, 𝑜2 = 𝑇𝑇𝑇 𝑜1, 𝑜2 ∑ 𝑇𝑇𝑇 𝑛

𝑞 𝑛=1

𝑄𝑄_𝑇𝑇𝑇 𝑜1, 𝑜2 𝐵𝐵_𝑇𝑇𝑇 𝑜1, 𝑜2

𝑞 𝑜1, 𝑜2

Edges: weighted links (calculated using Anderberg and Pearson)

slide-11
SLIDE 11

UMBC

Intrusion Detection with SLNs

After SLN is complete, and during run-time

  • Investigate features of an incoming flow
  • Find start node in the SLN with similar features to the

incoming flow

– Classifies individual flows using rule-based classifier that works on flow features (J48)

  • Expand the set of nodes with additional ones based
  • n:

– Connectivity on the graph – Threshold value (controls scope of expansion)

  • Recall is increased, but may have false positives

11

slide-12
SLIDE 12

UMBC

Intrusion Detection with SLNs

  • Apply context filters

– Limit the expanded result set – Reduce the false positives/negatives

  • Precision increases
  • SLN must be updated when new attacks

(nodes) are discovered

– Graph re-generation is expensive – Dynamic approach is more promising

12

slide-13
SLIDE 13

UMBC

Attack Prediction Process

R1 R2 Rn

13

Incoming flow Classification rules for Initial prediction Filtering FPs Final predictions

slide-14
SLIDE 14

UMBC

Hybrid intrusion detection

  • Combines flow-based and packet-based
  • Takes advantages of both approaches
  • Requires big data platform
  • Increased accuracy of predictions (obviously)

14

slide-15
SLIDE 15

UMBC

Hybrid intrusion detection

Layer one

  • Flow-based approach is applied
  • If prediction is benign, allow flow to pass
  • If prediction is suspicious analyze further

– Flow marked suspicious with high probability, then enforce appropriate policy:

  • Deny entry
  • Divert to another system (e.g. honeypot)

– Flow marked suspicious with medium probability, then proceed to layer two

15

slide-16
SLIDE 16

UMBC

Hybrid intrusion detection

Layer two

  • More information is needed to decide
  • Corresponding packets are passed to Spark

based platform

  • Spark Dstream is applied
  • Map function in parallel for both individual

and multi-stage packet analysis

16

slide-17
SLIDE 17

UMBC

Hybrid Big-Data IDS

17

Flow-based layer

slide-18
SLIDE 18

UMBC

Hybrid intrusion detection

  • Multistage attacks

– Requires current and past (historical packets with same IP address) – A NoSQL DB (Cassandra) stores suspicious packets and is queried for matched patterns – Newly discovered attacks are used to dynamically update the SLN

18

slide-19
SLIDE 19

UMBC

Advantages of Hybrid Approach

  • Flows that are predicted as benign or suspicious with high

probability do not reach the second layer (packet examination) saving computational resources

  • Only questionable flows are further examined at the packet

level

  • Accuracy of the prediction is expected to rise, since more

information (payload) is available

  • More attacks may be recognized (since there is access to

payload, in addition to flow data)

  • Compared to packet-based approaches, our approach

requires less computational resources

19

slide-20
SLIDE 20

UMBC

Packet Analysis on Spark

  • Create a Spark streaming context with batch interval at n second
  • Create DStream by collecting incoming network socket, a DStream

contains all packets within the batch interval time window

  • Apply full packet analysis function for each packet in parallel

through the DStream’s map function, output each suspicious packet and its attackType using key-value structure

  • Report new types of attacks to update SLN
  • Apply multistage packet analysis function for each DStream

element in parallel through DStream’s map function, output each suspicious multistage packets and its attackType using key-value structure

20

slide-21
SLIDE 21

UMBC

Conclusions

  • A promising technique for huge amounts of

network data

  • Takes advantage of flow and packet approaches
  • Builds on previous success on packet-based and

flow-based intrusion detection

  • Work in progress on Hybrid approach for BD

– Implementation for Spark platform – Evaluation with datasets

21

slide-22
SLIDE 22

UMBC

Questions?

22