Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis - - PowerPoint PPT Presentation

detection architecture
SMART_READER_LITE
LIVE PREVIEW

Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis - - PowerPoint PPT Presentation

MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011 Network Intrusion Detection Systems


slide-1
SLIDE 1

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011

slide-2
SLIDE 2

Network Intrusion Detection Systems

  • Typically deployed at ingress/egress points

– Inspect all network traffic – Look for suspicious activities – Alert on malicious actions

10 GbE

gvasil@ics.forth.gr 2

Internet Internal Network NIDS

slide-3
SLIDE 3
  • Traffic rates are increasing

– 10 Gbit/s Ethernet speeds are common in metro/enterprise networks – Up to 40 Gbit/s at the core

  • Keep needing to perform more complex analysis

at higher speeds

– Deep packet inspection – Stateful analysis – 1000s of attack signatures

Challenges

gvasil@ics.forth.gr 3

slide-4
SLIDE 4

Designing NIDS

  • Fast

– Need to handle many Gbit/s – Scalable

  • Moore’s law does not hold anymore
  • Commodity hardware

– Cheap – Easily programmable

gvasil@ics.forth.gr 4

slide-5
SLIDE 5

Today: fast or commodity

  • Fast “hardware” NIDS

– FPGA/TCAM/ASIC based – Throughput: High

  • Commodity “software” NIDS

– Processing by general-purpose processors – Throughput: Low

gvasil@ics.forth.gr 5

slide-6
SLIDE 6

MIDeA

  • A NIDS out of commodity components

– Single-box implementation – Easy programmability – Low price

Can we build a 10 Gbit/s NIDS with commodity hardware?

gvasil@ics.forth.gr 6

slide-7
SLIDE 7

Outline

  • Architecture
  • Implementation
  • Performance Evaluation
  • Conclusions

gvasil@ics.forth.gr 7

slide-8
SLIDE 8

Single-threaded performance

  • Vanilla Snort: 0.2 Gbit/s

NIC Preprocess Pattern matching Output

gvasil@ics.forth.gr 8

slide-9
SLIDE 9

Problem #1: Scalability

  • Single-threaded NIDS have limited

performance

– Do not scale with the number of CPU cores

gvasil@ics.forth.gr 9

slide-10
SLIDE 10

Multi-threaded performance

  • Vanilla Snort: 0.2 Gbit/s
  • With multiple CPU-cores: 0.9 Gbit/s

NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Preprocess Pattern matching Output

gvasil@ics.forth.gr 10

slide-11
SLIDE 11

Problem #2: How to split traffic

 Synchronization overheads  Cache misses Receive-Side Scaling (RSS)

NIC cores

11

slide-12
SLIDE 12

Multi-queue performance

  • Vanilla Snort: 0.2 Gbit/s
  • With multiple CPU-cores: 0.9 Gbit/s
  • With multiple Rx-queues: 1.1 Gbit/s

RSS NIC Pattern matching Output Preprocess Pattern matching Output Pattern matching Output Preprocess Preprocess

12

slide-13
SLIDE 13

Problem #3: Pattern matching is the bottleneck

Offload pattern matching on the GPU

NIC Pattern matching Output NIC Preprocess Pattern matching Output Preprocess

> 75%

gvasil@ics.forth.gr 13

slide-14
SLIDE 14

Why GPU?

  • General-purpose computing

– Flexible and programmable

  • Powerful and ubiquitous

– Constant innovation

  • Data-parallel model

– More transistors for data processing rather than data caching and flow control

gvasil@ics.forth.gr 14

slide-15
SLIDE 15

Offloading pattern matching to the GPU

  • Vanilla Snort: 0.2 Gbit/s
  • With multiple CPU-cores: 0.9 Gbit/s
  • With multiple Rx-queues: 1.1 Gbit/s
  • With GPU: 5.2 Gbit/s

RSS NIC Pattern matching Output Preprocess Pattern matching Output Pattern matching Output Preprocess Preprocess

15

slide-16
SLIDE 16

Outline

  • Architecture
  • Implementation
  • Performance Evaluation
  • Conclusions

gvasil@ics.forth.gr 16

slide-17
SLIDE 17

Multiple data transfers

  • Several data transfers between different devices

Are the data transfers worth the computational gains offered?

NIC CPU GPU

gvasil@ics.forth.gr 17

PCIe PCIe

slide-18
SLIDE 18

Capturing packets from NIC

  • Packets are hashed in the NIC and distributed to

different Rx-queues

  • Memory-mapped ring buffers for each Rx-queue

Rx Rx Queue Assigned Rx Rx Rx

Network Interface

Ring buffers

Kernel space User space

gvasil@ics.forth.gr 18

slide-19
SLIDE 19

CPU Processing

  • Packet capturing is performed by different CPU-cores in parallel

– Process affinity

  • Each core normalizes and reassembles captured packets to streams

– Remove ambiguities – Detect attacks that span multiple packets

  • Packets of the same connection always end up to the same core

– No synchronization – Cache locality

  • Reassembled packet streams are then transferred to the GPU for

pattern matching

– How to access the GPU?

gvasil@ics.forth.gr 19

slide-20
SLIDE 20

Accessing the GPU

  • Solution #1: Master/Slave model
  • Execution flow example

gvasil@ics.forth.gr 20

GPU

Thread 2

P1 P1 Transfer to GPU: GPU execution: Transfer from GPU: P1 P1 P1 P1 14.6 Gbit/s

Thread 3 Thread 4 Thread 1 PCIe

64 Gbit/s

slide-21
SLIDE 21

Accessing the GPU

  • Solution #2: Shared execution by multiple threads
  • Execution flow example

gvasil@ics.forth.gr 21

P1 P2 P3 P1 P2 P3 Transfer to GPU: GPU execution: Transfer from GPU: P1 P2 P3 P1 P1 P1

GPU

48.1 Gbit/s

Thread 1 Thread 2 Thread 3 Thread 4 PCIe

64 Gbit/s

slide-22
SLIDE 22

Transferring to GPU

  • Small transfer results to PCIe throughput degradation

Each core batches many reassembled packets into a single buffer

gvasil@ics.forth.gr

CPU-core Scan

Push Push Push

GPU

slide-23
SLIDE 23

Pattern Matching on GPU

  • Uniformly, one GPU core for each reassembled

packet stream

GPU core Matches GPU core GPU core GPU core Packet Buffer GPU core GPU core

gvasil@ics.forth.gr 23

slide-24
SLIDE 24

Pipelining CPU and GPU

  • Double-buffering

– Each CPU core collects new reassembled packets, while the GPUs process the previous batch – Effectively hides GPU communication costs

CPU

Packet buffers

gvasil@ics.forth.gr 24

slide-25
SLIDE 25

Recap

1-10Gbps

Demux Per-flow protocol analysis Data-parallel content matching

NIC: CPUs: GPUs:

Packet streams Reassembled packet streams Packets

gvasil@ics.forth.gr 25

slide-26
SLIDE 26

Outline

  • Architecture
  • Implementation
  • Performance Evaluation
  • Conclusions

gvasil@ics.forth.gr 26

slide-27
SLIDE 27

Setup: Hardware

  • NUMA architecture, QuickPath Interconnect

Memory

IOH IOH

GPU NIC GPU

Memory

CPU-0 CPU-1

gvasil@ics.forth.gr 27

Model Specs

2 x CPU Intel E5520 2.27 GHz x 4 cores 2 x GPU NVIDIA GTX480 1.4 GHz x 480 cores 1 x NIC 82599EB 10 GbE

slide-28
SLIDE 28

Pattern Matching Performance

  • The performance of a single GPU increases, as

the number of CPU-cores increases

Bounded by PCIe capacity

GPU Throughput

1

14.6 26.7 42.5 48.1

2 4 8

#CPU-cores

gvasil@ics.forth.gr 28

slide-29
SLIDE 29

Pattern Matching Performance

  • The performance of a single GPU increases, as

the number of CPU-cores increases

70.7

Adding a second GPU

gvasil@ics.forth.gr 29

GPU Throughput

1

14.6 26.7 42.5 48.1

2 4 8

#CPU-cores

slide-30
SLIDE 30

Setup: Network

gvasil@ics.forth.gr 30

Traffic Generator/Replayer MIDeA 10 GbE

slide-31
SLIDE 31

Synthetic traffic

  • Randomly generated traffic

Gbit/s

1.5 4.8 7.2

Snort (8x cores) MIDeA

200b 800b 1500b

Packet size

gvasil@ics.forth.gr 31

2.1 1.1 2.4

slide-32
SLIDE 32

Real traffic

Gbit/s

5.2

  • 5.2 Gbit/s with zero packet-loss

– Replayed trace captured at the gateway of a university campus

gvasil@ics.forth.gr 32

1.1

Snort (8x cores) MIDeA

slide-33
SLIDE 33

Summary

  • MIDeA: A multi-parallel network intrusion

detection architecture

– Single-box implementation – Based on commodity hardware – Less than $1500

  • Operate on 5.2 Gbit/s with zero packet loss

– 70 Gbit/s pattern matching throughput

gvasil@ics.forth.gr 33

slide-34
SLIDE 34

Thank you!

gvasil@ics.forth.gr

gvasil@ics.forth.gr 34