detection architecture
play

Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis - PowerPoint PPT Presentation

MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011 Network Intrusion Detection Systems


  1. MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011

  2. Network Intrusion Detection Systems • Typically deployed at ingress/egress points – Inspect all network traffic – Look for suspicious activities – Alert on malicious actions 10 GbE Internal Internet Network NIDS gvasil@ics.forth.gr 2

  3. Challenges • Traffic rates are increasing – 10 Gbit/s Ethernet speeds are common in metro/enterprise networks – Up to 40 Gbit/s at the core • Keep needing to perform more complex analysis at higher speeds – Deep packet inspection – Stateful analysis – 1000s of attack signatures gvasil@ics.forth.gr 3

  4. Designing NIDS • Fast – Need to handle many Gbit/s – Scalable • Moore’s law does not hold anymore • Commodity hardware – Cheap – Easily programmable gvasil@ics.forth.gr 4

  5. Today: fast or commodity • Fast “hardware” NIDS – FPGA/TCAM/ASIC based – Throughput: High • Commodity “software” NIDS – Processing by general-purpose processors – Throughput: Low gvasil@ics.forth.gr 5

  6. MIDeA • A NIDS out of commodity components – Single-box implementation – Easy programmability – Low price Can we build a 10 Gbit/s NIDS with commodity hardware? gvasil@ics.forth.gr 6

  7. Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 7

  8. Single-threaded performance Pattern NIC Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s gvasil@ics.forth.gr 8

  9. Problem #1: Scalability • Single-threaded NIDS have limited performance – Do not scale with the number of CPU cores gvasil@ics.forth.gr 9

  10. Multi-threaded performance Pattern Preprocess Output matching Pattern NIC Preprocess Output matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s gvasil@ics.forth.gr 10

  11. Problem #2: How to split traffic cores  Synchronization overheads NIC  Cache misses  Receive-Side Scaling (RSS) 11

  12. Multi-queue performance Pattern Preprocess Output matching RSS Pattern Preprocess Output NIC matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s • With multiple Rx-queues: 1.1 Gbit/s 12

  13. Problem #3: Pattern matching is the bottleneck > 75% Pattern NIC Preprocess Output matching  Offload pattern matching on the GPU Pattern NIC Preprocess Output matching gvasil@ics.forth.gr 13

  14. Why GPU? • General-purpose computing – Flexible and programmable • Powerful and ubiquitous – Constant innovation • Data-parallel model – More transistors for data processing rather than data caching and flow control gvasil@ics.forth.gr 14

  15. Offloading pattern matching to the GPU Pattern Preprocess Output matching RSS Pattern Preprocess Output NIC matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s • With multiple Rx-queues: 1.1 Gbit/s • With GPU: 5.2 Gbit/s 15

  16. Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 16

  17. Multiple data transfers PCIe PCIe GPU CPU NIC • Several data transfers between different devices Are the data transfers worth the computational gains offered? gvasil@ics.forth.gr 17

  18. Capturing packets from NIC Ring buffers User space Kernel space Rx Rx Rx Rx Network Interface Rx Queue Assigned • Packets are hashed in the NIC and distributed to different Rx-queues • Memory-mapped ring buffers for each Rx-queue gvasil@ics.forth.gr 18

  19. CPU Processing • Packet capturing is performed by different CPU-cores in parallel – Process affinity • Each core normalizes and reassembles captured packets to streams – Remove ambiguities – Detect attacks that span multiple packets • Packets of the same connection always end up to the same core – No synchronization – Cache locality • Reassembled packet streams are then transferred to the GPU for pattern matching – How to access the GPU? gvasil@ics.forth.gr 19

  20. Accessing the GPU • Solution #1: Master/Slave model Thread 2 Thread 1 PCIe Thread 3 GPU 64 Gbit/s Thread 4 • Execution flow example 14.6 Gbit/s P1 P1 Transfer to GPU: GPU execution: P1 P1 Transfer from GPU: P1 P1 gvasil@ics.forth.gr 20

  21. Accessing the GPU • Solution #2: Shared execution by multiple threads Thread 1 Thread 2 PCIe GPU 64 Gbit/s Thread 3 Thread 4 • Execution flow example 48.1 Gbit/s Transfer to GPU: P1 P2 P3 P1 GPU execution: P1 P2 P3 P1 Transfer from GPU: P1 P2 P3 P1 gvasil@ics.forth.gr 21

  22. Transferring to GPU Push CPU-core Push Scan Push GPU • Small transfer results to PCIe throughput degradation  Each core batches many reassembled packets into a single buffer gvasil@ics.forth.gr

  23. Pattern Matching on GPU Packet Buffer GPU GPU GPU core core core GPU GPU GPU core core core Matches • Uniformly, one GPU core for each reassembled packet stream gvasil@ics.forth.gr 23

  24. Pipelining CPU and GPU CPU Packet buffers • Double-buffering – Each CPU core collects new reassembled packets, while the GPUs process the previous batch – Effectively hides GPU communication costs gvasil@ics.forth.gr 24

  25. Recap Data-parallel GPUs: content matching Reassembled packet streams Per-flow CPUs: protocol analysis Packet streams NIC: Demux 1-10Gbps Packets gvasil@ics.forth.gr 25

  26. Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 26

  27. Setup: Hardware CPU-0 CPU-1 Memory Memory IOH IOH GPU GPU NIC • NUMA architecture, QuickPath Interconnect Model Specs 2 x CPU Intel E5520 2.27 GHz x 4 cores 2 x GPU NVIDIA GTX480 1.4 GHz x 480 cores 1 x NIC 82599EB 10 GbE gvasil@ics.forth.gr 27

  28. Pattern Matching Performance Bounded by PCIe capacity GPU Throughput 48.1 42.5 26.7 14.6 1 2 4 8 #CPU-cores • The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 28

  29. Pattern Matching Performance 70.7 GPU Throughput 48.1 42.5 Adding a second GPU 26.7 14.6 1 2 4 8 #CPU-cores • The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 29

  30. Setup: Network 10 GbE Traffic MIDeA Generator/Replayer gvasil@ics.forth.gr 30

  31. Synthetic traffic 7.2 MIDeA Gbit/s Snort (8x cores) 4.8 2.4 2.1 1.5 1.1 200b 800b 1500b Packet size • Randomly generated traffic gvasil@ics.forth.gr 31

  32. Real traffic MIDeA 5.2 Gbit/s Snort (8x cores) 1.1 • 5.2 Gbit/s with zero packet-loss – Replayed trace captured at the gateway of a university campus gvasil@ics.forth.gr 32

  33. Summary • MIDeA: A multi-parallel network intrusion detection architecture – Single-box implementation – Based on commodity hardware – Less than $1500 • Operate on 5.2 Gbit/s with zero packet loss – 70 Gbit/s pattern matching throughput gvasil@ics.forth.gr 33

  34. Thank you! gvasil@ics.forth.gr gvasil@ics.forth.gr 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend