 
              Chair of Network Architectures and Services Department of Informatics Technical University of Munich Comparison of Queuing Data Structures for Traffic Analysers Maximilian Pudelko August 3, 2016 Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Contents Motivation Existing Data Structures and Found Problems Analysis Implementation of QQ Evaluation of QQ Conclusion/Future work Maximilian Pudelko – QQ 2
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Motivation Level Speed NIC 40 Gbit/s RAM (copy) 225 Gbit/s tcpdump up to 5.9 Gbit/s [1] SSD 3.6 Gbit/s (400 MB/s) Table 1: Common throughput values • Speed gap between NICs and storages increases • Capturing becomes more complex (special drivers, filtering, ...) • Exploiting multi core architectures is a must • Efficient data structure required Maximilian Pudelko – QQ 3
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Tested Data Structures • Cameron: ReaderWriterQueue and ConcurrentQueue • Folly: ProducerConsumerQueue and MPMCQueue • DPDK: rte ring + rte mbuf Maximilian Pudelko – QQ 4
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Tested Data Structures • Cameron: ReaderWriterQueue and ConcurrentQueue • Folly: ProducerConsumerQueue and MPMCQueue • DPDK: rte ring + rte mbuf Features: • Lock-free (atomic, CAS) in critical path • Optimized for low latency • Generic (RW, CQ, PCQ, MPMC), packet structure required • Specialized for packet processing (DPDK) Maximilian Pudelko – QQ 4
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Found Problems - Throughput Packet rate [Mpps] Data rate [Gbit/s] 100 Individual allocation 60 75 Message buffer 45 50 30 25 15 0 0 RWQ PCQ CQ MPMC rte ring Data structure Figure 1: Throughput comparison with 64 byte packets, 4 inserters Per packet allocation malloc(packet len) : • Synchronized across all threads, slow • Memory reuse not possible • Leads to heap fragmentation (TLB pollution) • Specialized implementations exist, out of scope Maximilian Pudelko – QQ 5
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Found Problems - Throughput Packet rate [Mpps] Data rate [Gbit/s] 100 Individual allocation 60 75 Message buffer 45 50 30 25 15 0 0 RWQ PCQ CQ MPMC rte ring Data structure Figure 1: Throughput comparison with 64 byte packets, 4 inserters Message buffers: • Fixed size chunks allocated in bulk • Chained together to accommodate larger packets • No implemented by general purpose queues • Space overhead from partially used buffers Maximilian Pudelko – QQ 5
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Found Problems - DPDK Limitations • Memory pool capacity limit: 2 27 ≈ 134 M slots • Only 2.68 sec of traffic at 50 Mpps • Space overhead of > 50 B per packet • Not compatible with regular threads Maximilian Pudelko – QQ 6
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Found Problems - Missing Special Features • Random access to packets • Roles: Peek access for analysers • Time coherent storing, even distribution at low rates • Option to dump to persistent storage Maximilian Pudelko – QQ 7
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Analysis A suitable data structure must ... 1. ... handle 10 Gbit/s single and 40 Gbit/s combined. 2. ... store large amount of packets. 3. ... provide the special features. 4. ... scale on multi core architectures. Maximilian Pudelko – QQ 8
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Approach/Implementation - Overview signal( filter , <range> ) Analyser NIC Inserter peek() Dumper enqueue() dequeue() pcap NIC Inserter QQ Writer NIC Inserter HDD Figure 2: QQ overview Maximilian Pudelko – QQ 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Approach/Implementation - Outer Queue • 2-Layer model: Queue of Queues • One continuous memory block for packet data • Allocated at initialization via transparent hugepages • Segmented into equal sized chunks for independent inner queues • Threads request references to inner queues • Operations guarded by mutex Maximilian Pudelko – QQ 10
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Approach/Implementation - Inner Queue Offset vector 0x0000 0x0060 ... Timer Mutex head base Granted Packet Packet Packet 0 Packet 1 ... memory block Header Header Figure 3: Inner queue structure • Separate mutex, ”lock once, store often” • In place storage of packets in granted block • Hold timer to prevent monopolization • API similar to std::vector • Direct access possible, management via wrapper preferred Maximilian Pudelko – QQ 11
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Approach/Implementation - Packet Header • Perpended to actual packet • POD struct, compatible with FFIs • Stores metadata: timestamp, vlan, packet length Maximilian Pudelko – QQ 12
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Evaluation - Throughput 64 B 128 B 256 B 512 B 1024 B 1514 B STREAM 225 200 175 Rate [Gbit/s] 150 125 100 75 50 25 0 1 2 3 4 5 Number of inputs Figure 4: Theoretical QQ throughput compared to STREAM Maximilian Pudelko – QQ 13
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Evaluation - Influence of Inner Queue Capacity Throughput [Gbit/s] 25 20 15 10 5 0 2 4 8 16 32 64 128 Inner queue capacity [MiB] Figure 5: Single inner queue throughput with 64 byte packets • Less impact than expected • Caching benefits offset by large amount of data Maximilian Pudelko – QQ 14
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Evaluation - Packet Delay � size Storage � delay min = min , timeout · # Inputs r max Storage size 2 MiB 4 MiB 32 MiB 10 Gbit/s & 2 inserters 3.335 ms 6.711 ms 53.69 ms 40 Gbit/s & 4 inserters 1.667 ms 3.335 ms 26.84 ms Table 2: Minimum packet traversal time at various configurations Maximilian Pudelko – QQ 15
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Conclusion/Future work • > 130 Gbit/s and > 170 Mpps • 2.3 to 100 more compared to existing data structures • Delay still within low millisecond range • Future work: Replacement of handwritten filter functions with JIT compiled filter expressions (BPF, tcpdump -ni ) Maximilian Pudelko – QQ 16
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Bibliography I [1] A. Brown. I3: Maximizing packet capture performance. Wireshark Developer and User Conference, 2014. Maximilian Pudelko – QQ 17
Chair of Network Architectures and Services Department of Informatics Technical University of Munich function dumpTask(qq , path) local pcap_writer = pcapLib.create_pcap_writer (path) while mg.running () do -- Signaling with analyzer omitted local storage = qq:dequeue () -- QQ API call for i=0, tonumber(storage:size ())-1 do -- loop over al local pkt = storage:getPacket(i) local udpPkt = pktLib.getUdpPacket (pkt) -- MoonGens p if udpPkt.ip4:getSrc () == ip_match then pcap_writer:store(pkt:getTimestamp (), pkt:getLength (), pkt.data) end end storage:release () -- explicit release at end of loop end end Listing 1: Dump task Maximilian Pudelko – QQ 18
Recommend
More recommend