mcc a predictable and scalable massive client load
play

MCC : a Predictable and Scalable Massive Client Load Generator - PowerPoint PPT Presentation


  1. ��������������������������������������������� �������������������������������������� Nov 14-16, 2019 // Denver, Colorado, USA MCC : a Predictable and Scalable Massive Client Load Generator Wenqing Wu*, Xiao Feng*, Wenli Zhang, Mingyu Chen 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Science 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China

  2. Outline I. Background/Motivation II. Design III. Evaluation IV. Conclusion �

  3. Background ( D evice u nder T est) 1 Load DUT 2 generator Data Center IoT Stateless Stateful 1 2 No connections TCP connection oriented • • Uni-directional Bi-directional • • Supported: Supported: • • Ø Network load to simulate Bandwidth test Bandwidth test Packet loss test Packet loss test ü Stateful Latency analysis ü Tremendous NAT/Firewall test ü Various distributions Application interaction �

  4. State of the Art Ø Hardware-based load Generators OSNT (NetFPGA, $ 2,000) Ixia Sprient (Specialized device, > $100,000 ) (Specialized device) ✓ Precise & accurate ✓ High throughput ✘ Stateless Inflexible ( firm, not open source ) ✘ ✘ Expensive �

  5. State of the Art Ø Software-based load Generators Load Generators Description Stateful Comments trafgen Packet generator based on AF_PACKET � MoonGen Packet generator fueled-by DPDK � D-ITG Distributed framework A multiplatform tool � ✓ Iperf Bandwidth and jitter analysis Close-loop, No concurrency support ✓ wrk HTTP benchmarking tool Limited throughput, poor scalability ✓ Cheap (Run on cheap commodity hardware ) ✓ Some are flexible (Open source, easy to add new features) ✘ Stateful generators can not achieve microsecond precision ✘ Stateful generators show poor scalability 4

  6. Imprecision in stateful load generation Ø Why software-based stateful load generators are less precise? ü Scheduling policies in Operating System (OS) E.g. , sleep() does not guarantee one microsecond precision. • ü POSIX blocking I/O interface E.g. , select() introduces at least 20 µs deviation to timed tasks. • ü Heavy kernel protocol stack Uncertain stack processing time poisons the precision of timed operations running • in application layer. �

  7. MCC (Massive Client Connections) Ø Design goal : • A predictable and scalable massive client load generator ü Stateful TCP connection oriented • ü Predictable load generation Two-stage time mechanism achieves one microsecond precision • ü High throughput while preforming flow-level simulation Lightweight protocol stack based to simplify packet processing • High-speed I/O with kernel-bypass technique • ü Scalability in multi-core systems Shared-nothing multi-threaded model • �

  8. MCC Overview Ø Load generation model of MCC Application Layer Data Path ü Load modeling 1 Control Logic Adjust packet size, add timestamp Data Modeling 1 Load Modeling Adjust packets inter departure time 2 2 Connection Manager ü Reactor NIO pattern App Reactor Model ü User-level stack based Timer ü Customized I/O layer TCP/IP Stack Controllable I/O I/O I/O Layer ü Two-stage timer mechanism Timer Control packet I/O precisely Device Driver �

  9. User-level load generator Ø MCC runs fully in user space ü We are able to optimize the full path of load generation • An I/O Thread is added to achieve precise control LoadGen Application LoadGen Thread User Stateful Socket-like API Socket API Kernel Stack thread Kernel stack Fully in user space Stateless I/O thread Customized I/O layer Packet I/O DPDK I/O library Device driver Device driver Kernel-based solutions MCC’s approach �

  10. Two-stage timer Ø Two-stage timer helps to generate predictable load • App Timer ü Flow-level Connection App Timer 1 manager ü Control send operation LoadGen send() Data with Thread • IO Timer 3 timestamp ü Packet-level ü Control xmit operation according to timestamp added in application layer Stack encapsulation 4 Thread • Step Initialize Load Generation thread 1 • I/O Timer Register 2 Initialize I/O thread • 2 queue I/O Send data according to the App timer • 3 (RQ) Thread xmit() xmit() Encapsulate packet and enqueue it to RQ • 4 5 Xmit packet according to the I/O timer • 5 DPDK IO library �

  11. App timer Ø Polling-based application layer timer ü Avoid the imprecision resulting from scheduling policies in OS • Structures User APIs Non-blocking event loop - Timed task: 2-tuple ( timestamp, function ) P o l l i n - Task set: RB-tree (fast insertion/deletion) g c h e Add timed task c k • Steps Task sched(ts, func) I. Register (Add timed task) Set Execute II. Trigger timeout (Polling check) III. Execute (Run callback function) Store with RB-tree 3 Microsecond precision ��

  12. I/O timer Ø Novel I/O timer added in customized I/O layer ü Eliminate timing error introduced by protocol stack (tens of microseconds ) • I/O layer (I/O thread) sndbuf - Dedicated I/O thread Stack - Lockless Register queue (RQ) Thread RQ • Steps IO Timer I. Insert encapsulated packet into RQ II. Polling check RQ I/O III. Send packet out at specific time Thread DPDK 1 Microsecond precision ring queue NIC ��

  13. Scalable Multi-threaded Model Ø Shared-nothing multi-threaded Model Distributor thread • Per-core threading core0 ü Per-core listening queue core1 core2 worker ü Per-core file descriptors LoadGen LoadGen LoadGen ü Run-to-completion model Thread Thread Thread • Core affinity … • RSS (Receive-Side Scaling ) Stack Stack Stack Thread Thread Thread • Distributor thread ü Parse/distribute configuration DPDK ü Aggregate statistics Scalability queue0 queue1 queue2 RSS NIC ��

  14. Scalable Multi-threaded Model Ø Message passing model between Distributor and Workers ü Avoid synchronization primitives (lock, memory barrier, atomic operations, …) ü Easy to extend for multiple Workers Task queue push pull Worker Distributor push pull Result queue (statistics, state) 13

  15. Evaluation ��

  16. Experimental Setup ü Machines(client/server ): • CPU: Intel(R) Xeon(R) CPU E5645 12 cores @ 2.40GHz • Memory: 96 GB • NIC: Intel 82599ES 10Gb Ethernet Adapter ü Experiments: • Microbenchmark: - Precision with different timers • Predictable load generation • Throughput & Scalability ��

  17. Precision with different timers ü Timers in MCC bring one microsecond precision. Baseline: sleep() + Linux kernel stack • (“Linux” in the table indicates “Linux kernel stack”) Precision of the load generator when generating constant bit rate (CBR) traffic ��

  18. Predictable Load Generation ü MCC is able to generate traffic following the analytical Poisson distribution. 0.175 App timer vs. analytic 0.150 0.125 Probability Density App timer + IO timer vs. analytic 0.100 0.075 0.050 0.025 0.000 0 20 40 60 80 100 120 Packet interval (µs) Poisson Traffic Generation �� �

  19. Throughput & Scalability ü 2.4x ~ 3.5x higher throughput than wrk (A kernel-based HTTP benchmark) ü Almost linear scalability before reaching line rate 35 wrk MCC 30.83 28.57 30 Requests per Second (x10 5 ) 25 19.27 20 13.87 15 8.6 10 7.65 6.96 6.3 4.75 3.56 5 2.4 1.5 0 1 2 4 6 8 10 Number of CPU cores HTTP load generation (File size: 64B) ��

  20. Conclusion Ø MCC : A predictable and scalable massive client load generator ü Predictable load generation - Two-stage timer mechanism ü High throughput - Lightweight user-level stack - Kernel-bypass ü Scalability in multi-core systems - Shared-nothing multi-threaded model ��

  21. ������ Please feel free to email us at wuwenqing@ict.ac.cn if you have any questions.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend