 
              Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The Ohio State University
Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation and Results • Conclusions and Future Work ICPP '10 2
Introduction & Motivation Fabric Card Switches • Supercomputing clusters growing in size and scale Line Card Switches Line Card Switches • MPI – predominant programming model for HPC • High performance interconnects like InfiniBand increased network capacity • Compute capacity outstrips network capacity with advent of multi/many core processors • Gets aggravated as jobs get assigned to random nodes and share links ICPP '10 3
Analysis of Traffic Pattern in a Supercomputer • Traffic flow in the Ranger supercomputer at Texas Advanced Computing Center (TACC) shows heavy link sharing – http://www.tacc.utexas.edu Color of Dot Description Green Network Elements Black Line Card Switches Red Fabric Card Switches Color Number of Streams Black 1 Blue 2 Red 3 – 4 Orange 5 - 8 Courtesy - TACC Green > 8 ICPP '10 4
Possible Issue with Link Sharing Switch Compute Compute Node Node • Few communicating peers – No Problem • Packets get backed up as number of communicating peers increases • Results in delayed arrival of packet at destination ICPP '10 5
Frequency Distribution of Inter Arrival Times • Packet size – 2 KB (results same for 1 KB to 16 KB) • Arrival time is directly proportional to the load on the links ICPP '10 6
Introduction & Motivation (Cont) Can modern networks like InfiniBand alleviate this? ICPP '10 7
InfiniBand Architecture • An industry standard for low latency, high bandwidth System Area Networks • Multiple features – Two communication types • Channel Semantics • Memory Semantics (RDMA mechanism) – Queue Pair (QP) based communication – Quality of Service (QoS) support – Multiple Virtual Lanes (VL) – QPs associated to VLs by means of pre-specified Service Levels • Multiple communication speeds available for Host Channel Adapters (HCA) – 10 Gbps (SDR) / 20 Gbps (DDR) / 40 Gbps (QDR) ICPP '10 8
InfiniBand Network Buffer Architecture • Buffers in most IB HCAs InfiniBand Host Channel Adapter (HCA) and switches grouped into two Virtual Lane 0 – Common Buffer Pool and, – Private VL buffers Common Virtual Virtual Lane 1 • Most current generation Physical Buffer Lane Link MPIs only use one VL Arbiter • Inefficient use of available Pool network resources Virtual Lane 15 • Why not use more VLs? • Possible con – Would it take more time to poll all the VLs ICPP '10 9
Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation and Results • Conclusions and Future Work ICPP '10 10
Problem Statement • Can multiple virtual lanes be used to improve performance of HPC applications • How can we integrate this design into an MPI library so that end applications will be benefited ICPP '10 11
Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation and Results • Conclusions and Future Work ICPP '10 12
Proposed Framework and Goals • No change to application • Re-design MPI library to use Job Scheduler multiple VLs • Need new methods to take Application advantage of multiple VLs – Traffic Distribution MPI Library • Load balance traffic across multiple VLs Traffic Traffic – Traffic Segregation Distribution Segregation • Ensure one kind of traffic does not disturb other InfiniBand Network • Distinguish between – Low & High priority traffic – Small & Large messages ICPP '10 13
Proposed Design • Re-design MPI library to use multiple VLs – Multiple Virtual Lanes configured with different characteristics • Transmit less packets at high priority • Transmit more packets at lower priority etc – Multiple Service Levels (SL) defined to match VLs – Queue Pairs (QPs) assigned proper SLs at QP creation time • Multiple ways to assign Service Levels to applications – Assign SLs with similar characteristics in a round robin fashion • Traffic Distribution – Assign SLs with desired characteristic based on type of application • Traffic Segregation – Other designs being explored ICPP '10 14
Proposed Design (Cont) Job MPI Library Scheduler Virtual Lane 0 Service Level Virtual Service Level Virtual Lane 1 Physical Lane Application Link Arbiter Service Level Virtual Lane 15 ICPP '10 15
Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation & Results • Conclusions and Future Work ICPP '10 16
Experimental Testbed • Compute platforms – Intel Nehalem • Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz • 12GB RAM, 8MB cache • PCIe 2.0 interface • Network Equipments – MT26428 QDR ConnectX HCAs – 36-port Mellanox QDR switch used to connect all the nodes • Red Hat Enterprise Linux Server release 5.3 (Tikanga) • OFED-1.4.2 • OpenSM version 3.1.6 • Benchmarks – Modified version of OFED perftest for verbs level tests – MPIBench collective benchmark – CPMD used for application level evaluation ICPP '10 17
MVAPICH / MVAPICH2 Software • High Performance MPI Library for IB and 10GE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1255 organizations in 59 countries – More than 44,500 downloads from OSU site directly – Empowering many TOP500 clusters • 11 th ranked 62,976-core cluster (Ranger) at TACC – Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) – Also supports uDAPL device to work with any network supporting uDAPL – http://mvapich.cse.ohio-state.edu/ ICPP '10 18
Verbs Level Performance • Tests use 8 communicating pairs • One QP per pair • Packet size – 2 KB • Results same for 1 KB to 16 KB • Traffic distribution using multiple VLs results in more predictable Inter arrival time • Slight increase in average latency ICPP '10 19
MPI Level Point to Point Performance 3500 140 1-VL 3000 120 Bandiwdth (MBps) 8-VLs 2500 100 Latency (us) 2000 80 1500 60 1000 40 1-VL 500 20 8-VLs 0 0 1K 2K 4K 8K 16K 32K 64K 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Message Size (Bytes) 1.8 Message Rate (in Millions) 1-VL 1.6 • Tests use 8 communicating pairs 1.4 8-VLs 1.2 • One QP per pair 1 • Traffic distribution using multiple VLs 0.8 result in better overall performance 0.6 0.4 • 13% performance improvement over 0.2 case with just one VL 0 1K 2K 4K 8K 16K 32K 64K ICPP '10 20 Message Size (Bytes)
MPI Level Collective Performance Traffic Distribution Traffic Segregation 7000 350 2 Alltoalls (No Segregation) 1-VL 6000 300 2 Alltoalls (Segregation) 8-VLs 5000 250 1 Alltoall Latency (us) Latency (us) 4000 200 3000 150 2000 100 1000 50 0 0 1K 2K 4K 8K 16K 1K 2K 4K 8K 16K Message Size (Bytes) Message Size (Bytes) • For 64 process Alltoall, Traffic Distribution and Traffic Segregation through use of multiple VLs results in better performance • 20% performance improvement seen with Traffic Distribution • 12% performance improvement seen with Traffic Segregation ICPP '10 21
Application Level Performance 1 • CPMD application • 64 processes 0.8 Normalized Time • Traffic Distribution through use of multiple VLs results in 0.6 better performance • 11% performance 0.4 improvement in Alltoall performance • 6% improvement in overall 0.2 performance 0 Total Time Time in Alltoall 1 VL 8 VLs ICPP '10 22
Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation & Results • Conclusions & Future Work ICPP '10 23
Conclusions & Future Work • Explore use of Virtual Lanes to improve predictability and performance of HPC applications • Integrate our scheme into MVAPICH2 MPI library and conduct performance evaluations at various levels • Consistent increase in performance at verbs, MPI and application level evaluations • Explore advanced schemes to improve performance using multiple virtual lanes • Proposed solution will be available in future MVAPICH2 releases ICPP '10 24
Thank you! {subramon, laipi, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/ ICPP '10 25
Recommend
More recommend