Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters
Hari Subramoni, Ping Lai, Sayantan Sur and
- Dhabhaleswar. K. Panda
Predictability using Multiple Virtual Lanes in Modern Multi-Core - - PowerPoint PPT Presentation
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The
ICPP '10 2
ICPP '10
growing in size and scale
programming model for HPC
interconnects like InfiniBand increased network capacity
with advent of multi/many core processors
get assigned to random nodes and share links
3
Line Card Switches Line Card Switches Fabric Card Switches
ICPP '10 Color Number of Streams Black 1 Blue 2 Red 3 – 4 Orange 5 - 8 Green > 8 Color of Dot Description Green Network Elements Black Line Card Switches Red Fabric Card Switches Courtesy - TACC 4
Computing Center (TACC) shows heavy link sharing
– http://www.tacc.utexas.edu
ICPP '10 5
Switch Compute Node Compute Node
ICPP '10 6
ICPP '10 7
– Two communication types
– Queue Pair (QP) based communication – Quality of Service (QoS) support – Multiple Virtual Lanes (VL) – QPs associated to VLs by means of pre-specified Service Levels
ICPP '10 8
ICPP '10 Virtual Lane 0
Common Buffer Pool
Virtual Lane 1 Virtual Lane 15 Virtual Lane Arbiter Physical Link
and switches grouped into two
– Common Buffer Pool and, – Private VL buffers
MPIs only use one VL
network resources
– Would it take more time to poll all the VLs
9
InfiniBand Host Channel Adapter (HCA)
ICPP '10 10
ICPP '10 11
ICPP '10 12
multiple VLs
advantage of multiple VLs – Traffic Distribution
multiple VLs
– Traffic Segregation
not disturb other
– Low & High priority traffic – Small & Large messages ICPP '10
Job Scheduler MPI Library InfiniBand Network
Application
Traffic Segregation Traffic Distribution
13
– Multiple Virtual Lanes configured with different characteristics
– Multiple Service Levels (SL) defined to match VLs – Queue Pairs (QPs) assigned proper SLs at QP creation time
– Assign SLs with similar characteristics in a round robin fashion
– Assign SLs with desired characteristic based on type of application
– Other designs being explored
ICPP '10 14
ICPP '10 Physical Link Virtual Lane Arbiter
Application
Job Scheduler MPI Library Virtual Lane 0 Virtual Lane 1 Virtual Lane 15 Service Level Service Level Service Level
15
ICPP '10 16
– Intel Nehalem
– MT26428 QDR ConnectX HCAs – 36-port Mellanox QDR switch used to connect all the nodes
– Modified version of OFED perftest for verbs level tests – MPIBench collective benchmark – CPMD used for application level evaluation
ICPP '10 17
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1255 organizations in 59 countries – More than 44,500 downloads from OSU site directly – Empowering many TOP500 clusters
– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) – Also supports uDAPL device to work with any network supporting uDAPL – http://mvapich.cse.ohio-state.edu/
ICPP '10 18
ICPP '10
results in more predictable Inter arrival time
19
500 1000 1500 2000 2500 3000 3500 1K 2K 4K 8K 16K 32K 64K Bandiwdth (MBps) Message Size (Bytes) 1-VL 8-VLs ICPP '10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1K 2K 4K 8K 16K 32K 64K Message Rate (in Millions) Message Size (Bytes) 1-VL 8-VLs 20 40 60 80 100 120 140 1K 2K 4K 8K 16K 32K 64K Latency (us) Message Size (Bytes) 1-VL 8-VLs
result in better overall performance
case with just one VL
20
ICPP '10
use of multiple VLs results in better performance
1000 2000 3000 4000 5000 6000 7000 1K 2K 4K 8K 16K Latency (us) Message Size (Bytes) Traffic Distribution 1-VL 8-VLs 50 100 150 200 250 300 350 1K 2K 4K 8K 16K Latency (us) Message Size (Bytes) Traffic Segregation 2 Alltoalls (No Segregation) 2 Alltoalls (Segregation) 1 Alltoall 21
0.2 0.4 0.6 0.8 1
Total Time Time in Alltoall
Normalized Time 1 VL 8 VLs
ICPP '10
use of multiple VLs results in better performance
improvement in Alltoall performance
performance
22
ICPP '10 23
performance of HPC applications
performance evaluations at various levels
level evaluations
virtual lanes
ICPP '10 24
ICPP '10 25