Predictability using Multiple Virtual Lanes in Modern Multi-Core - - PowerPoint PPT Presentation

predictability using multiple virtual lanes in modern
SMART_READER_LITE
LIVE PREVIEW

Predictability using Multiple Virtual Lanes in Modern Multi-Core - - PowerPoint PPT Presentation

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The


slide-1
SLIDE 1

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Hari Subramoni, Ping Lai, Sayantan Sur and

  • Dhabhaleswar. K. Panda

Department of Computer Science & Engineering The Ohio State University

slide-2
SLIDE 2

Outline

  • Introduction & Motivation
  • Problem Statement
  • Design
  • Performance Evaluation and Results
  • Conclusions and Future Work

ICPP '10 2

slide-3
SLIDE 3

Introduction & Motivation

ICPP '10

  • Supercomputing clusters

growing in size and scale

  • MPI – predominant

programming model for HPC

  • High performance

interconnects like InfiniBand increased network capacity

  • Compute capacity
  • utstrips network capacity

with advent of multi/many core processors

  • Gets aggravated as jobs

get assigned to random nodes and share links

3

Line Card Switches Line Card Switches Fabric Card Switches

slide-4
SLIDE 4

Analysis of Traffic Pattern in a Supercomputer

ICPP '10 Color Number of Streams Black 1 Blue 2 Red 3 – 4 Orange 5 - 8 Green > 8 Color of Dot Description Green Network Elements Black Line Card Switches Red Fabric Card Switches Courtesy - TACC 4

  • Traffic flow in the Ranger supercomputer at Texas Advanced

Computing Center (TACC) shows heavy link sharing

– http://www.tacc.utexas.edu

slide-5
SLIDE 5

Possible Issue with Link Sharing

  • Few communicating peers – No Problem
  • Packets get backed up as number of communicating peers increases
  • Results in delayed arrival of packet at destination

ICPP '10 5

Switch Compute Node Compute Node

slide-6
SLIDE 6

Frequency Distribution of Inter Arrival Times

ICPP '10 6

  • Packet size – 2 KB (results same for 1 KB to 16 KB)
  • Arrival time is directly proportional to the load on the links
slide-7
SLIDE 7

Introduction & Motivation (Cont)

Can modern networks like InfiniBand alleviate this?

ICPP '10 7

slide-8
SLIDE 8

InfiniBand Architecture

  • An industry standard for low latency, high bandwidth

System Area Networks

  • Multiple features

– Two communication types

  • Channel Semantics
  • Memory Semantics (RDMA mechanism)

– Queue Pair (QP) based communication – Quality of Service (QoS) support – Multiple Virtual Lanes (VL) – QPs associated to VLs by means of pre-specified Service Levels

  • Multiple communication speeds available for Host Channel

Adapters (HCA) – 10 Gbps (SDR) / 20 Gbps (DDR) / 40 Gbps (QDR)

ICPP '10 8

slide-9
SLIDE 9

InfiniBand Network Buffer Architecture

ICPP '10 Virtual Lane 0

Common Buffer Pool

Virtual Lane 1 Virtual Lane 15 Virtual Lane Arbiter Physical Link

  • Buffers in most IB HCAs

and switches grouped into two

– Common Buffer Pool and, – Private VL buffers

  • Most current generation

MPIs only use one VL

  • Inefficient use of available

network resources

  • Why not use more VLs?
  • Possible con

– Would it take more time to poll all the VLs

9

InfiniBand Host Channel Adapter (HCA)

slide-10
SLIDE 10

Outline

  • Introduction & Motivation
  • Problem Statement
  • Design
  • Performance Evaluation and Results
  • Conclusions and Future Work

ICPP '10 10

slide-11
SLIDE 11

Problem Statement

  • Can multiple virtual lanes be used to

improve performance of HPC applications

  • How can we integrate this design into an

MPI library so that end applications will be benefited

ICPP '10 11

slide-12
SLIDE 12

Outline

  • Introduction & Motivation
  • Problem Statement
  • Design
  • Performance Evaluation and Results
  • Conclusions and Future Work

ICPP '10 12

slide-13
SLIDE 13

Proposed Framework and Goals

  • No change to application
  • Re-design MPI library to use

multiple VLs

  • Need new methods to take

advantage of multiple VLs – Traffic Distribution

  • Load balance traffic across

multiple VLs

– Traffic Segregation

  • Ensure one kind of traffic does

not disturb other

  • Distinguish between

– Low & High priority traffic – Small & Large messages ICPP '10

Job Scheduler MPI Library InfiniBand Network

Application

Traffic Segregation Traffic Distribution

13

slide-14
SLIDE 14

Proposed Design

  • Re-design MPI library to use multiple VLs

– Multiple Virtual Lanes configured with different characteristics

  • Transmit less packets at high priority
  • Transmit more packets at lower priority etc

– Multiple Service Levels (SL) defined to match VLs – Queue Pairs (QPs) assigned proper SLs at QP creation time

  • Multiple ways to assign Service Levels to applications

– Assign SLs with similar characteristics in a round robin fashion

  • Traffic Distribution

– Assign SLs with desired characteristic based on type of application

  • Traffic Segregation

– Other designs being explored

ICPP '10 14

slide-15
SLIDE 15

Proposed Design (Cont)

ICPP '10 Physical Link Virtual Lane Arbiter

Application

Job Scheduler MPI Library Virtual Lane 0 Virtual Lane 1 Virtual Lane 15 Service Level Service Level Service Level

15

slide-16
SLIDE 16

Outline

  • Introduction & Motivation
  • Problem Statement
  • Design
  • Performance Evaluation & Results
  • Conclusions and Future Work

ICPP '10 16

slide-17
SLIDE 17

Experimental Testbed

  • Compute platforms

– Intel Nehalem

  • Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz
  • 12GB RAM, 8MB cache
  • PCIe 2.0 interface
  • Network Equipments

– MT26428 QDR ConnectX HCAs – 36-port Mellanox QDR switch used to connect all the nodes

  • Red Hat Enterprise Linux Server release 5.3 (Tikanga)
  • OFED-1.4.2
  • OpenSM version 3.1.6
  • Benchmarks

– Modified version of OFED perftest for verbs level tests – MPIBench collective benchmark – CPMD used for application level evaluation

ICPP '10 17

slide-18
SLIDE 18

MVAPICH / MVAPICH2 Software

  • High Performance MPI Library for IB and 10GE

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1255 organizations in 59 countries – More than 44,500 downloads from OSU site directly – Empowering many TOP500 clusters

  • 11th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) – Also supports uDAPL device to work with any network supporting uDAPL – http://mvapich.cse.ohio-state.edu/

ICPP '10 18

slide-19
SLIDE 19

ICPP '10

  • Tests use 8 communicating pairs
  • One QP per pair
  • Packet size – 2 KB
  • Results same for 1 KB to 16 KB
  • Traffic distribution using multiple VLs

results in more predictable Inter arrival time

  • Slight increase in average latency

19

Verbs Level Performance

slide-20
SLIDE 20

MPI Level Point to Point Performance

500 1000 1500 2000 2500 3000 3500 1K 2K 4K 8K 16K 32K 64K Bandiwdth (MBps) Message Size (Bytes) 1-VL 8-VLs ICPP '10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1K 2K 4K 8K 16K 32K 64K Message Rate (in Millions) Message Size (Bytes) 1-VL 8-VLs 20 40 60 80 100 120 140 1K 2K 4K 8K 16K 32K 64K Latency (us) Message Size (Bytes) 1-VL 8-VLs

  • Tests use 8 communicating pairs
  • One QP per pair
  • Traffic distribution using multiple VLs

result in better overall performance

  • 13% performance improvement over

case with just one VL

20

slide-21
SLIDE 21

MPI Level Collective Performance

ICPP '10

  • For 64 process Alltoall, Traffic Distribution and Traffic Segregation through

use of multiple VLs results in better performance

  • 20% performance improvement seen with Traffic Distribution
  • 12% performance improvement seen with Traffic Segregation

1000 2000 3000 4000 5000 6000 7000 1K 2K 4K 8K 16K Latency (us) Message Size (Bytes) Traffic Distribution 1-VL 8-VLs 50 100 150 200 250 300 350 1K 2K 4K 8K 16K Latency (us) Message Size (Bytes) Traffic Segregation 2 Alltoalls (No Segregation) 2 Alltoalls (Segregation) 1 Alltoall 21

slide-22
SLIDE 22

Application Level Performance

0.2 0.4 0.6 0.8 1

Total Time Time in Alltoall

Normalized Time 1 VL 8 VLs

ICPP '10

  • CPMD application
  • 64 processes
  • Traffic Distribution through

use of multiple VLs results in better performance

  • 11% performance

improvement in Alltoall performance

  • 6% improvement in overall

performance

22

slide-23
SLIDE 23

Outline

  • Introduction & Motivation
  • Problem Statement
  • Design
  • Performance Evaluation & Results
  • Conclusions & Future Work

ICPP '10 23

slide-24
SLIDE 24

Conclusions & Future Work

  • Explore use of Virtual Lanes to improve predictability and

performance of HPC applications

  • Integrate our scheme into MVAPICH2 MPI library and conduct

performance evaluations at various levels

  • Consistent increase in performance at verbs, MPI and application

level evaluations

  • Explore advanced schemes to improve performance using multiple

virtual lanes

  • Proposed solution will be available in future MVAPICH2 releases

ICPP '10 24

slide-25
SLIDE 25

Thank you!

{subramon, laipi, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/

ICPP '10 25