Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group - PowerPoint PPT Presentation

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1

Cray User Group Conference 2011 2

Application MPI Interface MPICH2 ROMIO ADI3 ADIO GPFS ... CH3 Device Lus. CH3 Interface Job launcher mvapich2 1.5 PMI Nemesis xpmem Nemesis NetMod Interface Cray specific components PSM GM GNI TCP IB MX 3

LIBPGAS SHMEM MPICH2 P M libxpmem I DMAPP XPMEM A L UGNI P s s a S p y UDREG b s KGNI o PE LIBS libjob KDREG OS LIBS JOB OS device drivers (gem/ari specific) ‏ GHAL A B OS device drivers Cray User Group Conference 2011 (HSN independent) ‏ Package A uses methods of B 5 July 21, 2010 Cray Proprietary

 A ¡connec'on ¡oriented ¡approach ¡based ¡on ¡GNI ¡SMSG ¡mailboxes ¡is ¡used ¡  Lowest ¡latency, ¡highest ¡message ¡rates ¡  Reliable ¡connec'ons, ¡can ¡ride ¡through ¡network ¡faults ¡  Characteris'cs ¡of ¡ ¡Gemini ¡memory ¡registra'on ¡hardware ¡influenced ¡ MPICH2 ¡GNI ¡Network ¡Module ¡(Netmod) ¡design. ¡ ¡  All ¡network ¡transac'ons ¡are ¡tracked. ¡ ¡There ¡is ¡clean ¡separa'on ¡between ¡ data ¡transfers ¡and ¡control ¡messages. ¡ ¡No ¡fire-‑and-‑forget. ¡ ¡This ¡makes ¡fault ¡ tolerance ¡support ¡much ¡simpler. ¡ Cray User Group Conference 2011 7

 Uses ¡DSMN ¡hardware ¡in ¡Gem/Ari ¡ Endpoint Y  Messages ¡delivered ¡in ¡order ¡even ¡though ¡ ‘adap've’ ¡rou'ng ¡is ¡used ¡  Tolerant ¡to ¡transient ¡network ¡faults ¡ message data flag data  FLOW ¡CONTROL. ¡ ¡ ¡If ¡the ¡receiver ¡stops ¡ dequeuing ¡messages, ¡sender ¡runs ¡out ¡of ¡ credits ¡and ¡stops ¡sending. ¡ ¡No ¡polling ¡remote ¡ message data variables, ¡queue ¡overruns, ¡etc. ¡ flag data  MPICH2 ¡and ¡GNILND ¡(Lustre, ¡DVS, ¡etc.) ¡share ¡ same ¡mailbox ¡code ¡  Memory ¡per ¡mailbox ¡controlled ¡by ¡ applica'on. ¡ ¡It ¡can ¡be ¡small ¡~1000 ¡bytes ¡or ¡ PUT_MSG so. ¡ ¡  User-‑space ¡has ¡op'on ¡to ¡use ¡shared ¡ mailboxes ¡(MSGQs) ¡to ¡reduce ¡memory ¡ Endpoint X footprint. ¡ ¡ Command buffer Data buffer Cray User Group Conference 2011 8

 By ¡default, ¡connec'ons ¡(mailboxes) ¡are ¡established ¡dynamically ¡using ¡the ¡ scalable, ¡but ¡low ¡performance ¡datagram ¡(BTE_SEND) ¡path. ¡  Mailboxes ¡are ¡normally ¡mapped ¡to ¡large ¡pages ¡to ¡reduce ¡TLB ¡pressure ¡ when ¡processing ¡messages ¡from ¡many ¡different ¡mailboxes. ¡ ¡For ¡be`er ¡ performance ¡a ¡subset ¡of ¡mailboxes/rank ¡will ¡soon ¡be ¡placed ¡on ¡DIE0 ¡ memory ¡if ¡user ¡chooses. ¡  A ¡RX ¡Comple'on ¡Queue ¡(part ¡of ¡DSMN) ¡is ¡used ¡to ¡lookup ¡which ¡mailbox ¡to ¡ check ¡for ¡incoming ¡messages. ¡ ¡If ¡the ¡CQ ¡becomes ¡overrun, ¡app ¡doesn’t ¡die, ¡ just ¡scan ¡all ¡the ¡mailboxes. ¡ ¡ ¡  Some ¡users ¡very ¡much ¡like ¡this ¡– ¡the ¡“I ¡just ¡want ¡to ¡get ¡through ¡this ¡silly ¡part ¡of ¡the ¡code ¡ without ¡dying ¡or ¡doing ¡big ¡rewrite” ¡crowd ¡  Some ¡users ¡don’t ¡like ¡this ¡because ¡they’d ¡prefer ¡to ¡die ¡and ¡figure ¡out ¡how ¡to ¡fix ¡things ¡ rather ¡than ¡run ¡slow. ¡ ¡ Cray User Group Conference 2011 9

250 ¡ Mailbox memory usage per rank (MB) 200 ¡ 150 ¡ SMSG ¡ MSGQ ¡ 100 ¡ 50 ¡ 0 ¡ 0 ¡ 20000 ¡ 40000 ¡ 60000 ¡ 80000 ¡ 100000 ¡ 120000 ¡ 140000 ¡ Ranks in job Cray User Group Conference 2011 10

 Eager ¡Protocol ¡  For ¡a ¡message ¡that ¡can ¡fit ¡in ¡a ¡GNI ¡SMSG ¡mailbox ¡(E0) ¡  For ¡a ¡message ¡that ¡can’t ¡fit ¡into ¡a ¡mailbox ¡but ¡is ¡less ¡than ¡ MPICH_GNI_MAX_EAGER_MSG_SIZE ¡in ¡length ¡(E1) ¡  Rendezvous ¡protocol ¡(LMT) ¡ Cray User Group Conference 2011 11

 Protocol ¡for ¡messages ¡that ¡can ¡fit ¡into ¡a ¡GNI ¡Smsg ¡mailbox ¡  The ¡default ¡varies ¡with ¡job ¡size, ¡although ¡this ¡can ¡be ¡tuned ¡by ¡the ¡user ¡to ¡ some ¡extent ¡ ranks in job maximum bytes of user data <= 1024 984 >1024 && 472 <=16384 > 16384 216 Cray User Group Conference 2011 12

 ¡For ¡good ¡performance, ¡switching ¡from ¡an ¡Eager ¡protocol ¡to ¡Rendezvous ¡at ¡ the ¡small ¡maximum ¡messages ¡sizes ¡possible ¡for ¡GNI ¡SMSG ¡mailboxes ¡is ¡not ¡ acceptable, ¡except ¡for ¡IMB, ¡etc. ¡ ¡  For ¡this ¡reason, ¡the ¡GNI ¡Netmod ¡has ¡a ¡leave-‑the-‑data-‑at-‑the-‑source-‑but-‑ send-‑the-‑header ¡GET-‑based ¡Eager ¡protocol ¡for ¡messages ¡too ¡large ¡to ¡fit ¡ into ¡a ¡mailbox, ¡but ¡less ¡than ¡or ¡equal ¡to ¡ MPICH_GNI_MAX_EAGER_MSG_SIZE ¡bytes ¡ Cray User Group Conference 2011 13

Sender Receiver RTS via SMSG r e f f used by Nemesis u b d RDMA Read n e s p p a ? r e f f u b v c r RCV DONE via SMSG p p a DMA buffer CH3 Header memcpy Cray User Group Conference 2011 14

 LMT ¡ stands ¡for ¡Long ¡Message ¡Transfer. ¡ ¡ ¡  This ¡is ¡a ¡rendezvous ¡protocol. ¡ ¡The ¡Nemesis ¡match ¡engine ¡has ¡to ¡have ¡ matched ¡the ¡receive ¡with ¡the ¡send ¡before ¡an ¡LMT ¡begins ¡  Nemesis ¡provides ¡the ¡infrastructure ¡for ¡RDMA ¡style ¡NICs ¡like ¡Gemini ¡to ¡ make ¡use ¡of ¡zero-‑copy ¡without ¡reinven'ng ¡wheels ¡  The ¡GNI ¡Netmod ¡makes ¡use ¡of ¡this ¡infrastructure, ¡as ¡does ¡the ¡XPMEM ¡ component ¡of ¡the ¡shared ¡memory ¡part ¡of ¡Nemesis ¡(intra-‑node ¡transfers) ¡  Two ¡methods ¡are ¡used ¡by ¡the ¡GNI ¡Netmod, ¡depending ¡on ¡size ¡of ¡the ¡ message ¡  RDMA ¡read ¡method ¡(receiver ¡pulls ¡the ¡data)� ¡  RDMA ¡write ¡method ¡(max ¡bandwidth)� ¡ Cray User Group Conference 2011 15

RECEIVER SENDER app send buffer Register buffer with Gemini RTS via SMSG (includes CH3 header) ‏ Register buffer with Gemini app rcv buffer RDMA Read RCV DONE via SMSG Cray User Group Conference 2011 16

RTS via SMSG (includes CH3 Sender Receiver header) ‏ Register buffer chunk with Gemini app rcv buffer data transferred Keep repeating until all sender CTS via SMSG Register buffer chunk with Gemini app send buffer RDMA PUT SEND DONE via SMSG Cray Inc. Proprietary 17

 RDMA ¡Read ¡path ¡offers ¡best ¡opportunity ¡with ¡current ¡MPICH2 ¡to ¡get ¡some ¡ overlap ¡of ¡compute ¡with ¡communicate, ¡at ¡least ¡for ¡the ¡sender ¡  There ¡are ¡alignment ¡restric'ons ¡for ¡source ¡when ¡using ¡RDMA ¡Read ¡path ¡  Dword ¡aligned ¡start ¡addr ¡  Integral ¡number ¡of ¡dwords ¡message ¡length ¡  RDMA ¡read ¡delivers ¡subop'mal ¡network ¡bandwidth ¡u'liza'on ¡in ¡the ¡ general ¡alignment ¡case ¡for ¡send ¡and ¡receive ¡buffers ¡  RDMA ¡Write ¡offers ¡highest ¡bandwidth ¡path, ¡not ¡sensi've ¡to ¡alignment ¡of ¡ send ¡and ¡receive ¡buffers ¡  Not ¡possible ¡to ¡get ¡much ¡overlap ¡of ¡compute ¡with ¡communicate ¡for ¡the ¡ RDMA ¡Write ¡path ¡with ¡current ¡MPICH2 ¡somware ¡ Cray User Group Conference 2011 18

 Tests ¡were ¡done ¡on ¡the ¡following ¡system ¡  Cray ¡XE ¡with ¡2.0 ¡GHz ¡Magny ¡Cours ¡(12) ¡– ¡24 ¡cores ¡per ¡node ¡– ¡system ¡  Cray ¡Linux ¡Environment ¡(CLE) ¡3.1.61 ¡and ¡a ¡pre-‑release ¡MPT ¡5.3 ¡(MPICH2) ¡  Not ¡intended ¡as ¡adver'sing ¡material ¡for ¡maximum ¡possible ¡performance ¡ (use ¡2.4 ¡GHz ¡processors ¡for ¡that) ¡ Cray User Group Conference 2011 20

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group - PowerPoint PPT Presentation

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User Group Conference 2011 2 Application MPI Interface MPICH2 ROMIO ADI3 ADIO GPFS ... CH3 Device Lus. CH3 Interface Job launcher mvapich2 1.5

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

Lectures and the transition to university David Pritchard (david.pritchard@strath.ac.uk)

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Steve Deitz Cray Inc. A new parallel language Under development at Cray Inc. Supported

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

on Cray Systems Cory Spitz and Ann Koehler Cray Inc. 5/25/2011 Introduction Lustre is a

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

Wine & Wisdom Vote By Mail May 12, 2020 Janet Hoffman and Pat Plummer LWVCC Co-Presidents

Universal Acceptance: Barrier or Excuse Don Hollander

CS 134: Operating Systems Locks and Low-Level Synchronization 1 / 25 Overview CS34 Overview

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

Next Generation ACO Model Model Overview Presentation March 17, 201 5 Agenda Model Overview

Outline Feedback UML More UML and OOD Reading Chapter 2 2 1 Feedback

Governance Body Meeting Thursday, April 2 nd , 2020 12:00 P.M. 12:15 P.M. ET This meeting

Setup Desktop Grids and Bridges Tutorial Robert Lovas, MTA SZTAKI RI-261556 6 th IDGF Tutorial

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group - PowerPoint PPT Presentation

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User Group Conference 2011 2 Application MPI Interface MPICH2 ROMIO ADI3 ADIO GPFS ... CH3 Device Lus. CH3 Interface Job launcher mvapich2 1.5

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

Lectures and the transition to university David Pritchard (david.pritchard@strath.ac.uk)

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Steve Deitz Cray Inc. A new parallel language Under development at Cray Inc. Supported

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

on Cray Systems Cory Spitz and Ann Koehler Cray Inc. 5/25/2011 Introduction Lustre is a

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

Wine &amp; Wisdom Vote By Mail May 12, 2020 Janet Hoffman and Pat Plummer LWVCC Co-Presidents

Universal Acceptance: Barrier or Excuse Don Hollander

CS 134: Operating Systems Locks and Low-Level Synchronization 1 / 25 Overview CS34 Overview

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

Next Generation ACO Model Model Overview Presentation March 17, 201 5 Agenda Model Overview

Outline Feedback UML More UML and OOD Reading Chapter 2 2 1 Feedback

Governance Body Meeting Thursday, April 2 nd , 2020 12:00 P.M. 12:15 P.M. ET This meeting

Setup Desktop Grids and Bridges Tutorial Robert Lovas, MTA SZTAKI RI-261556 6 th IDGF Tutorial

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Wine & Wisdom Vote By Mail May 12, 2020 Janet Hoffman and Pat Plummer LWVCC Co-Presidents