scaling communication intensive applications on bluegene
play

Scaling Communication-Intensive Applications on BlueGene/P Using - PowerPoint PPT Presentation

Scaling Communication-Intensive Applications on BlueGene/P Using One- Sided Communication and Overlap By Rajesh Nishtala 1 , Paul Hargrove 2 , Dan Bonachea 1 and Katherine Yelick 1,2 1 University of California, Berkeley 2 Lawrence Berkeley


  1. Scaling Communication-Intensive Applications on BlueGene/P Using One- Sided Communication and Overlap By Rajesh Nishtala 1 , Paul Hargrove 2 , Dan Bonachea 1 and Katherine Yelick 1,2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory (to appear at IEEE IPDPS 2009)

  2. Observations • Performance gains delivered through increasing concurrency rather than clock rates • Application scalability is essential for future performance improvements • 100,000s of processors will be the norm in the very near future • Maximize the use of available resources • Leverage communication/communication and communication/ computation overlap • Systems will favor many slower power efficient processors to fewer faster power inefficient one • Light-weight communication and runtime system to minimize software overhead • Close semantic match to underlying hardware http://upc.lbl.gov 2

  3. Overview • Discuss our new port of GASNet, the communication subsystem for the the Berkeley UPC compiler, to the BlueGene/P • Outline the key differences found between one and two sided communication and their applicability to modern networks • Show how the microbenchmark performance advantages can translate to real applications • Chose the communication bound NAS FT benchmark as the case study • Thesis Statement: • The one-sided communication model found in GASNet is a better semantic fit to modern highly concurrent systems by better leveraging features such as RDMA and thus allowing applications to realize better scaling. http://upc.lbl.gov 3

  4. BlueGene/P Overview • Representative example for future highly !"#$%& +,-."/01 !"#$%& concurrent systems '('()* '()* • Compute node: 4 cores running at 850 2,-34&%-/".&1 MHz w/ 2GB of RAM and 13.6 GB/s between main memory and the cores )-PIJ1 +,-%./(0- • Total cores = (32 nodes / node card) x )99-NK 52,-/6781-9(9(,: 2,-/4;8<=%>-?@)-AB-/".&1 )9-NIJ1 (32 node cards / rack) x (upto 72 racks) ,-NO /,&12$%./(0- • Different networks for different tasks )-/678>-9? CDEF1 92M-HIJ1 *9-HK /341 • 3D Torus for general point-to-point Figure and data from: 9-8.4/%114.1 IBM System Blue Gene Solution: Blue Gene/P )2G*-HIJ1 Application Development communication (5.1 GB/s per node) ,G?-HK-CCD )2G*-HIJ1 by Carlos Sosa and Brant Knudson '-FK-LCDEF • Global Interrupt network for Barriers Published Dec. 2008 by IBM Redbooks ISBN: 0738432113 (1.3 us for 72 racks) • Global Collective Network for One- to-Many broadcast or Many-to-one reductions (0.85 GB/s per link) http://upc.lbl.gov 4

  5. Partitioned Global Address Space (PGAS) Languages • Programming model suitable for both shared and distributed memory systems shared address space • Language presents a logically shared memory • Any thread may directly read/write private address space data located on a remote processor P0 P1 P2 P3 • Address space is partitioned so each processor has affinity to a memory region • Accesses to “local” memory are potentially much faster http://upc.lbl.gov 5

  6. Data Transfers in UPC • MPI Code: double A; MPI_Status stat; • Example: Send P0’s version of if(myrank == 0) { A = 42.0; A to P1 MPI_Send(&A, 1, MPI_DOUBLE, 1 , 0, MPI_COMM_WORLD); } else if(myrank == 1) A A A A A MPI_Recv(&A, 1, MPI_DOUBLE, 0 , MPI_ANY_TAG, MPI_COMM_WORLD, &stat); • UPC Code: P0 P1 P2 P3 shared [1] double A[4]; if( MYTHREAD == upc_threadof(&A[0]) ) { A[0] = 42.0; upc_memput(&A[1], &A[0], sizeof(double)); } http://upc.lbl.gov 6

  7. One-Sided versus Two-Sided Communication host cores one-sided put (i.e. GASNet) dest addr data payload NIC two-sided send/recv (i.e. MPI) pre-posted memory msg id data payload recv • One-sided put/get is able to directly transfer data w/o interrupting host cores • Message contains the information about the remote address to find out where to directly put the data • CPU need not be involved if NIC supports Remote Direct Memory Access (RDMA) • Synchronization is decoupled from the data movement. • Two-sided send/recv requires rendez-vous with host cores to agree where the data needs to be put before RDMA can be used • Bounce buffers can also be used for small enough message but slow serial can make it prohibitively expensive • Most modern networks provide RDMA functionality, so why not just use it directly? http://upc.lbl.gov 7

  8. GASNet Overview • Portable and high performance runtime system for many different PGAS Languages • Projects: Berkeley UPC, GCC-UPC, Titanium, Rice Co-Array Fortran, Cray Chapel, Cray UPC & Co-Array Fortran and many other experimental projects • Supported Networks: BlueGene/P (DCMF), Infiniband (VAPI and IBV), Cray XT (Portals), Quadrics (Elan), Myrinet (GM), IBM LAPI, SHMEM, SiCortex (soon to be released), UDP, MPI • 100% open source and under BSD license • Features: • Multithreaded (works on VN, Dual, or SMP modes) • Provides efficient nonblocking puts and gets • Often just a thin wrapper around hardware puts and gets • Also support for Vector, Index, and Strided (VIS) operations • Provides rich Active Messaging API • Provides Nonblocking Collective Communication • Collectives will soon be automatically tuned http://upc.lbl.gov 8

  9. GASNet Latency Performance • GASNet implemented on top of Deep Computing Messaging Framework (DCMF) 9 MPI Send/Recv • Lower level than MPI GASNet (Get + sync) GASNet (Put + sync) 8 • Provides Puts, Gets, AMSend, and 7 Collectives Roundtrip Latency (microseconds) • Point-to-point ping-ack latency 6 performance 5 • N-byte transfer w/ 0 byte 4 acknowledgement • GASNet takes advantage of 3 DCMF remote completion 2 Good notification 1 • Minimum semantics needed to 0 implement the UPC memory model 1 2 4 8 16 32 64 128 256 512 Transfer Size (Bytes) • Almost a factor of two difference until 32 bytes • Indication of better semantic match to underlying communication system http://upc.lbl.gov 9

  10. GASNet Multilink Bandwidth Six Link Peak • Each node has six 850MB/s* GASNet (6 link) 4500 MPI (6 link) GASNet (4 link) bidirectional link MPI (4 link) 4000 GASNet (2 link) • Vary number of links used from 1 to 6 MPI (2 link) Flood Bandwidth (MB/s 1MB = 2 20 Bytes) One Link Peak 3500 GASNet (1 link) MPI (1 link) • Initiate a series of nonblocking puts 3000 on the links (round-robin) 2500 Good • Communication/communication 2000 overlap 1500 • Both MPI and GASNet asymptote to 1000 the same bandwidth 500 • GASNet outperforms MPI at 0 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M Transfer Size (Bytes) midrange message sizes * Kumar et. al showed the • Lower software overhead implies maximum achievable bandwidth for DCMF transfers is 748 MB/s per more efficient message injection link so we use this as our peak • GASNet avoids rendezvous to bandwidth See “The deep computing messaging framework: generalized scalable leverage RDMA message passing on the blue gene/P supercomputer”, Kumar et al. ICS08 http://upc.lbl.gov 10

  11. Case Study: NAS FT Benchmark !"#$%&''()# 4"#$%&''()# *+#,-&.%/01# *+;+#,-&.%/01# 67827# 6782# 6:82:# 25# 6:## 24# 2!# 23# 69## 69# • Perform a large 3D FFT • Used in many ares of computational science • Molecular dynamics, CFD, image processing, signal processing, astrophysics, etc. • Representative of a class of communication intensive algorithms • Requires parallel many-to-many communication • Stresses communication subsystem • Limited by bandwidth (namely bisection bandwidth) of the network • Building on our previous work, we perform a 2D partition of the domain • Requires two rounds of communication rather than one • Each processor communicates in two rounds with O( � T) threads in each http://upc.lbl.gov 11

  12. Our Terminology • Domain is NX columns by NY rows by NZ 4"#$%&''()# planes *+;+#,-&.%/01# • We overlay a TY x TZ processor grid (i.e. NX 67827# 6:82:# is only contiguous dimension) • Plane: An NX columns by NY rows that is shared amongst a team of TY processors • Slab: An NX columns by NY/TY rows of 69# elements that is entirely on one thread • Each thread owns NZ/TZ slabs • Packed Slab: An NX columns by NY/TY rows by NZ/TZ rows • All the data a particular thread owns http://upc.lbl.gov 12

  13. 3D-FFT Algorithm • Perform a 3D FFT (as part of NAS FT) across a large rectangular prism • Perform an FFT in each of the 3 D00 D01 D02 D03 C00 C01 C02 C03 Dimensions B00 B01 B02 B03 D10 D11 D12 D13 A00 A01 A02 A03 • Need to Team-Exchange for other 2/3 P0 C10 C11 C12 C13 dimensions for a 2-D processor layout B10 B11 B12 B13 D20 D21 D22 D23 A10 A11 A12 A13 • Performance limited by bisection C20 C21 C22 C23 bandwidth of the network B20 B21 B22 B23 D30 D31 D32 D33 A20 A21 A22 A23 • Algorithm: C30 C31 C32 C33 B30 B31 B32 B33 • Perform FFT across the rows A30 A31 A32 A33 Each processor owns a row of 4 squares (16 processors in example) http://upc.lbl.gov 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend