Porting GASNet to Portals: Porting GASNet to Portals: Partitioned - - PowerPoint PPT Presentation

porting gasnet to portals porting gasnet to portals
SMART_READER_LITE
LIVE PREVIEW

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned - - PowerPoint PPT Presentation

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Partitioned Global Address Space (PGAS) Language Support for the Cray XT Language Support for the Cray XT Dan Bonachea Bonachea Dan Paul Hargrove,


slide-1
SLIDE 1

GASNet at UC Berkeley / LBNL

Dan Dan Bonachea Bonachea

Paul Hargrove, Michael Welcome, Katherine Paul Hargrove, Michael Welcome, Katherine Yelick Yelick Cray User Group (CUG) 2009 Cray User Group (CUG) 2009 http:// http://gasnet.cs.berkeley.edu gasnet.cs.berkeley.edu http:// http://upc.lbl.gov upc.lbl.gov

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Partitioned Global Address Space (PGAS) Language Support for the Cray XT Language Support for the Cray XT

slide-2
SLIDE 2

GASNet at UC Berkeley / LBNL

What is GASNet?

  • GASNet is:
  • A high-performance, one-sided communication layer
  • Portable abstraction layer for the network
  • Runs on most architectures of interest to HPC
  • Native ports to a wide variety of low-level network APIs
  • Can run over portable network interfaces (MPI, UDP)
  • Designed as compilation target for PGAS languages
  • UPC, Co-array Fortran, Titanium, Chapel,...
  • Targeted by 7 separate parallel compiler efforts and counting

– Berkeley UPC, GCC UPC, Cray XT UPC – Rice CAF, Cray XT CAF, Berkeley Titanium, Cray Chapel – Numerous prototyping efforts

slide-3
SLIDE 3

GASNet at UC Berkeley / LBNL

PGAS Compiler System Stack

Compiler-generated code (C, asm) Language Runtime system GASNet Communication System Network Hardware

Platform- independent Network- independent Language- independent Compiler- independent

PGAS Code

(UPC, Titanium, CAF, etc)

PGAS Compiler

slide-4
SLIDE 4

GASNet at UC Berkeley / LBNL

GASNet Design Overview: System Architecture

  • Two-Level architecture is mechanism for portability
  • GASNet Core API
  • Most basic required primitives, narrow and general
  • Implemented directly on each network
  • Based on Active Messages lightweight RPC paradigm
  • GASNet Extended API

– Wider interface that includes higher-level operations – puts and gets w/ flexible sync, split-phase barriers, collective operations, etc – Have reference implementation of the extended API in terms of the core API – Directly implement selected subset of interface for performance – leverage hardware support for higher-level operations

Compiler-generated code Compiler-specific runtime system

GASNet Extended API GASNet Core API

Network Hardware

slide-5
SLIDE 5

GASNet at UC Berkeley / LBNL

GASNet Design Progression on XT

  • Pure MPI: mpi-conduit
  • Fully portable implementation of GASNet over MPI-1
  • “Runs everywhere, optimally nowhere”
  • Portals/MPI Hybrid
  • Replaced Extended API (put/get) with Portals calls
  • Zero-copy RDMA transfers using SeaStar support
  • Pure Portals: portals-conduit
  • Native Core API (AM) implementation over Portals
  • Eliminated reliance on MPI
  • Firehose integration
  • Reduce memory registration overheads
slide-6
SLIDE 6

GASNet at UC Berkeley / LBNL

Portals Message Processing

  • Lowest-level software interface to the XT network is Portals
  • All data movement via Put/Get btwn pre-registered memory regions
  • Provides sophisticated recv-side processing of all incoming messages
  • Designed to allow NIC offload of MPI message matching
  • Provides (more than) sufficient generality for our purposes

EQ Application Memory Region

Optional

Event Queue Portal Table

Portal Index

Match List

ME

<0001>

ME

<1100>

ME

<0110>

MD Application Memory Region NIC

Incoming Message Memory Descriptor

slide-7
SLIDE 7

GASNet at UC Berkeley / LBNL

GASNet Put in Portals-conduit

Node 0 Memory GASNet GASNet segment segment A A Node 1 Memory GASNet GASNet segment segment B B

Node 0’s gasnet_put of A to B becomes: PortalsPut(RARSRC, offset(A), RARME | op_id, offset(B))

RAR MD Portal Table

RAR PTE

Match List RAR ME RARSRC MD SAFE EQ SEND_END ACK Local completion Remote completion (No EQ) Operation identifier smuggled thru ignored match bits

slide-8
SLIDE 8

GASNet at UC Berkeley / LBNL

GASNet Get in Portals-conduit

Node 0 Memory GASNet GASNet segment segment C C Node 1 Memory GASNet GASNet segment segment B B

Node 0’s gasnet_get of B to C becomes: PortalsGet(TMPMD, 0, RARME | op_id, offset(B))

RAR MD Portal Table

RAR PTE

Match List RAR ME TMPMD MD SAFE EQ REPLY_END Get completion (No EQ) Dynamically-created MD for large out-of- segment reference

slide-9
SLIDE 9

GASNet at UC Berkeley / LBNL

Performance: Small Put Latency

  • All performance results taken on 2 nodes of Franklin, quad-core XT4 @ NERSC
  • Portals-conduit outperforms GASNet-over-MPI by about 2x
  • Semantically-induced costs of implementing put/get over message passing
  • Leverages Portals-level acknowledgement for remote completion
  • Outperforms a raw MPI ping/pong by eliminating software overheads

5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1024

Payload Size (bytes) Latency of Blocking Put (µs) mpi-conduit Put MPI Ping-Ack portals-conduit Put

(down is good)

slide-10
SLIDE 10

GASNet at UC Berkeley / LBNL

Performance: Large Put Bandwidth

  • Portals-conduit exposes the full zero-copy RDMA bandwidth of the SeaStar
  • Meets or exceeds achievable bandwidth of a raw MPI flood test
  • Mpi-conduit bandwidth suffers due to 2-copy of the payload

200 400 600 800 1000 1200 1400 1600 1800 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Payload Size (bytes) Bandwidth of Non-Blocking Put (MB/s) portals-conduit Put OSU MPI BW test mpi-conduit Put

(up is good)

slide-11
SLIDE 11

GASNet at UC Berkeley / LBNL

ReqRB MD ReqRB MD

GASNet AM Request in Portals-conduit

Node 0 Memory GASNet GASNet segment segment Node 1 Memory GASNet GASNet segment segment

Node 0’s gasnet_AMRequestMedium becomes: PortalsPut(ReqSB_MD, offset(sendbuffer), Req_ME | op_id | <AM metadata>, 0)

ReqRB MD Portal Table

AM PTE

ReqSB MD SAFE EQ AM EQ PUT_END AM Request Handler executed AM Request AM Request AM Request AM Request Send Buffers Send Buffers AM Request AM Request AM Request AM Request Recv Recv Buffers Buffers Match List Req ME

(Triple buffered)

ReqRB has a Locally-managed

  • ffset
slide-12
SLIDE 12

GASNet at UC Berkeley / LBNL

GASNet AM Reply in Portals-conduit

Node 0 Memory GASNet GASNet segment segment Node 1 Memory GASNet GASNet segment segment

Node 1’s gasnet_AMReplyMedium becomes: PortalsPut(RplSB_MD, offset(sendbuffer), Rpl_ME | op_id | <AM metadata>, request_offset)

RplSB MD Portal Table

AM PTE

Match List Rpl ME ReqSB MD SAFE EQ PUT_END AM Reply Handler executed SAFE EQ AM Reply AM Reply AM Request AM Request Send Buffers Send Buffers AM Reply AM Reply AM Reply AM Reply Send Buffers Send Buffers

slide-13
SLIDE 13

GASNet at UC Berkeley / LBNL

Portals-conduit Data Structures

  • RAR PTE: covers GASNet segment with 3 MD’s with diff EQs
  • AM PTE: Active Message buffers
  • 3 MD’s: Request Send/Reply Recv, Request Recv, and Reply Send
  • EQ separation for deadlock-free AM
  • TMPMD’s created dynamically for transfers with out-of-segment local side

Large out-of-segment local addressing: Src of Put/AM Long payload, dest of Get SAFE_EQ N/A N/A none none TMPMD Src of AM Reply Header SAFE_EQ N/A N/A none none RplSB Bounce buffers for out-of-segment Put/Long/Get, AM Request Header src, AM Reply Header dst SAFE_EQ REMOTE PUT 0x4 AM ReqSB Dest of AM Request Header (double-buffered) AM_EQ LOCAL PUT 0x3 AM ReqRB Remote segment: dst of ReplyLong payload Local segment: src of Put/Long payload, dst of Get SAFE_EQ REMOTE PUT 0x2 RAR RARSRC Remote segment: dst of RequestLong payload AM_EQ REMOTE PUT 0x1 RAR RARAM Remote segment: dst of Put, src of Get NONE REMOTE PUT/GET 0x0 RAR RAR

Description Event Queue Offset Mgt. Ops Allowed Match Bits PTE MD

slide-14
SLIDE 14

GASNet at UC Berkeley / LBNL

Portals-conduit Flow Control

  • Most significant challenge in the AM implementation
  • Prevent overflowing recv buffers at the target
  • Prevent overflowing EQ space at either end
  • Local-side resources managed using send tokens
  • Request injection acquires EQ and buffer space for send and Reply recv
  • Still need to prevent overflows at remote (target) end
  • Initial approach: Statically Partition recv resources between peers
  • Reserve worst-case space at target for each sender to get full B/W
  • Initiator-managed, per-target credit system
  • Requests consume credits (based on payload sz), Replies return them
  • Downside: Non-scalable buffer memory utilization
  • Final approach: Dynamic credit redistribution
  • Reserve space for each receiver to get full B/W
  • Each peer starts with minimal credits, rest banked at the target
  • Target loans additional credits to “chatty” peers, and revokes from

“quiet” ones

slide-15
SLIDE 15

GASNet at UC Berkeley / LBNL

Performance: Active Message Latency

  • Shows the benefit of implementing AM natively
  • Portals-conduit AM’s outperform mpi-conduit
  • Less per-message metadata, big advantage under 1 packet
  • Beyond one packet, less software overheads w/o MPI

5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1024

Payload Size (bytes) AM Medium Round-trip Latency (µs) mpi-conduit portals-conduit

(down is good)

slide-16
SLIDE 16

GASNet at UC Berkeley / LBNL

Performance: Out-of-segment Put Bandwidth (Firehose)

  • Blocking put test (no overlap), exaggerates software overheads
  • TMPMD pays synchronous MD create/destroy every transfer
  • Incurs a pinning cost linear in the page count (on CNL)
  • Firehose exploits spatial/temporal locality to reuse local MDs
  • LRU algorithm with region coalescing – quickly discovers the working set
  • Provides 4% to 8% bandwidth improvement

200 400 600 800 1000 1200 1400 1600 1800 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Payload Size (bytes) Bandwidth of Blocking Put (MB/s) portals-conduit w/Firehose portals-conduit w/TMPMD

(up is good)

slide-17
SLIDE 17

GASNet at UC Berkeley / LBNL

Conclusions

  • Portals-conduit delivers good GASNet performance on Cray XT
  • Outperforms generic GASNet-over-MPI by about 2x
  • Microbenchmark performance competitive with raw MPI
  • Solid comm. foundation for many PGAS compilers
  • Future Work
  • Expand Firehose integration to include remote memory
  • Acknowledgements:
  • Thanks to all at Cray who helped in our efforts!
  • Office of Science DOE Contracts

DE-AC02-05CH11231, DE-FC03-01ER25509

  • NERSC, DOE Contract DE-AC02-05CH11231
  • ORNL, DOE Contract DE-AC05-00OR22725
  • NSF TeraGrid & PSC System Access

For more information: For more information: http:// http://gasnet.cs.berkeley.edu gasnet.cs.berkeley.edu http:// http://upc.lbl.gov upc.lbl.gov