Basic features of the API Memory allocation and sample API calls - - PowerPoint PPT Presentation

basic features of the api
SMART_READER_LITE
LIVE PREVIEW

Basic features of the API Memory allocation and sample API calls - - PowerPoint PPT Presentation

DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini Network Interface


slide-1
SLIDE 1
slide-2
SLIDE 2

 DMAPP in context  Basic features of the API  Memory allocation and sample API calls  Preliminary Gemini performance measurements

2

slide-3
SLIDE 3

The Distributed Memory Application (DMAPP) API

 Supports features of the Gemini Network Interface  Used by higher levels of the software stack:

PGAS compiler runtime

SHMEM library  Balance between portability and hardware intimacy  Intended to be used by system software developers  Application developers should use SHMEM

3

slide-4
SLIDE 4

4

MPICH2 Cray SHMEM PGAS compilers user-level GNI Linux Core Gemini HW Abstraction Layer Gemini network processor MPICH2 Cray SHMEM PGAS compilers DMAPP kernel-level GNI Gemini HW Abstraction Layer Gemini network processor

PE kernel HW

Apps

slide-5
SLIDE 5

 Distributed memory model  One-sided model for participating (SPMD) processes

launched by Alps aprun command

 Each PE has local memory but has one-sided access

(PUT/GET) to remote memory

 Remote memory has to be in an accessible memory

segment

5

slide-6
SLIDE 6

put

 Network supports direct remote get/put from user

process to user process.

 Mechanisms:

 Block Transfer (BTE)  Fast Memory Access (FMA)

including Atomic Memory Operations (AMOs)

6

source destination

PE PE

slide-7
SLIDE 7

 Remote source or destination in either data or

symmetric-heap segments

 Symmetry means we can use local address

information in remote context

7

process

segments

Remote op

slide-8
SLIDE 8

 dmapp_init  Sets up access to data and symmetric heap

(exports memory)

 barrier  you can set or read available resource limits  dmapp_get_jobinfo 

Returns a structure with useful information:

Number of PEs

Index of this PE

Pointers to data and symmetric heap segments required in other calls

8

slide-9
SLIDE 9

dmapp_put(*target_addr, *target_seg, target_pe, source_addr, nelems, type)

 Remote locations defined by:

address, segment, pe

 This is a blocking operation  type can be DMAPP_{BYTE,DW,QW,DQW) for

1, 4, 8 and 16 bytes.

 Analogous get call

9

slide-10
SLIDE 10

 Blocking (no suffix)

dmapp_put, dmapp_get

 Non-blocking explicit (_nb suffix)

dmapp_put_nb(…, syncid)

 Non-blocking implicit (_nbi suffix)  No handle to test for completion  Synchronization (memory completion/visibility)  Can wait on specific syncid  Can wait for all implicit operations to complete

10

slide-11
SLIDE 11

put

 Strided calls

dmapp_iput…, dmapp_iget…

 Additional arguments define source and destination

stride in terms of elements

11

iput

Remote data

slide-12
SLIDE 12

 Scatter/Gather calls

dmapp_ixput…, dmapp_ixget…

 Local data is contiguous  Remote data is distributed as defined by an array of

  • ffsets

12

put ixput

Remote data

slide-13
SLIDE 13

 Put with indexed PE-stride calls

dmapp_put_ixpe…, dmapp_get_ixpe…

 Local data is contiguous  Remote data is distributed (as defined by an array of

PE-offsets) to the same address on each PE

 Use for small amounts of data  These are not collective operations

13

put put_ixpe PE 2 PE 1 PE 0

nelems=3

slide-14
SLIDE 14

 Scatter/Gather with indexed PE-stride calls

dmapp_scatter_ixpe, dmapp_gather_ixpe

 Local data is contiguous  Source is scattered to (or gathered from)

PEs nelems elements at a time.

14

put scatter_ixpe PE 2 PE 4 PE 6

nelems=1

slide-15
SLIDE 15

Atomic operations to 8-byte (QW) remote data

Command Operation AADD Atomic ADD AAND Atomic AND AOR Atomic OR AXOR Atomic EXCLUSIVE OR AFADD Atomic fetch and ADD AFAND Atomic fetch and AND AFOR Atomic fetch and OR AFXOR Atomic fetch and XOR AFAX Atomic fetch AND-EXCLUSIVE OR ACSWAP Compare and SWAP

15

slide-16
SLIDE 16

 Direct support in NIC  Be careful to only read values via DMAPP API

16

t

AADD AFADD

slide-17
SLIDE 17

 Some calls return syncid (_nb)  Can test or wait on completion

 dmapp_syncid_wait(*syncid)  dmapp_syncid_test(*syncid,*flag)

 For implicit non-blocking (_nbi)

 dmapp_gsync_wait()  Dmapp_gsync_test(*flag)  Use for many small messages

17

slide-18
SLIDE 18

 DMAPP applications can allocate memory in

symmetric heap

double *a; a=(double*) dmapp_sheap_malloc(N*sizeof(double));  Associated realloc and free calls.  Application is responsible for maintaining symmetry

  • f allocations

18

slide-19
SLIDE 19

DMAPP exports data and symmetric heap for you This means:

 For C

 File scope and static inside function  Allocated in symmetric Heap

 For Fortran (no API but if there was)

 SAVEd data  Data in COMMON

19

slide-20
SLIDE 20

 Atomic add for master counter (FADD for testing)  Master compares (with n-1) and swaps with 0  … master releases other PEs

20

AA Barrier counter

PE PE PE PE PE

+1 +1 +1 +1 +1

PE

slide-21
SLIDE 21

static uint64_t barrier_counter, bc; if (mype==master){ do{ // wait until counter is npes-1, swap with 0 dmapp_acswap_qw(&bc,(void *)&barrier_counter, seg_data,mype,npes-1,0); } while ( bc!=(npes-1)); } else { dmapp_aadd_qw((void*)&barrier_counter,seg_data, master,1); } // now release barrier…

21

slide-22
SLIDE 22

 SHMEM

 Has same SPMD model  Requires use of symmetric memory  Original interface is blocking  Non-standard extensions for non-blocking put/get  Varying-sized data items with typed API  Get/put with strided and gather/scatter variants  Barrier and collective operations on sets of PEs  Has the same atomic memory operations

 SHMEM is implemented using DMAPP for Gemini

systems

22

slide-23
SLIDE 23

 Data measured on prototype system during Q1 2010  2100MHz Opteron processors  2400MHz HyperTransport interface  Dual node tests run between PEs on neighbouring

Gemini routers

23

slide-24
SLIDE 24

24

0.0 0.5 1.0 1.5 2.0 2.5 8 16 32 64 128 256 512 1024 Time (microsecs) Size (bytes) PUT, ping-pong PUT, at source GET

slide-25
SLIDE 25

25

1000 2000 3000 4000 5000 6000 7000 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K Bandwidth (mbytes/sec) Element size (bytes) PPN=1 PPN=2 PPN=4

slide-26
SLIDE 26

26

500 1000 1500 2000 2500 3000 1 2 4 8 16 32 64 Bandwidth (mbytes/sec) Non-blocking puts 8 bytes 64 bytes 256 bytes

slide-27
SLIDE 27

27

20 40 60 80 100 120 140 160 2 4 8 16 32 64 128 256 512 1024 2048 4096 Rate (millions/sec) Stride (64-bit words) Vector length = 16 Vector Length = 64 Vector length = 4096

slide-28
SLIDE 28

28

20 40 60 80 100 120 256 512 768 1024 AMO rate (millions) Number of processes 1 AMO 8192 AMOs

slide-29
SLIDE 29

 Latency (~1 s) far better than SeaStar  Good aggregate bandwidths on small transfers  High AMO rates, especially when multiple processes

target the same variables

 Strided puts are an important case for CAF  Ongoing optimization effort (for example reduce

number of FMA descriptor updates)

29

slide-30
SLIDE 30

 What is DMAPP and where does it fit?  Basic features of the API  Memory allocation and sample API calls  Preliminary Gemini performance data

30

slide-31
SLIDE 31