LRTS: A Portable High Performance Low-level Communication Interface - - PowerPoint PPT Presentation

lrts a portable high performance low level communication
SMART_READER_LITE
LIVE PREVIEW

LRTS: A Portable High Performance Low-level Communication Interface - - PowerPoint PPT Presentation

LRTS: A Portable High Performance Low-level Communication Interface Yanhua Sun 1 ale 1 Laxmikant(Sanjay) V. K 1 University of Illinois at Urbana-Champaign sun51@illinois.edu April 15, 2013 Yanhua Sun U of Illinois at Urbana-Champaign 1/24


slide-1
SLIDE 1

LRTS: A Portable High Performance Low-level Communication Interface

Yanhua Sun1 Laxmikant(Sanjay) V. K´ ale1

1University of Illinois at Urbana-Champaign

sun51@illinois.edu

April 15, 2013

Yanhua Sun U of Illinois at Urbana-Champaign 1/24

slide-2
SLIDE 2

Motivation

What the vendors provide

Modern supercomputers, especially networks, are complicated

Yanhua Sun U of Illinois at Urbana-Champaign 2/24

slide-3
SLIDE 3

Motivation

What the vendors provide

Modern supercomputers, especially networks, are complicated

What the programming models require

Global address space models Message passing model Message driven (active message) models

Yanhua Sun U of Illinois at Urbana-Champaign 2/24

slide-4
SLIDE 4

Motivation

What the vendors provide

Modern supercomputers, especially networks, are complicated

What the programming models require

Global address space models Message passing model Message driven (active message) models

A minimum set of functions to implement runtime systems

Yanhua Sun U of Illinois at Urbana-Champaign 2/24

slide-5
SLIDE 5

Outline

Goal of LRTS Charm++ architecture on LRTS Core APIs and extended APIs Performance of micro benchmarks and NAMD Future work

Yanhua Sun U of Illinois at Urbana-Champaign 3/24

slide-6
SLIDE 6

Goals of LRTS

Goal = Completeness + Productivity + Portability + Performance

Yanhua Sun U of Illinois at Urbana-Champaign 4/24

slide-7
SLIDE 7

Goal of LRTS

Completeness

Sufficient to run Charm++

Productivity

Require no knowledge of Charm++ to port Charm++ developers : easy to add new features (Replica)

Portability

Functions should not dependend on specific machines

Performance

Space for optimization

Yanhua Sun U of Illinois at Urbana-Champaign 5/24

slide-8
SLIDE 8

Charm++ Architecture

Applications

Libs Langs

CHARM++ Programming Model Converse Runtime System DCMF TCP/IP MPI uGNI more machine layers Machine Implementation

Yanhua Sun U of Illinois at Urbana-Champaign 6/24

slide-9
SLIDE 9

Charm++ Architecture

Applications

Libs Langs

CHARM++ Programming Model Converse Runtime System DCMF TCP/IP MPI uGNI more machine layers

NAMD ChaNGa

  • penAtom

Contigation Charm++ MSA Chrisma all libraries Yanhua Sun U of Illinois at Urbana-Champaign 7/24

slide-10
SLIDE 10

Charm++ Architecture

Applications

Libs Langs

CHARM++ Programming Model Converse Runtime System DCMF TCP/IP MPI uGNI more machine layers NAMD ChaNGa

  • penAtom

Contigation Charm++ MSA Chrisma all libraries SDAG Chare Chare Array entry methods

Yanhua Sun U of Illinois at Urbana-Champaign 8/24

slide-11
SLIDE 11

Charm++ Architecture

Applications

Libs Langs

CHARM++ Programming Model Converse Runtime System DCMF TCP/IP MPI uGNI more machine layers NAMD ChaNGa

  • penAtom

Contigation Charm++ MSA Chrisma all libraries SDAG Chare Chare Array entry methods load balancing projections message scheduler threads seed load balancer communication converse initialization Converse queues

Yanhua Sun U of Illinois at Urbana-Champaign 9/24

slide-12
SLIDE 12

Charm++ Architecture Based on LRTS

Applications

Libs Langs

CHARM++ Programming Model Converse Runtime System DCMF TCP/IP MPI uGNI more machine layers NAMD ChaNGa

  • penAtom

Contigation Charm++ MSA Chrisma all libraries SDAG Chare Chare Array entry methods load balancing projections message scheduler threads seed load balancer machine specific init communication LRTS converse initialization Converse queues non/SMP implementation commom broadcast

Yanhua Sun U of Illinois at Urbana-Champaign 10/24

slide-13
SLIDE 13

Charm++ Naming Rules

CkFoo (most used for Charm++ programmers) CmiFoo (converse programs) LrtsFoo (only for vendors)

Yanhua Sun U of Illinois at Urbana-Champaign 11/24

slide-14
SLIDE 14

Messaging Flow

Non SMP mode - one process per core (hardware thread) SMP mode - one thread per core (hardware thread)

Intra-node communication by passing pointers Dedicated communication thread

Yanhua Sun U of Illinois at Urbana-Champaign 12/24

slide-15
SLIDE 15

Messaging Flow

Non SMP mode - one process per core (hardware thread) SMP mode - one thread per core (hardware thread)

Intra-node communication by passing pointers Dedicated communication thread

Communication thread sending message queue Thread 0 Message queue Thread 1 Message queue s Network Node 1 Node 0

Yanhua Sun U of Illinois at Urbana-Champaign 12/24

slide-16
SLIDE 16

Messaging Flow

non SMP mode - one process per core (hardware thread) SMP mode - one thread per core (hardware thread)

Intra-node communication by passing pointers Dedicated communication thread

Communication thread sending message queue Thread 0 Message queue Thread 1 Message queue s Network Node 1 Node 0 Receive message Receive message

Yanhua Sun U of Illinois at Urbana-Champaign 13/24

slide-17
SLIDE 17

Core APIs

required to run Charm++

Startup and Shutdown

void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID) void LrtsExit() void LrtsBarrier()

Yanhua Sun U of Illinois at Urbana-Champaign 14/24

slide-18
SLIDE 18

Core APIs - P2P communication

Sending messages

CmiCommHandle LrtsSendFunc(int destNode, int destPE, int size, char *msg, int mode); Different protocols for message size Buffering scheme in machine layer

Yanhua Sun U of Illinois at Urbana-Champaign 15/24

slide-19
SLIDE 19

Core APIs - P2P communication

Sending messages

CmiCommHandle LrtsSendFunc(int destNode, int destPE, int size, char *msg, int mode); Different protocols for message size Buffering scheme in machine layer

LrtsAdvanceCommunication

void LrtsAdvanceCommunication(int whileidle); Sending buffered messages Polling network

Yanhua Sun U of Illinois at Urbana-Champaign 15/24

slide-20
SLIDE 20

Core APIs - P2P communication

Sending messages

CmiCommHandle LrtsSendFunc(int destNode, int destPE, int size, char *msg, int mode); Different protocols for message size Buffering scheme in machine layer

LrtsAdvanceCommunication

void LrtsAdvanceCommunication(int whileidle); Sending buffered messages Polling network

void handleOneRecvedMsg(int size, char *msg)

Yanhua Sun U of Illinois at Urbana-Champaign 15/24

slide-21
SLIDE 21

Extended APIs - Memory

Memory Management

void* LrtsAlloc(int n bytes) void LrtsFree(void *msg) Pinned memory pool - uGNI L2Atomic queues for freed messages

Yanhua Sun U of Illinois at Urbana-Champaign 16/24

slide-22
SLIDE 22

Extended APIs - Persistent Messages

Persistent messages

Communication partners and sizes do not change

Yanhua Sun U of Illinois at Urbana-Champaign 17/24

slide-23
SLIDE 23

Extended APIs - Persistent Messages

Persistent messages

Communication partners and sizes do not change RDMA support (uGNI, PAMI, Ibverbs) void LrtsSendPersistentMsg(PersistentHandle h, int destNode, int size, void *msg)

Yanhua Sun U of Illinois at Urbana-Champaign 17/24

slide-24
SLIDE 24

Extended APIs - Collectives

void LrtsBroadcast() common implementation + specific

Yanhua Sun U of Illinois at Urbana-Champaign 18/24

slide-25
SLIDE 25

Extended APIs - Collectives

void LrtsBroadcast() common implementation + specific

Spanning Tree Hypercube

All asynchronous functions

Yanhua Sun U of Illinois at Urbana-Champaign 18/24

slide-26
SLIDE 26

Status of LRTS

Cray machines with uGNI : XE, XK, XC

Sun etal, A uGNI-Based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect, IPDPS 2012 Sun etal, Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6, SC 2012

IBM machines : BlueGene/P with DCMF; BlueGene/Q with PAMI

Kumar etal, Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q, IPDPS 2013

Machines supporting MPI Infiniband clusters

Yanhua Sun U of Illinois at Urbana-Champaign 19/24

slide-27
SLIDE 27

Performance - Latency on BGQ

2 4 6 8 10 12 PAMI-LRTS SMP PAMI SMP PAMI-LRTS PAMI

Time(us)

Charm++ architectures

32 Bytes 1024 Bytes 8192 Bytes Yanhua Sun U of Illinois at Urbana-Champaign 20/24

slide-28
SLIDE 28

Performance - Bandwidth on BGQ

0.5 1 1.5 2 2.5 3 3.5 4 PAMI-LRTS SMP PAMI SMP PAMI-LRTS PAMI

Bandwidth(GBytes/sec)

Charm++ architectures

1024 Bytes 32K Bytes 1M Bytes Yanhua Sun U of Illinois at Urbana-Champaign 21/24

slide-29
SLIDE 29

Application Performance

NAMD Apoa1(92k atoms) with PME every 4 steps on BGQ

2 4 6 8 10 12 PAMI-LRTS SMP PAMI SMP PAMI-LRTS PAMI

Timestep(ms/step)

Charm++ architectures

32 Nodes (2048 hw threads) 64 Nodes (4096 hw threads)

Yanhua Sun U of Illinois at Urbana-Champaign 22/24

slide-30
SLIDE 30

100M-atom Simulation on State-of-art Machines

Best performance on Blue Waters is 8.9ms/step with 25k nodes 13ms/step on Titan with 18k nodes 17.9ms/step on Bluegene/Q with 16K nodes

Yanhua Sun U of Illinois at Urbana-Champaign 23/24

slide-31
SLIDE 31

Conclusion and Future work

Conclusion

LRTS interface simplifies the runtime implementation on new hardware LRTS maintain good performance

Yanhua Sun U of Illinois at Urbana-Champaign 24/24

slide-32
SLIDE 32

Conclusion and Future work

Conclusion

LRTS interface simplifies the runtime implementation on new hardware LRTS maintain good performance

Future work

Message buffering and scheduling Fault tolerance interface Implement other runtime system - Unistack

Yanhua Sun U of Illinois at Urbana-Champaign 24/24