PTMPI Threaded MPI execution on cluster of SMP machines Zoran - - PowerPoint PPT Presentation

ptmpi
SMART_READER_LITE
LIVE PREVIEW

PTMPI Threaded MPI execution on cluster of SMP machines Zoran - - PowerPoint PPT Presentation

UCSB cs240b project Fall 1999 PTMPI Threaded MPI execution on cluster of SMP machines Zoran Dimitrijevic Department of Computer Science University of California at Santa Barbara E-mail: zoran@cs.ucsb.edu Introduction


slide-1
SLIDE 1

UCSB cs240b project Fall 1999

PTMPI

Threaded MPI execution on cluster of SMP machines

Zoran Dimitrijevic Department of Computer Science University of California at Santa Barbara E-mail: zoran@cs.ucsb.edu

slide-2
SLIDE 2

Introduction

  • Cluster of SMP machines
  • Each cluster node is SMP machine
  • Communication between the nodes is through etherenet TCP/IP
  • Current MPI implementation for shared memory machines:
  • TMPI – threaded MPI execution – each MPI node is a thread inside one process
  • Fast

Not scalable – regular OS process can be running on just one machine

  • MPICH – each MPI node is a process – communication between nodes involve
  • perating system activity

Slow

Scalable – each node can be running on different machine

slide-3
SLIDE 3

Problem Statement

  • System consists of several processes
  • Scalability – each process can run on different machine
  • Communication between the processes is through sockets
  • Processes can be running anywhere on the Net
  • Each MPI node is a thread inside a process
  • Fast communication between the MPI nodes inside the same process –

through shared memory

  • During the startup the nodes are created –

each process can have different number of MPI nodes running inside it

slide-4
SLIDE 4

Proposed Solution

  • PTMPI Startup:
  • Configuration is in the resource file
  • Each process is started with single initialization argument – process ID
  • Each process gets its IP and listenning port number
  • There are p processes in the system
  • Complete sockets graph is created – p(p-1)/2 sockets
  • Each process creates local_MPI_count receiver queues
  • Each process creates a thread for each MPI node running on it
  • Each process creates two communication threads:

In communicator – read from the sockets and dispatches messages

Out communicator – read from its queues (one per each MPI thread) and writes to sockets

slide-5
SLIDE 5
  • MPI Node Thread Startup:
  • Each MPI node is an instance of class MPI_Node
  • PTMPI main creates thread for each MPI node and passes the local ID to them
  • Each thread creates a new instance of class MPI_Node
  • SPMD in shared memory

All global data for MPI program must be copied for each thread

This is achieved since all MPI functions are friend function to class MPI_Node or defined in class MPI_Node, and all global MPI data are members of the class MPI_Node

All MPI global data can be placed in mpi_global_data.h which is included in MPI_Node class

  • Each thread calls method mpi_main(int argc, char **argv)

Arguments are passed from PTMPI main function exept first one (and the name is set to mpi_program)

slide-6
SLIDE 6
  • PTMPI System Layout:
  • utput

daemon input daemon MPI thread node MPI thread node MPI thread node

  • utput

daemon input daemon MPI thread node MPI thread node MPI thread node

  • utput

daemon input daemon MPI thread node MPI thread node MPI thread node MPI thread node MPI thread node

Process 1: IP1 Process 2: IP2 Process 0: IP0 Sockets

slide-7
SLIDE 7
  • Process node layout:

In Communicator p-1 Read sockets Local MPI node threads MPI_Node::mpi_main recv_queue[0] MPI_Node::mpi_main recv_queue[0] MPI_Node::mpi_main recv_queue[0]

. . .

Out Communicator p-1 Write sockets

Out_comm._queue[0]

. . . . . .

Out_comm._queue[0] Out_comm._queue[0] Out_comm._queue[0]

. . . . . .

Each thread writes and reads to recv_queues in shared memory

slide-8
SLIDE 8
  • Receiver Queues

MPI_QueueElem mutex cond MPI_QueueElem mutex cond MPI_QueueElem mutex cond MPI_QueueElem mutex cond MPI_QueueElem mutex cond MPI_QueueElem mutex cond use_mutex recv_cond

recv_buffer recv_request

slide-9
SLIDE 9
  • Messages: MPI_QueueElem
  • Goal: minimize the number of memory copy in system
  • All queues in the system are using the same class for elements
  • Broadcast does not copy the message
  • Threads are using mutex and condition members of MPI_QueueElem
  • Last waiter free the message if the message is buffered and deletes the element
slide-10
SLIDE 10
  • MPI functions implemented:
  • MPI_Init
  • MPI_Comm_rank
  • MPI_Comm_size
  • MPI_Finalize
  • MPI_Send
  • MPI_Isend
  • MPI_Recv
  • MPI_Irecv
  • MPI_Wait
  • MPI_Broadcast
slide-11
SLIDE 11

Initial Performance Evaluation

20 40 60 80 100 120 MPICH PTMPI 1024x16 1024x32 1024x64 2048x16 2048x32 2048x64

Figure 3: Block-based matrix multiplication execution time in seconds for 16 MPI nodes running on four two-processor SMP nodes.

10 20 30 40 50 60 MPICH PTMPI 1024x16 1024x32 1024x64 2048x16 2048x32 2048x64

Figure 4: Block-based matrix multiplication execution time in seconds for 8 MPI nodes running on four two-processor SMP nodes.

slide-12
SLIDE 12

5 10 15 20 25 30 35 40 45 MPICH PTMPI 1024x16 1024x32 1024x64 2048x16 2048x32 2048x64

Figure 6: Block-based matrix multiplication execution time in seconds for 16 MPI nodes running on four four-processor SMP nodes.

10 20 30 40 50 60 70 80 MPICH PTMPI 1024x16 1024x32 2048x16 2048x32 2048x64

Figure 5: Block-based matrix multiplication execution time in seconds for 32 MPI nodes running on four four-processor SMP nodes.

20 40 60 80 100 120 140 1 2 4 8 16 2048x32 1 node/CPU 2048x32 2 nodes/CPU

Figure 7: PTMPI block-based matrix multiplication execution time in seconds as function of number of two-processor SMP nodes.

10 20 30 40 50 60 1 2 4 2048x32 1 n o d e /C PU 2048x32 2 n o d e s /CPU

Figure 8: PTMPI block-based matrix multiplication execution time in seconds as function of number of four-processor SMP nodes.

slide-13
SLIDE 13

10 20 30 40 50 60 70 80 90 1 2 4 1024x16 2048x32

Figure 10: PTMPI block-based matrix multiplication MFLOPS rate per processor as function of number of four- processor SMP nodes (one thread per processor).

10 20 30 40 50 60 70 80 1 2 4 8 16 1024x16 2048x32

Figure 9: PTMPI block-based matrix multiplication MFLOPS rate as function of number of two-processor SMP nodes (one thread per processor).

slide-14
SLIDE 14

Conclusions and Future Improvements

  • Basic MPI functions are implemented
  • Current MPI_node to process is basic one, it is expected that smart mapping can

significantly improve execution speedup for some applications

  • Since the communication between the threads is faster than through sockets,

MPI gathering function need to be implemented

  • Spin waiting for send and receive inside the process if running on real SMP
  • Sending only message header through the socket if the message is big,

and waiting for message data request when the receiver is ready