a scalable multicast scheme for distributed dag scheduling
play

A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - PowerPoint PPT Presentation

May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May


  1. May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May 13-15, 2009

  2. Contents Motivation • Background: a programming model for DAG scheduling • Overview of the multicast scheme • Topology ID • Compact routing tables • Multicast examples • Experimental results • May 14, 2009 2

  3. Motivation High performance on multicore machines • New software should have two characteristics: • • Fine grain threads • Asynchronous execution We want to use dynamic DAG scheduling • Extremely Scalable • • We are thinking of millions of processing cores. • Distributed-memory May 14, 2009 3

  4. A DAG Example for Cholesky Factorization 1,1 4,1 2,1 3,1 4,4 2,2 3,2 3,3 4,2 4,3 2,2 1,1 1,1 4,2 3,2 2,1 2,1 2,2 2,2 3,1 3,1 3,2 3,2 3,3 3,3 3,3 4,3 4,4 4,1 4,1 4,2 4,2 4,3 4,3 4,4 4,4 3,3 4,3 loop indep. dependency loop carried dependency 4,4 critical path (T ∞ ) 4,4 May 14, 2009 4

  5. Simple Programming Model Symbolic DAG interface: • int get_num_parents(const Task t); • int get_children(const Task t, Task children); • set_entry_task(const Task t); • set_exit_task(const Task t); • May 14, 2009 5

  6. Interface Definition for Cholesky Factorization int get_children(Task p , Task* buf, int nblks) { struct Task { if ( p .type = POTRF) { int type; // what task /* along p’s column but below p */ int k; // iteration index buf := {TRSM task t | t .j = p .j & t .i ∈ ( p .i, nblks]} int i, j; // row, column index } int priority; if ( p .type = TRSM) { } /* a row and a column (both with index p.i) */ buf := {GEMM task t | t .i = p .i & t .j ∈ ( p .j, p .i] or t .j = p .i & t .i ∈ [ p .i, nblks] } } if ( p .type = GEMM) { int get_num_parents(Task t) { /* has a single child */ if (t.type = POTRF) return 1 if (diagonal) buf := a POTRF task if (t.type = TRSM) return 2 else if (below diag) buf := a TRSM task if (t.type = GEMM) { else buf := a GEMM task if (diagonal) return 2 } else return 3 return |buf| } } } May 14, 2009 6

  7. Performance on SGI Altix 3700 BX2 Performance on SGI Altix 3700 BX2 Weak Scalability of chol_dag on SGI (Peak 6.4GFLOPS) peak dgemm 6 5 GFLOPS Per CPU 4 3 2 chol_dag sgi_pdpotrf 1 0 1 2 4 8 16 32 64 128 Number of CPU's May 14, 2009 7

  8. Performance on the Grig Cluster Weak Scalability of chol_dag on Grig Cluster (Peak 6.4 GFLOPS) peak 6 dgemm 5 GFLOPS Per CPU 4 3 2 chol_dag pdpotrf 1 0 1 2 4 8 16 32 64 128 Number of CPU's May 14, 2009 8

  9. The Multicast Problem Problem: a set of processes are executing a DAG where • multiple sources notify different groups simultaneously. P0 P1 P2 P3 P4 P5 1024 ... May 14, 2009 9

  10. Multicast Scheme Overview Application-level routing layer • Hierarchical abstraction of a system • Each process has a topology ID. • • Like zip code • The longer the common prefix of two topo_ids, the closer they are. Compact routing table • • An extension to Plaxton’s neighbor table [1] [1] Plaxton, C. G., Rajaraman, R., and Richa, A. W. 1997. Accessing nearby copies of replicated objects in a distributed environment. SPAA '97. May 14, 2009 10

  11. Topology ID Assign IDs to the whole system (i.e., T system ) • • T program of a user program ⊂ T system A topology ID is a number of digits. • • E.g., 256 nodes consist of 4 digits with base 4. 2bits 2bits 2bits 2bits • E.g., 2048 nodes consist of 4 digits with base 8. 3bits 3bits 3bits 3bits We assume that two nodes with a longer common prefix • are closer on the physical network. May 14, 2009 11

  12. Topology ID Example Topology ID Example SGI Altix 3700 System May 14, 2009 12

  13. Compact Routing Table Suppose process x has a routing table and Table[i,j] • stores process z , then ID x and ID z must satisfy: • ID x 0 ID x 1 ...ID x i-1 = ID z 0 ID z 1 ...ID z i-1 , ID x 0 j i = j (i.e., (i+1)th digit of ID z = j). • ID z 0 Routing table could have empty entries. • i z Always search for the forwarding host • on the LCP( x , y ) row. z = ID x 0:i-1 j xxxxx At most (log 2 P)/(base) steps • O(lgP) space cost • • 1 million cores � 80(5x16) entries • 1 billion cores � 192(6x32) entries May 14, 2009 13

  14. A Multicast Example Node 001 multicasts data to nodes {010, 100, 101}. • 0** 1** 00* 01* 10* 11* 000 001 010 011 100 101 110 111 May 14, 2009 14

  15. Grig Cluster grig.sinrg.cs.utk.edu • 64 nodes, dual-CPU per node • Intel Xeon 3.20GHz • Peak performance 6.4 GFLOPS • Myrinet interconnection (MX 1.0.0) • Goto BLAS 1.26 • DGEMM performance 5.57 GFLOPS • 87% of peak performance (upper bound) • MPICH-MX 1.x • gcc 64 bits • May 14, 2009 15

  16. Multicast on a Cluster (4 CPUs) Multicast Performance (4 CPU's) 0.04 0.035 0.03 0.025 Time (s) 0.02 0.015 dag_mcast flat_mcast 0.01 mpi_bcast 0.005 0 128K 256K 512K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 1M 2M 4M Message Size (Bytes) May 14, 2009 16

  17. Multicast on a Cluster (8 CPUs) Multicast Performance (8 CPU's) 0.12 0.1 0.08 Time (s) 0.06 dag_mcast flat_mcast 0.04 mpi_bcast 0.02 0 6 2 4 8 6 2 K K K K K K M M M 1 2 4 8 K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 17

  18. Multicast on a Cluster (16 CPUs) Multicast Performance (16 CPU's) 0.25 0.2 0.15 Time (s) 0.1 dag_mcast flat_mcast mpi_bcast 0.05 0 6 2 4 8 6 2 K K K K K K M M M 1 2 4 8 K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 18

  19. Multicast on a Cluster (32 CPUs) Multicast Performance (32 CPU's) 0.6 0.5 0.4 Time (s) 0.3 dag_mcast 0.2 flat_mcast mpi_bcast 0.1 0 6 2 4 8 6 2 K K K K K K M M M 1 2 4 8 K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 19

  20. Multicast on a Cluster (64 CPUs) Multicast Performance (64 CPU's) 2.5 2 1.5 Time (s) 1 dag_mcast flat_mcast mpi_bcast 0.5 0 1 2 4 8 6 2 4 8 6 2 K K K K K K M M M K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 1 3 6 2 5 1 1 2 5 Message Size (Bytes) May 14, 2009 20

  21. Multicast on a Cluster (128 CPUs) Multicast Performance (128 CPU's) 4.5 4 3.5 3 Time (s) 2.5 2 dag_mcast 1.5 flat_mcast 1 mpi_bcast 0.5 0 M M M 1 2 4 8 6 2 4 8 6 2 K K K K K K K K K K 1 3 6 2 5 1 1 2 4 8 6 2 4 8 6 2 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 21

  22. Summary Support scalable multicast in distributed DAG • scheduling Important features: • • Non-blocking • Topology-aware • Scalable in terms of routing-table space and #steps • Dead-lock free • No requirement of communication group creation • Support multiple concurrent multicasts Performance is close to vendor’s MPI_Bcast. • May 14, 2009 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend