May 14, 2009
A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - - PowerPoint PPT Presentation
A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - - PowerPoint PPT Presentation
May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May
May 14, 2009 2
Contents
- Motivation
- Background: a programming model for DAG scheduling
- Overview of the multicast scheme
- Topology ID
- Compact routing tables
- Multicast examples
- Experimental results
May 14, 2009 3
Motivation
- High performance on multicore machines
- New software should have two characteristics:
- Fine grain threads
- Asynchronous execution
- We want to use dynamic DAG scheduling
- Extremely Scalable
- We are thinking of millions of processing cores.
- Distributed-memory
May 14, 2009 4
A DAG Example for Cholesky Factorization
critical path (T∞) loop carried dependency loop indep. dependency
1,1 2,1 4,2 2,2 3,3 4,4 3,1 4,3 4,1 3,2 4,2 2,2 3,2 3,3 4,3 4,4 3,3 4,3 4,4 4,4
1,1 1,1 2,1 2,1 2,2 2,2 3,1 3,1 3,2 3,2 3,3 3,3 4,1 4,1 4,2 4,2 4,3 4,3 4,4 4,4
May 14, 2009 5
Simple Programming Model
- Symbolic DAG interface:
- int get_num_parents(const Task t);
- int get_children(const Task t, Task children);
- set_entry_task(const Task t);
- set_exit_task(const Task t);
May 14, 2009 6
Interface Definition for Cholesky Factorization
struct Task { int type; // what task int k; // iteration index int i, j; // row, column index int priority; } int get_children(Task p, Task* buf, int nblks) { if (p.type = POTRF) { /* along p’s column but below p */ buf := {TRSM task t | t.j = p.j & t.i ∈( p.i, nblks]} } if (p.type = TRSM) { /* a row and a column (both with index p.i) */ buf := {GEMM task t | t.i = p.i & t.j ∈(p.j, p.i] or t.j = p.i & t.i ∈ [p.i, nblks] } } if (p.type = GEMM) { /* has a single child */ if (diagonal) buf := a POTRF task else if (below diag) buf := a TRSM task else buf := a GEMM task } return |buf| } int get_num_parents(Task t) { if (t.type = POTRF) return 1 if (t.type = TRSM) return 2 if (t.type = GEMM) { if (diagonal) return 2 else return 3 } }
May 14, 2009 7
Performance on SGI Altix 3700 BX2 Performance on SGI Altix 3700 BX2
Weak Scalability of chol_dag on SGI (Peak 6.4GFLOPS) 1 2 3 4 5 6 1 2 4 8 16 32 64 128 Number of CPU's GFLOPS Per CPU
chol_dag sgi_pdpotrf
peak dgemm
May 14, 2009 8
Performance on the Grig Cluster
Weak Scalability of chol_dag on Grig Cluster (Peak 6.4 GFLOPS)
1 2 3 4 5 6 1 2 4 8 16 32 64 128 Number of CPU's
GFLOPS Per CPU
chol_dag pdpotrf
peak dgemm
May 14, 2009 9
The Multicast Problem
- Problem: a set of processes are executing a DAG where
multiple sources notify different groups simultaneously. P3
...
1024
P0 P2 P1 P4 P5
May 14, 2009 10
Multicast Scheme Overview
- Application-level routing layer
- Hierarchical abstraction of a system
- Each process has a topology ID.
- Like zip code
- The longer the common prefix of two topo_ids, the
closer they are.
- Compact routing table
- An extension to Plaxton’s neighbor table [1]
[1] Plaxton, C. G., Rajaraman, R., and Richa, A. W. 1997. Accessing nearby copies of replicated objects in a distributed environment. SPAA '97.
May 14, 2009 11
Topology ID
- Assign IDs to the whole system (i.e., Tsystem)
- Tprogram of a user program ⊂ Tsystem
- A topology ID is a number of digits.
- E.g., 256 nodes consist of 4 digits with base 4.
- E.g., 2048 nodes consist of 4 digits with base 8.
- We assume that two nodes with a longer common prefix
are closer on the physical network.
2bits 2bits 2bits 2bits 3bits 3bits 3bits 3bits
May 14, 2009 12
Topology ID Example Topology ID Example
SGI Altix 3700 System
May 14, 2009 13
Compact Routing Table
- Suppose process x has a routing table and Table[i,j]
stores process z, then IDx and IDz must satisfy:
- IDx
0IDx 1...IDx i-1 = IDz 0IDz 1...IDz i-1,
- IDz
i = j (i.e., (i+1)th digit of IDz = j).
- Routing table could have empty entries.
- Always search for the forwarding host
- n the LCP(x,y) row.
- At most (log2P)/(base) steps
- O(lgP) space cost
- 1 million cores 80(5x16) entries
- 1 billion cores 192(6x32) entries
i j
z
IDx 0
z = IDx
0:i-1 j xxxxx
May 14, 2009 14
A Multicast Example
- Node 001 multicasts data to nodes {010, 100, 101}.
0** 1** 00* 01* 000 001 010 011 10* 11* 100 101 110 111
May 14, 2009 15
Grig Cluster
- grig.sinrg.cs.utk.edu
- 64 nodes, dual-CPU per node
- Intel Xeon 3.20GHz
- Peak performance 6.4 GFLOPS
- Myrinet interconnection (MX 1.0.0)
- Goto BLAS 1.26
- DGEMM performance 5.57 GFLOPS
- 87% of peak performance (upper bound)
- MPICH-MX 1.x
- gcc 64 bits
May 14, 2009 16
Multicast on a Cluster (4 CPUs)
Multicast Performance (4 CPU's) 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Time (s)
dag_mcast flat_mcast mpi_bcast
May 14, 2009 17
Multicast on a Cluster (8 CPUs)
Multicast Performance (8 CPU's) 0.02 0.04 0.06 0.08 0.1 0.12 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)
dag_mcast flat_mcast mpi_bcast
May 14, 2009 18
Multicast on a Cluster (16 CPUs)
Multicast Performance (16 CPU's) 0.05 0.1 0.15 0.2 0.25 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)
dag_mcast flat_mcast mpi_bcast
May 14, 2009 19
Multicast on a Cluster (32 CPUs)
Multicast Performance (32 CPU's) 0.1 0.2 0.3 0.4 0.5 0.6 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)
dag_mcast flat_mcast mpi_bcast
May 14, 2009 20
Multicast on a Cluster (64 CPUs)
Multicast Performance (64 CPU's) 0.5 1 1.5 2 2.5 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)
dag_mcast flat_mcast mpi_bcast
May 14, 2009 21
Multicast on a Cluster (128 CPUs)
Multicast Performance (128 CPU's) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)
dag_mcast flat_mcast mpi_bcast
May 14, 2009 22
Summary
- Support scalable multicast in distributed DAG
scheduling
- Important features:
- Non-blocking
- Topology-aware
- Scalable in terms of routing-table space and #steps
- Dead-lock free
- No requirement of communication group creation
- Support multiple concurrent multicasts
- Performance is close to vendor’s MPI_Bcast.