A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - - PowerPoint PPT Presentation

a scalable multicast scheme for distributed dag scheduling
SMART_READER_LITE
LIVE PREVIEW

A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - - PowerPoint PPT Presentation

May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May


slide-1
SLIDE 1

May 14, 2009

A Scalable Multicast Scheme for Distributed DAG Scheduling

Fengguang Song1, Jack Dongarra1,2, Shirley Moore1 University of Tennessee1 Oak Ridge National Laboratory2 Scheduling for Large-Scale Systems Workshop Knoxville, May 13-15, 2009

slide-2
SLIDE 2

May 14, 2009 2

Contents

  • Motivation
  • Background: a programming model for DAG scheduling
  • Overview of the multicast scheme
  • Topology ID
  • Compact routing tables
  • Multicast examples
  • Experimental results
slide-3
SLIDE 3

May 14, 2009 3

Motivation

  • High performance on multicore machines
  • New software should have two characteristics:
  • Fine grain threads
  • Asynchronous execution
  • We want to use dynamic DAG scheduling
  • Extremely Scalable
  • We are thinking of millions of processing cores.
  • Distributed-memory
slide-4
SLIDE 4

May 14, 2009 4

A DAG Example for Cholesky Factorization

critical path (T∞) loop carried dependency loop indep. dependency

1,1 2,1 4,2 2,2 3,3 4,4 3,1 4,3 4,1 3,2 4,2 2,2 3,2 3,3 4,3 4,4 3,3 4,3 4,4 4,4

1,1 1,1 2,1 2,1 2,2 2,2 3,1 3,1 3,2 3,2 3,3 3,3 4,1 4,1 4,2 4,2 4,3 4,3 4,4 4,4

slide-5
SLIDE 5

May 14, 2009 5

Simple Programming Model

  • Symbolic DAG interface:
  • int get_num_parents(const Task t);
  • int get_children(const Task t, Task children);
  • set_entry_task(const Task t);
  • set_exit_task(const Task t);
slide-6
SLIDE 6

May 14, 2009 6

Interface Definition for Cholesky Factorization

struct Task { int type; // what task int k; // iteration index int i, j; // row, column index int priority; } int get_children(Task p, Task* buf, int nblks) { if (p.type = POTRF) { /* along p’s column but below p */ buf := {TRSM task t | t.j = p.j & t.i ∈( p.i, nblks]} } if (p.type = TRSM) { /* a row and a column (both with index p.i) */ buf := {GEMM task t | t.i = p.i & t.j ∈(p.j, p.i] or t.j = p.i & t.i ∈ [p.i, nblks] } } if (p.type = GEMM) { /* has a single child */ if (diagonal) buf := a POTRF task else if (below diag) buf := a TRSM task else buf := a GEMM task } return |buf| } int get_num_parents(Task t) { if (t.type = POTRF) return 1 if (t.type = TRSM) return 2 if (t.type = GEMM) { if (diagonal) return 2 else return 3 } }

slide-7
SLIDE 7

May 14, 2009 7

Performance on SGI Altix 3700 BX2 Performance on SGI Altix 3700 BX2

Weak Scalability of chol_dag on SGI (Peak 6.4GFLOPS) 1 2 3 4 5 6 1 2 4 8 16 32 64 128 Number of CPU's GFLOPS Per CPU

chol_dag sgi_pdpotrf

peak dgemm

slide-8
SLIDE 8

May 14, 2009 8

Performance on the Grig Cluster

Weak Scalability of chol_dag on Grig Cluster (Peak 6.4 GFLOPS)

1 2 3 4 5 6 1 2 4 8 16 32 64 128 Number of CPU's

GFLOPS Per CPU

chol_dag pdpotrf

peak dgemm

slide-9
SLIDE 9

May 14, 2009 9

The Multicast Problem

  • Problem: a set of processes are executing a DAG where

multiple sources notify different groups simultaneously. P3

...

1024

P0 P2 P1 P4 P5

slide-10
SLIDE 10

May 14, 2009 10

Multicast Scheme Overview

  • Application-level routing layer
  • Hierarchical abstraction of a system
  • Each process has a topology ID.
  • Like zip code
  • The longer the common prefix of two topo_ids, the

closer they are.

  • Compact routing table
  • An extension to Plaxton’s neighbor table [1]

[1] Plaxton, C. G., Rajaraman, R., and Richa, A. W. 1997. Accessing nearby copies of replicated objects in a distributed environment. SPAA '97.

slide-11
SLIDE 11

May 14, 2009 11

Topology ID

  • Assign IDs to the whole system (i.e., Tsystem)
  • Tprogram of a user program ⊂ Tsystem
  • A topology ID is a number of digits.
  • E.g., 256 nodes consist of 4 digits with base 4.
  • E.g., 2048 nodes consist of 4 digits with base 8.
  • We assume that two nodes with a longer common prefix

are closer on the physical network.

2bits 2bits 2bits 2bits 3bits 3bits 3bits 3bits

slide-12
SLIDE 12

May 14, 2009 12

Topology ID Example Topology ID Example

SGI Altix 3700 System

slide-13
SLIDE 13

May 14, 2009 13

Compact Routing Table

  • Suppose process x has a routing table and Table[i,j]

stores process z, then IDx and IDz must satisfy:

  • IDx

0IDx 1...IDx i-1 = IDz 0IDz 1...IDz i-1,

  • IDz

i = j (i.e., (i+1)th digit of IDz = j).

  • Routing table could have empty entries.
  • Always search for the forwarding host
  • n the LCP(x,y) row.
  • At most (log2P)/(base) steps
  • O(lgP) space cost
  • 1 million cores 80(5x16) entries
  • 1 billion cores 192(6x32) entries

i j

z

IDx 0

z = IDx

0:i-1 j xxxxx

slide-14
SLIDE 14

May 14, 2009 14

A Multicast Example

  • Node 001 multicasts data to nodes {010, 100, 101}.

0** 1** 00* 01* 000 001 010 011 10* 11* 100 101 110 111

slide-15
SLIDE 15

May 14, 2009 15

Grig Cluster

  • grig.sinrg.cs.utk.edu
  • 64 nodes, dual-CPU per node
  • Intel Xeon 3.20GHz
  • Peak performance 6.4 GFLOPS
  • Myrinet interconnection (MX 1.0.0)
  • Goto BLAS 1.26
  • DGEMM performance 5.57 GFLOPS
  • 87% of peak performance (upper bound)
  • MPICH-MX 1.x
  • gcc 64 bits
slide-16
SLIDE 16

May 14, 2009 16

Multicast on a Cluster (4 CPUs)

Multicast Performance (4 CPU's) 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Time (s)

dag_mcast flat_mcast mpi_bcast

slide-17
SLIDE 17

May 14, 2009 17

Multicast on a Cluster (8 CPUs)

Multicast Performance (8 CPU's) 0.02 0.04 0.06 0.08 0.1 0.12 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)

dag_mcast flat_mcast mpi_bcast

slide-18
SLIDE 18

May 14, 2009 18

Multicast on a Cluster (16 CPUs)

Multicast Performance (16 CPU's) 0.05 0.1 0.15 0.2 0.25 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)

dag_mcast flat_mcast mpi_bcast

slide-19
SLIDE 19

May 14, 2009 19

Multicast on a Cluster (32 CPUs)

Multicast Performance (32 CPU's) 0.1 0.2 0.3 0.4 0.5 0.6 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)

dag_mcast flat_mcast mpi_bcast

slide-20
SLIDE 20

May 14, 2009 20

Multicast on a Cluster (64 CPUs)

Multicast Performance (64 CPU's) 0.5 1 1.5 2 2.5 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)

dag_mcast flat_mcast mpi_bcast

slide-21
SLIDE 21

May 14, 2009 21

Multicast on a Cluster (128 CPUs)

Multicast Performance (128 CPU's) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M Message Size (Bytes) Time (s)

dag_mcast flat_mcast mpi_bcast

slide-22
SLIDE 22

May 14, 2009 22

Summary

  • Support scalable multicast in distributed DAG

scheduling

  • Important features:
  • Non-blocking
  • Topology-aware
  • Scalable in terms of routing-table space and #steps
  • Dead-lock free
  • No requirement of communication group creation
  • Support multiple concurrent multicasts
  • Performance is close to vendor’s MPI_Bcast.