A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - PowerPoint PPT Presentation

May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May 13-15, 2009

Contents Motivation • Background: a programming model for DAG scheduling • Overview of the multicast scheme • Topology ID • Compact routing tables • Multicast examples • Experimental results • May 14, 2009 2

Motivation High performance on multicore machines • New software should have two characteristics: • • Fine grain threads • Asynchronous execution We want to use dynamic DAG scheduling • Extremely Scalable • • We are thinking of millions of processing cores. • Distributed-memory May 14, 2009 3

A DAG Example for Cholesky Factorization 1,1 4,1 2,1 3,1 4,4 2,2 3,2 3,3 4,2 4,3 2,2 1,1 1,1 4,2 3,2 2,1 2,1 2,2 2,2 3,1 3,1 3,2 3,2 3,3 3,3 3,3 4,3 4,4 4,1 4,1 4,2 4,2 4,3 4,3 4,4 4,4 3,3 4,3 loop indep. dependency loop carried dependency 4,4 critical path (T ∞ ) 4,4 May 14, 2009 4

Simple Programming Model Symbolic DAG interface: • int get_num_parents(const Task t); • int get_children(const Task t, Task children); • set_entry_task(const Task t); • set_exit_task(const Task t); • May 14, 2009 5

Interface Definition for Cholesky Factorization int get_children(Task p , Task* buf, int nblks) { struct Task { if ( p .type = POTRF) { int type; // what task /* along p’s column but below p */ int k; // iteration index buf := {TRSM task t | t .j = p .j & t .i ∈ ( p .i, nblks]} int i, j; // row, column index } int priority; if ( p .type = TRSM) { } /* a row and a column (both with index p.i) */ buf := {GEMM task t | t .i = p .i & t .j ∈ ( p .j, p .i] or t .j = p .i & t .i ∈ [ p .i, nblks] } } if ( p .type = GEMM) { int get_num_parents(Task t) { /* has a single child */ if (t.type = POTRF) return 1 if (diagonal) buf := a POTRF task if (t.type = TRSM) return 2 else if (below diag) buf := a TRSM task if (t.type = GEMM) { else buf := a GEMM task if (diagonal) return 2 } else return 3 return |buf| } } } May 14, 2009 6

Performance on SGI Altix 3700 BX2 Performance on SGI Altix 3700 BX2 Weak Scalability of chol_dag on SGI (Peak 6.4GFLOPS) peak dgemm 6 5 GFLOPS Per CPU 4 3 2 chol_dag sgi_pdpotrf 1 0 1 2 4 8 16 32 64 128 Number of CPU's May 14, 2009 7

Performance on the Grig Cluster Weak Scalability of chol_dag on Grig Cluster (Peak 6.4 GFLOPS) peak 6 dgemm 5 GFLOPS Per CPU 4 3 2 chol_dag pdpotrf 1 0 1 2 4 8 16 32 64 128 Number of CPU's May 14, 2009 8

The Multicast Problem Problem: a set of processes are executing a DAG where • multiple sources notify different groups simultaneously. P0 P1 P2 P3 P4 P5 1024 ... May 14, 2009 9

Multicast Scheme Overview Application-level routing layer • Hierarchical abstraction of a system • Each process has a topology ID. • • Like zip code • The longer the common prefix of two topo_ids, the closer they are. Compact routing table • • An extension to Plaxton’s neighbor table [1] [1] Plaxton, C. G., Rajaraman, R., and Richa, A. W. 1997. Accessing nearby copies of replicated objects in a distributed environment. SPAA '97. May 14, 2009 10

Topology ID Assign IDs to the whole system (i.e., T system ) • • T program of a user program ⊂ T system A topology ID is a number of digits. • • E.g., 256 nodes consist of 4 digits with base 4. 2bits 2bits 2bits 2bits • E.g., 2048 nodes consist of 4 digits with base 8. 3bits 3bits 3bits 3bits We assume that two nodes with a longer common prefix • are closer on the physical network. May 14, 2009 11

Topology ID Example Topology ID Example SGI Altix 3700 System May 14, 2009 12

Compact Routing Table Suppose process x has a routing table and Table[i,j] • stores process z , then ID x and ID z must satisfy: • ID x 0 ID x 1 ...ID x i-1 = ID z 0 ID z 1 ...ID z i-1 , ID x 0 j i = j (i.e., (i+1)th digit of ID z = j). • ID z 0 Routing table could have empty entries. • i z Always search for the forwarding host • on the LCP( x , y ) row. z = ID x 0:i-1 j xxxxx At most (log 2 P)/(base) steps • O(lgP) space cost • • 1 million cores � 80(5x16) entries • 1 billion cores � 192(6x32) entries May 14, 2009 13

A Multicast Example Node 001 multicasts data to nodes {010, 100, 101}. • 0** 1** 00* 01* 10* 11* 000 001 010 011 100 101 110 111 May 14, 2009 14

Grig Cluster grig.sinrg.cs.utk.edu • 64 nodes, dual-CPU per node • Intel Xeon 3.20GHz • Peak performance 6.4 GFLOPS • Myrinet interconnection (MX 1.0.0) • Goto BLAS 1.26 • DGEMM performance 5.57 GFLOPS • 87% of peak performance (upper bound) • MPICH-MX 1.x • gcc 64 bits • May 14, 2009 15

Multicast on a Cluster (4 CPUs) Multicast Performance (4 CPU's) 0.04 0.035 0.03 0.025 Time (s) 0.02 0.015 dag_mcast flat_mcast 0.01 mpi_bcast 0.005 0 128K 256K 512K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 1M 2M 4M Message Size (Bytes) May 14, 2009 16

Multicast on a Cluster (8 CPUs) Multicast Performance (8 CPU's) 0.12 0.1 0.08 Time (s) 0.06 dag_mcast flat_mcast 0.04 mpi_bcast 0.02 0 6 2 4 8 6 2 K K K K K K M M M 1 2 4 8 K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 17

Multicast on a Cluster (16 CPUs) Multicast Performance (16 CPU's) 0.25 0.2 0.15 Time (s) 0.1 dag_mcast flat_mcast mpi_bcast 0.05 0 6 2 4 8 6 2 K K K K K K M M M 1 2 4 8 K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 18

Multicast on a Cluster (32 CPUs) Multicast Performance (32 CPU's) 0.6 0.5 0.4 Time (s) 0.3 dag_mcast 0.2 flat_mcast mpi_bcast 0.1 0 6 2 4 8 6 2 K K K K K K M M M 1 2 4 8 K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 19

Multicast on a Cluster (64 CPUs) Multicast Performance (64 CPU's) 2.5 2 1.5 Time (s) 1 dag_mcast flat_mcast mpi_bcast 0.5 0 1 2 4 8 6 2 4 8 6 2 K K K K K K M M M K K K K 1 3 6 2 5 1 8 6 2 1 2 4 8 6 2 4 1 2 4 1 2 5 1 3 6 2 5 1 1 2 5 Message Size (Bytes) May 14, 2009 20

Multicast on a Cluster (128 CPUs) Multicast Performance (128 CPU's) 4.5 4 3.5 3 Time (s) 2.5 2 dag_mcast 1.5 flat_mcast 1 mpi_bcast 0.5 0 M M M 1 2 4 8 6 2 4 8 6 2 K K K K K K K K K K 1 3 6 2 5 1 1 2 4 8 6 2 4 8 6 2 1 2 4 1 2 5 2 5 1 1 3 6 1 2 5 Message Size (Bytes) May 14, 2009 21

Summary Support scalable multicast in distributed DAG • scheduling Important features: • • Non-blocking • Topology-aware • Scalable in terms of routing-table space and #steps • Dead-lock free • No requirement of communication group creation • Support multiple concurrent multicasts Performance is close to vendor’s MPI_Bcast. • May 14, 2009 22

A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - PowerPoint PPT Presentation

May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May

Outline 11: IP Multicast Multicast routing IP Multicast Design choices Distance

Multicast Protocols IGMP IP Group Membership Protocol DVMRP DV Multicast Routing Protocol

Multicast Protocols IGMP IP Group Membership Protocol DVMRP DV Multicast Routing Protocol

Multicast Research Multicast Routing ns-2 for Multicast Research Dense Mode, Sparse Mode

IP Multicast T om Bird tom@portfast.co.uk @portfast Multicats? What is multicast? One to

Multicast Protocols IGMP IP Group Membership Protocol DVMRP DV Multicast Routing Protocol

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Application Layer Multicast Instructor: Hamid R. Rabiee Spring 2012 Outline Introduction

Multicast Control Multicast Control Protocol (MCOP) Protocol (MCOP)

Application areas of Application areas of Scalable Adaptive Multicast Scalable Adaptive

XD XDAG: PoW + DA DAG frozen@xdag.io XDAG: A new DAG-based cryptocurrency The first mineable

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

ZC Multicast Address Allocation Steve Hanna MALLOC WG co-chair Sun Microsystems, Inc. Outline

What Is Multicast? Key: Unicast transfer Broadcast transfer Unicast Multicast transfer

Cloud Mediated Nature Observation - From Teleoperation to Cloud Robotics Dez Song Texas A&M

Using WUGS Kits to Improve the Quality of Service Min Song, Old Dominion University Mansoor

Dr. M. Alam Min Song The University of Toledo Outline Scheduling Algorithms for CIOQ

BFT for the skeptics Yee Jiun Song, Flavio Junqueira, Benjamin Reed Cornell University, Yahoo!

WHEN GOD IS VIEWED AS INSUFFICIENT Exodus 4:1-17 SETTING THE SCENE: God meets with Moses

REBEL AGAINST HOPELESSNESS (NUMBERS 13-14) JOIN THE REBELLION AGAINST HOPELESSNESS. NUMBERS

EFFECTIVE BELIEVERS How to live an effective Christian life By Kenneth M. Hoeck - www.totw.org

GRAIN SORGHUM WEED CONTROL UPDATE 2018 Eric P. Prostko, Ph.D. Professor and Extension Weed

A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang - PowerPoint PPT Presentation

May 14, 2009 A Scalable Multicast Scheme for Distributed DAG Scheduling Fengguang Song 1 , Jack Dongarra 1,2 , Shirley Moore 1 University of Tennessee 1 Oak Ridge National Laboratory 2 Scheduling for Large-Scale Systems Workshop Knoxville, May

Outline 11: IP Multicast Multicast routing IP Multicast Design choices Distance

Multicast Protocols IGMP IP Group Membership Protocol DVMRP DV Multicast Routing Protocol

Multicast Protocols IGMP IP Group Membership Protocol DVMRP DV Multicast Routing Protocol

Multicast Research Multicast Routing ns-2 for Multicast Research Dense Mode, Sparse Mode

IP Multicast T om Bird tom@portfast.co.uk @portfast Multicats? What is multicast? One to

Multicast Protocols IGMP IP Group Membership Protocol DVMRP DV Multicast Routing Protocol

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Application Layer Multicast Instructor: Hamid R. Rabiee Spring 2012 Outline Introduction

Multicast Control Multicast Control Protocol (MCOP) Protocol (MCOP)

Application areas of Application areas of Scalable Adaptive Multicast Scalable Adaptive

XD XDAG: PoW + DA DAG frozen@xdag.io XDAG: A new DAG-based cryptocurrency The first mineable

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

ZC Multicast Address Allocation Steve Hanna MALLOC WG co-chair Sun Microsystems, Inc. Outline

What Is Multicast? Key: Unicast transfer Broadcast transfer Unicast Multicast transfer

Cloud Mediated Nature Observation - From Teleoperation to Cloud Robotics Dez Song Texas A&amp;M

Using WUGS Kits to Improve the Quality of Service Min Song, Old Dominion University Mansoor

Dr. M. Alam Min Song The University of Toledo Outline Scheduling Algorithms for CIOQ

BFT for the skeptics Yee Jiun Song, Flavio Junqueira, Benjamin Reed Cornell University, Yahoo!

WHEN GOD IS VIEWED AS INSUFFICIENT Exodus 4:1-17 SETTING THE SCENE: God meets with Moses

REBEL AGAINST HOPELESSNESS (NUMBERS 13-14) JOIN THE REBELLION AGAINST HOPELESSNESS. NUMBERS

EFFECTIVE BELIEVERS How to live an effective Christian life By Kenneth M. Hoeck - www.totw.org

GRAIN SORGHUM WEED CONTROL UPDATE 2018 Eric P. Prostko, Ph.D. Professor and Extension Weed

Cloud Mediated Nature Observation - From Teleoperation to Cloud Robotics Dez Song Texas A&M