Optimizing Collective Communication on Multicores Rajesh Nishtala 1 - PowerPoint PPT Presentation

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1 University of California, Berkeley (2009) 1 / 57

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael L.Scott (1991) 2 / 57

PGAS Languages ◮ Focus on Partitioned Global Address Space languages 3 / 57

Partitioned Addresspace one address space T 1 T 2 T n ... 4 / 57

One Sided Communication read T 1 T 2 T n ... write 5 / 57

PGAS Languages ◮ UPC, Unified Parallel C ◮ CAF, Co-array Fortran ◮ Titanium, a Java dialect 6 / 57

Context ◮ The gap between processors and memory systems is still enormous 7 / 57

http://images.bit-tech.net/content_images/2007/11/the_secrets_of_pc_memory_part_1/hei.png 8 / 57

◮ Today: processors don’t get faster, but we see more and more processors on a single chip 9 / 57

Processor GHz Cores (Threads) Sockets Intel Clovertown 2.66 8 (8) 2 AMD Barcelona 2.3 32 (32) 8 Sun Niagara 2 1.4 32 (256) 4 Table: Experimental Platforms Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 10 / 57

Sun Niagara 2 http://www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/ rechnersysteme/beschreibung_der_hpc_systeme/ultrasparc_t2/ rba/ultrasparc_t2_architectural_details/?lang=de 11 / 57

◮ The number of processors on a chip grows at an exponential pace Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 12 / 57

Intel Single-Chip Cloud Computer (48 Cores) http://techresearch.intel.com/ProjectDetails.aspx?Id=1 13 / 57

◮ Communication in its most general form is the movement of data within cores, between cores or within memory systems Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 14 / 57

CPU CPU CPU CPU CPU CPU CPU CPU RAM RAM RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU communication network 15 / 57

Collective Communication ◮ Communication-intensive problems often involve global communication 16 / 57

Broadcast ! 1 2 ! ! 3 4 17 / 57

Gather ! 1 2 ! ! 3 4 18 / 57

◮ These operations are thought of as collective communication operations 19 / 57

Example: Sum of Vector Elements 1 2 3 4 5 6 7 8 9 10 20 / 57

Example: Sum of Vector Elements ◮ Create workers 1 2 3 4 5 6 7 8 9 10 W 1 W 2 W 3 W 4 W 5 21 / 57

Example: Sum of Vector Elements ◮ Every worker sums up it’s part of the vector 1 2 3 4 5 6 7 8 9 10 3 7 11 15 19 22 / 57

Example: Sum of Vector Elements ◮ The main thread gathers the partial results and sums them up 1 2 3 4 5 6 7 8 9 10 3 7 15 19 11 55 23 / 57

Example: Sum of Vector Elements Pseudocode (main thread): double [ ] vector = read_vector ( ) ; Thread [ ] workers = spwan_workers ( ) ; start_workers ( workers ) ; double r e s u l t = c a l c u l a t e _ r e s u l t ( workers ) ; 24 / 57

Example: Sum of Vector Elements Pseudocode (main thread): double [ ] vector = read_vector ( ) ; Thread [ ] workers = spwan_workers ( ) ; start_workers ( workers ) ; wait_until_everything_finished(workers); double r e s u l t = c a l c u l a t e _ r e s u l t ( workers ) ; 25 / 57

Barrier ◮ Synchronization method for a group of threads ◮ A thread can only continue it’s execution after every thread has called the barrier 26 / 57

1 2 3 4 5 6 7 8 9 10 3 7 11 15 19 55 27 / 57

Collective Communication Operation “ ... group of threads works together to perform a global communication operation ... ” 28 / 57

Reduce ◮ Divide a problem into smaller subproblems ◮ Every thread contributes it’s part to the solution ◮ Example: Calculate the smallest entry of a vector 29 / 57

Flat vs. Tree ◮ For communication among threads, different topologies can be used 30 / 57

Flat Topology ◮ Example: we have a reduce operation ◮ in the end the main thread W main has to wait for every worker thread W 1 ,..., W 7 W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 31 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main 32 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W main 33 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W main W main 34 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W main W main W main 35 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W main W main W main W main 36 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W main W main W main W main W main 37 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W main W main W main W main W main W main 38 / 57

Tree Topology ◮ Example: we have a reduce operation ◮ in the end the main thread W main has to wait for every worker thread W 1 ,..., W 7 W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 39 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W 2 W 4 W 6 40 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W 2 W 4 W 6 W main W 4 41 / 57

W main W 1 W 2 W 3 W 4 W 5 W 6 W 7 W main W 2 W 4 W 6 W main W 4 W main 42 / 57

Analysis Figure: Barrier Performance Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 43 / 57

Barrier Implementation #define N 4 pthread_t threads [N ] ; vo l a ti l e int ready [N ] ; vo l a ti l e int go [N ] ; b a r r i e r ( int id ) { void i f ( id == 0) { / / wait f o r each thread for ( int i = 1; i < N; i ++) while ( ready [ i ] == 0 ) ; / / reset the ready flags for ( int i = 0; i < N; i ++) ready [ i ] = 0; / / signal each thread for ( int i = 0; i < N; i ++) go [ i ] = 1; } else { ready [ id ] = 1; / / wait u n t i l thread i s signalled while ( go [ id ] == 0 ) ; go [ id ] = 0; } } 44 / 57

Experiment: Barrier Implementation 45 / 57

◮ Strict synchronization: Data movement can only start after all threads have entered the collective and must be completed before the first thread exits the collective Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 46 / 57

Strict Synchronization v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 T 1 T 2 T 3 T 4 T 5 T 6 T 7 47 / 57

Loosening Synchronization Requirements ◮ Loose synchronization: Data movement can begin as soon as any thread has entered the collective and continue until the last thread leaves the collective Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 48 / 57

Loose Synchronization v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 T 1 T 2 T 3 T 4 T 5 T 6 T 7 49 / 57

v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 T 1 T 2 T 3 T 4 T 5 T 6 T 7 50 / 57

v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 T 1 T 2 T 3 T 4 T 5 T 6 T 7 51 / 57

v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 T 1 T 2 T 3 T 4 T 5 T 6 T 7 52 / 57

(32 cores, 256 threads) Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores 53 / 57

Summary ◮ Best strategy depends on the hardware and on the problem ◮ Using a library that can automatically adapt to a given situation can bring a great performance improvement, since hand tuning takes far too long 56 / 57

Words on the Paper ◮ Very high level ◮ Description of the problem without concrete solution ◮ No implementation ◮ Plots aren’t always clear and precise 57 / 57

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 - PowerPoint PPT Presentation

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1 University of California, Berkeley (2009) 1 / 57 Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

Multiprocessors/Multicores Presented by Yue Gao September 26, 2013 Presented by Yue Gao

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding,

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Collective Communications Collective Communication Communications involving a group of

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Optimizing Communication on Blue Waters Torsten Hoefler PRAC Workshop, Oct. 19 th 2010 T. Hoefler

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Group smarts: Elevate collective intelligence through communication, norms, and diversity

Lecture on Multicores Darius Sidlauskas Post-doc Darius Sidlauskas, 25/02-2014 1/21 Outline

Cross-ISA Machine Emulation for Multicores Emilio G. Cota Columbia University Paolo Bonzini

A Scalable Architecture for Ordered Parallelism Mark Jeffrey , Suvinay Subramanian, Cong Yan,

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

http://cs224w.stanford.edu Main question today: Given a network with labels on some nodes, how

CS6501: T opics in Learning and Game Theory (Fall 2019) Prediction Markets and Scoring Rules

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford

About this class Two-Sided Matching (mostly from Roth and Sotomayor) 1 Basic Structure Two

Nonlinear Control Lecture # 10 Time Varying and Perturbed Systems Nonlinear Control Lecture #

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

CSCI 3110 Fun with Algorithms Norbert Zeh nzeh@cs.dal.ca Faculty of Computer Science Dalhousie

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 - PowerPoint PPT Presentation

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1 University of California, Berkeley (2009) 1 / 57 Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

Multiprocessors/Multicores Presented by Yue Gao September 26, 2013 Presented by Yue Gao

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding,

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Collective Communications Collective Communication Communications involving a group of

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Optimizing Communication on Blue Waters Torsten Hoefler PRAC Workshop, Oct. 19 th 2010 T. Hoefler

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Group smarts: Elevate collective intelligence through communication, norms, and diversity

Lecture on Multicores Darius Sidlauskas Post-doc Darius Sidlauskas, 25/02-2014 1/21 Outline

Cross-ISA Machine Emulation for Multicores Emilio G. Cota Columbia University Paolo Bonzini

A Scalable Architecture for Ordered Parallelism Mark Jeffrey , Suvinay Subramanian, Cong Yan,

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

http://cs224w.stanford.edu Main question today: Given a network with labels on some nodes, how

CS6501: T opics in Learning and Game Theory (Fall 2019) Prediction Markets and Scoring Rules

Machine Learning - MT 2016 11 &amp; 12. Neural Networks Varun Kanade University of Oxford

About this class Two-Sided Matching (mostly from Roth and Sotomayor) 1 Basic Structure Two

Nonlinear Control Lecture # 10 Time Varying and Perturbed Systems Nonlinear Control Lecture #

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

CSCI 3110 Fun with Algorithms Norbert Zeh nzeh@cs.dal.ca Faculty of Computer Science Dalhousie

Machine Learning - MT 2016 11 & 12. Neural Networks Varun Kanade University of Oxford