MulticoreBSP for C a high-performance library for shared-memory - PowerPoint PPT Presentation

MulticoreBSP for C a high-performance library for shared-memory parallel programming Albert-Jan Yzelman, Rob H. Bisseling, D. Roose, and K. Meerbergen. 2nd of July 2013 at the ‘International Symposium on High-level Parallel Programming and Applications’, Paris 1-2 July 2013. c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 1 / 33

Introduction A BSP computer C = ( p, r, g, l ) . Primary assumption: the bottleneck of communication are the exit points and the entry points of communication. Parameters: A BSP computer has p processors, each processor runs at speed r . sending and receiving data during an all-to-all communication costs g , preparing the network for all-to-all communication costs l . c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 2 / 33

Introduction F or Bulk Synchronous Parallel algorithms: computations are grouped into phases , no communication during computation, but communication is allowed in-between computation phases. 1 2 ...and so on 3 4 Synchronisation & Communication Synchronisation... Superstep 1 Superstep 2 c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 3 / 33

Introduction The time spent in computation during the i th superstep is w ( s ) T comp ,i = max i /r. s The total cost of communication is N − 1 � T comm = h i g. i =0 Adding up the computation and communication costs, and accounting for l gives us the full BSP cost : N − 1 max w ( s ) � T = i /r + h i g + l. i =0 c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 4 / 33

Goals Why another BSP library? We aim to show that: existing BSP software runs equally well on shared- memory systems as it does on distributed-memory, BSP can attain high performance on non-trivial applications, comparable the state-of-the-art. c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 5 / 33

Goals Thus, MulticoreBSP for C: Is fully backwards-compatible with BSPlib (optionally), is based on BSPlib but with an updated interface, defines two new high-performance primitives. Technologies employed: MulticoreBSP for C is written in ANSI C99, and depends on two standard extensions: POSIX Threads for shared-memory threading. 1 POSIX realtime for high-resolution timings. 2 c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 6 / 33

Changes from BSPlib Programming interface updates: size_t instead of int when appropriate; unsigned types whenever appropriate; Standard updates: asymptotic running times of all BSP primitives; support for hierarchical execution (Multi-BSP); adds bsp direct get and bsp hpsend . Library additionally features thread affinity/pinning. c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 7 / 33

All 22 BSP primitives SPMD: High-performance: bsp_init : Θ(1) bsp_hpput : Θ(1) bsp_begin : O ( p ) bsp_hpget : Θ(1) bsp_nprocs : Θ(1) bsp_hpsend : Θ(1) bsp_end : O ( l ) bsp_hpmove : Θ(1) bsp_pid : Θ(1) bsp_direct_get : Θ( size ) bsp_sync : Θ( l + g · h i ) bsp_abort : Θ(1) bsp_time : Θ(1) BSMP: DRMA: bsp_send : Θ( size ) bsp_push_reg : Θ(1) bsp_set_tagsize : Θ(1) bsp_pop_reg : Θ(1) bsp_qsize : O ( messages ) bsp_put : Θ( size ) bsp_get_tag : Θ(1) bsp_get : Θ(1) bsp_move : Θ( size ) c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 8 / 33

BSP ‘direct get’ The ‘direct get’ is a blocking one-sided get instruction. bypasses the BSP model, but is consistent with bsp hpget . Its intended case is within supersteps that contain only BSP ‘get’ primitives, guarantee source data remains unchanged. Replacing those primitives with calls to bsp direct get allows merging this superstep with its following one, thus saving a synchronisation step . A. N. Yzelman & Rob H. Bisseling, An Object-Oriented Bulk Synchronous Parallel Library for Multicore Programming , Concurrency and Computation: Practice and Experience 24(5), pp. 533-553 (2012). c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 9 / 33

BSP ‘hp send’ A BSMP message consists of two parts: an arbitrarily-sized payload , and a fixed-size identifier tag . BSPlib is “buffered on source, buffered on receive”: When sending a BSMP message, source data is copied in the outgoing communications queue . When receiving a BSMP message, the message is put in an incoming queue (during the communication phase). (Dual-buffering also occurs for the bsp put and bsp get .) c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 10 / 33

BSP ‘hp send’ BSP programming is transparent and safe because of buffering on destination, 1 buffering on source. 2 This costs memory. Alternative: high-performance ( hp ) variants. bsp move ; copies a message from its incoming communications queue into local memory. bsp hpmove ; evades this by returning the user a pointer into the queue. bsp hpsend ; delays reading source data until the message is sent. Local source data should remain unchanged! ( bsp hpput and bsp hpget also exist.) c � 2013, ExaScience Lab - A. N. Yzelman MulticoreBSP for C 11 / 33

MulticoreBSP for C a high-performance library for shared-memory - PowerPoint PPT Presentation

MulticoreBSP for C a high-performance library for shared-memory parallel programming Albert-Jan Yzelman, Rob H. Bisseling, D. Roose, and K. Meerbergen. 2nd of July 2013 at the International Symposium on High-level Parallel Programming and

AMS RTI Q. Yan / IHEP V. Choukto / MIT RTI Introduction 1: RTI record each second AMS global

Networking Overview CS 161: Computer Security Prof. Vern Paxson TAs: Jethro Beekman, Mobin

N e u r a l M o d e l s f o r M u l t i - S e n s o r I n t e g r

SPIE/IS&T Electronic Imaging, San Francisco, 25 January 2012 cover objects stego objects

Ve Vector tor Pr Prog ogrammi ramming ng Using ing St Structural uctural Rec ecursion

Unit 4: Inference for numerical variables Lecture 1: Bootstrap, paired, and two sample Statistics

Taking Student Success to Scale (TS 3 ) Virtual Convening: High Impact Practices January 28, 2016

Cross-Model Office Hours Session Primary Care First, Direct Contracting, and Kidney Care Choices

Wisconsin Propane Autogas Roundtable Lorrie Lisek, Executive Director Wisconsin Clean Cities

1

FET Open: main features and evaluation process Salvatore SPINELLO Research Programme Officer

How to make MySQL work with Raft Diancheng Wang & Guangchao Bai Staff Database Engineer @

Hawaii (Maui Project) Pier Luigi Fiorini - Lead Developer Maui Project FOSDEM 2014 - 01/02/2014

Academic Writing across Genres: Language Choices in Research Articles and Impact Case Studies

TSBK01 Image Coding and Data Compression Lecture 10 Jrgen

How to Win from OTT Disruption Content Aggregation for Operators December 10 th , 2019 About Tara

Timeouts: Beware Surprisingly High Delay By Ramakrishna Padmanabhan, Patrick Owen, Aaron Schulman,

Timeouts: Beware Surprisingly High Delay: Collect Everything, Assume Nothing Patrick Owen,

Improving Enterprises HA and Disaster Recovery Solutions Marco Tusa Percona 1 About Me

resolution through smart signal processing UDT 2019 Sthlm A. Gllstrm. L. Fuchs, C. Larsson

Temporally Distributed Networks for Fast Video Semantic Segmentation Ping Hu 1 Fabian Caba

Ping An International Financial Leasing Aviation Business Unit 20 February 2017 1 CONTENTS 1.

CPSC 213 2.7.1-2.7.3, 2.7.5-2.7.6 Textbook 3.6.1-3.6.5 Introduction to Computer

IP over Web-Avian Carriers (IPoWAC) Dan Ldtke Historical Context IP over Avian Carriers

MulticoreBSP for C a high-performance library for shared-memory - PowerPoint PPT Presentation

MulticoreBSP for C a high-performance library for shared-memory parallel programming Albert-Jan Yzelman, Rob H. Bisseling, D. Roose, and K. Meerbergen. 2nd of July 2013 at the International Symposium on High-level Parallel Programming and

AMS RTI Q. Yan / IHEP V. Choukto / MIT RTI Introduction 1: RTI record each second AMS global

Networking Overview CS 161: Computer Security Prof. Vern Paxson TAs: Jethro Beekman, Mobin

N e u r a l M o d e l s f o r M u l t i - S e n s o r I n t e g r

SPIE/IS&amp;T Electronic Imaging, San Francisco, 25 January 2012 cover objects stego objects

Ve Vector tor Pr Prog ogrammi ramming ng Using ing St Structural uctural Rec ecursion

Unit 4: Inference for numerical variables Lecture 1: Bootstrap, paired, and two sample Statistics

Taking Student Success to Scale (TS 3 ) Virtual Convening: High Impact Practices January 28, 2016

Cross-Model Office Hours Session Primary Care First, Direct Contracting, and Kidney Care Choices

Wisconsin Propane Autogas Roundtable Lorrie Lisek, Executive Director Wisconsin Clean Cities

1

FET Open: main features and evaluation process Salvatore SPINELLO Research Programme Officer

How to make MySQL work with Raft Diancheng Wang &amp; Guangchao Bai Staff Database Engineer @

Hawaii (Maui Project) Pier Luigi Fiorini - Lead Developer Maui Project FOSDEM 2014 - 01/02/2014

Academic Writing across Genres: Language Choices in Research Articles and Impact Case Studies

TSBK01 Image Coding and Data Compression Lecture 10 Jrgen

How to Win from OTT Disruption Content Aggregation for Operators December 10 th , 2019 About Tara

Timeouts: Beware Surprisingly High Delay By Ramakrishna Padmanabhan, Patrick Owen, Aaron Schulman,

Timeouts: Beware Surprisingly High Delay: Collect Everything, Assume Nothing Patrick Owen,

Improving Enterprises HA and Disaster Recovery Solutions Marco Tusa Percona 1 About Me

resolution through smart signal processing UDT 2019 Sthlm A. Gllstrm. L. Fuchs, C. Larsson

Temporally Distributed Networks for Fast Video Semantic Segmentation Ping Hu 1 Fabian Caba

Ping An International Financial Leasing Aviation Business Unit 20 February 2017 1 CONTENTS 1.

CPSC 213 2.7.1-2.7.3, 2.7.5-2.7.6 Textbook 3.6.1-3.6.5 Introduction to Computer

IP over Web-Avian Carriers (IPoWAC) Dan Ldtke Historical Context IP over Avian Carriers

SPIE/IS&T Electronic Imaging, San Francisco, 25 January 2012 cover objects stego objects

How to make MySQL work with Raft Diancheng Wang & Guangchao Bai Staff Database Engineer @