Bulk-synchronous pseudo-streaming for many-core accelerators - PowerPoint PPT Presentation

Bulk-synchronous pseudo-streaming for many-core accelerators Jan-Willem Buurlage 1 Tom Bannink 1 , 2 Abe Wits 3 1 Centrum voor Wiskunde en Informatica (CWI), Amsterdam, The Netherlands 2 QuSoft, Amsterdam, The Netherlands 3 Utrecht University, The Netherlands 1

Overview Parallella Epiphany BSP Extending BSP with streams Examples Inner product Matrix multiplication Sort 2

Parallella

Parallella • ‘ A supercomputer for everyone , with the lofty goal of democratizing access to parallel computing’ • Crowd-funded development board, raised almost $1M in 2012. 3

Epiphany co-processor • N × N grid of RISC processors, clocked by default at 600 MHz (current generations have 16 or 64 cores), each with limited local memory . • Efficient communication network with ‘ zero-cost start up ’ communication. Asynchronous connection to external memory pool using DMA engines (used for software caching). • Energy efficient @ 50 GFLOPs / W (single precision), in 2011, top GPUs about 5 × less efficient. 4

Epiphany memory • Each Epiphany core has 32 kB of local memory , on 16-core model 512 kB available in total. There are no caches. • On each core, the kernel binary and stack already take up a large section of this memory. • On the Parallella, there is 32 MB of external RAM shared between the cores, and 1 GB of additional RAM accessible from the ARM host processor. 5

Many-core co-processors • Applications: Mobile, Education, possibly even HPC. • There are also specialized (co)processors on the market for e.g. machine learning, computer vision. • KiloCore (UC Davis, 2016). 1000 processors on a single chip. 6

Epiphany BSP

Epiphany BSP • Parallella: powerful platform, especially for students and hobbyists. Suffers from poor tooling. • Epiphany BSP, implementation of the BSPlib standard for the Parallella. • Custom implementations for many rudimentary operations: memory management, printing, barriers. 7

Hello World: ESDK (124 LOC) // host // k e r n e l const unsigned ShmSize = 128; i n t main ( void ) { const char ShmName [ ] = ” he l lo s hm ” ; const char ShmName [ ] = ” const unsigned SeqLen = 20; h el l o s hm ” ; const char Msg [ ] = ” H e l l o i n t main ( i n t argc , char ∗ argv [ ] ) World from core 0x%03x ! ” ; { char buf [ 2 5 6 ] = { 0 } ; unsigned row , col , coreid , i ; e c o r e i d t c o r e i d ; e p l a t f o r m t platform ; e memseg t emem ; e e p i p h a n y t dev ; unsigned my row ; e mem t mbuf ; unsigned my col ; i n t rc ; srand (1) ; // Who am I ? Query the CoreID from hardware . e s e t l o a d e r v e r b o s i t y (H D0) ; c o r e i d = e g e t c o r e i d ( ) ; e s e t h o s t v e r b o s i t y (H D0) ; e c o o r d s f r o m c o r e i d ( coreid , &my row , &my col ) ; e i n i t (NULL) ; e r e s e t s y s t e m () ; i f ( E OK != e shm attach (&emem, e g e t p l a t f o r m i n f o (& platform ) ; ShmName) ) { return EXIT FAILURE ; rc = e s h m a l l o c (&mbuf , ShmName , } ShmSize ) ; i f ( rc != E OK) s n p r i n t f ( buf , s i z e o f ( buf ) , Msg , rc = e shm attach (&mbuf , ShmName c o r e i d ) ; ) ; 8 // . . . // . . .

Hello World: Epiphany BSP (18 LOC) // k e r n e l // host #i n c l u d e < e bsp . h > #i n c l u d e < host bsp . h > #i n c l u d e < s t d i o . h > i n t main () { b sp b eg in () ; i n t main ( i n t argc , char ∗∗ argv ) { b s p i n i t ( ” e h e l l o . e l f ” , argc , argv ) ; i n t n = bsp nprocs () ; i n t p = b s p p i d () ; b sp b eg i n ( bsp nprocs () ) ; e b s p p r i n t f ( ” H e l l o world from core % ebsp spmd () ; d/%d” , p , n ) ; bsp end () ; bsp end () ; return 0 ; return 0; } } 9

BSP computers • The BSP model [Valiant, 1990] describes a general way to perform parallel computations. • An abstract BSP computer is associated to the model that has p processors, which all have access to a communication network. . . . p 1 2 3 4 10

BSP computers (cont.) • BSP programs consist of a number of supersteps, that each have a computation phase, and a communication phase. Each superstep is followed by a barrier synchronisation. • Each processor on a BSP computer has a processing rate r . It has two parameters: g , related to the communication speed, and l the latency. • The running time of a BSP program can be expressed in terms of these parameters! We denote this by T ( g , l ). 11

BSP on low-memory • Limited local memory, classic BSP programs can not run. • Primary goal should be to minimize communication with external memory. • Many known performance models can be applied to this system (EM-BSP, MBSP, Multi-BSP), no portable way to write/develop algorithms . 12

Bulk-synchronous pseudo-streaming for many-core accelerators - PowerPoint PPT Presentation

Bulk-synchronous pseudo-streaming for many-core accelerators Jan-Willem Buurlage 1 Tom Bannink 1 , 2 Abe Wits 3 1 Centrum voor Wiskunde en Informatica (CWI), Amsterdam, The Netherlands 2 QuSoft, Amsterdam, The Netherlands 3 Utrecht University, The

Cup Concept with Profits Bulk Merchandising Solutions.Bulk Merchandising Solutions.Bulk

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Synchronous Grammars Synchronous grammars are a way of simultaneously generating pairs of

Workflow Plus Bulk Request Actions Tool for Synergy Enterprise What is This Tool ? Allows

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

HW/SW Codesign w/ FPGAs Data Flow Modeling II ECE 522 Synchronous Data Flow Graphs Synchronous

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Discussion with Capt. Azhar @ PII 27/03/2019 PII March 2019 - Capt Azhar 1 Agenda Break

(Some) Bulk Properties at RHIC (Some) Bulk Properties at RHIC Many thanks to organizers ! Kai

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

ECEN 5022 Cryptography Pseudo Random Number Generators Peter Mathys University of Colorado

Models for Inexact Reasoning Reasoning with Subjective Pseudo Reasoning with Subjective Pseudo

MIPS Pseudo Instructions and Functions Philipp Koehn 2 October 2019 Philipp Koehn Computer

GD350 Series High Performance & Multi-functional Inverter Presentation Brief introduction

Distance Learning Components Piedmont Unified School District August 5, 2020 Distance Learning

High-Performance Remote and Distributed Teams Randy Shoup VP Engineering Technology

DalePobega Making the most of technology Online Learners Overnight Preparing and migrating a

Jan 2004, Pages 325 - 354. Jan 2004, Pages 325 - 354. In Integration of Software Specification

Developing a VLE for Enterprise modules University of Kent Dan Clark University Learning

Motivation Issues of the current live based educational system (DVTS or Vidyo etc) to the

Threshold Logical Clocks Manuel Vidigueira Distributed and Decentralized Systems Lab (DEDIS)

Bulk-synchronous pseudo-streaming for many-core accelerators - PowerPoint PPT Presentation

Bulk-synchronous pseudo-streaming for many-core accelerators Jan-Willem Buurlage 1 Tom Bannink 1 , 2 Abe Wits 3 1 Centrum voor Wiskunde en Informatica (CWI), Amsterdam, The Netherlands 2 QuSoft, Amsterdam, The Netherlands 3 Utrecht University, The

Cup Concept with Profits Bulk Merchandising Solutions.Bulk Merchandising Solutions.Bulk

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Synchronous Grammars Synchronous grammars are a way of simultaneously generating pairs of

Workflow Plus Bulk Request Actions Tool for Synergy Enterprise What is This Tool ? Allows

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

HW/SW Codesign w/ FPGAs Data Flow Modeling II ECE 522 Synchronous Data Flow Graphs Synchronous

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Discussion with Capt. Azhar @ PII 27/03/2019 PII March 2019 - Capt Azhar 1 Agenda Break

(Some) Bulk Properties at RHIC (Some) Bulk Properties at RHIC Many thanks to organizers ! Kai

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

ECEN 5022 Cryptography Pseudo Random Number Generators Peter Mathys University of Colorado

Models for Inexact Reasoning Reasoning with Subjective Pseudo Reasoning with Subjective Pseudo

MIPS Pseudo Instructions and Functions Philipp Koehn 2 October 2019 Philipp Koehn Computer

GD350 Series High Performance &amp; Multi-functional Inverter Presentation Brief introduction

Distance Learning Components Piedmont Unified School District August 5, 2020 Distance Learning

High-Performance Remote and Distributed Teams Randy Shoup VP Engineering Technology

DalePobega Making the most of technology Online Learners Overnight Preparing and migrating a

Jan 2004, Pages 325 - 354. Jan 2004, Pages 325 - 354. In Integration of Software Specification

Developing a VLE for Enterprise modules University of Kent Dan Clark University Learning

Motivation Issues of the current live based educational system (DVTS or Vidyo etc) to the

Threshold Logical Clocks Manuel Vidigueira Distributed and Decentralized Systems Lab (DEDIS)

GD350 Series High Performance & Multi-functional Inverter Presentation Brief introduction