Fast Distributed Process Creation with the XMOS XS1 Architecture - PowerPoint PPT Presentation

Fast Distributed Process Creation with the XMOS XS1 Architecture James Hanlon Department of Computer Science University of Bristol, UK 20 th June 2011

Introduction Processors as a resource Scalable parallel programming Contributions Implementation Platform Explicit processor allocation Demonstration & evaluation Rapid process distribution Sorting Conclusions Future work

Processors as a resource ◮ Current parallel programming models provide little support for management of processors. ◮ Many are closely coupled to the machine and parameterised by the number of processors. ◮ The programmer is left responsible for scheduling processes on the underlying system. ◮ As the level of parallelism increases (10 6 processes at exascale), it is clear that we require a means to automatically allocate processors. ◮ We don’t expect to have to write our own memory allocation routines!

Scalable parallel programming ◮ For parallel computations to scale it will be necessary to express programs in an intrinsically parallel manner, focusing on dependencies between processes. ◮ Excess parallelism enables scalability (parallel slackness hides communication latency). ◮ It is also more expressive: ◮ For irregular and unbounded structures. ◮ Allows composite structures and construction of parallel subroutines . ◮ The scheduling of processes and allocation of processors is then a property of the language and runtime. ◮ But this requires the ability to rapidly initiate processes and collect results from them as they terminate.

Contributions 1. The design of an explicit, lightweight scheme for distributed dynamic processor allocation . 2. A convincing proof-of-concept implementation on a sympathetic architecture. 3. Predictions for larger systems based on accurate performance models.

Platform ◮ XMOS XS1 architecture: ◮ General-purpose, multi-threaded, message-passing and scalable. ◮ Primitives for threading, synchronisation and communication execute in same time as standard load/store, branch and arithmetic operations. ◮ Support for position independent code. ◮ Predictable. ◮ XK-XMP-64: ◮ Experimental board with 64 XCore processors connected in a hypercube. ◮ 64kB of memory and 8 hardware threads per core. ◮ Aggregate 512-way concurrency, 25.6 GIPS and 4MB RAM. ◮ A bespoke language and runtime with a simple set of features to demonstrate and experiment with distributed process creation.

Explicit processor allocation: notation ◮ Processor allocation is exposed in the language with the on statement: on p do Q This executes process Q synchronously on processor p . ◮ The execution of all processes are implicitly on the current processor. ◮ We can compose on in parallel to exploit multi-threaded parallelism: { Q 1 || on p do Q 2 } which offloads and executes Q 2 while executing Q 1 . ◮ Processes must be disjoint.

Explicit processor allocation: implementation Form C ( P ) source host ◮ on forms a closure C of process P including the variable context and a list of procedures including P and those it calls.

Explicit processor allocation: implementation source host ◮ on forms a closure C of process P including the variable context and a list of procedures including P and those it calls. ◮ A connection is initialised between the source and host processors and the host creates a new thread for the incoming process.

Explicit processor allocation: implementation C ( P ) source host ◮ on forms a closure C of process P including the variable context and a list of procedures including P and those it calls. ◮ A connection is initialised between the source and host processors and the host creates a new thread for the incoming process. ◮ It then receives C ( P ) and initialises P on the new thread.

Explicit processor allocation: implementation Initialise P source host ◮ on forms a closure C of process P including the variable context and a list of procedures including P and those it calls. ◮ A connection is initialised between the source and host processors and the host creates a new thread for the incoming process. ◮ It then receives C ( P ) and initialises P on the new thread. ◮ All call branches are performed through a table (with the instruction BLACP) so the host updates this to record the new address of each procedure contained in C .

Explicit processor allocation: implementation Updates source host ◮ on forms a closure C of process P including the variable context and a list of procedures including P and those it calls. ◮ A connection is initialised between the source and host processors and the host creates a new thread for the incoming process. ◮ It then receives C ( P ) and initialises P on the new thread. ◮ All call branches are performed through a table (with the instruction BLACP) so the host updates this to record the new address of each procedure contained in C . ◮ When P has terminated, the host sends back any updated free variables of P stored at the source (as P is disjoint).

Rapid process distribution ◮ We can combine recursion and parallelism to rapidly generate processes: proc distribute ( t , n ) is if n = 1 then node ( t ) else { distribute ( t , n / 2) || on t + n / 2 do distribute ( t + n / 2, n / 2) } ◮ This distributes the process node over n processors in O ( log n ) time. ◮ The execution of distribute (0, 4) proceeds in time and space : p 0 p 1 p 2 p 3

Rapid process distribution ◮ We can combine recursion and parallelism to rapidly generate processes: proc distribute ( t , n ) is if n = 1 then node ( t ) else { distribute ( t , n / 2) || on t + n / 2 do distribute ( t + n / 2, n / 2) } ◮ This distributes the process node over n processors in O ( log n ) time. ◮ The execution of distribute (0, 4) proceeds in time and space : p 0 p 1 p 2 p 3 distribute (0,4)

Rapid process distribution ◮ We can combine recursion and parallelism to rapidly generate processes: proc distribute ( t , n ) is if n = 1 then node ( t ) else { distribute ( t , n / 2) || on t + n / 2 do distribute ( t + n / 2, n / 2) } ◮ This distributes the process node over n processors in O ( log n ) time. ◮ The execution of distribute (0, 4) proceeds in time and space : p 0 p 1 p 2 p 3 distribute (0,4) distribute (0,2) distribute (2,2)

Rapid process distribution ◮ We can combine recursion and parallelism to rapidly generate processes: proc distribute ( t , n ) is if n = 1 then node ( t ) else { distribute ( t , n / 2) || on t + n / 2 do distribute ( t + n / 2, n / 2) } ◮ This distributes the process node over n processors in O ( log n ) time. ◮ The execution of distribute (0, 4) proceeds in time and space : p 0 p 1 p 2 p 3 distribute (0,4) distribute (0,2) distribute (2,2) distribute (0,1) distribute (1,1) distribute (2,1) distribute (3,1)

Rapid process distribution ◮ We can combine recursion and parallelism to rapidly generate processes: proc distribute ( t , n ) is if n = 1 then node ( t ) else { distribute ( t , n / 2) || on t + n / 2 do distribute ( t + n / 2, n / 2) } ◮ This distributes the process node over n processors in O ( log n ) time. ◮ The execution of distribute (0, 4) proceeds in time and space : p 0 p 1 p 2 p 3 distribute (0,4) distribute (0,2) distribute (2,2) distribute (0,1) distribute (1,1) distribute (2,1) distribute (3,1) node (0) node (1) node (2) node (3)

Rapid process distribution: execution time 120 100 80 Time ( µ s) 60 40 Predicted 20 Measured 0 10 20 30 40 50 60 Processors ◮ 114.60 µ s (11,460 cycles) for 64 processors. ◮ Predicted 190 µ s for 1024 processors.

Mergesort ◮ Same structure as distribute but with work performed at leaves. p 0 p 0 p 4 p 0 p 2 p 4 p 6 p 0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 0 p 2 p 4 p 6 p 0 p 4 p 0

Mergesort: execution time I 0.8 Process distribution 0.7 256B 0.6 512B Time (ms) 0.5 1kB 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 Processors ◮ Minimum when input array is subdivided into 64B sections.

Mergesort: execution time II ◮ Measured (up to 64 cores) and predicted (up to 1024 cores) for 256B input. 10 Runtime 1 Time (ms) 0.1 Processes 0.01 Processes and data 0.001 0.0001 1 2 4 8 16 32 64 128 256 512 1024 Processors

Mergesort: execution time III ◮ Predicted up to 1024 cores for 1GB input. 1e+06 Runtime 100000 10000 Processes and data Time (ms) 1000 100 10 1 Processes 0.1 0.01 0.001 1 2 4 8 16 32 64 128 256 512 1024 Processors ◮ Single-source data-distribution is a worst-case.

Conclusions ◮ We have built a lightweight mechanism for dynamically allocating processors in a distributed system. ◮ Combined with recursion we can rapidly distribute processes: over 64 processors in 114.60 µ s. ◮ It is possible to operate at a fine granularity: creation of a remote process to operate on just 64B data. ◮ We can establish a lower bound on the performance of the approach. ◮ Distribution over 1024 processors in ∼ 200 µ s (20,000 cycles). ◮ This scheme works well with large arrays of processors with small memories and allows you to express programs to exploit this. ◮ Don’t need powerful cores with large memories. ◮ Emphasis changes from data structures to process structures .

Future work 1. Automatic placement of processes. 2. MPI implementation for evaluation on and comparison with supercomputer architectures. 3. Optimisation of processor allocation mechanism such as pipelining the reception and execution of closures.

Fast Distributed Process Creation with the XMOS XS1 Architecture - PowerPoint PPT Presentation

Fast Distributed Process Creation with the XMOS XS1 Architecture James Hanlon Department of Computer Science University of Bristol, UK 20 th June 2011 Introduction Processors as a resource Scalable parallel programming Contributions

Communicating Processors Past, Present and Future David May Bristol University and XMOS David

Creation of new mark Creation of new markets ets Creation of new mark Creation of new markets

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

1 Last class: Process Creation Today: Process Management 2 Process Description 3

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Fast-track listing Fast-track listing process Time to market can be essential benefits of

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

RCIA 1: God, the Trinity, Creation, and Angels 1: God, the Trinity, Creation, & Angels Does

Device Creation with Qt Enterprise Embedded Andy Nichols Overview The challenges of device

The Creation of Saints Row Saints Row 's Open World Cityscape: 's Open World Cityscape: The

Objectives of the Course Parallel Systems: Understanding the current state-of-the-art in

Scenario Workshop SOUTHEAST GUIDING COALITION ENROLLMENT AND PROGRAM BALANCING November 12, 2020

Lecture 22: NoSQL Finale Wednesday, April 22, 2015 Announcements Course evaluations will be

Another Dynamic Algorithm: Scoreboard Summary Tomasulo Algorithm Speedup 1.7 from compiler;

Single-sided PGAS Communications Libraries Overview of PGAS approaches David Henty, Alan Simpson

Lecture 5: Parallel machines and models; shared memory programming David Bindel 8 Feb 2010

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

Mlbase: distributed machine learning system Adapted slides from mlbase.org S

Fast Distributed Process Creation with the XMOS XS1 Architecture - PowerPoint PPT Presentation

Fast Distributed Process Creation with the XMOS XS1 Architecture James Hanlon Department of Computer Science University of Bristol, UK 20 th June 2011 Introduction Processors as a resource Scalable parallel programming Contributions

Communicating Processors Past, Present and Future David May Bristol University and XMOS David

Creation of new mark Creation of new markets ets Creation of new mark Creation of new markets

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

1 Last class: Process Creation Today: Process Management 2 Process Description 3

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Fast-track listing Fast-track listing process Time to market can be essential benefits of

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

RCIA 1: God, the Trinity, Creation, and Angels 1: God, the Trinity, Creation, &amp; Angels Does

Device Creation with Qt Enterprise Embedded Andy Nichols Overview The challenges of device

The Creation of Saints Row Saints Row 's Open World Cityscape: 's Open World Cityscape: The

Objectives of the Course Parallel Systems: Understanding the current state-of-the-art in

Scenario Workshop SOUTHEAST GUIDING COALITION ENROLLMENT AND PROGRAM BALANCING November 12, 2020

Lecture 22: NoSQL Finale Wednesday, April 22, 2015 Announcements Course evaluations will be

Another Dynamic Algorithm: Scoreboard Summary Tomasulo Algorithm Speedup 1.7 from compiler;

Single-sided PGAS Communications Libraries Overview of PGAS approaches David Henty, Alan Simpson

Lecture 5: Parallel machines and models; shared memory programming David Bindel 8 Feb 2010

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

Mlbase: distributed machine learning system Adapted slides from mlbase.org S

RCIA 1: God, the Trinity, Creation, and Angels 1: God, the Trinity, Creation, & Angels Does