Parallel Software Design ASD Shared Memory HPC Workshop Computer - - PowerPoint PPT Presentation

parallel software design asd shared memory hpc workshop
SMART_READER_LITE
LIVE PREVIEW

Parallel Software Design ASD Shared Memory HPC Workshop Computer - - PowerPoint PPT Presentation

Parallel Software Design ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 14, 2020 Schedule - Day 5 Computer Systems (ANU) Parallel


slide-1
SLIDE 1

Parallel Software Design ASD Shared Memory HPC Workshop

Computer Systems Group, ANU

Research School of Computer Science Australian National University Canberra, Australia

February 14, 2020

slide-2
SLIDE 2

Schedule - Day 5

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 2 / 141

slide-3
SLIDE 3

Software Patterns

Outline

1

Software Patterns Finding Concurrency

2

Algorithmic Structure Patterns

3

Program and Data Structure Patterns

4

Systems on chip: Introduction

5

System-on-chip Processors

6

Emerging Paradigms and Challenges in Parallel Computing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 3 / 141

slide-4
SLIDE 4

Software Patterns

Software Design Patterns

Waterfall Model Software Design is a fundamental aspect of the software development life-cycle A design pattern describes a good solution to a recurring problem in a particular context in software design

Purpose:

Capture expert design knowledge and make it accessible in a standardized expression format Promote communication and streamline documentation by providing a shorthand for designers Increase software quality / reusability and designer/programmer productivity Provide a basis or model to improve upon

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 4 / 141

slide-5
SLIDE 5

Software Patterns

Designing Software for Parallel Systems

Parallel hardware landscape evolves at a rapid pace Objectives for using parallel hardware

High performance - Supercomputing High availability, scalability - Cloud computing

Programming models and frameworks for writing software for parallel systems revolve around three major paradigms

Shared memory systems - Open Multi-Processing (OpenMP) Distributed memory systems - Message Passing Interface (MPI) Accelerators - Open Computing Language (OpenCL), Compute Unified Device Architecture (CUDA)

Multiple other programming frameworks available Software often designed, implemented and tuned to run on a specific parallel system - low code portability How do you design high performance, scalable and portable software for parallel systems?

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 5 / 141

slide-6
SLIDE 6

Software Patterns

Reference Material

Patterns for Parallel Programming, T. Mattson, B. Sanders, B.

  • Massingill. Addison Wesley 2005 (ISBN 0 321 94078 4)

Introduction to Software Engineering Design: Processes, Principles and Patterns with UML2, Christopher Fox. Pearson Addison Wesley 2006 (ISBN 0 321 41013 0) Patterns of Enterprise Application Architecture, Martin Fowler. Addison Wesley 2003 (ISBN 0 321 12742 0)

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 6 / 141

slide-7
SLIDE 7

Software Patterns

Applying Patterns to a Problem

Before trying to write parallel software, consider whether it is worth it!

Is the problem large enough? Are the results significant enough? Are the key features and data items well understood?

If answers to the questions above are positive, consider the following steps to design a parallel program:

1

Finding Concurrency: Identify independent components within the problem to expose exploitable concurrency

2

Apply an Algorithm Structure: Structure the algorithm to take advantage of potential concurrency

3

Consider Program and Data Structures: Decide on what data structures to use and apply a known design pattern to the program

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 7 / 141

slide-8
SLIDE 8

Software Patterns

Design Spaces

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 8 / 141

slide-9
SLIDE 9

Software Patterns Finding Concurrency

Finding Concurrency

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 9 / 141

slide-10
SLIDE 10

Software Patterns Finding Concurrency

Finding Concurrency – Objectives

Produce the following software design artefacts given a problem and a target hardware specification: Task decomposition that identifies tasks that can execute concurrently Data decomposition that identifies data local to each task A way of grouping tasks and ordering the groups to satisfy temporal constraints Analysis of dependencies among tasks

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 10 / 141

slide-11
SLIDE 11

Software Patterns Finding Concurrency

Finding Concurrency – Decomposition

Decomposing a problem into elements that can execute concurrently Task Decomposition: Divide the problem into groups of operations called tasks that can be executed either independently or concurrently with some restrictions Data Decomposition: Concentrate on breaking the data into distinct chunks that can be operated on independently or with some restrictions

Choices in either decomposition will naturally constrain choices in the other

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 11 / 141

slide-12
SLIDE 12

Software Patterns Finding Concurrency

Task Decomposition

Considering Hardware Features: Decide whether task decomposition is suitable for the target hardware:

Maintain Load Balance: Can enough tasks be spawned to keep all the processing elements busy? Minimize Overheads: Can each task have enough work to outweigh the cost

  • f spawning tasks and managing inter-task dependencies?

Memory Structure: Is the system primarily shared-memory or distributed-memory?

A task decomposition is more suitable if it avoids distributing a complex data structure between many distributed-memory nodes If memory size per node on a distributed memory system is limited, then replicating data may not be an option and a data decomposition (i.e. dividing the data into chunks and allocating them to nodes) is required

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 12 / 141

slide-13
SLIDE 13

Software Patterns Finding Concurrency

Task Decomposition

Identifying task units: Tasks can be found in several aspects of a problem:

Functional: Each task corresponds to a distinct function call Loop-split: If loop iterations are independent and enough of them exist, then a single or a number of iterations can be mapped to a task Data-driven: When multiple concurrent executions to different sections of data are required, each of these can be mapped to a task

Parameterizing task units: To maintain flexibility and scalability in design, parameterize the number and size of task units on an appropriate dimension. For example, the number of CPU cores or GPU accelerators present in the target system, or the number of total processing elements

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 13 / 141

slide-14
SLIDE 14

Software Patterns Finding Concurrency

Data Decomposition

Considering Problem Characteristics: Decomposing a problem w.r.t data is a good starting point if the following conditions are met:

Data Structure Size: The problem revolves around the manipulation of a large data structure. Operation Similarity: Identical operations are applied to different parts of the data structure independently or with minimal dependencies or restrictions

Parameterizing Granularity: In order to keep the design flexible, scalable and cater to various parallel systems, the number of data chunks and their sizes, i.e. their granularity should be hinged on specific hardware features such as the size of main memory per node Common examples of suitable problems include:

Array-based computations: Concurrency defined in terms of updates to different segments of an array. If the array is multi-dimensional, it can be decomposed in many ways (rows, columns, blocks/tiles of different shapes) in order to suit hardware features such as cache sizes Recursive data structures: Parallel updates to sub-trees of large tree data structures

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 14 / 141

slide-15
SLIDE 15

Software Patterns Finding Concurrency

Dependency Analysis – Grouping Tasks

Once the problem has been decomposed to outline the concurrent tasks and data associated with each task, often the tasks can be restructured to form groups based on the following types of constraints:

Temporal Dependency: Constraint on the order in which tasks must execute Collective (Runtime) Dependency: Collection of tasks that must run at the same time. For example, to satisfy boundary conditions

  • therwise resulting in a deadlock situation

Independent: No ordering constraint between tasks

Since ordering constraints can now be applied to groups instead of individual tasks, establishing partial orders between tasks is simplified

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 15 / 141

slide-16
SLIDE 16

Software Patterns Finding Concurrency

Dependency Analysis – Task Execution Order

Once separate task groups are identified, the order in which these groups must execute needs to be determined. To identify ordering constraints, consider:

Task Data Requirement: What data is required by a task group before execution and whether another task group is responsible for generating it Shared Data Requirement: If task groups must write to a shared resource in a certain order Lack of Constraints: If task groups and their associated data are completely independent, i.e. embarrassingly parallel problems

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 16 / 141

slide-17
SLIDE 17

Software Patterns Finding Concurrency

Dependency Analysis - Data Sharing

Given a clear ordering between task groups, identify and manage access to shared data Overly coarse-grained synchronization can lead to poor scaling, e.g. using barriers between phases of a computation Shared data can be categorized as follows: Read-only: No access protection required on shared memory systems. Usually replicated on distributed memory systems. Effectively-local: Data partitioned into task-local chunks which need not have access protections Read-write: Data accessed by multiple task groups and must be protected by using exclusive-access methods such as locks, semaphores, monitors etc. Two special cases of read-write data are:

Accumulate: A reduction operation that results in multiple tasks updating each shared data item and accumulating a result. For example, global sum, maximum or minimum Multiple-read/single-write: Data read by multiple tasks to obtain initial values, but eventually a single task modifies the data

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 17 / 141

slide-18
SLIDE 18

Software Patterns Finding Concurrency

Design Evaluation

Is the decomposition and dependency analysis good enough to move

  • n to the next design space?

Software design is an iterative process and is rarely perfected in a single iteration Evaluate the suitability of the design to the intended target platform Does the design need to cater to other parallel systems?

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 18 / 141

slide-19
SLIDE 19

Software Patterns Finding Concurrency

Hands-on Exercise: Finding Concurrency

Objective: To identify tasks and the different possible decompositions and perform dependency analysis

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 19 / 141

slide-20
SLIDE 20

Algorithmic Structure Patterns

Outline

1

Software Patterns

2

Algorithmic Structure Patterns Task Parallelism Divide & Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination

3

Program and Data Structure Patterns

4

Systems on chip: Introduction

5

System-on-chip Processors

6

Emerging Paradigms and Challenges in Parallel Computing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 20 / 141

slide-21
SLIDE 21

Algorithmic Structure Patterns

Algorithm Structure Patterns

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 21 / 141

slide-22
SLIDE 22

Algorithmic Structure Patterns

Algorithmic Structure – Objectives

Develop an algorithmic structure for exploiting the concurrency identified in the previous space How can the concurrency be mapped to multiple units of execution? Many parallel algorithms exist but most adhere to six basic patterns as described in this space At this stage, consider:

Target platform characteristics such as number of processing elements and how they communicate Avoid the tendency to over-constrain the design by making it too specific for the target platform! What is the major organizing principle implied by exposed concurrency, i.e. is there a particular way of looking at it that stands out? For example, tasks, data, or flow of data

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 22 / 141

slide-23
SLIDE 23

Algorithmic Structure Patterns

Organize by Tasks

For when the execution of the tasks themselves is the best organizing principle If the task groups form a linear set or can be spawned linearly, use the Task Parallelism pattern. This includes both embarrassingly parallel problems and situations where task groups have some dependencies, share data, and/or require communication If task groups are recursive, use the Divide and Conquer pattern. The problem is recursively broken into sub-problems and each of them are solved independently, then their solutions are recombined to form the complete solution

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 23 / 141

slide-24
SLIDE 24

Algorithmic Structure Patterns Task Parallelism

The Task Parallelism Pattern

Three key aspects: Task Definition: The functionality that constitutes a task or a group

  • f tasks. There must be enough tasks to ensure a proper load balance

Inter-task Dependencies: How different task groups interact and what the computation/communication costs or overheads of these interactions are Task Schedule: How task groups are assigned to different units of execution

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 24 / 141

slide-25
SLIDE 25

Algorithmic Structure Patterns Task Parallelism

The Task Parallelism Pattern – Dependencies

Dependencies between tasks can be further categorized as follows: Ordering Constraints: The program order in which task groups must execute Shared Data Dependencies: These can be further categorized into:

Removable Dependencies: Those that can be resolved by simple code transformations Separable Dependencies: When accumulation into a shared data structure is required. Often the data structure can be replicated for each task group which then accumulates a local result. These local results are then combined to produce the final result once all task groups have finished. This is also known as a reduction operation

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 25 / 141

slide-26
SLIDE 26

Algorithmic Structure Patterns Task Parallelism

The Task Parallelism Pattern – Scheduling

The manner in which task groups are assigned to units of execution (UE) in order to ensure a good computational load balance. Two primary categories of schedules are: Static Schedule: the task group distribution among UEs is determined before program execution starts. For example, round-robin assignment of similar-sized task groups to UEs Dynamic Schedule: the task group distribution varies during program execution and is non-deterministic. This form of scheduling is used when:

Task group sizes vary widely and/or are unpredictable The capabilities of UEs are either unknown or vary unpredictably

Common approaches to implement dynamic scheduling are:

Global Task Queue: There exists a global queue containing all task groups. Each UE removes task groups from the global queue when free and executes them Work Stealing: Each UE has its own local task queue which is populated before program execution starts. When a UE is finished with tasks in its local queue, it attempts to steal tasks from the other UEs’ local queues

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 26 / 141

slide-27
SLIDE 27

Algorithmic Structure Patterns Divide & Conquer

The Divide And Conquer Pattern – Sequential

The sequential divide-and-conquer strategy solves a problem in the following manner:

1 func solve returns Solution; // a solution stage func baseCase returns Boolean; // direct solution test 3 func baseSolve returns Solution; // direct solution func merge returns Solution; // combine sub -solutions 5 func split returns Problem []; // split into subprobs 7 Solution solve(Problem P) { if (baseCase(P)) 9 return baseSolve(P); else { 11 Problem subProblems [N]; Solution subSolution [N]; 13 subProblems = split(P); for (int i=0; i<N; i++) 15 subSolutions [i] = solve( subProblems [i]); return merge( subSolutions ); 17 } }

The recursive solve() function doubles the concurrency at each stage if it is not a baseCase() The baseSolve() function should only be called if the overhead of further splits/merges worsens performance or when the size of the problem is optimal for the target node, for e.g. when the data size fits into cache

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 27 / 141

slide-28
SLIDE 28

Algorithmic Structure Patterns Divide & Conquer

The Divide And Conquer Pattern – Sequential

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 28 / 141

slide-29
SLIDE 29

Algorithmic Structure Patterns Divide & Conquer

The Divide and Conquer Pattern – Parallel

The concurrency is obvious as the sub-problems can be solved independently One task for each invocation of the solve() function is mapped to a single UE Each task in effect dynamically generates and absorbs a task for each sub-problem For efficiency, a sequential solution may be used as soon as the size of a task goes below a particular threshold, or once all processing elements have enough work When the task sizes are irregular, it is helpful to use a Global Task Queue to maintain a healthy load balance

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 29 / 141

slide-30
SLIDE 30

Algorithmic Structure Patterns Divide & Conquer

The Divide and Conquer Pattern – Parallel

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 30 / 141

slide-31
SLIDE 31

Algorithmic Structure Patterns Divide & Conquer

Organize by Data Decomposition

For when decomposition of data forms the major organizing principle in understanding concurrency of the problem If data can be broken into discrete subsets and operated on independently by task groups, then use the Geometric Decomposition. Solutions for a subset may require data from a small number of other subsets for e.g. to satisfy boundary conditions in grid-based problems If the problem requires use of a recursive data structure such as a binary tree, use the Recursive Data pattern

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 31 / 141

slide-32
SLIDE 32

Algorithmic Structure Patterns Geometric Decomposition

The Geometric Decomposition Pattern

An expression of coarse-grained data parallelism Applicable for linear data structures such as arrays The key aspects of this pattern are:

Data Decomposition: Decomposition of the data structure to substructures or chunks in a manner analogous to dividing a geometric region into sub-regions Update: Each chunk has an associated update task that computes a local result Exchange: Each task has the required data it needs, possibly from neighbouring chunks, to perform the update Task Schedule: Mapping the task groups and data chunks to UEs

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 32 / 141

slide-33
SLIDE 33

Algorithmic Structure Patterns Geometric Decomposition

The Geometric Decomposition Pattern

Important points to note:

Chunk Granularity: The granularity or size of data chunks directly impacts the efficiency of the program

Small number of large chunks ⇒ Smaller number of large message exchanges ⇒ Reduced communication overhead, Increased load balancing difficulty Large number of small chunks ⇒ Larger number of small message exchanges ⇒ Increased communication overhead, Decreased load balancing difficulty It is important to parameterize the granularity so that it can be fine tuned either at compile time or runtime

Chunk Shape: Often data to be exchanged between tasks lie at the boundaries of their respective data chunks. This implies that minimizing the surface area of the chunks should reduce the amount of data that must be exchanged Ghost Boundaries: In order to reduce communication during execution, the boundary data required from other chunks can be replicated. A ghost boundary refers to duplicates of data at boundaries of neighbouring chunks

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 33 / 141

slide-34
SLIDE 34

Algorithmic Structure Patterns Recursive Data

The Recursive Data Pattern

Applies to problems involving a recursive data structure that appear to require sequential processing to update all of its elements The key aspects of this pattern are:

Data Decomposition: The recursive data structure is completely decomposed into individual elements and each element or group of elements is assigned to a task group running on a separate UE, which is responsible for updating its partial result Structure: The top-level operation is a sequential loop. Each iteration is a parallel update on all elements to produce a partial result. The loop ends when a result convergence condition is met. Synchronization: A partial result calculation might require combining results from neighbouring elements, leading to a requirement for communication between UEs at each loop iteration

There are distinct similarities to the Divide and Conquer pattern, however this pattern does not require recursive spawning of tasks and has a static task schedule to start with

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 34 / 141

slide-35
SLIDE 35

Algorithmic Structure Patterns Recursive Data

Organize by Flow of Data

When the flow of data imposes an

  • rdering on the execution of task

groups and represents the major

  • rganizing principle

If the flow of data is regular (static) and does not change during program execution, the task groups can be structured into a Pipeline pattern through which the data flows If the data flows in an irregular, dynamic or unpredictable manner, use the Event-Based Coordination pattern where the task groups may interact through asynchronous events

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 35 / 141

slide-36
SLIDE 36

Algorithmic Structure Patterns Pipeline

The Pipeline Pattern

The overall computation involves performing a calculation on many sets of data The calculation can be viewed as data flowing through a pre-determined sequence

  • f stages similar to a factory assembly line

Each stage in the pipeline computes the i-th step of the computation Each stage may be assigned to a different task and data elements may be passed from one task to another as operations are completed Notice that some resources are idle initially when the pipeline is being filled and again during the end of the computation when the pipeline is drained

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 36 / 141

slide-37
SLIDE 37

Algorithmic Structure Patterns Event-Based Coordination

The Event-Based Coordination Pattern

This pattern is used when the problem can be decomposed into semi-independent tasks interacting in an irregular fashion and the interactions are determined by the flow of data between these tasks No restriction to a linear structure like for the pipeline pattern Express data flow using abstractions called events Each event must have a task that generates it and a task that processes it The computation within each task can be defined as follows:

initialize while (not done) { receive event process event send event } finalize

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 37 / 141

slide-38
SLIDE 38

Algorithmic Structure Patterns Event-Based Coordination

Hands-on Exercise: Algorithm Structure Patterns

Objective: To identify an algorithm structure for a parallel stencil-based computation and complete an implementation of it

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 38 / 141

slide-39
SLIDE 39

Program and Data Structure Patterns

Outline

1

Software Patterns

2

Algorithmic Structure Patterns

3

Program and Data Structure Patterns SPMD Master-worker Loop parallelism Fork-join Shared data Shared queue Distributed Array

4

Systems on chip: Introduction

5

System-on-chip Processors

6

Emerging Paradigms and Challenges in Parallel Computing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 39 / 141

slide-40
SLIDE 40

Program and Data Structure Patterns

Program and Data Structure Patterns

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 40 / 141

slide-41
SLIDE 41

Program and Data Structure Patterns

Objective – Choosing the Right Pattern

Program structures represent an intermediate stage between an algorithmic structure and the implemented source code They describe software constructions or structures that support the expression of parallel algorithms Choosing a program structure pattern is usually straightforward The outcomes of the algorithmic structure design space analysis should point towards a suitable program structure pattern In the table below, the number of ✓’s indicate the likelihood of suitability of a program structure pattern for a particular algorithmic structure

Task Divide and Geometric Recursive Pipeline Event Based Parallelism Conquer Decomposition Data Coordination SPMD ✓✓✓✓ ✓✓✓ ✓✓✓✓ ✓✓ ✓✓✓ ✓✓ Loop Parallelism ✓✓✓✓ ✓✓ ✓✓✓ ✗ ✗ ✗ Master/Worker ✓✓✓✓ ✓✓ ✓ ✓ ✓ ✓ Fork/Join ✓✓ ✓✓✓✓ ✓✓ ✗ ✓✓✓✓ ✓✓✓✓

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 41 / 141

slide-42
SLIDE 42

Program and Data Structure Patterns SPMD

Program Structures: SPMD Pattern

Single Program Multiple Data All UEs run the same program in parallel, but each has its own set of data Different UEs can follow different paths through the program A unique ID is associated with each UE which determines its course through the program and its share of global data that it needs to process

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 42 / 141

slide-43
SLIDE 43

Program and Data Structure Patterns SPMD

Program Structures: SPMD Pattern

The following program structure is implied by the SPMD Pattern:

1 Initialize: Load the program on a UE, and perform book-keeping.

Establish communications with other UEs

2 Obtain a Unique Identifier: Computation may proceed differently

  • n different UEs, conditional on ID

3 Execution on UE: Start the computation and have different UEs

take different paths through the source code using:

Branching statements to give specific blocks of code to different UEs Using the UE identifier in loop index calculations to split loop iterations among the UEs

4 Distribute Data: Global data is decomposed into chunks and stored

in UE local memory based on the UE’s unique identifier

5 Finalize: Recombine the local results into a global data structure and

perform cleanup and book-keeping

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 43 / 141

slide-44
SLIDE 44

Program and Data Structure Patterns Master-worker

Program Structures: Master-Worker Pattern

A master process or thread sets up a pool of worker processes or threads and a bag of tasks The workers execute concurrently, with each worker repeatedly removing a task from the bag of tasks and processing it, until all tasks have been processed or some other termination condition is reached Some implementations may have more than one master or no explicit master

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 44 / 141

slide-45
SLIDE 45

Program and Data Structure Patterns Master-worker

Program Structures: Master Worker Pattern

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 45 / 141

slide-46
SLIDE 46

Program and Data Structure Patterns Loop parallelism

Program Structures: Loop Parallelism Pattern

This pattern addresses the problem of transforming a serial program whose runtime is dominated by a set of compute-intensive loops The concurrent tasks are identified as iterations of parallelized loops This pattern applies particularly to problems that already have a mature sequential code base and cannot be invested in for major restructuring to parallelize for high performance When the existing code is available, the goal becomes incremental evolution of the sequential code to its final parallel form, one loop at a time Ideally, most of the changes required are localized around loop transformations to remove any loop-carried dependencies The OpenMP programming API was primarily created to support loop parallelism on shared-memory computers

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 46 / 141

slide-47
SLIDE 47

Program and Data Structure Patterns Loop parallelism

Program Structures: Loop Parallelism Pattern

Steps to undertake to apply this pattern to a sequential problem are:

Identify Bottlenecks: Locate the most computationally intensive loops either by code inspection, understanding the problem, or by using performance analysis tools Eliminate Loop Carried Dependencies: Loop iterations must be independent in

  • rder to be parallelized

Parallelize the loops: Distribute iterations among the UEs Optimize the loop schedule: Distribution among UEs must be evenly balanced Two commonly used loop transformations are: Merge Loops: If a problem consists of a sequence of loops that have consistent loop limits, the loops can often be merged into a single loop with more complicated iterations Coalesce Nested Loops: Nested loops can often be combined into a single loop with a larger combined iteration count, which might offset the overhead of nesting and create more tasks per UE, i.e. achieve better load balance

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 47 / 141

slide-48
SLIDE 48

Program and Data Structure Patterns Fork-join

Program Structures: Fork/Join Pattern

This pattern applies to problems where the number of concurrent tasks may vary during program execution The tasks are spawned dynamically or forked and later terminated or joined with the forking task A main UE forks off some number of other UEs that then continue in parallel to accomplish some portion of the overall work The tasks map onto UEs in different ways such as:

A simple direct mapping where there is one task per UE An indirect mapping where a pool of UEs work on sets of tasks. If the problem consists of multiple fork-join sequences (which are expensive), then it is more efficient to first create a pool of UEs to match the number of processing elements. Then use a global task queue to map tasks to UEs as they are created

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 48 / 141

slide-49
SLIDE 49

Program and Data Structure Patterns Shared data

Data Structures: Shared Data Pattern

This pattern addresses the problem of handling data that is shared by more than one UE Typical problems that this pattern applies to are:

At least one data structure is accessed by multiple tasks in the course

  • f the program’s execution

At least one task modifies the shared data structure The tasks potentially need to use the modified value during the concurrent computation

For any order of execution of tasks, the computation must be correct, i.e. shared data access must obey program order

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 49 / 141

slide-50
SLIDE 50

Program and Data Structure Patterns Shared data

Data Structures: Shared Data Pattern

To manage shared data, follow these steps: ADT Definition: Start by defining an abstract data type (ADT) with a fixed set of operations on the data

If the ADT is a stack, then the operations would be push and pop If executed serially, these operations should leave the data in a consistent state

Concurrency-control protocol: Devise a protocol to ensure that, if used concurrently, the operations on the ADT are sequentially

  • consistent. Approaches for this include:

Mutual exclusion and critical sections on a shared memory system. Minimize length of critical sections Assign shared data to a particular UE in a distributed memory system Identify non-interfering sets of operations Readers/writers protocol Nested locks

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 50 / 141

slide-51
SLIDE 51

Program and Data Structure Patterns Shared queue

Data Structures: Shared Queue Pattern

An important shared data structure commonly utilized in the master/worker pattern This pattern represents a “thread-safe” implementation of the familiar queue ADT Concurrency-control protocols that encompass too much of the shared queue in a single synchronization construct increase the chances of UEs being blocked waiting for access Maintaining a single queue for systems with complicated memory hierarchies (such as NUMA systems) can cause excess communication and increase parallel overhead

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 51 / 141

slide-52
SLIDE 52

Program and Data Structure Patterns Shared queue

Data Structures: Shared Queue Pattern

Types of shared queues that can be implemented: Non-blocking queue: No interference by multiple UEs accessing the queue concurrently Block-on-empty queue: A UE trying to pop from an empty queue will wait until the queue has an element available Distributed shared queue: Each UE has a local queue that is shared with other UEs. Commonly used for work stealing

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 52 / 141

slide-53
SLIDE 53

Program and Data Structure Patterns Distributed Array

Data Structures: Distributed Array Pattern

One of the most commonly used data structures, i.e. an array This pattern represents arrays of one or more dimensions that are decomposed into sub-arrays and distributed among available UEs Particularly important when the geometric decomposition algorithmic structure is being utilized along with the SPMD program structure Although it primarily applies for distributed memory systems, it also has applications for NUMA systems Some commonly used array distributions are:

1D Block: The array is decomposed in one dimension only and distributed one block per UE. Sometimes referred to as column block

  • r row block, in the context of 2D arrays

2D Block: One 2D block or tile per UE Block-cyclic: More blocks than UEs and blocks are assigned to UEs in a round-robin fashion

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 53 / 141

slide-54
SLIDE 54

Program and Data Structure Patterns Distributed Array

Summary

Structured thinking is key when designing parallel software Patterns help in thought process but are not set in stone! Quality of software design matters most since software outlives hardware

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 54 / 141

slide-55
SLIDE 55

Program and Data Structure Patterns Distributed Array

Hands-on Exercise: Program and Data Structure Patterns

Objective: To identify and exploit parallelism in large matrix multiplication

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 55 / 141

slide-56
SLIDE 56

Systems on chip: Introduction

Outline

1

Software Patterns

2

Algorithmic Structure Patterns

3

Program and Data Structure Patterns

4

Systems on chip: Introduction

5

System-on-chip Processors

6

Emerging Paradigms and Challenges in Parallel Computing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 56 / 141

slide-57
SLIDE 57

Systems on chip: Introduction

Systems on a Chip

former systems can now be integrated into a single chip usually for special-purpose systems high speed per price, power

  • ften have

hierarchical networks

(courtesy EDA360 Insider) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 57 / 141

slide-58
SLIDE 58

Systems on chip: Introduction

On-chip Networks: Bus-based

traditional model; buses have address and data pathways can be several ’masters’ (operating the bus, e.g. CPU, DMA engine) in a multicore context, there be many! (scalability issue) hence arbitration is a complex issue (and takes time!) techniques for improving bus utilization:

burst transfer mode: multiple requests (in regular pattern) once granted access pipelining: place next address as data from previous

broadcast: one master sends to

  • thers

e.g. cache coherency - snoop, invalidation

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 58 / 141

slide-59
SLIDE 59

Systems on chip: Introduction

On-Chip Networks: Current and Next Generation

buses: only one device can access (make a ‘transaction’) at one time crossbars: devices split in 2 groups of size p

can have p transactions at once, provided at most 1 per device e.g. UltraSPARC T2: p = 8 cores and L2$ banks; Fermi GPU: p = 14 for largish p, need to be internally organized as a series of switches

may be also organized as a ring for larger p, may need a more scalable topology such as a 2-D mesh (or torus, e.g. Intel SCC), or hierarchies of these

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 59 / 141

slide-60
SLIDE 60

Systems on chip: Introduction

Sandy Bridge Ring On-Die Interconnect

a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocol to handle coherency and ordering

(courtesy www.lostcircuits.com) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 60 / 141

slide-61
SLIDE 61

Systems on chip: Introduction

Cache Coherency Considered Harmful

(also known as the ‘Coherency Wall’) a core writes at address x in its L1$; must invalidate x in other L1$s standard protocols requires a broadcast message for each invalidation

maintaining (MOESI) protocol also requires a broadcast on every miss also causes contention (& delay) in the network (worse than O(p2)?)

directory-based protocols can direct invalidation messages to only the caches holding the same data

far more scalable (e.g. SGI Altix SMP), for lightly-shared data for each cached line, need a bit vector of length p: O(p2) storage cost

false sharing in any case results wasted traffic P0 P1 P2 P3 P4 P5 P6 P7 writes b0 b1 b2 b3 b4 b5 b6 b7

hey, what about GPUs?

atomic instructions sync down to the LLC, cost O(p) energy each! cache line size is sub-optimal for messages on on-chip networks

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 61 / 141

slide-62
SLIDE 62

System-on-chip Processors

Outline

1

Software Patterns

2

Algorithmic Structure Patterns

3

Program and Data Structure Patterns

4

Systems on chip: Introduction

5

System-on-chip Processors Motivation Case Study: TI Keystone II SoC Bare-metal Runtime on DSP Programming DSP cores with OpenCL and OpenMP

6

Emerging Paradigms and Challenges in Parallel Computing Computer Systems (ANU) Parallel Software Design Feb 14, 2020 62 / 141

slide-63
SLIDE 63

System-on-chip Processors Motivation

High Performance Computing

Using more than one computer to solve a large scale problem Using clusters of compute nodes Large clusters = Supercomputers! Applications include:

Data analysis Numerical Simulations Modeling Complex mathematical calculations

Computations dominated by Floating-point operations (FLOPs) History dates back to the 1960s!

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 63 / 141

slide-64
SLIDE 64

System-on-chip Processors Motivation

High Performance Computing

[

Summit]Summit Supercomputer1

Each node has two 22-core IBM POWER9 processors and six NVIDIA V100 GPUs 4608 nodes = 2,414,592 compute cores Peak performance of 225 PetaFLOPs No.1 on the Top 500 list, Nov 2019 Power consumption: 13 MWatt

1nbcnews.com Computer Systems (ANU) Parallel Software Design Feb 14, 2020 64 / 141

slide-65
SLIDE 65

System-on-chip Processors Motivation

High Performance Computing

Power consumption is a major problem Power Consumption ∝ Heat Generation ∝ Cooling Requirement ∝ Increase in Maintenance Cost Majority of research targeting energy efficiency Alternative building blocks for Supercomputers? Energy-efficient system-on-chips with accelerators ✓

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 65 / 141

slide-66
SLIDE 66

System-on-chip Processors Motivation

System-on-chip Processors

All components of a computer on a single chip Single/Multi core CPU One or more on-chip accelerators

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 66 / 141

slide-67
SLIDE 67

System-on-chip Processors Motivation

Few of the ones we play with. . .

[

TI K2H]TI Keystone II

Evaluation Module2

[

Jetson]NVIDIA Jetson TK1

Board3

[

Par16]Adapteva Parallella

Epiphany Board4

0http://ti.com/ 2http://anandtech.com/ 4http://arstechnica.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 67 / 141

slide-68
SLIDE 68

System-on-chip Processors Case Study: TI Keystone II SoC

Case Study: TI Keystone II SoC

Overview of the TI Keystone II (Hawking) SoC Very Long Instruction Word (VLIW) architecture Software-managed Cache-coherency on DSP cores How shared memory is managed between ARM and DSP cores Execution of a binary on the DSP cores from ARM Linux Bare-metal Runtime on DSP cores Programming for HPC on the Hawking

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 68 / 141

slide-69
SLIDE 69

System-on-chip Processors Case Study: TI Keystone II SoC

TI ’Hawking’ SoC

[

TI K2H] TI K2H ARM-DSP SoC5

Host: Quad-core ARM Cortex A15 Accelerator: Eight-core floating point C66X DSP Communication: Shared Memory, Hardware Queues Features:

DSP: 32 KB L1-D, L1-P Cache DSP: 1 MB L2 Cache (Configurable as SRAM) DSP: Aggregate 157.184 SP GFLOPS ARM: 32 KB L1-D, L2-P Cache ARM: 4 MB L2 Shared Cache ARM: Aggregate 38.4 SP GFLOPS Common: 6 MB Shared SRAM ARM cores are cache coherent but DSP cores are not DSPs have no MMU, no virtual memory Power consumption around 15 Watts TDP 5http://ti.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 69 / 141

slide-70
SLIDE 70

System-on-chip Processors Case Study: TI Keystone II SoC

ARM Cortex A15 MPCore

[

ARM Cortex-A15]ARM Cortex-A15 Host

CPU6

1-4 Cache Coherent Cores per on-chip cluster 32-bit ARMv7 Reduced Instruction Set Computer (RISC) instructions 40-bit Large Physical Address Extension (LPAE) addressing Out-of-order execution, branch prediction Vector Floating Point (VFPv4) unit per core Advanced Single Instruction Multiple Data (SIMD) extension aka NEON unit per core

4 32-bit registers (quad) can hold a single 128-bit vector 4 SP multiplies/cycle 6http://geek.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 70 / 141

slide-71
SLIDE 71

System-on-chip Processors Case Study: TI Keystone II SoC

TI C66x Digital Signal Processor

L1P

C66x DSP

L1P SRAM/Cache 32KB

L2

L1D

Embedded Debug

Prefetch

Power Management Interrupt controller

Emulation

Register file A Register file B

Fetch

L1D SRAM/Cache 32KB L2 SRAM/Cache 1MB DMA

L M S D L M S D

Dispatch Exectute

8-way Very Long Instruction Word (VLIW) processor

Instruction level parallelism Compiler generates VLIW instructions composed of instructions for separate functional units that can run in parallel

8 RISC functional units in two sides

Multiplier (M): multiplication Data (D): load/store ALU (L) and Control (S): addition and branch

Single Instruction Multiple Data (SIMD) up to 128-bit vectors

4 32-bit registers (quad) can hold a single 128-bit vector M: 4 SP multiplies/cycle L and S: 2 SP add/cycle

8 Multiply-Accumulate (MAC)/cycle

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 71 / 141

slide-72
SLIDE 72

System-on-chip Processors Case Study: TI Keystone II SoC

VLIW Architecture

L1P

C66x DSP

L1P SRAM/Cache 32KB

L2

L1D

Embedded Debug

Prefetch

Power Management Interrupt controller

Emulation

Register file A Register file B

Fetch

L1D SRAM/Cache 32KB L2 SRAM/Cache 1MB DMA

L M S D L M S D

Dispatch Exectute

A very long instruction word consists

  • f multiple independent instructions

which may be logically unrelated, packed together by the compiler Thew onus on compiler to statically schedule independent instructions into a single VLIW instruction Multiple functional units in the processor Instructions in a bundle are statically aligned to be directly fed into functional units in lock-step Simple expression of Instruction Level Parallelism

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 72 / 141

slide-73
SLIDE 73

System-on-chip Processors Case Study: TI Keystone II SoC

VLIW Trade-offs

Advantages: Hardware is simplified, i.e. no dynamic scheduling required VLIW instructions contain independent sub-instructions, i.e. no dependency checking required i.e. simplified instruction issue unit Instruction alignment/distribution is not required after fetch to separate functional units, i.e. simplified hardware Disadvantages: Compiler complexity, i.e. independent operations need to be found for every cycle NOPs are inserted for when suitable operations are not found for a cycle When functional units or instruction latencies change, i.e. when executing on another processor having the same architecture, recompilation is required Lock-step execution causes independent operations to stall until the longest-latency instruction completes

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 73 / 141

slide-74
SLIDE 74

System-on-chip Processors Case Study: TI Keystone II SoC

C66x DSP Cache Coherency

C66X DSPs are not cache-coherent with each other Between flush points, threads do not access the same data – that would be a data race! The runtime triggers software-managed cache operations at flush points Costs around 1350 clock cycles for each flush operation

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 74 / 141

slide-75
SLIDE 75

System-on-chip Processors Case Study: TI Keystone II SoC

Memory Hierarchy

Available to both ARM and DSP: 8 GB DDR3 RAM ∼ 100 cycle access time 6 MB Scratchpad RAM (SRAM) ∼ 20 cycle access time Available to DSP only: 1 MB L2 cache per core configurable as SRAM ∼ 7 cycle access time 32 KB L1 data and L1 instruction cache per core configurable as SRAM ∼ 2 cycle access time

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 75 / 141

slide-76
SLIDE 76

System-on-chip Processors Case Study: TI Keystone II SoC

Sharing memory between ARM and DSP

Obstacles: No shared MMU between ARM and DSP cores No MMU on DSP cores No shared virtual memory What elements are required in order for ARM and DSP programs to share memory? Linux virtual memory mapping to shared physical memory Shared heap between ARM and DSP cores Memory management library that provides malloc/free routines into the shared heap (TI’s Contiguous Memory (CMEM) package)

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 76 / 141

slide-77
SLIDE 77

System-on-chip Processors Case Study: TI Keystone II SoC

Executing a binary on a DSP core

The TI Multiprocess Manager (MPM) package allows linux userspace programs to load and run binaries onto the DSP cores individually Has two major components: Daemon (mpmsrv) and CLI utility (mpmcl) Uses the remoteproc driver the DSP output (trace) is obtained using the rpmsg bus (Requires a resource table entry in the loaded binary ELF sections) Maintained at: git.ti.com/keystone-linux/multi-proc-manager.git

1 root@k2hk -evm :~# mpmcl status dsp0 dsp0 is in running state 3 root@k2hk -evm :~# mpmcl reset dsp0 reset succeeded 5 root@k2hk -evm :~# mpmcl status dsp0 dsp0 is in reset state 7 root@k2hk -evm :~# mpmcl load dsp0 main.out load successful 9 root@k2hk -evm :~# mpmcl run dsp0 run succeeded 11 root@k2hk -evm :~# cat /sys/kernel/debug/ remoteproc/ remoteproc0 /trace0 Main started

  • n core 0

13 ... root@k2hk -evm :~# Computer Systems (ANU) Parallel Software Design Feb 14, 2020 77 / 141

slide-78
SLIDE 78

System-on-chip Processors Bare-metal Runtime on DSP

C66x DSP Runtime Support System

Eight DSP cores What does bare-metal execution mean?

No OS running on DSP cores Each core boots every time a binary is loaded onto it The executable binary loaded on the DSP cores must at least provide:

Task execution Memory management File I/O Inter-process communication (IPC) Basic task scheduling

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 78 / 141

slide-79
SLIDE 79

System-on-chip Processors Bare-metal Runtime on DSP

A bare-metal runtime system

A runtime library or a runtime system is software intended to support the execution of a program by providing: API implementations of programming language features Type-checking, debugging, code generation and optimization, possibly garbage-collection Access to runtime environment data structures Interface to OS system calls Without the presence of an OS, a runtime system must also provide the basic services of an OS such as those provided by a typical micro-kernel: Memory management Thread management IPC

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 79 / 141

slide-80
SLIDE 80

System-on-chip Processors Bare-metal Runtime on DSP

Memory management

Defining a physical memory region to place the heap data structure

Can be in shared MSMC SRAM or DDR3 RAM The exact location is determined by the linker Memory sections can be specified via a linker command file

Initialization of the heap Provide mutually exclusive access to the heap from all cores – provide Locking mechanisms for critical sections, i.e. mutex/semaphore C API functions: malloc, free, calloc, realloc, memalign

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 80 / 141

slide-81
SLIDE 81

System-on-chip Processors Bare-metal Runtime on DSP

Thread management

The runtime system can be considered as a single process running on a DSP core with multiple threads of work running within it. It manages: The execution of units of work or tasks on each DSP core

Pointer to a function in memory: void (*fn)(void*) Pointer to argument buffer in memory: void* args

Multiplexing threads of execution

Task/thread dispatcher Scheduling of threads (Pre-defined policy) Maintain thread state data structures in non-cacheable memory Sharing the same address space for each core Sharing locks on local memory (L2 SRAM)

Teams of threads Possible pre-emption

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 81 / 141

slide-82
SLIDE 82

System-on-chip Processors Bare-metal Runtime on DSP

Inter-process Communication

Fundamental to process execution Exchanging data between threads on one or more DSP cores Three primary methods:

Mutually exclusive access to objects in shared memory Mutually exclusive access to objects in local memory Atomic access to hardware queues

We focus on the use of hardware queues present in the Keystone II SoC

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 82 / 141

slide-83
SLIDE 83

System-on-chip Processors Bare-metal Runtime on DSP

Hardware Queues

Part of the Multicore Navigator present on the K2 SoC Queue Manager Sub-System (QMSS) 16384 queues Can be used via:

QMSS Low-level drivers (LLD) Open Event Machine (OpenEM): Abstraction above QMSS LLD

LIFO and FIFO configurations available

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 83 / 141

slide-84
SLIDE 84

System-on-chip Processors Bare-metal Runtime on DSP

Hardware Queues

What can you push to a hardware queue? The address of a single message descriptor at a time A message descriptor can be any data structure created by the user. Typically, a C struct of fixed size. 20 available memory regions from which at least one must be mapped and configured for message descriptor storage The descriptor size must be a multiple of 16 bytes and must be a minimum 32 bytes

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 84 / 141

slide-85
SLIDE 85

System-on-chip Processors Bare-metal Runtime on DSP

Hardware Queues

Using hardware queues7

Each push and pop operation is atomic.

7http://www.deyisupport.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 85 / 141

slide-86
SLIDE 86

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenCL

A parallel programming library API specification that provides: A consistent execution model: Host, Devices, Compute Units, Processing Elements A consistent memory model: Global, Constant, Local, Private Asynchronous execution on the device Data-parallel: NDRange Index Space, Work Groups, Work Items Task-parallel: In-order and out-of-order queues, asynchronous dispatch Architecture invariant kernel specification

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 86 / 141

slide-87
SLIDE 87

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenCL Platform Model

OpenCL Platform Model8

8http://developer.amd.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 87 / 141

slide-88
SLIDE 88

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenCL Memory Model

OpenCL Platform Model9

9http://developer.amd.com/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 88 / 141

slide-89
SLIDE 89

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenCL Example: Vector Addition

const char * kernelStr = 2 "kernel void VectorAdd(global const short4* a, " " global const short4* b, " 4 " global short4* c) " "{" 6 " int id = get_global_id (0);" " c[id] = a[id] + b[id];" 8 "}"; /* Create device context */ 10 Context context( CL_DEVICE_TYPE_ACCELERATOR ); std :: vector <Device > devices = context.getInfo <CL_CONTEXT_DEVICES >(); 12 /* Declare device buffers */ Buffer bufA (context , CL_MEM_READ_ONLY , bufsize); 14 Buffer bufB (context , CL_MEM_READ_ONLY , bufsize); Buffer bufDst (context , CL_MEM_WRITE_ONLY , bufsize); 16 /* Create program from kernel string , compile program , associate with context */ Program :: Sources source (1, std :: make_pair(kernelStr ,strlen(kernelStr))); 18 Program program = Program(context , source); program.build(devices); 20 /* Set kernel arguments */ Kernel kernel(program , "VectorAdd"); 22 kernel.setArg (0, bufA); kernel.setArg (1, bufB); kernel.setArg (2, bufDst); /* Create Command Queue */ 24 CommandQueue Q(context , devices[d], CL_QUEUE_PROFILING_ENABLE ); /* Write data to device */ 26

  • Q. enqueueWriteBuffer (bufA , CL_FALSE , 0, bufsize , srcA , NULL , &ev1);
  • Q. enqueueWriteBuffer (bufB , CL_FALSE , 0, bufsize , srcB , NULL , &ev2);

28

  • Q. enqueueNDRangeKernel (kernel , NullRange , NDRange( NumVecElements ),

/* Enqueue kernel */ NDRange( WorkGroupSize ), NULL , &ev3); 30

  • Q. enqueueReadBuffer (bufDst , CL_TRUE , 0, bufsize , dst , NULL , &ev4); /* Read

result */ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 89 / 141

slide-90
SLIDE 90

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenMP

Shared memory parallel programming specification: Using compiler directives to partition work across cores Fork-join model - master thread creates a team of threads for each parallel region Memory model does not require hardware cache coherency Data and task parallel Mature programming model: Spec v1.0 out in 1997 Suited to multi-core systems with shared memory Widely used in HPC community

int i; 2 #pragma

  • mp

parallel for for (i = 0; i < size; i++) 4 c[i] = a[i] + b[i]; Computer Systems (ANU) Parallel Software Design Feb 14, 2020 90 / 141

slide-91
SLIDE 91

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenMP Fork-Join Model

OpenMP Fork-Join Model10

10http://computing.llnl.gov/ Computer Systems (ANU) Parallel Software Design Feb 14, 2020 91 / 141

slide-92
SLIDE 92

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenMP for Accelerators

Recent addition in OpenMP 4.0 (Feb, 2014) Notion of host and target device Use target constructs to offload work from host to target device Target regions contain OpenMP parallel regions Map clauses specify data synchronization

#pragma

  • mp

target map (to:a[0: size], b[0: size], size) \ 2 map (from:c[0: size ]) { 4 int i; #pragma

  • mp

parallel for 6 for (i = 0; i < size; i++) c[i] = a[i] + b[i]; 8 } Computer Systems (ANU) Parallel Software Design Feb 14, 2020 92 / 141

slide-93
SLIDE 93

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

OpenMPAcc Library and compiler

clacc: Shell compiler

  • mps2s: Source-to-source

translator libOpenMPAcc: Thin layer on top of OpenCL ARM - DSP communication and synchronization using OpenCL over shared memory Source-to-source lowering generates separate ARM and DSP source code and libOpenMPAcc API calls

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 93 / 141

slide-94
SLIDE 94

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

CLACC compiler

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 94 / 141

slide-95
SLIDE 95

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

Source-to-source translation

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 95 / 141

slide-96
SLIDE 96

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

Leveraging CMEM

CMEM: Contiguous (Shared) Memory Can be used to access DRAM outside linux memory space Buffer usage same as normal malloc’d buffers No memcpy required when mapping data to target regions CMEM cache operations performed in libOpenMPAcc to maintain data consistency CMEM wbInvAll() used when data size is greater than a threshold

float* buf_in_ddr = (float *) __malloc_ddr ( size_bytes); 2 float* buf_in_msmc = (float *) __malloc_msmc ( size_bytes); 4 __free_ddr (buf_in_ddr ); __free_msmc ( buf_in_msmc ); Computer Systems (ANU) Parallel Software Design Feb 14, 2020 96 / 141

slide-97
SLIDE 97

System-on-chip Processors Programming DSP cores with OpenCL and OpenMP

Leveraging fast local memory

Each DSP has 1 MB L2 cache 896 KB out of the 1 MB is configured as SRAM OpenCL uses 128 KB leaving 768 KB free Using the local map-type

1 float* local_buf = malloc(sizeof(float)*size); 3 #pragma

  • mp

target map (to:a[0: size], b[0: size], size) \ map (from:c[0: size ])\ 5 map (local:local_buf [0: size ]) { 7 #pragma

  • mp

parallel { 9 int i; for (i = 0; i < size; i++) 11 local_buf[i] = a[i] + b[i]; 13 for (i = 0; i < size; i++) c[i] = local_buf[i]; 15 } } 17 free(local_buf); Computer Systems (ANU) Parallel Software Design Feb 14, 2020 97 / 141

slide-98
SLIDE 98

Emerging Paradigms and Challenges in Parallel Computing

Outline

1

Software Patterns

2

Algorithmic Structure Patterns

3

Program and Data Structure Patterns

4

Systems on chip: Introduction

5

System-on-chip Processors

6

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models Accelerators and Energy Efficiency Computer Systems (ANU) Parallel Software Design Feb 14, 2020 98 / 141

slide-99
SLIDE 99

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

DAG Execution Models: Motivations and Ideas

In most programming models, serial or parallel, the algorithm is

  • ver-specified

sequencing that is not necessary is often specified specifying what (sub-) tasks of a program can run in parallel is difficult and error-prone the model may constrain the program to run on a particular architecture (e.g. single memory image)

Directed acyclic task graph programming models specify only the necessary semantic ordering constraints

we express an instance of an executing program as a graph of tasks a node has an edge pointing to a second node if there is a (data) dependency between them

The DAG run-time system can then determine when and where each task executes, with the potential to extract maximum concurrency

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 99 / 141

slide-100
SLIDE 100

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

DAG Execution Models: Motivations and Ideas

In DAG programming models, we version data (say by iteration count)

it thus has a declarative, write-once semantics a node in a DAG will have associated with it: the input data items (including version number) required the output data items produced (usually with an updated version number) the function which performs this task

Running a task-DAG program involves:

generating the graph allowing an execution engine to schedule tasks to processors a task may execute when all of its input data items are ready the task informs the engine that its output data items have been produced before exiting

Potential advantages include:

maximizing parallelism, transparent load balancing arguably simpler programming model: no race hazards / deadlocks! abstraction the over underlying architecture permitting fault-tolerance: tasks on a failed process may be re-executed (requires that data items are kept in a ‘resilient store’)

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 100 / 141

slide-101
SLIDE 101

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

DAG Example: Tiled Cholesky Factorization

A (+ve definite) symmetric matrix A may be factored A = LLT, where L is triangular The ‘right-looking’ tiled algorithm occurs in place (A may be stored in a triangular matrix), with L overwriting A (tile by tile) The (PLASMA) DAG pseudo-code is

1 for k = 0.. nb_tiles -1 A[k][k] <- DPOTRF(A[k][k]) // A[k][k] = sqrt(A[k][k]) 3 for m = k+1.. nb_tiles -1 A[m,k] <- DTRSM(A[k][k], A[m][k]) // A[m][k] = A[k][k]^-1 A[m,k] 5 for n = k+1.. nb_tiles -1 A[n, n] <- DSYRK(A[n][k], A[n][n]) // A[n][n] -= A[n][k] A[n][k]^T 7 for (m = n+1.. nb_tiles -1) A[m, n] <- DGEMM(A[m][k], A[n][k], A[m,n]) 9 // A[m][n] -= A[n][k] A[m,n]^T

The size of nb tiles a trade-off: ||ism / load balance vs amortize task startup & data fetch/store costs, cache performance etc

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 101 / 141

slide-102
SLIDE 102

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

Tiled DAG Cholesky Factorization (II)

(courtesy Haidar et al, Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 102 / 141

slide-103
SLIDE 103

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

Tiled DAG Cholesky Factorization (III)

(courtesy Haidar et al, —)

Task graph with nb tiles = 5 Column on left shows the depth:width ratio

  • f the DAG

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 103 / 141

slide-104
SLIDE 104

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

Cholesky Factorization in PLASMA

For each ‘core’ task, we define a function to insert it into PLASMA’s task-DAG scheduler and that to perform the task;

1 int Dsched_dpotrf (Dsched d s c h e d , int nb , double *A, int lda) { DSCHED_Insert_Task (dsched , TASK_core_potrf , 3 sizeof(int), &nb , VALUE , sizeof(double*nb*nb , A, INOUT | LOCALITY , 5 sizsof(int), &lda , VALUE , 0); } 7 void TASK_core_dpotrf (Dsched d s c h e d ) { int nb , lda; double *A; 9 dsched_unpack_args_3 (dsched , nb , A, lda); dportf_("L", &n, A, &lda , ...); 11 }

And these are inserted into the scheduler, which works out the implicit dependencies:

1 for (k = 0; k < nb_tiles; k++) { Dsched_dpotrf (dsched , nb , A[k][k], lda); 3 for (m = k+1; m < nb_tiles; m++) { Dsched_dtrsm (dsched , nb , A[k][k], A[m][k], lda); 5 for (n = k+1; n < n_tiles; n++) { Dsched_dsyrk (dsched , nb , A[n][k], A[n][n], lda); 7 for (m = n+1; m < nb_tiles; m++) Dsched_dgemm (dsched , nb , A[m][k], A[n][k], A[m,n]) 9 }}} Computer Systems (ANU) Parallel Software Design Feb 14, 2020 104 / 141

slide-105
SLIDE 105

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

PLASMA Architecture

(courtesy Haidar et al)

Architecture for scheduler

inserted tasks go into an implicit DAG can be in NotReady, Queued, or Done states the workers execute queued tasks the descendants of Done tasks are examined to see if they are ready

(courtesy Haidar et al)

Execution trace of Cholesky: large vs small window sizes

a full DAG becomes very large it is necessary only to expand a ‘window’ of the next tasks to be executed needs to reasonably large to get good scheduling

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 105 / 141

slide-106
SLIDE 106

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

Intel Concurrent Collections (CnC)

CnC is based on two sources of ordering requirements:

producer / consumer (data dependence) controller / controllee (control dependence)

CnC combines ideas of streaming, tuple spaces and dataflow users need to supply a CnC dependency graph and code for step functions step functions get and put data into the CnC store

(figs courtesy Knobe&Sarker, CnC || Programming Model) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 106 / 141

slide-107
SLIDE 107

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

Cholesky in CnC

Dependency Graph specification:

(courtesy Knobe&Sarker, CnC || Programming Model)

(Cholesky = dportf, Trisolve = dtrsm, Update = dsyrk+dgemm) Note that the iteration is explicitly part of the tag Step function for the symmetric- rank-k update: The task graph is generated by executing a user-supplied harness

(figures courtesy Schlimbach, Brodman & Knob, Concurrent Collections on Distributed Memory) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 107 / 141

slide-108
SLIDE 108

Emerging Paradigms and Challenges in Parallel Computing Directed Acyclic Task-Graph Execution Models

DAG Task Graph Programming Models: Summary

The programmer needs only to break the computation into tasks Otherwise not too much more than specifying sequential computation

specify task ‘executor’ and task harness generator becomes trickier when the task DAG is data-dependent e.g. iterative linear system solvers

Shows promise for both small and large scale parallel execution Challenge: hierarchical systems require hierarchical task decomposition!

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 108 / 141

slide-109
SLIDE 109

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Accelerators and Energy Efficiency

In 1993 some people decided to compile a list of the 500 most powerful computers in the world (www.top500.org). This provides an interesting window computer hardware trends and issues

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 109 / 141

slide-110
SLIDE 110

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

The Top500: June 2018

(http://www.top500.org/resources/presentations/ (54th TOP500)) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 110 / 141

slide-111
SLIDE 111

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Top500: Performance Trend

(http://www.top500.org/resources/presentations/ (54th TOP500)) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 111 / 141

slide-112
SLIDE 112

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Moore’s Law & Dennard Scaling

Two “laws” underpin what we observe in the time evolution of the Top500

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 112 / 141

slide-113
SLIDE 113

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Chip Size and Clock Speed

Moore’s Law and Dennard Scaling allowed construction of faster and more complex processors, systems so fast that they could not communicate across the chip in a single clock cycle

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 113 / 141

slide-114
SLIDE 114

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Divide & Conquer

To address this multiple processors were placed on a single piece of silicon

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 114 / 141

slide-115
SLIDE 115

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Top500: Multicore Emergence

(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 115 / 141

slide-116
SLIDE 116

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Top500: Accelerator Trend

(http://www.top500.org/blog/resources/presentations/ (54th TOP500)) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 116 / 141

slide-117
SLIDE 117

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Moore’s Law & Dennard Scaling

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 117 / 141

slide-118
SLIDE 118

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Moore’s Law & Dennard Scaling

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 118 / 141

slide-119
SLIDE 119

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

A New Chip Development Era

1960-2010 2010-? Few transistors No shortage of transistors No shortage of power Limited power Maximize transistor utility Minimize energy Generalize Customize

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 119 / 141

slide-120
SLIDE 120

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Top500: Power Efficiency

(http://www.top500.org/resources/presentations/ (54th TOP500)) Computer Systems (ANU) Parallel Software Design Feb 14, 2020 120 / 141

slide-121
SLIDE 121

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Heterogeneity and Energy Efficiency

The following slides taken from Parallel Computer Architecture and Programming, Kayvon Fatahalian, Carnegie Mellon University Course 15-418/618 (http://15418.courses.cs.cmu.edu/spring2015/)

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 121 / 141

slide-122
SLIDE 122

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

You need to buy a computer system

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Processor A

4 cores Each core has sequential performance P

Processor B

16 cores Each core has sequential performance P/2

All other components of the system are equal.

Which do you pick?

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 122 / 141

slide-123
SLIDE 123

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Amdahl’s law revisited

speedup(f,n)

f = fraction of program that is parallelizable n = parallel processors Assumptions: Parallelizable work distributes perfectly onto n processors of equal capability

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 123 / 141

slide-124
SLIDE 124

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Rewrite Amdahl’s law in terms of resource limits

speedup(f,n,r)

f = fraction of program that is parallelizable n = total processing resources (e.g., transistors on a chip) r = resources dedicated to each processing core, (each of the n/r cores has sequential performance perf(r) Two examples where n=16 rA = 4 rB = 1

(relative to processor with 1 unit

  • f resources, n=1. Assume

perf(1) = 1)

[Hill and Marty 08]

More general form of Amdahl’s Law in terms

  • f f, n, r

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 124 / 141

slide-125
SLIDE 125

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Speedup (relative to n=1)

Each line corresponds to a different chip confjguration All lines on the same graph correspond to confjguration with same number of resources (constant n per chip) X-axis = r (many small cores to left, fewer “fatter” cores to right) perf(r) modeled as

Up to 16 cores (n=16) Up to 256 cores (n=256)

[Figure credit: Hill and Marty 08] 1 1

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 125 / 141

slide-126
SLIDE 126

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Asymmetric set of processing cores

Core Core Core Core Core Core Core Core Core Core Core Core Core

Example: n=16 One core: r = 4 Other 12 cores: r = 1

speedup(f,n,r)

(of heterogeneous processor with n recourses relative to uniprocessor with

  • ne unit worth of resources, n=1)
  • ne perf(r) processor + (n-r) perf(1)=1 processors

[Hill and Marty 08]

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 126 / 141

slide-127
SLIDE 127

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Speedup (relative to n=1)

X-axis for asymmetric architectures gives r for the single “fat” core (assume rest of cores are r = 1) X-axis for symmetric architectures gives r for all cores (many small cores to left, few “fat” cores to right)

(chip from prev. slide)

[Source: Hill and Marty 08]

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 127 / 141

slide-128
SLIDE 128

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Heterogeneous processing

Observation: most “real world” applications have complex workload characteristics *

They have components that can be widely parallelized. And components that are difficult to parallelize. They have components that are amenable to wide SIMD execution. And components that are not. (divergent control fmow) They have components with predictable data access And components with unpredictable access, but those accesses might cache well.

* You will likely make a similar observation during your projects

Idea: the most efficient processor is a heterogeneous mixture of resources (“use the most efficient tool for the job”)

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 128 / 141

slide-129
SLIDE 129

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Example: Intel “Haswell" (2013)

(4th Generation Core i7 architecture)

Four CPU cores + many GPU cores

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 129 / 141

slide-130
SLIDE 130

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Heterogeneous Architectures for Supercomputing

from Green500 List, Nov 2019

rank Top500 system cores Rmax power efficiency rank (×1000) (PF) (KW) (GF/W) 1 159 A64FX prototype: A64FX 2GHz 37 2.0 118 16.9 Fujitsu Numazu 2 420 ZettaScaler-2.2, Xeon D-1571+PEZY-SC2 700Mhz 1271 1.3 80 16.2 PEZY Computing, Japan 3 24 Aimos: POWER9 3.5GHz + Volta GV100 130 8.0 510 15.8 Rensselaer Poly, USA 4 173 Satori: POWER9 2.4GHz + Volta GV100 23 1.5 94 15.6 MIT, USA 5 1 Summit: POWER9 3.1GHz + Volta GV100 2414 148.6 10096 14.7 Oak Ridge National Laboratory 6 8 ABCI: Xeon Gold 2.4GHz + Tesla V100 SXM2 392 19.9 1649 14.4 AIST, Japan 7 494 MareNostrum: POWER9 3.1GHz + Tesla V100 18 1.5 81 14.1 Barcelona 8 23 TSUBAME3.0: Xeon E5 2.4GHz + Tesla P100 SXM2 136 1145 792 13.7 Tokyo Tech 9 11 Pangea III: POWER9 3.5GHz + Volta GV100 291 18 1367 13.1 Total Exploration, France 10 2 Sierra: POWER9 3.1GHz + + Volta GV100 1572 95 7438 12.7 LLNL Computer Systems (ANU) Parallel Software Design Feb 14, 2020 130 / 141

slide-131
SLIDE 131

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Energy-constrained computing

▪ Supercomputers are energy constrained

  • Due to shear scale
  • Overall cost to operate (power for machine and for cooling)

▪ Datacenters are energy constrained

  • Reduce cost of cooling
  • Reduce physical space requirements

▪ Mobile devices are energy constrained

  • Limited battery life
  • Heat dissipation

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 131 / 141

slide-132
SLIDE 132

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Limits on chip power consumption

▪ General in mobile processing rule: the longer a task runs the less power it can use

  • Processor’s power consumption is limited by heat generated (efficiency is

required for more than just maximizing battery life)

Power Time

Electrical limit: max power that can be supplied to chip Die temp: (junction temp -- Tj): chip becomes unreliable above this temp (chip can run at high power for short period of time until chip heats to Tj) Case temp: mobile device gets too hot for user to comfortably hold (chip is at suitable operating temp, but heat is dissipating into case) Battery life: chip and case are cool, but want to reduce power consumption to sustain long battery life for given task

Slide credit: adopted from original slide from M. Shebanow: HPG 2013 keynote iPhone 5 battery: 5.4 watt-hours 4th gen iPad battery: 42.5 watt-hours 15in Macbook Pro: 95 watt-hours

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 132 / 141

slide-133
SLIDE 133

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Hardware specialization increases efficiency

[Chung et al. MICRO 2010] lg2(N) (data set size) FPGA GPUs FPGA GPUs lg2(N) (data set size) ASIC delivers same performance as one CPU core with ~ 1/1000th the chip area. GPU cores: ~ 5-7 times more area efficient than CPU cores. ASIC delivers same performance as one CPU core with only ~ 1/100th the power.

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 133 / 141

slide-134
SLIDE 134

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2515/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Efficiency benefjts of compute specialization

▪ Rules of thumb: compared to good-quality C code on CPU... ▪ Throughput-maximized processor architectures: e.g., GPU cores

  • Approximately 10x improvement in perf / watt
  • Assuming code maps well to wide data-parallel execution and is compute bound

▪ Fixed-function ASIC (“application-specifjc integrated circuit”)

  • Can approach 100x or greater improvement in perf/watt
  • Assuming code is compute bound and

and is not fmoating-point math

[Source: Chung et al. 2010 , Dally 08] [Figure credit Eric Chung]

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 134 / 141

slide-135
SLIDE 135

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Research: ARM + GPU Supercomputer

▪ Observation: the heavy lifting in supercomputing applications is the data-

parallel part of workload

  • Less need for “beefy” sequential performance cores

▪ Idea: build supercomputer out of power-efficient building blocks

  • ARM CPUs (for control/scheduling) + GPU cores (primary compute engine)

▪ Goal: 7 GFLOPS/Watt efficiency ▪ Project underway at Barcelona Supercomputing Center

http://www.montblanc-project.eu Computer Systems (ANU) Parallel Software Design Feb 14, 2020 135 / 141

slide-136
SLIDE 136

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Challenges of heterogeneity

▪ Heterogeneous system: preferred processor for each task

  • Challenge for system designer: what is the right mixture of resources?
  • Too few throughput oriented resources (lower peak throughput for parallel workloads)
  • Too few sequential processing resources (limited by sequential part of workload)
  • How much chip area should be dedicated to a specifjc function, like video? (these

resources are taken away from general-purpose processing)

  • Work balance must be anticipated at chip design time
  • System cannot adapt to changes in usage over time, new algorithms, etc.
  • Challenge to software developer: how to map programs onto a heterogeneous

collection of resources?

  • Challenge: “Pick the right tool for the job”: design algorithms that decompose well into

components that each map well to different processing components of the machine

  • The scheduling problem is more complex on a heterogeneous system
  • Available mixture of resources can dictate choice of algorithm
  • Software portability nightmare (we’ll revisit in a future lecture on domain specifjc

programming languages

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 136 / 141

slide-137
SLIDE 137

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Data movement has high energy cost

▪ Rule of thumb in mobile system design: reduce amount of data transferred

from memory

  • Earlier in class we discussed minimizing communication to reduce stalls (poor performance).

Now, we wish to reduce communication to reduce energy consumption

▪ “Ballpark” numbers

  • Integer op: ~ 1 pJ *
  • Floating point op: ~20 pJ *
  • Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ
  • Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ

▪ Implications

  • Reading 10 GB/sec from memory: ~1.6 watts
  • Entire power budget for mobile GPU: ~1 watt (remember phone is also running CPU, display,

radios, etc.)

  • iPhone 5 battery: ~5.5 watt-hours (note: my Macbook Pro laptop: 77 watt-hour battery)
  • Exploiting locality matters!!!

* Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc.

[Sources: Bill Dally (NVIDIA), Tom Olson (ARM)]

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 137 / 141

slide-138
SLIDE 138

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

From: http://15418.courses.cs.cmu.edu/spring2015/lecture/heterogeneity

CMU 15-418/618, Spring 2015

Three trends in energy-focused computing

▪ Compute less!

  • Computing more costs energy: parallel algorithms that do more work than sequential

counterparts may not be desirable even if they run faster

▪ Reduce bandwidth requirements

  • Exploit locality (restructure algorithms to reuse on-chip data as much as possible)
  • Aggressive use of compression: perform extra computation to compress application data before

transferring to memory (likely to see fjxed-function HW to reduce overhead of general data compression/decompression)

▪ Specialize compute units:

  • Heterogeneous processors: CPU-like cores + throughput-optimized cores (GPU-like cores)
  • Fixed-function units: audio processing, “movement sensor processing” video decode/encode,

image processing/computer vision?

  • Specialized instructions: expanding set of AVX vector instructions, new instructions for

accelerating AES encryption (AES-NI)

  • Programmable soft logic: FPGAs

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 138 / 141

slide-139
SLIDE 139

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Summary: Energy efficiency

Moore’s Law continues, but Dennard Scaling has ended We are in a new era of processor design Heterogeneity and energy usage are the issues of the day!

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 139 / 141

slide-140
SLIDE 140

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Hands-on Exercise: Matrix Multiplication Implementation

Objective: To show a parallel multi-core implementation of matrix multiplication using the Master/Worker mode

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 140 / 141

slide-141
SLIDE 141

Emerging Paradigms and Challenges in Parallel Computing Accelerators and Energy Efficiency

Summary

Topics covered today - parallel algorithm design and parallel futures: Finding parallelism Algorithmic Structure Patterns Program and Data Structure Patterns System-on-chip processors Challenges: reducing synchronization, heterogeneity and energy

Computer Systems (ANU) Parallel Software Design Feb 14, 2020 141 / 141