Deriving Efficient Data Movement From Decoupled Access/Execute - PowerPoint PPT Presentation

The Queen’s Tower The Queen’s Tower Imperial College Imperial College London London South Kensington, South Kensington, SW7 SW7 Deriving Efficient Data Movement From Decoupled Access/Execute Specifications Lee W. Howes, Anton Lokhmotov, Alastair F. Donaldson and Paul H. J. Kelly Imperial College London and Codeplay Software January 2009 Lee Howes 27th Jan 2008 | Ashley Brown

Multi-core architectures • Require parallel programming • Must divide computation • Must communicate data • High-throughput computation – Efficient use of memory bandwidth essential Source: AMD 2 Lee Howes 27th Jan 2008 | Ashley Brown

Cell's hardware solution • Target the memory wall: – Distributed local memories: 256kB each – Separate data movement from computation using DMA engines • Bulk transfers increase efficiency • Increased programming challenge: – Must write data movement code – Must deal with alignment constraints • Premature optimisation – Platform independence is lost Source: IBM 3 Lee Howes 27th Jan 2008 | Ashley Brown

Mainstream programming models • No explicit support for separation of computation from data access • Freely mix computation and data movement • Complexity of compiler analysis => Difficult to extract separation • Orthogonal issues: – extracting parallelism – creating data movement code 4 Lee Howes 27th Jan 2008 | Ashley Brown

The proposal • Allow the programmer to express explicitly: – Separation between data communication and computation – Parallelism of the computation 5 Lee Howes 27th Jan 2008 | Ashley Brown

Streams • Approaches the separation ideal • Simple kernel applied to each element of a data set • Each element of stream typically independent of others – No feedback as a parallel processing model – Dependencies only on input and output elements 6 Lee Howes 27th Jan 2008 | Ashley Brown

Parallelism in stream programming • Independence of executions => simple inference of parallelism • Sliding windows of elements on inputs – access multiple elements – parallelism still predictable • AMD, NVIDIA use a stream model for parallel hardware 7 Lee Howes 27th Jan 2008 | Ashley Brown

Streams? A 2D convolution filter • Reads region of input • Processes region • Writes single point in the output 8 Lee Howes 27th Jan 2008 | Ashley Brown

Representing convolution as 1D streams • One option: flatten 2D dataset – Requires multiple sliding windows or long FIFO structures • Mapping 2D structures to 1D streams is untidy 9 Lee Howes 27th Jan 2008 | Ashley Brown

Representing convolution as 2D streams • Stanford's Brook language uses stencils on 2D shaped streams floats x; floats2 y; streamShape(x,2,32,32); streamStencil(y, x, STREAM_STENCIL_CLAMP, 2, 1, -1, 1, -1); kernel void neighborAvg(floats2 a, out floats b) { b = 0.25*(a[0][1]+a[2][1]+a[1][0]+a[1][2]); } 10 Lee Howes 27th Jan 2008 | Ashley Brown

Representing convolution as 2D streams • Stencil stream passed to kernel • Treated as if it is a small set of accessible elements • Limited addressing capabilities floats x; floats2 y; streamShape(x,2,32,32); streamStencil(y, x, STREAM_STENCIL_CLAMP, 2, 1, -1, 1, -1); kernel void neighborAvg(floats2 a, out floats b) { b = 0.25*(a[0][1]+a[2][1]+a[1][0]+a[1][2]); } 11 Lee Howes 27th Jan 2008 | Ashley Brown

Generalising streams • View streams as: – A kernel, executed separately on each data element – A simple mapping of that kernel onto the data – elementwise or moving windowed • This is a simplistic separation of access from execution, hence the Decoupled Acess/Execute ( Æ cute ) model 12 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute as a generalisation of streams • Take a similar kernel-per-element declarative programming model • View in terms of an iteration space that is independent of the data sets • With a separate, flexible mapping to the data • Mapping allows clean descriptions of complicated data access patterns • Simpler kernel implementations with localised data sets 13 Lee Howes 27th Jan 2008 | Ashley Brown

Execute • Define an iteration space (e.g. as polyhedral constraints) • Execute a computation kernel for each point in the iteration space 14 Lee Howes 27th Jan 2008 | Ashley Brown

Data access • On each iteration, the kernel accesses a set of data elements • Accessed elements treated as local to the iteration • Eases programming of the kernel 15 Lee Howes 27th Jan 2008 | Ashley Brown

Decoupled access/execute • Decouple access to remote memory from local execution • Separate mapping of local store to global data 16 Lee Howes 27th Jan 2008 | Ashley Brown

Multiple iterations • Decouple access and execute for multiple iterations for efficiency • Manually supporting this flexibility can be challenging 17 Lee Howes 27th Jan 2008 | Ashley Brown

Add in alignment issues • DMAs must be adapted to correct for alignment • Data can often be read with alignment tweaks to fix performance 18 Lee Howes 27th Jan 2008 | Ashley Brown

In code: The iterator Neighbourhood2D_Read inputPointSet(iterationSpace, input, K); Point2D_Write outputPointSet(iterationSpace, output); ... void kernel( const IterationSpace2D::element_iterator &eit ) { // compute mean rgb mean( 0.0f, 0.0f, 0.0f ); for(int w = -K; w <= K; ++w) { for(int z = -K; z <= K; ++z) { mean += inputPointSet(eit, w, z); // input[x+w][y+z] } } outputPointSet( eit ) = mean / ((2K+1)(2K+1)); } 19 Lee Howes 27th Jan 2008 | Ashley Brown

In code: Use of access descriptors Neighbourhood2D_Read inputPointSet(iterationSpace, input, K); Point2D_Write outputPointSet(iterationSpace, output); ... void kernel( const IterationSpace2D::element_iterator &eit ) { // compute mean rgb mean( 0.0f, 0.0f, 0.0f ); for(int w = -K; w <= K; ++w) { for(int z = -K; z <= K; ++z) { mean += inputPointSet(eit, w, z); // input[x+w][y+z] } } outputPointSet( eit ) = mean / ((2K+1)(2K+1)); } 20 Lee Howes 27th Jan 2008 | Ashley Brown

In code: Computation in the kernel Neighbourhood2D_Read inputPointSet(iterationSpace, input, K); Point2D_Write outputPointSet(iterationSpace, output); ... void kernel( const IterationSpace2D::element_iterator &eit ) { // compute mean rgb mean( 0.0f, 0.0f, 0.0f ); for(int w = -K; w <= K; ++w) { for(int z = -K; z <= K; ++z) { mean += inputPointSet(eit, w, z); // input[x+w][y+z] } } outputPointSet( eit ) = mean / ((2K+1)(2K+1)); } 21 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute iteration spaces • Define an n-dimensional iteration space • Specify sizes for each dimension – can be run time defined • For example: – IterationSpace<2> iSpace( 0, 0, 10, 10 ); • Over which we can iterate using fairly standard syntax: – for( IterationSpace<2>::iterator it = iSpace.begin()..... ){...} • Can treat the iterator loop much as an OpenMP blocked look 22 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute access descriptors • Define a mapping from an iteration space to an array • Specify shape and mapping functions • For example: – Region2D<Array<rgb,2>,IterationSpace<2>> inputPointSet( iSpace, data, RADIUS ); • Which we can access using an iterator – InputPointSet(it,1,0).r = 3; 23 Lee Howes 27th Jan 2008 | Ashley Brown

Æ cute address modifiers • Base address of a region combines: – iterator address in its iteration space – address modifier function • A modifier, or modifier chain, is applied (optionally) to each access descriptor: – Point2D< Project2D1D< 1, 0 > > inputPointSet( iSpace, data, RADIUS ); – Projects a 2D address into a 1D address to access a 1D dataset 24 Lee Howes 27th Jan 2008 | Ashley Brown

The Æ cute framework • Implementation of the Æ cute model for data movement on the STI Cell processor 25 Lee Howes 27th Jan 2008 | Ashley Brown

Iterating • PPE takes a chunk of the iteration space – Blocking is configurable 26 Lee Howes 27th Jan 2008 | Ashley Brown

Delegation • Transmits chunk to appropriate SPE runtime as a message 27 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • SPE loads appropriate data for the chunk into an internal buffer in each access descriptor object 28 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • SPE processes one buffer set while receiving the next block to process 29 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • DMA loading next buffers operate in parallel with computation 30 Lee Howes 27th Jan 2008 | Ashley Brown

Loading data • On completion of a block, input buffers cleared, output DMAs initiated 31 Lee Howes 27th Jan 2008 | Ashley Brown

Advantages • Separation of buffering maintains simplicity • Double/triple buffering comes naturally when there are no data dependent loads • Remove complexity of manual software pipelining • Complicated addressing schemes not precluded 32 Lee Howes 27th Jan 2008 | Ashley Brown

Deriving Efficient Data Movement From Decoupled Access/Execute - PowerPoint PPT Presentation

The Queens Tower The Queens Tower Imperial College Imperial College London London South Kensington, South Kensington, SW7 SW7 Deriving Efficient Data Movement From Decoupled Access/Execute Specifications Lee W. Howes, Anton

Module 1: Introduction Deriving Business Information Deriving meaningful information from

Decoupled Access/Execute Computer Architectures James E. Smith Presented by Dan Amelang How

Deriving Filtering Algorithms Deriving Filtering Algorithms from Constraint Checkers from

Deriving Consensus for Multi-Parallel Corpora: An English Bible Study Patrick Xia David

Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2

Damping Power System Inter-area Oscillations Through Decoupled Modulation Rui Fan, Shaobu Wang

Secrets of the decoupled Drupal practitioner Preston So April 11, 2019 DrupalCon

1 Decoupled & Uprooted Case Study, Government of Flanders Tomas Flpp (Vacilando) 2

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke & Mark Shropshire Todays

Prominence-based licensing in head movement and phrasal movement Brian Hsu LSA 2020 Annual

DIFFICULTIES IN CHILDREN Anna Barnett Everyday movement skills Everyday movement skills

Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial);

Group Action Systems Group Action Systems : : a Mathematical tool for deriving a Mathematical

Service composition: Deriving Component Designs from Global Requirements Gregor v. Bochmann

Deriving Enforcement Mechanisms from H. Janicke, et.al. Policies Motivation ITL Policy Rules

What is this talk about? What is this talk about? Deriving tight and safe task execution time

PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David

Weakly self-avoiding walk in dimension four Gordon Slade University of British Columbia

M obius transformations and Furstenbergs theorem Piotr Rutkowski BSc Wednesday 8 th April,

AN APPLICATION OF THE HARDENED FLOATING-POINT CORES ON HIL SIMULATIONS Elas Todorovich,

TenantGuard: Scalable Runtime Verification of Cloud-Wide VM-Level Network Isolation Han Song

Physics and hard disk drives- an industrial career perspective Steven Lambert APS Industrial

Brady Corporation F16 Q4 Financial Results September 9, 2016 Forward-Looking Statements 2 In

$2 to $8 million 2 1 7/30/2013 M ANAGING RISK UNDER THE AIA $1.8 billion $1.5 billion

Sambuz

Useful Links

Newsletter

Mail Us

Deriving Efficient Data Movement From Decoupled Access/Execute - PowerPoint PPT Presentation

The Queens Tower The Queens Tower Imperial College Imperial College London London South Kensington, South Kensington, SW7 SW7 Deriving Efficient Data Movement From Decoupled Access/Execute Specifications Lee W. Howes, Anton

Module 1: Introduction Deriving Business Information Deriving meaningful information from

Decoupled Access/Execute Computer Architectures James E. Smith Presented by Dan Amelang How

Deriving Filtering Algorithms Deriving Filtering Algorithms from Constraint Checkers from

Deriving Consensus for Multi-Parallel Corpora: An English Bible Study Patrick Xia David

Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2

Damping Power System Inter-area Oscillations Through Decoupled Modulation Rui Fan, Shaobu Wang

Secrets of the decoupled Drupal practitioner Preston So April 11, 2019 DrupalCon

1 Decoupled &amp; Uprooted Case Study, Government of Flanders Tomas Flpp (Vacilando) 2

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke &amp; Mark Shropshire Todays

Prominence-based licensing in head movement and phrasal movement Brian Hsu LSA 2020 Annual

DIFFICULTIES IN CHILDREN Anna Barnett Everyday movement skills Everyday movement skills

Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial);

Group Action Systems Group Action Systems : : a Mathematical tool for deriving a Mathematical

Service composition: Deriving Component Designs from Global Requirements Gregor v. Bochmann

Deriving Enforcement Mechanisms from H. Janicke, et.al. Policies Motivation ITL Policy Rules

What is this talk about? What is this talk about? Deriving tight and safe task execution time

PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David

Weakly self-avoiding walk in dimension four Gordon Slade University of British Columbia

M obius transformations and Furstenbergs theorem Piotr Rutkowski BSc Wednesday 8 th April,

AN APPLICATION OF THE HARDENED FLOATING-POINT CORES ON HIL SIMULATIONS Elas Todorovich,

TenantGuard: Scalable Runtime Verification of Cloud-Wide VM-Level Network Isolation Han Song

Physics and hard disk drives- an industrial career perspective Steven Lambert APS Industrial

Brady Corporation F16 Q4 Financial Results September 9, 2016 Forward-Looking Statements 2 In

$2 to $8 million 2 1 7/30/2013 M ANAGING RISK UNDER THE AIA $1.8 billion $1.5 billion

Sambuz

Useful Links

Newsletter

Mail Us

1 Decoupled & Uprooted Case Study, Government of Flanders Tomas Flpp (Vacilando) 2

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke & Mark Shropshire Todays