Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee - - PowerPoint PPT Presentation

decoupled access execute metaprogramming
SMART_READER_LITE
LIVE PREVIEW

Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee - - PowerPoint PPT Presentation

Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial); Alastair F. Donaldson (Oxford/Codeplay) University of Birmingham, 3 July 2009 Challenge cute model Vision Recent meeting on accelerated


slide-1
SLIDE 1

Decoupled Access/Execute Metaprogramming

Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial); Alastair F. Donaldson (Oxford/Codeplay) University of Birmingham, 3 July 2009

slide-2
SLIDE 2

Challenge Æcute model Vision

Recent meeting on accelerated computing at Imperial

(35–40 attendees summarised by their affiliation)

Computing (software optimisation, cognitive robotics, visual information processing, reconfigurable computing) Electrical Engineering (reconfigurable computing, design automation) Mechanical Engineering (multiscale flow dynamics, vibration technology) Earth Science and Engineering (applied modelling & computation) Physics (plasma, experimental solid state) Chemistry (computational, biological & biophysical) Biomedical Engineering Chemical Engineering Civil Engineering Aeronautics

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-3
SLIDE 3

Challenge Æcute model Vision

Berkeley motifs (dwarfs)

Dense Linear Algebra Sparse Linear Algebra N-Body Methods Spectral Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-4
SLIDE 4

Challenge Æcute model Vision

Berkeley motifs (dwarfs)

Dense Linear Algebra Sparse Linear Algebra N-Body Methods Spectral Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-5
SLIDE 5

Challenge Æcute model Vision

Why accelerator programming is challenging?

Accelerator hardware hundreds of functional units software-managed memory hierarchies, e.g.

host memory (main memory) device global memory (on-board) device local memory (on-chip)

Accelerator software low-level, hence unproductive architecture-specific, hence nonportable

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-6
SLIDE 6

Challenge Æcute model Vision

The fundamental software engineering challenge

How to use accelerator technology but keep maintainability, composability, reusability, portability?

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-7
SLIDE 7

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Decoupled Access/Execute (Æcute) model

Decoupled Access/Execute metaprogramming kernel code written for uniform memory execute metadata describe execution constraints access metadata describe memory access pattern part of the kernel’s interface specification Goals robust translation into efficient low-level code ample opportunities for optimisation convenience and flexibility

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-8
SLIDE 8

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Execute metadata

Execute metadata for a kernel is a tuple E = (I, R, P), where: I ⊂ Zn is a finite, n-dimensional iteration space, for some n > 0; R ⊆ I × I, is a precedence relation such that (i1, i2) ∈ R iff iteration i1 must be executed before iteration i2. P is a partition of I into a set of non-empty, disjont iteration subspaces: P =

  • Ik : I = Ik; Ii

Ij = ∅, i = j

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-9
SLIDE 9

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Access metadata

Access metadata for a kernel is a tuple A = (Mr, Mw), where: Mr : I → P(M) specifies the set of memory locations Mr(i) that may be read on iteration i ∈ I; Mw : I → P(M) specifies the set of memory locations Mw(i) that may be written on iteration i ∈ I.

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-10
SLIDE 10

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Example: 2D convolution

Oy,x =

K

  • u=−K

K

  • v=−K

Cu,v · Iy+u,x+v I: input image O: output image C: coefficients W: image width H: image height K: neighbourhood radius K ≤ y < H − K; K ≤ x < W − K

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-11
SLIDE 11

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Memory access of a single (y, x) iteration (K = 1)

Oy,x =

K

  • u=−K

K

  • v=−K

Cu,v · Iy+u,x+v

x y 3

Region of I

3 3 3

All of C Iteration (y,x) Region of O

1 1

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-12
SLIDE 12

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Æcute specification (h × w rectangular tiling)

Oy,x =

K

  • u=−K

K

  • v=−K

Cu,v · Iy+u,x+v Execute metadata (I, R, P): I =

  • (y, x) : K ≤ y < H − K, K ≤ x < W − K
  • R = ∅

P =

  • {(y, x) ∈ I : h(j − 1) ≤ y − K < hj, w(i − 1) ≤

x − K < wi} : 1 ≤ j < (H − 2K)/h, 1 ≤ i < (W − 2K)/w

  • Access metadata (Mr, Mw):

Mr =

  • Iy+u,x+v, Cu,v : (y, x) ∈ I, −K ≤ u, v ≤ K
  • Mw =
  • Oy,x : (y, x) ∈ I
  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-13
SLIDE 13

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Ii,j, Oi,j: 0 ≤ i < H, 0 ≤ j < W; Ci,j: −K ≤ i, j ≤ K

C++

rgb I[W][H]; rgb O[W][H]; rgb C[2*K+1][2*K+1];

Æcute (data wrappers)

Array2D<rgb> arrayI(&I[0][0], W, H); Array2D<rgb> arrayO(&O[0][0], W, H); Array2D<rgb> arrayC(&C[0][0], 2*K+1, 2*K+1);

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-14
SLIDE 14

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Iteration space: K ≤ y < H − K, K ≤ x < W − K

C++

for(y = K; y < H-K; ++y) for(x = K; x < W-K; ++x) // Kernel code for each (y,x)

Æcute (execute metadata)

IterationSpace1D y(K,H-K); IterationSpace1D x(K,W-K); IterationSpace2D iterYX(y,x);

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-15
SLIDE 15

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Access regions: implicit in C++, explicit in Æcute

C++

// Kernel code for each (y,x) rgb sum(0.0f, 0.0f, 0.0f); for(u = -K; u <= K; ++u) for(v = -K; v <= K; ++v) sum += C[K+u][K+v] * I[y+u][x+v]; // read from C and I O[y][x] = sum; // write to O

Æcute (access metadata)

// Access descriptors Neighbourhood2D_R accessI(iterYX, arrayI, K); Point2D_W accessO(iterYX, arrayO); All_R accessC(iterYX, arrayC);

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-16
SLIDE 16

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Kernel code

C++

// Kernel code for each (y,x) int u, v; rgb sum(0.0f, 0.0f, 0.0f); for(u = -K; u <= K; ++u) for(v = -K; v <= K; ++v) sum += C[K+u][K+v] * I[y+u][x+v]; O[y][x] = sum;

Æcute (kernel method)

void kernel(const IterationSpace2D::iterator &it) { int u, v; rgb sum(0.0f, 0.0f, 0.0f); for(u = -K; u <= K; ++u) for(v = -K; v <= K; ++v) sum += accessC(u, v) * accessI(it, u, v); accessO(it) = sum; }

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-17
SLIDE 17

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Bringing all together

// Data wrappers Array2D<rgb> arrayI(&I, W, H); Array2D<rgb> arrayO(&O, W, H); Array2D<rgb> arrayC(&C, 2*K+1, 2*K+1); // Execute metadata IterationSpace1D y(K,H-K); IterationSpace1D x(K,W-K); IterationSpace2D iterYX(y,x); // Access metadata Neighbourhood2D_R accessI(iterYX, arrayI, K); Point2D_W accessO(iterYX, arrayO); All_R accessC(iterYX, arrayC); // Filter initialisation and execution ConvolutionFilter2D conv(iterYX, accessI, accessO, accessC); iterYX.tile(h, w); conv.run();

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-18
SLIDE 18

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Æcute metadata benefits

data movement synthesis and optimisation (e.g. software pipelining and exploiting data reuse) machine-independent abstraction, machine-dependent tuning (via partitioning) potential for inter-kernels optimisations (e.g. loop fusion and array contraction)

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-19
SLIDE 19

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Memory access of an iteration subspace

Let Ik ⊂ I be an iteration subspace Mr(Ik) =

i∈Ik Mr(i)

Mw(Ik) =

i∈Ik Mw(i)

x y 4

Region of I

4 3 3

All of C One block Region of O

2 2

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-20
SLIDE 20

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Memory access of two iteration subspaces

Let Ik and In be two iteration subspaces

Mr(Ik) T Mr(In) determines reuse Mw(Ik) T Mr(In) determines true dependence Mr(Ik) T Mw(In) determines anti dependence Mw(Ik) T Mw(In) determines output dependence

x y

Two (overlapping) regions of I All of C Two blocks Two (disjoint) regions of O

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-21
SLIDE 21

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Inter-kernel optimisation (fusion and contraction)

x y

Region of I Kernel A Region of T

1 1 x y

Region of O Kernel B Region of T

1 1

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-22
SLIDE 22

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Inter-kernel optimisation (fusion and contraction)

x y

Region of I Kernel A/B Region of O

Improved locality, reduced communication Tricky but within the reach of polyhedral code generation Using metadata bypasses fragile inter-kernel analysis

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-23
SLIDE 23

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Prototype framework for the Cell BE architecture

Sony/Toshiba/IBM Cell Broadband Engine 1 PowerPC Processing Element (PPE) 8 Synergistic Processing Elements (SPEs) Main memory, 256 KiB local memory per SPE DMA to copy data between main and local memories Benchmark variants (kernel code is basically the same) Æcute Hand-written C Software cache (IBM Cell SDK) Experimental setup 3.2 GHz Cell (Sony PlayStation 3, 6 SPEs available) IBM Cell SDK 2.1, Fedora Linux 7

Framework details

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-24
SLIDE 24

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Matrix-vector multiply: y = A · x

2048x2048 2048x4096 4096x4096 4096x8192 1 2 3 4 5 6 7 8 9 10 Matrix size Execution time normalised to hand−written code

Hand−written Software cache AEcute 4x1024 AEcute 4x512 AEcute 4x256 Æcute spec

Bandwidth limited Tiling increases locality: x is reused Longer strips (e.g. 1024) are more efficient Æcute is 2.3–3.6× slower, software cache is 5–10× slower

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-25
SLIDE 25

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Closest-to-mean filter on square D × D images

D=256, N=15 D=256, N=63 D=1024, N=15 D=1024, N=63 0.5 1 1.5 2 2.5 3 3.5 Image dimension D, pixel neighbourhood diameter N Execution time normalised to hand−written code

Hand−written Software cache AEcute 20x20 AEcute 5x40 AEcute 40x5 5.7 5.1 9.6 Æcute spec

No tile size was universally best 40 × 5 is best for D=256 20 × 20 is best for D=1024 5 × 40 is near best for both D=256 and 1024 Æcute is 12–40% slower, software cache is 2.2–9.6× slower

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-26
SLIDE 26

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Bit-reversed copy: y[σn(i)] = x[i], for 0 ≤ i < N = 2n

0.5 1 1.5 2 2.5 3 3.5 14 16 18 20 22 24 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 Bandwidth (GiB/s) Data size (index bits) Software cache Hand-written AEcute

Æcute spec

(σn(i) reverses bits of n-bit index, e.g. σ4(01112) = 11102)

Tiled algorithm (Carter & Gatlin) Vector kernel (Lokhmotov & Mycroft) Non-affine access specification For large problem sizes (n = 22–24), Æcute is 1.6× slower, software cache is 20× slower

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-27
SLIDE 27

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Work in progress: Æcute metaprogramming for GPUs

GPUs are most widely available and popular accelerators Complex software-managed memory hierarchy (access)

Currently, two levels Speculatively, more levels (NVIDIA)

Complex iteration space mappings (execute)

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-28
SLIDE 28

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Motivating example: vertical mean image filter

Ox,y = 1 D

D−1

  • k=0

Ix,y+k I is a W × H grey-scale input image; O is a W × (H − D) grey-scale output image; D is the diameter of the filter (D ≪ H); 0 ≤ x < W, 0 ≤ y < H − D.

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-29
SLIDE 29

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Scalable algorithm for vertical mean image filter

Ox,y =

  • 1

D

D−1

k=0 Ix,y+k,

for y = y0; Ox,y−1 + 1

D

  • Ix,y+D−1 − Ix,y−1
  • ,

for 1 ≤ y − y0 < T.

1 for(int x = 0; x < W; ++x) { // for each column 2 for(int y0 = 0; y0 < H-D; y0 += T) { // for each strip of rows 3 // first phase: convolution 4 float sum = 0.0f; 5 for(int k = 0; k < D; ++k) 6 sum += I[(y0+k)*W + x]; 7 O[y0*W + x] = sum / (float)D; 8 9 // second phase: rolling sum 10 for(int dy = 1; dy < min(T,H-D-y0); ++dy) { 11 int y = y0 + dy; 12 sum -= I[(y-1)*W + x]; 13 sum += I[(y-1+D)*W + x]; 14 O[y*W + x] = sum / (float)D; 15 } } }

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-30
SLIDE 30

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Memory access of iterations (x, y : y + 3) [D = 2]

Ox,y =

  • 1

D

D−1

k=0 Ix,y+k,

for y = y0; Ox,y−1 + 1

D

  • Ix,y+D−1 − Ix,y−1
  • ,

for 1 ≤ y − y0 < T.

x y

Strip of I Iterations

(x,y0:y0+3)

Strip of O

1 1 4 5

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-31
SLIDE 31

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Complexity and parallelism

Ox,y =

  • 1

D

D−1

k=0 Ix,y+k,

for y = y0; Ox,y−1 + 1

D

  • Ix,y+D−1 − Ix,y−1
  • ,

for 1 ≤ y − y0 < T. Writes to O: N = W × (H − D) Reads from I and arithmetic operations: Θ(N + ND/T)

Θ(ND) when T ≪ D; Θ(N) when T ≫ D.

Thread parallelism: ⌈N/T⌉

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-32
SLIDE 32

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Two-dimensional grid mapping

WPBX WPBY H−D W

1 2 3 4 5 6 7 8 9 10 11 12

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-33
SLIDE 33

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

One-dimensional grid mapping

WPBY H−D WPBX W

1 2 3 4 5 6 7 8 9 10 8 5 3

Round to SIMD size

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-34
SLIDE 34

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Results on NVIDIA GTX 280, 5120 × 3200 image

2D grid; 1D grid.

4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5120, H=3200, D=40, TPBX=128, TPBY=1 2D 1D

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-35
SLIDE 35

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Results on NVIDIA GTX 280, 5121 × 3200 image

2D grid. Data padded to multiples of 16, 32, 64, and 128 ps.

4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5121, H=3200, D=40, TPBX=128, TPBY=1 2D, padded to 5136 2D, padded to 5152 2D, padded to 5184 2D, padded to 5248

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-36
SLIDE 36

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Results on NVIDIA GTX 280, 5121 × 3200 image

Data padded to a multiple of 64 ps. 1D grid wrapped on multiples of 16, 32 and 64 ps.

4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5121 (padded to 5184), H=3200, D=40, TPBX=128, TPBY=1) 1D wrapped on 5121 1D wrapped on 5136 1D wrapped on 5152 1D wrapped on 5184

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-37
SLIDE 37

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Results on NVIDIA GTX 280, 5121 × 3200 image

Data padded to a multiple of 64 ps. 2D grid; 1D grid wrapped on a multiple of 32 ps.

4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5121 (padded to 5184), H=3200, D=40, TPBX=128, TPBY=1 2D 1D wrapped on 5152

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-38
SLIDE 38

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Towards Æcute metaprogramming

Separating algorithm representation from mapping and tuning Specifying a hierarchy of iteration space partitions, e.g.

at the lowest level, by individual threads:

iterXY.partitionThreads(1,T); // 1xT outputs/thread

at the middle level, by blocks of possibly cooperating threads:

iterXY.partitionBlocks(128,T); // 128xT outputs/block

at the highest level, by possibly cooperating compute devices:

iterXY.partitionDevices(W/2,H-D); // (W/2)x(H-D) outputs/device

Synthesizing data movement for software-managed memories

alignment, padding

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-39
SLIDE 39

Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work

Future work

Integrating Æcute metadata into existing languages

Sieve C++ (EPSRC CASE Award Imperial/Codeplay) OpenCL (AMD) OpenMP?

Handling both regular and irregular computations Extracting Æcute metadata statically and dynamically Generating code for iterative and adaptive optimisation Scaling to large applications

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-40
SLIDE 40

Challenge Æcute model Vision

Parallel programming systems for scientists

Short-term goals investigate code generation and optimisation build convenient and efficient tools drive scientists away from low-level programming Long-term goals accumulate libraries of components investigate composability and reusability drive scientists towards high-level programming

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-41
SLIDE 41

Challenge Æcute model Vision

Parallel programming as a basic tool of sciences

“. . . today the calculus can be taught without a trace of mystery and with complete rigor. There is no longer any reason why this basic instrument of the sciences should not be understood by every educated person.” — Richard Courant, Herbert Robbins. “What is mathematics?” (1946)

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-42
SLIDE 42

Challenge Æcute model Vision

Parallel programming as a basic tool of sciences

“. . . today the calculus can be taught without a trace of mystery and with complete rigor. There is no longer any reason why this basic instrument of the sciences should not be understood by every educated person.” — Richard Courant, Herbert Robbins. “What is mathematics?” (1946) “. . . today parallel programming can be taught without a trace of mystery and with complete rigor. There is no longer any reason why this basic instrument of the sci- ences should not be understood by every educated per- son.” — ?

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-43
SLIDE 43

Challenge Æcute model Vision

Acknowledgements

The work on the Æcute model is in collaboration with Lee Howes (Imperial) Paul Kelly (Imperial) Alastair Donaldson (Oxford/Codeplay) It is inspired by my PhD research in collaboration with Alan Mycroft (Cambridge) Ben Gaster (ClearSpeed/AMD) Andrew Richards (Codeplay) Collaboration with the Russian Academy of Sciences involves Arutyun Avetisyan Andrey Belevantsev Alexander Monakov

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-44
SLIDE 44

Challenge Æcute model Vision

Questions?

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-45
SLIDE 45

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

PPE and SPE runtimes

Benchmarks

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-46
SLIDE 46

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

PPE takes a block of the iteration space (blocking is configurable)

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-47
SLIDE 47

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

PPE assigns the block of iterations to a ready SPE by sending a message

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-48
SLIDE 48

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

SPE copies the block’s working set into buffers associated with access descriptors

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-49
SLIDE 49

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

Whilst SPE processes the block, it receives another block for execution

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-50
SLIDE 50

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

SPE copies the new block’s working set into another set of buffers

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-51
SLIDE 51

Æcute Cell framework Æcute specifications Extra material

Prototype framework for the Cell BE architecture

SPE copies out the first block’s results and clears its buffers

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-52
SLIDE 52

Æcute Cell framework Æcute specifications Extra material Matrix-vector multiply Closest-to-mean filter Bit-reversed copy

Matrix-vector multiply

Execute metadata (I, R, P): I = {(i, j) : 0 ≤ i < H, 0 ≤ j < W} R = {((i, j), (i, k)) : 0 ≤ i < H, 0 ≤ j < k < W} P =

  • {(i, j) ∈ I : h(k − 1) ≤ i < hk, w(l − 1) ≤ j < wl} :

1 ≤ k < H/h, 1 ≤ l < W/w

  • Access metadata (Mr, Mw):

Mr(i, j) = {A[i][j], x[j]} Mw(i, j) = {y[i]}

Benchmark results

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-53
SLIDE 53

Æcute Cell framework Æcute specifications Extra material Matrix-vector multiply Closest-to-mean filter Bit-reversed copy

Closest-to-mean filter

Execute metadata (I, R, P): I =

  • (x, y) : K ≤ x < W − K, K ≤ y < H − K
  • R = ∅

P =

  • {(x, y) ∈ I : w(i − 1) ≤ x − K < wi, h(j − 1) ≤

y − K < hj} : 1 ≤ i < (W − 2K)/w, 1 ≤ j < (H − 2K)/h

  • Access metadata (Mr, Mw):

Mr =

  • I[x + u][y + v] : (x, y) ∈ I, −K ≤ u, v ≤ K
  • Mw =
  • O[x][y] : (x, y) ∈ I
  • Benchmark results
  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-54
SLIDE 54

Æcute Cell framework Æcute specifications Extra material Matrix-vector multiply Closest-to-mean filter Bit-reversed copy

Bit-reversed copy

Execute metadata (I, R, P): I = {t : 0 ≤ t < N/B2} R = ∅ P =

  • {t} : t ∈ I
  • Access metadata (Mr, Mw):

Mr(t) = {x[u.t.v] : t ∈ I, 0 ≤ u, v < B} Mw(t) = {y[u.σn(t).v] : t ∈ I, 0 ≤ u, v < B}

Benchmark results

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-55
SLIDE 55

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Parallel scopes

Joint work with A. Mycroft (Cambridge), A. Donaldson, A. Richards (Codeplay)

Aliasing complicates dependence analysis The compiler must produce reliable, if inefficient, code The programmer needs to tell more to the compiler

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-56
SLIDE 56

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Delayed side-effects in Codeplay’s Sieve C++

In Codeplay’s C++ extension, the programmer can place a code fragment in a parallel scope, thereby instructing the compiler to delay writes to memory locations defined outside of the scope, and apply them in order on exit from the scope. Sieve C++

float *pa, *pb; ... sieve { // sieve block for(int i = 1; i <= n; ++i) { pb[i] = pa[i] + 42; } } // writes to pb[1:n] happen here

Fortran 90

pb[1:n] = pa[1:n] + 42;

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-57
SLIDE 57

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Example: N-body simulation

Force acting on particle i according to the law of gravity

  • Fi = −
  • j=i

G MiMj |rij|3 rij

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-58
SLIDE 58

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Example: N-body simulation

C++ code

int Size; float3 rNormalised(float3 * Pos, int i, int j); ... void computeForces(float3 * Force, float3 * Pos, float * Mass) { for(int i = 0; i < Size; ++i) { float3 Potential = { 0.0f, 0.0f, 0.0f }; for(int j = 0; j < Size; ++j) { float3 r = rNormalised(Pos, i, j); Potential -= r * Mass[j]; } Force[i] = Potential * Mass[i]; } }

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-59
SLIDE 59

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Example: N-body simulation

Sieve C++ code

int Size; sieve float3 rNormalised(outer float3 * Pos, int i, int j); ... void computeForces(float3 * Force, float3 * Pos, float * Mass) { sieve { for(int i = 0; i < Size; ++i) { float3 Potential = { 0.0f, 0.0f, 0.0f }; for(int j = 0; j < Size; ++j) { float3 r = rNormalised(Pos, i, j); Potential -= r * Mass[j]; } // Delayed write Force[i] = Potential * Mass[i]; } } // Side-effects to Force[] committed here }

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-60
SLIDE 60

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Sieve/OpenMP C++ (Codeplay/Microsoft) vs. C++

Dual quad-core AMD Opteron (2GHz Barcelona), 4GiB RAM, Windows Server 2003

Benchmark details

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-61
SLIDE 61

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Sieve/OpenMP C++ (Codeplay/IBM XL Alpha) vs. C++

Sony PlayStation 3 (3.2GHz Cell BE, 6 SPEs), IBM SDK 3, Fedora Core 7

Benchmark details

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-62
SLIDE 62

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Lessons learnt

The concept of delayed side-effects is useful Eliminates the need for fragile whole-program analysis Enables safe speculative parallelisation Provides deterministic behaviour Efficient implementation is hard Data movement is the culprit for heterogeneous architectures Mainstream languages mix memory access and compute

  • perations

Efficient programs decouple memory access and computation (and then overlap them asynchonously) Conclusion: we need a programming model that promotes this decoupling (hence Æcute)!

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-63
SLIDE 63

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Sieve/OpenMP C++ benchmarks

GRAVITY: N-body molecular dynamics simulation of 8192 particles NOISE RGB: Noise reduction filter applied to 512 × 512 colour image NOISE GREY: Noise reduction filter applied to 512 × 512 greyscale image CRC: Cyclic redundancy check on random 8M (1M=220) word message MAND: Calculates 1024 × 1024 fragment of the Mandelbrot set FFT3D: Fast Fourier transform of complex 1283 data set

x86 results Cell results

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming

slide-64
SLIDE 64

Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280

Bit-reversal on NVIDIA GTX 280

20 40 60 80 100 12 14 16 18 20 22 24 26 Bandwidth (GiB/s) Data size (index bits)

  • gp-sm

+gp-sm

  • gp+sm-sp
  • gp+sm+sp

+gp+sm-sp +gp+sm+sp

  • A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly

Decoupled Access/Execute Metaprogramming