Decoupled Access/Execute Metaprogramming
Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial); Alastair F. Donaldson (Oxford/Codeplay) University of Birmingham, 3 July 2009
Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee - - PowerPoint PPT Presentation
Decoupled Access/Execute Metaprogramming Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial); Alastair F. Donaldson (Oxford/Codeplay) University of Birmingham, 3 July 2009 Challenge cute model Vision Recent meeting on accelerated
Anton Lokhmotov, Lee Howes, Paul H.J. Kelly (Imperial); Alastair F. Donaldson (Oxford/Codeplay) University of Birmingham, 3 July 2009
Challenge Æcute model Vision
(35–40 attendees summarised by their affiliation)
Computing (software optimisation, cognitive robotics, visual information processing, reconfigurable computing) Electrical Engineering (reconfigurable computing, design automation) Mechanical Engineering (multiscale flow dynamics, vibration technology) Earth Science and Engineering (applied modelling & computation) Physics (plasma, experimental solid state) Chemistry (computational, biological & biophysical) Biomedical Engineering Chemical Engineering Civil Engineering Aeronautics
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
Dense Linear Algebra Sparse Linear Algebra N-Body Methods Spectral Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
Dense Linear Algebra Sparse Linear Algebra N-Body Methods Spectral Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
Accelerator hardware hundreds of functional units software-managed memory hierarchies, e.g.
host memory (main memory) device global memory (on-board) device local memory (on-chip)
Accelerator software low-level, hence unproductive architecture-specific, hence nonportable
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
How to use accelerator technology but keep maintainability, composability, reusability, portability?
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Decoupled Access/Execute metaprogramming kernel code written for uniform memory execute metadata describe execution constraints access metadata describe memory access pattern part of the kernel’s interface specification Goals robust translation into efficient low-level code ample opportunities for optimisation convenience and flexibility
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Execute metadata for a kernel is a tuple E = (I, R, P), where: I ⊂ Zn is a finite, n-dimensional iteration space, for some n > 0; R ⊆ I × I, is a precedence relation such that (i1, i2) ∈ R iff iteration i1 must be executed before iteration i2. P is a partition of I into a set of non-empty, disjont iteration subspaces: P =
Ij = ∅, i = j
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Access metadata for a kernel is a tuple A = (Mr, Mw), where: Mr : I → P(M) specifies the set of memory locations Mr(i) that may be read on iteration i ∈ I; Mw : I → P(M) specifies the set of memory locations Mw(i) that may be written on iteration i ∈ I.
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Oy,x =
K
K
Cu,v · Iy+u,x+v I: input image O: output image C: coefficients W: image width H: image height K: neighbourhood radius K ≤ y < H − K; K ≤ x < W − K
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Oy,x =
K
K
Cu,v · Iy+u,x+v
x y 3
Region of I
3 3 3
All of C Iteration (y,x) Region of O
1 1
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Oy,x =
K
K
Cu,v · Iy+u,x+v Execute metadata (I, R, P): I =
P =
x − K < wi} : 1 ≤ j < (H − 2K)/h, 1 ≤ i < (W − 2K)/w
Mr =
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
C++
rgb I[W][H]; rgb O[W][H]; rgb C[2*K+1][2*K+1];
Æcute (data wrappers)
Array2D<rgb> arrayI(&I[0][0], W, H); Array2D<rgb> arrayO(&O[0][0], W, H); Array2D<rgb> arrayC(&C[0][0], 2*K+1, 2*K+1);
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
C++
for(y = K; y < H-K; ++y) for(x = K; x < W-K; ++x) // Kernel code for each (y,x)
Æcute (execute metadata)
IterationSpace1D y(K,H-K); IterationSpace1D x(K,W-K); IterationSpace2D iterYX(y,x);
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
C++
// Kernel code for each (y,x) rgb sum(0.0f, 0.0f, 0.0f); for(u = -K; u <= K; ++u) for(v = -K; v <= K; ++v) sum += C[K+u][K+v] * I[y+u][x+v]; // read from C and I O[y][x] = sum; // write to O
Æcute (access metadata)
// Access descriptors Neighbourhood2D_R accessI(iterYX, arrayI, K); Point2D_W accessO(iterYX, arrayO); All_R accessC(iterYX, arrayC);
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
C++
// Kernel code for each (y,x) int u, v; rgb sum(0.0f, 0.0f, 0.0f); for(u = -K; u <= K; ++u) for(v = -K; v <= K; ++v) sum += C[K+u][K+v] * I[y+u][x+v]; O[y][x] = sum;
Æcute (kernel method)
void kernel(const IterationSpace2D::iterator &it) { int u, v; rgb sum(0.0f, 0.0f, 0.0f); for(u = -K; u <= K; ++u) for(v = -K; v <= K; ++v) sum += accessC(u, v) * accessI(it, u, v); accessO(it) = sum; }
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
// Data wrappers Array2D<rgb> arrayI(&I, W, H); Array2D<rgb> arrayO(&O, W, H); Array2D<rgb> arrayC(&C, 2*K+1, 2*K+1); // Execute metadata IterationSpace1D y(K,H-K); IterationSpace1D x(K,W-K); IterationSpace2D iterYX(y,x); // Access metadata Neighbourhood2D_R accessI(iterYX, arrayI, K); Point2D_W accessO(iterYX, arrayO); All_R accessC(iterYX, arrayC); // Filter initialisation and execution ConvolutionFilter2D conv(iterYX, accessI, accessO, accessC); iterYX.tile(h, w); conv.run();
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
data movement synthesis and optimisation (e.g. software pipelining and exploiting data reuse) machine-independent abstraction, machine-dependent tuning (via partitioning) potential for inter-kernels optimisations (e.g. loop fusion and array contraction)
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Let Ik ⊂ I be an iteration subspace Mr(Ik) =
i∈Ik Mr(i)
Mw(Ik) =
i∈Ik Mw(i)
x y 4
Region of I
4 3 3
All of C One block Region of O
2 2
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Let Ik and In be two iteration subspaces
Mr(Ik) T Mr(In) determines reuse Mw(Ik) T Mr(In) determines true dependence Mr(Ik) T Mw(In) determines anti dependence Mw(Ik) T Mw(In) determines output dependence
x y
Two (overlapping) regions of I All of C Two blocks Two (disjoint) regions of O
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
x y
Region of I Kernel A Region of T
1 1 x y
Region of O Kernel B Region of T
1 1
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
x y
Region of I Kernel A/B Region of O
Improved locality, reduced communication Tricky but within the reach of polyhedral code generation Using metadata bypasses fragile inter-kernel analysis
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Sony/Toshiba/IBM Cell Broadband Engine 1 PowerPC Processing Element (PPE) 8 Synergistic Processing Elements (SPEs) Main memory, 256 KiB local memory per SPE DMA to copy data between main and local memories Benchmark variants (kernel code is basically the same) Æcute Hand-written C Software cache (IBM Cell SDK) Experimental setup 3.2 GHz Cell (Sony PlayStation 3, 6 SPEs available) IBM Cell SDK 2.1, Fedora Linux 7
Framework details
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
2048x2048 2048x4096 4096x4096 4096x8192 1 2 3 4 5 6 7 8 9 10 Matrix size Execution time normalised to hand−written code
Hand−written Software cache AEcute 4x1024 AEcute 4x512 AEcute 4x256 Æcute spec
Bandwidth limited Tiling increases locality: x is reused Longer strips (e.g. 1024) are more efficient Æcute is 2.3–3.6× slower, software cache is 5–10× slower
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
D=256, N=15 D=256, N=63 D=1024, N=15 D=1024, N=63 0.5 1 1.5 2 2.5 3 3.5 Image dimension D, pixel neighbourhood diameter N Execution time normalised to hand−written code
Hand−written Software cache AEcute 20x20 AEcute 5x40 AEcute 40x5 5.7 5.1 9.6 Æcute spec
No tile size was universally best 40 × 5 is best for D=256 20 × 20 is best for D=1024 5 × 40 is near best for both D=256 and 1024 Æcute is 12–40% slower, software cache is 2.2–9.6× slower
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
0.5 1 1.5 2 2.5 3 3.5 14 16 18 20 22 24 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 Bandwidth (GiB/s) Data size (index bits) Software cache Hand-written AEcute
Æcute spec
(σn(i) reverses bits of n-bit index, e.g. σ4(01112) = 11102)
Tiled algorithm (Carter & Gatlin) Vector kernel (Lokhmotov & Mycroft) Non-affine access specification For large problem sizes (n = 22–24), Æcute is 1.6× slower, software cache is 20× slower
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
GPUs are most widely available and popular accelerators Complex software-managed memory hierarchy (access)
Currently, two levels Speculatively, more levels (NVIDIA)
Complex iteration space mappings (execute)
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Ox,y = 1 D
D−1
Ix,y+k I is a W × H grey-scale input image; O is a W × (H − D) grey-scale output image; D is the diameter of the filter (D ≪ H); 0 ≤ x < W, 0 ≤ y < H − D.
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Ox,y =
D
D−1
k=0 Ix,y+k,
for y = y0; Ox,y−1 + 1
D
for 1 ≤ y − y0 < T.
1 for(int x = 0; x < W; ++x) { // for each column 2 for(int y0 = 0; y0 < H-D; y0 += T) { // for each strip of rows 3 // first phase: convolution 4 float sum = 0.0f; 5 for(int k = 0; k < D; ++k) 6 sum += I[(y0+k)*W + x]; 7 O[y0*W + x] = sum / (float)D; 8 9 // second phase: rolling sum 10 for(int dy = 1; dy < min(T,H-D-y0); ++dy) { 11 int y = y0 + dy; 12 sum -= I[(y-1)*W + x]; 13 sum += I[(y-1+D)*W + x]; 14 O[y*W + x] = sum / (float)D; 15 } } }
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Ox,y =
D
D−1
k=0 Ix,y+k,
for y = y0; Ox,y−1 + 1
D
for 1 ≤ y − y0 < T.
x y
Strip of I Iterations
(x,y0:y0+3)
Strip of O
1 1 4 5
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Ox,y =
D
D−1
k=0 Ix,y+k,
for y = y0; Ox,y−1 + 1
D
for 1 ≤ y − y0 < T. Writes to O: N = W × (H − D) Reads from I and arithmetic operations: Θ(N + ND/T)
Θ(ND) when T ≪ D; Θ(N) when T ≫ D.
Thread parallelism: ⌈N/T⌉
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
WPBX WPBY H−D W
1 2 3 4 5 6 7 8 9 10 11 12
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
WPBY H−D WPBX W
1 2 3 4 5 6 7 8 9 10 8 5 3
Round to SIMD size
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
2D grid; 1D grid.
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5120, H=3200, D=40, TPBX=128, TPBY=1 2D 1D
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
2D grid. Data padded to multiples of 16, 32, 64, and 128 ps.
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5121, H=3200, D=40, TPBX=128, TPBY=1 2D, padded to 5136 2D, padded to 5152 2D, padded to 5184 2D, padded to 5248
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Data padded to a multiple of 64 ps. 1D grid wrapped on multiples of 16, 32 and 64 ps.
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5121 (padded to 5184), H=3200, D=40, TPBX=128, TPBY=1) 1D wrapped on 5121 1D wrapped on 5136 1D wrapped on 5152 1D wrapped on 5184
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Data padded to a multiple of 64 ps. 2D grid; 1D grid wrapped on a multiple of 32 ps.
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 100 200 300 400 500 600 700 800 Throughput (Mpixel/s) Output pixels per thread (T) W=5121 (padded to 5184), H=3200, D=40, TPBX=128, TPBY=1 2D 1D wrapped on 5152
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Separating algorithm representation from mapping and tuning Specifying a hierarchy of iteration space partitions, e.g.
at the lowest level, by individual threads:
iterXY.partitionThreads(1,T); // 1xT outputs/thread
at the middle level, by blocks of possibly cooperating threads:
iterXY.partitionBlocks(128,T); // 128xT outputs/block
at the highest level, by possibly cooperating compute devices:
iterXY.partitionDevices(W/2,H-D); // (W/2)x(H-D) outputs/device
Synthesizing data movement for software-managed memories
alignment, padding
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision Access/execute metadata Cell BE implementation Current & future work
Integrating Æcute metadata into existing languages
Sieve C++ (EPSRC CASE Award Imperial/Codeplay) OpenCL (AMD) OpenMP?
Handling both regular and irregular computations Extracting Æcute metadata statically and dynamically Generating code for iterative and adaptive optimisation Scaling to large applications
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
Short-term goals investigate code generation and optimisation build convenient and efficient tools drive scientists away from low-level programming Long-term goals accumulate libraries of components investigate composability and reusability drive scientists towards high-level programming
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
“. . . today the calculus can be taught without a trace of mystery and with complete rigor. There is no longer any reason why this basic instrument of the sciences should not be understood by every educated person.” — Richard Courant, Herbert Robbins. “What is mathematics?” (1946)
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
“. . . today the calculus can be taught without a trace of mystery and with complete rigor. There is no longer any reason why this basic instrument of the sciences should not be understood by every educated person.” — Richard Courant, Herbert Robbins. “What is mathematics?” (1946) “. . . today parallel programming can be taught without a trace of mystery and with complete rigor. There is no longer any reason why this basic instrument of the sci- ences should not be understood by every educated per- son.” — ?
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
The work on the Æcute model is in collaboration with Lee Howes (Imperial) Paul Kelly (Imperial) Alastair Donaldson (Oxford/Codeplay) It is inspired by my PhD research in collaboration with Alan Mycroft (Cambridge) Ben Gaster (ClearSpeed/AMD) Andrew Richards (Codeplay) Collaboration with the Russian Academy of Sciences involves Arutyun Avetisyan Andrey Belevantsev Alexander Monakov
Decoupled Access/Execute Metaprogramming
Challenge Æcute model Vision
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
PPE and SPE runtimes
Benchmarks
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
PPE takes a block of the iteration space (blocking is configurable)
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
PPE assigns the block of iterations to a ready SPE by sending a message
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
SPE copies the block’s working set into buffers associated with access descriptors
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
Whilst SPE processes the block, it receives another block for execution
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
SPE copies the new block’s working set into another set of buffers
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material
SPE copies out the first block’s results and clears its buffers
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Matrix-vector multiply Closest-to-mean filter Bit-reversed copy
Execute metadata (I, R, P): I = {(i, j) : 0 ≤ i < H, 0 ≤ j < W} R = {((i, j), (i, k)) : 0 ≤ i < H, 0 ≤ j < k < W} P =
1 ≤ k < H/h, 1 ≤ l < W/w
Mr(i, j) = {A[i][j], x[j]} Mw(i, j) = {y[i]}
Benchmark results
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Matrix-vector multiply Closest-to-mean filter Bit-reversed copy
Execute metadata (I, R, P): I =
P =
y − K < hj} : 1 ≤ i < (W − 2K)/w, 1 ≤ j < (H − 2K)/h
Mr =
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Matrix-vector multiply Closest-to-mean filter Bit-reversed copy
Execute metadata (I, R, P): I = {t : 0 ≤ t < N/B2} R = ∅ P =
Mr(t) = {x[u.t.v] : t ∈ I, 0 ≤ u, v < B} Mw(t) = {y[u.σn(t).v] : t ∈ I, 0 ≤ u, v < B}
Benchmark results
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
Joint work with A. Mycroft (Cambridge), A. Donaldson, A. Richards (Codeplay)
Aliasing complicates dependence analysis The compiler must produce reliable, if inefficient, code The programmer needs to tell more to the compiler
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
In Codeplay’s C++ extension, the programmer can place a code fragment in a parallel scope, thereby instructing the compiler to delay writes to memory locations defined outside of the scope, and apply them in order on exit from the scope. Sieve C++
float *pa, *pb; ... sieve { // sieve block for(int i = 1; i <= n; ++i) { pb[i] = pa[i] + 42; } } // writes to pb[1:n] happen here
Fortran 90
pb[1:n] = pa[1:n] + 42;
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
Force acting on particle i according to the law of gravity
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
C++ code
int Size; float3 rNormalised(float3 * Pos, int i, int j); ... void computeForces(float3 * Force, float3 * Pos, float * Mass) { for(int i = 0; i < Size; ++i) { float3 Potential = { 0.0f, 0.0f, 0.0f }; for(int j = 0; j < Size; ++j) { float3 r = rNormalised(Pos, i, j); Potential -= r * Mass[j]; } Force[i] = Potential * Mass[i]; } }
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
Sieve C++ code
int Size; sieve float3 rNormalised(outer float3 * Pos, int i, int j); ... void computeForces(float3 * Force, float3 * Pos, float * Mass) { sieve { for(int i = 0; i < Size; ++i) { float3 Potential = { 0.0f, 0.0f, 0.0f }; for(int j = 0; j < Size; ++j) { float3 r = rNormalised(Pos, i, j); Potential -= r * Mass[j]; } // Delayed write Force[i] = Potential * Mass[i]; } } // Side-effects to Force[] committed here }
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
Dual quad-core AMD Opteron (2GHz Barcelona), 4GiB RAM, Windows Server 2003
Benchmark details
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
Sony PlayStation 3 (3.2GHz Cell BE, 6 SPEs), IBM SDK 3, Fedora Core 7
Benchmark details
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
The concept of delayed side-effects is useful Eliminates the need for fragile whole-program analysis Enables safe speculative parallelisation Provides deterministic behaviour Efficient implementation is hard Data movement is the culprit for heterogeneous architectures Mainstream languages mix memory access and compute
Efficient programs decouple memory access and computation (and then overlap them asynchonously) Conclusion: we need a programming model that promotes this decoupling (hence Æcute)!
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
GRAVITY: N-body molecular dynamics simulation of 8192 particles NOISE RGB: Noise reduction filter applied to 512 × 512 colour image NOISE GREY: Noise reduction filter applied to 512 × 512 greyscale image CRC: Cyclic redundancy check on random 8M (1M=220) word message MAND: Calculates 1024 × 1024 fragment of the Mandelbrot set FFT3D: Fast Fourier transform of complex 1283 data set
x86 results Cell results
Decoupled Access/Execute Metaprogramming
Æcute Cell framework Æcute specifications Extra material Parallel scopes Bit-reversal on NVIDIA GTX 280
20 40 60 80 100 12 14 16 18 20 22 24 26 Bandwidth (GiB/s) Data size (index bits)
+gp-sm
+gp+sm-sp +gp+sm+sp
Decoupled Access/Execute Metaprogramming