the multicore challenge
play

The Multicore Challenge performance? sustainability? - PowerPoint PPT Presentation

SaC Functional Programming for HP 3 Chalmers Tekniska Hgskola 3.5.2018 Sven-Bodo Scholz The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP


  1. SaC – Functional Programming for HP 3 Chalmers Tekniska Högskola 3.5.2018 Sven-Bodo Scholz

  2. The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP H igh P erformance H igh P ortability H igh P roductivity

  3. Typical Scenario SVP SVP SVP SVP L SVP SVP SVP SVP SVP SVP SVP SVP VHDL SVP SVP SVP SVP μTC OpenCL MPI/ OpenMP algorithm

  4. Tomorrow’s Scenario SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP OpenCL SVP SVP SVP SVP VHDL μTC algorithm MPI/OpenMP

  5. The HP 3 Vision J SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP OpenCL SVP SVP SVP SVP VHDL μTC algorithm MPI/OpenMP

  6. S A C: HP 3 Driven Language Design H IGH -P RODUCTIVITY H IGH -P ERFORMANCE Ø easy to learn Ø no frills - C-like look and feel - lean language core Ø easy to program Ø performance focus & - Matlab-like style - strictly controlled side-effects - OO-like power - implicit memory management - FP-like abstractions Ø concurrency apt Ø easy to integrate - data-parallelism at core - light-weight C interface H IGH -P ORTABILITY Ø no low-level facilities - no notion of memory - no explicit concurrency/ parallelism - no notion of communication

  7. What is Data-Parallelism? Formulate algorithms in space rather than time ! 1 prod = prod( iota( 10)+1) prod = 1; 2 for( i=1; i<=10; i++) { prod = prod*i; . . . } 1 2 10 6 . . . 3628800 3628800

  8. Why is Space Better than Time? Compiler prod( iota( n)) sequential micro − threaded code code multi − threaded code . . . 1 1 4 7 1 2 10 . . . 2 2 20 56 2 90 . . . . . . 6 6 120 5040 . . . 3628800 3628800 3628800

  9. Another Example: Fibonacci Numbers fib(4) if( n<=1) return n; fib( 3) fib(2) } else { return fib( n-1) + fib( n-2); fib(1) fib(0) fib(1) fib(2) } fib(0) fib(1)

  10. Another Example: Fibonacci Numbers fib(4) int fib( int n) if( n<=1) fib( 3) fib(2) return n; } else { fib(1) fib(0) fib(1) fib(2) return fib( n-1) + fib( n-2); } fib(0) fib(1) fib(2) fib(2) fib( 3) fib(4)

  11. Fibonacci Numbers – now linearised! fib(4) fib(0) fst: 0 snd: 1 Int fib’( int fst, int snd, int n) fib(1) if( n== 0) fst: 1 fib(1) return fst; snd: 1 fib(2) else return fib’( snd, fst+snd, n-1) fst: 1 fib(2) snd: 2 fib(3) fst: 2 fib(3) snd: 3 fib(4)

  12. Fibonacci Numbers – now data-parallel! matprod( genarray( [n], [[1, 1], [1, 0]])) [0,0] fib(2) fib(1) fib(3) fib(4) 1 1 1 1 1 1 fib(3) 1 0 fib(2) 1 0 fib(1) 1 0 fib(0) 2 1 1 1 1 1 1 0 3 2 2 1

  13. Everything is an Array Think Arrays! Ø Vectors are arrays. Ø Matrices are arrays. Ø Tensors are arrays. Ø ........ are arrays.

  14. Everything is an Array Think Arrays! Ø Vectors are arrays. Ø Even scalars are arrays. Ø Matrices are arrays. Ø Any operation maps arrays to Ø Tensors are arrays. arrays. Ø ........ are arrays. Ø Even iteration spaces are arrays

  15. Multi-Dimensional Arrays i shape vector: [ 3] data vector: [ 1, 2, 3] 1 2 3 i k 7 8 9 shape vector: [ 2, 2, 3] j 1 2 3 data vector: [ 1, 2, 3, ..., 11, 12] 10 11 12 4 5 6 shape vector: [ ] 42 data vector: [ 42 ] 14

  16. Index-Free Combinator-Style Computations L2 norm: sqrt( sum( square( A))) Convolution step: W1 * shift(-1, A) + W2 * A + W1 * shift( 1, A) Convergence test: all( abs( A-B) < eps)

  17. Shape-Invariant Programming l2norm( [1,2,3,4] ) sqrt( sum( sqr( [1,2,3,4]))) sqrt( sum( [1,4,9,16])) sqrt( 30) 5.4772

  18. Shape-Invariant Programming l2norm( [[1,2],[3,4]] ) sqrt( sum( sqr( [[1,2],[3,4]]))) sqrt( sum( [[1,4],[9,16]])) sqrt( [5,25]) [2.2361, 5]

  19. Where do these Operations Come from? double l2norm( double[*] A) { return( sqrt( sum( square( A))); } double square( double A) { return( A*A); } 18

  20. Where do these Operations Come from? double square( double A) { return( A*A); } double[+] square( double[+] A) { res = with { (. <= iv <= .) : square( A[iv]); } : modarray( A); return( res); } 19

  21. With-Loops with { ([0,0] <= iv < [3,4]) : square( iv[0]); } : genarray( [3,4], 42); [0,0] [0,1] [0,2] [0,3] 0 0 0 0 [1,0] [1,1] [1,2] [1,3] 1 1 1 1 [2,0] [2,1] [2,2] [2,3] 4 4 4 4 indices values 20

  22. With-Loops with { ([0,0] <= iv <= [1,1]) : square( iv[0]); ([0,2] <= iv <= [1,3]) : 42; ([2,0] <= iv <= [2,2]) : 0; } : genarray( [3,4], 21); [0,0] [0,1] [0,2] [0,3] 0 0 42 42 [1,0] [1,1] [1,2] [1,3] 1 1 42 42 [2,0] [2,1] [2,2] [2,3] 0 0 0 21 indices values 21

  23. With-Loops with { ([0,0] <= iv <= [1,1]) : square( iv[0]); ([0,2] <= iv <= [1,3]) : 42; ([2,0] <= iv <= [2,3]) : 0; } : fold( +, 0); map 0 0 42 42 [0,0] [0,1] [0,2] [0,3] reduce [1,0] [1,1] [1,2] [1,3] 1 1 42 42 170 [2,0] [2,1] [2,2] [2,3] 0 0 0 0 indices values 22

  24. Set-Notation and With-Loops { iv -> a[iv] + 1} with { ( 0*shape(a) <= iv < shape(a)) : a[iv] + 1; } : genarray( shape( a), zero(a)) 23

  25. Observation Ø most operations boil down to With-loops Ø With-Loops are the source of concurrency

  26. Computation of π 25

  27. Computation of π double f( double x) { return 4.0 / (1.0+x*x); } int main() { num_steps = 10000; step_size = 1.0 / tod( num_steps); x = (0.5 + tod( iota( num_steps))) * step_size; y = { iv-> f( x[iv])}; pi = sum( step_size * y); printf( " ...and pi is: %f\n", pi); return(0); } 26

  28. Example: Matrix Multiply j j j i i i X ( AB ) i , j = A i , k ∗ B k , j k { [i,j] -> sum( A[[i,.]] * B[[.,j]]) } 27

  29. Example: Relaxation 0 1/8 0 1/8 4/8 1/8 0 1/8 0 weights = [[0d,1d,0d], [1d,4d,1d], [0d,1d,0d]] / 8d; in = …. out = { iv -> sum( { ov -> weights[ov] * rotate( 1-ov, in)[iv]} ) }; 28

  30. Programming in a Data- Parallel Style - Consequences • much less error-prone indexing! • combinator style • increased reuse • better maintenance • easier to optimise • huge exposure of concurrency!

  31. What not How (1) re-computation not considered harmful! a = potential( firstDerivative(x)); a = kinetic( firstDerivative(x));

  32. What not How (1) re-computation not considered harmful! a = potential( firstDerivative(x)); a = kinetic( firstDerivative(x)); compiler tmp = firstDerivative(x); a = potential( tmp); a = kinetic( tmp);

  33. What not How (2) variable declaration not required! int main() { istep = 0; nstop = istep; x, y = init_grid(); u = init_solv (x, y); ...

  34. What not How (2) variable declaration not required, ... but sometimes useful! int main() { acts like an assertion here! double[ 256] x,y; istep = 0; nstop = istep; x, y = init_grid(); u = init_solv (x, y); ...

  35. What not How (3) data structures do not imply memory layout a = [1,2,3,4]; b = genarray( [1024], 0.0); c = stencilOperation( a); d = stencilOperation( b);

  36. What not How (3) data structures do not imply memory layout could be implemented by: a = [1,2,3,4]; int a0 = 1; b = genarray( [1024], 0.0); int a1 = 2; int a2 = 3; c = stencilOperation( a); int a3 = 4; d = stencilOperation( b);

  37. What not How (3) data structures do not imply memory layout a = [1,2,3,4]; or by: b = genarray( [1024], 0.0); int a[4] = {1,2,3,4}; c = stencilOperation( a); d = stencilOperation( b);

  38. What not How (3) data structures do not imply memory layout or by: a = [1,2,3,4]; adesc_t a = malloc(...) a->data = malloc(...) b = genarray( [1024], 0.0); a->data[0] = 1; a->desc[1] = 2; c = stencilOperation( a); a->desc[2] = 3; a->desc[3] = 4; d = stencilOperation( b);

  39. What not How (4) data modification does not imply in-place operation! 1 2 3 4 a = [1,2,3,4]; copy copy or update b = modarray( a, [0], 5); 5 2 3 4 c = modarray( a, [1], 6); 1 6 3 4

  40. What not How (5) truely implicit memory management qpt = transpose( qp); deriv = dfDxBoundary( qpt); qp = transpose( deriv); ≡ qp = transpose( dfDxNoBoundary( transpose( qp), DX));

  41. Challenge: Memory Management: What does the λ-calculus teach us? conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } 40

  42. How do we implement this? – the scalar case conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } operation implementation read read from stack funcall push copy on stack 41

  43. How do we implement this? – the non-scalar case naive approach conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } operation non-delayed copy read O(1) + free update O(1) reuse O(1) funcall O(1) / O(n) + malloc 42

  44. How do we implement this? – the non-scalar case widely adopted approach conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... GC } operation delayed copy + delayed GC read O(1) update O(n) + malloc reuse malloc funcall O(1) 43

  45. How do we implement this? – the non-scalar case reference counting approach conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } operation delayed copy + non-delayed GC read O(1) + DEC_RC_FREE update O(1) / O(n) + malloc reuse O(1) / malloc funcall O(1) + INC_RC 44

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend