The Multicore Challenge performance? sustainability? - PowerPoint PPT Presentation

SaC – Functional Programming for HP 3 Chalmers Tekniska Högskola 3.5.2018 Sven-Bodo Scholz

The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP H igh P erformance H igh P ortability H igh P roductivity

Typical Scenario SVP SVP SVP SVP L SVP SVP SVP SVP SVP SVP SVP SVP VHDL SVP SVP SVP SVP μTC OpenCL MPI/ OpenMP algorithm

Tomorrow’s Scenario SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP OpenCL SVP SVP SVP SVP VHDL μTC algorithm MPI/OpenMP

The HP 3 Vision J SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP OpenCL SVP SVP SVP SVP VHDL μTC algorithm MPI/OpenMP

S A C: HP 3 Driven Language Design H IGH -P RODUCTIVITY H IGH -P ERFORMANCE Ø easy to learn Ø no frills - C-like look and feel - lean language core Ø easy to program Ø performance focus & - Matlab-like style - strictly controlled side-effects - OO-like power - implicit memory management - FP-like abstractions Ø concurrency apt Ø easy to integrate - data-parallelism at core - light-weight C interface H IGH -P ORTABILITY Ø no low-level facilities - no notion of memory - no explicit concurrency/ parallelism - no notion of communication

What is Data-Parallelism? Formulate algorithms in space rather than time ! 1 prod = prod( iota( 10)+1) prod = 1; 2 for( i=1; i<=10; i++) { prod = prod*i; . . . } 1 2 10 6 . . . 3628800 3628800

Why is Space Better than Time? Compiler prod( iota( n)) sequential micro − threaded code code multi − threaded code . . . 1 1 4 7 1 2 10 . . . 2 2 20 56 2 90 . . . . . . 6 6 120 5040 . . . 3628800 3628800 3628800

Another Example: Fibonacci Numbers fib(4) if( n<=1) return n; fib( 3) fib(2) } else { return fib( n-1) + fib( n-2); fib(1) fib(0) fib(1) fib(2) } fib(0) fib(1)

Another Example: Fibonacci Numbers fib(4) int fib( int n) if( n<=1) fib( 3) fib(2) return n; } else { fib(1) fib(0) fib(1) fib(2) return fib( n-1) + fib( n-2); } fib(0) fib(1) fib(2) fib(2) fib( 3) fib(4)

Fibonacci Numbers – now linearised! fib(4) fib(0) fst: 0 snd: 1 Int fib’( int fst, int snd, int n) fib(1) if( n== 0) fst: 1 fib(1) return fst; snd: 1 fib(2) else return fib’( snd, fst+snd, n-1) fst: 1 fib(2) snd: 2 fib(3) fst: 2 fib(3) snd: 3 fib(4)

Fibonacci Numbers – now data-parallel! matprod( genarray( [n], [[1, 1], [1, 0]])) [0,0] fib(2) fib(1) fib(3) fib(4) 1 1 1 1 1 1 fib(3) 1 0 fib(2) 1 0 fib(1) 1 0 fib(0) 2 1 1 1 1 1 1 0 3 2 2 1

Everything is an Array Think Arrays! Ø Vectors are arrays. Ø Matrices are arrays. Ø Tensors are arrays. Ø ........ are arrays.

Everything is an Array Think Arrays! Ø Vectors are arrays. Ø Even scalars are arrays. Ø Matrices are arrays. Ø Any operation maps arrays to Ø Tensors are arrays. arrays. Ø ........ are arrays. Ø Even iteration spaces are arrays

Multi-Dimensional Arrays i shape vector: [ 3] data vector: [ 1, 2, 3] 1 2 3 i k 7 8 9 shape vector: [ 2, 2, 3] j 1 2 3 data vector: [ 1, 2, 3, ..., 11, 12] 10 11 12 4 5 6 shape vector: [ ] 42 data vector: [ 42 ] 14

Index-Free Combinator-Style Computations L2 norm: sqrt( sum( square( A))) Convolution step: W1 * shift(-1, A) + W2 * A + W1 * shift( 1, A) Convergence test: all( abs( A-B) < eps)

Shape-Invariant Programming l2norm( [1,2,3,4] ) sqrt( sum( sqr( [1,2,3,4]))) sqrt( sum( [1,4,9,16])) sqrt( 30) 5.4772

Shape-Invariant Programming l2norm( [[1,2],[3,4]] ) sqrt( sum( sqr( [[1,2],[3,4]]))) sqrt( sum( [[1,4],[9,16]])) sqrt( [5,25]) [2.2361, 5]

Where do these Operations Come from? double l2norm( double[*] A) { return( sqrt( sum( square( A))); } double square( double A) { return( A*A); } 18

Where do these Operations Come from? double square( double A) { return( A*A); } double[+] square( double[+] A) { res = with { (. <= iv <= .) : square( A[iv]); } : modarray( A); return( res); } 19

With-Loops with { ([0,0] <= iv < [3,4]) : square( iv[0]); } : genarray( [3,4], 42); [0,0] [0,1] [0,2] [0,3] 0 0 0 0 [1,0] [1,1] [1,2] [1,3] 1 1 1 1 [2,0] [2,1] [2,2] [2,3] 4 4 4 4 indices values 20

With-Loops with { ([0,0] <= iv <= [1,1]) : square( iv[0]); ([0,2] <= iv <= [1,3]) : 42; ([2,0] <= iv <= [2,2]) : 0; } : genarray( [3,4], 21); [0,0] [0,1] [0,2] [0,3] 0 0 42 42 [1,0] [1,1] [1,2] [1,3] 1 1 42 42 [2,0] [2,1] [2,2] [2,3] 0 0 0 21 indices values 21

With-Loops with { ([0,0] <= iv <= [1,1]) : square( iv[0]); ([0,2] <= iv <= [1,3]) : 42; ([2,0] <= iv <= [2,3]) : 0; } : fold( +, 0); map 0 0 42 42 [0,0] [0,1] [0,2] [0,3] reduce [1,0] [1,1] [1,2] [1,3] 1 1 42 42 170 [2,0] [2,1] [2,2] [2,3] 0 0 0 0 indices values 22

Set-Notation and With-Loops { iv -> a[iv] + 1} with { ( 0*shape(a) <= iv < shape(a)) : a[iv] + 1; } : genarray( shape( a), zero(a)) 23

Observation Ø most operations boil down to With-loops Ø With-Loops are the source of concurrency

Computation of π 25

Computation of π double f( double x) { return 4.0 / (1.0+x*x); } int main() { num_steps = 10000; step_size = 1.0 / tod( num_steps); x = (0.5 + tod( iota( num_steps))) * step_size; y = { iv-> f( x[iv])}; pi = sum( step_size * y); printf( " ...and pi is: %f\n", pi); return(0); } 26

Example: Matrix Multiply j j j i i i X ( AB ) i , j = A i , k ∗ B k , j k { [i,j] -> sum( A[[i,.]] * B[[.,j]]) } 27

Example: Relaxation 0 1/8 0 1/8 4/8 1/8 0 1/8 0 weights = [[0d,1d,0d], [1d,4d,1d], [0d,1d,0d]] / 8d; in = …. out = { iv -> sum( { ov -> weights[ov] * rotate( 1-ov, in)[iv]} ) }; 28

Programming in a Data- Parallel Style - Consequences • much less error-prone indexing! • combinator style • increased reuse • better maintenance • easier to optimise • huge exposure of concurrency!

What not How (1) re-computation not considered harmful! a = potential( firstDerivative(x)); a = kinetic( firstDerivative(x));

What not How (1) re-computation not considered harmful! a = potential( firstDerivative(x)); a = kinetic( firstDerivative(x)); compiler tmp = firstDerivative(x); a = potential( tmp); a = kinetic( tmp);

What not How (2) variable declaration not required! int main() { istep = 0; nstop = istep; x, y = init_grid(); u = init_solv (x, y); ...

What not How (2) variable declaration not required, ... but sometimes useful! int main() { acts like an assertion here! double[ 256] x,y; istep = 0; nstop = istep; x, y = init_grid(); u = init_solv (x, y); ...

What not How (3) data structures do not imply memory layout a = [1,2,3,4]; b = genarray( [1024], 0.0); c = stencilOperation( a); d = stencilOperation( b);

What not How (3) data structures do not imply memory layout could be implemented by: a = [1,2,3,4]; int a0 = 1; b = genarray( [1024], 0.0); int a1 = 2; int a2 = 3; c = stencilOperation( a); int a3 = 4; d = stencilOperation( b);

What not How (3) data structures do not imply memory layout a = [1,2,3,4]; or by: b = genarray( [1024], 0.0); int a[4] = {1,2,3,4}; c = stencilOperation( a); d = stencilOperation( b);

What not How (3) data structures do not imply memory layout or by: a = [1,2,3,4]; adesc_t a = malloc(...) a->data = malloc(...) b = genarray( [1024], 0.0); a->data[0] = 1; a->desc[1] = 2; c = stencilOperation( a); a->desc[2] = 3; a->desc[3] = 4; d = stencilOperation( b);

What not How (4) data modification does not imply in-place operation! 1 2 3 4 a = [1,2,3,4]; copy copy or update b = modarray( a, [0], 5); 5 2 3 4 c = modarray( a, [1], 6); 1 6 3 4

What not How (5) truely implicit memory management qpt = transpose( qp); deriv = dfDxBoundary( qpt); qp = transpose( deriv); ≡ qp = transpose( dfDxNoBoundary( transpose( qp), DX));

Challenge: Memory Management: What does the λ-calculus teach us? conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } 40

How do we implement this? – the scalar case conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } operation implementation read read from stack funcall push copy on stack 41

How do we implement this? – the non-scalar case naive approach conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } operation non-delayed copy read O(1) + free update O(1) reuse O(1) funcall O(1) / O(n) + malloc 42

How do we implement this? – the non-scalar case widely adopted approach conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... GC } operation delayed copy + delayed GC read O(1) update O(n) + malloc reuse malloc funcall O(1) 43

How do we implement this? – the non-scalar case reference counting approach conceptual copies f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } operation delayed copy + non-delayed GC read O(1) + DEC_RC_FREE update O(1) / O(n) + malloc reuse O(1) / malloc funcall O(1) + INC_RC 44

The Multicore Challenge performance? sustainability? - PowerPoint PPT Presentation

SaC Functional Programming for HP 3 Chalmers Tekniska Hgskola 3.5.2018 Sven-Bodo Scholz The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

VAST CHALLENGE 2017 Bianca Barnucz & Stephanie Wegscheidl OVERVIEW VAST Challenge

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

ECE 3574: Applied Software Design Threads Today we are going to start looking at threads,

Technologies for Dementia Andrea Wilkinson & Mark Chignell TRO 2017 Annual Conference

An implicit graph Suppose we represent each node in graph

Building WebGPU with Rust Fosdem, 2th Feb 2020 Dzmitry Malyshau @kvark (Mozilla / Graphics

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms & policies

4.1 Polygon Meshes and Implicit Surfaces Hao Li http://cs420.hao-li.com 1 Geometric

MEMORY SYNCHRONIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Chapter 9 : Linguistics in Prague and Vienna between the wars John A. Goldsmith May 5 , 2016 1

The Multicore Challenge performance? sustainability? - PowerPoint PPT Presentation

SaC Functional Programming for HP 3 Chalmers Tekniska Hgskola 3.5.2018 Sven-Bodo Scholz The Multicore Challenge performance? sustainability? affordability? SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP SVP

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

VAST CHALLENGE 2017 Bianca Barnucz &amp; Stephanie Wegscheidl OVERVIEW VAST Challenge

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

ECE 3574: Applied Software Design Threads Today we are going to start looking at threads,

Technologies for Dementia Andrea Wilkinson &amp; Mark Chignell TRO 2017 Annual Conference

An implicit graph Suppose we represent each node in graph

Building WebGPU with Rust Fosdem, 2th Feb 2020 Dzmitry Malyshau @kvark (Mozilla / Graphics

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms &amp; policies

4.1 Polygon Meshes and Implicit Surfaces Hao Li http://cs420.hao-li.com 1 Geometric

MEMORY SYNCHRONIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Chapter 9 : Linguistics in Prague and Vienna between the wars John A. Goldsmith May 5 , 2016 1

VAST CHALLENGE 2017 Bianca Barnucz & Stephanie Wegscheidl OVERVIEW VAST Challenge

Technologies for Dementia Andrea Wilkinson & Mark Chignell TRO 2017 Annual Conference

Dynamic Memory Allocation Today Dynamic memory allocation mechanisms & policies