Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The - - PowerPoint PPT Presentation

lecture 14 cilk
SMART_READER_LITE
LIVE PREVIEW

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The - - PowerPoint PPT Presentation

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on Charles Leisersons Slides on Cilk EE717 Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors.


slide-1
SLIDE 1

EE717

Lecture 14: Cilk

The lecture is partly based on Charles Leiserson’s Slides on Cilk

Shankar Balachandran bshankar@ee.iitb.ac.in

slide-2
SLIDE 2

EE717

A C language for programming dynamic multithreaded applications

  • n shared-memory multiprocessors.

Cilk

  • virus shell assembly
  • graphics rendering
  • n-body simulation
  • heuristic search
  • dense and sparse matrix

computations

  • friction-stir welding

simulation

  • artificial evolution

Example applications:

slide-3
SLIDE 3

EE717

Shared-Memory Multiprocessor

In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!

P P P Network

Memory I/O $ $ $

slide-4
SLIDE 4

EE717

Cilk Is Simple

  • Cilk extends the C language with just a handful
  • f keywords.
  • Every Cilk program has a serial semantics.
  • Not only is Cilk fast, it provides performance

guarantees based on performance abstractions.

  • Cilk is processor-oblivious.
  • Cilk’s provably good runtime system auto-

matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.

  • Cilk supports speculative parallelism.
slide-5
SLIDE 5

EE717

Fibonacci

int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } }

C elision

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } }

Cilk code

Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.

slide-6
SLIDE 6

EE717

Basic Cilk Keywords

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } }

Identifies a function as a Cilk procedure, capable of being spawned in parallel. The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

slide-7
SLIDE 7

EE717

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } }

Dynamic Multithreading

The computation dag unfolds dynamically. Example: fib(4) “Processor

  • blivious”

4 3 2 2 1 1 1

slide-8
SLIDE 8

EE717

Multithreaded Computation

  • The dag G = (V, E) represents a parallel instruction stream.
  • Each vertex v 2 V represents a (Cilk) thread: a maximal

sequence of instructions not containing parallel control (spawn, sync, return).

  • Every edge e 2 E is either a spawn edge, a return edge, or

a continue edge.

spawn edge return edge continue edge initial thread final thread

slide-9
SLIDE 9

EE717

LECTURE 14

Performance Measures

slide-10
SLIDE 10

EE717

Algorithmic Complexity Measures

TP = execution time on P processors

slide-11
SLIDE 11

EE717

Algorithmic Complexity Measures

TP = execution time on P processors T1 = work

slide-12
SLIDE 12

EE717

Algorithmic Complexity Measures

TP = execution time on P processors T1 = work T1 = span*

* Also called critical-path length or computational depth.

slide-13
SLIDE 13

EE717

Algorithmic Complexity Measures

TP = execution time on P processors T1 = work LOWER BOUNDS

  • TP ¸

T1/P

  • TP ¸

T1

* Also called critical-path length or computational depth.

T1 = span*

slide-14
SLIDE 14

EE717

Speedup

Definition: T1/TP = speedup on P processors. If T1/TP = Θ(P) · P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP ¸ T1/P.

slide-15
SLIDE 15

EE717

Parallelism

Because we have the lower bound TP ¸ T1, the maximum possible speedup given T1 and T1 is T1/T1= parallelism = the average amount

  • f work per step

along the span.

slide-16
SLIDE 16

EE717

Span: T1 = ? Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T1 = 8

3 4 5 6 1 2 7 8

Work: T1 = 17

slide-17
SLIDE 17

EE717

Parallelism: T1/T1 = 2.125 Span: T1 = ? Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T1 = 8 Work: T1 = 17

Using many more than 2 processors makes little sense.

slide-18
SLIDE 18

EE717

Lesson

Work and span can predict performance on large machines better than running times on small machines can.

slide-19
SLIDE 19

EE717

Suggested Reading

Amdahl's Law in the Multicore Era, Mark D. Hill and Michael R. Marty, IEEE Computer, July 2008.

slide-20
SLIDE 20

EE717

Parallelizing Vector Addition

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

C

slide-21
SLIDE 21

EE717

Parallelizing Vector Addition

C C

if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } }

Parallelization strategy:

  • 1. Convert loops to recursion.

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

slide-22
SLIDE 22

EE717

if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {

Parallelizing Vector Addition

C

Parallelization strategy:

  • 1. Convert loops to recursion.
  • 2. Insert Cilk keywords.

void vadd (real *A, real *B, int n){ cilk spawn vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); spawn

Side benefit: D&C is generally good for caches!

} } sync;

Cilk

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; }

slide-23
SLIDE 23

EE717

Vector Addition

cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } }

slide-24
SLIDE 24

EE717

Work: T1 = ? Span: T1 = ? Parallelism: T1/T1 = ? Θ(n/lg n) Θ(n)

Vector Addition Analysis

To add two vectors of length n, where BASE = Θ(1): Θ(lg n)

BASE

slide-25
SLIDE 25

EE717

Another Parallelization

C

void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } }

Cilk

cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; }

slide-26
SLIDE 26

EE717

Work: T1 = ? Span: T1 = ? Parallelism: T1/T1 = ?

Θ(1) Θ(n)

Analysis

… … Θ(n) To add two vectors of length n, where BASE = Θ(1):

BASE

P U N Y !