Asynchronous Parallel DLA in Concurrent Collections Aparna - - PowerPoint PPT Presentation

asynchronous parallel dla in concurrent collections
SMART_READER_LITE
LIVE PREVIEW

Asynchronous Parallel DLA in Concurrent Collections Aparna - - PowerPoint PPT Presentation

Asynchronous Parallel DLA in Concurrent Collections Aparna Chandramowlishwaran, Richard Vuduc Georgia Tech Kathleen Knobe Intel May 14, 2009 Workshop on Scheduling for Large-Scale Systems @ UTK 1 1 Motivation and goals Motivating


slide-1
SLIDE 1

Asynchronous Parallel DLA in Concurrent Collections

Aparna Chandramowlishwaran, Richard Vuduc – Georgia Tech Kathleen Knobe – Intel

May 14, 2009 Workshop on Scheduling for Large-Scale Systems @ UTK

1

1

slide-2
SLIDE 2

Motivation and goals

Motivating recent work for multicore systems

Tile algorithms for DLA, e.g., Buttari, et al. (2007); Chan, et al. (2007) General parallel programming models suited to this algorithmic style, e.g., Concurrent Collections (CnC) by Knobe & Offner (2004)

Goals

Study: Apply and evaluate CnC using PDLA examples Talk: CnC tutorial crash course; platform for your work?

2

To download CnC, see: whatif.intel.com

2

slide-3
SLIDE 3

Overview of the Concurrent Collections (CnC) language Asynchronous parallel Cholesky & symmetric eigensolver in CnC Experimental results (preliminary)

3

Outline

3

slide-4
SLIDE 4

Concurrent Collections (CnC) programming model

Separates computation semantics from expression of parallelism Program = components + scheduling constraints

Components: Computation, control, data Constraints: Relations among components No overwriting of data, no arbitrary serialization, and no side-effects

Combines tuple-space, streaming, and dataflow models

4

4

slide-5
SLIDE 5

CnC example: Outer product

5

Z ← x · yT

5

slide-6
SLIDE 6

CnC example: Outer product

6

Z ← x · yT zi,j ← xi · yj

Example only; coarser grain may be more realistic in practice.

6

slide-7
SLIDE 7

CnC example: Outer product

7

Collections:

Static representation of dynamic instances

zi,j ← xi · yj

7

slide-8
SLIDE 8

CnC example: Outer product

8

*

Step

Unit of execution

Collections:

Static representation of dynamic instances

Set of all (dynamic) multiplications

zi,j ← xi · yj

8

slide-9
SLIDE 9

CnC example: Outer product

9

*

<i,j> Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

<a, b, …> = tuple of tag components

zi,j ← xi · yj

9

slide-10
SLIDE 10

CnC example: Outer product

10

*

<i,j> Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Says whether, not when, step executes

zi,j ← xi · yj

10

slide-11
SLIDE 11

CnC example: Outer product

11

*

<i,j> Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Tags prescribe steps

zi,j ← xi · yj

11

slide-12
SLIDE 12

CnC example: Outer product

12

*

x y Z

<i> <j> <i,j> <i,j> Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Item

Data

zi,j ← xi · yj

12

slide-13
SLIDE 13

CnC example: Outer product

13

Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Item

Data

→ shows producer/consumer relations

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

13

slide-14
SLIDE 14

CnC example: Outer product

14

Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Item

Data

“Environment” may produce/consume

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

14

slide-15
SLIDE 15

Essential properties of a CnC program

15

Written in terms of values, without overwriting ⇒ race-free (dynamic single assignment) No arbitrary serialization,

  • nly explicit ordering

constraints (avoids analysis) Steps are side-effect free (functional)

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

15

slide-16
SLIDE 16

CnC example: Tree search

16

Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Item

Data

match ← find (value x in tree T)

16

slide-17
SLIDE 17

CnC example: Tree search Controller/controlee relations

17

Step

Unit of execution

Collections:

Static representation of dynamic instances

Tag

Control

Item

Data

=

T

<node> <root> <match>

x

<⋅> match ← find (value x in tree T)

17

slide-18
SLIDE 18

Execution model

18

Recall: Outer product example

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

18

slide-19
SLIDE 19

Execution model

19

Tag <i=2, j=5> available

<2,5>

zi,j ← xi · yj

19

slide-20
SLIDE 20

Execution model

20

Tag <i=2, j=5> available ⇒ Step prescribed

*

<2,5>

zi,j ← xi · yj

20

slide-21
SLIDE 21

Execution model

21

Tag <2,5> available ⇒ Step prescribed Items x:<2>, y:<5> available ⇒ Step inputs-available

*

x y

<2> <5> <2,5>

zi,j ← xi · yj

21

slide-22
SLIDE 22

Execution model

22

Tag <2,5> available ⇒ Step prescribed Items x:<2>, y:<5> available ⇒ Step inputs-available Prescribed + inputs-available ⇒ enabled

*

x y

<2> <5> <2,5>

zi,j ← xi · yj

22

slide-23
SLIDE 23

Execution model

23

Tag <2,5> available ⇒ Step prescribed Items x:<2>, y:<5> available ⇒ Step inputs-available Prescribed + inputs-available ⇒ enabled Executes ⇒ Z:<2,5> available

*

x y Z

<2> <5> <2,5> <2,5>

zi,j ← xi · yj

23

slide-24
SLIDE 24

zi,j ←

Coding and execution

24

[1] Write the specification (graph). [2] Implement steps in a “base” language (C/C++). [3] Build using CnC translator + compiler. [4] Run-time system maintains collections and schedules step execution.

24

slide-25
SLIDE 25

Textual notation

25

Recall: Outer product example

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

25

slide-26
SLIDE 26

Textual notation

26

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

26

slide-27
SLIDE 27

Textual notation

27 // Input: env → <*: i,j>;

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

27

slide-28
SLIDE 28

Textual notation

28 // Input: env → <*: i,j>, [x: i], [y: j];

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

28

slide-29
SLIDE 29

Textual notation

29 // Input: env → <*: i,j>, [x: i], [y: j];

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

29

slide-30
SLIDE 30

Textual notation

30 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j);

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

30

slide-31
SLIDE 31

Textual notation

31 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j); // Producer/consumer relations: [x: i], [y: j] → (*: i, j); (*: i, j) → [Z: i, j];

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

31

slide-32
SLIDE 32

Textual notation

32 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j); // Producer/consumer relations: [x: i], [y: j] → (*: i, j); (*: i, j) → [Z: i, j]; // Output: [Z: i, j] → env;

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

32

slide-33
SLIDE 33

Textual notation

33 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j); // Producer/consumer relations: [x: i], [y: j] → (*: i, j); (*: i, j) → [Z: i, j]; // Output: [Z: i, j] → env;

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

33

slide-34
SLIDE 34

Step code written in a sequential base language

34 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

Intel’s implementation uses C++; Rice University’s uses Java (Habanero)

34

slide-35
SLIDE 35

Step code written in a sequential base language

35 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

Intel’s implementation uses C++; Rice University’s uses Java (Habanero)

35

slide-36
SLIDE 36

Step code written in a sequential base language

36 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

Intel’s implementation uses C++; Rice University’s uses Java (Habanero)

36

slide-37
SLIDE 37

Step code written in a sequential base language

37 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

Intel’s implementation uses C++; Rice University’s uses Java (Habanero)

37

slide-38
SLIDE 38

Step code written in a sequential base language

38 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

Intel’s implementation uses C++; Rice University’s uses Java (Habanero)

38

slide-39
SLIDE 39

Step code written in a sequential base language

39 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }

*

x y Z

<i> <j> <i,j> <i,j>

zi,j ← xi · yj

Intel’s implementation uses C++; Rice University’s uses Java (Habanero)

39

slide-40
SLIDE 40

Run-time system

Built on top of Intel Threading Building Blocks (TBB)

Implements Cilk-style work stealing scheduler Work queues use LIFO, but FIFO and other strategies in development

Other run-times possible

DEC/HP TStreams on MPI; Rice U. Habanero uses Java threads Intel-specific issues with queuing (more later)

40

40

slide-41
SLIDE 41

Tile Cholesky: A → L⋅LT

Buttari, et al. (2007)

41

Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

A1,1 Ak,1 Ak,k Ap,1 Ap,k Ap,p

41

slide-42
SLIDE 42

Tile Cholesky: A → L⋅LT

Buttari, et al. (2007)

42

Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

A1,1 Ak,1 Ak,k Ap,1 Ap,k Ap,p

42

slide-43
SLIDE 43

Tile Cholesky: A → L⋅LT

Buttari, et al. (2007)

43

Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

A1,1 Ak,1 Ak,k Ap,1 Ap,k Ap,p

43

slide-44
SLIDE 44

Tile Cholesky: A → L⋅LT

Buttari, et al. (2007)

44

Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

A1,1 Ak,1 Ak,k Ap,1 Ap,k Ap,p

44

slide-45
SLIDE 45

Tile Cholesky: A → L⋅LT

Buttari, et al. (2007)

45

Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

A1,1 Ak,1 Ak,k Ap,1 Ap,k Ap,p

45

slide-46
SLIDE 46

Tile Cholesky in CnC

46

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

46

slide-47
SLIDE 47

Tile Cholesky in CnC

47

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C T U

Omitted: Items

47

slide-48
SLIDE 48

Tile Cholesky in CnC

48

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C

<k>

T U

Iteration index is a natural tag

48

slide-49
SLIDE 49

Tile Cholesky in CnC

49

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C

<k>

T

<i,k>

U

Given k, multiple T steps could go ⇒ 2-D tag

49

slide-50
SLIDE 50

Tile Cholesky in CnC

50

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C

<k>

T

<i,k>

U

<i,j,k> Given k, 2-D iteration space of update steps could go

50

slide-51
SLIDE 51

Tile Cholesky in CnC

51

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C

<k>

T

<i,k>

U

<i,j,k> Sequential Cholesky step enables Trisolve steps

51

slide-52
SLIDE 52

Tile Cholesky in CnC

52

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C

<k>

T

<i,k>

U

<i,j,k> Similarly, Trisolve step enables Update steps

52

slide-53
SLIDE 53

Tile Cholesky in CnC

53

SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)

Ak,k Ap,k Ap,p

C

<k>

T

<i,k>

U

<i,j,k> Other arrangements possible, e.g., pre-generate all tags.

53

slide-54
SLIDE 54

Dense symmetric generalized eigensolver

“Straightforward” translation of LAPACK’s _sygvx for Az = λBz

Pieces: Cholesky / reduction to standard form; tridiag reduction Only partly “asynchronous,” but useful proof-of-concept Performance limited by tridiagonal reduction step (BLAS-2)

54

54

slide-55
SLIDE 55

Experimental results

55

55

slide-56
SLIDE 56

56

Performance (GFlop/s)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

10 20 30 40 50 60 70 80 90

Matrix Size

20 40 60 80 100

Percentage of Theoretical Peak

DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s

Baseline ScaLAPACK+MPICH2/nemesis OpenMP+MKL(seq) Cilk++ rec+MKL(seq) MKL(multithreaded BLAS) CnC+MKL(seq)

Cholesky performance: Intel 2-socket x 4-core Harpertown @ 2 GHz + Intel MKL 10.1 CnC + MKL(seq)

56

slide-57
SLIDE 57

57 CnC-based Cholesky timeline (n=1000): Intel 2-socket x 4-core Harpertown @ 2 GHz + Intel MKL 10.1 for sequential components

1 2 3 4 5 6 7 8 Normalized Execution Time Thread # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Unblocked Cholesky Triangular Solve Symmetric Rank-k update Idle Requeue Lower-bound on Execution Time

Critical path

57

slide-58
SLIDE 58

58 Cholesky performance: AMD 4-socket x 4-core Barcelona @ 2 GHz

Performance (GFlop/s)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

10 20 30 40 50 60 70 80 90 100 110 120 130 140

Matrix Size

20 40 60 80 100

Percentage of Theoretical Peak

DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s

Baseline ScaLAPACK+MPICH2/nemesis OpenMP+MKL(seq) Cilk++ rec+MKL(seq) MKL(multithreaded BLAS) CnC+MKL(seq)

58

slide-59
SLIDE 59

59 Eigensolver performance (dsygvx) Performance (GFlop/s)

1000 2000 3000 4000 5000 6000 7000 8000 900010000

1 2 3 4 5 6 7 8

Matrix Size

Baseline MKL(multithreaded BLAS) CnC+MKL(seq)

Performance (GFlop/s)

1000 2000 3000 4000 5000 6000 7000 8000 900010000

1 2 3 4 5 6 7 8

Matrix Size

Baseline MKL(multithreaded BLAS) CnC+MKL(seq)

Intel Harpertown (2x4 = 8 core) AMD Barcelona (4x4 = 16 core)

59

slide-60
SLIDE 60

Summary and future work

60

CnC’s key ideas

Decompose computation into steps + (data) items + (control) tags, with constraint relations among these components – dataflow-like Goal: Separate computation semantics (orderings) from parallelism

Ongoing

“Finish” proof-of-concept example by adding, e.g., blocked data layouts New language primitives to simplify tag management & improve modularity, performance Extending run-time scheduling infrastructure Other applications & architectures

60

slide-61
SLIDE 61

Additional limitations

Tag types: integers only Cannot handle continuous (streaming) input More natural support for in-place algorithms Tools, e.g., debugging

61

61