Asynchronous Parallel DLA in Concurrent Collections
Aparna Chandramowlishwaran, Richard Vuduc – Georgia Tech Kathleen Knobe – Intel
May 14, 2009 Workshop on Scheduling for Large-Scale Systems @ UTK
1
1
Asynchronous Parallel DLA in Concurrent Collections Aparna - - PowerPoint PPT Presentation
Asynchronous Parallel DLA in Concurrent Collections Aparna Chandramowlishwaran, Richard Vuduc Georgia Tech Kathleen Knobe Intel May 14, 2009 Workshop on Scheduling for Large-Scale Systems @ UTK 1 1 Motivation and goals Motivating
Aparna Chandramowlishwaran, Richard Vuduc – Georgia Tech Kathleen Knobe – Intel
May 14, 2009 Workshop on Scheduling for Large-Scale Systems @ UTK
1
1
Motivating recent work for multicore systems
Tile algorithms for DLA, e.g., Buttari, et al. (2007); Chan, et al. (2007) General parallel programming models suited to this algorithmic style, e.g., Concurrent Collections (CnC) by Knobe & Offner (2004)
Goals
Study: Apply and evaluate CnC using PDLA examples Talk: CnC tutorial crash course; platform for your work?
2
To download CnC, see: whatif.intel.com
2
Overview of the Concurrent Collections (CnC) language Asynchronous parallel Cholesky & symmetric eigensolver in CnC Experimental results (preliminary)
3
3
Separates computation semantics from expression of parallelism Program = components + scheduling constraints
Components: Computation, control, data Constraints: Relations among components No overwriting of data, no arbitrary serialization, and no side-effects
Combines tuple-space, streaming, and dataflow models
4
4
5
5
6
Example only; coarser grain may be more realistic in practice.
6
7
Collections:
Static representation of dynamic instances
7
8
Step
Unit of execution
Collections:
Static representation of dynamic instances
Set of all (dynamic) multiplications
8
9
<i,j> Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
<a, b, …> = tuple of tag components
9
10
<i,j> Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Says whether, not when, step executes
10
11
<i,j> Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Tags prescribe steps
11
12
<i> <j> <i,j> <i,j> Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Item
Data
12
13
Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Item
Data
→ shows producer/consumer relations
<i> <j> <i,j> <i,j>
13
14
Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Item
Data
“Environment” may produce/consume
<i> <j> <i,j> <i,j>
14
15
Written in terms of values, without overwriting ⇒ race-free (dynamic single assignment) No arbitrary serialization,
constraints (avoids analysis) Steps are side-effect free (functional)
<i> <j> <i,j> <i,j>
15
16
Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Item
Data
match ← find (value x in tree T)
16
17
Step
Unit of execution
Collections:
Static representation of dynamic instances
Tag
Control
Item
Data
<node> <root> <match>
<⋅> match ← find (value x in tree T)
17
18
Recall: Outer product example
<i> <j> <i,j> <i,j>
18
19
Tag <i=2, j=5> available
<2,5>
19
20
Tag <i=2, j=5> available ⇒ Step prescribed
<2,5>
20
21
Tag <2,5> available ⇒ Step prescribed Items x:<2>, y:<5> available ⇒ Step inputs-available
<2> <5> <2,5>
21
22
Tag <2,5> available ⇒ Step prescribed Items x:<2>, y:<5> available ⇒ Step inputs-available Prescribed + inputs-available ⇒ enabled
<2> <5> <2,5>
22
23
Tag <2,5> available ⇒ Step prescribed Items x:<2>, y:<5> available ⇒ Step inputs-available Prescribed + inputs-available ⇒ enabled Executes ⇒ Z:<2,5> available
<2> <5> <2,5> <2,5>
23
24
[1] Write the specification (graph). [2] Implement steps in a “base” language (C/C++). [3] Build using CnC translator + compiler. [4] Run-time system maintains collections and schedules step execution.
24
25
Recall: Outer product example
<i> <j> <i,j> <i,j>
25
26
<i> <j> <i,j> <i,j>
26
27 // Input: env → <*: i,j>;
<i> <j> <i,j> <i,j>
27
28 // Input: env → <*: i,j>, [x: i], [y: j];
<i> <j> <i,j> <i,j>
28
29 // Input: env → <*: i,j>, [x: i], [y: j];
<i> <j> <i,j> <i,j>
29
30 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j);
<i> <j> <i,j> <i,j>
30
31 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j); // Producer/consumer relations: [x: i], [y: j] → (*: i, j); (*: i, j) → [Z: i, j];
<i> <j> <i,j> <i,j>
31
32 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j); // Producer/consumer relations: [x: i], [y: j] → (*: i, j); (*: i, j) → [Z: i, j]; // Output: [Z: i, j] → env;
<i> <j> <i,j> <i,j>
32
33 // Input: env → <*: i,j>, [x: i], [y: j]; // Prescription relations: <*: i,j> :: (*: i,j); // Producer/consumer relations: [x: i], [y: j] → (*: i, j); (*: i, j) → [Z: i, j]; // Output: [Z: i, j] → env;
<i> <j> <i,j> <i,j>
33
34 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }
<i> <j> <i,j> <i,j>
Intel’s implementation uses C++; Rice University’s uses Java (Habanero)
34
35 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }
<i> <j> <i,j> <i,j>
Intel’s implementation uses C++; Rice University’s uses Java (Habanero)
35
36 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }
<i> <j> <i,j> <i,j>
Intel’s implementation uses C++; Rice University’s uses Java (Habanero)
36
37 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }
<i> <j> <i,j> <i,j>
Intel’s implementation uses C++; Rice University’s uses Java (Habanero)
37
38 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }
<i> <j> <i,j> <i,j>
Intel’s implementation uses C++; Rice University’s uses Java (Habanero)
38
39 Return_t mult (Graph_t& G, const Tag_t& t) { int i = t[0], j = t[1]; double x_i = G.x.Get (Tag_t(i)); double y_j = G.y.Get (Tag_t(j)); G.Z.Put (Tag_t(i, j), x_i*y_j); return CNC_Success; }
<i> <j> <i,j> <i,j>
Intel’s implementation uses C++; Rice University’s uses Java (Habanero)
39
Built on top of Intel Threading Building Blocks (TBB)
Implements Cilk-style work stealing scheduler Work queues use LIFO, but FIFO and other strategies in development
Other run-times possible
DEC/HP TStreams on MPI; Rice U. Habanero uses Java threads Intel-specific issues with queuing (more later)
40
40
41
Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
41
42
Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
42
43
Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
43
44
Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
44
45
Iteration k: // Over diagonal tiles SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
45
46
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
46
47
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
Omitted: Items
47
48
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
<k>
Iteration index is a natural tag
48
49
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
<k>
<i,k>
Given k, multiple T steps could go ⇒ 2-D tag
49
50
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
<k>
<i,k>
<i,j,k> Given k, 2-D iteration space of update steps could go
50
51
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
<k>
<i,k>
<i,j,k> Sequential Cholesky step enables Trisolve steps
51
52
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
<k>
<i,k>
<i,j,k> Similarly, Trisolve step enables Update steps
52
53
SeqCholesky (Lk,k ← Ak,k) Trisolve (Lk+1:p,k ← Ak+1:p,k, Lk,k) Update (Ak+1:p,k+1:p ← Lk+1:p,k, Ak+1:p,k+1:p)
<k>
<i,k>
<i,j,k> Other arrangements possible, e.g., pre-generate all tags.
53
“Straightforward” translation of LAPACK’s _sygvx for Az = λBz
Pieces: Cholesky / reduction to standard form; tridiag reduction Only partly “asynchronous,” but useful proof-of-concept Performance limited by tridiagonal reduction step (BLAS-2)
54
54
55
55
56
Performance (GFlop/s)
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
10 20 30 40 50 60 70 80 90
Matrix Size
20 40 60 80 100
Percentage of Theoretical Peak
DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s
Baseline ScaLAPACK+MPICH2/nemesis OpenMP+MKL(seq) Cilk++ rec+MKL(seq) MKL(multithreaded BLAS) CnC+MKL(seq)
Cholesky performance: Intel 2-socket x 4-core Harpertown @ 2 GHz + Intel MKL 10.1 CnC + MKL(seq)
56
57 CnC-based Cholesky timeline (n=1000): Intel 2-socket x 4-core Harpertown @ 2 GHz + Intel MKL 10.1 for sequential components
1 2 3 4 5 6 7 8 Normalized Execution Time Thread # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Unblocked Cholesky Triangular Solve Symmetric Rank-k update Idle Requeue Lower-bound on Execution Time
Critical path
57
58 Cholesky performance: AMD 4-socket x 4-core Barcelona @ 2 GHz
Performance (GFlop/s)
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Matrix Size
20 40 60 80 100
Percentage of Theoretical Peak
DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak DGEMM peak Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s Theoretical peak GFlop/s
Baseline ScaLAPACK+MPICH2/nemesis OpenMP+MKL(seq) Cilk++ rec+MKL(seq) MKL(multithreaded BLAS) CnC+MKL(seq)
58
59 Eigensolver performance (dsygvx) Performance (GFlop/s)
1000 2000 3000 4000 5000 6000 7000 8000 900010000
1 2 3 4 5 6 7 8
Matrix Size
Baseline MKL(multithreaded BLAS) CnC+MKL(seq)
Performance (GFlop/s)
1000 2000 3000 4000 5000 6000 7000 8000 900010000
1 2 3 4 5 6 7 8
Matrix Size
Baseline MKL(multithreaded BLAS) CnC+MKL(seq)
Intel Harpertown (2x4 = 8 core) AMD Barcelona (4x4 = 16 core)
59
60
CnC’s key ideas
Decompose computation into steps + (data) items + (control) tags, with constraint relations among these components – dataflow-like Goal: Separate computation semantics (orderings) from parallelism
Ongoing
“Finish” proof-of-concept example by adding, e.g., blocked data layouts New language primitives to simplify tag management & improve modularity, performance Extending run-time scheduling infrastructure Other applications & architectures
60
Tag types: integers only Cannot handle continuous (streaming) input More natural support for in-place algorithms Tools, e.g., debugging
61
61