T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware
VICTOR A. YING MARK C. JEFFREY* DANIEL SANCHEZ ISCA 2020
*University of Toronto starting Fall 2020
for Effective Speculative Parallelization in Hardware VICTOR A. - - PowerPoint PPT Presentation
T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL SANCHEZ *University of Toronto starting Fall 2020 ISCA 2020 Parallelization: Gap between programmers and hardware
VICTOR A. YING MARK C. JEFFREY* DANIEL SANCHEZ ISCA 2020
*University of Toronto starting Fall 2020
Multicores are everywhere Programmers still write sequential code
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE 2
Speculative parallelization: new architectures and compilers to parallelize sequential code without knowing what is safe to run in parallel
Intel Skylake-SP (2017): 28 cores per die
» Tiny tasks create opportunities to reduce communication and improve locality
3 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
swarm.csail.mit.edu
4 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Krishnan et al. (‘98-’01), STAMPede (‘98-’08), Cintra et al. (’00, ‘02), IMT (‘03), TCC (’04), POSH (‘06), Bulk (’06), Luo et al. (’09-’13), RASP (‘11), MTX (‘10-’20), and many others]
Prior TLS systems did not scale many real-world programs beyond a few cores due to
5 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
6
Time A B C D E F
…
for (int v = 0; v < numVertices; v++) { if (state[v] == UNVISITED) { state[v] = INCLUDED; for (int nbr : neighbors(v)) state[nbr] = EXCLUDED; } }
rd wr
Indirect memory accesses
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
7
Time D′ E′ F′ …
rd
A B C D E F
…
A B C D E F
rd wr
…
for (int v = 0; v < numVertices; v++) { if (state[v] == UNVISITED) { state[v] = INCLUDED; for (int nbr : neighbors(v)) state[nbr] = EXCLUDED; } }
ABORT RE-EXECUTE
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
[Jeffrey et al. MICRO’15, MICRO’16, MICRO’18; Subramanian et al. ISCA’17]
Execution model:
equal timestamp
in timestamp order
Detects order violations and selectively aborts dependent tasks Distributed task units queue, dispatch, and commit multiple tasks per cycle
8 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
9 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
cheap selective aborts
10
for (int v = 0; v < numVertices; v++) { if (state[v] == UNVISITED) { state[v] = INCLUDED; for (int nbr : neighbors(v)) state[nbr] = EXCLUDED; } }
Time
…
D′
rd wr
A B C D E F
rd
Spawners Workers
ABORT RE-EXECUTE
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
11 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
wr wr wr
Time
…
Time
wr wr wr
…
for (int v = 0; v < numVertices; v++) { if (state[v] == UNVISITED) { state[v] = INCLUDED; for (int nbr : neighbors(v)) state[nbr] = EXCLUDED; } }
Parallelize outer loop only: Parallelize both loops:
Tiny tasks (a few instructions) are difficult to spawn effectively
Tiny tasks (a few instructions) are difficult to spawn effectively
12
for (int v = 0; v < numVertices; v++) { if (state[v] == UNVISITED) { state[v] = INCLUDED; for (int nbr : neighbors(v)) state[nbr] = EXCLUDED; } }
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
… … … …
Spawners Workers Spawners
13 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
T4 divides the entire program into tasks starting from the first instruction of main() T4 automatically generates tasks from
T4 extracts nested parallelism from the entire program despite
14 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
15 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
int i = 0; while (status[i]) { if (foo(i)) break; i++; }:
Source code:
void iter(Timestamp i) { if (!done) { if (!status[i]) done = 1; else if (foo(i)) done = 1; } }
iter(0) iter(1) 4 iter(4) iter(5) 2 iter(2) iter(3) 6 iter(6) iter(7) 10 iter(10) iter(11) 8 iter(8) iter(9) 12 iter(12) iter(13)
16 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE g g f f f f
for (int i = 0; i < N; i++) { float x = f(); if (x > 0.0) g(x); }
Memory controller / IO Memory controller / IO Memory controller / IO Memory controller / IO
0xE6823
17 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
A C1 B C2
18 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Intraprocedural passes: small compile times (linear in code size) Use all standard LLVM optimizations to generate high-quality code More in the paper:
19 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE Object file Clang frontend LLVM backend Optimizations (e.g., -O3) x86_64 code generation C/C++ source code T4 Parallelization Passes T4 Parallelization Passes
20 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
21 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
4-wide superscalar out-of-order cores (Haswell-like) 32 KB L1 caches 1 MB L2 cache per 4-core tile 4 MB L3 slice per 4-core tile 256 entries 64 entries/tile (1024 tasks for 64-core chip) 4 4×4 mesh networks
22
Hot loops have some independent iterations Hot loops have serializing variables updated every iteration
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
23 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Serial code, -O3 Parallelized with T4 Task-spawn overheads are geo. mean 31%
24 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Parallel speedup
25 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Parallel speedup
26 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
9am in Los Angeles, Noon in New York, 6pm in Brussels, Midnight in Beijing
27 ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
swarm.csail.mit.edu