Data-Centric Execution of Speculative Parallel Programs
MA MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MI MICRO 2016
Data-Centric Execution of Speculative Parallel Programs MA MARK - - PowerPoint PPT Presentation
Data-Centric Execution of Speculative Parallel Programs MA MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MI MICRO 2016 Executive summary Many-cores must exploit cache locality to scale Current speculative
MA MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MI MICRO 2016
Many-cores must exploit cache locality to scale Current speculative systems, e.g. TLS or TM, do not exploit locality Spatial Hints: run tasks likely to access the same data in the same place
Our techniques make speculative parallelism practical at large scale
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
2
TRANSACTIONAL MEMORY (TM) SCHEDULERS
Reduce wasted work of coarse-grain txns Limit concurrency: When to run a task?
SPATIAL HINTS
Make accesses local for fine-grain tasks Less data movement: Where to run a task?
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
3
TRANSACTIONAL MEMORY (TM) SCHEDULERS
Reduce wasted work of coarse-grain txns Limit concurrency: When to run a task?
SPATIAL HINTS
Make accesses local for fine-grain tasks Less data movement: Where to run a task?
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
3
STATIC TASK MAPPING
Data dependences known a priori
Graph partitioning
DYNAMIC TASK MAPPING
Work stealing
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
4
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
5
Programs consist of timestamped tasks
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
6
swarm::enqueue(function_pointer, timestamp, arguments...);
Speculatively executes tasks out of order Large hardware task queues Scalable ordered speculation Scalable ordered commits
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
7
64-tile, 256-core chip Tile organization
Core Core Core Core L1I/D L1I/D L1I/D L1I/D
L2 L3 slice
Router
Task unit
Mem / IO Mem / IO Mem / IO Mem / IO
Tile
COMBINING SPECULATION AND LOCALITY
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
8
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
E A B r s t D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
E A B r s t 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
E A B r s t 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
E A B r s t 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
D0=1
E A B r s t 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
D0=1
E A B r s t 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1
E A B r s t 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1
E A B r s t 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1
E A B r s t 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1
E A B r s t 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1
E A B r s t 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1
E A B r s t 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1
E A B r s t 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1
E A B r s t 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1
E A B r s t 1 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1
E A B r s t 1 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1
E A B r s t 1 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0
E A B r s t 1 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0
E A B r s t 1 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0 E1=0
E A B r s t 1 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0 E1=0 t=0
E A B r s t 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
9
r=1 A=1
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0 E1=0 t=0
E A B r s t 1 1 1 1 D C
r s t = r XOR s
1 1 1 1
Execute independent tasks out of order
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
10
r A
Order = Simulated time (ns)
1 2 3 4 5 6
Tasks
C D E t s C B D E t
Data dependences
r A C D E t s C B D E t
Valid Schedule
2.4x parallelism (more in larger circuits)
1 128 256 Speedup 1c 128c 256c 1 128 256 Speedup 1c 128c 256c
Swarm sends new tasks to random tiles
Work stealing: a non-speculative scheduler
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
11
des
Random Stealing
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
12
Each task operates on a single gate The gate is known when the task is created With fine-grain tasks, most data accessed is known at creation time
r A C D E t s C B D E t
DES Schedule
1 128 256 Speedup 1c 128c 256c 1 128 256 Speedup 1c 128c 256c
Hints: map each gate to a statically-chosen tile
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
13
des
Stealing Random Hints
But we can do better!
A B D C
Send new tasks for a gate to its corresponding tile
D
186x
E
E
1 128 256 Speedup 1c 128c 256c
Static gate-to-tile mapping may cause hotspots
Dynamically remap gates (Hints) across tiles
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
14
1 128 256 Speedup 1c 128c 256c
des
Stealing Random Hints Load-Balanced Hints
Programmer knows most of the data accessed Spatial Hints convey program-level knowledge to exploit locality
236x
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
15
SOFTWARE
A Spatial Hint is an integer value
HARDWARE
Hashes each new task’s Hint to a tile ID Serializes same-Hint tasks
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
16
7 4 1 1
Localize most data accesses within a tile Serialize tasks likely to conflict
Static hint-to-tile mapping may cause imbalance
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
17
Tile ID 2 H Hint 0xF00 2 Bucket H Reconfigurable Tile Map Tile ID Hint 0xF00
1 7 1 … 61 63 40
Instead, periodically remap hints across tiles to equalize load
Non-speculative systems use # queued tasks as a proxy for load When imbalanced, speculative systems often
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
18
Remap hints to tiles to balance # of committed cycles per tile
2 Bucket H Reconfigurable Tile Map Tile ID Hint 0xF00
1 7 1 … 61 63 40
void desTask(Timestamp ts, GateInput* input) { Gate* g = input->gate(); bool toggledOutput = g.simulateToggle(input); if (toggledOutput) { // Toggle all inputs connected to this gate for (GateInput* i : g->connectedInputs()) swarm::enqueue(desTask,
/*Timestamp*/ ts + delay(g, i),
/*Hint*/ i->gate()->id, i); } }
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
19
One line of code to express the Gate ID as a Hint
Benchmark Hint Why? des Gate ID Map tasks for same gate to same tile nocsim Router ID Frequent intra-router communication bfs, sssp, astar, color Cache-line address Several vertices reside on the same line silo (Table ID, primary key) Each task accesses one database tuple genome, kmeans Multiple
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
20
Load balance reconfiguration algorithm Choice of application hints Relationship between task size and hint effectiveness
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
21
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
22
Event-driven, Pin-based simulator Target system: 256-core, 64-tile chip Scalability experiments from 1–256 cores
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
23
Core Core Core Core L1I/D L1I/D L1I/D L1I/D
L2 L3 slice
Router
Task unit
Mem / IO Mem / IO Mem / IO Mem / IO
Tile
64 MB shared L3 (1MB/tile) 256 KB per-tile L2s 16 KB per-core L1s 16K task queue entries (64/core) 4K commit queue entries (16/core) In-order, single-issue, scoreboarded
Load-Balanced Hints 3.3x faster than Random (193x gmean vs 58x)
1 256 512 Speedup
bfs
1 256 512
sssp
1 128 256
astar
1 64 128 Speedup
color
1 128 256
des
1 256 512
nocsim
1 128 256 Speedup 1c 128c 256c
silo
1 64 128 1c 128c 256c
genome
1 128 256 1c 128c 256c
kmeans Random
1 256 512 Speedup
bfs
1 256 512
sssp
1 128 256
astar
1 64 128 Speedup
color
1 128 256
des
1 256 512
nocsim
1 128 256 Speedup 1c 128c 256c
silo
1 64 128 1c 128c 256c
genome
1 128 256 1c 128c 256c
kmeans Hints Random
1 256 512 Speedup
bfs
1 256 512
sssp
1 128 256
astar
1 64 128 Speedup
color
1 128 256
des
1 256 512
nocsim
1 128 256 Speedup 1c 128c 256c
silo
1 64 128 1c 128c 256c
genome
1 128 256 1c 128c 256c
kmeans LBHints Hints Random
1 256 512 Speedup
bfs
1 256 512
sssp
1 128 256
astar
1 64 128 Speedup
color
1 128 256
des
1 256 512
nocsim
1 128 256 Speedup 1c 128c 256c
silo
1 64 128 1c 128c 256c
genome
1 128 256 1c 128c 256c
kmeans LBHints Hints Random Stealing
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
24
Stealing is inconsistent across benchmarks Load-Balanced Hints 17% – 27% faster than Hints
0.0 0.2 0.4 0.6 0.8 1.0
Aborted Cycles
R L R L R L R L R L R L R L R L R L
bfs sssp astar color des nocsim silo genome kmeans DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
25
0.0 0.2 0.4 0.6 0.8 1.0
NoC data transferred
R L R L R L R L R L R L R L R L R L
bfs sssp astar color des nocsim silo genome kmeans
Reduce wasted work by 6.4x Reduce network traffic by 3.5x
Speculative architectures must exploit locality to scale to 100s of cores
Spatial Hints convey app-level knowledge to exploit cache locality Hardware leverages hints by:
Our techniques make speculation practical on large-scale systems
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
26
Speculative architectures must exploit locality to scale to 100s of cores
Spatial Hints convey app-level knowledge to exploit cache locality Hardware leverages hints by:
Our techniques make speculation practical on large-scale systems
DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS
27