SAM: Optimizing Multithreaded Cores for Speculative Parallelism
MA MALEEN ABEYDEERA, SUVINAY SUBRAMANIAN, MARK JEFFREY, JOEL EMER, DANIEL SANCHEZ PA PACT 2017
SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA - - PowerPoint PPT Presentation
SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY SUBRAMANIAN, MARK JEFFREY, JOEL EMER, DANIEL SANCHEZ PA PACT 2017 Executive Summary Analyzes the interplay between hardware multithreading and
MA MALEEN ABEYDEERA, SUVINAY SUBRAMANIAN, MARK JEFFREY, JOEL EMER, DANIEL SANCHEZ PA PACT 2017
Analyzes the interplay between hardware multithreading and speculative parallelism
(eg: Thread Level Speculation and Transactional Memory )
Conventional multithreading causes performance pathologies on speculative workloads
Why? All threads are treated equally
Speculation Aware Multithreading (SAM)
SAM makes multithreading more useful
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 2
Analyzes the interplay between hardware multithreading and speculative parallelism
(eg: Thread Level Speculation and Transactional Memory )
Conventional multithreading causes performance pathologies on speculative workloads
Why? All threads are treated equally
Speculation Aware Multithreading (SAM)
SAM makes multithreading more useful
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 2
Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 3
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 4
Parallelize tasks when the dependences are not known in advance Hardware executes all tasks in parallel, aborting upon conflicts Which task to abort? Conflict resolution policy
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 4
Parallelize tasks when the dependences are not known in advance Hardware executes all tasks in parallel, aborting upon conflicts Which task to abort? Conflict resolution policy
Speculative Parallelism
Ordered e.g. Thread-Level Speculation (TLS)
(Program order dictates the conflict resolution order)
Unordered e.g. Hardware Transactional Memory
(Any execution order is valid, but high-performance conflict resolution policies define an order)
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 4
Parallelize tasks when the dependences are not known in advance Hardware executes all tasks in parallel, aborting upon conflicts Which task to abort? Conflict resolution policy
Implicit order among all tasks in any speculative system Speculative Parallelism
Ordered e.g. Thread-Level Speculation (TLS)
(Program order dictates the conflict resolution order)
Unordered e.g. Hardware Transactional Memory
(Any execution order is valid, but high-performance conflict resolution policies define an order)
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 5
void desTask(Timestamp ts , GateInput* input) { Gate* g = input ->gate (); bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) { for (GateInput* i : g-> connectedInputs ()) { swarm::enqueue(desTask , ts+delay(g,i), i); } } }
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 5
void desTask(Timestamp ts , GateInput* input) { Gate* g = input ->gate (); bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) { for (GateInput* i : g-> connectedInputs ()) { swarm::enqueue(desTask , ts+delay(g,i), i); } } } Tasks create children tasks (function ptr, timestamp, args) Timestamped tasks
Tasks appear to execute in timestamp order Unordered execution via equal timestamps
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 5
void desTask(Timestamp ts , GateInput* input) { Gate* g = input ->gate (); bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) { for (GateInput* i : g-> connectedInputs ()) { swarm::enqueue(desTask , ts+delay(g,i), i); } } } Tasks create children tasks (function ptr, timestamp, args) Timestamped tasks
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 6
Equal timestamps: global order via Virtual Time (VT)
Timestamp Tiebreaker Virtual Time
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 6 Mem / IO Mem / IO Mem / IO Mem / IO
16-tile, 64-core CMP Tile Organization
Core Core Core Core L1I/D L1I/D L1I/D L1I/D L2 L3 Slice Router Task Unit Tile
Equal timestamps: global order via Virtual Time (VT)
Timestamp Tiebreaker Virtual Time
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 6 Mem / IO Mem / IO Mem / IO Mem / IO
16-tile, 64-core CMP Tile Organization
Core Core Core Core L1I/D L1I/D L1I/D L1I/D L2 L3 Slice Router Task Unit Tile
Equal timestamps: global order via Virtual Time (VT) Tasks execute out-of-order, but commit in VT order
Timestamp Tiebreaker Virtual Time
Commit queue: state of tasks waiting to commit
Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 7
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8
System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8
System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8
Insights:
System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8
Insights:
However, multithreading can also lead to:
System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue Micro-ops issued from aborted tasks
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8
Insights:
However, multithreading can also lead to:
System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue Micro-ops issued from aborted tasks Resource stalls
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8
Insights:
However, multithreading can also lead to:
Unlikely-to-commit tasks hurt the throughput of likely-to-commit ones
System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue Micro-ops issued from aborted tasks Resource stalls
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 9
Prioritize threads according to their conflict resolution priorities
Reduce Speculation Resource Stalls (tasks commit early) Reduce Aborts (focus resources on tasks likely to commit)
Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 10
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11
SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11
SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11
SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues Conflict resolution priority updates (Virtual Times) Task Unit
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11
SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues SAM issue priorities (higher is better) Sort Max Ready
52:9 52:7 17:1 95:4
Virtual Times
3 2 4 1
Issue ThreadID Conflict resolution priority updates (Virtual Times) Task Unit
Baseline System
Benchmarks
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 12
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13
1 Thread Ordered Benchmarks Unordered Benchmarks
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13
8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13
8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13
8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks
8 threaded cores
threaded cores by 1.85X With SAM, the benefit increases to 2.33X
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13
8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks
8 threaded cores
threaded cores by 1.85X With SAM, the benefit increases to 2.33X
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13
8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks
8 threaded cores
threaded cores by 1.85X With SAM, the benefit increases to 2.33X
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 14
SAM matches RR when there are no pathologies
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 14
SAM matches RR when there are no pathologies SAM reduces wasted work
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 14
SAM matches RR when there are no pathologies SAM reduces wasted work SAM reduces resource stalls
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other
Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 15
Unlike in-order cores, priorities affect pipeline efficiency
Despite these, prioritizing tasks is better Need for aggressive prioritization affects core design
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 16
SMT Issue Fetch Decode Thread micro-op queues Issue Buffer
Physical
Reg File Pipe 0
Reorder
Buffer Pipe 1 In-flight uops (for ICount)
3
9 4 2
SAM priorities
3
4 2 1
Conflict resolution priority updates (from task unit)
Conflict res. priorities
2
3 2 1
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
Baseline policy - ICount (IC)
sssp – 8 threads
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
Baseline policy - ICount (IC) SAM is more beneficial with dynamically shared ROBs Reduces aborts + resource stalls
sssp – 8 threads
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
Baseline policy - ICount (IC) SAM is more beneficial with dynamically shared ROBs Reduces aborts + resource stalls But reduced pipeline efficiency
sssp – 8 threads
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
Baseline policy - ICount (IC) SAM is more beneficial with dynamically shared ROBs Reduces aborts + resource stalls But reduced pipeline efficiency Increase in wrong-path issues + not-ready stalls
sssp – 8 threads
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18 Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18
Hardware counters to track cycles
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18
Aborted Resource Not ready Wrong path
Hardware counters to track cycles
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18
Aborted Resource Not ready Wrong path
Hardware counters to track cycles
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18
Aborted Resource Not ready Wrong path
Hardware counters to track cycles Cycles lost to task level speculation
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18
Aborted Resource Not ready Wrong path
Hardware counters to track cycles Cycles lost to task level speculation Cycles lost to pipeline inefficiencies
Use SAM Use ICount
True False
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 19
At 8 threads / core:
Average over all benchmarks
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 19
At 8 threads / core:
Average over all benchmarks
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 19
At 8 threads / core:
Adaptive policy slightly increases performance at 2 and 4 threads
Average over all benchmarks
Micro-ops issued Unused issue slots (reason)
Committed Aborted Resource Not ready Other Wrong path
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 20
Conventional multithreading causes performance pathologies on speculative workloads
Speculation Aware Multithreading (SAM)
Prioritize threads running tasks more likely to commit
SAM makes multithreading more useful
SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 21
Conventional multithreading causes performance pathologies on speculative workloads
Speculation Aware Multithreading (SAM)
Prioritize threads running tasks more likely to commit
SAM makes multithreading more useful