SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA - - PowerPoint PPT Presentation

sam optimizing multithreaded cores for speculative
SMART_READER_LITE
LIVE PREVIEW

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA - - PowerPoint PPT Presentation

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY SUBRAMANIAN, MARK JEFFREY, JOEL EMER, DANIEL SANCHEZ PA PACT 2017 Executive Summary Analyzes the interplay between hardware multithreading and


slide-1
SLIDE 1

SAM: Optimizing Multithreaded Cores for Speculative Parallelism

MA MALEEN ABEYDEERA, SUVINAY SUBRAMANIAN, MARK JEFFREY, JOEL EMER, DANIEL SANCHEZ PA PACT 2017

slide-2
SLIDE 2

Executive Summary

Analyzes the interplay between hardware multithreading and speculative parallelism

(eg: Thread Level Speculation and Transactional Memory )

Conventional multithreading causes performance pathologies on speculative workloads

  • Increase in aborted work
  • Inefficient use of speculation resources

Why? All threads are treated equally

Speculation Aware Multithreading (SAM)

  • Prioritize threads running tasks more likely to commit

SAM makes multithreading more useful

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 2

slide-3
SLIDE 3

Executive Summary

Analyzes the interplay between hardware multithreading and speculative parallelism

(eg: Thread Level Speculation and Transactional Memory )

Conventional multithreading causes performance pathologies on speculative workloads

  • Increase in aborted work
  • Inefficient use of speculation resources

Why? All threads are treated equally

Speculation Aware Multithreading (SAM)

  • Prioritize threads running tasks more likely to commit

SAM makes multithreading more useful

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 2

slide-4
SLIDE 4

Outline

Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 3

slide-5
SLIDE 5

Background on Speculative Parallelism

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 4

Parallelize tasks when the dependences are not known in advance Hardware executes all tasks in parallel, aborting upon conflicts Which task to abort? Conflict resolution policy

slide-6
SLIDE 6

Background on Speculative Parallelism

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 4

Parallelize tasks when the dependences are not known in advance Hardware executes all tasks in parallel, aborting upon conflicts Which task to abort? Conflict resolution policy

Speculative Parallelism

Ordered e.g. Thread-Level Speculation (TLS)

(Program order dictates the conflict resolution order)

Unordered e.g. Hardware Transactional Memory

(Any execution order is valid, but high-performance conflict resolution policies define an order)

slide-7
SLIDE 7

Background on Speculative Parallelism

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 4

Parallelize tasks when the dependences are not known in advance Hardware executes all tasks in parallel, aborting upon conflicts Which task to abort? Conflict resolution policy

Implicit order among all tasks in any speculative system Speculative Parallelism

Ordered e.g. Thread-Level Speculation (TLS)

(Program order dictates the conflict resolution order)

Unordered e.g. Hardware Transactional Memory

(Any execution order is valid, but high-performance conflict resolution policies define an order)

slide-8
SLIDE 8

Baseline System - Swarm [Jeffrey, MICRO’ 15]

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 5

void desTask(Timestamp ts , GateInput* input) { Gate* g = input ->gate (); bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) { for (GateInput* i : g-> connectedInputs ()) { swarm::enqueue(desTask , ts+delay(g,i), i); } } }

slide-9
SLIDE 9

Baseline System - Swarm [Jeffrey, MICRO’ 15]

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 5

void desTask(Timestamp ts , GateInput* input) { Gate* g = input ->gate (); bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) { for (GateInput* i : g-> connectedInputs ()) { swarm::enqueue(desTask , ts+delay(g,i), i); } } } Tasks create children tasks (function ptr, timestamp, args) Timestamped tasks

slide-10
SLIDE 10

Baseline System - Swarm [Jeffrey, MICRO’ 15]

Tasks appear to execute in timestamp order Unordered execution via equal timestamps

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 5

void desTask(Timestamp ts , GateInput* input) { Gate* g = input ->gate (); bool toggledOutput = g.simulateToggle(input); if ( toggledOutput ) { for (GateInput* i : g-> connectedInputs ()) { swarm::enqueue(desTask , ts+delay(g,i), i); } } } Tasks create children tasks (function ptr, timestamp, args) Timestamped tasks

slide-11
SLIDE 11

Swarm Microarchitecture

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 6

Equal timestamps: global order via Virtual Time (VT)

Timestamp Tiebreaker Virtual Time

slide-12
SLIDE 12

Swarm Microarchitecture

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 6 Mem / IO Mem / IO Mem / IO Mem / IO

16-tile, 64-core CMP Tile Organization

Core Core Core Core L1I/D L1I/D L1I/D L1I/D L2 L3 Slice Router Task Unit Tile

Equal timestamps: global order via Virtual Time (VT)

Timestamp Tiebreaker Virtual Time

slide-13
SLIDE 13

Swarm Microarchitecture

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 6 Mem / IO Mem / IO Mem / IO Mem / IO

16-tile, 64-core CMP Tile Organization

Core Core Core Core L1I/D L1I/D L1I/D L1I/D L2 L3 Slice Router Task Unit Tile

Equal timestamps: global order via Virtual Time (VT) Tasks execute out-of-order, but commit in VT order

Timestamp Tiebreaker Virtual Time

Commit queue: state of tasks waiting to commit

slide-14
SLIDE 14

Outline

Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 7

slide-15
SLIDE 15

Pitfalls of Speculation-Oblivious Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8

System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order

slide-16
SLIDE 16

Pitfalls of Speculation-Oblivious Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8

System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order

slide-17
SLIDE 17

Pitfalls of Speculation-Oblivious Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8

Insights:

  • 1. Multithreading can be highly beneficial

System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue

slide-18
SLIDE 18

Pitfalls of Speculation-Oblivious Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8

Insights:

  • 1. Multithreading can be highly beneficial

However, multithreading can also lead to:

  • 2. Increased aborts

System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue Micro-ops issued from aborted tasks

slide-19
SLIDE 19

Pitfalls of Speculation-Oblivious Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8

Insights:

  • 1. Multithreading can be highly beneficial

However, multithreading can also lead to:

  • 2. Increased aborts
  • 3. Inefficient use of speculation resources

System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue Micro-ops issued from aborted tasks Resource stalls

slide-20
SLIDE 20

Pitfalls of Speculation-Oblivious Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 8

Insights:

  • 1. Multithreading can be highly beneficial

However, multithreading can also lead to:

  • 2. Increased aborts
  • 3. Inefficient use of speculation resources

Unlikely-to-commit tasks hurt the throughput of likely-to-commit ones

System configuration: 64-core SMT system In-order core with 2-wide issue Speculation-oblivious round-robin order Micro-ops issued from committed tasks No ready micro-ops to issue Micro-ops issued from aborted tasks Resource stalls

slide-21
SLIDE 21

Speculation-Aware Multithreading

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 9

Prioritize threads according to their conflict resolution priorities

Reduce Speculation Resource Stalls (tasks commit early) Reduce Aborts (focus resources on tasks likely to commit)

slide-22
SLIDE 22

Outline

Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 10

slide-23
SLIDE 23

SAM on in-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11

SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues

slide-24
SLIDE 24

SAM on in-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11

SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues

slide-25
SLIDE 25

SAM on in-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11

SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues Conflict resolution priority updates (Virtual Times) Task Unit

slide-26
SLIDE 26

SAM on in-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 11

SMT Issue Fetch Decode les les Register Files Pipe 0 Pipe 1 Int ALU FP ALU Int ALU Mem/DCache Thread micro-op queues SAM issue priorities (higher is better) Sort Max Ready

52:9 52:7 17:1 95:4

Virtual Times

3 2 4 1

Issue ThreadID Conflict resolution priority updates (Virtual Times) Task Unit

slide-27
SLIDE 27

Experimental Methodology

Baseline System

  • Swarm + Wait-N-GoTM [Jafri et al. ASPLOS’13] conflict resolution techniques
  • Cycle-accurate, event-driven, Pin-based simulator
  • Model systems up to 64 cores
  • Cores: 2 wide issue, up to 8 threads per core

Benchmarks

  • Ordered : Swarm [Jeffrey et al. MICRO’15, MICRO’16] – 8 benchmarks
  • Unordered : STAMP [Minh et al. IISWC’ 08] – 8 benchmarks

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 12

slide-28
SLIDE 28

SAM makes multithreading more effective

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13

1 Thread Ordered Benchmarks Unordered Benchmarks

slide-29
SLIDE 29

SAM makes multithreading more effective

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13

8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks

slide-30
SLIDE 30

SAM makes multithreading more effective

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13

8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks

slide-31
SLIDE 31

SAM makes multithreading more effective

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13

8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks

8 threaded cores

  • utperform single

threaded cores by 1.85X With SAM, the benefit increases to 2.33X

slide-32
SLIDE 32

SAM makes multithreading more effective

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13

8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks

8 threaded cores

  • utperform single

threaded cores by 1.85X With SAM, the benefit increases to 2.33X

slide-33
SLIDE 33

SAM makes multithreading more effective

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 13

8 Thread SAM 8 Thread Round Robin 1 Thread Ordered Benchmarks Unordered Benchmarks

8 threaded cores

  • utperform single

threaded cores by 1.85X With SAM, the benefit increases to 2.33X

slide-34
SLIDE 34

Why does SAM help?

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 14

SAM matches RR when there are no pathologies

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other

slide-35
SLIDE 35

Why does SAM help?

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 14

SAM matches RR when there are no pathologies SAM reduces wasted work

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other

slide-36
SLIDE 36

Why does SAM help?

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 14

SAM matches RR when there are no pathologies SAM reduces wasted work SAM reduces resource stalls

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other

slide-37
SLIDE 37

Outline

Background on speculative parallelism Pitfalls of speculative parallelism with conventional multithreading SAM on in-order cores SAM on out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 15

slide-38
SLIDE 38

SAM on out-of-order cores

Unlike in-order cores, priorities affect pipeline efficiency

  • A single thread can clog core resources
  • Increased wrong path execution

Despite these, prioritizing tasks is better Need for aggressive prioritization affects core design

  • Shared, not partitioned ROBs

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 16

SMT Issue Fetch Decode Thread micro-op queues Issue Buffer

Physical

Reg File Pipe 0

Reorder

Buffer Pipe 1 In-flight uops (for ICount)

3

9 4 2

SAM priorities

3

4 2 1

Conflict resolution priority updates (from task unit)

Conflict res. priorities

2

3 2 1

slide-39
SLIDE 39

SAM tradeoffs with out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

Baseline policy - ICount (IC)

sssp – 8 threads

slide-40
SLIDE 40

SAM tradeoffs with out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

Baseline policy - ICount (IC) SAM is more beneficial with dynamically shared ROBs Reduces aborts + resource stalls

sssp – 8 threads

slide-41
SLIDE 41

SAM tradeoffs with out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

Baseline policy - ICount (IC) SAM is more beneficial with dynamically shared ROBs Reduces aborts + resource stalls But reduced pipeline efficiency

sssp – 8 threads

slide-42
SLIDE 42

SAM tradeoffs with out-of-order cores

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 17

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

Baseline policy - ICount (IC) SAM is more beneficial with dynamically shared ROBs Reduces aborts + resource stalls But reduced pipeline efficiency Increase in wrong-path issues + not-ready stalls

sssp – 8 threads

slide-43
SLIDE 43

Adaptive SAM policy

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18 Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-44
SLIDE 44

Adaptive SAM policy

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18

Hardware counters to track cycles

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-45
SLIDE 45

Adaptive SAM policy

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18

Aborted Resource Not ready Wrong path

Hardware counters to track cycles

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-46
SLIDE 46

Adaptive SAM policy

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18

Aborted Resource Not ready Wrong path

Hardware counters to track cycles

+

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-47
SLIDE 47

Adaptive SAM policy

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18

Aborted Resource Not ready Wrong path

Hardware counters to track cycles Cycles lost to task level speculation

+

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-48
SLIDE 48

Adaptive SAM policy

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 18

Aborted Resource Not ready Wrong path

Hardware counters to track cycles Cycles lost to task level speculation Cycles lost to pipeline inefficiencies

+ + >

Use SAM Use ICount

True False

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-49
SLIDE 49

SAM on OoO cores (all benchmarks)

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 19

At 8 threads / core:

  • Multithreading improves performance
  • ver single threaded cores by 1.1x

Average over all benchmarks

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-50
SLIDE 50

SAM on OoO cores (all benchmarks)

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 19

At 8 threads / core:

  • Multithreading improves performance
  • ver single threaded cores by 1.1x
  • With SAM, improvement rises to 1.5x

Average over all benchmarks

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-51
SLIDE 51

SAM on OoO cores (all benchmarks)

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 19

At 8 threads / core:

  • Multithreading improves performance
  • ver single threaded cores by 1.1x
  • With SAM, improvement rises to 1.5x

Adaptive policy slightly increases performance at 2 and 4 threads

Average over all benchmarks

Micro-ops issued Unused issue slots (reason)

Committed Aborted Resource Not ready Other Wrong path

slide-52
SLIDE 52

Conclusion

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 20

Conventional multithreading causes performance pathologies on speculative workloads

  • Increase in aborted work
  • Inefficient use of speculation resources

Speculation Aware Multithreading (SAM)

Prioritize threads running tasks more likely to commit

SAM makes multithreading more useful

slide-53
SLIDE 53

Questions?

SAM : OPTIMIZING MULTITHREADED CORES FOR SPECULATIVE PARALLELISM 21

Conventional multithreading causes performance pathologies on speculative workloads

  • Increase in aborted work
  • Inefficient use of speculation resources

Speculation Aware Multithreading (SAM)

Prioritize threads running tasks more likely to commit

SAM makes multithreading more useful