Stall c Store b Buffer a Mem A. Ros & S. Kaxiras ISCA18 - - PowerPoint PPT Presentation

stall c store b buffer a mem a ros s kaxiras isca 18 los
SMART_READER_LITE
LIVE PREVIEW

Stall c Store b Buffer a Mem A. Ros & S. Kaxiras ISCA18 - - PowerPoint PPT Presentation

Goal Problem Solution Challenge Results Conclusions pdf/portada N ON -S PECULATIVE S TORE C OALESCING IN T OTAL S TORE O RDER Alberto Ros 1 Stefanos Kaxiras 2 1 Universidad de Murcia aros@ditec.um.es 2 Uppsala University


slide-1
SLIDE 1

pdf/portada Goal Problem Solution Challenge Results Conclusions

NON-SPECULATIVE STORE COALESCING

IN TOTAL STORE ORDER

Alberto Ros1 Stefanos Kaxiras2

1Universidad de Murcia

aros@ditec.um.es

2Uppsala University

stefanos.kaxiras@it.uu.se

June 4th, 2018

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 1

slide-2
SLIDE 2

Goal Problem Solution Challenge Results Conclusions

OUTLINE

Goal: Coalescing stores into a single write operation to reduce processor stalls and memory accesses

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 2

slide-3
SLIDE 3

Goal Problem Solution Challenge Results Conclusions

OUTLINE

Goal: Coalescing stores into a single write operation to reduce processor stalls and memory accesses Problem: Coalescing can break store order (and programming intuition for x86 processors)

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 2

slide-4
SLIDE 4

Goal Problem Solution Challenge Results Conclusions

OUTLINE

Goal: Coalescing stores into a single write operation to reduce processor stalls and memory accesses Problem: Coalescing can break store order (and programming intuition for x86 processors) Solution: Atomicity gives the illusion of order but is hard to scale and costly (e.g., centralized, transactional)

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 2

slide-5
SLIDE 5

Goal Problem Solution Challenge Results Conclusions

OUTLINE

Goal: Coalescing stores into a single write operation to reduce processor stalls and memory accesses Problem: Coalescing can break store order (and programming intuition for x86 processors) Solution: Atomicity gives the illusion of order but is hard to scale and costly (e.g., centralized, transactional) Challenge: To perform multiple stores atomically without speculation (without rollback) and in a distributed manner

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 2

slide-6
SLIDE 6

Goal Problem Solution Challenge Results Conclusions

OUTLINE

Goal: Coalescing stores into a single write operation to reduce processor stalls and memory accesses Problem: Coalescing can break store order (and programming intuition for x86 processors) Solution: Atomicity gives the illusion of order but is hard to scale and costly (e.g., centralized, transactional) Challenge: To perform multiple stores atomically without speculation (without rollback) and in a distributed manner We present the first solution to this challenge

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 2

slide-7
SLIDE 7

Goal Problem Solution Challenge Results Conclusions

OUTLINE

1 GOAL: COALESCING STORES 2 PROBLEM: COALESCING BREAKS STORE ORDER 3 SOLUTION: ATOMICITY 4 CHALLENGE: DISTRIBUTED, NON-SPECULATIVE

ATOMICITY

5 RESULTS 6 CONCLUSIONS

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 3

slide-8
SLIDE 8

Goal Problem Solution Challenge Results Conclusions

STORE BUFFER AND TOTAL STORE ORDER

Proc Mem

Program Order

Store Buffer

st a a

Store operations in current x86 processors

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 4

slide-9
SLIDE 9

Goal Problem Solution Challenge Results Conclusions

STORE BUFFER AND TOTAL STORE ORDER

Proc Mem

Program Order

Store Buffer

a st b b

Store operations in current x86 processors

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 4

slide-10
SLIDE 10

Goal Problem Solution Challenge Results Conclusions

STORE BUFFER AND TOTAL STORE ORDER

Proc Mem

Program Order

Store Buffer

a b

Store operations in current x86 processors

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 4

slide-11
SLIDE 11

Goal Problem Solution Challenge Results Conclusions

STORE BUFFER AND TOTAL STORE ORDER

Proc Mem

Program Order

Store Buffer

a b First-In First-Out (FIFO) a

Store operations in current x86 processors

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 4

slide-12
SLIDE 12

Goal Problem Solution Challenge Results Conclusions

STORE BUFFER AND TOTAL STORE ORDER

Proc Mem

Program Order

Store Buffer

b First-In First-Out (FIFO) a b

Store operations in current x86 processors

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 4

slide-13
SLIDE 13

Goal Problem Solution Challenge Results Conclusions

STORE BUFFER AND TOTAL STORE ORDER

Proc Mem

Program Order

Store Buffer

a b Total Store Order (TSO)

Store operations in current x86 processors

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 4

slide-14
SLIDE 14

Goal Problem Solution Challenge Results Conclusions

LIMITATIONS AND THE SOLUTION OF COALESCING

Proc Mem Store Buffer

a b st c c

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 5

slide-15
SLIDE 15

Goal Problem Solution Challenge Results Conclusions

LIMITATIONS AND THE SOLUTION OF COALESCING

Proc Mem Store Buffer

a b c Stall

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 5

slide-16
SLIDE 16

Goal Problem Solution Challenge Results Conclusions

LIMITATIONS AND THE SOLUTION OF COALESCING

Proc Mem Store Buffer

a b c Stall

  • Individual writes
  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 5

slide-17
SLIDE 17

Goal Problem Solution Challenge Results Conclusions

LIMITATIONS AND THE SOLUTION OF COALESCING

Proc Mem Coalescing Store Buffer

b st c

Stores to the same cache line (same color in the example) can coalesce in a single write

a, c

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 5

slide-18
SLIDE 18

Goal Problem Solution Challenge Results Conclusions

LIMITATIONS AND THE SOLUTION OF COALESCING

Proc Mem Coalescing Store Buffer

b a, c No stall

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 5

slide-19
SLIDE 19

Goal Problem Solution Challenge Results Conclusions

LIMITATIONS AND THE SOLUTION OF COALESCING

Proc Mem Coalescing Store Buffer

b a, c No stall

  • Coalesced writes

Performance and energy improvements

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 5

slide-20
SLIDE 20

Goal Problem Solution Challenge Results Conclusions

OUTLINE

1 GOAL: COALESCING STORES 2 PROBLEM: COALESCING BREAKS STORE ORDER 3 SOLUTION: ATOMICITY 4 CHALLENGE: DISTRIBUTED, NON-SPECULATIVE

ATOMICITY

5 RESULTS 6 CONCLUSIONS

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 6

slide-21
SLIDE 21

Goal Problem Solution Challenge Results Conclusions

THE PROBLEM OF COALESCING STORES

Proc Mem Coalescing Store Buffer

b a, c

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 7

slide-22
SLIDE 22

Goal Problem Solution Challenge Results Conclusions

THE PROBLEM OF COALESCING STORES

Proc Mem Coalescing Store Buffer

b a, c First green ⇒ c overtakes b

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 7

slide-23
SLIDE 23

Goal Problem Solution Challenge Results Conclusions

THE PROBLEM OF COALESCING STORES

Proc Mem Coalescing Store Buffer

b a, c First green ⇒ c overtakes b First blue ⇒ b overtakes a

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 7

slide-24
SLIDE 24

Goal Problem Solution Challenge Results Conclusions

THE PROBLEM OF COALESCING STORES

Proc Mem Coalescing Store Buffer

b a, c First green ⇒ c overtakes b First blue ⇒ b overtakes a Store order?

In the paper: A new litmus test that captures a TSO violation when breaking store order

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 7

slide-25
SLIDE 25

Goal Problem Solution Challenge Results Conclusions

OUTLINE

1 GOAL: COALESCING STORES 2 PROBLEM: COALESCING BREAKS STORE ORDER 3 SOLUTION: ATOMICITY 4 CHALLENGE: DISTRIBUTED, NON-SPECULATIVE

ATOMICITY

5 RESULTS 6 CONCLUSIONS

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 8

slide-26
SLIDE 26

Goal Problem Solution Challenge Results Conclusions

ATOMICITY: ILLUSION OF STORE ORDER

Proc Mem Coalescing Store Buffer

b a, c Store order ⇒ Atomicity

Coalescing forms atomic write groups

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 9

slide-27
SLIDE 27

Goal Problem Solution Challenge Results Conclusions

FORMING ATOMIC WRITE GROUPS

a st c b

Match younger write to the same cache line

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 10

slide-28
SLIDE 28

Goal Problem Solution Challenge Results Conclusions

FORMING ATOMIC WRITE GROUPS

a st c b b a, c

Writes to a, b, and c are indivisible

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 10

slide-29
SLIDE 29

Goal Problem Solution Challenge Results Conclusions

FORMING ATOMIC WRITE GROUPS

a st c b b a, c b a, c st d

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 10

slide-30
SLIDE 30

Goal Problem Solution Challenge Results Conclusions

FORMING ATOMIC WRITE GROUPS

a st c b b a, c b a, c st d b a, c d st e

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 10

slide-31
SLIDE 31

Goal Problem Solution Challenge Results Conclusions

FORMING ATOMIC WRITE GROUPS

a st c b b a, c b a, c st d b a, c d st e b, e a, c d

Writes to a, b, c, d, and e are indivisible

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 10

slide-32
SLIDE 32

Goal Problem Solution Challenge Results Conclusions

KNOWN WAYS TO PERFORM WRITES ATOMICALY

  • 1. Mutual exclusion (TCC ISCA’04, BulkSC ISCA’07)
  • 2. Transactional (Ocklahoma PDTSA’93, Store-Wait-Free ISCA’07)

Proc 1

a b

Proc 2

a b

Mem

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 11

slide-33
SLIDE 33

Goal Problem Solution Challenge Results Conclusions

KNOWN WAYS TO PERFORM WRITES ATOMICALY

  • 1. Mutual exclusion (TCC ISCA’04, BulkSC ISCA’07)
  • 2. Transactional (Ocklahoma PDTSA’93, Store-Wait-Free ISCA’07)

Proc 1

a b

Proc 2

a b

Mem

a b Me first!

Centralized and non-scalable solution

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 11

slide-34
SLIDE 34

Goal Problem Solution Challenge Results Conclusions

KNOWN WAYS TO PERFORM WRITES ATOMICALY

  • 1. Mutual exclusion (TCC ISCA’04, BulkSC ISCA’07)
  • 2. Transactional (Ocklahoma PDTSA’93, Store-Wait-Free ISCA’07)

Proc 1

a b

Proc 2

a b

Mem

Conflict! a b Abort!

  • Speculation: rollback on conflict
  • Canceling memory writes is a costly operation
  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 11

slide-35
SLIDE 35

Goal Problem Solution Challenge Results Conclusions

OUTLINE

1 GOAL: COALESCING STORES 2 PROBLEM: COALESCING BREAKS STORE ORDER 3 SOLUTION: ATOMICITY 4 CHALLENGE: DISTRIBUTED, NON-SPECULATIVE

ATOMICITY

5 RESULTS 6 CONCLUSIONS

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 12

slide-36
SLIDE 36

Goal Problem Solution Challenge Results Conclusions

A NEW PERSPECTIVE

Proc 1

a b

Proc 2

a b

Mem

  • Writing atomically a number of cache lines is similar to the

problem of acquiring a number of locks in parallel programming

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 13

slide-37
SLIDE 37

Goal Problem Solution Challenge Results Conclusions

A NEW PERSPECTIVE

Proc 1

a b

Proc 2

a b

Mem

a b Deadlock!

  • Writing atomically a number of cache lines is similar to the

problem of acquiring a number of locks in parallel programming

  • Deadlock, if locks are taken in opposite order
  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 13

slide-38
SLIDE 38

Goal Problem Solution Challenge Results Conclusions

A NEW PERSPECTIVE

Perform writes following a global order Deadlock-free considering unlimited resources1

Proc 1

a b

Proc 2

a b

Mem

a Wait

1 Dijkstra, “Hierarchical ordering of sequential processes”

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 13

slide-39
SLIDE 39

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

Cache Cache

Dir

a b a b

  • Private caches
  • Shared directory
  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-40
SLIDE 40

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Write in lexicographical (Lex) order ⇒ physical address

Proc 1 Proc 2 Mem

Cache Cache

Dir

a b a b 2 1 1 2

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-41
SLIDE 41

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

Cache Cache

Dir

a b a b 2 1 1 2

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-42
SLIDE 42

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

a b a b 2 1 1 2

Cache Cache

Dir

a

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-43
SLIDE 43

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

a b a b 2 1 1 2

Cache Cache

Dir

a a

A write locks the cache line permission (lock bit)

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-44
SLIDE 44

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Group writes have been ordered ⇒ Proc 1 first

Proc 1 Proc 2 Mem

a b a b 2 1 1 2

Cache Cache

Dir

a a Wait

A “conflict” between atomic groups always happens in their minimun common address

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-45
SLIDE 45

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

a b a b 2 1 1 2

Cache Cache

Dir

a a Wait

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-46
SLIDE 46

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

a b a b 2 1 1 2

Cache Cache

Dir

a a Wait b

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-47
SLIDE 47

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

a b a b 2 1 1 2

Cache Cache

Dir

a a Wait b b

All lock bits reset in bulk

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-48
SLIDE 48

Goal Problem Solution Challenge Results Conclusions

LEXICOGRAPHICAL ORDER

Proc 1 Proc 2 Mem

a b 2 1 1 2

Cache Cache

Dir

a a b b

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 14

slide-49
SLIDE 49

Goal Problem Solution Challenge Results Conclusions

RESOURCE-CONFLICT DEADLOCKS

Lex order is deadlock-free, assuming unlimited resources

But resources are limited

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 15

slide-50
SLIDE 50

Goal Problem Solution Challenge Results Conclusions

RESOURCE-CONFLICT DEADLOCKS

Lex order is deadlock-free, assuming unlimited resources

But resources are limited

Locking cache lines introduces resource-conflict deadlocks

⇒ Need resources to keep all locks simultaneously

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 15

slide-51
SLIDE 51

Goal Problem Solution Challenge Results Conclusions

RESOURCE-CONFLICT DEADLOCKS

Lex order is deadlock-free, assuming unlimited resources

But resources are limited

Locking cache lines introduces resource-conflict deadlocks

⇒ Need resources to keep all locks simultaneously

1

Intra-group: resource deadlocks for a single group

a b 2 1

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 15

slide-52
SLIDE 52

Goal Problem Solution Challenge Results Conclusions

RESOURCE-CONFLICT DEADLOCKS

Lex order is deadlock-free, assuming unlimited resources

But resources are limited

Locking cache lines introduces resource-conflict deadlocks

⇒ Need resources to keep all locks simultaneously

1

Intra-group: resource deadlocks for a single group

a b 2 1

2

Inter-group: resource deadlocks for multiple groups

a b e d 2 1 5 4

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 15

slide-53
SLIDE 53

Goal Problem Solution Challenge Results Conclusions

INTRA-GROUP CONFLICTS IN PRIVATE RESOURCES

Caches must be able to hold all locked cache lines

E.g., if direct-mapped cache and a and b map to the same set ⇒ deadlock Proc 1 Mem

a b 2 1

Cache

Dir

a a b b

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 16

slide-54
SLIDE 54

Goal Problem Solution Challenge Results Conclusions

INTRA-GROUP CONFLICTS IN PRIVATE RESOURCES

Caches must be able to hold all locked cache lines

E.g., if direct-mapped cache and a and b map to the same set ⇒ deadlock

Sub-address lex order

rank = addrline % (setscache × assoccache) Reduces coalescing opportunities

⇒ Addresses of the same rank cannot be in the same atomic group

Proc 1 Mem

a b 2 1

Cache

Dir

a a b b

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 16

slide-55
SLIDE 55

Goal Problem Solution Challenge Results Conclusions

INTER-GROUP CONFLICTS IN SHARED RESOURCES

Proc 1 Proc 2 Mem

a b e d 2 1 5 4

Cache Cache

Dir

a a d d

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 17

slide-56
SLIDE 56

Goal Problem Solution Challenge Results Conclusions

INTER-GROUP CONFLICTS IN SHARED RESOURCES

Proc 1 Proc 2 Mem

a b e d 2 1 5 4

Cache Cache

Dir

a a d d Deadlock

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 17

slide-57
SLIDE 57

Goal Problem Solution Challenge Results Conclusions

INTER-GROUP CONFLICTS IN SHARED RESOURCES

Proc 1 Proc 2 Mem

a b e d 2 1 5 4

Cache Cache

Dir

a a d d Deadlock

rank = addrline % (setsdir × assocdir) The formation of atomic groups with sub-address order prevents different atomic groups from overflowing shared structures

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 17

slide-58
SLIDE 58

Goal Problem Solution Challenge Results Conclusions

INTER-GROUP CONFLICTS IN SHARED RESOURCES

Proc 1 Proc 2 Mem

a b e d 2 1 1 2

Cache Cache

Dir

a a Wait

rank = addrline % (setsdir × assocdir) The formation of atomic groups with sub-address order prevents different atomic groups from overflowing shared structures

Sub-address

  • rder
  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 17

slide-59
SLIDE 59

Goal Problem Solution Challenge Results Conclusions

SYSTEM-WIDE SUB-ADDRESS LEX ORDER

Deadlock free: rank = addrline % min(setsi × associ) Sub-address lex order intuition

Each rank in an order either has resources or conflicts with the minimun common address when taking the resource A conflict orders the atomic group writes

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 18

slide-60
SLIDE 60

Goal Problem Solution Challenge Results Conclusions

SYSTEM-WIDE SUB-ADDRESS LEX ORDER

Deadlock free: rank = addrline % min(setsi × associ) Sub-address lex order intuition

Each rank in an order either has resources or conflicts with the minimun common address when taking the resource A conflict orders the atomic group writes

Simple implementation in the store buffer

Just stop coalescing on rank conflict

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 18

slide-61
SLIDE 61

Goal Problem Solution Challenge Results Conclusions

SYSTEM-WIDE SUB-ADDRESS LEX ORDER

Deadlock free: rank = addrline % min(setsi × associ) Sub-address lex order intuition

Each rank in an order either has resources or conflicts with the minimun common address when taking the resource A conflict orders the atomic group writes

Simple implementation in the store buffer

Just stop coalescing on rank conflict

No significant protocol changes

Just request waiting and prefetch nacks

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 18

slide-62
SLIDE 62

Goal Problem Solution Challenge Results Conclusions

OUTLINE

1 GOAL: COALESCING STORES 2 PROBLEM: COALESCING BREAKS STORE ORDER 3 SOLUTION: ATOMICITY 4 CHALLENGE: DISTRIBUTED, NON-SPECULATIVE

ATOMICITY

5 RESULTS 6 CONCLUSIONS

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 19

slide-63
SLIDE 63

Goal Problem Solution Challenge Results Conclusions

SIMULATION ENVIRONMENT

Schemes evaluated:

NSB: Unified SQ/SB, no coalescing (Intel-like) LSB: Split SQ/SB, line coalescing CSB-TSO: Split SQ/SB, coalescing, TSO CSB-RC: Split SQ/SB, coalescing, release consistency

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 20

slide-64
SLIDE 64

Goal Problem Solution Challenge Results Conclusions

SIMULATION ENVIRONMENT

Schemes evaluated:

NSB: Unified SQ/SB, no coalescing (Intel-like) LSB: Split SQ/SB, line coalescing CSB-TSO: Split SQ/SB, coalescing, TSO CSB-RC: Split SQ/SB, coalescing, release consistency

GEMS + in-house TSO processor model

8 out-of-order Haswell-like cores Store queue (SQ) + store buffer (SB): 42 entries Lex order: 512 ranks (L1 cache: 32KB)

Benchmarks: Parsec-3.0

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 20

slide-65
SLIDE 65

Goal Problem Solution Challenge Results Conclusions

ENERGY CONSUMPTION (L1 & SQ/SB)

Normalized to NSB Reductions of writes due to coalescing Reductions of reads due to hits in the SB (more coalescing)

blackscholes bodytrack canneal dedup ferret fluidanimate freqmine streamcluster swaptions vips x264 Average 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

  • Norm. energy consumption

L1_Tag L1_Read L1_Write SQSB_Search SQSB_Read SQSB_Write

  • 1. NSB
  • 2. LSB
  • 3. CSB-TSO
  • 4. CSB-RC
  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 21

slide-66
SLIDE 66

Goal Problem Solution Challenge Results Conclusions

ENERGY CONSUMPTION (L1 & SQ/SB)

Normalized to NSB Reductions of writes due to coalescing Reductions of reads due to hits in the SB (more coalescing)

blackscholes bodytrack canneal dedup ferret fluidanimate freqmine streamcluster swaptions vips x264 Average 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

  • Norm. energy consumption

L1_Tag L1_Read L1_Write SQSB_Search SQSB_Read SQSB_Write

  • 1. NSB
  • 2. LSB
  • 3. CSB-TSO
  • 4. CSB-RC

CSB-TSO 23.3% reduction w.r.t NSB CSB-TSO on par to CSB-RC

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 21

slide-67
SLIDE 67

Goal Problem Solution Challenge Results Conclusions

EXECUTION TIME

Normalized to NSB Improvements due to less processor stalls

blackscholes bodytrack canneal dedup ferret fluidanimate freqmine streamcluster swaptions vips x264 Geomean 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

  • Norm. execution time

NSB LSB CSB-TSO CSB-RC

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 22

slide-68
SLIDE 68

Goal Problem Solution Challenge Results Conclusions

EXECUTION TIME

Normalized to NSB Improvements due to less processor stalls

blackscholes bodytrack canneal dedup ferret fluidanimate freqmine streamcluster swaptions vips x264 Geomean 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

  • Norm. execution time

NSB LSB CSB-TSO CSB-RC

CSB-TSO improves NSB by 6.2% CSB-TSO close to CSB-RC

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 22

slide-69
SLIDE 69

Goal Problem Solution Challenge Results Conclusions

OUTLINE

1 GOAL: COALESCING STORES 2 PROBLEM: COALESCING BREAKS STORE ORDER 3 SOLUTION: ATOMICITY 4 CHALLENGE: DISTRIBUTED, NON-SPECULATIVE

ATOMICITY

5 RESULTS 6 CONCLUSIONS

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 23

slide-70
SLIDE 70

Goal Problem Solution Challenge Results Conclusions

CONCLUSIONS

First solution to perform writes atomically

⇒ Non-centralized ⇒ Non-speculative ⇒ Deadlock-free

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 24

slide-71
SLIDE 71

Goal Problem Solution Challenge Results Conclusions

CONCLUSIONS

First solution to perform writes atomically

⇒ Non-centralized ⇒ Non-speculative ⇒ Deadlock-free

Thanks to LEX order

⇒ Non-deadlocking ⇒ Accommodates resource limitations

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 24

slide-72
SLIDE 72

Goal Problem Solution Challenge Results Conclusions

CONCLUSIONS

First solution to perform writes atomically

⇒ Non-centralized ⇒ Non-speculative ⇒ Deadlock-free

Thanks to LEX order

⇒ Non-deadlocking ⇒ Accommodates resource limitations

The result is a simpler, higher performing solution

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 24

slide-73
SLIDE 73

pdf/portada Questions?

NON-SPECULATIVE STORE COALESCING

IN TOTAL STORE ORDER

Alberto Ros1 Stefanos Kaxiras2

1Universidad de Murcia

aros@ditec.um.es

2Uppsala University

stefanos.kaxiras@it.uu.se

June 4th, 2018

  • A. Ros & S. Kaxiras

ISCA’18 @ Los Angeles, CA, USA June 4th, 2018 25