Memory Consistency Models Virendra Singh Computer Architecture - - PowerPoint PPT Presentation

memory consistency models
SMART_READER_LITE
LIVE PREVIEW

Memory Consistency Models Virendra Singh Computer Architecture - - PowerPoint PPT Presentation

Memory Consistency Models Virendra Singh Computer Architecture & Dependable Systems Lab. Dept. of Electrical Engineering Indian Institute of Technology Bombay Courtesy: Sarita Adve University of Illinois at Urbana-Champaign CS-683:


slide-1
SLIDE 1

Memory Consistency Models

Virendra Singh Computer Architecture & Dependable Systems Lab.

  • Dept. of Electrical Engineering

Indian Institute of Technology Bombay Courtesy: Sarita Adve

University of Illinois at Urbana-Champaign CS-683: Advanced Computer Architecture (08 Nov 2013)

slide-2
SLIDE 2

Memory Consistency Model: Definition

Memory consistency model Order in which memory operations will appear to execute ⇒ What value can a read return? Affects ease-of-programming and performance

slide-3
SLIDE 3

Implicit Memory Model

Sequential consistency (SC) [Lamport] Result of an execution appears as if

  • All operations executed in some sequential order
  • Memory operations of each process in program order

No caches, no write buffers

MEMORY P1 P3 P2 Pn

slide-4
SLIDE 4

Implicit Memory Model

Sequential consistency (SC) [Lamport] Result of an execution appears as if

  • All operations executed in some sequential order
  • Memory operations of each process in program order

No caches, no write buffers

MEMORY P1 P3 P2 Pn

Two aspects: Program order Atomicity

slide-5
SLIDE 5

Understanding Program Order – Example 1

Initially X = 2 P1 P2 ….. ….. r0=Read(X) r1=Read(x) r0=r0+1 r1=r1+1 Write(r0,X) Write(r1,X) ….. …… Possible execution sequences: P1:r0=Read(X) P2:r1=Read(X) P2:r1=Read(X) P2:r1=r1+1 P1:r0=r0+1 P2:Write(r1,X) P1:Write(r0,X) P1:r0=Read(X) P2:r1=r1+1 P1:r0=r0+1 P2:Write(r1,X) P1:Write(r0,X) x=3 x=4

slide-6
SLIDE 6

Atomic Operations

  • sequential consistency has nothing to do with atomicity as shown

by example on previous slide

  • atomicity: use atomic operations such as exchange
  • exchange(r,M): swap contents of register r and location M

r0 = 1; do exchange(r0,S) while (r0 != 0); //S is memory location //enter critical section ….. //exit critical section S = 0;

slide-7
SLIDE 7

Understanding Program Order – Example 1

Initially Flag1 = Flag2 = 0 P1 P2 Flag1 = 1 Flag2 = 1 if (Flag2 == 0) if (Flag1 == 0) critical section critical section Execution: P1 P2 (Operation, Location, Value) (Operation, Location, Value) Write, Flag1, 1 Write, Flag2, 1 Read, Flag2, 0 Read, Flag1, ___

slide-8
SLIDE 8

Understanding Program Order – Example 1

Initially Flag1 = Flag2 = 0 P1 P2 Flag1 = 1 Flag2 = 1 if (Flag2 == 0) if (Flag1 == 0) critical section critical section Execution: P1 P2 (Operation, Location, Value) (Operation, Location, Value) Write, Flag1, 1 Write, Flag2, 1 Read, Flag2, 0 Read, Flag1, ____

slide-9
SLIDE 9

Understanding Program Order – Example 1

Initially Flag1 = Flag2 = 0 P1 P2 Flag1 = 1 Flag2 = 1 if (Flag2 == 0) if (Flag1 == 0) critical section critical section Execution: P1 P2 (Operation, Location, Value) (Operation, Location, Value) Write, Flag1, 1 Write, Flag2, 1 Read, Flag2, 0 Read, Flag1, 0

slide-10
SLIDE 10

Understanding Program Order – Example 1

P1 P2 Write, Flag1, 1 Write, Flag2, 1 Read, Flag2, 0 Read, Flag1, 0 Can happen if

  • Write buffers with read bypassing
  • Overlap, reorder write followed by read in h/w or compiler
  • Allocate Flag1 or Flag2 in registers

On AlphaServer, NUMA-Q, T3D/T3E, Ultra Enterprise Server

slide-11
SLIDE 11

Understanding Program Order - Example 2

Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {;} Flag = 1; ... = A; P1 P2 Write, A, 23 Read, Flag, 0 Write, Flag, 1 Read, Flag, 1 Read, A, ____

slide-12
SLIDE 12

Understanding Program Order - Example 2

Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {;} Flag = 1; ... = A; P1 P2 Write, A, 23 Read, Flag, 0 Write, Flag, 1 Read, Flag, 1 Read, A, 0

slide-13
SLIDE 13

Understanding Program Order - Example 2

Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {;} Flag = 1; ... = A; P1 P2 Write, A, 23 Read, Flag, 0 Write, Flag, 1 Read, Flag, 1 Read, A, 0 Can happen if Overlap or reorder writes or reads in hardware or compiler On AlphaServer, T3D/T3E

slide-14
SLIDE 14

Understanding Program Order: Summary

SC limits program order relaxation: Write → Read Write → Write Read → Read, Write

slide-15
SLIDE 15

Sequential Consistency

SC constrains all memory operations: Write → Read Write → Write Read → Read, Write

  • Simple model for reasoning about parallel programs
  • But, intuitively reasonable reordering of memory operations in a

uniprocessor may violate sequential consistency model Modern microprocessors reorder operations all the time to obtain performance (write buffers, overlapped writes,non-blocking reads…). Question: how do we reconcile sequential consistency model with the demands of performance?

slide-16
SLIDE 16

Understanding Atomicity – Caches 101

A mechanism needed to propagate a write to other copies ⇒ Cache coherence protocol

P1 CACHE MEMORY MEMORY A OLD P2 Pn A OLD A OLD BUS

slide-17
SLIDE 17

Notes

  • Sequential consistency is not really about memory operations

from different processors (although we do need to make sure memory operations are atomic).

  • Sequential consistency is not really about dependent memory
  • perations in a single processor’s instruction stream (these are

respected even by processors that reorder instructions).

  • The problem of relaxing sequential consistency is really all about

independent memory operations in a single processor’s instruction stream that have some high-level dependence (such as locks guarding data) that should be respected to obtain correct results.

slide-18
SLIDE 18

Relaxing Program Orders

  • Weak ordering:
  • Divide memory operations into data operations and synchronization
  • perations
  • Synchronization operations act like a fence:
  • All data operations before synch in program order must complete

before synch is executed

  • All data operations after synch in program order must wait for

synch to complete

  • Synchs are performed in program order
  • Implementation of fence: processor has counter that is incremented

when data op is issued, and decremented when data op is completed

  • Example: PowerPC has SYNC instruction (caveat: semantics

somewhat more complex than what we have described…)

slide-19
SLIDE 19

Another model: Release consistency

  • Further relaxation of weak consistency
  • Synchronization accesses are divided into
  • Acquires: operations like lock
  • Release: operations like unlock
  • Semantics of acquire:
  • Acquire must complete before all following memory accesses
  • Semantics of release:
  • all memory operations before release are complete
  • but accesses after release in program order do not have to wait for

release

  • operations which follow release and which need to wait must be

protected by an acquire

slide-20
SLIDE 20

Cache Coherence Protocols

How to propagate write? Invalidate -- Remove old copies from other caches Update -- Update old copies in other caches to new values

slide-21
SLIDE 21

Understanding Atomicity - Example 1

Initially A = B = C = 0 P1 P2 P3 P4 A = 1; A = 2; while (B != 1) {;} while (B != 1) {;} B = 1; C = 1; while (C != 1) {;} while (C != 1) {;} tmp1 = A; tmp2 = A;

slide-22
SLIDE 22

Understanding Atomicity - Example 1

Initially A = B = C = 0 P1 P2 P3 P4 A = 1; A = 2; while (B != 1) {;} while (B != 1) {;} B = 1; C = 1; while (C != 1) {;} while (C != 1) {;} tmp1 = A; 1 tmp2 = A; 2 Can happen if updates of A reach P3 and P4 in different order Coherence protocol must serialize writes to same location (Writes to same location should be seen in same order by all)

slide-23
SLIDE 23

Understanding Atomicity - Example 2

Initially A = B = 0 P1 P2 P3 A = 1 while (A != 1) ;while (B != 1) ; B = 1; tmp = A P1 P2 P3 Write, A, 1 Read, A, 1 Write, B, 1 Read, B, 1 Read, A, 0 Can happen if read returns new value before all copies see it Read-others’-write early optimization unsafe

slide-24
SLIDE 24

Program Order and Write Atomicity Example

Initially all locations = 0 P1 P2 Flag1 = 1; Flag2 = 1; ... = Flag2; 0 ... = Flag1; 0 Can happen if read early from write buffer

slide-25
SLIDE 25

Program Order and Write Atomicity Example

Initially all locations = 0 P1 P2 Flag1 = 1; Flag2 = 1; A = 1; A = 2; ... = A; ... = A; ... = Flag2; 0 ... = Flag1; 0

slide-26
SLIDE 26

Program Order and Write Atomicity Example

Initially all locations = 0 P1 P2 Flag1 = 1; Flag2 = 1; A = 1; A = 2; ... = A; 1 ... = A; 2 ... = Flag2; 0 ... = Flag1; 0 Can happen if read early from write buffer “Read-own-write early” optimization can be unsafe

slide-27
SLIDE 27

SC Summary

SC limits Program order relaxation: Write → Read Write → Write Read → Read, Write Read others’ write early Read own write early Unserialized writes to the same location Alternative Give up sequential consistency Use relaxed models

slide-28
SLIDE 28

Note: Aggressive Implementations of SC

Can actually do optimizations with SC with some care Hardware has been fairly successful Limited success with compiler But not an issue here Many current architectures do not give SC Compiler optimizations on SC still limited

slide-29
SLIDE 29

Classification for Relaxed Models

Typically described as system optimizations - system-centric Optimizations Program order relaxation: Write → Read Write → Write Read → Read, Write Read others’ write early Read own write early All models provide safety net All models maintain uniprocessor data and control dependences, write serialization

slide-30
SLIDE 30

Some Current System-Centric Models

SYNC

    

PowerPC various MEMBARs

   

RMO MB, WMB

   

Alpha release, acquire, nsync, RMW

    

RCpc release, acquire, nsync, RMW

   

RCsc synchronization

   

WO RMW, STBAR

  

PSO RMW

  

PC RMW

 

TSO serialization instructions

IBM 370 Safety Net

Read Own Write Early Read Others’ Write Early

R →RW Order W →W Order W →R Order Relaxation:

slide-31
SLIDE 31

System-Centric Models: Assessment

System-centric models provide higher performance than SC BUT 3P criteria Programmability? Lost intuitive interface of SC Portability? Many different models Performance? Can we do better? Need a higher level of abstraction

slide-32
SLIDE 32

An Alternate Programmer-Centric View

Many models give informal software rules for correct results BUT Rules are often ambiguous when generally applied What is a correct result? Why not Formalize one notion of correctness – the base model Relaxed model = Software rules that give appearance of base model Which base model? What rules? What if don’t obey rules?

slide-33
SLIDE 33

Which Base Model?

Choose sequential consistency as base model Specify memory model as a contract System gives sequential consistency IF programmer obeys certain rules + Programmability + Performance + Portability [Adve and Hill, Gharachorloo, Gupta, and Hennessy]

slide-34
SLIDE 34

What Software Rules?

Rules must Pertain to program behavior on SC system Enable optimizations without violating SC Possible rules Prohibit certain access patterns Ask for certain information Use given constructs in prescribed ways ??? Examples coming up

slide-35
SLIDE 35

What if a Program Violates Rules?

What about programs that don’t obey the rules? Option 1: Provide a system-centric specification But this path has pitfalls Option 2: Avoid system-centric specification Only guarantee a read returns value written to its location

slide-36
SLIDE 36

Programmer-Centric Models

Several models proposed Motivated by previous system-centric optimizations (and more) Data-race-free-0 (DRF0) / properly-labeled-1 model Application to Java

slide-37
SLIDE 37

The Data-Race-Free-0 Model: Motivation

Different operations have different semantics

P1 P2 A = 23; while (Flag != 1) {;} B = 37; … = B; Flag = 1; … = A;

Flag = Synchronization; A, B = Data Can reorder data operations Distinguish data and synchronization Need to

  • Characterize data / synchronization
  • Prove characterization allows optimizations w/o violating SC
slide-38
SLIDE 38

Data-Race-Free-0: Some Definitions

Two operations conflict if – Access same location – At least one is a write

slide-39
SLIDE 39

Data-Race-Free-0: Some Definitions (Cont.)

(Consider SC executions ⇒ global total order) Two conflicting operations race if – From different processors – Execute one after another (consecutively)

P1 P2 Write, A, 23 Write, B, 37 Read, Flag, 0 Write, Flag, 1 Read, Flag, 1 Read, B, ___ Read, A, ___

Races usually “synchronization,” others “data” Can optimize operations that never race

slide-40
SLIDE 40

Data-Race-Free-0 (DRF0) Definition

Data-Race-Free-0 Program All accesses distinguished as either synchronization or data All races distinguished as synchronization (in any SC execution) Data-Race-Free-0 Model Guarantees SC to data-race-free-0 programs (For others, reads return value of some write to the location)

slide-41
SLIDE 41

Programming with Data-Race-Free-0

Information required: This operation never races (in any SC execution)

1. Write program assuming SC 2. For every memory operation specified in the program do: Never races? yes Distinguish as data Distinguish as synchronization no don’t know

  • r don’t care
slide-42
SLIDE 42

Programming With Data-Race-Free-0

Programmer’s interface is sequential consistency Knowledge of races needed even with SC “Don't-know” option helps

slide-43
SLIDE 43

Distinguishing/Labeling Memory Operations

Need to distinguish/label operations at all levels

  • High-level language
  • Hardware

Compiler must translate language label to hardware label Tradeoffs at all levels Flexibility Ease-of-use Performance Interaction with other level

slide-44
SLIDE 44

Language Support for Distinguishing Accesses

Synchronization with special constructs Support to distinguish individual accesses

slide-45
SLIDE 45

Synchronization with Special Constructs

Example: synchronized in Java Programmer must ensure races limited to the special constructs Provided construct may be inappropriate for some races E.g., producer-consumer with Java P1 P2 A = 23; while (Flag != 1) {;} B = 37; … = B; Flag = 1; … = A;

slide-46
SLIDE 46

Distinguishing Individual Memory Operations

Option 1: Annotations at statement level P1 P2 data = ON synchronization = ON A = 23; while (Flag != 1) {;} B = 37; data = ON synchronization = ON … = B; Flag = 1; … = A; Option 2: Declarations at variable level synch int: Flag data int: A, B

slide-47
SLIDE 47

Distinguishing Individual Memory Operations (Cont.)

Default declarations To decrease errors Make synchronization default To decrease number of additional labels Make data default

slide-48
SLIDE 48

Distinguishing/Labeling Operations for Hardware

  • Different flavors of load/store
  • E.g., ld.acq, st.rel in IA-64
  • Fences or memory barrier instructions
  • Most popular today

E.g., MB/WMB in Alpha, MEMBAR in SPARC V9

  • For DRF0, insert appropriate fence before/after synch
  • Extra instruction for all synchronization

Default = synchronization can give bad performance

  • Special instructions for synchronization
  • E.g., Compare&Swap
slide-49
SLIDE 49

Interactions Between Language and Hardware

  • If hardware uses fences,

language should not encourage default of synchronization

  • If hardware only distinguishes based on special instructions,

language should not distinguish individual operations

  • Languages other than Java do not provide explicit support,

high-level programmers directly use hardware fences

slide-50
SLIDE 50

Performance: Data-Race-Free-0 Implementations

Can prove that we can Reorder, overlap data between consecutive synchronization Make data writes non-atomic P1 P2 A = 23; while (Flag != 1) {;} B = 37; … = B; Flag = 1; … = A; ⇒ Weak Ordering obeys Data-Race-Free-0

slide-51
SLIDE 51

Data-Race-Free-0 Implementations (Cont.)

DRF0 also allows more aggressive implementations than WO

  • Don't need Data → Read sync, Write sync → Data (like RCsc)

P1 P2 A = 23; while (Flag != 1) {;} B = 37; … = B; Flag = 1; … = A;

  • Can postpone writes of A, B to Read, Flag, 1
  • Can postpone writes of A, B to reads of A, B
  • Can exploit last two observations with

Lazy invalidations Lazy release consistency on software DSMs

slide-52
SLIDE 52

Portability: DRF0 Program on System-Centric Models

WO - Direct port Alpha, RMO - Precede synch write with fence, follow synch read with fence, fence between synch write and read RCsc - Synchronization = competing IBM 370, TSO, PC - Replace synch reads with read-modify-writes PSO - Replace synch reads with read-modify-writes, precede synch write with STBAR PowerPC - Combination of Alpha/RMO and TSO/PC RCpc - Combination of RCsc and PC

slide-53
SLIDE 53

Data-Race-Free-0 vs. Weak Ordering

Programmability DRF0 programmer can assume SC WO requires reasoning with out-of-order, non-atomicity Performance DRF0 allows higher performance implementations Portability DRF0 programs correct on more implementations than WO DRF0 programs can be run correctly on all system-centric models discussed earlier

slide-54
SLIDE 54

Data-Race-Free-0 vs. Weak Ordering (Cont.)

Caveats

  • Asynchronous programs
  • Theoretically possible to distinguish operations better than

DRF0 for a given system

slide-55
SLIDE 55

Programmer-Centric Models: Summary

The idea Programmer follows prescribed rules (for behavior on SC) System gives SC For programmer Reason with SC Enhanced portability For system designers More flexibility

slide-56
SLIDE 56

Programmer-Centric Models: A Systematic Approach

In general

  • What software rules are useful?
  • What further optimizations are possible?

My thesis characterizes

  • Useful rules
  • Possible optimizations
  • Relationship between the above
slide-57
SLIDE 57

Conclusions

Sequential consistency limits performance optimizations System-centric relaxed memory models harder to program Programmer-centric approach for relaxed models Software obeys rules, system gives SC Application to Java Can develop software rules for SC for idioms of interest Easier for programmers than system-centric specification

slide-58
SLIDE 58

Challenges faced by the Multicore Industry

  • Applications written for single core processing cannot benefit from

multi-core processing as they are not multi-threaded in nature.

  • Sharing a single memory cache or data bus among multiple cores

can create a bottleneck, meaning the extra cores will be largely wasted

  • Enhancing the performance of each core and optimizing it for multi-

core operation

  • Improving the memory subsystem and optimizing data access in

ways that ensure data can be used as fast as possible among all cores.

  • Optimizing the interconnect fabric that connects the cores to

improve performance between cores and memory units………

slide-59
SLIDE 59

Merits and Demerits of sharing Merits of Sharing the Last Level cache

  • Sharing the Last level cache when co-running threads work
  • n the same data can prove to be very beneficial
  • Inter-core communication.
  • Bigger Size of cache

Demerits

  • Cross-core Interference
slide-60
SLIDE 60

Workload Characteristics

The amount of Cross-Core Interference depends largely depends on the workload (set of applications that run on neighbouring cores) characteristics Characteristics that affect the interference: 1. Frequency of memory accesses 2. Working Set Size 3. Memory Reuse (Temporal Locality) ….

slide-61
SLIDE 61

Review: Shared Caches Between Cores

Advantages: Dynamic partitioning of available cache space No fragmentation due to static partitioning Easier to maintain coherence Shared data and locks do not ping pong between caches Disadvantages Cores incur conflict misses due to other cores’ accesses Misses due to inter-core interference Some cores can destroy the hit rate of other cores

What kind of access patterns could cause this?

Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?) High bandwidth harder to obtain (N cores  N ports?)

slide-62
SLIDE 62

Shared Caches: How to Share?

Free-for-all sharing Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU) Not thread/application aware An incoming block evicts a block regardless of which threads the blocks belong to Problems A cache-unfriendly application can destroy the performance of a cache friendly application Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit Reduced performance, reduced fairness

slide-63
SLIDE 63

Problem with Shared Caches

L2 $ L1 $

……

Processor Core 1 L1 $ Processor Core 2

←t1

slide-64
SLIDE 64

Problem with Shared Caches

64

L1 $ Processor Core 1 L1 $ Processor Core 2 L2 $

…… t2→

slide-65
SLIDE 65

Problem with Shared Caches

L1 $ L2 $

……

Processor Core 1 Processor Core 2

←t1

L1 $

t2→

t2’s throughput is significantly reduced due to unfair cache sharing.

slide-66
SLIDE 66

Controlled Cache Sharing

Utility based cache partitioning

Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.

Fair cache partitioning

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.

Shared/private mixed cache mechanisms

Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009.

66

slide-67
SLIDE 67

Utility Based Shared Cache Partitioning

Goal: Maximize system throughput Observation: Not all threads/applications benefit equally from caching  simple LRU replacement not good for system throughput Idea: Allocate more cache space to applications that obtain the most benefit from more space The high-level idea can be applied to other shared resources as well. Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High- Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.

slide-68
SLIDE 68

Utility Based Cache Partitioning (I)

68

Utility Ua

b = Misses with a ways – Misses with b ways

Low Utility High Utility Saturating Utility

Num ways from 16-way 1MB L2 Misses per 1000 instructions

slide-69
SLIDE 69

Utility Based Cache Partitioning (II)

Misses per 1000 instructions (MPKI)

equake vpr

LRU UTIL Idea: Give more cache to the application that benefits more from cache

slide-70
SLIDE 70

Utility Based Cache Partitioning (III)

70

Three components:  Utility Monitors (UMON) per core  Partitioning Algorithm (PA)  Replacement support to enforce partitions I$ D$ Core1 I$ D$ Core2 Shared L2 cache Main Memory UMON1 UMON2 PA

slide-71
SLIDE 71

Utility Monitors

 For each core, simulate LRU using auxiliary tag store (ATS)

 Hit counters in ATS to count hits per recency position  LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1

MTS Set B Set E Set G Set A Set C Set D Set F Set H ATS Set B Set E Set G Set A Set C Set D Set F Set H + + + + (MRU)H0 H1 H2… H15(LRU)