CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai February 06, 2018 URCS

Outline Introduction to Parallel Memory Systems Memory Systems in Parallel Processors Coherence Implementations in Hardware Implications for Parallel Programs

Traditional View of Memory • Memory is accessed through CPU loads/stores • Memory contents have addresses • Smallest unit of access varies across machines • Usually 8 bits (i.e. 1 byte) • Some machines have other Memory correctness constraints • e.g. alignment • x86 has very few correctness constraints

Memory in most machines today CPU Pipeline • Memory hierarchy • Multiple levels of cache Level 1 (L1) memory Cache • pron. cash • Multiple loads/stores can be Level 2 (L2) Last-level Cache (LLC) in flight at same time Cache • Called memory-level parallelism (MLP) • Stall-on-use, not Memory stall-on-issue

Caches • Caches are faster than main memory • Closer to pipeline • Smaller than main memory • Usually, SRAM instead of DRAM (main memory) • Caches contain copies of data in main memory • If address requested by load/store exists in cache: hit • If address does not exist in cache: miss • Addresses that hit can be satisfied from the cache • Addresses that miss are forwarded to next level of memory hierarchy • Forwarding continues until found • What is the last level?

Internal Organization of Caches • Unit of data organization in caches: Line • Line size may vary from tags state data 64 to 128 bytes across CPU models valid 0xdae... hello world line • Each line is a non-overlapping chunk of main memory • Each line in cache contains contain: • State (e.g. valid or invalid) • Tag (part of address) • Data

Symmetric Multiprocessors (SMPs) • Each processor occupies one “socket”, and are identical CPU0 CPU1 • All processors share same Pipeline Pipeline main memory • Known as Shared Memory Level 1 (L1) Level 1 (L1) Cache Cache Multiprocessors • Not all levels of hierarchy Level 2 (L2) Level 2 (L2) Cache Cache are shared • Caches are private • Not shown: interconnect Memory between processors • Superseded by Chip Multiprocessors

Chip Multiprocessors (CMPs) CPU Core0 Core1 Pipeline Pipeline • Each socket contains multiple cores (all identical) Level 1 (L1) Level 1 (L1) Cache Cache • All processors share same main memory Level 2 (L2) Cache • Some levels of cache can also be shared • Some still private Memory

tags cc data 0xdae... processor/core tags cc data READ 0xdae... processor/core Memory 0xdae... X X Reads and writes in xMPs - reads/reads

tags cc data 0xdae... processor/core tags cc data processor/core Memory 0xdae... X X 0xdae... X Reads and writes in xMPs - reads/reads

X WRITE Y to 0xdae... 0xdae... X X 0xdae... Memory processor/core data tags cc tags processor/core 0xdae... data cc Reads and writes in xMPs - read/writes

Reads and writes in xMPs - write/write processor/core processor/core WRITE Z to 0xdae... WRITE Y to 0xdae... cc tags cc tags data data X 0xdae... X 0xdae... Memory 0xdae... X

The problem of coherence • Multiple copies of same address exist in the memory hierarchy • How do we keep all the copies the same? • How do we resolve ordering of writes to the same address? Usually resolved through a coherence protocol.

Coherence Protocols • Can be transparent (in hardware) • You might need to implement one in software • if you’re creating your own caches • Basic idea: every read and write needs to participate in a “coherence protocol” • Usually a finite state machine (FSM) • Each line in the cache has a state associated with it • Reads, writes and cache evictions in the coherence domain may change the line’s coherence state • Coherence domain can consist other CPUs, I/O devices, etc. • State determines which actions (reads/writes/evictions) are valid • Validity conditions?

MESI coherence protocol • States in MESI • Modified: line contains modified data • Exclusive: line is not shared • Shared: line is shared read-only • Invalid: line contains no data

Simplified MESI state diagram INVALID read from memory read from other processor write by other processor EXCLUSIVE read by other processor write to memory SHARED write by this processorr write by this processor MODIFIED Solid lines – actions by this processor Dashed lines – actions by other processor Many transitions not shown – e.g. INVALID from SHARED, EXCLUSIVE on cache line replacement

The MESI Protocol (Simplified) • Every cache line begins in INVALID state • On a read, the cache line is put into: • EXCLUSIVE: if it was read from memory • SHARED: if it was read from another copy • On a write, line is moved to MODIFIED state • If it was previously SHARED, all other copies are INVALIDATED • It will eventually be written back to memory

MESI protocol test • How many concurrent readers does MESI allow? • How many concurrent writers does MESI allow?

MESI protocol test 2 • How does the MESI protocol order concurrent writers?

Snoop Protocols • Requires a shared bus among all processors • All requests to read/write are broadcast on the bus • All processors “snoop”/listen to memory requests • If a processor has a copy in EXCLUSIVE/SHARED/MODIFIED state: • It responds with a copy of its data • Moves its line to SHARED • Processors broadcast “INVALIDATE” to all processors before writing • Must wait for acknowledgements

Directory-based Protocols • Requires a shared structure called “directory” • Directory tracks contents of every cache in the system • Addresses only • Caches talk to directory only • Directory send messages only to caches that contain affected data • Used in systems with large number of processors • > 8 • Implementation need NOT be a centralized structure

Summary of Cache Coherence • Reads and writes to shared data involve communication with other processors • Expensive • Possible Serialization bottleneck

Shared Variables Variables that are read/written by multiple threads are called shared variables.

Compilers and Cache Coherence int *a; T0 T1 while(*a < 1000) while(1) *a = *a + 1; printf("%d\n", *a);

Volatiles volatile int *a; T0 T1 while(*a < 1000) while(1) *a = *a + 1; printf("%d\n", *a);

False Sharing int sums[NTHREADS]; Tx for(...) sums[x] += a[...];

T6 sums 0x14 0x10 0xC 8 4 T7 T5 0x1C T4 T3 T2 T1 T0 0 0x18 Memory Layout for sums sums[] occupies a single cache line.

Cache Line Bouncing • No thread shares data with another thread • However, thread data resides within the same cache line • Coherence operates at cache-line granularity • Every write to the cache line will potentially be serialized

Summary • Memory locations may be stored in registers by compiler • Will not participate in cache coherence • Data layout may cause inadvertent conflicts with each other • One solution: Privatize and then merge

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai February 06, 2018 URCS Outline Introduction to Parallel Memory Systems Memory Systems in Parallel Processors Coherence Implementations in Hardware

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018

CSC2/458 Parallel and Distributed Systems Parallel Data Structures - I Sreepathi Pai January 18,

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Distribute Computing Other Programming Models

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Consensus and Failures Sreepathi Pai April 10, 2018

CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Clocks Sreepathi Pai March 22, 2018 URCS Outline

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks Sreepathi Pai February 13,

CSC2/458 Parallel and Distributed Systems Termination Detection Sreepathi Pai April 12, 2018

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES Natalie Enright Jerger

Unicamp MC714 Distributed Systems Slides by Maarten van Steen, adapted from Distributed Systems,

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of

1 Directory Protocol Messages Parallel App: Commercial Workload Message type Source

Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building

Optimizing Synchronization 18742-Computer Architecture and Systems Ashish Dwivedi, Deepali Garg

Strategic Plan October 25, 2018 DRAFT The Problem: Students lack sufficient pathways to great

Networking by Example Noury Bouraqadi http://car.mines-douai.fr/noury "Deep Into

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai February 06, 2018 URCS Outline Introduction to Parallel Memory Systems Memory Systems in Parallel Processors Coherence Implementations in Hardware

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018

CSC2/458 Parallel and Distributed Systems Parallel Data Structures - I Sreepathi Pai January 18,

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Distribute Computing Other Programming Models

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Consensus and Failures Sreepathi Pai April 10, 2018

CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Clocks Sreepathi Pai March 22, 2018 URCS Outline

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks Sreepathi Pai February 13,

CSC2/458 Parallel and Distributed Systems Termination Detection Sreepathi Pai April 12, 2018

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES Natalie Enright Jerger

Unicamp MC714 Distributed Systems Slides by Maarten van Steen, adapted from Distributed Systems,

EI 338: Computer Systems Engineering (Operating Systems &amp; Computer Architecture) Dept. of

1 Directory Protocol Messages Parallel App: Commercial Workload Message type Source

Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building

Optimizing Synchronization 18742-Computer Architecture and Systems Ashish Dwivedi, Deepali Garg

Strategic Plan October 25, 2018 DRAFT The Problem: Students lack sufficient pathways to great

Networking by Example Noury Bouraqadi http://car.mines-douai.fr/noury &quot;Deep Into

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of

Networking by Example Noury Bouraqadi http://car.mines-douai.fr/noury "Deep Into