Cache Coherence in Scalable Machines Scalable Cache Coherent - PowerPoint PPT Presentation

Cache Coherence in Scalable Machines

Scalable Cache Coherent Systems • Scalable, distributed memory plus coherent replication • Scalable distributed memory machines • P-C-M nodes connected by network • communication assist interprets network transactions, forms interface • Final point was shared physical address space • cache miss satisfied transparently from local or remote memory • Natural tendency of cache is to replicate • but coherence? • no broadcast medium to snoop on • Not only hardware latency/bw, but also protocol must scale

What Must a Coherent System Do? • Provide set of states, state transition diagram, and actions • Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find source of info about state of line in other caches – whether need to communicate with other cached copies (b) Find out where the other copies are (c) Communicate with those copies (inval/update) • (0) is done the same way on all systems • state of the line is maintained in the cache • protocol is invoked if an “access fault” occurs on the line • Different approaches distinguished by (a) to (c)

Bus-based Coherence • All of (a), (b), (c) done through broadcast on bus • faulting processor sends out a “search” • others respond to the search probe and take necessary action • Could do it in scalable network too • broadcast to all processors, and let them respond • Conceptually simple, but broadcast doesn’t scale with p • on bus, bus bandwidth doesn’t scale • on scalable network, every fault leads to at least p network transactions • Scalable coherence: • can have same cache states and state transition diagram • different mechanisms to manage protocol

Approach #1: Hierarchical Snooping • Extend snooping approach: hierarchy of broadcast media • tree of buses or rings (KSR-1) • processors are in the bus- or ring-based multiprocessors at the leaves • parents and children connected by two-way snoopy interfaces – snoop both buses and propagate relevant transactions • main memory may be centralized at root or distributed among leaves • Issues (a) - (c) handled similarly to bus, but not full broadcast • faulting processor sends out “search” bus transaction on its bus • propagates up and down hiearchy based on snoop results • Problems: • high latency: multiple levels, and snoop/lookup at every level • bandwidth bottleneck at root • Not popular today

Scalable Approach #2: Directories • Every memory block has associated directory information • keeps track of copies of cached blocks and their states • on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary • in scalable networks, comm. with directory and copies is through network transactions Requestor Requestor 1. RdEx request 1. P P to directory Read request C to directory C Directory node P M/D 2. M/D A A for block C Reply with 2. sharers identity M/D A P Reply with owner identity C 3. 3b. 3a. Directory node Read req. M/D A Inval. req. Inval. req. to owner to sharer to sharer 4a. 4b. Data 4a. Reply Inval. ack Inval. ack 4b. Revision message to directory P P P C C C M/D M/D M/D A A A Sharer Node with Sharer dirty copy (b) Write miss to a block with two sharers (a) Read miss to a block in dirty state •Many alternatives for organizing directory information

A Popular Middle Ground • Two-level “hierarchy” • Individual nodes are multiprocessors, connected non- hiearchically • e.g. mesh of SMPs • Coherence across nodes is directory-based • directory keeps track of nodes, not individual processors • Coherence within nodes is snooping or directory • orthogonal, but needs a good interface of functionality • Examples: • Convex Exemplar: directory-directory • Sequent, Data General, HAL: directory-snoopy

Example Two-level Hierarchies P P P P P P P P C C C C C C C C B1 B1 B1 B1 Snooping Snooping Main Assist Assist Main Main Dir. Main Dir. Adapter Adapter Mem Mem Mem Mem B2 Network (a) Snooping-snooping (b) Snooping-directory P P P P P P P P C C C C C C C C A A A A A M/D A A A M/D M/D M/D M/D M/D M/D M/D Network1 Network1 Network1 Network1 Directory adapter Directory adapter Dir/Snoopy adapter Dir/Snoopy adapter Bus (or Ring) Network2 (d) Directory-snooping (c) Directory-directory

Advantages of Multiprocessor Nodes • Potential for cost and performance advantages • amortization of node fixed costs over multiple processors – applies even if processors simply packaged together but not coherent • can use commodity SMPs • less nodes for directory to keep track of • much communication may be contained within node (cheaper) • nodes prefetch data for each other (fewer “remote” misses) • combining of requests (like hierarchical, only two-level) • can even share caches (overlapping of working sets) • benefits depend on sharing pattern (and mapping) – good for widely read-shared: e.g. tree data in Barnes-Hut – good for nearest-neighbor, if properly mapped – not so good for all-to-all communication

Disadvantages of Coherent MP Nodes • Bandwidth shared among nodes • all-to-all example • applies to coherent or not • Bus increases latency to local memory • With coherence, typically wait for local snoop results before sending remote requests • Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth • Overall, may hurt performance if sharing patterns don’t comply

Outline • Overview of directory-based approaches • Directory Protocols • Correctness, including serialization and consistency • Implementation • study through case Studies: SGI Origin2000, Sequent NUMA-Q • discuss alternative approaches in the process • Synchronization • Implications for parallel software • Relaxed memory consistency models • Alternative approaches for a coherent shared address space

Basic Operation of Directory P P • k processors. Cache Cache • With each cache-block in memory: k presence-bits, 1 dirty-bit Interconnection Network • With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit • • • Memory Directory presence bits dirty bit • Read from main memory by processor i: • If dirty-bit OFF then { read from main memory; turn p[i] ON; } • if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} • Write to main memory by processor i: • If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON; ... } • ...

Scaling with No. of Processors • Scaling of memory and directory bandwidth provided • Centralized directory is bandwidth bottleneck, just like centralized memory • How to maintain directory information in distributed way? • Scaling of performance characteristics • traffic: no. of network transactions each time protocol is invoked • latency = no. of network transactions in critical path each time • Scaling of directory storage requirements • Number of presence bits needed grows as the number of processors • How directory is organized affects all these, performance at a target scale, as well as coherence management issues

Insights into Directories • Inherent program characteristics: • determine whether directories provide big advantages over broadcast • provide insights into how to organize and store directory information • Characteristics that matter – frequency of write misses? – how many sharers on a write miss – how these scale

10 20 30 40 50 60 70 80 90 100 Cache Invalidation Patterns 0 10 20 30 40 50 60 70 80 90 0 0 8.75 0 0 80.98 91.22 1 1 15.06 2 0 2 3.04 3 0 3 0.49 4 0 4 0.34 5 0 5 0.03 6 6 0 Ocean Invalidation Patterns 7 0 LU Invalidation Patterns 0 7 0.03 8 to 11 0 8 to 11 # of invalidations # of invalidations 0 12 to 15 0 12 to 15 0 16 to 19 0 16 to 19 0 20 to 23 0 20 to 23 0 24 to 27 24 to 27 0 28 to 31 0 0 28 to 31 0 32 to 35 0 32 to 35 0 36 to 39 36 to 39 0 0 40 to 43 0 40 to 43 44 to 47 0 44 to 47 0 0 0 48 to 51 48 to 51 0 0 52 to 55 52 to 55 0 56 to 59 56 to 59 0 0.02 0.22 60 to 63 60 to 63

Cache Invalidation Patterns Barnes-Hut Invalidation Patterns 48.35 50 45 40 35 30 25 22.87 20 15 10.56 10 5.33 2.87 5 2.5 1.88 1.4 1.27 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33 0 0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 # of invalidations Radiosity Invalidation Patterns 58.35 60 50 40 30 20 12.04 10 6.68 4.16 3.28 2.24 2.2 1.74 1.59 1.46 1.16 0.97 0.92 0.91 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0 0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 # of invalidations

Cache Coherence in Scalable Machines Scalable Cache Coherent - PowerPoint PPT Presentation

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed memory plus coherent replication Scalable distributed memory machines P-C-M nodes connected by network communication assist interprets

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache Coherency and Memory Consistency Why On-Chip Cache Coherence is here to stay - Motivation:

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence

Coherence Intuition that the parts of a discourse hang together Local coherence: Consecutive

Coherence Coherence Coherence Holography Recording Holography Recording Let the object

Cache Coherence Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Cache

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses

Complexity-Effec/ve Mul/core Cache Coherence MCC2012 Stefanos

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Enhancing Health and Safety in the Homeless Response System July 23, 2020 Housekeeping A

HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap

Programming Web Applications Programming Web Applications The relation to Servlets with JSP

Errick L. Greene Ed.D., Superintendent Our Agenda Opening & Welcome Mission/Vision & JPS

Asymptotic analysis of large random graphs Marion Sciauveau Joint work with J-F. Delmas and J-S.

Executive Order on Sanctuary Jurisdictions Slides available at www.naco.org/webinars later

Discovery Experiments with the Modular Total Absorp;on Spectrometer

TSV-Constrained Micro-Channel Infrastructure Design for Cooling Stacked 3D-ICs Bing Shi and

Cache Coherence in Scalable Machines Scalable Cache Coherent - PowerPoint PPT Presentation

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed memory plus coherent replication Scalable distributed memory machines P-C-M nodes connected by network communication assist interprets

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache Coherency and Memory Consistency Why On-Chip Cache Coherence is here to stay - Motivation:

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence

Coherence Intuition that the parts of a discourse hang together Local coherence: Consecutive

Coherence Coherence Coherence Holography Recording Holography Recording Let the object

Cache Coherence Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Cache

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses

Complexity-Effec/ve Mul/core Cache Coherence MCC2012 Stefanos

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Enhancing Health and Safety in the Homeless Response System July 23, 2020 Housekeeping A

HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap

Programming Web Applications Programming Web Applications The relation to Servlets with JSP

Errick L. Greene Ed.D., Superintendent Our Agenda Opening &amp; Welcome Mission/Vision &amp; JPS

Asymptotic analysis of large random graphs Marion Sciauveau Joint work with J-F. Delmas and J-S.

Executive Order on Sanctuary Jurisdictions Slides available at www.naco.org/webinars later

Discovery Experiments with the Modular Total Absorp;on Spectrometer

TSV-Constrained Micro-Channel Infrastructure Design for Cooling Stacked 3D-ICs Bing Shi and

Errick L. Greene Ed.D., Superintendent Our Agenda Opening & Welcome Mission/Vision & JPS