directory coherence
play

DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School - PowerPoint PPT Presentation

DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Tonight: project proposal This lecture Snooping wrap-up


  1. DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture

  2. Overview ¨ Upcoming deadline ¤ Tonight: project proposal ¨ This lecture ¤ Snooping wrap-up ¤ Directory coherence ¤ Implementation challenges ¤ Token-based coherence protocol

  3. Recall: Cache Coherence ¨ Definition of coherence ¤ Write propagation n Write are visible to other processors ¤ Write serialization n All write to the same location are seen in the same order by all processes P1 P2 A

  4. Implementation Challenges ¨ MSI implementation ¤ Stable States [Vantrease’11]

  5. Implementation Challenges ¨ MSI implementation ¤ Stable States ¤ Busy states [Vantrease’11]

  6. Implementation Challenges ¨ MSI implementation ¤ Stable States ¤ Busy states ¤ Races Unexpected events from concurrent requests to same block [Vantrease’11]

  7. Cache Coherence Complexity ¨ A broadcast snooping bus (L2 MOETSI) [Lepak’03]

  8. Implementation Tradeoffs n Reduce unnecessary invalidates and transfers of blocks n Optimize the protocol with more states and prediction mechanisms n Adding more states and optimizations n Difficult to design and verify n lead to more cases to take care of n race conditions n Gained benefit may be less than costs (diminishing returns)

  9. Coherence Cache Miss ¨ Recall: cache miss classification ¤ Cold (compulsory): first access to block ¤ Capacity: due to limited capacity ¤ Conflict: many blocks are mapped to the same set ¨ New class: misses due to sharing ¤ True vs. false sharing A B

  10. Summary of Snooping Protocols ¨ Advantages ¤ Short miss latency ¤ Shared bus provides global point of serialization ¤ Simple implementation based on buses in uniprocessors ¨ Disadvantages ¤ Must broadcast messages to preserve the order ¤ The global point of serialization is not scalable n It needs a virtual bus (or a totally-ordered interconnect)

  11. Scalable Coherence Protocols ¨ Problem: shared interconnect is not scalable ¨ Solution: make explicit requests for blocks ¨ Directory-based coherence: every cache block has additional information ¤ To track of copies of cached blocks and their states ¤ To track ownership for each block ¤ To coordinate invalidation appropriately

  12. Directory Information ¨ P+1 additional bits for every cache block ¤ One bit used to indicate the block is in each cache ¤ One exclusive bit to indicate the cache has the only copy (can update without notifying others) ¨ On a read, set the cache’s bit and arrange the supply of data ¨ On a write, invalidate all caches that have the block and reset their bits P=4 E Cache Block How to organize directory information?

  13. Directory Organization ¨ Example: central directory for P processors ¤ For each cache block in memory n p presence bits, 1 dirty bit ¤ For each cache block in cache n 1 valid bit, and 1 dirty (owner) bit P P 1 valid, 1 dirty (exclusive) per block Cache Cache Interconnection Network • • • Memory Directory presence bits dirty bit

  14. Directory Protocol ¨ Three states (similar to snoopy protocol) ¤ Shared: more than one processors have data, memory up- to-date ¤ Uncached: no processor has it; not valid in any cache ¤ Exclusive: one processor has data; memory out-of-date ¨ Basic terminology ¤ Local node, where a request originates ¤ Home node, where the memory location of an address resides ¤ Remote node, has copy of a cache block, whether exclusive or shared

  15. Read Request ¨ P0 reads a cache location 1. Read P0 Home 2. DatEx (DatShr) P1 [Culler/Singh]

  16. ReadEx Request ¨ Avoid roundtrip to home by sending data directly from owner 1. RdEx 2. Invl P0 Home 3a. Rev Owner 3b. DatEx [Culler/Singh]

  17. Write Contention ¨ NACKing mechanism 1a. RdEx 1b. RdEx 4. Invl 3. RdEx L J P0 Home P1 5a. Rev 2a. DatEx 2b. NACK J 5b. DatEx What are the challenges? [Culler/Singh]

  18. Design Challenges ¨ Fairness: which requester is preferred on a conflict? ¤ Consider distance and delivery order of interconnect ¨ Race condition: how to keep the proper sequence ¤ NACK requests to busy blocks (pending invalidate) n Original requestor retries ¤ Queuing requests and granting in sequence

  19. Summary of Directory Protocols ¨ Advantages ¨ Does not require broadcast to all caches ¨ Exactly as scalable as interconnect and directory storage (much more scalable than bus) ¨ Disadvantages ¨ Adds indirection to miss latency (critical path) ¨ request à directory à memory ¨ Requires extra storage space to track directory states ¨ Protocols and race conditions are more complex

  20. Avoid Indirection ¨ Can we get the best of both snooping and directory protocols? ¤ Direct cache-to-cache misses (broadcast is ok) ¤ What if unordered interconnect (e.g., mesh) was used? Directory Protocol Hybrid Protocol 1 1 P P P M P P P M 3 2 2

  21. An Example Problem Delayed in interconnect Request to write 1 No Copy No Copy Read/Write 2 P 0 P 1 P 2 Ack 3 Request to read •P 0 issues a request to write (delayed to P 2 ) •P 1 issues a request to read

  22. An Example Problem Read-only Read-only 1 No Copy No Copy Read/Write 2 P 0 P 1 P 2 4 3 •P 2 responds with data to P 1

  23. An Example Problem Read-only Read-only 1 No Copy No Copy Read/Write 5 2 P 0 P 1 P 2 4 3 •P 0 ’s delayed request arrives at P 2

  24. An Example Problem 6 No Copy Read-only Read-only Read/Write 1 No Copy Read/Write 5 2 P 0 P 1 P 2 7 4 3 •P 2 responds to P 0

  25. An Example Problem No Copy Read-only Read-only Read/Write 1 No Copy Read/Write 5 2 P 0 P 1 P 2 7 4 3 Problem: P 0 and P 1 are in inconsistent states Locally “correct” operation, globally inconsistent

  26. Token Coherence Max Tokens Request to write Delayed 1 T=0 T=0 T=16 (R/W) 2 P 0 P 1 P 2 3 Request to read Delayed •P 0 issues a request to write (delayed to P 2 ) [Martin’03] •P 1 issues a request to read

  27. Token Coherence T=1(R) T=15(R) 1 T=0 T=0 T=16 (R/W) 2 P 0 P 1 P 2 4 T=1 3 •P 2 responds with data to P 1 [Martin’03]

  28. Token Coherence T=1(R) T=15(R) 1 T=0 T=0 T=16 (R/W) 5 2 P 0 P 1 P 2 4 3 •P 0 ’s delayed request arrives at P 2 [Martin’03]

  29. Token Coherence 6 T=15 T=0 T=1(R) T=15(R) 1 T=15(R) T=0 T=16 (R/W) 5 2 P 0 P 1 P 2 7 4 3 •P 2 responds to P 0 [Martin’03]

  30. Token Coherence T=15(R) T=1(R) T=0 P 0 P 1 P 2 Now what? (P 0 wants all tokens) [Martin’03]

  31. Token Coherence 8 T=15(R) T=1(R) T=0 Timeout! P 0 P 1 P 2 T=1 9 •P 0 reissues request [Martin’03] •P 1 responds with a token

  32. Token Coherence T=16 (R/W) T=0 T=0 P 0 P 1 P 2 One final issue: What about starvation? •P 0 ’s request completed [Martin’03]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend