DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School - - PowerPoint PPT Presentation

directory coherence
SMART_READER_LITE
LIVE PREVIEW

DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School - - PowerPoint PPT Presentation

DIRECTORY COHERENCE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Tonight: project proposal This lecture Snooping wrap-up


slide-1
SLIDE 1

DIRECTORY COHERENCE

CS/ECE 7810: Advanced Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Upcoming deadline

¤ Tonight: project proposal

¨ This lecture

¤ Snooping wrap-up ¤ Directory coherence ¤ Implementation challenges ¤ Token-based coherence protocol

slide-3
SLIDE 3

Recall: Cache Coherence

¨ Definition of coherence

¤ Write propagation

n Write are visible to other processors

¤ Write serialization

n All write to the same location are seen in the same order by

all processes

A P1 P2

slide-4
SLIDE 4

Implementation Challenges

¨ MSI implementation

¤ Stable States [Vantrease’11]

slide-5
SLIDE 5

Implementation Challenges

¨ MSI implementation

¤ Stable States ¤ Busy states [Vantrease’11]

slide-6
SLIDE 6

Implementation Challenges

¨ MSI implementation

¤ Stable States ¤ Busy states ¤ Races Unexpected events from concurrent requests to same block [Vantrease’11]

slide-7
SLIDE 7

Cache Coherence Complexity

¨ A broadcast snooping bus (L2 MOETSI)

[Lepak’03]

slide-8
SLIDE 8

Implementation Tradeoffs

n Reduce unnecessary invalidates and transfers of blocks n Optimize the protocol with more states and prediction

mechanisms

n Adding more states and optimizations n Difficult to design and verify

n lead to more cases to take care of n race conditions

n Gained benefit may be less than costs (diminishing returns)

slide-9
SLIDE 9

Coherence Cache Miss

¨ Recall: cache miss classification

¤ Cold (compulsory): first access to block ¤ Capacity: due to limited capacity ¤ Conflict: many blocks are mapped to the same set

¨ New class: misses due to sharing

¤ True vs. false sharing A B

slide-10
SLIDE 10

Summary of Snooping Protocols

¨ Advantages

¤ Short miss latency ¤ Shared bus provides global point of serialization ¤ Simple implementation based on buses in uniprocessors

¨ Disadvantages

¤ Must broadcast messages to preserve the order ¤ The global point of serialization is not scalable

n It needs a virtual bus (or a totally-ordered interconnect)

slide-11
SLIDE 11

Scalable Coherence Protocols

¨ Problem: shared interconnect is not scalable ¨ Solution: make explicit requests for blocks ¨ Directory-based coherence: every cache block has

additional information

¤ To track of copies of cached blocks and their states ¤ To track ownership for each block ¤ To coordinate invalidation appropriately

slide-12
SLIDE 12

Directory Information

¨ P+1 additional bits for every cache block ¤ One bit used to indicate the block is in each cache ¤ One exclusive bit to indicate the cache has the only copy

(can update without notifying others)

¨ On a read, set the cache’s bit and arrange the supply

  • f data

¨ On a write, invalidate all caches that have the block

and reset their bits

P=4 E Cache Block

How to organize directory information?

slide-13
SLIDE 13

Directory Organization

¨ Example: central directory for P processors

¤ For each cache block in memory

n p presence bits, 1 dirty bit

¤ For each cache block in cache

n 1 valid bit, and 1 dirty (owner) bit

  • P

P Cache Cache Memory Directory presence bits dirty bit Interconnection Network 1 valid, 1 dirty (exclusive) per block

slide-14
SLIDE 14

Directory Protocol

¨ Three states (similar to snoopy protocol) ¤ Shared: more than one processors have data, memory up-

to-date

¤ Uncached: no processor has it; not valid in any cache ¤ Exclusive: one processor has data; memory out-of-date ¨ Basic terminology ¤ Local node, where a request originates ¤ Home node, where the memory location of an address

resides

¤ Remote node, has copy of a cache block, whether exclusive

  • r shared
slide-15
SLIDE 15

Read Request

¨ P0 reads a cache location

[Culler/Singh] P0 Home

  • 1. Read
  • 2. DatEx (DatShr)

P1

slide-16
SLIDE 16

ReadEx Request

¨ Avoid roundtrip to home by sending data directly

from owner

[Culler/Singh] P0 Home

  • 1. RdEx
  • 3b. DatEx

Owner

  • 2. Invl
  • 3a. Rev
slide-17
SLIDE 17

Write Contention

¨ NACKing mechanism

[Culler/Singh] P0 Home

  • 1a. RdEx
  • 2a. DatEx

P1

  • 1b. RdEx
  • 2b. NACK

J L

  • 3. RdEx
  • 4. Invl
  • 5a. Rev
  • 5b. DatEx

J

What are the challenges?

slide-18
SLIDE 18

Design Challenges

¨ Fairness: which requester is preferred on a conflict?

¤ Consider distance and delivery order of interconnect

¨ Race condition: how to keep the proper sequence

¤ NACK requests to busy blocks (pending invalidate)

n Original requestor retries

¤ Queuing requests and granting in sequence

slide-19
SLIDE 19

Summary of Directory Protocols

¨ Advantages

¨ Does not require broadcast to all caches ¨ Exactly as scalable as interconnect and directory storage

(much more scalable than bus)

¨ Disadvantages

¨ Adds indirection to miss latency (critical path)

¨ request à directory à memory

¨ Requires extra storage space to track directory states ¨ Protocols and race conditions are more complex

slide-20
SLIDE 20

Avoid Indirection

¨ Can we get the best of both snooping and directory

protocols?

¤ Direct cache-to-cache misses (broadcast is ok) ¤ What if unordered interconnect (e.g., mesh) was used?

P P P M

2 1 3

P P P M

1 2

Directory Protocol Hybrid Protocol

slide-21
SLIDE 21

An Example Problem 1

  • P0 issues a request to write (delayed to P2)

Request to write

P2

Read/Write

P1

No Copy

P0

No Copy

Delayed in interconnect

3

  • P1 issues a request to read

Request to read

2

Ack

slide-22
SLIDE 22

An Example Problem

P2

Read/Write

P1

No Copy

P0

No Copy

1 2 3 4

Read-only Read-only

  • P2 responds with data to P1
slide-23
SLIDE 23

An Example Problem

P2

Read/Write

P1

No Copy

P0

No Copy

1 2 3 4 5

Read-only Read-only

  • P0’s delayed request arrives at P2
slide-24
SLIDE 24

An Example Problem

P2

Read/Write

P1

No Copy

P0

Read/Write 1

2 3 4 5 6 7

Read-only Read-only No Copy

  • P2 responds to P0
slide-25
SLIDE 25

An Example Problem

P2

Read/Write

P1

No Copy

P0

Read/Write 1

2 3 4 5 7

Read-only Read-only No Copy Problem: P0 and P1 are in inconsistent states Locally “correct” operation, globally inconsistent

slide-26
SLIDE 26

Token Coherence

[Martin’03]

P2

T=16 (R/W)

P1

T=0

P0

T=0

2

Delayed

1

  • P0 issues a request to write (delayed to P2)

Request to write

3

  • P1 issues a request to read

Delayed

Request to read

Max Tokens

slide-27
SLIDE 27

Token Coherence

P2

T=16 (R/W)

P1

T=0

P0

T=0

1 2 3 4

T=1(R) T=15(R)

  • P2 responds with data to P1

T=1 [Martin’03]

slide-28
SLIDE 28

Token Coherence

P2

T=16 (R/W)

P1

T=0

P0

T=0

1 2 3 4 5

T=1(R) T=15(R)

  • P0’s delayed request arrives at P2

[Martin’03]

slide-29
SLIDE 29

Token Coherence

P2

T=16 (R/W)

P1

T=0

P0

T=15(R)

1 2 3 4 5 6 7

T=1(R) T=15(R) T=0

  • P2 responds to P0

T=15 [Martin’03]

slide-30
SLIDE 30

Token Coherence

P2

T=0

P1

T=1(R)

P0

T=15(R)

Now what? (P0 wants all tokens)

[Martin’03]

slide-31
SLIDE 31

Token Coherence

P2

T=0

P1

T=1(R)

P0

T=15(R)

8

  • P0 reissues request
  • P1 responds with a token

T=1 9

Timeout!

[Martin’03]

slide-32
SLIDE 32

Token Coherence

P2

T=0

P0

T=16 (R/W)

P1

T=0

  • P0’s request completed

One final issue: What about starvation?

[Martin’03]