What well talk about 2 ZSim has a full-featured memory system - - PowerPoint PPT Presentation

what we ll talk about
SMART_READER_LITE
LIVE PREVIEW

What well talk about 2 ZSim has a full-featured memory system - - PowerPoint PPT Presentation

MICRO 2015 W AIKIKI H AWAII 5 D EC 2015 ZS IM T UTORIAL M EMORY S YSTEM N ATHAN B ECKMANN What well talk about 2 ZSim has a full-featured memory system (originally designed for caches) Core Memory What well talk about 2


slide-1
SLIDE 1

ZSIM TUTORIAL – MEMORY SYSTEM

NATHAN BECKMANN

MICRO 2015 – WAIKIKI HAWAII – 5 DEC 2015

slide-2
SLIDE 2

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory

slide-3
SLIDE 3

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Cache

slide-4
SLIDE 4

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory L1I L1D Cache L2

slide-5
SLIDE 5

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory L1I L1D Cache Array Replac ement L2

slide-6
SLIDE 6

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory L1I L1D Cache Array Replac ement Contention Model L2

slide-7
SLIDE 7

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model L2 L2

slide-8
SLIDE 8

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence L2 L2 Coherence Coherence

slide-9
SLIDE 9

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC L2 L2 Coherence Coherence

slide-10
SLIDE 10

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC L2 L2 Coherence Coherence Directory Prefetcher Prefetcher

slide-11
SLIDE 11

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher

slide-12
SLIDE 12

What we’ll talk about

 ZSim has a full-featured memory system (originally designed for caches)

2

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher

Modular design!

slide-13
SLIDE 13

Topics covered

3

 ZSim memory system design & important classes/files  Configuration options & available models  How to extend zsim yourself (with example!)

slide-14
SLIDE 14

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher  Bound-phase function simulation  Some components add weave-phase modeling

slide-15
SLIDE 15

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq  Bound-phase function simulation  Some components add weave-phase modeling

slide-16
SLIDE 16

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store()  Bound-phase function simulation  Some components add weave-phase modeling

slide-17
SLIDE 17

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-18
SLIDE 18

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-19
SLIDE 19

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-20
SLIDE 20

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency  Bound-phase function simulation  Some components add weave-phase modeling

slide-21
SLIDE 21

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency  Bound-phase function simulation  Some components add weave-phase modeling

slide-22
SLIDE 22

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency MemReq access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-23
SLIDE 23

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency MemReq access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-24
SLIDE 24

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-25
SLIDE 25

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-26
SLIDE 26

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-27
SLIDE 27

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() InvReq  Bound-phase function simulation  Some components add weave-phase modeling

slide-28
SLIDE 28

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() InvReq invalidate()  Bound-phase function simulation  Some components add weave-phase modeling

slide-29
SLIDE 29

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() InvReq invalidate() invalidate()  Bound-phase function simulation  Some components add weave-phase modeling

slide-30
SLIDE 30

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() invalidate() invalidate()  Bound-phase function simulation  Some components add weave-phase modeling

slide-31
SLIDE 31

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() invalidate() invalidate() access()  Bound-phase function simulation  Some components add weave-phase modeling

slide-32
SLIDE 32

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() invalidate() invalidate() access() lookup()  Bound-phase function simulation  Some components add weave-phase modeling

slide-33
SLIDE 33

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() invalidate() invalidate() access() cands rankCands() lookup()  Bound-phase function simulation  Some components add weave-phase modeling

slide-34
SLIDE 34

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() invalidate() invalidate() access() cands rankCands() lookup()  Bound-phase function simulation  Some components add weave-phase modeling

slide-35
SLIDE 35

Example: “A day in the life of a memory request”

4

Core Memory Core L1I L1D L1I L1D Cache Array Replac ement Contention Model Coherence NoC Filter $ Filter $ L2 L2 Coherence Coherence Directory Prefetcher Prefetcher MemReq load()/store() access() access() Latency access() invalidate() invalidate() access() cands rankCands() lookup()  Bound-phase function simulation  Some components add weave-phase modeling

slide-36
SLIDE 36

Important ZSim memory classes

5

MemReq

slide-37
SLIDE 37

MemReq

6

 Represents an in-flight memory request  Important fields:

 uint64_t lineAddr – shifted address  AccessType type – GETS, GETX, PUTS, PUTX  uint64_t cycle – requesting cycle  MESIState* state – coherence state (M, E, S, or I)

 Important methods:

 N/A

slide-38
SLIDE 38

Important ZSim memory classes

7

MemReq

slide-39
SLIDE 39

Important ZSim memory classes

7

MemReq MemObject

slide-40
SLIDE 40

MemObject

8

 Generic interface for things that handle memory requests  Important fields:

 N/A

 Important methods:

 uint64_t access(MemReq& req) – performs an access and returns completion time

slide-41
SLIDE 41

Implementing a simple model for main memory

9

class SimpleMemory : public MemObject { uint64_t latency; g_string name; public: SimpleMemory(uint64_t _latency, g_string _name) : latency(_latency), name(_name) {}; const char* getName() { return name.c_str(); } uint64_t access(MemReq& req) { switch (req.type) { case PUTS: case PUTX: // write *req.state = I; case GETS: *req.state = req.is(MemReq::NOEXCL)? S : E; case GETX: *req.state = M; } return req.cycle + latency; } };

slide-42
SLIDE 42

Implementing a simple model for main memory

9

class SimpleMemory : public MemObject { uint64_t latency; g_string name; public: SimpleMemory(uint64_t _latency, g_string _name) : latency(_latency), name(_name) {}; const char* getName() { return name.c_str(); } uint64_t access(MemReq& req) { switch (req.type) { case PUTS: case PUTX: // write *req.state = I; case GETS: *req.state = req.is(MemReq::NOEXCL)? S : E; case GETX: *req.state = M; } return req.cycle + latency; } };

Set coherence in requestor

slide-43
SLIDE 43

Implementing a simple model for main memory

9

class SimpleMemory : public MemObject { uint64_t latency; g_string name; public: SimpleMemory(uint64_t _latency, g_string _name) : latency(_latency), name(_name) {}; const char* getName() { return name.c_str(); } uint64_t access(MemReq& req) { switch (req.type) { case PUTS: case PUTX: // write *req.state = I; case GETS: *req.state = req.is(MemReq::NOEXCL)? S : E; case GETX: *req.state = M; } return req.cycle + latency; } };

Set coherence in requestor Completion cycle

slide-44
SLIDE 44

Important ZSim memory classes

10

MemReq MemObject “is a”

slide-45
SLIDE 45

Important ZSim memory classes

10

MemReq MemObject SimpleMemory MD1Memory DDRMemory “is a”

slide-46
SLIDE 46

Memory controllers

11

 Different models for main memory  SimpleMemory: fixed-latency, no contention

 Important fields: latency

 MD1Memory: contention modeled using M/D/1 queue

 Important fields: megabytesPerSecond (bandwidth), zeroLoadLatency, etc.

 DDRMemory & DRAMSimMemory: detailed modeling of DDR timings

 Important fields: lots of configuration parameters (CAS, RAS, bus MHz)  Timings modeled in weave-phase  Requires TimingCore or OOO core models  Similar accuracy, but DDRMemory is much faster

slide-47
SLIDE 47

Important ZSim memory classes

12

MemReq MemObject SimpleMemory MD1Memory DDRMemory “is a”

slide-48
SLIDE 48

Important ZSim memory classes

12

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache “is a”

slide-49
SLIDE 49

InvReq

13

 Represents an invalidation request from coherence controller/directory  Important fields:

 uint64_t lineAddr – shifted address  InvType type – INV, INVX, FWD  uint64_t cycle – requesting cycle

 Important methods:

 N/A

slide-50
SLIDE 50

BaseCache

14

 Generic interface for cache-like objects  Important fields:

 N/A

 Important methods:

 void setParents(…) – register the caches above it in the hierarchy  void setChildren(…) – register the caches below it in the hierarchy  uint64_t invalidate(const InvReq& req) – invalidate line locally & in children  uint64_t access(MemReq& req)

slide-51
SLIDE 51

Important ZSim memory classes

15

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache “is a”

slide-52
SLIDE 52

Important ZSim memory classes

15

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache “is a”

slide-53
SLIDE 53

Cache

16

 Inclusive cache

 Contains tag array, coherence controller, replacement policy (discussed later)  Adds logic to control these components

 Important fields (that aren’t discussed later):

 uint32_t accLat – access latency  uint32_t invLat – invalidation latency

 Important methods:

 void setParents(…) – register the caches above it in the hierarchy  void setChildren(…) – register the caches below it in the hierarchy  uint64_t invalidate(const InvReq& req) – invalidate line locally & in children  uint64_t access(MemReq& req)

slide-54
SLIDE 54

How ZSim allows concurrency

17

L2 L1 L1 Core Core L3

slide-55
SLIDE 55

How ZSim allows concurrency

17

L2 L1 L1 Core Core L3

slide-56
SLIDE 56

How ZSim allows concurrency

17

L2 L1 L1 Core Core L3

slide-57
SLIDE 57

How ZSim allows concurrency

17

L2 L1 L1 Core Core L3

slide-58
SLIDE 58

How ZSim allows concurrency

17

L2 L1 L1 Core Core L3

slide-59
SLIDE 59

How ZSim allows concurrency

17

L2 L1 L1 Core Core L3

slide-60
SLIDE 60

How ZSim allows concurrency

18

 Naïve “big lock” implementation won’t work L2 L1 L1 Core Core L3

slide-61
SLIDE 61

How ZSim allows concurrency

18

 Naïve “big lock” implementation won’t work L2 L1 L1 Core Core L3

slide-62
SLIDE 62

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-63
SLIDE 63

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-64
SLIDE 64

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-65
SLIDE 65

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-66
SLIDE 66

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-67
SLIDE 67

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-68
SLIDE 68

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

slide-69
SLIDE 69

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

Requires handling many complex transients!

slide-70
SLIDE 70

 There is concurrency available!

How ZSim allows concurrency

19

L2 L1 L1 Core Core L3 MemReq MemReq

Requires handling many complex transients!

slide-71
SLIDE 71

How ZSim allows concurrency

20

L2 L1 L1 Core Core L3  Locking each cache leads to deadlock on invalidations MemReq

slide-72
SLIDE 72

How ZSim allows concurrency

20

L2 L1 L1 Core Core L3  Locking each cache leads to deadlock on invalidations MemReq MemReq

slide-73
SLIDE 73

How ZSim allows concurrency

20

L2 L1 L1 Core Core L3  Locking each cache leads to deadlock on invalidations MemReq

L1 is waiting on L2 on MemReq L2 is waiting on L1 on InvReq  Deadlock!

InvReq MemReq

slide-74
SLIDE 74

How ZSim allows concurrency

20

L2 L1 L1 Core Core L3  Locking each cache leads to deadlock on invalidations MemReq

L1 is waiting on L2 on MemReq L2 is waiting on L1 on InvReq  Deadlock!

InvReq MemReq

slide-75
SLIDE 75

How ZSim allows concurrency

21

 Blocks more accesses going up, allows invalidations going down  Caches have two locks: access lock + invalidation lock  Invalidations are prioritized

 Accesses acquire both locks  Invalidations need only invalidation lock

slide-76
SLIDE 76

How ZSim allows concurrency

21

 Blocks more accesses going up, allows invalidations going down  Caches have two locks: access lock + invalidation lock  Invalidations are prioritized

 Accesses acquire both locks  Invalidations need only invalidation lock

uint64_t Cache::access(MemReq& req) { invLock.acquire(); accLock.acquire(); // look up address etc invLock.release() parent->access(req); // check if we got an invalidation! accLock.release(); return completionTime; }

slide-77
SLIDE 77

How ZSim allows concurrency

21

 Blocks more accesses going up, allows invalidations going down  Caches have two locks: access lock + invalidation lock  Invalidations are prioritized

 Accesses acquire both locks  Invalidations need only invalidation lock

uint64_t Cache::access(MemReq& req) { invLock.acquire(); accLock.acquire(); // look up address etc invLock.release() parent->access(req); // check if we got an invalidation! accLock.release(); return completionTime; } uint64_t Cache::invalidate(InvReq& req) { invLock.acquire(); // do invalidation children.invalidate(req); invLock.release() return completionTime; }

slide-78
SLIDE 78

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-79
SLIDE 79

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-80
SLIDE 80

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-81
SLIDE 81

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-82
SLIDE 82

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-83
SLIDE 83

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

InvReq

slide-84
SLIDE 84

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

InvReq

slide-85
SLIDE 85

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-86
SLIDE 86

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-87
SLIDE 87

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-88
SLIDE 88

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-89
SLIDE 89

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-90
SLIDE 90

How ZSim allows concurrency

22

L2 L1 L1 Core Core L3 MemReq MemReq

Invalidation lock Access lock

slide-91
SLIDE 91

Important ZSim memory classes

23

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache “is a”

slide-92
SLIDE 92

Important ZSim memory classes

23

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher “is a”

slide-93
SLIDE 93

NUCACache

24

slide-94
SLIDE 94

NUCACache

24

 Non-uniform cache access: banks distributed around the chip  Important fields:

 BankDir* bankDir – see below  g_vector<BaseCache*> banks – the distributed banks

 Important methods: none over BaseCache

slide-95
SLIDE 95

NUCACache

24

 Non-uniform cache access: banks distributed around the chip  Important fields:

 BankDir* bankDir – see below  g_vector<BaseCache*> banks – the distributed banks

 Important methods: none over BaseCache  Supports dynamic NUCA policies via BankDir class

 uint32_t preAccess(MemReq& req) – Give destination bank  int32_t getPrevBank(MemReq& req, uint32_t curBank) – Get old bank (if moved)

slide-96
SLIDE 96

NUCACache

24

 Non-uniform cache access: banks distributed around the chip  Important fields:

 BankDir* bankDir – see below  g_vector<BaseCache*> banks – the distributed banks

 Important methods: none over BaseCache  Supports dynamic NUCA policies via BankDir class

 uint32_t preAccess(MemReq& req) – Give destination bank  int32_t getPrevBank(MemReq& req, uint32_t curBank) – Get old bank (if moved)

 Wide-ranging support

 First-touch, R-NUCA [Hardavellas ISCA’09], [Awasthi HPCA’09], idealized private D-NUCA

[Herrero ISCA’10], Jigsaw [Beckmann PACT’13, Beckmann HPCA’15]

 Some yet-to-be-released

slide-97
SLIDE 97

NUCACache::access pseudo-code

25

uint64_t NUCACache::access(MemReq& req) { uint32_t bank = bankDir->preAccess(req); int32_t prevBank = bankDir->getPrevBank(req, bank); if (prevBank != -1 && bank != prevBank) { // move the line from prevBank to bank } uint64_t completionCycle = banks[bank]->access(req); return completionCycle; }

slide-98
SLIDE 98

Implementing your own D-NUCA

26

 Idealized “last-touch” bank dir that migrates lines to wherever they are

referenced

uint32_t LastTouchBankDir::preAccess(MemReq& req) { uint32_t closestBank = nuca->getSortedRTTs(req.childId)[0].second; return closestBank; } int32_t LastTouchBankDir::getPrevBank(MemReq& req, uint32_t currentBank) { ScopedMutex sm(mutex); // avoid races auto prevBankId = lineMap.find(req.lineAddr); if (prevBankId == lineMap.end() || currentBank == *prevBankId) { return -1; } else { uint32_t prevBank = *prevBankId; *prevBankId = currentBank; return *prevBank; } }

slide-99
SLIDE 99

StreamPrefetcher

27

 Implements stream prefetcher  Important fields:

 Entry array[16] – the streams it is following

 Important methods: none over BaseCache  Prefetcher will issue its own MemReqs to parents

 Validated against Westmere

slide-100
SLIDE 100

Important ZSim memory classes

28

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher “is a”

slide-101
SLIDE 101

Important ZSim memory classes

28

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a”

slide-102
SLIDE 102

Other caches

29

 NonInclusiveCache – self explanatory, requires separate directory for

coherence

 TimingCache – adds weave-phase models for cache contention  FilterCache – boundary between core models & memory models

 Speeds up simulator by accelerating loads & stores  Important methods: uint64_t load/store(Address vAddr, uint64_t curCycle)  FilterCache adds a virtually-indexed, direct-mapped cache to filter accesses before

they reach the more expensive Cache-hierarchy

 Filter is kept coherent and checks for timing hazards (eg, OOO store execution)

slide-103
SLIDE 103

Important ZSim memory classes

30

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a”

slide-104
SLIDE 104

Important ZSim memory classes

30

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray

slide-105
SLIDE 105

CacheArray

31

 Implements a tag array with different organizations  Important fields: None  Important methods:

 int32_t lookup(…) – does the array hold this address? If so, which line is it?  uint32_t preinsert(…) – make space (i.e., find a victim to evict)  void postinsert(…) – allocate space (i.e., finalize eviction)

 Replacement split into phases to avoid invalidation races  ZSim supports set associative, fully associative, zcaches

 Compressed caches in development

slide-106
SLIDE 106

Important ZSim memory classes

32

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray

slide-107
SLIDE 107

Important ZSim memory classes

32

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy

slide-108
SLIDE 108

ReplPolicy

33

 The replacement policy   Important fields: None  Important methods:

 void update(uint32_t id, const MemReq* req) – called upon hit  void replaced(uint32_t id) – called upon eviction  template<class C> uint32_t rankCands(const MemReq* req, C cands) – find a victim

 For performance, this is optimized at compile time to different arrays  Different versions auto-generated from DECL_RANK_BINDINGS() macro  ZSim supports LRU, pseudo-LRU, NRU, LFU, random, SRRIP

, DRRIP , SHiP , PDP , and many more!

slide-109
SLIDE 109

Example: Implementing LRU

34

 Timestamp-based implementation, evict the oldest line

class LRUReplPolicy : public ReplPolicy { uint64_t timestamp; // global access count uint64_t* array; // last-use timestamp per line uint64_t numLines; public: explicit LRUReplPolicy(uint32_t _numLines) : timestamp(1), numLines(_numLines) { array = gm_calloc<uint64_t>(numLines); } ~LRUReplPolicy() { gm_free(array); } void update(uint32_t id, const MemReq* req) { // called upon hit array[id] = timestamp++; } void replaced(uint32_t id) { // called upon eviction array[id] = 0; } …

slide-110
SLIDE 110

Example: Implementing LRU

35

 Timestamp-based implementation, evict the oldest line

… template<typename C> uint32_t uint32_t rank(const MemReq* req, C cands) { uint32_t bestCand = -1; for (auto ci = cands.begin(); ci != cands.end(); ci++) { if (array[*ci] == 0) { return *ci; } else if (timestamp – array[*ci] < timestamp – array[bestCand]) { bestCand = *ci; } } return bestCand; } DECL_RANK_BINDINGS(); };

slide-111
SLIDE 111

Important ZSim memory classes

36

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy

slide-112
SLIDE 112

Important ZSim memory classes

36

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor

slide-113
SLIDE 113

PartReplPolicy

37

 ZSim implements cache partitioning in the replacement policy  Important fields:

 PartitionMapper* mapper – maps MemReqs to a partition  PartitionMonitor* monitor – measure stats about different partitions, e.g. miss curves

 Important methods:

 void setPartitionSizes(const uint32_t* sizes) – reset partition sizes

 ZSim supports way partitioning, idealized LRU partitioning, and Vantage

slide-114
SLIDE 114

Important ZSim memory classes

38

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor

slide-115
SLIDE 115

Important ZSim memory classes

38

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor CC MESITopCC MESIBottomCC MESICC

slide-116
SLIDE 116

CC

39

 Implements coherence across cache levels  Important fields: None  Important methods:

 void setParents/setChildren(…) – similar to Cache  bool startAccess(MemReq& req)  bool shouldAllocate(const MemReq& req)  uint64_t processEviction(…)  uint64_t processAccess(…)  void endAccess(const MemReq& req)  void startInv()  uint64_t processInv(…)  uint64_t numSharers(uint32_t lineId)  bool isValid(uint32_t lineId)  MESIState getState(uint32_t lineId)  bool isSharer(uint32_t lineId, uint32_t childId)

slide-117
SLIDE 117

CC

39

 Implements coherence across cache levels  Important fields: None  Important methods:

 void setParents/setChildren(…) – similar to Cache  bool startAccess(MemReq& req)  bool shouldAllocate(const MemReq& req)  uint64_t processEviction(…)  uint64_t processAccess(…)  void endAccess(const MemReq& req)  void startInv()  uint64_t processInv(…)  uint64_t numSharers(uint32_t lineId)  bool isValid(uint32_t lineId)  MESIState getState(uint32_t lineId)  bool isSharer(uint32_t lineId, uint32_t childId)

Regular accesses

slide-118
SLIDE 118

CC

39

 Implements coherence across cache levels  Important fields: None  Important methods:

 void setParents/setChildren(…) – similar to Cache  bool startAccess(MemReq& req)  bool shouldAllocate(const MemReq& req)  uint64_t processEviction(…)  uint64_t processAccess(…)  void endAccess(const MemReq& req)  void startInv()  uint64_t processInv(…)  uint64_t numSharers(uint32_t lineId)  bool isValid(uint32_t lineId)  MESIState getState(uint32_t lineId)  bool isSharer(uint32_t lineId, uint32_t childId)

Regular accesses Invalidations

slide-119
SLIDE 119

CC

39

 Implements coherence across cache levels  Important fields: None  Important methods:

 void setParents/setChildren(…) – similar to Cache  bool startAccess(MemReq& req)  bool shouldAllocate(const MemReq& req)  uint64_t processEviction(…)  uint64_t processAccess(…)  void endAccess(const MemReq& req)  void startInv()  uint64_t processInv(…)  uint64_t numSharers(uint32_t lineId)  bool isValid(uint32_t lineId)  MESIState getState(uint32_t lineId)  bool isSharer(uint32_t lineId, uint32_t childId)

Regular accesses Invalidations Querying (e.g., ReplPolicies)

slide-120
SLIDE 120

CC naming convention

40

 Top  Parent  Bottom  Child L2 L3 MESITopCC

MESIBottomCC MESITopCC MESIBottomCC

slide-121
SLIDE 121

Important ZSim memory classes

41

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor CC MESITopCC MESICC

slide-122
SLIDE 122

Important ZSim memory classes

41

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor CC MESITopCC MESIBottomCC MESICC MESITerminalCC MESIDirCC

slide-123
SLIDE 123

Important ZSim memory classes

42

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor CC MESITopCC MESIBottomCC MESICC MESITerminalCC MESIDirCC

slide-124
SLIDE 124

Important ZSim memory classes

42

MemReq InvReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache Cache NUCACache StreamPrefetcher TimingCache NonInclusiveCache FilterCache “is a” “has a” CacheArray ReplPolicy PartReplPolicy PartitionMonitor CC MESITopCC MESIBottomCC Network MESICC MESITerminalCC MESIDirCC

slide-125
SLIDE 125

Network

43

 Tracks round-trip latency between objects in the target system  ZSim does not currently model network contention  Important fields:

 string  uint32_t delayMap – maps object pairs (by name) to their round-trip

communication latency

 Important methods:

 Network(const char* filename) – initialize from a network file  uint32_t getRTT(const char* src, const char* dst) – Look up network latency

 Will show example network file in Config session later

slide-126
SLIDE 126

ZSim memory system

44

Class Is a What is it? File MemReq An in-flight memory request going up the cache hierarchy memory_hierarchy.h MemObject Base object for anything that takes MemReqs. Provides access method. memory_hierarchy.h SimpleMemory MemObject Fixed-latency main memory. mem_ctrls.h MD1Memory MemObject M/D/1 queuing latency model for main memory. mem_ctrls.h DDRMemory MemObject Full DDR timing simulation, requires TimingEvents. ddr_mem.h InvReq An invalidation going down the cache hierarchy memory_hierarchy.h BaseCache MemObject Base object for caches, prefetchers, etc. Provides setChildren/Parents and invalidate methods. memory_hierarchy.h Cache BaseCache An inclusive cache. cache.h NonInclusiveCache Cache A non-inclusive cache. non_incl_cache.h TimingCache Cache Connects timing events through hierarchy for DDR memory models. timing_cache.h FilterCache Cache Implements optimized load/store methods for efficiency. filter_cache.h NUCACache BaseCache Cache with distributed banks internally. Maps addresses to banks via BankDir object. nuca_cache.h StreamPrefetcher BaseCache Streaming prefetcher. prefetcher.h CacheArray Tracks addresses in cache and structure of array. cache_arrays.h ReplPolicy A replacement policy. Notified of accesses/evictions through update/replaced methods. Chooses victim in rankCands method. repl_policies.h PartReplPolicy ReplPolicy Implements cache partitioning. Adds setPartitionSizes method and monitoring. part_repl_policies.h PartitionMonitor Monitors partition miss curves. Provides access and getMissCurve methods. monitor.h Network getRTT method provides (fixed) latency between two modeled objects. network.h CC Generic coherence controller (zsim implements MESI). coherence_ctrls.h