What well talk about 2 ZSim has a full-featured memory system - PowerPoint PPT Presentation

Example: “A day in the life of a memory request” 4  Bound-phase function simulation  Some components add weave-phase modeling access() lookup() Replac load()/store() access() Array ement Coherence L1I Prefetcher Filter $ Core L2 access() L1D Latency Coherence Directory access() NoC Cache Memory invalidate() MemReq L1I Coherence Prefetcher Filter $ Core Contention Model L2 L1D invalidate()

Example: “A day in the life of a memory request” 4  Bound-phase function simulation  Some components add weave-phase modeling access() lookup() rankCands() Replac load()/store() access() Array cands ement Coherence L1I Prefetcher Filter $ Core L2 access() L1D Latency Coherence Directory access() NoC Cache Memory invalidate() MemReq L1I Coherence Prefetcher Filter $ Core Contention Model L2 L1D invalidate()

Important ZSim memory classes 5 MemReq

MemReq 6  Represents an in-flight memory request  Important fields:  uint64_t lineAddr – shifted address  AccessType type – GETS, GETX, PUTS, PUTX  uint64_t cycle – requesting cycle  MESIState* state – coherence state (M, E, S, or I)  Important methods:  N/A

Important ZSim memory classes 7 MemReq

Important ZSim memory classes 7 MemReq MemObject

MemObject 8  Generic interface for things that handle memory requests  Important fields:  N/A  Important methods:  uint64_t access(MemReq& req) – performs an access and returns completion time

Implementing a simple model for main memory 9 class SimpleMemory : public MemObject { uint64_t latency; g_string name; public: SimpleMemory(uint64_t _latency, g_string _name) : latency(_latency), name(_name) {}; const char* getName() { return name.c_str(); } uint64_t access(MemReq& req) { switch (req.type) { case PUTS: case PUTX: // write *req.state = I; case GETS: *req.state = req.is(MemReq::NOEXCL)? S : E; case GETX: *req.state = M; } return req.cycle + latency; } };

Implementing a simple model for main memory 9 class SimpleMemory : public MemObject { uint64_t latency; g_string name; public: SimpleMemory(uint64_t _latency, g_string _name) : latency(_latency), name(_name) {}; const char* getName() { return name.c_str(); } Set coherence in requestor uint64_t access(MemReq& req) { switch (req.type) { case PUTS: case PUTX: // write *req.state = I; case GETS: *req.state = req.is(MemReq::NOEXCL)? S : E; case GETX: *req.state = M; } return req.cycle + latency; } };

Implementing a simple model for main memory 9 class SimpleMemory : public MemObject { uint64_t latency; g_string name; public: SimpleMemory(uint64_t _latency, g_string _name) : latency(_latency), name(_name) {}; const char* getName() { return name.c_str(); } Set coherence in requestor uint64_t access(MemReq& req) { switch (req.type) { case PUTS: case PUTX: // write *req.state = I; case GETS: *req.state = req.is(MemReq::NOEXCL)? S : E; case GETX: *req.state = M; } return req.cycle + latency; Completion cycle } };

Important ZSim memory classes 10 “is a” MemReq MemObject

Important ZSim memory classes 10 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory

Memory controllers 11  Different models for main memory  SimpleMemory: fixed-latency, no contention  Important fields: latency  MD1Memory: contention modeled using M/D/1 queue  Important fields: megabytesPerSecond (bandwidth), zeroLoadLatency, etc.  DDRMemory & DRAMSimMemory: detailed modeling of DDR timings  Important fields: lots of configuration parameters (CAS, RAS, bus MHz)  Timings modeled in weave-phase  Requires TimingCore or OOO core models  Similar accuracy, but DDRMemory is much faster

Important ZSim memory classes 12 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory

Important ZSim memory classes 12 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache InvReq

InvReq 13  Represents an invalidation request from coherence controller/directory  Important fields:  uint64_t lineAddr – shifted address  InvType type – INV, INVX, FWD  uint64_t cycle – requesting cycle  Important methods:  N/A

BaseCache 14  Generic interface for cache-like objects  Important fields:  N/A  Important methods:  void setParents (…) – register the caches above it in the hierarchy  void setChildren (…) – register the caches below it in the hierarchy  uint64_t invalidate(const InvReq& req) – invalidate line locally & in children  uint64_t access(MemReq& req)

Important ZSim memory classes 15 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache InvReq

Important ZSim memory classes 15 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache InvReq Cache

Cache 16  Inclusive cache  Contains tag array, coherence controller, replacement policy (discussed later)  Adds logic to control these components  Important fields (that aren’t discussed later):  uint32_t accLat – access latency  uint32_t invLat – invalidation latency  Important methods:  void setParents (…) – register the caches above it in the hierarchy  void setChildren (…) – register the caches below it in the hierarchy  uint64_t invalidate(const InvReq& req) – invalidate line locally & in children  uint64_t access(MemReq& req)

How ZSim allows concurrency 17 L3 L2 L1 L1 Core Core

How ZSim allows concurrency 18  Naïve “big lock” implementation won’t work L3 L2 L1 L1 Core Core

How ZSim allows concurrency 19  There is concurrency available! L3 L2 L1 L1 MemReq Core MemReq Core

How ZSim allows concurrency 19  There is concurrency available! L3 MemReq L2 L1 L1 Core MemReq Core

How ZSim allows concurrency 19  There is concurrency available! L3 MemReq L2 L1 MemReq L1 Core Core

How ZSim allows concurrency 19  There is concurrency available! MemReq L3 L2 L1 MemReq L1 Core Core

How ZSim allows concurrency 19  There is concurrency available! MemReq L3 L2 MemReq L1 L1 Core Core

How ZSim allows concurrency 19  There is concurrency available! MemReq L3 L2 L1 L1 MemReq Core Core

How ZSim allows concurrency 19  There is concurrency available! L3 L2 L1 L1 MemReq MemReq Core Core

How ZSim allows concurrency 19  There is concurrency available! L3 Requires handling many complex transients! L2 L1 L1 MemReq MemReq Core Core

How ZSim allows concurrency 20  Locking each cache leads to deadlock on invalidations L3 MemReq L2 L1 L1 Core Core

How ZSim allows concurrency 20  Locking each cache leads to deadlock on invalidations L3 MemReq L2 MemReq L1 L1 Core Core

How ZSim allows concurrency 20  Locking each cache leads to deadlock on invalidations L3 L1 is waiting on L2 on MemReq MemReq L2 L2 is waiting on L1 on InvReq InvReq  Deadlock! MemReq L1 L1 Core Core

How ZSim allows concurrency 21  Blocks more accesses going up, allows invalidations going down  Caches have two locks: access lock + invalidation lock  Invalidations are prioritized  Accesses acquire both locks  Invalidations need only invalidation lock

How ZSim allows concurrency 21  Blocks more accesses going up, allows invalidations going down  Caches have two locks: access lock + invalidation lock  Invalidations are prioritized  Accesses acquire both locks  Invalidations need only invalidation lock uint64_t Cache::access(MemReq& req) { invLock.acquire(); accLock.acquire(); // look up address etc invLock.release() parent->access(req); // check if we got an invalidation! accLock.release(); return completionTime; }

How ZSim allows concurrency 21  Blocks more accesses going up, allows invalidations going down  Caches have two locks: access lock + invalidation lock  Invalidations are prioritized  Accesses acquire both locks  Invalidations need only invalidation lock uint64_t Cache::access(MemReq& req) { uint64_t Cache::invalidate(InvReq& req) { invLock.acquire(); accLock.acquire(); invLock.acquire(); // look up address etc // do invalidation invLock.release() children.invalidate(req); parent->access(req); invLock.release() // check if we got an invalidation! return completionTime; accLock.release(); } return completionTime; }

How ZSim allows concurrency 22 Invalidation lock L3 Access lock L2 L1 L1 MemReq Core MemReq Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock L2 MemReq L1 L1 Core MemReq Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock MemReq L2 L1 L1 Core MemReq Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock MemReq L2 L1 MemReq L1 Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock MemReq L2 MemReq L1 L1 Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock MemReq L2 InvReq MemReq L1 L1 Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock MemReq L2 MemReq L1 L1 InvReq Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock MemReq L2 MemReq L1 L1 Core Core

How ZSim allows concurrency 22 Invalidation lock MemReq L3 Access lock L2 MemReq L1 L1 Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock L2 MemReq L1 MemReq L1 Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock L2 L1 MemReq L1 MemReq Core Core

How ZSim allows concurrency 22 Invalidation lock L3 Access lock L2 L1 L1 MemReq MemReq Core Core

Important ZSim memory classes 23 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache InvReq Cache

Important ZSim memory classes 23 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache InvReq Cache NUCACache StreamPrefetcher

NUCACache 24

NUCACache 24  Non-uniform cache access: banks distributed around the chip  Important fields:  BankDir* bankDir – see below  g_vector<BaseCache*> banks – the distributed banks  Important methods: none over BaseCache

NUCACache 24  Non-uniform cache access: banks distributed around the chip  Important fields:  BankDir* bankDir – see below  g_vector<BaseCache*> banks – the distributed banks  Important methods: none over BaseCache  Supports dynamic NUCA policies via BankDir class  uint32_t preAccess(MemReq& req) – Give destination bank  int32_t getPrevBank(MemReq& req, uint32_t curBank) – Get old bank (if moved)

NUCACache 24  Non-uniform cache access: banks distributed around the chip  Important fields:  BankDir* bankDir – see below  g_vector<BaseCache*> banks – the distributed banks  Important methods: none over BaseCache  Supports dynamic NUCA policies via BankDir class  uint32_t preAccess(MemReq& req) – Give destination bank  int32_t getPrevBank(MemReq& req, uint32_t curBank) – Get old bank (if moved)  Wide-ranging support  First-touch, R- NUCA [Hardavellas ISCA’09], [Awasthi HPCA’09], idealized private D -NUCA [Herrero ISCA’10], Jigsaw [Beckmann PACT’13, Beckmann HPCA’15]  Some yet-to-be-released

NUCACache::access pseudo-code 25 uint64_t NUCACache::access(MemReq& req) { uint32_t bank = bankDir->preAccess(req); int32_t prevBank = bankDir->getPrevBank(req, bank); if (prevBank != -1 && bank != prevBank) { // move the line from prevBank to bank } uint64_t completionCycle = banks[bank]->access(req); return completionCycle; }

Implementing your own D-NUCA 26  Idealized “last - touch” bank dir that migrates lines to wherever they are referenced uint32_t LastTouchBankDir::preAccess(MemReq& req) { uint32_t closestBank = nuca->getSortedRTTs(req.childId)[0].second; return closestBank; } int32_t LastTouchBankDir::getPrevBank(MemReq& req, uint32_t currentBank) { ScopedMutex sm(mutex); // avoid races auto prevBankId = lineMap.find(req.lineAddr); if (prevBankId == lineMap.end() || currentBank == *prevBankId) { return -1; } else { uint32_t prevBank = *prevBankId; *prevBankId = currentBank; return *prevBank; } }

StreamPrefetcher 27  Implements stream prefetcher  Important fields:  Entry array[16] – the streams it is following  Important methods: none over BaseCache  Prefetcher will issue its own MemReqs to parents  Validated against Westmere

Important ZSim memory classes 28 “is a” MemReq MemObject SimpleMemory MD1Memory DDRMemory BaseCache InvReq Cache NUCACache StreamPrefetcher

What well talk about 2 ZSim has a full-featured memory system - PowerPoint PPT Presentation

MICRO 2015 W AIKIKI H AWAII 5 D EC 2015 ZS IM T UTORIAL M EMORY S YSTEM N ATHAN B ECKMANN What well talk about 2 ZSim has a full-featured memory system (originally designed for caches) Core Memory What well talk about 2

How To Give How To Give a good good Technical Talk Technical Talk Bertrand Meyer Bertrand

How To Design A Signature Talk: Part 1 How To Design Your Signature Talk: Part 1 Your Signature

Harnessing the Power of Self-Talk Mary Fran Bontempo Self-Talk Self-Talk is your most

Crafting Your Girl Talk Presentation A Guide for Women of Inspiration PAL Volunteer Services

My presentation AB123C Outline Talk about giving a talk A tool to plan and hold

WOCC 2007 Talk WOCC 2007 Talk WOCC 2007 Talk A Management Strategy for A Management Strategy

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Talk to me Drupal Talk to me Drupal Using Drupal to power a Voice App Speaker notes Talk to me

A Talk about How to Give a Talk Part II Bertram Fronhfer International Center for

3/7/2016 Customized Conversations Most of us talk to GOD every day and talk to LOST PEOPLE

Cheap Talk Games: Extensions Cheap Talk Games: Extensions F. Koessler / November 12, 2008 Cheap

Rules WRITING OVERLOAD BLOG WOMEN TALK 02 Rule No. 1 BE KIND The whole point of Women Talk is

How to Deliver a Great TED Talk Presentation Secrets of How to Deliver a Great TED Talk

How to give a research talk Thomas D. Nielsen September 2008 How to give a research talk

Disclaimer Disclaimer This talk is not about the front end Disclaimer This talk is about

How To Give a good Technical Talk Bertrand Meyer , ETH Zrich & ITMO Welcome to my talk !

Low Contention Mapping of Real-Time Tasks onto a TilePro 64 Core Processor Christopher Zimmer and

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma

Interference-aware Scheduling for Data-processing Frameworks in Container-based Clusters Miguel

Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of the Macabre @ andy_pavlo 2

Exploration of Influence of Program Inputs on CMP Co-Scheduling Yunlian Jiang Xipeng Shen

What well talk about 2 ZSim has a full-featured memory system - PowerPoint PPT Presentation

MICRO 2015 W AIKIKI H AWAII 5 D EC 2015 ZS IM T UTORIAL M EMORY S YSTEM N ATHAN B ECKMANN What well talk about 2 ZSim has a full-featured memory system (originally designed for caches) Core Memory What well talk about 2

How To Give How To Give a good good Technical Talk Technical Talk Bertrand Meyer Bertrand

How To Design A Signature Talk: Part 1 How To Design Your Signature Talk: Part 1 Your Signature

Harnessing the Power of Self-Talk Mary Fran Bontempo Self-Talk Self-Talk is your most

Crafting Your Girl Talk Presentation A Guide for Women of Inspiration PAL Volunteer Services

My presentation AB123C Outline Talk about giving a talk A tool to plan and hold

WOCC 2007 Talk WOCC 2007 Talk WOCC 2007 Talk A Management Strategy for A Management Strategy

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Talk to me Drupal Talk to me Drupal Using Drupal to power a Voice App Speaker notes Talk to me

A Talk about How to Give a Talk Part II Bertram Fronhfer International Center for

3/7/2016 Customized Conversations Most of us talk to GOD every day and talk to LOST PEOPLE

Cheap Talk Games: Extensions Cheap Talk Games: Extensions F. Koessler / November 12, 2008 Cheap

Rules WRITING OVERLOAD BLOG WOMEN TALK 02 Rule No. 1 BE KIND The whole point of Women Talk is

How to Deliver a Great TED Talk Presentation Secrets of How to Deliver a Great TED Talk

How to give a research talk Thomas D. Nielsen September 2008 How to give a research talk

Disclaimer Disclaimer This talk is not about the front end Disclaimer This talk is about

How To Give a good Technical Talk Bertrand Meyer , ETH Zrich &amp; ITMO Welcome to my talk !

Low Contention Mapping of Real-Time Tasks onto a TilePro 64 Core Processor Christopher Zimmer and

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma

Interference-aware Scheduling for Data-processing Frameworks in Container-based Clusters Miguel

Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of the Macabre @ andy_pavlo 2

Exploration of Influence of Program Inputs on CMP Co-Scheduling Yunlian Jiang Xipeng Shen

How To Give a good Technical Talk Bertrand Meyer , ETH Zrich & ITMO Welcome to my talk !