the virtual write queue coordinating dram and last level
play

The Virtual Write Queue: Coordinating DRAM and Last-Level Cache - PowerPoint PPT Presentation

ISCA 2010 The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies Jeffrey Stuecheli 1,2 , Dimitris Kaseridis 1 , David Daly 3 , Hillery C. Hunter 3 & Lizy K. John 1 1 ECE Department, The University of Texas at Austin 2 IBM


  1. ISCA 2010 The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies Jeffrey Stuecheli 1,2 , Dimitris Kaseridis 1 , David Daly 3 , Hillery C. Hunter 3 & Lizy K. John 1 1 ECE Department, The University of Texas at Austin 2 IBM Corp., Austin 3 IBM Thomas J. Watson Research Center Laboratory for Computer Architecture 6/21/2010

  2. Background Memory terminology � Target System: Multi-Core CMP – 8-16 cores (and up) – Shared cache and memory subsystem � Terminology: – Channel/Rank/Chip/Bank � Area of focus: Improving scheduling of memory interface in light of many cores combined with DRAM technology challenges 2 Laboratory for Computer Architecture 6/21/2010

  3. Motivation Memory Wall (Labyrinth) � Traditional concern is read latency – Fixed at ~26 ns � Beyond latency, many parameters are limiters to efficient utilization � Data bus frequency � 2x each DDRx generation – DDR 200-400, DDR2 400-1066, DDR3 800-1600 – But, internal latency is ~constant � Fixed latency – Bank Precharge (50ns, ~7 operations@1066Mhz) – Write � Read (7.5ns, ~2 operations@1066MHz) 3 Laboratory for Computer Architecture 6/21/2010

  4. Motivation Implications � Scheduling efficiency – Reads � Critical path to execution – Writes � Decoupled � Queuing – We need more write buffering (make the most of each opportunity to execute writes) – Not Read buffering due to latency criticality of loads

  5. The Virtual Write Queue � Grow effective write Last Level reordering by an order of Cache Physical Virtual magnitude through a two- Write Write Queue level structure Queue – Writes can only execute out Cache DRAM Cache of physical write queue Cleaner Scheduler Set – Keep physical queue full with a good mix of operations Dirty – Physical write queue MRU LRU becomes staging ground, Way Way covers latency to pull data from the LLC.

  6. VWQ Details

  7. VWQ Details Cache � Memory Writeback Evolution � Forced Writeback: Traditional approach to writeback. � Eager Writeback: Decouple cache fill from writeback with early “eager” writeback of dirty data (Lee, MICRO 2000). � Scheduled Writeback: Our proposal. Place writeback under the control of the memory scheduler. 7 Laboratory for Computer Architecture 6/21/2010

  8. VWQ Details Filling the Physical Write Queue � Key concept: – Relatively few classes of writes: • Rank Classification: Which Rank? • Page Mode: Quality level • Bank conflicts: Avoid writes to same bank, different page – Physical Write Queue Content: • Maintain high quality writes in structure • Keep Writes to each Rank

  9. VWQ Details Address Mapping � Set address of cache contains – All Rank selection bits – All Bank selection bits – Some number of Column bits (address within a DRAM page)

  10. VWQ Details The Cache Cleaner Last Level Cache Physical Virtual Write Write � Goal: fast/efficient search of Queue Queue large LLC directory Cache DRAM Cache � Based around Set State Vector Cleaner Scheduler Set (SSV) � SSV enables – Efficient communication of dirty Dirty lines to be cleaned MRU LRU Way Way � Cleaner will select line based on current physical write Q contents – Keep full with uniform mix of operations to each DRAM resource Set State Vector 10 Laboratory for Computer Architecture 6/21/2010

  11. VWQ Details Read/Write Priority in scheduler � Goal: Defer write operations as long as possible – Forced Writeback : Queuing depth is quite limited. – Eager Writeback : Write queue is always full; how do we know when we must execute writes? – Virtual Write Queue : Monitor overall fullness on a per Rank basis. Much larger effective buffering capability.

  12. Evaluation/Results

  13. Evaluation/Results Bandwidth Improvement Example � From SPEC mcf workload Baseline VWQ 0.9 0.8 0.7 0.6 utilization 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 millions of instructions

  14. Evaluation/Results Virtual Write Queue IPC Gains � Each experiment consists of 8 copies of the same benchmark – IPC was observed to be uniform across cores (symmetrical system was fair) � Improvements in 1,2, and 4 rank systems – Largest improvement with 1 rank due to exposed “Write to Read Same Rank” penalty 25.00% IPC Improvement 1 Rank 20.00% 2 Ranks 15.00% 4 Ranks 10.00% 5.00% 0.00% hmmer quantum mcf omnetpp bzip2 bwaves cactus dealII gcc emsFDTD leslie3d AVG soplex 14 Laboratory for Computer Architecture 6/21/2010

  15. Evaluation/Results Power Reduction Due to Increased Write Page Mode Access � Overall DRAM power reduction is shown 15 Laboratory for Computer Architecture 6/21/2010

  16. Conclusion � Memory scheduling is critical to CMP design � We must leverage all state in the SOC/CMP

  17. Thank You, Questions? Laboratory for Computer Architecture University of Texas Austin & IBM Austin & IBM T. J. Watson Lab 17 Laboratory for Computer Architecture 6/21/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend