The Virtual Write Queue: Coordinating DRAM and Last-Level Cache - - PowerPoint PPT Presentation

the virtual write queue coordinating dram and last level
SMART_READER_LITE
LIVE PREVIEW

The Virtual Write Queue: Coordinating DRAM and Last-Level Cache - - PowerPoint PPT Presentation

ISCA 2010 The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies Jeffrey Stuecheli 1,2 , Dimitris Kaseridis 1 , David Daly 3 , Hillery C. Hunter 3 & Lizy K. John 1 1 ECE Department, The University of Texas at Austin 2 IBM


slide-1
SLIDE 1

Laboratory for Computer Architecture 6/21/2010

The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies

Jeffrey Stuecheli1,2, Dimitris Kaseridis1, David Daly3, Hillery C. Hunter3 & Lizy K. John1

1ECE Department, The University of Texas at Austin 2IBM Corp., Austin 3IBM Thomas J. Watson Research Center

ISCA 2010

slide-2
SLIDE 2

2 Laboratory for Computer Architecture 6/21/2010

Memory terminology

Target System: Multi-Core CMP – 8-16 cores (and up) – Shared cache and memory subsystem Terminology: – Channel/Rank/Chip/Bank Area of focus: Improving scheduling of memory interface in light of many cores combined with DRAM technology challenges

Background

slide-3
SLIDE 3

3 Laboratory for Computer Architecture 6/21/2010

Memory Wall (Labyrinth)

Traditional concern is read latency – Fixed at ~26 ns Beyond latency, many parameters are limiters to efficient utilization Data bus frequency 2x each DDRx generation – DDR 200-400, DDR2 400-1066, DDR3 800-1600 – But, internal latency is ~constant Fixed latency – Bank Precharge (50ns, ~7 operations@1066Mhz) – WriteRead (7.5ns, ~2 operations@1066MHz)

Motivation

slide-4
SLIDE 4

Implications

Scheduling efficiency – Reads Critical path to execution – Writes Decoupled Queuing – We need more write buffering (make the most of each opportunity to execute writes) – Not Read buffering due to latency criticality of loads

Motivation

slide-5
SLIDE 5

The Virtual Write Queue

Grow effective write reordering by an order of magnitude through a two- level structure – Writes can only execute out

  • f physical write queue

– Keep physical queue full with a good mix of

  • perations

– Physical write queue becomes staging ground, covers latency to pull data from the LLC.

Last Level Cache DRAM Scheduler

Dirty

Virtual Write Queue

MRU Way LRU Way

Cache Cleaner Physical Write Queue

Cache Set

slide-6
SLIDE 6

VWQ Details

slide-7
SLIDE 7

7 Laboratory for Computer Architecture 6/21/2010

CacheMemory Writeback Evolution

Forced Writeback: Traditional approach to writeback. Eager Writeback: Decouple cache fill from writeback with early “eager” writeback of dirty data (Lee, MICRO 2000). Scheduled Writeback: Our proposal. Place writeback under the control of the memory scheduler.

VWQ Details

slide-8
SLIDE 8

Filling the Physical Write Queue

Key concept:

– Relatively few classes of writes:

  • Rank Classification: Which Rank?
  • Page Mode: Quality level
  • Bank conflicts: Avoid writes to same bank, different page

– Physical Write Queue Content:

  • Maintain high quality writes in structure
  • Keep Writes to each Rank

VWQ Details

slide-9
SLIDE 9

Address Mapping

Set address of cache contains – All Rank selection bits – All Bank selection bits – Some number of Column bits (address within a DRAM page) VWQ Details

slide-10
SLIDE 10

10 Laboratory for Computer Architecture 6/21/2010

The Cache Cleaner

Goal: fast/efficient search of large LLC directory Based around Set State Vector (SSV) SSV enables – Efficient communication of dirty lines to be cleaned Cleaner will select line based

  • n current physical write Q

contents – Keep full with uniform mix of

  • perations to each DRAM

resource

Set State Vector

Last Level Cache DRAM Scheduler

Dirty

Virtual Write Queue

MRU Way LRU Way

Cache Cleaner Physical Write Queue

Cache Set

VWQ Details

slide-11
SLIDE 11

Read/Write Priority in scheduler

Goal: Defer write operations as long as possible

– Forced Writeback: Queuing depth is quite limited. – Eager Writeback: Write queue is always full; how do we know when we must execute writes? – Virtual Write Queue: Monitor overall fullness on a per Rank basis. Much larger effective buffering capability.

VWQ Details

slide-12
SLIDE 12

Evaluation/Results

slide-13
SLIDE 13

Bandwidth Improvement Example

From SPEC mcf workload

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 50 100 150 200 250 300 350

millions of instructions utilization

Baseline VWQ

Evaluation/Results

slide-14
SLIDE 14

14 Laboratory for Computer Architecture 6/21/2010

Virtual Write Queue IPC Gains

Each experiment consists of 8 copies of the same benchmark – IPC was observed to be uniform across cores (symmetrical system was fair)

  • Improvements in 1,2, and 4 rank systems

– Largest improvement with 1 rank due to exposed “Write to Read Same Rank” penalty

Evaluation/Results

0.00% 5.00% 10.00% 15.00% 20.00% 25.00%

bzip2 bwaves cactus dealII gcc emsFDTD hmmer leslie3d quantum mcf

  • mnetpp

soplex AVG

IPC Improvement

1 Rank 2 Ranks 4 Ranks

slide-15
SLIDE 15

15 Laboratory for Computer Architecture 6/21/2010

Power Reduction Due to Increased Write Page Mode Access

Overall DRAM power reduction is shown

Evaluation/Results

slide-16
SLIDE 16

Conclusion

Memory scheduling is critical to CMP design We must leverage all state in the SOC/CMP

slide-17
SLIDE 17

17 Laboratory for Computer Architecture 6/21/2010

Thank You, Questions?

Laboratory for Computer Architecture University of Texas Austin & IBM Austin & IBM T. J. Watson Lab