Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe - - PowerPoint PPT Presentation

rank idle time prediction driven
SMART_READER_LITE
LIVE PREVIEW

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe - - PowerPoint PPT Presentation

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe Wang, Samira M. Khan, Daniel A. Jimnez Computer Science Department University of Texas at San Antonio Memory Latency is Performance Bottleneck Memory wall -


slide-1
SLIDE 1

Rank Idle Time Prediction Driven Last-Level Cache Writeback

Zhe Wang, Samira M. Khan, Daniel A. Jiménez Computer Science Department University of Texas at San Antonio

slide-2
SLIDE 2
  • Memory wall
  • Microprocessor is faster than memory

Memory Latency is Performance Bottleneck

CPUs Caches

DRAM Microprocessor

Read Write

Fast Slow

  • System performance is sensitive to memory read latency
  • Write-Induced Interference [Lee et al. 2010]
  • Writes can delay the service of reads, degrade performance

1

slide-3
SLIDE 3

Read Buffer Write Buffer

Write-Induced Interference

Data

Servicing Write Cycles

Write-Induced Interference Cycles

Bus Turnaround

Wait

Command line Data line

Write Read 2

108 processor cycles

Service of write requests delay the service of following read requests, thus causing performance degradation

slide-4
SLIDE 4

Quantifying Write-Induced Interference

Without write-induced interference, system performance improves 23% on average

3

1 1.1 1.2 1.3 1.4 1.5

400.perlbench 401.bzip2 403.gcc 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 450.soplex 456.hmmer 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 483.xalancbmk Gmean

Speedup

slide-5
SLIDE 5

Traditional Writeback

  • Dirty cache blocks are sent to write buffer when evicted

Write Buffer DRAM Last-Level Cache Small Size

  • The problem
  • Clustering memory traffic : bursty reads with evicted writes
  • Writeback inefficiency : small size of write buffer

4

Evicted Writes

slide-6
SLIDE 6
  • Propose a technique that services write requests at the point that minimizes

the delay caused to the following read requests

  • Propose a low-overhead rank idle time predictor to predict long periods of idle

time in memory ranks

  • Propose a LLC writeback management policy that intelligently writes back

bank-level parallelism writes during the long rank idle period

  • Balance the memory bandwidth
  • Isolate the service of memory read and write requests as much as possible

5

Contributions of This Paper

slide-7
SLIDE 7

Outline

  • Motivation
  • Rank Idle Time Prediction Driven Last-Level Cache

Writeback Technique.

  • System Structure
  • Rank Idle Time Predictor
  • Evaluation
  • Conclusion

6

slide-8
SLIDE 8

Reducing Write-Induced Interference

  • When to service write requests
  • Memory write requests should be serviced at the time that have minimal

interference with read requests

Read Requests Write Requests

Read Access Pattern Perfect Writeback Traditional Writeback

  • How to schedule write requests
  • Schedule high locality write requests
  • Large write scheduling space

7

slide-9
SLIDE 9

Related Work: LLC Writeback Technique

  • Eager Writeback [Lee et al. 2000]
  • Memory scheduling spaced is limited by the write buffer
  • Has no knowledge about how long the rank idle period will be last
  • Virtual Write Queue [Stuecheli et al. 2010]
  • Requires specific memory address mapping scheme
  • Has no knowledge about how long the rank idle period will be last

LLC DRAM

Write Buffer

Scheduled Writes

Virtual Write Queue

8

MRU LRU

slide-10
SLIDE 10

Quantifying Rank Idle Time

9

0% 20% 40% 60% 80% 100% mix1 mix2 mix3 mix4 mix5 mix6 mix7 mix8 Gmean

Rank Idle Percent

38%

Ranks are Idle 38% of the time on average

slide-11
SLIDE 11

10

Rank Idle Time Prediction Driven LLC Writeback

Insight: Allow writes to be serviced during long rank idle periods

  • Use a predictor to predict long rank idle period once a rank starts to become

idle

  • Scheduled write requests are generated from LLC and sent to DRAM for

service during the predicted long rank idle period

  • Redistribute the write requests into long rank idle period
  • Isolate the service of memory read and write requests as much as possible
slide-12
SLIDE 12

System Structure

LLC Cache Cleaner Rank Idle Time Predictor

PC of LLC miss Rank is Idle Long rank idle time

Write Buffer

DRAM Bank-Level Parallelism

MRU LRU

dirty bit

11

slide-13
SLIDE 13
  • Based on the observation that if an instruction PC leads to long rank idle

period, then there is a high probability that the next time this instruction is reached it will also lead to a long rank idle period

12

Rank Idle Time Predictor

  • Two-Level Predictor

2-Bit Counter 2-Bit Counter

PC of memory read accesses Prediction result Prediction result

First Level Predictor Second Level Predictor

Rank Idle Cycle Counter

slide-14
SLIDE 14

2-Bit Counter 2-Bit Counter

Cache Cleaner

PC of Last LLC miss PC of Last LLC miss

T1

RRank is idle

Long rank idle time(300 CPU cycles) Long rank idle time Long rank idle time

First Level Predictor Second Level Predictor

T3

m

T2

m Rank Idle Time Predictor

Rank Idle Time Predictor

Rank Idle Cycle Counter

13

slide-15
SLIDE 15

Evaluation Methodology

  • Simulator
  • MARSSx86 [Patel et al. 2011] +DRAMSim2 [Rosenfeld et al. 2011]
  • Execution Core
  • out-of-order, 8-core processor
  • Caches
  • 64KB L1 I + D caches, 2-cycle
  • 16MB 16 way set associative LLC, 14-cycle
  • DRAM System
  • DDR3 1600MHZ
  • 2 channels, 2/4 rank per channel, 8 banks per rank
  • CMP Workloads
  • SPEC CPU2006 benchmarks
  • Six mixes of SPEC CPU2006 benchmarks for 8-core processor

14

slide-16
SLIDE 16

1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 1.16 1.18 1.2

401.bzip2 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 450.soplex 456.hmmer 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 483.xalancbmk GMean

Speedup

Eager-WB VWQ RITPD-WB

1.30

Performance Evaluation

It improves performance of eight benchmarks by at least 10% and delivers an average speedup of 10.5% with two-rank configuration and 10.1% with four-rank configuration.

15

Baseline : 32-entry/c per-channel WB

slide-17
SLIDE 17

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%

mix1 mix2 mix3 mix4 mix5 mix6 AMean

False Positive Rates

first predictor second predictor

Prediction Evaluation

False positive rates for the first-level and second-level predictors are 8.5% and 14.7% on average

16

slide-18
SLIDE 18

Read Latency Evaluation

17 0.6 0.7 0.8 0.9 1 1.1

mix1 mix2 mix3 mix4 mix5 mix6 GMean Normalized Read Latency Eager-WB VWQ RITPD-WB

The technique reduces the read latency on average by 12.7% with two-rank configuration and 14.8% with four-rank configuration

slide-19
SLIDE 19

Storage Overhead

Overhead

Two-level rank idle time predictor

4KB=2bits * 8096entries*2

Cache Cleaner

2K bytes

Total

18KB for 2-rank / 34 KB for 4- rank

Percentage of LLC Capacity

~0.3%

18

slide-20
SLIDE 20

Conclusion

  • Write-induced interference causes significant performance

degradation.

  • Proposed a rank idle time predictor that predicts the long rank

idle time.

  • Proposed a LLC writeback management policy that

intelligently writes back bank-level parallelism writes during the long rank idle period

  • Balance the memory bandwidth
  • Isolate the service of memory read and write requests as

much as possible

19

slide-21
SLIDE 21

Thank You! Question?