rank idle time prediction driven
play

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe - PowerPoint PPT Presentation

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe Wang, Samira M. Khan, Daniel A. Jimnez Computer Science Department University of Texas at San Antonio Memory Latency is Performance Bottleneck Memory wall -


  1. Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe Wang, Samira M. Khan, Daniel A. Jiménez Computer Science Department University of Texas at San Antonio

  2. Memory Latency is Performance Bottleneck • Memory wall - Microprocessor is faster than memory Fast Slow Microprocessor Write DRAM CPUs Caches Read • System performance is sensitive to memory read latency • Write-Induced Interference [Lee et al . 2010] - Writes can delay the service of reads, degrade performance 1

  3. Write-Induced Interference Wait Read Buffer Write Buffer Service of write requests delay the service of following read requests, thus causing performance degradation Command line Write Read Servicing Write Cycles Data line Data Bus Turnaround Write-Induced Interference Cycles 108 processor cycles 2

  4. Speedup Without write-induced interference, system performance Quantifying Write-Induced Interference 1.1 1.2 1.3 1.4 1.5 1 400.perlbench 401.bzip2 403.gcc 410.bwaves improves 23% on average 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 450.soplex 456.hmmer 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 483.xalancbmk Gmean 3

  5. Traditional Writeback • Dirty cache blocks are sent to write buffer when evicted Evicted Writes Write DRAM Last-Level Cache Buffer Small Size • The problem - Clustering memory traffic : bursty reads with evicted writes - Writeback inefficiency : small size of write buffer 4

  6. Contributions of This Paper • Propose a technique that services write requests at the point that minimizes the delay caused to the following read requests • Propose a low-overhead rank idle time predictor to predict long periods of idle time in memory ranks • Propose a LLC writeback management policy that intelligently writes back bank-level parallelism writes during the long rank idle period - Balance the memory bandwidth - Isolate the service of memory read and write requests as much as possible 5

  7. Outline • Motivation • Rank Idle Time Prediction Driven Last-Level Cache Writeback Technique. -System Structure -Rank Idle Time Predictor • Evaluation • Conclusion 6

  8. Reducing Write-Induced Interference • When to service write requests - Memory write requests should be serviced at the time that have minimal interference with read requests Read Requests Write Requests Read Access Pattern Perfect Writeback Traditional Writeback • How to schedule write requests - Schedule high locality write requests - Large write scheduling space 7

  9. Related Work: LLC Writeback Technique • Eager Writeback [Lee et al. 2000] - Memory scheduling spaced is limited by the write buffer - Has no knowledge about how long the rank idle period will be last Scheduled Writes DRAM LLC Write Buffer MRU LRU Virtual Write Queue • Virtual Write Queue [Stuecheli et al. 2010] - Requires specific memory address mapping scheme - Has no knowledge about how long the rank idle period will be last 8

  10. Quantifying Rank Idle Time 100% Rank Idle Percent 80% 60% 38% 40% 20% 0% mix1 mix2 mix3 mix4 mix5 mix6 mix7 mix8 Gmean Ranks are Idle 38% of the time on average 9

  11. Rank Idle Time Prediction Driven LLC Writeback Insight: Allow writes to be serviced during long rank idle periods • Use a predictor to predict long rank idle period once a rank starts to become idle • Scheduled write requests are generated from LLC and sent to DRAM for service during the predicted long rank idle period - Redistribute the write requests into long rank idle period - Isolate the service of memory read and write requests as much as possible 10

  12. System Structure Rank is Idle PC of LLC miss Rank Idle Time Predictor Long rank idle time dirty bit Cache Cleaner DRAM LLC Write Buffer MRU LRU Bank-Level Parallelism 11

  13. Rank Idle Time Predictor • Two-Level Predictor Rank Idle Cycle Counter 2-Bit Counter Prediction result PC of memory read accesses 2-Bit Counter Prediction result First Level Predictor Second Level Predictor • Based on the observation that if an instruction PC leads to long rank idle period, then there is a high probability that the next time this instruction is reached it will also lead to a long rank idle period 12

  14. Rank Idle Time Predictor T3 RRank is idle T1 T2 m m Long rank idle time Rank Idle Cycle Counter Cache Cleaner 2-Bit Counter Long rank idle time(300 CPU cycles) PC of Last LLC miss Long rank idle time 2-Bit Counter First Level Predictor PC of Last LLC miss Rank Idle Time Predictor Second Level Predictor 13

  15. Evaluation Methodology • Simulator - MARSSx86 [Patel et al. 2011] + DRAMSim2 [Rosenfeld et al. 2011] • Execution Core - out-of-order, 8-core processor • Caches - 64KB L1 I + D caches, 2-cycle - 16MB 16 way set associative LLC, 14-cycle • DRAM System - DDR3 1600MHZ - 2 channels, 2/4 rank per channel, 8 banks per rank • CMP Workloads - SPEC CPU2006 benchmarks - Six mixes of SPEC CPU2006 benchmarks for 8-core processor 14

  16. Performance Evaluation Baseline : 32-entry/c per-channel WB Eager-WB VWQ RITPD-WB 1.2 1.30 1.18 1.16 1.14 Speedup 1.12 1.1 1.08 1.06 1.04 1.02 1 401.bzip2 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 450.soplex 456.hmmer 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 483.xalancbmk GMean It improves performance of eight benchmarks by at least 10% and delivers an average speedup of 10.5% with two-rank configuration and 10.1% with four-rank configuration. 15

  17. Prediction Evaluation first predictor second predictor 20% False Positive Rates 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% mix1 mix2 mix3 mix4 mix5 mix6 AMean False positive rates for the first-level and second-level predictors are 8.5% and 14.7% on average 16

  18. Read Latency Evaluation 1.1 Normalized Read Latency Eager-WB VWQ RITPD-WB 1 0.9 0.8 0.7 0.6 mix1 mix2 mix3 mix4 mix5 mix6 GMean The technique reduces the read latency on average by 12.7% with two-rank configuration and 14.8% with four-rank configuration 17

  19. Storage Overhead Overhead 4KB=2bits * 8096entries*2 Two-level rank idle time predictor 2K bytes Cache Cleaner 18KB for 2-rank / 34 KB for 4- Total rank ~0.3% Percentage of LLC Capacity 18

  20. Conclusion • Write-induced interference causes significant performance degradation. • Proposed a rank idle time predictor that predicts the long rank idle time. • Proposed a LLC writeback management policy that intelligently writes back bank-level parallelism writes during the long rank idle period - Balance the memory bandwidth - Isolate the service of memory read and write requests as much as possible 19

  21. Thank You! Question?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend