Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe - PowerPoint PPT Presentation

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe Wang, Samira M. Khan, Daniel A. Jiménez Computer Science Department University of Texas at San Antonio

Memory Latency is Performance Bottleneck • Memory wall - Microprocessor is faster than memory Fast Slow Microprocessor Write DRAM CPUs Caches Read • System performance is sensitive to memory read latency • Write-Induced Interference [Lee et al . 2010] - Writes can delay the service of reads, degrade performance 1

Write-Induced Interference Wait Read Buffer Write Buffer Service of write requests delay the service of following read requests, thus causing performance degradation Command line Write Read Servicing Write Cycles Data line Data Bus Turnaround Write-Induced Interference Cycles 108 processor cycles 2

Speedup Without write-induced interference, system performance Quantifying Write-Induced Interference 1.1 1.2 1.3 1.4 1.5 1 400.perlbench 401.bzip2 403.gcc 410.bwaves improves 23% on average 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 450.soplex 456.hmmer 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 483.xalancbmk Gmean 3

Traditional Writeback • Dirty cache blocks are sent to write buffer when evicted Evicted Writes Write DRAM Last-Level Cache Buffer Small Size • The problem - Clustering memory traffic : bursty reads with evicted writes - Writeback inefficiency : small size of write buffer 4

Contributions of This Paper • Propose a technique that services write requests at the point that minimizes the delay caused to the following read requests • Propose a low-overhead rank idle time predictor to predict long periods of idle time in memory ranks • Propose a LLC writeback management policy that intelligently writes back bank-level parallelism writes during the long rank idle period - Balance the memory bandwidth - Isolate the service of memory read and write requests as much as possible 5

Outline • Motivation • Rank Idle Time Prediction Driven Last-Level Cache Writeback Technique. -System Structure -Rank Idle Time Predictor • Evaluation • Conclusion 6

Reducing Write-Induced Interference • When to service write requests - Memory write requests should be serviced at the time that have minimal interference with read requests Read Requests Write Requests Read Access Pattern Perfect Writeback Traditional Writeback • How to schedule write requests - Schedule high locality write requests - Large write scheduling space 7

Related Work: LLC Writeback Technique • Eager Writeback [Lee et al. 2000] - Memory scheduling spaced is limited by the write buffer - Has no knowledge about how long the rank idle period will be last Scheduled Writes DRAM LLC Write Buffer MRU LRU Virtual Write Queue • Virtual Write Queue [Stuecheli et al. 2010] - Requires specific memory address mapping scheme - Has no knowledge about how long the rank idle period will be last 8

Quantifying Rank Idle Time 100% Rank Idle Percent 80% 60% 38% 40% 20% 0% mix1 mix2 mix3 mix4 mix5 mix6 mix7 mix8 Gmean Ranks are Idle 38% of the time on average 9

Rank Idle Time Prediction Driven LLC Writeback Insight: Allow writes to be serviced during long rank idle periods • Use a predictor to predict long rank idle period once a rank starts to become idle • Scheduled write requests are generated from LLC and sent to DRAM for service during the predicted long rank idle period - Redistribute the write requests into long rank idle period - Isolate the service of memory read and write requests as much as possible 10

System Structure Rank is Idle PC of LLC miss Rank Idle Time Predictor Long rank idle time dirty bit Cache Cleaner DRAM LLC Write Buffer MRU LRU Bank-Level Parallelism 11

Rank Idle Time Predictor • Two-Level Predictor Rank Idle Cycle Counter 2-Bit Counter Prediction result PC of memory read accesses 2-Bit Counter Prediction result First Level Predictor Second Level Predictor • Based on the observation that if an instruction PC leads to long rank idle period, then there is a high probability that the next time this instruction is reached it will also lead to a long rank idle period 12

Rank Idle Time Predictor T3 RRank is idle T1 T2 m m Long rank idle time Rank Idle Cycle Counter Cache Cleaner 2-Bit Counter Long rank idle time(300 CPU cycles) PC of Last LLC miss Long rank idle time 2-Bit Counter First Level Predictor PC of Last LLC miss Rank Idle Time Predictor Second Level Predictor 13

Evaluation Methodology • Simulator - MARSSx86 [Patel et al. 2011] + DRAMSim2 [Rosenfeld et al. 2011] • Execution Core - out-of-order, 8-core processor • Caches - 64KB L1 I + D caches, 2-cycle - 16MB 16 way set associative LLC, 14-cycle • DRAM System - DDR3 1600MHZ - 2 channels, 2/4 rank per channel, 8 banks per rank • CMP Workloads - SPEC CPU2006 benchmarks - Six mixes of SPEC CPU2006 benchmarks for 8-core processor 14

Performance Evaluation Baseline : 32-entry/c per-channel WB Eager-WB VWQ RITPD-WB 1.2 1.30 1.18 1.16 1.14 Speedup 1.12 1.1 1.08 1.06 1.04 1.02 1 401.bzip2 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 450.soplex 456.hmmer 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 483.xalancbmk GMean It improves performance of eight benchmarks by at least 10% and delivers an average speedup of 10.5% with two-rank configuration and 10.1% with four-rank configuration. 15

Prediction Evaluation first predictor second predictor 20% False Positive Rates 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% mix1 mix2 mix3 mix4 mix5 mix6 AMean False positive rates for the first-level and second-level predictors are 8.5% and 14.7% on average 16

Read Latency Evaluation 1.1 Normalized Read Latency Eager-WB VWQ RITPD-WB 1 0.9 0.8 0.7 0.6 mix1 mix2 mix3 mix4 mix5 mix6 GMean The technique reduces the read latency on average by 12.7% with two-rank configuration and 14.8% with four-rank configuration 17

Storage Overhead Overhead 4KB=2bits * 8096entries*2 Two-level rank idle time predictor 2K bytes Cache Cleaner 18KB for 2-rank / 34 KB for 4- Total rank ~0.3% Percentage of LLC Capacity 18

Conclusion • Write-induced interference causes significant performance degradation. • Proposed a rank idle time predictor that predicts the long rank idle time. • Proposed a LLC writeback management policy that intelligently writes back bank-level parallelism writes during the long rank idle period - Balance the memory bandwidth - Isolate the service of memory read and write requests as much as possible 19

Thank You! Question?

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe - PowerPoint PPT Presentation

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe Wang, Samira M. Khan, Daniel A. Jimnez Computer Science Department University of Texas at San Antonio Memory Latency is Performance Bottleneck Memory wall -

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

BATTERY REPLACEMENT IN IDLE STOP START VEHCILES (ISS) What are ISS systems An Idle Stop Start

BATTERY REPLACEMENT IN IDLE STOP START VEHICLES (ISS) What are ISS systems An Idle Stop Start

Idle Reduction Idle Reduction Presented by: Bryan Roy, PMP Presented by: Bryan Roy, PMP Valley

State Machine for GPRS MM READY timer expiry or GPRS Attach Force to STANDBY IDLE IDLE READY

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

rP b c rP r P rP r P tx idle tx rx idle rx idle a s avg. power b s

CS 457 Lecture 7 Ethernet and Wireless Fall 2011 Ethernet Uses CSMA/CD Carrier sense:

More on PSL some examples, some pitfalls FSM start continue continue idle p1 p2 p3

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &

Packet Scheduling: Weighted Fair Queueing (WFQ) ( ) and Virtual Clock (VC) and Virtual Clock

Financial Results Third Quarter 2020 Forward-Looking Statements & Non-GAAP Financial Measures

Wrap up & Experimentation CS147L Lecture 8 Mike Krieger Wednesday, November 25, 2009 Intro

Decision Aid Methodologies In Transportation Lecture 2: Duality and Column generation Chen Jiang

CENG5030 Part 1-4: Switching Activity Bei Yu (Latest update: March 25, 2019) Spring 2019 1 /

for power sector Nawaz Peerbocus Director, Energy Transitions and Electric Power Singapore

THREADED PROGRAMMING OpenMP Performance 2 A common scenario..... So I wrote my OpenMP

2 = 2040400 + 480400 + 2220*0.16 = 1008355.2 (microJ) = 1.0083552 (J) b) (20 pts) Fixed

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe - PowerPoint PPT Presentation

Rank Idle Time Prediction Driven Last-Level Cache Writeback Zhe Wang, Samira M. Khan, Daniel A. Jimnez Computer Science Department University of Texas at San Antonio Memory Latency is Performance Bottleneck Memory wall -

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

BATTERY REPLACEMENT IN IDLE STOP START VEHCILES (ISS) What are ISS systems An Idle Stop Start

BATTERY REPLACEMENT IN IDLE STOP START VEHICLES (ISS) What are ISS systems An Idle Stop Start

Idle Reduction Idle Reduction Presented by: Bryan Roy, PMP Presented by: Bryan Roy, PMP Valley

State Machine for GPRS MM READY timer expiry or GPRS Attach Force to STANDBY IDLE IDLE READY

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &amp;

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

rP b c rP r P rP r P tx idle tx rx idle rx idle a s avg. power b s

CS 457 Lecture 7 Ethernet and Wireless Fall 2011 Ethernet Uses CSMA/CD Carrier sense:

More on PSL some examples, some pitfalls FSM start continue continue idle p1 p2 p3

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &amp;

Packet Scheduling: Weighted Fair Queueing (WFQ) ( ) and Virtual Clock (VC) and Virtual Clock

Financial Results Third Quarter 2020 Forward-Looking Statements &amp; Non-GAAP Financial Measures

Wrap up &amp; Experimentation CS147L Lecture 8 Mike Krieger Wednesday, November 25, 2009 Intro

Decision Aid Methodologies In Transportation Lecture 2: Duality and Column generation Chen Jiang

CENG5030 Part 1-4: Switching Activity Bei Yu (Latest update: March 25, 2019) Spring 2019 1 /

for power sector Nawaz Peerbocus Director, Energy Transitions and Electric Power Singapore

THREADED PROGRAMMING OpenMP Performance 2 A common scenario..... So I wrote my OpenMP

2 = 2040*400 + 480*400 + 2220*0.16 = 1008355.2 (microJ) = 1.0083552 (J) b) (20 pts) Fixed

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &

Financial Results Third Quarter 2020 Forward-Looking Statements & Non-GAAP Financial Measures

Wrap up & Experimentation CS147L Lecture 8 Mike Krieger Wednesday, November 25, 2009 Intro

2 = 2040400 + 480400 + 2220*0.16 = 1008355.2 (microJ) = 1.0083552 (J) b) (20 pts) Fixed