Impr mproving DR ving DRAM P M Per erfor
- rmanc
Impr mproving DR ving DRAM P M Per erfor ormanc mance e by P - - PowerPoint PPT Presentation
Impr mproving DR ving DRAM P M Per erfor ormanc mance e by P y Par arallelizing R allelizing Refr efreshes eshes with A with Accesses esses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim,
– Degrades system performance and energy efficiency – Becomes exacerbated as DRAM density increases
– 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms – 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays
– 20.2% and 9.0% for 8-core systems using 32Gb DRAM – Very close to the ideal scheme without refreshes
2
3
Memor emory y Con
troller
4
Refr efresh esh Read ead Da Data ta Capacit apacitor
Access ess tr transist ansistor
Refr efresh dela esh delays r s requests b equests by 100s of ns y 100s of ns
Time
5
Time
Refr efresh esh Round-r
der
6
7
Time
RD RD
Per er-Bank R
efresh esh
8
Time
RD RD Subar Subarray R y Refr efresh esh Time
9
10
11
tRefPeriod (tREFI): Remains constan
tRefLatency (tRFC): Varies based on DRAM chip density (e.g., 350ns)
Timeline
Read/Write: roughly 50ns
12
tRefLatency tRefPeriod
13
Timeline Bank 0 Bank 1 Refr efresh esh
Refr efresh esh Timeline Bank 0 Bank 1
Refr efresh esh Refr efresh esh Refr efresh esh
Read ead Read ead Read ead Read ead
14
15
16
17
Bank 1 Bank 0
18
Refr efresh esh Read ead Timeline Bank 1 Bank 0 Refr efresh esh Read ead Refr efresh esh Read ead
Refr efresh esh Read ead
Sa Saved c ed cycles cles
Sa Saved c ed cycles cles
Read
Request queue (Bank 0) Request queue (Bank 1)
Read
2) Writ ite-R
efresh P esh Par aralleliza allelization tion
19
20
Bank 1 Bank 0 Refr efresh esh Read ead Read ead
Time
21
Timeline Bank 1 Bank 0 Writ ite e Read ead Writ ite e
Turnaround
Writ ite e
22
Timeline Bank 1 Bank 0
Turnaround
Refr efresh esh Read ead Read ead
Writ ite e Writ ite e Writ ite e
Timeline Bank 1 Bank 0 Read ead
Turnaround
Read ead Writ ite e Writ ite e Writ ite e Refr efresh esh
Refr efresh esh
Sa Saved c ed cycles cles
23
24
Row Bu w Bufffer er
25
26
Timeline Subarray 1 Subarray 0
Da Data ta Refr efresh esh Refr efresh esh Read ead Read ead
27
28
DDR3 R DDR3 Rank ank
Memor emory y Con
troller
Memor emory y Con
troller
Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0
…
L1 $: 32KB L1 $: 32KB L2 $: 512KB/c L2 $: 512KB/cor
e
29
1 2 3 4 5 6 8Gb 16Gb 32Gb
DR DRAM Chip D M Chip Densit ensity y All-Bank Per-Bank Elastic DARP SARP DSARP Ideal
30
31
5 10 15 20 25 30 35 40 45 8Gb 16Gb 32Gb Ener Energy per A gy per Access (nJ ess (nJ) ) DR DRAM Chip D M Chip Densit ensity y All-Bank Per-Bank Elastic DARP SARP DSARP Ideal
32
– Degrades system performance and energy efficiency – Becomes exacerbated as DRAM density increases
– 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms – 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays
– 20.2% and 9.0% for 8-core systems using 32Gb DRAM – Very close to the ideal scheme without refreshes
33
35
– 1) Sub Sub-r
ank refr efresh esh → refreshes a subset of banks within a rank – 2) Subar Subarray r y refr efresh esh → refreshes one subarray at a time – 3) Dynamic sub-rank refresh scheduling policies
– 1) Leverage idle subarrays to serve accesses – 2) Schedule refreshes to idle banks first
– 1) Exploit write draining periods to hide refresh latency – 2) We provide detailed analysis on existing per-bank refresh in mobile DRAM – 3) Concrete description on our scheduling algorithm
36
37
Cur urren ent t Futur uture e
(B (By y y year 2020*) ear 2020*)
43% 43% 23% 23% 6.7% 6.7%
*ITRS Roadmap, 2011
Pot
ential R tial Range ange
38
DRAM Timeline 1 1 2 2 3 3 4 4 5 5 6 6 1 1 2 2 3 3 4 4 5 5
Dela elayed b ed by 1 r y 1 refr efresh c esh command
1 1 2 2 5 5 6 6 7 7
Ahead b Ahead by 1 r y 1 refr efresh c esh command
4 4 3 3
39
1 1
DRAM Timeline
2 2 3 3 N N N+1 N+1 Row1 Row1 tRefreshWindow=$∗%&'()'"ℎ+'),-.=31.948!"<%&'%'/%,-/ 1 1
DRAM Timeline
N+1 N+1 N+1 N+1 Row1 Row1 Row1 tRefreshWindow %0'123 tRetention>tRefreshWindo4+%0'123
… … … …
40
MaximumSlowdown = maxi IPCi
alone
IPCi
shared
41
PowerOverheadtFAW = (4* IACT + IREF) / 4* IACT tFAWSARP = tFAW *PowerOverheadtFAW tRRDSARP = tRRD*PowerOverheadtFAW
42
3.3% 5.3% 9.1%
43
44
45
5 10 15 20 25 30 35 25 50 75 100 Avg 0 25 50 75 100 Avg Compared to REFab Compared to REFpb WS I WS Impr mprovemen ement (%) t (%) 8Gb 16Gb 32Gb
46
0.2 0.4 0.6 0.8 1 1.2 8Gb 16Gb 32Gb Normalized WS DRAM Density REFab FGR 2x FGR 4x AR DSARP
47
48
49
0.95 1 1.05 1.1 1.15 1.2 1.25
Normalized Weighted Speedup
100 Workloads REFpb
50
51