SLIDE 1 TITLE
Image
Optimal DDR4 System with Data Bus Inversion
Hing Yan (Thomas) To, (Xilinx Inc.) Changyi Su (Xilinx Inc.), Juan Wang (Xilinx Inc.) Dmitry Klokotov (Xilinx Inc.), Lizhi Zhu (Xilinx Inc.), John Schmitz (Xilinx Inc.) Penglin Niu (Xilinx Inc.), Yong Wang (Xilinx Inc.)
SLIDE 2 SPEAKER
Hing Yan (Thomas) To
Technical Director, Xilinx Inc. tto@xilinx.com Thomas is a Technical Director in System Memory Signal Integrity & Device Power Group at Xilinx, Inc. Prior to joining Xilinx, Thomas was with NVIDIA Advanced Technology Group focused on high speed (32GTs) circuits & system channel designs and supported different test chips for different advanced process nodes such as 20nm SOC & 16nm FINFET process. Before NVIDIA, Thomas worked for Intel for more than 16 years covered and led many different types of system memory IO development such as Sandy Bridge Server DDR IO and covered many different system memory technology ranging from DDR1 to DDR4. Thomas received his PhD degree in Electrical Engineering from the Ohio State University in 1995 & he has over 37 patents in the fields of mixed signal IO circuits and system memory configurations as well as high speed clocking for high speed memory designs.
SLIDE 3 Outline
- High Performance Computing Performance Requirement Trend
- Typical Power Distribution in Computing System Example
- System Memory Power Improvement Approach
- Technology Process Node Scaling Trend
- IO Voltage Scaling Trend
- DDR4 IO signaling
- Data Bus Inversion (DBI) in DDR4 Interface
- DQ bus data Functional View with DBI enabled
- DDR4 System Power Improvement Example
- DDR4 IO Interface Training & Calibration with DBI
- Power Noise Improvement with DBI
- Experimental Data Margin Validation and Results
- Summary & Conclusions
SLIDE 4 Computation Requirement Trend
1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10
TFLOPs
Source:Top500.org
Top #1 System TFLOPs
Computing Performance Requirement increases exponentially. Expected to maintain similar or lower the Power Envelope.
SLIDE 5 Typical Power Distribution Comparison
60% 14% 2% 19% 5%
Xeon +DDR3
CPU Board net Mem Store
30% 4% 6% 48% 12%
Atom + DDR3
CPU Board net Mem Store
Traditionally CPU has been the dominated component. System Memory becomes a factor as CPU power improves relatively.
SLIDE 6 System Memory Power Improvement Approach
- Technology Process Node Scaling Trends
– Improving Process Technology improves speed, power and memory density.
- IO Voltage Scaling Trends
– Scaling down the IO voltage improves IO power.
- IO signaling Improvements
– IO Signaling can improve IO power
SLIDE 7 DRAM Process Technology Trend
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
New DRAM Process Technology node every year
5xnm 4xnm 3xnm 2xnm 2ynm 2znm 5xnm 4xnm 3xnm 2xnm 2ynm 2znm
* Customer sample shipping date for 1st procduct of each node
8xnm 6xnm
DRAM introduced with new Process Technology Node every year .
SLIDE 8 DRAM Power Improvement between DDR3 and DDR4
20 40 60 80 100 IDD0 IDD2N IDD4R IDD4W IDD5 IDD5N
IDD current Comparison
DDR3 DDR4
~35%
DDR4 device improves power from DDR3 device
SLIDE 9
DRAM IO Voltage Scaling Trend
DDR IO Voltage has been scaling down from generation to generation. Scaling rate is slowing down.
SLIDE 10 Change of IO Standard
VDDQ VDDQ
Only Logic Low in DDR4 dissipates DC power.
SLIDE 11 DDR4 Per Unit Power Distribution Comparison
17% 62% 21%
Relative Power Distribution
Total Activate Power Total RD/WR/Term Power Total Background Power
Even with Power Reduction w.r.t. DDR3, RD/WD/Term Power still a large portion. DDR4 can enable DBI to further improve IO power opportunistically.
Assume 70% Read/30%Write no DBI enabled
SLIDE 12
DBI Functional View
Data From Core Controller with DBI Enabled capability Channel DRAM DQ & DQS DBI#
SLIDE 13 DBI Functional Burst Length View
Data From Core
Controller with DBI Enabled capability
Channel
DRAM
DQ & DQS
DBI#
Data From CORE
SLIDE 14
System Power Comparison Set Up
FPGA DRAM
Write % Read % Test Programs (Traffic Gen) with different Rd%--Wr% ratio Test Programs with No DBI with DBI TG_a TG_m
SLIDE 15 Read & Write Percentage Ratio for Relative Power Comparison
79 77 75 73 70 67 63 50 57 44 40 21 23 25 28 30 33 37 50 43 56 60
TG_A(79%RD:21%WR) TG_B(77%RD:23%WR) TG_C(75%RD:25%WR) TG_D(73%RD:28%WR) TG_E(70%RD:30%WR) TG_F(67%RD:33%WR) TG_G(63%RD:37%WR) TG_H(50%RD:50%WR) TG_J(57%RD:43%WR) TG_K(44%RD:56%WR) TG_M(40%RD:60%WR) Rd % Wr %
Analyze the relative power improvement with different work loads.
SLIDE 16 Relative Power Improvement with DBI
22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
Nominized PWR to No DBI(DBI -Diabled) Nominized PWR to No DBI (DBI-Enabled) % Improvement
Relative System Power (%) Relative Improved System Power (%) ref to No DBI
System with DBI enabled shows relative power improvement. Improved amount varies with Read and Write % ratio
SLIDE 17 DBI need Calibration
DQ0 (I/O) DQ1 (I/O) DQS(I/O) DQS#(I/O)
CK_GEN_ DQ TX FIR
DBI#(I/O) DQ0 (I/O) DQ1 (I/O) DQS (I/O) DQS# (I/O) DBI# (I/O)
CTLE RCV Vref Delay CK_GEN_ DQS TX FIR CTLE RCV Vref Delay
DBI bit need to be calibrated together with other DQ bits
SLIDE 18 Step Function Representation of with DQ Pattern
𝑬𝑹[𝟏] 𝒖 = 𝑬𝑹[𝟏]𝒔 𝒖 − 𝒔𝟐𝑼 − 𝑬𝑹[𝟏]𝒈 𝒖 − 𝒈𝟐𝑼 + ⋯ 𝑬𝑹[𝟏] 𝒔 𝒖 − 𝒔𝒋𝑼 − 𝑬𝑹[𝟏]𝒈 𝒖 − 𝒈𝒋𝑼 + ⋯ 𝑬𝑹[𝟖] 𝒖 = 𝑬𝑹[𝟖]𝒔 𝒖 − 𝒔𝟐𝑼 − 𝑬𝑹[𝟖]𝒈 𝒖 − 𝒈𝟐𝑼 + ⋯ 𝑬𝑹[𝟖] 𝒔 𝒖 − 𝒔𝒋𝑼 − 𝑬𝑹[𝟖]𝒈 𝒖 − 𝒈𝒋𝑼 + ⋯ 𝑬𝑹𝑻 𝒖 = 𝑬𝑹𝑻𝒔 𝒖 − 𝒔𝟐(𝑼 − 𝑼 𝟑) − 𝑬𝑹𝑻𝒈 𝒖 − 𝒈𝟐 𝑼 − 𝑼 𝟑 + ⋯ +𝑬𝑹𝑻𝒔 𝒖 − 𝒔𝒋(𝑼 − 𝑼 𝟑) − 𝑬𝑹𝑻𝒈 𝒖 − 𝒈𝒋(𝑼 − 𝑼 𝟑) + ⋯
Channel Configuration System
SLIDE 19 DQ Eye Reference to DQS
𝐸𝑅_𝐸𝑅𝑇 𝐹𝑧𝑓(𝑢) = 𝑧(𝑢 + 𝑙𝑗𝑈) 0 ≤ 𝑢 ≤ 𝑈, ∀ 𝑙𝑗∈ ℕ0 , 𝑗 = 𝑠, 𝑔
TdivW_total VdivW_total Tjit
DV
Based on the rise and fall unit step response & their combinations:- Construct calibration pattern & to search for worst case jitter and eye height.
SLIDE 20 DBI bit Calibration with DQ
Make sure all DQ bits will have toggling coverage.
Data From CORE
SLIDE 21
Power Noise Improvement with DBI enabled
PDN Impedance (Z_pdn) is a function of frequency Jitter is a function of Z_pdn and step current load characteristic.
SLIDE 22
Voltage Droop Improvement with DBI Enabled
Average step current reduced by enabling DBI. Voltage Droop performance improves.
SLIDE 23
System Eye Margin Improvement Validation Set Up
Validation Methods:- Direct measurement of DQ Eye at DRAM inputs. Write and Read Eye Shmoo. Compare with and without DBI enabled.
SLIDE 24
Direct Write Eye Measurement at DRAM
Write Eye measurement shows a 5% UI jitter improvement. Validation extended to create functional Read and Write Eye shmoo next.
SLIDE 25
Read and Write Shmoo Set Up
SLIDE 26
Read Eye Shmoo without DBI Enabled
SLIDE 27
Read Eye Shmoo with and without DBI Enabled
SLIDE 28
Write Eye Shmoo without DBI Enabled
SLIDE 29
Write Eye Shmoo with and without DBI Enabled
SLIDE 30 94 96 98 100 102 104 106 108 110 112
Write Eye Width @ Vref Read Eye Width @ Vref
No DBI DBI
~11% ~7%
Eye Shmoo Comparison
Eye width improvement observed Improvement amount are different. Write improved by 11% Read improved by 7% Different improvement implies different step current impact Different PDN between DRAM unit and controller PHY.
SLIDE 31 Summary and Conclusions
- Computing Performance requirements drive the need to reduce system power.
- System memory Power became one of the major factor to the total system power.
- Traditional improvement methods, such as scaling process node and IO voltage,
slow down.
- DDR4 IO introduced DBI function to opportunistically reduce the IO power.
- Power improvement amount varies with Write and Read Ratio.
- DBI reduced the average step current in memory system, hence improved channel
margin.
- Experimental data showed the Channel Jitter improvement differs between Write
and Read direction.