Modern DRAM Memory Systems Brian T. Davis MTU Interview Seminar - - PowerPoint PPT Presentation

modern dram memory systems
SMART_READER_LITE
LIVE PREVIEW

Modern DRAM Memory Systems Brian T. Davis MTU Interview Seminar - - PowerPoint PPT Presentation

Modern DRAM Memory Systems Brian T. Davis MTU Interview Seminar Advanced Computer Architecture Laboratory University of Michigan April 24, 2000 page 1 Brian T. Davis Introduction Memory system Research objective DRAM


slide-1
SLIDE 1

Brian T. Davis page 1

Modern DRAM Memory Systems

Brian T. Davis MTU Interview Seminar Advanced Computer Architecture Laboratory University of Michigan April 24, 2000

slide-2
SLIDE 2

Brian T. Davis page 2

  • Introduction

❍ Memory system ❍ Research objective

  • DRAM Primer

❍ Array ❍ Access sequence ❍ SDRAM ❍ Motivation for further innovation

  • Modern DRAM Architectures

❍ DRDRAM ❍ DDR2 ❍ Cache enhanced DDR2 low-latency variants

  • Performance and Controller Policy Research

❍ Simulation methodologies ❍ Results

  • Conclusions
  • Future Work
slide-3
SLIDE 3

Brian T. Davis page 3

Processor Memory System

  • Architecture Overview

❍ This is the architecture of most desktop systems ❍ Cache configurations may vary ❍ DRAM Controller is typically an element of the chipset ❍ Speed of all Busses can vary depending upon the system

  • DRAM Latency Problem

CPU Primary Cache Secondary Cache

Backside Bus

North-Bridge Chipset DRAM Controller

Frontside Bus

DRAM System

DRAM Bus Other Chipset Devices I/O Systems

slide-4
SLIDE 4

Brian T. Davis page 4

Research Objective

  • Determine highest performance memory controller policy for

each DRAM architecture

  • Compare performance of various DRAM architectures for

different classifications of applications, while each architecture is

  • perating under best controller policy
slide-5
SLIDE 5

Brian T. Davis page 5

DRAM Array

❍ One transistor & capacitor per bit in the DRAM (256 or 512MBit currently) ❍ Three events in hardware access sequence

  • Precharge
  • Energize word line--based upon de-muxed row data
  • Select bits from the row in sense-amps

❍ Refresh is mandatory ❍ Page and row are synonymous terminology

Word Lines

. . . . . . . . . . . . . .

Bit Lines Sense Amplifiers Column Decoder Row Decoder

slide-6
SLIDE 6

Brian T. Davis page 6

Arrays per Device

  • Multiple arrays per device & aspect ratio

❍ Larger arrays; larger bit lines; higher capacitance; higher latency ❍ Multiple smaller arrays; lower latency; more concurrency (if interface allows) ❍ Tradeoff--fewer & larger = cheaper--more & smaller = higher performance

  • Controller policies

❍ Close-Page-AutoPrecharge (CPA) ❍ Open-Page (OP)

slide-7
SLIDE 7

Brian T. Davis page 7

Fast-Page-Mode (FPM) DRAM Interface

❍ All signals required by DRAM array provided by DRAM controller ❍ Three events in FPM interface access sequence

  • Row Address Strobe - RAS
  • Column Address Strobe - CAS
  • Data response

❍ Dedicated interface - only a single transaction at any time ❍ Address bus multiplexed between row & column

RAS

CAS

Address Data

Row Col 1 Col 2 Col 3 Data 1 Data 2 Data 3

slide-8
SLIDE 8

Brian T. Davis page 8

SDRAM Interface

❍ All I/O synchronous rather than async--buffered on the device ❍ Split-transaction interface ❍ Allows concurrency in a pipelined-similar fashion - to unique banks ❍ Requires latches for address & data - low device overhead ❍ Double Data Rate (DDR) increases only data transition frequency

slide-9
SLIDE 9

Brian T. Davis page 9

SDRAM DIMM/System Architecture

❍ Devices per DIMM affects effective page size thus potentially performance ❍ Each device only covers a "slice" of the data bus ❍ DIMMs can be single or double sided - single sided shown ❍ Data I/O per device is a bond-out issue

  • Has been increasing as devices get larger

DIMM Addr

8 8 8 8 8 8 8 8

Data

64

168-PIN SDRAM DIMM Interface Additional DIMMs DRAM Controller

slide-10
SLIDE 10

Brian T. Davis page 10

Motivation for a New DRAM Architecture

  • SDRAM limits performance of high-performance processors

❍ TPC-C 4-wide issue machines achieve CPI of 4.2-4.5 (DEC) ❍ STREAM 8-wide machine--1Ghz: CPI of 3.6-9.7--5G: CPI of 7.7-42.0 ❍ PERL 8-wide machine--1Ghz: CPI of 0.8-1.1--5Ghz: CPI of 1.0-4.7

  • DRAM array has essentially remained static for 25 years

❍ Device size (x4) per 3 years - Moore’s law ❍ Processors performance (not speed) 60% annually ❍ Latency decreases at 7% annually

  • Bandwidth vs. Latency

❍ Potential bandwidth = (data bus width) * (operating frequency) ❍ 64-bit desktop bus 100-133 MHz (0.8 - 1.064 GB/s) ❍ 256-bit server (parity) bus 83-100 Mhz (2.666-3.2 GB/s)

  • Workstation manufacturers migrating to enhanced DRAM
slide-11
SLIDE 11

Brian T. Davis page 11

Modern DRAM Architectures

  • DRAM architecture’s examined

❍ PC100 - baseline SDRAM ❍ DDR133(PC2100) - SDRAM 9 months out ❍ Rambus -> Concurrent Rambus -> Direct Rambus ❍ DDR2 ❍ Cache Enhanced Architecture - possible to any interface - here to DDR2

  • Not all novel DRAM will be discussed here

❍ SyncLink - death by standards organization ❍ Cached DRAM - two-port notebook single-solution ❍ MultiBanked DRAM - low-latency core w/ many small banks

  • Common elements

❍ Interface should enable parallelism between accesses to unique banks ❍ Exploit the extra bits retrieved, but not requested

  • Focus on DDR2 low-latency variants

❍ JEDEC 42.3 Future DRAM Task Group ❍ Low-Latency DRAM Working Group

slide-12
SLIDE 12

Brian T. Davis page 12

DRDRAM RIMM/System Architecture

❍ Smaller arrays: 32 per 128Mbit device (4 Mbit Arrays; 1KByte page) ❍ Devices in series on RIMM rather than parallel ❍ Many more banks than in an equivalent size SDRAM memory system ❍ Sense-amps are shared between neighboring banks ❍ Clock flows both directions along channel

slide-13
SLIDE 13

Brian T. Davis page 13

Direct Rambus (DRDRAM) Channel

❍ Narrow bus architecture ❍ All activity occurs in OCTCYCLES (4 clock cycles; 8 signal transitions) ❍ Three bus components

  • Row (3 bits); Col (5 bits); Data (16 bits)

❍ Allows 3 transactions to use the bus concurrently ❍ All signals are Double Data Rate (DDR)

slide-14
SLIDE 14

Brian T. Davis page 14

DDR2 Architecture

❍ Four arrays per 512 Mbit device ❍ Simulations assume 4 (x16) devices per DIMM ❍ Few, large arrays--64MByte effective banks--8 KByte effective pages

slide-15
SLIDE 15

Brian T. Davis page 15

DDR2 Interface

❍ Changes from current SDRAM interface

  • Additive Latency (AL = 2; CL = 3 in this figure)
  • Fixed burst size of 4
  • Reduce power considerations

❍ Leverages existing knowledge

slide-16
SLIDE 16

Brian T. Davis page 16

EMS Cache-Enhanced Architecture

❍ Full SRAM cache array for each row ❍ Precharge latency can always be hidden ❍ Adds the capacity for No-Write-Transfer ❍ Controller requires no additional storage--only control for NW-Xfer

slide-17
SLIDE 17

Brian T. Davis page 17

Virtual Channel Architecture

❍ Channels are SRAM cache on DRAM die - 16 channels = 16 line cache ❍ Read and write can only occur through channel ❍ Controller can manage channels in many ways

  • FIFO
  • Bus-master based

❍ Controller complexity & storage increase dramatically ❍ Designed to reduce conflict misses

slide-18
SLIDE 18

Brian T. Davis page 18

PC133 DDR2 DDR2_VC DDR2_EMS DRDRAM Potential Bandwidth 1.064 GB/s 3.2 GB/s 1.6 GB/s Interface

  • Bus
  • 64 Data bits
  • 168 pads on

DIMM

  • 133 Mhz
  • Bus
  • 64 Data bits
  • 184 pads on

DIMM

  • 200 Mhz
  • Channel
  • 16 Data Bits
  • 184 pads on

RIMM

  • 400 Mhz

Latency to first 64 bits (Min. : Max) (3 : 9) cycles (22.5 : 66.7) nS (3.5 : 9.5) cycles (17.5 : 47.5) nS (2.5 : 18.5) cycles (12.5 : 92.5) nS (3.5 : 9.5) cycles (17.5 : 47.5) nS (14 : 32) cycles (35 : 80) nS Latency Advantage

  • 16 Line Cache /

Dev; 1/4 row line size

  • Cache Line per

bank; line size is row size

  • Many smaller

banks

  • More open pages

Advantage

  • Cost
  • Cost
  • Less Misses in

“Hot Bank”

  • Precharge

Always Hidden

  • Full Array BW

Utilized

  • Narrow Bus
  • Smaller

Incremental granularity Disadvantage

  • Area (3-6%)
  • Controller

Complexity

  • More misses on

purely linear accesses

  • Area (5-8%)
  • More conflict

misses

  • Area (10%)
  • Sense Amps

shared between adjacent banks

slide-19
SLIDE 19

Brian T. Davis page 19

Comparison of Controller Policies

  • Close-Page Auto Precharge (CPA)

❍ After each access, data in sense-amps is discarded ❍ ADV: Subsequent accesses in unique row/page: no precharge latency ❍ DIS: Subsequent accesses in same row/page: must repeat access

  • Open-Page (OP)

❍ After each access, data in sense-amps is maintained ❍ ADV: subsequent accesses in same row/page: page-mode access ❍ DIS: Adjacent accesses in unique row/page: incurs precharge latency

  • EMS considerations

❍ No-Write Transfer mode - how to identify write only streams or rows

  • Virtual Channel (VC) considerations

❍ How many channels can the controller manage? ❍ Dirty virtual channel writeback

slide-20
SLIDE 20

Brian T. Davis page 20

Execution Driven Simulation

❍ SimpleScalar - standard processor simulation tool ❍ Advantages

  • Feedback from DRAM latency
  • Parameter’s of system are easy modify with full reliability
  • Confidence in results can be very high

❍ Disadvantages

  • SLOW to execute
  • Limited to architectures which can be simulated by SimpleScalar

Secondary Cache

Backside Bus

North-Bridge Chipset DRAM Controller

Frontside Bus

DRAM System

DRAM Bus Other Chipset Devices I/O Systems

Compiled Binaries CPU Primary Cache SimpleScalar

Not Modeled

slide-21
SLIDE 21

Brian T. Davis page 21

Trace Driven Simulation

❍ Advantages

  • FAST to simulate
  • Allows traces from SMP’s or more complex architectures
  • Appropriate for model verification, hit-rate

❍ Disadvantages

  • No-feedback from access to subsequent accesses
  • W/O timestamps is essentially a limit-study framework
  • Not appropriate for time based results
  • Simulation parameters limited to those of the gathered system

North-Bridge Chipset DRAM Controller

Frontside Bus

DRAM System

DRAM Bus Other Chipset Devices I/O Systems

FrontSide Bus Level & Graphics & I/O Accesses

slide-22
SLIDE 22

Brian T. Davis page 22

Results

  • Execution driven based upon:

❍ SimpleScalar

  • Version 2.0MSHR - Written by Todd Austin - Modified by Doug Burger -

Customized for these simulations

  • 8 Way Super Scalar / 2 Memory Ports
  • 32K I/D split L1 caches
  • 256K Unified L2
  • 16 MSHRs provide concurrent memory access support
  • Trace driven based upon:

❍ IBM OLTP (On-Line Transaction Processing) traces

  • SMP 1-way or 8-way processor - elements are cache snoop data

❍ Transmeta Crusoe processor running Windows applications

  • Includes processor, AGP graphics & I/O as access sources
  • DRAM & controller models

❍ SDRAM model (PC100 - DDR133) ❍ DRDRAM model ❍ DDR2 model (std, vc & ems)

slide-23
SLIDE 23

Brian T. Davis page 23

cc1 com− − press go ijpeg li linear_w alk mpeg2d ec mpeg2e nc pegwit perl

0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25

SPEC BMarks Runtime

pc100 ddr133 drd ddr2 ddr2ems ddr2vc

Benchmark Seconds

slide-24
SLIDE 24

Brian T. Davis page 24

random_walk stream stream_no_unroll

0.25 0.5 0.75

1

1.25 1.5 1.75

Bandwidth Benchmarks Runtime

pc100 ddr133 drd ddr2 ddr2ems ddr2vc

Benchmark Seconds

slide-25
SLIDE 25

Brian T. Davis page 25

10g 5g 1g 0.25 0.5 0.75

1

1.25 1.5 1.75

2

Stream Execution Time

pc100 ddr133 drd ddr2 ddr2ems ddr2vc

Processor Speed

Seconds

slide-26
SLIDE 26

Brian T. Davis page 26

cc1 com− − press go ijpeg li linear _walk mpeg 2dec mpeg 2enc pegwit perl ran− dom_ walk strea m strea m_no _un−

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Data Bus Utilization

pc100 ddr133 ddr2 ddr2ems ddr2vc

Benchmark Fraction

slide-27
SLIDE 27

Brian T. Davis page 27

  • ltp1w
  • ltp8w

xm_access xm_cpumark xm_gcc xm_quake

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

Adjacent Accesses to Same Bank

ddr2_cpa ddr2_cpa_remap ddr2_op ddr2_op_remap Ddr2ems ddr2ems_remap Ddr2vc ddr2vc_remap

Trace

Fraction

slide-28
SLIDE 28

Brian T. Davis page 28

  • ltp1w
  • ltp8w

xm_access xm_cpumark xm_gcc xm_quake

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Hit Rates

ddr2_cpa ddr2_op ddr2ems_cpa ddr2vc_cpa

Trace

Fraction

slide-29
SLIDE 29

Brian T. Davis page 29

ddr2_cpa ddr2_op ddr2ems ddr2vc

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

Adjacent Accesses − Remapping Effectiveness

DRAM Architecture Fraction

slide-30
SLIDE 30

Brian T. Davis page 30

cc1 com− press go ijpeg li linear_w alk mpeg2d ec mpeg2e nc pegwit perl

ran− dom_w

alk stream stream_

no_un− roll

5

10 15 20 25 30 35 40

Average Latency

ddr2_cpa ddr2_cpa_inv ddr2_op ddr2ems_cpa ddr2ems_cpa_inv ddr2ems_op ddr2vc_cpa

Benchmark NanoSeconds

slide-31
SLIDE 31

Brian T. Davis page 31

Conclusions

  • More bandwidth can be had, at a cost
  • The target for architectural improvements must be latency
  • Controller can significantly affect average latency
  • DDR2 is evolutionary, but provides the required performance
  • Cache Enhanced DRAM can improve performance, but the price

for improvement is dependant upon market penetration

  • Packetized interfaces incurs increased latency
slide-32
SLIDE 32

Brian T. Davis page 32

Future Work

  • VC controller performance

❍ Cache line allocation policy(s) for Channels ❍ When to write-back dirty channels - avoid maximal penalty ❍ Price/Performance in Controller

  • EMS controller performance

❍ when to use no-write-transfer

  • Controller onto processor die
  • Embedded DRAM architectures
  • SMP primary memory partitioning
slide-33
SLIDE 33

Brian T. Davis page 33

Conventional DRAM

❍ Basic DRAM core (memory array) used in all DRAM memories ❍ Delay is propagation through all circuits, no pipelining ❍ Limit on memory array size due to bit line capacitance ❍ Remainder of row, accessed, but not used, is discarded

....Bit Lines...

Memory Array

Sense Amps/Word Drivers Row Decoder . . . . . . Data In/Out Buffers Column Decoder Clock & Refresh Cktry Column Buffer Row Buffer Data rd/wr ras cas address

slide-34
SLIDE 34

Brian T. Davis page 34

Conventional DRAM Upgrades

  • Fast Page Mode (FPM) DRAM

❍ Eliminates the RAS transition requirement between each access ❍ Utilizes the sense-amp contents as cache

  • Extended Data Out (EDO) DRAM

❍ Latch added between the sense-amps and the output drivers ❍ Allows parallel operation of two DRAM components

  • Output drivers function while next access is being done
  • Memory array (precharge or access) is somewhat overlapped
  • Burst EDO DRAM

❍ Burst capability for accessing large contiguous segments of a row ❍ Toggling of the CAS line sequences to the next datum in the burst

slide-35
SLIDE 35

Brian T. Davis page 35

Conventional (FPM) DRAM Interface

❍ Dedicated interface - only a single transaction at any time ❍ Address bus multiplexed between Row & Column ❍ All signals to req’d by DRAM array provided by DRAM controller

slide-36
SLIDE 36

Brian T. Davis page 36

Synchronous DRAM (SDRAM)

❍ Make all I/O synchronous rather than async ❍ 66MHz SDRAM -> PC100 -> DDR133 (PC2100) ❍ Overhead is very low - latches for address & data

....Bit Lines...

DRAM Array

Sense Amps/Word Drivers Row Decoder . . . . . . I/O Buffers Column Decoder Control Signal Address Buffer Data ras cas address chip sel Clock Generator Read Register Write Register

slide-37
SLIDE 37

Brian T. Davis page 37

Interleaved Memory

❍ Relatively uncommon ❍ Used to get concurrency from asynchronous DRAM

Bank 0 Bank 1 Bank N-1 Bank N

....................

Data Individual Address Control Signals

slide-38
SLIDE 38

Brian T. Davis page 38

Direct Rambus (DRDRAM) Device Architecture