Memory Access Pattern-Aware DRAM Performance Model for Multi-core - PowerPoint PPT Presentation

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi * , Jongbok Lee + , and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul National University, + Hansung University Seoul, Korea

Introduction  The memory-wall problem in multi-core era • The rate at which memory traffic is generated by an increasing number of cores is growing faster than the rate at which it can be serviced cores is growing faster than the rate at which it can be serviced. DIMMs Processor Memory Controller L1I$ core address interconnection L1D$ translator network and L1I$ caches core L1D$ scheduler  Our research focuses on main memory subsystem design.  This paper proposes an analytical DRAM performance model. 1 Multimedia Systems Lab. @ SoEE, SNU

Outline  Background  Motivation  Approach  Objective  Modeling Bank Busy Time • Minimum inter-command delays • Pattern parameters Pattern parameters • Average bank busy time  Evaluation Results  Concluding Remarks 2 Multimedia Systems Lab. @ SoEE, SNU

DRAM architecture • Multiple banks (typically 4 or 8)  Each bank has cell array, row-buffer, and address/control logics g  The address, command and data buses are shared by all banks  DRAM operations DRAM operations • Activate (ACT)  an entire row data is read from the cell array and stored to the row-buffer (row- array and stored to the row buffer (row buffer is open) • Precharge (PRE) PRE ACT  the contents of the row-buffer are  the contents of the row-buffer are restored to cell array (row-buffer is closed) and bitlines are precharged • Read (RD) or write (WR) Read (RD) or write (WR)  from/to the row-buffer RD WR 3 Multimedia Systems Lab. @ SoEE, SNU

DRAM timing trends 50 40 40 (nsec) 30 Time tRAS 20 tFAW tWR tCL tWTR 10 tRCD(=tRP) tRRD tCK 0 DDR200 DDR266 00 DDR333 66 DDR400A 33 DDR2-400 0A DDR2-533 00 DDR2-667 33 DDR2-800 67 DDR3-800 00 DDR3-1066 00 DDR3-1333 66 DDR3-1600 33 00 * JEDEC DDR/DDR2/DDR3 Standards DRAM Generations  The goal is to find out an analytical model which can show the impact of each The goal is to find out an analytical model which can show the impact of each DRAM timing on the performance. 4 Multimedia Systems Lab. @ SoEE, SNU

Challenge  DRAM access performance depends on a program’s memory access behavior (a) row-buffer hit (b) row-buffer miss row(x) is stored and ( ) row(x) is stored and row(x) is stored and row(y) is requested row(x) is requested (2)  PRE-ACT-RD(WR)  RD (WR) ACT ( ) (1) PRE RD WR (3) RD WR • The DRAM command chain generated to serve a memory request depends on the incoming request and on the row-buffer status (open or closed, row g q ( p , index if opened), which is determined by the previously serviced requests. 5 Multimedia Systems Lab. @ SoEE, SNU

Objective  To find out an analytical model which has a form of  = f( w ,  )  f( , ) •  : performance metric • w : characteristics of memory access behavior •  : DRAM timings such as tRP, tRCD, tRAS, tCCD, … • f : a simple function of w and   Key questions • What is the performance metric ? What is the performance metric ? • How to characterize the memory access behavior of a program ? • What is the relationship between input parameters and the performance metric ? t i ? 6 Multimedia Systems Lab. @ SoEE, SNU

Assumptions  1) One memory request is serviced by one column command • All memory references are cache misses. • cache block size = 64 Bytes, data bus width = 64 bits, burst length = 8 • h bl k i 64 B t d t b idth 64 bit b t l th 8  2) There are four DRAM commands: PRE, ACT, RD and WR • The effect of REF (refresh) to the access performance is negligible. The effect of REF (refresh) to the access performance is negligible. • RDAP/WRAP (auto-precharge after RD/WR) are not generated when the memory controller adopts the open policy.  3) O  3) Open policy for row-buffer management li f b ff t • row-buffer misses  PRE-ACT-RD, PRE-ACT-WR • row-buffer hits  RD, WR ,  4) First-Ready First-Come First Served (FR-FCFS) scheduling • The row-buffer hit requests are prioritized miss ones to maximize data bus utilization. tili ti 7 Multimedia Systems Lab. @ SoEE, SNU

Approach latency of Q2 = (waiting time ) + tCAS + tCCD proc. 0 latency of Q1 = tRP + tRCD + tCAS + tCCD proc. 1 Q1(miss) Q2(hit) P A R R cmd. bus R C D D E T tCAS tRCD tRP tCAS data bus data transfer time D1 D2 = tCCD ( 4  tCK) tRP t t C tRCD tCCD tCC tCCD tCC bank bank busy time for Q1 • Memor access latenc incl des the q e ing dela • Memory access latency includes the queuing delay. • Data transfer time is related with only tCK among DRAM timings.  Modeling the time needed for a bank to service DRAM commands g   bank busy time 8 Multimedia Systems Lab. @ SoEE, SNU

Bank busy time  A bank is said to busy when it is not possible for the memory controller to issue any command to the bank due to timing constraints Otherwise a bank is in idle status constraints. Otherwise, a bank is in idle status.  Considerations: • 1) simple : PRE (  tRP ), ACT (  tRCD ) 1) simple : PRE (  tRP ), ACT (  tRCD ) • 2) dependency on the command that follows   in a pair-wise fashion (minimum inter-command delays)  ex) RD RD (  tCCD ) vs RD WR (  tRTW )  ex) RD-RD (  tCCD ) vs. RD-WR (  tRTW ) • 3) multiple timing constraints on PRE  ex) RD-PRE : it depends on the number RDs between ACT-PRE (a) RD-PRE (  tRTP ) (b) RD-PRE (  tRAS-tRCD ) 9 Multimedia Systems Lab. @ SoEE, SNU

Minimum inter-command delays  The minimum inter-command delay can be defined for all possible DRAM command pairs based on DRAM timing constraints defined in the data sheet defined in the data sheet  RD( x ) represents the consecutive x RD commands (x=1, …, m) • m =  (tRAS-tRCD-tRTP)/tCCD  ( m =2, 3, 3, and 4 for DDR3-800/-1066/-1333/-1600)  RD(others) means the row-buffer miss cases which are not included in WR-PRE ( ) and RD( x )-PRE 10 Multimedia Systems Lab. @ SoEE, SNU

Pattern parameters  := the number of occurrences of each DRAM command pair • They can be interpreted as characteristics of memory access streams  cf) open policy is assumed for the row buffer management policy  cf) open-policy is assumed for the row-buffer management policy.  the number of row-buffer misses ( N m ) = N wp + N rx + N rt p  the number of row-buffer hits = N ww + N rw + N wr + N rr 11 Multimedia Systems Lab. @ SoEE, SNU

The proposed model  The bank busy time is a linear combination of the minimum inter- command delays and pattern parameters. n  N i  D i Bank busy time =  i 1 12 Multimedia Systems Lab. @ SoEE, SNU

Average bank busy time  := the bank busy time per a memory request • N : the number of memory requests to a bank during program execution Average bank busy time = w 0  tRP + w 1  tRCD + w 2  tCCD + w 3  tCWL + w 4  tRTW + w 5  tWTR + w 6  tRAS + w 7  tWR + w 8  tRTP (row-buffer miss ratio) • , where 13 Multimedia Systems Lab. @ SoEE, SNU

Experimental setup kernel/application description M5 Memory bank 0 matrix transpose (512  512) Controller FFT.MT L1I$ cache gle bus matrix multiplication (512  512) FFT.MM bank 1 core FR-FCFS addr/cmd L1D$ grid size : 258  258 OceanContig shared sing shared L2 Cholesky input: tk23.O data bus matrix size: 512  512 address LUContig (64 bit) L1I$ mapping bank 7 core Raytrace input: teapot.env L1D$ FMM 2048 particles • Architecture simulator configuration (M5)  in-order processor model (P=1,2,…,64), 2 GHz  L1 cache : private separate 64 KB 2-way 64 Bytes 1 cycle  L1 cache : private, separate, 64 KB, 2 way, 64 Bytes, 1 cycle  L2 cache : shared, unified, 512 KB, 2-way, 64 Bytes, 20 cycles  shared bus with no overhead • Main memory subsystem • Main memory subsystem  a cycle-accurate DRAM timing simulator extension for M5  memory controller: FR-FCFS, [row:bank:col], open-policy  2 Gb t  2 Gbytes, 8 banks, DDR3-800/-1066/-1333/-1600, data bus width : 64 bit 8 b k DDR3 800/ 1066/ 1333/ 1600 d t b idth 64 bit • Seven multi-threaded workloads from SPLASH-2 benchmark 14 Multimedia Systems Lab. @ SoEE, SNU

(1) Pattern parameters y requests (x10 3 ) y requests (x10 3 ) 25 row-buffer hits 120 20 100 Nww (write-write/hit) Nrw (read-write/hit) 80 15 Nwr (write-read/hit) 60 Nrr (read-read/hit) 10 10 # of memory # of memory Nr2 (miss after 2 reads) N 2 ( i f 2 d ) 40 Nr1 (miss after 1 read) 5 20 Nrt (miss after read, other cases) Nwp (miss after write) 1 2 4 8 16 32 1 2 4 8 16 32 row-buffer misses the number of processors, bank0 ~ bank7 the number of processors, bank0 ~ bank7 (a) FFT.MT (b) Raytrace  The pattern parameters are obtained during the simulation as shown in the figure. h i th fi • Other results are included in the paper.  Selecting representative pattern parameters for a workload. Selecting representative pattern parameters for a workload. • when the memory accesses are distributed non-uniformly across banks. • 1) select a bank that has the maximum number of requests • 2) use the pattern parameters of that bank 15 Multimedia Systems Lab. @ SoEE, SNU

Memory Access Pattern-Aware DRAM Performance Model for Multi-core - PowerPoint PPT Presentation

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi * , Jongbok Lee + , and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul National University, + Hansung University

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

Emerging Non Volatile Memory Resistive Memory Technologies Key concept: replace DRAM cell

Chapter 5 Internal Memory Contents Semiconductor main memory Organization DRAM and

Memory Management Memory Manager Requirements Minimize primary memory access time

Main Memory & DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

Physical transport phenomena /1 Transfer of mass and/or energy in a system that is not in

An Overview of the barriers for sustainable development of GEOTHERMAL ENERGY potential in

Benefits & Impacts Policy 779-page Ebook Download at http://bioenfapesp.org Bioenergy

Semester projects The Plan Principles of Complex Systems Suggestions for Projects Course 300,

DetNet WG Chairs: Lou Berger lberger@labn.net Pat Thaler pat.thaler@broadcom.com Jnos Farkas

Lyxor Research Paper LYXOR RESEARCH A PRIMER ON ALTERNATIVE RISK PREMIA A P RIL 2 0 1 6 RAYANN

p' = TRp B A except 6.1.6, 6.3.1 (1,1) A intuitive? FCG Sect 13.3 Scene Graphs

Navigating the Global Fund Allocation Cycle 2020-2022: Guide for W4GF Advocates Se Section 1:

Memory Access Pattern-Aware DRAM Performance Model for Multi-core - PowerPoint PPT Presentation

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi * , Jongbok Lee + , and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul National University, + Hansung University

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

Emerging Non Volatile Memory Resistive Memory Technologies Key concept: replace DRAM cell

Chapter 5 Internal Memory Contents Semiconductor main memory Organization DRAM and

Memory Management Memory Manager Requirements Minimize primary memory access time

Main Memory &amp; DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

Physical transport phenomena /1 Transfer of mass and/or energy in a system that is not in

An Overview of the barriers for sustainable development of GEOTHERMAL ENERGY potential in

Benefits &amp; Impacts Policy 779-page Ebook Download at http://bioenfapesp.org Bioenergy

Semester projects The Plan Principles of Complex Systems Suggestions for Projects Course 300,

DetNet WG Chairs: Lou Berger lberger@labn.net Pat Thaler pat.thaler@broadcom.com Jnos Farkas

Lyxor Research Paper LYXOR RESEARCH A PRIMER ON ALTERNATIVE RISK PREMIA A P RIL 2 0 1 6 RAYANN

p' = TRp B A except 6.1.6, 6.3.1 (1,1) A intuitive? FCG Sect 13.3 Scene Graphs

Navigating the Global Fund Allocation Cycle 2020-2022: Guide for W4GF Advocates Se Section 1:

Main Memory & DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

Benefits & Impacts Policy 779-page Ebook Download at http://bioenfapesp.org Bioenergy