Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization - PowerPoint PPT Presentation

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry

 Bulk data copy and initialization • Unnecessarily move data on the memory channel • Degrade system performance and energy efficiency  RowClone – perform copy in DRAM with low cost • Uses row buffer to copy large quantity of data • Source row → row buffer → destination row • 11 X lower latency and 74 X lower energy for a bulk copy  Accelerate Copy-on-Write and Bulk Zeroing • Forking, checkpointing, zeroing (security), VM cloning  Improves performance and energy efficiency at low cost • 27 % and 17 % for 8 -core systems ( 0.01% DRAM chip area) 2

Limited Bandwidth Core Memory Cache Channel MC Core High Energy 3

Core Memory Cache Channel MC Core Reduce unnecessary data movement 4

Bulk Data src dst Copy Bulk Data val dst Initialization 5

Bulk Data src dst Copy Bulk Data val dst Initialization 6

00000 00000 00000 Zero initialization Forking Checkpointing (e.g., security) Many more VM Cloning Page Migration Deduplication 7

High Energy ( 3600 nJ to copy 4 KB) Core dst Cache Channel MC src Core High latency ( 1046 ns to copy 4 KB) Interference 8

X High Energy Core dst Cache ? Channel MC src Core X High latency X Interference 9

 Introduction  DRAM Background  RowClone • Fast Parallel Mode • Pipelined Serial Mode  End-to-end Design  Evaluation 10

Subarray Memory Channel Bank Chip I/O Bank I/O Row of DRAM Cells Row Buffer 11

Memory Channel Chip I/O Bank I/O ACTIVATE : Copy data from row to row buffer READ : Transfer data to channel using the shared bus 12

V DD V DD /2 DRAM Cell 0 Sense Amplifier (Row Buffer) V DD /2 13

V DD V DD V DD /2 + δ V DD /2 V DD /2 + δ DRAM Cell 0 Amplify the Restore Cell difference Cell Data loses charge In the stable state, Sense Amplifier the sense amplifier drives the cell (Row Buffer) V DD /2 READ/WRITE ACTIVATE 0 14

 Introduction  DRAM Background  RowClone • Fast Parallel Mode • Pipelined Serial Mode  End-to-end Design  Evaluation 15

s r c r o w d s s r c t r r o w o w s r c r o w Row Buffer  1 . Source row to row buffer ? 2 . Row buffer to destination row 16

V DD V DD /2 + δ V DD V DD /2 + δ V DD /2 src 0 dst 0 Amplify the difference Data gets copied Sense Amplifier V DD /2 (Row Buffer) 0 17

s r c r o w d s s r c t r r o w o w s r c r o w Row Buffer 1 . Activate src row (copy data from src to row buffer) 2 . Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst) 18

Bulk Data Copy Latency Energy 11x 74x 3600 nJ to 40 nJ 1046 ns to 90 ns No bandwidth consumption Very little changes to the DRAM chip 19

 Location of source/destination • Both should be in the same subarray  Size of the copy • Copies all the data from source row to destination 20

Memory Channel Bank Chip I/O Shared internal bus Overlap the latency of the read and the write 1.9 X latency reduction, 3.2 X energy reduction 21

Inter subarray Subarray Use PSM twice Memory Channel Bank Bank I/O Chip I/O Inter bank Intra subarray Use PSM Use FPM 22

 Initialization with arbitrary data • Initialize one row • Copy the data to other rows  Zero initialization (most common) • Reserve a row in each subarray (always zero) • Copy data from reserved row (FPM mode) • 6.0 X lower latency, 41.5 X lower DRAM energy • 0.2% loss in capacity 23

Energy Reduction Latency Reduction 74.4x 14 11.6x 80 12 41.5x 60 10 6.0x 8 40 6 4 1.9x 20 1.0x 3.2x 1.5x 2 0 0 Intra-Subarray Inter-Bank Inter-Subarray Intra-Subarray Intra-Subarray Inter-Bank Inter-Subarray Intra-Subarray Very low cost: 0.01% increase in die area Copy Zero Copy Zero 24

 Introduction  DRAM Background  RowClone • Fast Parallel Mode • Pipelined Serial Mode  End-to-end Design  Evaluation 25

How does the software Application communicate occurrences of bulk copy/initialization Operating System to hardware? How to ensure cache ISA coherence? How to maximize use of Microarchitecture the Fast Parallel Mode? Handling data reuse after DRAM (RowClone) zero initialization? 26

 Two new instructions • memcopy and meminit • Similar instructions present in existing ISAs  Microarchitecture Implementation • Checks if instructions can be sped up by RowClone • Export instructions to the memory controller 27

 RowClone modifies data in memory • Need to maintain coherence of cached data  Similar to DMA • Source and destination in memory • Can leverage hardware support for DMA  Additional optimizations 28

 Make operating system subarray-aware  Primitives amenable to use of FPM • Copy-on-Write  Allocate destination in same subarray as source  Use FPM to copy • Bulk Zeroing  Use FPM to copy data from reserved zero row 29

 Data reuse after zero initialization • Phase 1 : OS zeroes out the page • Phase 2 : Application uses cachelines of the page  RowClone • Avoids misses in phase 1 • But incurs misses in phase 2  RowClone-Zero-Insert (RowClone-ZI) • Insert clean zero cachelines 30

 Introduction  DRAM Background  RowClone • Fast Parallel Mode • Pipelined Serial Mode  End-to-end Design  Evaluation 31

 Out-of-order multi-core simulator  1 MB/core last-level cache  Cycle-accurate DDR 3 DRAM simulator  6 Copy/Initialization intensive applications +SPEC CPU 2006 for multi-core  Performance • Instruction throughput for single-core • Weighted Speedup for multi-core 32

 System bootup ( Booting the Debian OS)  Compile ( GNU C compiler – executing cc 1 )  Forkbench ( A fork microbenchmark)  Memcached ( Inserting a large number of objects)  MySql ( Loading a database)  Shell script ( find with ls on each subdirectory) 33

Zero Copy Write Read 1 Fraction of Memory Traffic 0.8 0.6 0.4 0.2 0 bootup compile forkbench mcached mysql shell 34

IPC Improvement Memory Energy Reduction 70% Compared to Baseline 60% 50% 40% 30% Improvements correlate with fraction of 20% memory traffic due to copy/initialization 10% 0% bootup compile forkbench mcached mysql shell 35

 Reduced bandwidth consumption benefits all applications.  Run copy/initialization intensive applications with memory intensive SPEC applications.  Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications. 36

System Performance Memory Energy Efficiency Improvement over Baseline 30% 25% 20% 15% 10% 5% Performance improvement increases Consistent improvement in 0% with increasing core count energy/instruction 2-Core 4-Core 8-Core 37

 Discussion on interleaving and copy granularity  Detailed analysis of the fork benchmark  Detailed multi-core results and analysis  Results with the PSM mode  Analysis of RowClone-ZI  Comparison to memory-controller-based DMA 38

 Bulk data copy and initialization • Unnecessarily move data on the memory channel • Degrade system performance and energy efficiency  RowClone – perform copy in DRAM with low cost • Uses row buffer to copy large quantity of data • Source row → row buffer → destination row • 11 X lower latency and 74 X lower energy for a bulk copy  Accelerate Copy-on-Write and Bulk Zeroing • Forking, checkpointing, zeroing (security), VM cloning  Improves performance and energy efficiency at low cost • 27 % and 17 % for 8 -core systems ( 0.01% chip area overhead) 39

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry

2-core 4-core 8-core # Workloads 138 50 40 Weighted Speedup 15% 20% 27% Instruction Throughput 14% 15% 25% Harmonic Speedup 13% 16% 29% Max Slowdown Reduction 6% 12% 23% Bandwidth/Instruction Reduction 29% 27% 28% Energy/Instruction Reduction 19% 17% 17% 42

Baseline RowClone RowClone-ZI 2.5 Instructions per Cycle 2 1.5 1 0.5 0 bootup compile forkbench mcached mysql shell 43

Baseline RowClone RowClone-ZI 1.4 Normaized Weighted Speedup 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 44

0.7 0.6 0.5 0.4 0.3 0.2 64MB 128MB 0.1 0 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k Number of Pages Updated 45

2.5 2 Normalized IPC 1.5 1 64MB 128MB 0.5 0 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k Number of Pages Updated 46

Baseline RowClone-PSM RowClone-FPM 1.2 Normalized Energy 1 0.8 0.6 0.4 0.2 0 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k Number of Pages Updated 47

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization - PowerPoint PPT Presentation

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry Bulk data copy and initialization

T Levels/Skills Plan Body Copy Body Copy Body Copy Body Copy Body Copy Body Copy Body Copy Body

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

Cup Concept with Profits Bulk Merchandising Solutions.Bulk Merchandising Solutions.Bulk

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Workflow Plus Bulk Request Actions Tool for Synergy Enterprise What is This Tool ? Allows

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

Discussion with Capt. Azhar @ PII 27/03/2019 PII March 2019 - Capt Azhar 1 Agenda Break

Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM Rajat Kateja # Anirudh

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

BRION TINGLER, Alibaba International Corporate Communications Welcome to Gateway Canada 17

IDEAS Conference November 21, 2019 2 COMPANY OVERVIEW 3 Key Credit Highlights Cinemark

Regent Pacific Presentation COMPANY OVERVIEW Dece Decembe mber 2 r 200 009 Company Overview

Dragon Mining Established Nordic Gold Producer with Substantial Growth Opportunities PDAC 2016

Rethinking DRAM Power Modes for Energy Proportionality Krishna Malladi 1 , Ian Shaeffer 2 , Liji

Hybrid Memory Platform Kenneth Wright, Sr. D ire ctor Rambus / Emerging Solutions Division Join

Dram Shop Liability: Bar, Restaurant and Individual Exposure for Over-Serving Customers and

Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer