Gather-Scatter DRAM In-DRAM Address Translation to Improve the - PowerPoint PPT Presentation

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry

Executive summary • Problem: Non-unit strided accesses – Present in many applications – In-efficient in cache-line-optimized memory systems • Our Proposal: Gather-Scatter DRAM – Gather/scatter values of strided access from multiple chips – Ideal memory bandwidth/cache utilization for power-of-2 strides – Requires very few changes to the DRAM module • Results – In-memory databases: the best of both row store and column store – Matrix multiplication: Eliminates software gather for SIMD optimizations 2

Strided access pattern Field 1 Field 3 Record 1 In-Memory Record 2 Database Table Record n Physical layout of the data structure (row store) 3

Shortcomings of existing systems Data unnecessarily High latency transferred on the Wasted bandwidth memory channel and stored in on- Wasted cache space chip cache High energy Cache Line 4

Prior approaches Improving efficiency of fine-grained memory accesses • Impulse Memory Controller (HPCA 1999) • Adaptive/Dynamic Granularity Memory System (ISCA 2011/12) Costly in a commodity system • Modules that support fine-grained memory accesses – E.g., mini-rank, threaded-memory module • Sectored caches 5

Goal: Eliminate inefficiency Can we retrieve a only useful data? Gather-Scatter DRAM (Power-of-2 strides) Cache Line 6

DRAM modules have multiple chips All chips within a “rank” operate in unison! Two Challenges! ? READ addr Cache Line Data Cmd/Addr 7

Challenge 1: Chip conflicts Data of each cache line is spread across all the chips! Cache line 0 Cache line 1 Useful data mapped to only two chips! 8

Challenge 2: Shared address bus All chips share the same address bus! No flexibility for the memory controller to read different addresses from each chip! One address bus for each chip is costly! 9

Gather-Scatter DRAM Challenge 1: Minimizing chip conflicts Column-ID-based data shuffling (shuffle data of each cache line differently) Challenge 2: Shared address bus Pattern ID – In-DRAM address translation (locally compute column address at each chip) 10

Column-ID-based data shuffling (implemented in the memory controller) Stage “n” enabled only if Cache Line n th LSB of column ID is set DRAM Column Address Stage 1 1 0 1 Stage 2 Stage 3 Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7 11

Effect of data shuffling After shuffling Before shuffling Chip conflicts Minimal chip conflicts! Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7 Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7 Col 0 Col 1 Col 2 Col 3 Can be retrieved in a single command 12

Gather-Scatter DRAM Challenge 1: Minimizing chip conflicts Column-ID-based data shuffling (shuffle data of each cache line differently) Challenge 2: Shared address bus Pattern ID – In-DRAM address translation (locally compute the column address at each chip) 13

Per-chip column translation logic output address XOR READ addr, pattern AND chip ID cmd CTL addr cmd = addr pattern pattern READ/WRITE 14

Gather-Scatter DRAM (GS-DRAM) 32 values contiguously stored in DRAM (at the start of a DRAM row) read addr 0, pattern 0 (stride = 1, default operation) read addr 0, pattern 1 (stride = 2) read addr 0, pattern 3 (stride = 4) read addr 0, pattern 7 (stride = 8) 15

End-to-end system support for GS-DRAM Support for coherence of GS-DRAM overlapping cache lines New instructions: pattload/pattstore Pattern ID Cache pattload reg, addr, patt Tag Data Store Store CPU miss cacheline(addr), patt Memory controller DRAM column(addr), patt 16

Methodology • Simulator – Gem5 x86 simulator – Use “prefetch” instruction to implement pattern load – Cache hierarchy • 32KB L1 D/I cache, 2MB shared L2 cache – Main Memory: DDR3-1600, 1 channel, 1 rank, 8 banks • Energy evaluations – McPAT + DRAMPower • Workloads – In-memory databases – Matrix multiplication 17

In-memory databases Layouts Workloads Row Store Transactions Column Store Analytics GS-DRAM Hybrid 18

Workload • Database – 1 table with million records – Each record = 1 cache line • Transactions – Operate on a random record – Varying number of read-only/write-only/read-write fields • Analytics – Sum of one/two columns • Hybrid – Transactions thread: random records with 1 read-only, 1 write-only – Analytics thread: sum of one column 19

Transaction throughput and energy Row Store Column Store GS-DRAM 30 60 3X (mJ for 10000 trans.) 25 50 (millions/second) Throughput 20 40 Energy 15 30 10 20 5 10 0 0 20

Analytics performance and energy Row Store Column Store GS-DRAM 2.5 120 Execution Time (mSec) 2X 100 2.0 Energy (mJ) 80 1.5 60 1.0 40 0.5 20 0.0 0 21

Hybrid Transactions/Analytical Processing Row Store Column Store GS-DRAM 2 30 Execution Time (mSec) 1.8 25 1.6 (millions/second) Throughput 1.4 20 1.2 1 15 0.8 10 0.6 0.4 5 0.2 0 0 Transactions Analytics 22

Conclusion • Problem: Non-unit strided accesses – Present in many applications – In-efficient in cache-line-optimized memory systems • Our Proposal: Gather-Scatter DRAM – Gather/scatter values of strided access from multiple chips – Ideal memory bandwidth/cache utilization for power-of-2 strides – Low DRAM Cost: Logic to perform two bitwise operations per chip • Results – In-memory databases: the best of both row store and column store – Many more applications: scientific computation, key-value stores 23

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry

Backup 25

Maintaining Cache Coherence • Restrict each data structure to only two patterns – Default pattern – One additional strided pattern • Additional invalidations on read-exclusive requests – Cache controller generates list of cache lines overlapping with modified cache line – Invalidates all overlapping cache lines 26

Hybrid Transactions/Analytical Processing Row Store Column Store GS-DRAM Transactions Analytics Execution Time (mSec) 21 30 10 (millions/second) 25 Throughput 8 20 6 15 4 10 2 5 0 0 w/o Pref. Pref. w/o Pref. Pref. 27

Transactions Results 10 Execution time for 10000 8 trans. 6 4 2 0 1-0-1 2-1-2 0-2-2 2-4-2 5-0-1 2-0-4 6-1-2 4-2-2 28

Gather-Scatter DRAM In-DRAM Address Translation to Improve the - PowerPoint PPT Presentation

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry Executive summary

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER

MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 1 2 3 4 5 Logical and

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn

Basic Communication Operations (cont.) Alexandre David B2-206 Today Scatter and Gather

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM Rajat Kateja # Anirudh

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

S2 PATHWAYS EVENING 7 February, 2020 Relevant, inspiring, engaging education for every child

1 What is Young People of the Year or YOPEY (pronounced yop-ee)?. A short film by

Placing Student Work at the Center of Learning Symposium on Learning Outcomes Assessment: A

Fourth Annual Financial Regulatory Reform Symposium Navigating Dodd-Frank: Are We Avoiding

AOC meeting 2 May 2017 CSO Orienteringsmde 20. juni 2016 Agenda Welcome Minutes from

PRESENTATION Melbourne French Theatre Inc. Le Thtre Franais de Melbourne Princes Hill

Building social protection systems and protecting people Tauvik Muhamad & Valrie Schmitt

Disclaimer & Cautionary Statements This Presentation has been prepared by Bellevue Gold Limited

Sambuz

Useful Links

Newsletter

Mail Us

Gather-Scatter DRAM In-DRAM Address Translation to Improve the - PowerPoint PPT Presentation

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry Executive summary

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER

MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 1 2 3 4 5 Logical and

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn

Basic Communication Operations (cont.) Alexandre David B2-206 Today Scatter and Gather

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM Rajat Kateja # Anirudh

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

S2 PATHWAYS EVENING 7 February, 2020 Relevant, inspiring, engaging education for every child

1 What is Young People of the Year or YOPEY (pronounced yop-ee)?. A short film by

Placing Student Work at the Center of Learning Symposium on Learning Outcomes Assessment: A

Fourth Annual Financial Regulatory Reform Symposium Navigating Dodd-Frank: Are We Avoiding

AOC meeting 2 May 2017 CSO Orienteringsmde 20. juni 2016 Agenda Welcome Minutes from

PRESENTATION Melbourne French Theatre Inc. Le Thtre Franais de Melbourne Princes Hill

Building social protection systems and protecting people Tauvik Muhamad &amp; Valrie Schmitt

Disclaimer &amp; Cautionary Statements This Presentation has been prepared by Bellevue Gold Limited

Sambuz

Useful Links

Newsletter

Mail Us

Building social protection systems and protecting people Tauvik Muhamad & Valrie Schmitt

Disclaimer & Cautionary Statements This Presentation has been prepared by Bellevue Gold Limited