An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - PowerPoint PPT Presentation

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University

Outline Outline • Introduction • Our Approach • Buffered Compare Architecture • Evaluation • Summary 2

Introduction – Memory Wall Introduction – Memory Wall CPU • The number of cores in a chip is increasing • The memory bandwidth is not as much… ‐‐ > “memory wall” problem • Emerging big data applications require even more bandwidth Memory • In reality, much of the bandwidth is being wasted! 3

Introduction – Table Scan Introduction – Table Scan • Which items are made of Item# Material Weight wood? A Wood 10kg B Metal 1.5kg • Which items are heavier C Metal 7kg than 5kg? D Stone 3kg E Wood 2kg … 4

Introduction – Table Scan Introduction – Table Scan Host Key Search key ③ ② Cmp Result D0 D1 D2 D3 ① Data in table D0 D1 D2 D3 DRAM • Data are read and the comparisons are done • We only need the result – waste in bandwidth! 5

Introduction – Table Scan Introduction – Table Scan Host Key Result Reduced traffic Key Parallelism with wider Cmp bandwidth D0 D1 D2 D3 DRAM • Do compare within the memory • Only two transfers needed instead of many • Essentially a PIM (processing ‐ in ‐ memory) approach 6

Introduction ‐ PIM Introduction ‐ PIM • PIM research was active late 90’s ~ early 00’s – EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. – Multiple cores in DRAM – Hard to integrate ‐‐ > not successful • Re ‐ gaining interests due to – Big data workloads – Limited improvement of memory bandwidth – 3D stacked memory (HMC, HBM, etc.) enables integration of cores 7

Introduction ‐ PIM Introduction ‐ PIM • PIM with 3D stacked memory Host Processor HMC Last-Level Out-Of-Order L1 Cache L2 Cache DRAM Cache Core PCU Controller Crossbar Network HMC Controller DRAM PCU PCU Controller PMU … PIM Directory y DRAM Locality PCU Controller Monitor Host Processor PEI (PIM enabled instructions) [J. Ahn et al., ISCA 2015] In-Order Core … Controller DRAM … List Prefetch Pref. Buffer Crossbar Network Tesseract Mes.Trig. … Pref. [J. Ahn et al., ISCA 2015] … Message Queue NI 8

Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Row Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Activated Activated Bank Bank … Activated Activated Activated … Global wordline Chip I/O Off-chip 512 x 512 cells Internal Dataline Global Buffered link Local Sense Amp. Shared Bus (Row Buffer) Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • A single chip is comprised of 8 ‐ 16 banks • When accessing data, a row in a bank is “activated” and stored in a row buffer • A cache line (64B) is fetched in one burst 9

Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Bank Bank … Activated … Global wordline Zzz.. Chip I/O Off-chip 512 x 512 cells Internal Dataline Global link Local Sense Amp. Shared Bus (Row Buffer) Zzz.. Zzz.. Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • Multiple banks are used for interleaving since activating a row takes long time • One bank can fill up the bandwidth for the off ‐ chip link • Thus we have 8X ‐ 16X internal bandwidth, most of which is wasted 10

Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Bank Bank … Activated Compute Compute … Global wordline Chip I/O Off-chip 512 x 512 cells Internal Dataline Global link Local Sense Amp. Shared Bus (Row Buffer) Compute Compute Compute Compute Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • Compute inside each bank to utilize the excess bandwidth 11

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • We focus only on ‘compare ‐ n ‐ op’ pattern over a long range of data DRAM … D0 D1 D2 DN CMP Key 12

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ read – Returns the match results for each item DRAM … D0 D1 D2 DN CMP Key Result: (=, <, =, … , >) 13

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ select – Returns the min/max among each item DRAM … D0 D1 D2 DN CMP Max Max: (D7) 14

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ increment – Increments matching items DRAM K0, K1, K2, K2, KN, … V0 V1 V2 VN V2++ CMP K2 15

Buffered Compare Architecture Buffered Compare Architecture DRAM Chip Bank Mat Global Row Decoder Local Row Decoder Mat Local Wordline … Local Bitline Bank Bank … Global wordline … Dataline Chip I/O Global 512 x 512 cells Internal Shared Bus Local Sense Amp. Column Decoder (Row Buffer) Bank I/O CGEN … Bank Bank Global Dataline Result Key Buffer Queue Arithmetic Unit • Key buffer : Holds a value written by the processor • Arithmetic unit : Performs computation (cmp, add, etc.) using Bank I/O and Key buffer as operands • Result queue : Stores compare results • CGEN : Repeats the bank ‐ local commands • The datapath is 64 bits wide • 0.53% overhead in DRAM area 16

Buffered Compare Architecture ‐ Programming Model Buffered Compare Architecture ‐ Programming Model SW code Work items __kernel search(keys[], skey, d[]){ … CMP CMP CMP CMP int id = get_global_id(0) if (keys[id] == skey) Item 0 Item 1 … Item N-1 d[id] = 1 Target Data } … Core Core Core Instruction BC_cmp_read(skey, keys, N) … Memory Controller DRAM cmd CMP_RD(skey, addr, range) DRAM Banks • OpenCL based programming model • Programmers need not be aware of DRAM parameters (page size, number of banks, …) 17

Evaluation ‐ Setup Evaluation ‐ Setup • McSimA+ simulator • Processor – 22nm, 16 OoO cores running at 3GHz – 16KB private L1 – 32MB S ‐ NUCA L2 – Directory ‐ based MESI coherence • Memory – 28nm – DDR4 ‐ 2000 – 4 ranks per channel – 16 banks per chip – PAR ‐ BS (parallelism ‐ aware batch scheduling) 18

Evaluation ‐ Setup Evaluation ‐ Setup • Six workloads – TSC : In ‐ memory linear scan (Column ‐ store) – TSR : In ‐ memory linear scan (Row ‐ store) – BT : B+ tree traversal (index scan) – MAX : MAX aggregation – SA : Sequence assembly – KV : Key ‐ value store • BC was evaluated against baseline and AMO (Active Memory Operation) 19

Evaluation ‐ Speedup Evaluation ‐ Speedup • BC performs 3.62 times better than the baseline 20

Evaluation – Energy Reduction Evaluation – Energy Reduction 6.05 2 Proc. Mem Norm. Energy Baseline 1.5 AMO 1 0.5 BC 0 TSC TSR MAX BT KV SA Geomean • Energy consumption reduced by 73.3% on average – Proc: 77.2% – Mem: 43.9% 21

Summary Summary • We proposed buffered compare, a processing ‐ in ‐ memory approach to utilizing internal bandwidth of DRAM – Minimal overhead to the DRAM area – Less invasive to existing DDR protocols – 3.62X speedup and 73.3% energy reduction • Limitations – Utilization of cache – Utilization of critical ‐ word ‐ first policy – When using x4 devices, only up to 32bits are supported for the operands 22

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - PowerPoint PPT Presentation

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University Outline Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary 2 Introduction

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM Rajat Kateja # Anirudh

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

EFFECTIVE EFFECTIVE EFFECTIVE EFFECTIVE COMMUNICATIONS COMMUNICATIONS People First Language

A 1.1V, 667MHz Random Cycle, , y , Asymmetric 2T Gain Cell Embedded DRAM with a 99 9

More demanding workload Design Goals ____ More demanding workload Bottleneck: Network stack in

Keppel Land Limited Keppel Land Limited Nine Months to September 2005 Results Nine Months to

Opportunities and Development in the Renewable Energy and Biomass Energy Status in the

POTSDAM SUMMER ACADEMY JULY 2006 Program : Banking, Insurance and the Public Sector: Empirical

NetFPGA Summer Course Presented by: Noa Zilberman Yury Audzevich Technion August 2 August

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin

LR(0) Drawbacks Simple LR (SLR) Consider the unambiguous augmented grammar: New algorithm for

Drawbacks of single cycle implementation All instructions take the same time although