an effective approach to processing in dram
play

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - PowerPoint PPT Presentation

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University Outline Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary 2 Introduction


  1. An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University

  2. Outline Outline • Introduction • Our Approach • Buffered Compare Architecture • Evaluation • Summary 2

  3. Introduction – Memory Wall Introduction – Memory Wall CPU • The number of cores in a chip is increasing • The memory bandwidth is not as much… ‐‐ > “memory wall” problem • Emerging big data applications require even more bandwidth Memory • In reality, much of the bandwidth is being wasted! 3

  4. Introduction – Table Scan Introduction – Table Scan • Which items are made of Item# Material Weight wood? A Wood 10kg B Metal 1.5kg • Which items are heavier C Metal 7kg than 5kg? D Stone 3kg E Wood 2kg … 4

  5. Introduction – Table Scan Introduction – Table Scan Host Key Search key ③ ② Cmp Result D0 D1 D2 D3 ① Data in table D0 D1 D2 D3 DRAM • Data are read and the comparisons are done • We only need the result – waste in bandwidth! 5

  6. Introduction – Table Scan Introduction – Table Scan Host Key Result Reduced traffic Key Parallelism with wider Cmp bandwidth D0 D1 D2 D3 DRAM • Do compare within the memory • Only two transfers needed instead of many • Essentially a PIM (processing ‐ in ‐ memory) approach 6

  7. Introduction ‐ PIM Introduction ‐ PIM • PIM research was active late 90’s ~ early 00’s – EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. – Multiple cores in DRAM – Hard to integrate ‐‐ > not successful • Re ‐ gaining interests due to – Big data workloads – Limited improvement of memory bandwidth – 3D stacked memory (HMC, HBM, etc.) enables integration of cores 7

  8. Introduction ‐ PIM Introduction ‐ PIM • PIM with 3D stacked memory Host Processor HMC Last-Level Out-Of-Order L1 Cache L2 Cache DRAM Cache Core PCU Controller Crossbar Network HMC Controller DRAM PCU PCU Controller PMU … PIM Directory y DRAM Locality PCU Controller Monitor Host Processor PEI (PIM enabled instructions) [J. Ahn et al., ISCA 2015] In-Order Core … Controller DRAM … List Prefetch Pref. Buffer Crossbar Network Tesseract Mes.Trig. … Pref. [J. Ahn et al., ISCA 2015] … Message Queue NI 8

  9. Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Row Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Activated Activated Bank Bank … Activated Activated Activated … Global wordline Chip I/O Off-chip 512 x 512 cells Internal Dataline Global Buffered link Local Sense Amp. Shared Bus (Row Buffer) Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • A single chip is comprised of 8 ‐ 16 banks • When accessing data, a row in a bank is “activated” and stored in a row buffer • A cache line (64B) is fetched in one burst 9

  10. Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Bank Bank … Activated … Global wordline Zzz.. Chip I/O Off-chip 512 x 512 cells Internal Dataline Global link Local Sense Amp. Shared Bus (Row Buffer) Zzz.. Zzz.. Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • Multiple banks are used for interleaving since activating a row takes long time • One bank can fill up the bandwidth for the off ‐ chip link • Thus we have 8X ‐ 16X internal bandwidth, most of which is wasted 10

  11. Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation DRAM Chip Bank Mat Local Row Decoder Mat Global Row Decoder Local Wordline … Local Bitline Bank Bank … Activated Compute Compute … Global wordline Chip I/O Off-chip 512 x 512 cells Internal Dataline Global link Local Sense Amp. Shared Bus (Row Buffer) Compute Compute Compute Compute Column Decoder … Bank Bank Global Dataline Global Sense Amp. (Bank I/O) • Compute inside each bank to utilize the excess bandwidth 11

  12. Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • We focus only on ‘compare ‐ n ‐ op’ pattern over a long range of data DRAM … D0 D1 D2 DN CMP Key 12

  13. Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ read – Returns the match results for each item DRAM … D0 D1 D2 DN CMP Key Result: (=, <, =, … , >) 13

  14. Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ select – Returns the min/max among each item DRAM … D0 D1 D2 DN CMP Max Max: (D7) 14

  15. Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM? • Compare ‐ n ‐ increment – Increments matching items DRAM K0, K1, K2, K2, KN, … V0 V1 V2 VN V2++ CMP K2 15

  16. Buffered Compare Architecture Buffered Compare Architecture DRAM Chip Bank Mat Global Row Decoder Local Row Decoder Mat Local Wordline … Local Bitline Bank Bank … Global wordline … Dataline Chip I/O Global 512 x 512 cells Internal Shared Bus Local Sense Amp. Column Decoder (Row Buffer) Bank I/O CGEN … Bank Bank Global Dataline Result Key Buffer Queue Arithmetic Unit • Key buffer : Holds a value written by the processor • Arithmetic unit : Performs computation (cmp, add, etc.) using Bank I/O and Key buffer as operands • Result queue : Stores compare results • CGEN : Repeats the bank ‐ local commands • The datapath is 64 bits wide • 0.53% overhead in DRAM area 16

  17. Buffered Compare Architecture ‐ Programming Model Buffered Compare Architecture ‐ Programming Model SW code Work items __kernel search(keys[], skey, d[]){ … CMP CMP CMP CMP int id = get_global_id(0) if (keys[id] == skey) Item 0 Item 1 … Item N-1 d[id] = 1 Target Data } … Core Core Core Instruction BC_cmp_read(skey, keys, N) … Memory Controller DRAM cmd CMP_RD(skey, addr, range) DRAM Banks • OpenCL based programming model • Programmers need not be aware of DRAM parameters (page size, number of banks, …) 17

  18. Evaluation ‐ Setup Evaluation ‐ Setup • McSimA+ simulator • Processor – 22nm, 16 OoO cores running at 3GHz – 16KB private L1 – 32MB S ‐ NUCA L2 – Directory ‐ based MESI coherence • Memory – 28nm – DDR4 ‐ 2000 – 4 ranks per channel – 16 banks per chip – PAR ‐ BS (parallelism ‐ aware batch scheduling) 18

  19. Evaluation ‐ Setup Evaluation ‐ Setup • Six workloads – TSC : In ‐ memory linear scan (Column ‐ store) – TSR : In ‐ memory linear scan (Row ‐ store) – BT : B+ tree traversal (index scan) – MAX : MAX aggregation – SA : Sequence assembly – KV : Key ‐ value store • BC was evaluated against baseline and AMO (Active Memory Operation) 19

  20. Evaluation ‐ Speedup Evaluation ‐ Speedup • BC performs 3.62 times better than the baseline 20

  21. Evaluation – Energy Reduction Evaluation – Energy Reduction 6.05 2 Proc. Mem Norm. Energy Baseline 1.5 AMO 1 0.5 BC 0 TSC TSR MAX BT KV SA Geomean • Energy consumption reduced by 73.3% on average – Proc: 77.2% – Mem: 43.9% 21

  22. Summary Summary • We proposed buffered compare, a processing ‐ in ‐ memory approach to utilizing internal bandwidth of DRAM – Minimal overhead to the DRAM area – Less invasive to existing DDR protocols – 3.62X speedup and 73.3% energy reduction • Limitations – Utilization of cache – Utilization of critical ‐ word ‐ first policy – When using x4 devices, only up to 32bits are supported for the operands 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend