An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - - PowerPoint PPT Presentation
An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - - PowerPoint PPT Presentation
An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University Outline Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary 2 Introduction
Outline Outline
- Introduction
- Our Approach
- Buffered Compare Architecture
- Evaluation
- Summary
2
Introduction – Memory Wall Introduction – Memory Wall
- The number of cores in a
chip is increasing
- The memory bandwidth is
not as much… ‐‐> “memory wall” problem
- Emerging big data
applications require even more bandwidth
- In reality, much of the
bandwidth is being wasted!
3
Memory
CPU
Introduction – Table Scan Introduction – Table Scan
- Which items are made of
wood?
- Which items are heavier
than 5kg?
4
Item# Material Weight A Wood 10kg B Metal 1.5kg C Metal 7kg D Stone 3kg E Wood 2kg …
Introduction – Table Scan Introduction – Table Scan
5
- Data are read and the comparisons are done
- We only need the result – waste in bandwidth!
DRAM Host D2 D3 D0 D1 Key
Cmp
D2 D3 D0 D1
Result
Search key Data in table
① ② ③
Introduction – Table Scan Introduction – Table Scan
6
- Do compare within the memory
- Only two transfers needed instead of many
- Essentially a PIM (processing‐in‐memory) approach
DRAM Host D2 D3 D0 D1 Key Key
Result Cmp
Reduced traffic Parallelism with wider bandwidth
Introduction ‐ PIM Introduction ‐ PIM
- PIM research was active late 90’s ~ early 00’s
– EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. – Multiple cores in DRAM – Hard to integrate ‐‐> not successful
- Re‐gaining interests due to
– Big data workloads – Limited improvement of memory bandwidth – 3D stacked memory (HMC, HBM, etc.) enables integration of cores
7
Introduction ‐ PIM Introduction ‐ PIM
- PIM with 3D stacked memory
8
Crossbar Network … … … … DRAM Controller NI In-Order Core Message Queue Prefetch Buffer Mes.Trig. Pref. List Pref. Host Processor
Tesseract
[J. Ahn et al., ISCA 2015] Out-Of-Order Core L1 Cache L2 Cache Last-Level Cache HMC Controller Crossbar Network DRAM Controller DRAM Controller DRAM Controller Host Processor HMC … PCU PCU PCU PCU y PIM Directory Locality Monitor PMU
PEI (PIM enabled instructions)
[J. Ahn et al., ISCA 2015]
Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation
- A single chip is comprised of 8‐16 banks
- When accessing data, a row in a bank is
“activated” and stored in a row buffer
- A cache line (64B) is fetched in one burst
9
Off-chip link Chip I/O DRAM Chip Bank Global Row Decoder
Global Sense Amp. (Bank I/O)
Mat
… …
Global wordline
Column Decoder
Global Dataline
Mat
Local Row Decoder
Local Wordline 512 x 512 cells Local Bitline
Local Sense Amp. (Row Buffer)
Global Dataline
Bank
… …
Bank Bank Bank
Activated Activated Activated Activated
Row
Activated Buffered Internal Shared Bus
Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation
- Multiple banks are used for interleaving since activating a
row takes long time
- One bank can fill up the bandwidth for the off‐chip link
- Thus we have 8X‐16X internal bandwidth, most of which is
wasted
10
Off-chip link Chip I/O DRAM Chip Bank Global Row Decoder
Global Sense Amp. (Bank I/O)
Mat
… …
Global wordline
Column Decoder
Global Dataline
Mat
Local Row Decoder
Local Wordline 512 x 512 cells Local Bitline
Local Sense Amp. (Row Buffer)
Global Dataline
Bank
… …
Bank Bank Bank
Activated
Zzz.. Zzz.. Zzz..
Internal Shared Bus
Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation
- Compute inside each bank to utilize the excess bandwidth
11
Off-chip link Chip I/O DRAM Chip Bank Global Row Decoder
Global Sense Amp. (Bank I/O)
Mat
… …
Global wordline
Column Decoder
Global Dataline
Mat
Local Row Decoder
Local Wordline 512 x 512 cells Local Bitline
Local Sense Amp. (Row Buffer)
Global Dataline
Bank
… …
Bank Bank
Internal Shared Bus
Bank
Activated
Compute Compute Compute Compute Compute Compute
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?
- We focus only on ‘compare‐n‐op’ pattern over
a long range of data
12
D2 DN D0 D1 DRAM
…
Key CMP
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?
- Compare‐n‐read
– Returns the match results for each item
13
D2 DN D0 D1 DRAM
…
Key CMP Result: (=, <, =, … , >)
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?
- Compare‐n‐select
– Returns the min/max among each item
14
D2 DN D0 D1 DRAM
…
Max CMP Max: (D7)
Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?
- Compare‐n‐increment
– Increments matching items
15
K2, V2 KN, VN K0, V0 K1, V1
DRAM
…
K2 CMP
K2, V2++
Buffered Compare Architecture Buffered Compare Architecture
- Key buffer: Holds a value written by the processor
- Arithmetic unit: Performs computation (cmp, add, etc.)
using Bank I/O and Key buffer as operands
- Result queue: Stores compare results
- CGEN: Repeats the bank‐local commands
- The datapath is 64 bits wide
- 0.53% overhead in DRAM area
16
Chip I/O Bank
… …
Bank Bank
Internal Shared Bus
DRAM Chip Mat
Local Row Decoder
Local Wordline 512 x 512 cells Local Bitline
Local Sense Amp. (Row Buffer)
Global Dataline
Bank
Global Row Decoder
Bank
Mat
…
…
Global wordline
Global Dataline
Key Buffer Arithmetic Unit Column Decoder Bank I/O
Result Queue CGEN
Buffered Compare Architecture ‐ Programming Model Buffered Compare Architecture ‐ Programming Model
- OpenCL based programming model
- Programmers need not be aware of DRAM parameters
(page size, number of banks, …)
17
Item 0 Item 1 … Item N-1
CMP
Work items
Target Data
CMP CMP CMP
…
Core
…
Memory Controller DRAM Banks Core Core
__kernel search(keys[], skey, d[]){ int id = get_global_id(0) if (keys[id] == skey) d[id] = 1 } BC_cmp_read(skey, keys, N) … CMP_RD(skey, addr, range) SW code Instruction DRAM cmd
Evaluation ‐ Setup Evaluation ‐ Setup
- McSimA+ simulator
- Processor
– 22nm, 16 OoO cores running at 3GHz – 16KB private L1 – 32MB S‐NUCA L2 – Directory‐based MESI coherence
- Memory
– 28nm – DDR4‐2000 – 4 ranks per channel – 16 banks per chip – PAR‐BS (parallelism‐aware batch scheduling)
18
Evaluation ‐ Setup Evaluation ‐ Setup
- Six workloads
– TSC : In‐memory linear scan (Column‐store) – TSR : In‐memory linear scan (Row‐store) – BT : B+ tree traversal (index scan) – MAX : MAX aggregation – SA : Sequence assembly – KV : Key‐value store
- BC was evaluated against baseline and AMO
(Active Memory Operation)
19
Evaluation ‐ Speedup Evaluation ‐ Speedup
- BC performs 3.62 times better than the baseline
20
Evaluation – Energy Reduction Evaluation – Energy Reduction
- Energy consumption reduced by 73.3% on average
– Proc: 77.2% – Mem: 43.9%
21
0.5 1 1.5 2 TSC TSR MAX BT KV SA Geomean
- Norm. Energy
Proc. Mem
Baseline BC AMO
6.05
Summary Summary
- We proposed buffered compare, a processing‐in‐
memory approach to utilizing internal bandwidth of DRAM
– Minimal overhead to the DRAM area – Less invasive to existing DDR protocols – 3.62X speedup and 73.3% energy reduction
- Limitations
– Utilization of cache – Utilization of critical‐word‐first policy – When using x4 devices, only up to 32bits are supported for the operands
22