An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - - PowerPoint PPT Presentation

an effective approach to processing in dram
SMART_READER_LITE
LIVE PREVIEW

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , - - PowerPoint PPT Presentation

An Effective Approach to Processing in DRAM Jinho Lee, Kiyoung Choi , and Jung Ho Ahn Seoul National University Outline Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary 2 Introduction


slide-1
SLIDE 1

An Effective Approach to Processing in DRAM

Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul National University

slide-2
SLIDE 2

Outline Outline

  • Introduction
  • Our Approach
  • Buffered Compare Architecture
  • Evaluation
  • Summary

2

slide-3
SLIDE 3

Introduction – Memory Wall Introduction – Memory Wall

  • The number of cores in a

chip is increasing

  • The memory bandwidth is

not as much… ‐‐> “memory wall” problem

  • Emerging big data

applications require even more bandwidth

  • In reality, much of the

bandwidth is being wasted!

3

Memory

CPU

slide-4
SLIDE 4

Introduction – Table Scan Introduction – Table Scan

  • Which items are made of

wood?

  • Which items are heavier

than 5kg?

4

Item# Material Weight A Wood 10kg B Metal 1.5kg C Metal 7kg D Stone 3kg E Wood 2kg …

slide-5
SLIDE 5

Introduction – Table Scan Introduction – Table Scan

5

  • Data are read and the comparisons are done
  • We only need the result – waste in bandwidth!

DRAM Host D2 D3 D0 D1 Key

Cmp

D2 D3 D0 D1

Result

Search key Data in table

① ② ③

slide-6
SLIDE 6

Introduction – Table Scan Introduction – Table Scan

6

  • Do compare within the memory
  • Only two transfers needed instead of many
  • Essentially a PIM (processing‐in‐memory) approach

DRAM Host D2 D3 D0 D1 Key Key

Result Cmp

Reduced traffic Parallelism with wider bandwidth

slide-7
SLIDE 7

Introduction ‐ PIM Introduction ‐ PIM

  • PIM research was active late 90’s ~ early 00’s

– EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. – Multiple cores in DRAM – Hard to integrate ‐‐> not successful

  • Re‐gaining interests due to

– Big data workloads – Limited improvement of memory bandwidth – 3D stacked memory (HMC, HBM, etc.) enables integration of cores

7

slide-8
SLIDE 8

Introduction ‐ PIM Introduction ‐ PIM

  • PIM with 3D stacked memory

8

Crossbar Network … … … … DRAM Controller NI In-Order Core Message Queue Prefetch Buffer Mes.Trig. Pref. List Pref. Host Processor

Tesseract

[J. Ahn et al., ISCA 2015] Out-Of-Order Core L1 Cache L2 Cache Last-Level Cache HMC Controller Crossbar Network DRAM Controller DRAM Controller DRAM Controller Host Processor HMC … PCU PCU PCU PCU y PIM Directory Locality Monitor PMU

PEI (PIM enabled instructions)

[J. Ahn et al., ISCA 2015]

slide-9
SLIDE 9

Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation

  • A single chip is comprised of 8‐16 banks
  • When accessing data, a row in a bank is

“activated” and stored in a row buffer

  • A cache line (64B) is fetched in one burst

9

Off-chip link Chip I/O DRAM Chip Bank Global Row Decoder

Global Sense Amp. (Bank I/O)

Mat

… …

Global wordline

Column Decoder

Global Dataline

Mat

Local Row Decoder

Local Wordline 512 x 512 cells Local Bitline

Local Sense Amp. (Row Buffer)

Global Dataline

Bank

… …

Bank Bank Bank

Activated Activated Activated Activated

Row

Activated Buffered Internal Shared Bus

slide-10
SLIDE 10

Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation

  • Multiple banks are used for interleaving since activating a

row takes long time

  • One bank can fill up the bandwidth for the off‐chip link
  • Thus we have 8X‐16X internal bandwidth, most of which is

wasted

10

Off-chip link Chip I/O DRAM Chip Bank Global Row Decoder

Global Sense Amp. (Bank I/O)

Mat

… …

Global wordline

Column Decoder

Global Dataline

Mat

Local Row Decoder

Local Wordline 512 x 512 cells Local Bitline

Local Sense Amp. (Row Buffer)

Global Dataline

Bank

… …

Bank Bank Bank

Activated

Zzz.. Zzz.. Zzz..

Internal Shared Bus

slide-11
SLIDE 11

Our Approach ‐ DRAM Architecture & Motivation Our Approach ‐ DRAM Architecture & Motivation

  • Compute inside each bank to utilize the excess bandwidth

11

Off-chip link Chip I/O DRAM Chip Bank Global Row Decoder

Global Sense Amp. (Bank I/O)

Mat

… …

Global wordline

Column Decoder

Global Dataline

Mat

Local Row Decoder

Local Wordline 512 x 512 cells Local Bitline

Local Sense Amp. (Row Buffer)

Global Dataline

Bank

… …

Bank Bank

Internal Shared Bus

Bank

Activated

Compute Compute Compute Compute Compute Compute

slide-12
SLIDE 12

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?

  • We focus only on ‘compare‐n‐op’ pattern over

a long range of data

12

D2 DN D0 D1 DRAM

Key CMP

slide-13
SLIDE 13

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?

  • Compare‐n‐read

– Returns the match results for each item

13

D2 DN D0 D1 DRAM

Key CMP Result: (=, <, =, … , >)

slide-14
SLIDE 14

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?

  • Compare‐n‐select

– Returns the min/max among each item

14

D2 DN D0 D1 DRAM

Max CMP Max: (D7)

slide-15
SLIDE 15

Our Approach ‐ What to compute with PIM? Our Approach ‐ What to compute with PIM?

  • Compare‐n‐increment

– Increments matching items

15

K2, V2 KN, VN K0, V0 K1, V1

DRAM

K2 CMP

K2, V2++

slide-16
SLIDE 16

Buffered Compare Architecture Buffered Compare Architecture

  • Key buffer: Holds a value written by the processor
  • Arithmetic unit: Performs computation (cmp, add, etc.)

using Bank I/O and Key buffer as operands

  • Result queue: Stores compare results
  • CGEN: Repeats the bank‐local commands
  • The datapath is 64 bits wide
  • 0.53% overhead in DRAM area

16

Chip I/O Bank

… …

Bank Bank

Internal Shared Bus

DRAM Chip Mat

Local Row Decoder

Local Wordline 512 x 512 cells Local Bitline

Local Sense Amp. (Row Buffer)

Global Dataline

Bank

Global Row Decoder

Bank

Mat

Global wordline

Global Dataline

Key Buffer Arithmetic Unit Column Decoder Bank I/O

Result Queue CGEN

slide-17
SLIDE 17

Buffered Compare Architecture ‐ Programming Model Buffered Compare Architecture ‐ Programming Model

  • OpenCL based programming model
  • Programmers need not be aware of DRAM parameters

(page size, number of banks, …)

17

Item 0 Item 1 … Item N-1

CMP

Work items

Target Data

CMP CMP CMP

Core

Memory Controller DRAM Banks Core Core

__kernel search(keys[], skey, d[]){ int id = get_global_id(0) if (keys[id] == skey) d[id] = 1 } BC_cmp_read(skey, keys, N) … CMP_RD(skey, addr, range) SW code Instruction DRAM cmd

slide-18
SLIDE 18

Evaluation ‐ Setup Evaluation ‐ Setup

  • McSimA+ simulator
  • Processor

– 22nm, 16 OoO cores running at 3GHz – 16KB private L1 – 32MB S‐NUCA L2 – Directory‐based MESI coherence

  • Memory

– 28nm – DDR4‐2000 – 4 ranks per channel – 16 banks per chip – PAR‐BS (parallelism‐aware batch scheduling)

18

slide-19
SLIDE 19

Evaluation ‐ Setup Evaluation ‐ Setup

  • Six workloads

– TSC : In‐memory linear scan (Column‐store) – TSR : In‐memory linear scan (Row‐store) – BT : B+ tree traversal (index scan) – MAX : MAX aggregation – SA : Sequence assembly – KV : Key‐value store

  • BC was evaluated against baseline and AMO

(Active Memory Operation)

19

slide-20
SLIDE 20

Evaluation ‐ Speedup Evaluation ‐ Speedup

  • BC performs 3.62 times better than the baseline

20

slide-21
SLIDE 21

Evaluation – Energy Reduction Evaluation – Energy Reduction

  • Energy consumption reduced by 73.3% on average

– Proc: 77.2% – Mem: 43.9%

21

0.5 1 1.5 2 TSC TSR MAX BT KV SA Geomean

  • Norm. Energy

Proc. Mem

Baseline BC AMO

6.05

slide-22
SLIDE 22

Summary Summary

  • We proposed buffered compare, a processing‐in‐

memory approach to utilizing internal bandwidth of DRAM

– Minimal overhead to the DRAM area – Less invasive to existing DDR protocols – 3.62X speedup and 73.3% energy reduction

  • Limitations

– Utilization of cache – Utilization of critical‐word‐first policy – When using x4 devices, only up to 32bits are supported for the operands

22