Real Processing in Memory using Memristors Nishil Talati, Rotem Ben - - PowerPoint PPT Presentation

real processing in memory
SMART_READER_LITE
LIVE PREVIEW

Real Processing in Memory using Memristors Nishil Talati, Rotem Ben - - PowerPoint PPT Presentation

Memristive Memory Processing Unit (mMPU) Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj Ali, Ben Perach, Natan Peled, Ronny Ronen and Shahar Kvatinsky Technion Israel Institute of Technology


slide-1
SLIDE 1

Memristive Memory Processing Unit (mMPU)

Real Processing in Memory using Memristors

Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj Ali, Ben Perach, Natan Peled, Ronny Ronen and Shahar Kvatinsky Technion – Israel Institute of Technology Yale80 in 2019, July 2, 2019

slide-2
SLIDE 2

The ASIC2 Group

Simulation tools Memory design

Memristive Memory Processing Unit (mMPU)

Efficient processors Neuromorphic computing Mixed signal & RF circuits Cytomorphic electronics Hardware security

General

B

2

  • Emerging technologies: Design, Simulation, Modeling, Applications
  • Explore computation beyond von Neumann architectures
slide-3
SLIDE 3

Talk Summary in a Single Slide

General

3

A memristor memory cell NOR logic Gate (MAGIC) Crossbar Compatible

VG VG IN1 IN2 OUT IN1 IN2 OUT IN1 IN2 OUT

SIMD computing in memory Control & Data

B

True Processing in Memory

slide-4
SLIDE 4

The von Neumann Bottleneck

CPU Memory (DRAM) The von Neumann Machine

CPU Memory Performance Time Perf. Gap

Latency Energy

Pedram et al., IDT, 2017

Operation (16-bit operand) Energy/Op (45nm) Cost (vs. Add) Add operation 0.18 pJ 1X Load from on-chip SRAM 11 pJ ~60X Send to off-chip DRAM 640 pJ ~3600X Processing In Memory (PIM) – Higher Performance, Lower Energy!

Background

B

4

slide-5
SLIDE 5

The Problem – and the Solution Strategy

Onur Mutlu – July 1, 2019

Processing In Memory (PIM) – Higher Performance, Lower Energy!

Background

B

5

62.7% of the total system energy is spent on data movement

slide-6
SLIDE 6

In-/Near-/Out- Memory Computing

  • OUT: All computations are done out of the memory array →

data movement, dedicated processing units

  • NEAR: Some computations done out of the memory array → data movement
  • IN: All computations are performed in the memory array

Commands Data movement

On-Chip Controller Memristive Memory Array

Peripheral circuit

On-Chip Controller Memristive Memory Array

Peripheral circuit

Memristive Memory Array Processing Element Controller

Read Inputs Write Results

IN- NEAR- OUT- IN: Real Processing in Memory

Background

B

6

  • J. Reuben et al., "Memristive Logic: A Framework for Evaluation and Comparison", PATMOS 2017
slide-7
SLIDE 7

mMPU: Potential Solution to the von Neumann Bottleneck

mMPU: performing computation USING the memristive memory cells Moving from conventional DRAM to memristive memory

Clock, Address, Data, and Controls

CPU mMPU mMPU

Background

B

7

slide-8
SLIDE 8

Outline

  • Background
  • Memristor basics
  • Logic using memory cells
  • Processing in Memory - the mMPU
  • System design using the mMPU
  • Conclusions

Basics

8

slide-9
SLIDE 9

Memristor: The Fourth Basic Element

L.O. Chua, “Memristor – The Missing Circuit Element,” IEEE Trans. on Circuit Theory, 1971

Basics

B

STT MRAM

Resistive RAM (RRAM) Phase Change Memory (PCM)

9

slide-10
SLIDE 10

Memristor – Memory Resistor

Resistor with Varying Resistance

Decrease in resistance Increase in resistance

Current

Voltage Current

Low resistive state (RON, LRS) High resistive state (ROFF, HRS)

Memristor

Basics

B

10

slide-11
SLIDE 11

Applications for Memristors

Logic circuits Security

Neuromorphic computing Analog circuits Memory

Basics

B

11

slide-12
SLIDE 12

Important Memristors Attributes for Logic

  • Hysteresis – state is preserved
  • Distinct high/low resistance states

(HRS/LRS) – binary applications

  • Threshold current/voltage

for switching

RON ROFF Vth,off Vth,on

Basics

B

12

slide-13
SLIDE 13

IMPLY MAGIC MRL MAD

  • J. Reuben et al., "Memristive Logic: A Framework for Evaluation and Comparison", PATMOS 2017

PINATUBO Akers

Logic Families Using Memristors

Basics

B

13

slide-14
SLIDE 14

Outline

  • Background
  • Memristor basics
  • Logic using memory cells
  • Processing in Memory - the mMPU
  • System design using the mMPU
  • Conclusions

14

Logic

slide-15
SLIDE 15

RON ROFF RON

MAGIC – Memristor Aided loGIC

OUT IN2 IN1 1 1 1 1 1

ROFF >> RON ROFF

<<VG >VG/2

Initialize OUT to RON

  • S. Kvatinsky et al., "MAGIC – Memristor Aided LoGIC,“ TCAS II, Nov. 2014

RON = Logic ‘1’ ROFF = Logic ‘0’

NOR Operation

ROFF RON

Logic

B

15

slide-16
SLIDE 16

MAGIC NOR in Memristive Crossbar

Crossbar Compatible Functionally complete

VG VG IN1 IN2 OUT

Logic

B

16

slide-17
SLIDE 17

VG VG IN1 IN2 OUT

MAGIC NOR in a Memristive Memory

Logic

B

17

slide-18
SLIDE 18

Parallel Execution of MAGIC Gates

VG VG IN1 IN2 OUT IN1 IN2 OUT IN1 IN2 OUT

Efficient SIMD Realization

  • N. Talati et al., “Logic Design within Memristive Memories using Memristor-Aided loGIC (MAGIC),” TNANO, 2016

Logic

B

18

slide-19
SLIDE 19

Outline

  • Background
  • Memristor basics
  • Logic using memory cells
  • Processing in Memory - the mMPU
  • System design using the mMPU
  • Conclusions

19

PIM

slide-20
SLIDE 20

The Idea: Throughput Improvement

  • Locate element computation to a single row

=> Execution of a single instance in each row in parallel

  • Implement desired function using NOR/NOT sequence

𝑈ℎ𝑠𝑝𝑣𝑕ℎ𝑞𝑣𝑢 = #𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 Control Pattern (NOR Sequence)

PIM

20

slide-21
SLIDE 21

Example: In-Memory Parallel Execution

𝑷𝑺(𝑩𝒋, 𝑪𝒋)

∀𝒋 = 𝟐, … , 𝑶

B

21

PIM

slide-22
SLIDE 22

Example: In-Memory Parallel Execution

𝑷𝑺(𝑩𝒋, 𝑪𝒋)

∀𝒋 = 𝟐, … , 𝑶

B

22

PIM

* Per element, ignoring initialization

slide-23
SLIDE 23

Hierarchy of Logical Functions

MUL COPY AND DIV NAND OR

Convolution

MAGIC NOR/NOT XOR

Matrix multiplication

SQRT POW ADD SUB LOG

B

23

PIM

slide-24
SLIDE 24

Example: Full Adder (1)

𝑇 = 𝐵⨁𝐶⨁𝐷𝑗𝑜 𝐷𝑝𝑣𝑢 = 𝐵 ∙ 𝐶 + 𝐶 ∙ 𝐷𝑗𝑜 + 𝐵 ∙ 𝐷𝑗𝑜

A B CIN S COUT

  • Generate NOR/NOT sequence
  • Existing CAD tools do it very well!
  • Can reuse interim results to save space!

CIN A B COUT S

3 2 4 1 5 6 7 9 8 11 10 12

B

  • R. Ben Hur, et al., "Synthesis and Mapping of Boolean Functions for Memristor Aided Logic (MAGIC)", ICCAD 2017

24

PIM 1Per element, ignoring initialization 2Can be done w/ 9 NORs

slide-25
SLIDE 25

Example: Full Adder (2)

full_adder_12gates.v. in/out+gates = 3/2+10; 1: T1 = NOT I1 % alloc: R3 = NOT I1 2: T2 = NOT I2 % alloc: R4 = NOT I2 3: T5 = NOR2 T1,T2 % alloc: R7 = NOR2 R3,R4 4: T4 = NOR2 I1,I2 % alloc: R6 = NOR2 I1,I2 5: T6 = NOR2 T5,T4 % alloc: R8 = NOR2 R7,R6 6: T8 = NOR2 T6,I3 % alloc: R10 = NOR2 R8,I3 7: T7 = NOT T6 % alloc: R9 = NOT R8 8: T3 = NOT I3 % alloc: R5 = NOT I3 9: T9 = NOR2 T7,T3 % alloc: R11 = NOR2 R9,R5 10: O11 = NOR2 T8,T9 % alloc: R1 = NOR2 R10,R11 11: T10 = NOR2 T5,T9 % alloc: R12 = NOR2 R7,R11 12: O12 = NOT T10 % alloc: R2 = NOT R12 full_adder_12gates.v. in/out+gates = 3/2+10; 1: T1 = NOT I1 % alloc: R3 = NOT I1 2: T2 = NOT I2 % alloc: R1 = NOT I2 3: T5 = NOR2 T1,T2 % alloc: R2 = NOR2 R3,R1 4: T4 = NOR2 I1,I2 % alloc: R3 = NOR2 I1,I2 5: T6 = NOR2 T5,T4 % alloc: R1 = NOR2 R2,R3 6: T8 = NOR2 T6,I3 % alloc: R3 = NOR2 R1,I3 7: T7 = NOT T6 % alloc: R4 = NOT R1 8: T3 = NOT I3 % alloc: R1 = NOT I3 9: T9 = NOR2 T7,T3 % alloc: R5 = NOR2 R4,R1 10: O11 = NOR2 T8,T9 % alloc: R1 = NOR2 R3,R5 11: T10 = NOR2 T5,T9 % alloc: R3 = NOR2 R2,R5 12: O12 = NOT T10 % alloc: R2 = NOT R3

CIN A B COUT S

3 2 4 1 5 6 7 9 8 11 10 12

Naïve column allocation Smart column allocation

B

No Cell Reuse Best Cell Reuse

25

PIM

Smart Reuse ➔ Column usage reduction! 12 cells

5 cells

Per element, ignoring initialization

slide-26
SLIDE 26

Example: N-bit Multiplication

One adder at a time One partial product at a time

Input 1 Input 2 Result

N bits N bits 3N bits 12N bits 2N bits

Sequence of adders Partial products

Requires ~300<512 cells for N=16

Area 3700 300

12×

Ameer Haj-Ali et al., “Efficient Algorithms for In-memory Fixed Point Multiplication Using MAGIC,” 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018

B

  • Showcase PIM implementation of complex operation
  • N=16 ➔ ~3700 NOR operations (O(N

O(N2)) ))

  • Naïve column allocation: ~3700 Cells (O(

O(N2)) ))

  • Optimized manual/automictic allocation: ~300 Cells! (O(N

O(N))

Addition

26

PIM

* Per element, ignoring initialization

slide-27
SLIDE 27

Outline

  • Background
  • Memristor basics
  • Logic using memory cells
  • Processing in Memory - the mMPU
  • System design using the mMPU
  • Conclusions

27

System

slide-28
SLIDE 28

Applications

System Design using the mMPU

Memory Design Periphery Design mMPU Controller Design and Optimization Programming Model CPU mMPU

mMPU Controller

System

B

28

slide-29
SLIDE 29

Challenge 1 mMPU Controller μ-architecture

m

Compute Block

System

B

29

slide-30
SLIDE 30

Compute Block

Challenge 1: Arithmetic/Logical Operations in the mMPU

m

A B C

System

B

30

slide-31
SLIDE 31

Challenge 2: Code Generation and Storage allocation

module ckt( . . . endmodule

Logic function (.blif / .pla) Array Size

Synthesis Tool Map/Opt Tool

To a single row

GATE inv … GATE nor2 … . . .

Customized standard cell library (.genlib)

. . . . . .

Spatial independent execution sequence to single row Minimized NOR and NOT netlist (.v)

B A

i

C

  • C

S

A B

i

C g1 g2 g3 g4 g5 g6 g7 A B

i

C 1 1 1 1 1 1 1 A B

i

C 1 1 g3 1 g5 g6 g7 A B

i

C g8 g9 g3 g10 g5 g6 g7 A B

i

C g8 g9 1 g10 g5 1 g7 A B

i

C g8 g9 S g10 g5

  • C

g7

1 2 3 4 5 6 7 8 9 10

Column Number

  • (adapted from) R. Ben Hur, et al., ”Synthesis and Mapping of Boolean Functions for Memristor Aided Logic (MAGIC)”, ICCAD 2017
  • R. Ben Hur, R. Ronen, et al., “SIMPLER MAGIC: Synthesis and Mapping of In-Memory Logic Executed in a Single Row to Improve

Throughput”, submitted

System

31

B

Off-the-shelf Custom

slide-32
SLIDE 32

Challenge 2a: Mapping into a Limited-size Array

*DAG: Direct Acyclic Graph

B

System

  • Netlist ➔ DAG*
  • Traverse (DFS) – similar to register allocation
  • The execution order influences:
  • How many cells are enough for the execution
  • How many initialization cycles are necessary
slide-33
SLIDE 33

Challenge 2a: Mapping into a Limited-size Array

Minimum: 8 columns (cells) → 16 cycles for execution of N instances

Cycle Netlist Physical Allocation 1 w1 = NOT A R2 = NOT I1 2 w2 = NOT B R1 = NOT I2 3 w5 = NOR2 w1,w2 R4 = NOR2 R2,R1 4 w4 = NOR2 A,B R3 = NOR2 I1,I2 5 w6 = NOR2 w5,w4 R5 = NOR2 R4,R3 6 **ReUsing 3 registers = 2 1 3 7 w8 = NOR2 w6,Cin R2 = NOR2 R5,I3 8 w7 = NOT w6 R1 = NOT R5 9 w3 = NOT Cin R3 = NOT I3 10 **ReUsing 1 registers = 5 11 w9 = NOR2 w7,w3 R5 = NOR2 R1,R3 12 **ReUsing 2 registers = 1 3 13 S = NOR2 w8,w9 R1 = NOR2 R2,R5 14 w10 = NOR2 w5,w9 R3 = NOR2 R4,R5 15 **ReUsing 3 registers = 2 4 5 16 Cout = NOT w10 R2 = NOT R3

B

33

System

slide-34
SLIDE 34

Challenge 2a: Latency/Area Relative to Unrestricted Area (No Reuse)

34

0% 20% 40% 60% 80% 100% 120% 140%

Relative Latency , Relative Area

EPFL Benchmarks

Area - Minimum area Area - Minimum area + max{5%,10} #Cycles - Minimum area #Cycles - Minimum area + max{5%,10} Area decreases by 5.8x At the cost of a 6.2% latency degradation When area decreases by 5.4x Latency overhead reduces to 2.3%

B

System

slide-35
SLIDE 35

Challenge 3: Data Movement

Rank Chip DIMM

Bank

Internal Bus

Row Dec Bank I/O

MAT

Subarray

WL BL

Memristive cell

PIM_ADD #Op1, #Op2, #Out

Processing-in-memory Instruction Relative locations of

  • p1 and op2

Same MAT, aligned Same MAT, not aligned Different MATs Different banks

System

B

35

slide-36
SLIDE 36

Challenge 3: Modes of Data Transfer

MAT PN PN MAT SN PN

Intra-MAT data transfer

Row Decoder

MAT1,1 MAT1,2 MAT1,m MATn,1 MATn,2 MATn,m

Bank I/O

Sense Amps

Ctrl

Limited B/W

R

GBL

W

Intra-bank data transfer

Bank

Bank I/O

IB R W

Chip Internal Bus

Inter-bank data transfer

  • N. Talati et al., "Practical Challenges in Delivering the Promises of

Real Processing-in-Memory Machines," DATE 2018

System

36

slide-37
SLIDE 37

More mMPU Research Challenges

  • Defining the mMPU operation
  • Instructions; Micro-Operation Format
  • Control Storage
  • Optimized compilation of mMPU “program”
  • Passing info from the mMPU to the runtime systems
  • Data mapping & alignment
  • “Where is address A+1?”
  • Which variables stay in the same MAT?
  • Cost of data alignment
  • Memory controller to support PIM
  • Understanding the electrical limitations
  • Rigorous evaluation of mMPU benefits
  • Endurance aspects

….

System

37

slide-38
SLIDE 38

Conclusions

  • Data transfer → the von Neumann bottleneck
  • MAGIC – the scalable building block for

processing in memory

  • mMPU – a real processing-in-memory machine
  • Requires a new computing paradigm
  • Our group aims to develop a working end-to-end mMPU system

mMPU Memory + Processing CPU

Conclusions

B

38

slide-39
SLIDE 39

Thanks!