Real Processing in Memory using Memristors Nishil Talati, Rotem Ben - PowerPoint PPT Presentation

Memristive Memory Processing Unit (mMPU) Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj Ali, Ben Perach, Natan Peled, Ronny Ronen and Shahar Kvatinsky Technion – Israel Institute of Technology Yale80 in 2019 , July 2, 2019

The ASIC 2 Group General • Emerging technologies: Design, Simulation, Modeling, Applications • Explore computation beyond von Neumann architectures Neuromorphic Memory Simulation Efficient computing design tools processors Memristive Hardware Mixed signal Cytomorphic Memory Processing security & RF circuits electronics Unit (mMPU) 2 B

General Talk Summary in a Single Slide A memristor memory cell NOR logic Gate (MAGIC) Crossbar Compatible V G V G OUT IN 1 IN 2 IN 2 OUT IN 1 OUT IN 1 IN 2 SIMD computing in memory Control & Data True Processing in Memory 3 B

Background The von Neumann Bottleneck Latency Energy CPU Operation Energy/Op Cost Performance (16-bit operand) (45nm) (vs. Add) Perf. Add operation 0.18 pJ 1X Gap Load from on-chip SRAM 11 pJ ~60X Memory Send to off-chip DRAM 640 pJ ~3600X Time Pedram et al., IDT, 2017 Memory The von Neumann CPU (DRAM) Machine Processing In Memory (PIM) – Higher Performance, Lower Energy! 4 B

The Problem – and the Solution Strategy Background Onur Mutlu – July 1, 2019 62.7% of the total system energy is spent on data movement Processing In Memory (PIM) – Higher Performance, Lower Energy! 5 B

Background In-/Near-/Out- Memory Computing • OUT: All computations are done out of the memory array → data movement, dedicated processing units • NEAR: Some computations done out of the memory array → data movement • IN: All computations are performed in the memory array Processing On-Chip On-Chip Element IN: Controller Controller Real Processing Read Write in Memory Peripheral circuit Peripheral circuit Inputs Results Controller Commands Memristive Data movement Memristive Memristive Memory Memory Memory Array Array Array IN- NEAR- OUT- 6 B J. Reuben et al., "Memristive Logic: A Framework for Evaluation and Comparison", PATMOS 2017

Background mMPU: Potential Solution to the von Neumann Bottleneck CPU Moving from conventional DRAM to memristive memory Clock, Address, Data, and Controls mMPU: performing computation USING the memristive memory cells mMPU mMPU 7 B

Basics Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 8

Basics Memristor: The Fourth Basic Element Resistive RAM Phase Change STT MRAM (RRAM) Memory (PCM) 9 L.O. Chua, “ Memristor – The Missing Circuit Element, ” IEEE Trans. on Circuit Theory, 1971 B

Basics Memristor – Memory Resistor Resistor with Varying Resistance Current Low resistive state (R ON , LRS) High resistive state (R OFF , HRS) Voltage Decrease in resistance Current Memristor Increase in resistance 10 B

Basics Applications for Memristors Logic circuits Memory Analog circuits Neuromorphic Security computing 11 B

Basics Important Memristors Attributes for Logic • Hysteresis – state is preserved • Distinct high/low resistance states R ON (HRS/LRS) – binary applications • Threshold current/voltage for switching V th,off R OFF V th,on 12 B

Basics Logic Families Using Memristors IMPLY MAGIC MRL MAD PINATUBO Akers J. Reuben et al., "Memristive Logic: A Framework for Evaluation and Comparison", PATMOS 2017 13 B

Logic Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 14

Logic MAGIC – Memristor Aided loGIC Initialize OUT to R ON NOR Operation R ON = Logic ‘ 1 ’ R ON R OFF = Logic ‘ 0 ’ R OFF <<V G >V G /2 R OFF >> R ON IN 1 IN 2 OUT R OFF R ON 0 0 1 R OFF R ON 0 1 0 1 0 0 1 1 0 S. Kvatinsky et al. , "MAGIC – Memristor Aided LoGIC, “ TCAS II, Nov. 2014 15 B

Logic MAGIC NOR in Memristive Crossbar Functionally complete Crossbar V G V G Compatible IN 2 OUT IN 1 16 B

Logic MAGIC NOR in a Memristive Memory V G V G IN 2 OUT IN 1 17 B

Logic Parallel Execution of MAGIC Gates V G V G IN 2 OUT IN 1 Efficient OUT IN 2 IN 1 SIMD Realization OUT IN 2 IN 1 18 N. Talati et al. , “ Logic Design within Memristive Memories using Memristor-Aided loGIC (MAGIC), ” TNANO , 2016 B

PIM Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 19

PIM The Idea: Throughput Improvement • Locate element computation to a single row => Execution of a single instance in each row in parallel • Implement desired function using NOR/NOT sequence Control Pattern (NOR Sequence) 𝑈ℎ𝑠𝑝𝑣𝑕ℎ𝑞𝑣𝑢 = #𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 20

PIM Example: In-Memory Parallel Execution 𝑷𝑺(𝑩 𝒋 , 𝑪 𝒋 ) ∀𝒋 = 𝟐, … , 𝑶 21 B

PIM Example: In-Memory Parallel Execution 𝑷𝑺(𝑩 𝒋 , 𝑪 𝒋 ) ∀𝒋 = 𝟐, … , 𝑶 * Per element, ignoring initialization 22 B

PIM Hierarchy of Logical Functions Matrix Convolution multiplication POW SQRT LOG DIV MUL ADD SUB AND XOR OR COPY NAND MAGIC NOR/NOT 23 B

PIM Example: Full Adder (1) C IN A B C OUT S C IN 10 12 A S 8 1 5 11 C OUT 6 7 4 9 B 2 3 𝑇 = 𝐵⨁𝐶⨁𝐷 𝑗𝑜 𝐷 𝑝𝑣𝑢 = 𝐵 ∙ 𝐶 + 𝐶 ∙ 𝐷 𝑗𝑜 + 𝐵 ∙ 𝐷 𝑗𝑜 • Generate NOR/NOT sequence • Existing CAD tools do it very well! • Can reuse interim results to save space! 1 Per element, ignoring initialization 2 Can be done w/ 9 NORs 24 R. Ben Hur, et al. , "Synthesis and Mapping of Boolean Functions for Memristor Aided Logic (MAGIC)", ICCAD 2017 B

PIM Example: Full Adder (2) full_adder_12gates.v. in/out+gates = 3/2+10; 1: T1 = NOT I1 % alloc: R3 = NOT I1 C IN 2: T2 = NOT I2 % alloc: R4 = NOT I2 3: T5 = NOR2 T1,T2 % alloc: R7 = NOR2 R3,R4 C OUT 4: T4 = NOR2 I1,I2 % alloc: R6 = NOR2 I1,I2 10 12 5: T6 = NOR2 T5,T4 % alloc: R8 = NOR2 R7,R6 A Naïve 6: T8 = NOR2 T6,I3 % alloc: R10 = NOR2 R8,I3 S 8 column 1 5 7: T7 = NOT T6 % alloc: R9 = NOT R8 11 allocation 6 7 8: T3 = NOT I3 % alloc: R5 = NOT I3 4 9 9: T9 = NOR2 T7,T3 % alloc: R11 = NOR2 R9,R5 2 B 3 10: O11 = NOR2 T8,T9 % alloc: R1 = NOR2 R10,R11 11: T10 = NOR2 T5,T9 % alloc: R12 = NOR2 R7,R11 12: O12 = NOT T10 % alloc: R2 = NOT R12 No full_adder_12gates.v. in/out+gates = 3/2+10; Cell Reuse 1: T1 = NOT I1 % alloc: R3 = NOT I1 2: T2 = NOT I2 % alloc: R1 = NOT I2 3: T5 = NOR2 T1,T2 % alloc: R2 = NOR2 R3,R1 12 cells Smart 4: T4 = NOR2 I1,I2 % alloc: R3 = NOR2 I1,I2  column 5: T6 = NOR2 T5,T4 % alloc: R1 = NOR2 R2,R3 5 cells 6: T8 = NOR2 T6,I3 % alloc: R3 = NOR2 R1,I3 allocation 7: T7 = NOT T6 % alloc: R4 = NOT R1 8: T3 = NOT I3 % alloc: R1 = NOT I3 9: T9 = NOR2 T7,T3 % alloc: R5 = NOR2 R4,R1 Best Cell 10: O11 = NOR2 T8,T9 % alloc: R1 = NOR2 R3,R5 Reuse Per element, ignoring initialization 11: T10 = NOR2 T5,T9 % alloc: R3 = NOR2 R2,R5 12: O12 = NOT T10 % alloc: R2 = NOT R3 Smart Reuse ➔ Column usage reduction! 25 B

PIM Example: N-bit Multiplication One Requires One partial ~300<512 adder at product cells for a time N=16 at a time N bits N bits 3N bits 12N bits 3700 2N bits Area Partial products Sequence of adders Input 1 Input 2 Result 12 × 300 • Showcase PIM implementation of complex operation • N=16 ➔ ~3700 NOR operations (O(N O(N 2 )) )) • Naïve column allocation: ~3700 Cells ( O( Addition O(N 2 )) )) • Optimized manual/automictic allocation: ~300 Cells! ( O(N O(N)) * Per element, ignoring initialization 26 Ameer Haj-Ali et al. , “ Efficient Algorithms for In-memory Fixed Point Multiplication Using MAGIC, ” B 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018

System Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 27

System System Design using the mMPU Memory Design Periphery CPU Design mMPU Controller Design and Optimization mMPU Controller mMPU Programming Model Applications 28 B

System Challenge 1 mMPU Controller μ -architecture m Compute Block 29 B

System Challenge 1: Arithmetic/Logical Operations in the mMPU m Compute Block A B C 30 B

Real Processing in Memory using Memristors Nishil Talati, Rotem Ben - PowerPoint PPT Presentation

Memristive Memory Processing Unit (mMPU) Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj Ali, Ben Perach, Natan Peled, Ronny Ronen and Shahar Kvatinsky Technion Israel Institute of Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Real graduates, Real graduates, real transitions, real transitions, real stories: real

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Memory CS Basics Introduction: memory? 3) Memory Real mode flat model Registers

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

Marr's Theory of the Hippocampus: Part I Computational Models of Neural Systems Lecture 3.3

Managed Language Applications Forrest J. Robinson Michael R. Jantz Kshitij A. Doshi Prasad A.

virtual memory 2 1 xv6 memory layout 0x80000000 (KERNBASE) page tables store this mapping (one

Rethinking the Memory Hierarchy for Modern Languages Po-An Tsai , Yee Ling Gan, and Daniel Sanchez

Language Definition vs. Implementation Most of 251 so far Now a

Main Memory & DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA

SecPM: a Secure and Persistent Memory System for Non-volatile Memory Pengfei Zuo, Yu Hua Huazhong

Real Processing in Memory using Memristors Nishil Talati, Rotem Ben - PowerPoint PPT Presentation

Memristive Memory Processing Unit (mMPU) Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj Ali, Ben Perach, Natan Peled, Ronny Ronen and Shahar Kvatinsky Technion Israel Institute of Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Real graduates, Real graduates, real transitions, real transitions, real stories: real

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Memory CS Basics Introduction: memory? 3) Memory Real mode flat model Registers

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

Marr's Theory of the Hippocampus: Part I Computational Models of Neural Systems Lecture 3.3

Managed Language Applications Forrest J. Robinson Michael R. Jantz Kshitij A. Doshi Prasad A.

virtual memory 2 1 xv6 memory layout 0x80000000 (KERNBASE) page tables store this mapping (one

Rethinking the Memory Hierarchy for Modern Languages Po-An Tsai , Yee Ling Gan, and Daniel Sanchez

Language Definition vs. Implementation Most of 251 so far Now a

Main Memory &amp; DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA

SecPM: a Secure and Persistent Memory System for Non-volatile Memory Pengfei Zuo, Yu Hua Huazhong

Main Memory & DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)