 
              Memristive Memory Processing Unit (mMPU) Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj Ali, Ben Perach, Natan Peled, Ronny Ronen and Shahar Kvatinsky Technion – Israel Institute of Technology Yale80 in 2019 , July 2, 2019
The ASIC 2 Group General • Emerging technologies: Design, Simulation, Modeling, Applications • Explore computation beyond von Neumann architectures Neuromorphic Memory Simulation Efficient computing design tools processors Memristive Hardware Mixed signal Cytomorphic Memory Processing security & RF circuits electronics Unit (mMPU) 2 B
General Talk Summary in a Single Slide A memristor memory cell NOR logic Gate (MAGIC) Crossbar Compatible V G V G OUT IN 1 IN 2 IN 2 OUT IN 1 OUT IN 1 IN 2 SIMD computing in memory Control & Data True Processing in Memory 3 B
Background The von Neumann Bottleneck Latency Energy CPU Operation Energy/Op Cost Performance (16-bit operand) (45nm) (vs. Add) Perf. Add operation 0.18 pJ 1X Gap Load from on-chip SRAM 11 pJ ~60X Memory Send to off-chip DRAM 640 pJ ~3600X Time Pedram et al., IDT, 2017 Memory The von Neumann CPU (DRAM) Machine Processing In Memory (PIM) – Higher Performance, Lower Energy! 4 B
The Problem – and the Solution Strategy Background Onur Mutlu – July 1, 2019 62.7% of the total system energy is spent on data movement Processing In Memory (PIM) – Higher Performance, Lower Energy! 5 B
Background In-/Near-/Out- Memory Computing • OUT: All computations are done out of the memory array → data movement, dedicated processing units • NEAR: Some computations done out of the memory array → data movement • IN: All computations are performed in the memory array Processing On-Chip On-Chip Element IN: Controller Controller Real Processing Read Write in Memory Peripheral circuit Peripheral circuit Inputs Results Controller Commands Memristive Data movement Memristive Memristive Memory Memory Memory Array Array Array IN- NEAR- OUT- 6 B J. Reuben et al., "Memristive Logic: A Framework for Evaluation and Comparison", PATMOS 2017
Background mMPU: Potential Solution to the von Neumann Bottleneck CPU Moving from conventional DRAM to memristive memory Clock, Address, Data, and Controls mMPU: performing computation USING the memristive memory cells mMPU mMPU 7 B
Basics Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 8
Basics Memristor: The Fourth Basic Element Resistive RAM Phase Change STT MRAM (RRAM) Memory (PCM) 9 L.O. Chua, “ Memristor – The Missing Circuit Element, ” IEEE Trans. on Circuit Theory, 1971 B
Basics Memristor – Memory Resistor Resistor with Varying Resistance Current Low resistive state (R ON , LRS) High resistive state (R OFF , HRS) Voltage Decrease in resistance Current Memristor Increase in resistance 10 B
Basics Applications for Memristors Logic circuits Memory Analog circuits Neuromorphic Security computing 11 B
Basics Important Memristors Attributes for Logic • Hysteresis – state is preserved • Distinct high/low resistance states R ON (HRS/LRS) – binary applications • Threshold current/voltage for switching V th,off R OFF V th,on 12 B
Basics Logic Families Using Memristors IMPLY MAGIC MRL MAD PINATUBO Akers J. Reuben et al., "Memristive Logic: A Framework for Evaluation and Comparison", PATMOS 2017 13 B
Logic Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 14
Logic MAGIC – Memristor Aided loGIC Initialize OUT to R ON NOR Operation R ON = Logic ‘ 1 ’ R ON R OFF = Logic ‘ 0 ’ R OFF <<V G >V G /2 R OFF >> R ON IN 1 IN 2 OUT R OFF R ON 0 0 1 R OFF R ON 0 1 0 1 0 0 1 1 0 S. Kvatinsky et al. , "MAGIC – Memristor Aided LoGIC, “ TCAS II, Nov. 2014 15 B
Logic MAGIC NOR in Memristive Crossbar Functionally complete Crossbar V G V G Compatible IN 2 OUT IN 1 16 B
Logic MAGIC NOR in a Memristive Memory V G V G IN 2 OUT IN 1 17 B
Logic Parallel Execution of MAGIC Gates V G V G IN 2 OUT IN 1 Efficient OUT IN 2 IN 1 SIMD Realization OUT IN 2 IN 1 18 N. Talati et al. , “ Logic Design within Memristive Memories using Memristor-Aided loGIC (MAGIC), ” TNANO , 2016 B
PIM Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 19
PIM The Idea: Throughput Improvement • Locate element computation to a single row => Execution of a single instance in each row in parallel • Implement desired function using NOR/NOT sequence Control Pattern (NOR Sequence) 𝑈ℎ𝑠𝑝𝑣ℎ𝑞𝑣𝑢 = #𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 20
PIM Example: In-Memory Parallel Execution 𝑷𝑺(𝑩 𝒋 , 𝑪 𝒋 ) ∀𝒋 = 𝟐, … , 𝑶 21 B
PIM Example: In-Memory Parallel Execution 𝑷𝑺(𝑩 𝒋 , 𝑪 𝒋 ) ∀𝒋 = 𝟐, … , 𝑶 * Per element, ignoring initialization 22 B
PIM Hierarchy of Logical Functions Matrix Convolution multiplication POW SQRT LOG DIV MUL ADD SUB AND XOR OR COPY NAND MAGIC NOR/NOT 23 B
PIM Example: Full Adder (1) C IN A B C OUT S C IN 10 12 A S 8 1 5 11 C OUT 6 7 4 9 B 2 3 𝑇 = 𝐵⨁𝐶⨁𝐷 𝑗𝑜 𝐷 𝑝𝑣𝑢 = 𝐵 ∙ 𝐶 + 𝐶 ∙ 𝐷 𝑗𝑜 + 𝐵 ∙ 𝐷 𝑗𝑜 • Generate NOR/NOT sequence • Existing CAD tools do it very well! • Can reuse interim results to save space! 1 Per element, ignoring initialization 2 Can be done w/ 9 NORs 24 R. Ben Hur, et al. , "Synthesis and Mapping of Boolean Functions for Memristor Aided Logic (MAGIC)", ICCAD 2017 B
PIM Example: Full Adder (2) full_adder_12gates.v. in/out+gates = 3/2+10; 1: T1 = NOT I1 % alloc: R3 = NOT I1 C IN 2: T2 = NOT I2 % alloc: R4 = NOT I2 3: T5 = NOR2 T1,T2 % alloc: R7 = NOR2 R3,R4 C OUT 4: T4 = NOR2 I1,I2 % alloc: R6 = NOR2 I1,I2 10 12 5: T6 = NOR2 T5,T4 % alloc: R8 = NOR2 R7,R6 A Naïve 6: T8 = NOR2 T6,I3 % alloc: R10 = NOR2 R8,I3 S 8 column 1 5 7: T7 = NOT T6 % alloc: R9 = NOT R8 11 allocation 6 7 8: T3 = NOT I3 % alloc: R5 = NOT I3 4 9 9: T9 = NOR2 T7,T3 % alloc: R11 = NOR2 R9,R5 2 B 3 10: O11 = NOR2 T8,T9 % alloc: R1 = NOR2 R10,R11 11: T10 = NOR2 T5,T9 % alloc: R12 = NOR2 R7,R11 12: O12 = NOT T10 % alloc: R2 = NOT R12 No full_adder_12gates.v. in/out+gates = 3/2+10; Cell Reuse 1: T1 = NOT I1 % alloc: R3 = NOT I1 2: T2 = NOT I2 % alloc: R1 = NOT I2 3: T5 = NOR2 T1,T2 % alloc: R2 = NOR2 R3,R1 12 cells Smart 4: T4 = NOR2 I1,I2 % alloc: R3 = NOR2 I1,I2  column 5: T6 = NOR2 T5,T4 % alloc: R1 = NOR2 R2,R3 5 cells 6: T8 = NOR2 T6,I3 % alloc: R3 = NOR2 R1,I3 allocation 7: T7 = NOT T6 % alloc: R4 = NOT R1 8: T3 = NOT I3 % alloc: R1 = NOT I3 9: T9 = NOR2 T7,T3 % alloc: R5 = NOR2 R4,R1 Best Cell 10: O11 = NOR2 T8,T9 % alloc: R1 = NOR2 R3,R5 Reuse Per element, ignoring initialization 11: T10 = NOR2 T5,T9 % alloc: R3 = NOR2 R2,R5 12: O12 = NOT T10 % alloc: R2 = NOT R3 Smart Reuse ➔ Column usage reduction! 25 B
PIM Example: N-bit Multiplication One Requires One partial ~300<512 adder at product cells for a time N=16 at a time N bits N bits 3N bits 12N bits 3700 2N bits Area Partial products Sequence of adders Input 1 Input 2 Result 12 × 300 • Showcase PIM implementation of complex operation • N=16 ➔ ~3700 NOR operations (O(N O(N 2 )) )) • Naïve column allocation: ~3700 Cells ( O( Addition O(N 2 )) )) • Optimized manual/automictic allocation: ~300 Cells! ( O(N O(N)) * Per element, ignoring initialization 26 Ameer Haj-Ali et al. , “ Efficient Algorithms for In-memory Fixed Point Multiplication Using MAGIC, ” B 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018
System Outline • Background • Memristor basics • Logic using memory cells • Processing in Memory - the mMPU • System design using the mMPU • Conclusions 27
System System Design using the mMPU Memory Design Periphery CPU Design mMPU Controller Design and Optimization mMPU Controller mMPU Programming Model Applications 28 B
System Challenge 1 mMPU Controller μ -architecture m Compute Block 29 B
System Challenge 1: Arithmetic/Logical Operations in the mMPU m Compute Block A B C 30 B
Recommend
More recommend