Processing Architecture with Memory Channel Network Mohammad Alian 1 - PowerPoint PPT Presentation

Application-Transparent Near-Memory Processing Architecture with Memory Channel Network Mohammad Alian 1 , Seung Won Min 1 , Hadi Asgharimoghaddam 1 , Ashutosh Dhar 1 , Dong Kai Wang 1 , Thomas Roewer 2 , Adam McPadden 2 , Oliver O'Halloran 2 , Deming Chen 1 , Jinjun Xiong 2 , Daehoon Kim 1 , Wen-mei Hwu 1 , and Nam Sung Kim 1,3 1 University of Illinois Urbana-Champaign 2 IBM Research and Systems 3 Samsung Electronics

2 Executive Summary ry • Processing In Memory (PIM), Near Memory Processing (NMP), … ✓ EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, … • Question : why haven’t they been commercialized yet? ✓ Demand changes in application code and/or memory subsystem of host processor 48MB Memory CPU+ Vector Unit 3M $ Qaud 48MB Memory Processing Tile or DRAM Block Qaud Network IRAM’97 SmartMemories’00 ISCA’15

3 Executive Summary ry • Processing In Memory (PIM), Near Memory Processing (NMP), … ✓ EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, … • Question : why haven’t they been commercialized yet? ✓ Demand changes in application code and/or memory subsystem of host processor • Solution : memory module based NMP + Memory Channel Network (MCN) ✓ Recognize NMP memory modules as distributed computing nodes over Ethernet  no change in application code or memory subsystem of host processors ✓ Seamlessly integrate NMP w/ distributed computing frameworks for better scalability

4 Executive Summary ry • Processing In Memory (PIM), Near Memory Processing (NMP), … ✓ EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, … • Question : why haven’t they been commercialized yet? ✓ Demand changes in application code and/or memory subsystem of host processor • Solution : memory module based NMP + Memory Channel Network (MCN) ✓ Recognize NMP memory modules as distributed computing nodes over Ethernet  no change in application code or memory subsystem of host processors ✓ Seamlessly integrate NMP w/ distributed computing frameworks for better scalability • Feasibility & Performance: ✓ Demonstrate the feasibility w/ an IBM POWER8 + experimental memory module ✓ Improve the performance and processing bandwidth by 43% and 4× , respectively

5 Overview of f MCN-based NMP host MCN DIMM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM DRAM DRAM DRAM DRAM MC-0 MCN PROC OS CPU DRAM DRAM DRAM DRAM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM MC-1 local MCN PROC OS channels regular node MCN node global channel *Application Processor • Buffered DIMM w/ a low-power but powerful AP * in a buffer device Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

6 Overview of f MCN-based NMP host MCN DIMM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM DRAM DRAM DRAM DRAM MC-0 MCN PROC OS CPU DRAM DRAM DRAM DRAM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM MC-1 local MCN PROC OS channels regular node MCN node global channel *Application Processor • Buffered DIMM w/ a low-power but powerful AP * in a buffer device ✓ An MCN processor runs its own lightweight OS including the minimum network stack Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

7 Overview of f MCN-based NMP MCN distributed computing MCN node host MCN DIMM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM DRAM DRAM DRAM DRAM MC-0 MCN PROC OS CPU DRAM DRAM DRAM DRAM DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM MC-1 local MCN PROC OS channels regular node MCN node global channel *Application Processor • Buffered DIMM w/ a low-power but powerful AP * in a buffer device • Special driver faking memory channels as Ethernet connections Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

8 Higher Processing BW * w/ Commodity DRAM *bandwidth • Conventional memory system ✓ More DIMMs  larger capacity but the same bandwidth DDR4 DIMM DDR4 DIMM Memory Controller DRAM DRAM DRAM DRAM Data Buffer Data Buffer Global/Shared Channel Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

9 Higher Processing BW * w/ Commodity DRAM *bandwidth • Conventional memory system w/ near memory processing DIMMs ✓ an MCN processor w/ local DRAM devices through private channels  scaling aggregate processing memory bandwidth w/ # of MCN DIMMs Local/Private Channels DDR4 DIMM MCN DIMM DDR4 DIMM DDR4 DIMM Memory Controller Memory Controller DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM MCN PROC MCN PROC Data Buffer Data Buffer Global/Shared Channel Global/Shared Channel Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

10 MCN DIM IMM Architecture IBM Centaur DIMM Buffer Device 80 DDR DRAM chips + buffer chip w/ a tall form factor Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

11 MCN DIM IMM Architecture IBM Centaur DIMM Buffer Device Near-Memory Processor MCN Processor MCN PROC core core core core 0 1 2 3 global DDR interface DDR IRQ LLC / interconnect channel dual-port SRAM control MC TX RX local DRAM 80 DDR DRAM chips Snapdragon AP w/ 4 A57 ARM cores + + buffer chip ~20W TDP & 2MB LLC, GPU, 2 MCs, etc. @ w/ a tall form factor ~10 mm ×10 mm ~5W & ~8×8 mm 2 (1.8W & ~2×2 mm 2 ) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

12 MCN DIM IMM Architecture: In Interface Logic MCN PROC core core core core 0 1 2 3 global DDR interface DDR IRQ LLC / interconnect channel dual-port SRAM control MC TX RX MCN Interface local DRAM Serving as a fake network Interface card (NIC) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

13 MCN DIM IMM Architecture: In Interface Logic MCN buffer layout MCN PROC core core core core 64 Bytes ... 0 1 2 3 0 4 8 12 63 global DDR interface Tx-head Tx-tail Tx-poll reserved DDR IRQ LLC / interconnect Rx-head Rx-tail Rx-poll reserved channel Tx circular buffer 48KB dual-port SRAM control MC TX Rx circular buffer 48KB RX Mapped to a range of physical memory space MCN Interface directly accessed by MC like normal DRAM local DRAM Serving as a fake network Interface card (NIC) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

14 MCN DIM IMM Architecture: In Interface Logic MCN buffer layout MCN PROC core core core core 64 Bytes ... 0 1 2 3 0 4 8 12 63 global DDR interface Tx-head Tx-tail Tx-poll reserved DDR IRQ LLC / interconnect Rx-head Rx-tail Rx-poll reserved channel Tx circular buffer 48KB dual-port SRAM control MC TX Rx circular buffer 48KB RX Mapped to a range of physical memory space MCN Interface directly accessed by MC like normal DRAM local DRAM Serving as a fake network Interface card (NIC) Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

15 MCN Driver application user space host TX RX DDR4 DIMM MCN DIMM linux network stack MC-0 SRAM kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel regular access MCN access hardware MCN SRAM DDR memory Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

16 MCN Packet Routing 1. A packet is passed from the host network stack and the packet goes to • Host  MCN the corresponding MCN DIMM or NIC application user space host TX RX DDR4 DIMM MCN DIMM IP: X.X.X.X linux network stack MC-0 SRAM Header|Data/2 kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel Header|Data/2 regular access MCN access hardware MCN SRAM DDR memory Header | Data Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

17 MCN Packet Routing 2. If the packet needs to be sent to an MCN DIMM, the forwarding engine checks the MAC of the packet stored in • Host  MCN main memory application user space host TX RX DDR4 DIMM MCN DIMM linux network stack MC-0 SRAM Header|Data/2 kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel MAC: AA.AA.AA.AA.AA.AA Header|Data/2 regular access MCN access hardware MCN SRAM DDR memory Header | Data Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

18 MCN Packet Routing 3. If the MAC matches w/ that of an MCN DIMM, the packet is copied to the • Host  MCN SRAM buffer of the MCN DIMM application user space host TX RX DDR4 DIMM MCN DIMM linux network stack MC-0 SRAM Header|Data/2 kernel space MCN driver NIC forwarding engine polling agent CPU Driver DDR4 DIMM memcpy MCN DIMM MC-1 SRAM memory channel MAC: AA.AA.AA.AA.AA.AA Header|Data/2 regular access MCN access hardware MCN SRAM DDR memory Header | Data Overview Architecture Driver Optimizations Proof of Concept Evaluation Conclusion

Processing Architecture with Memory Channel Network Mohammad Alian 1 - PowerPoint PPT Presentation

Application-Transparent Near-Memory Processing Architecture with Memory Channel Network Mohammad Alian 1 , Seung Won Min 1 , Hadi Asgharimoghaddam 1 , Ashutosh Dhar 1 , Dong Kai Wang 1 , Thomas Roewer 2 , Adam McPadden 2 , Oliver O'Halloran 2 ,

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

The Method of Types and Its Application to Information Hiding Pierre Moulin University of

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

NDN-RTC and Experimental Library Func:onality Peter Gusev NDNComm, March 2017 Flume Slack

Resistive Memories Marwen Zorgui, Mohammed E. Fouda, Zhiying Wang, Ahmed Eltawil, and Fadi Kurdahi

Jingwen Bai ECE, Rice University Joint work with Chenxi Liu* and Ashutosh Sabharwal Rice

Improving Spectrum Efficiency with ACKs Jiansong Zhang # , Haichen Shen , Kun Tan ,

References 2 Material Related to LTE comes from 3GPP LTE: System Overview, Product

Determinating Timing Channels in Compute Clouds Amittai Aviram, Sen Hu, Bryan Ford Yale

Sambuz

Useful Links

Newsletter

Mail Us