Emerging memory technologies for improved energy efficiency Martin - PowerPoint PPT Presentation

Emerging memory technologies for improved energy efficiency Martin Wenzel Advanced Seminar WS2015

Memory Bandwidth Technology BW GB/s DDR3-1333 2GB 10,66 DDR4-2667 4GB 21,34 Hennessy, Patterson, Computer Architecture, A quantitative Approach http://www.extremetech.com/computing/197720-beyond-ddr4-understand-the-differences-between-wide-io-hbm- and-hybrid-memory-cube 2

Power Consumption ARCHITECTURES AND TECHNOLOGY FOR EXTREME SCALE COMPUTING, 2009 3

Stacking • Pricey • Thermal Resistance • High Density • Low Interconnect Length • High Internal Interconnect Width �� ~400 • �� Package limited < 4 • �� 4

Stacked Memory Hybrid Memory Cube • 32 Vaults • Vertical Memory partitions • Vault Logic • DRAM Controller • Packetized Interconnect • Support for Atomics • Arithmetic • Bitwise swap / write • Boolean • Compare and Swap HMC Specification V1.0 5

Hybrid Memory Cube Interconnect • Packet based Interconnect • 20GB/s Per Link • 8 Links per HMC • Aggregate Link Bandwidth • Connect additional HMCs Technology BW GB/s DDR3-1333 2GB 10,66 DDR4-2667 4GB 21,34 HMC Specification V1.0 6

Processing in Memory (PIM) Instruction Offloading • Problematic Workload • Low Computation Intensity • Low Locality • Expectation • Efficient Bandwidth Usage Compare and Swap • Conventionell ReadCacheline(PTR) 64B Data CAS(PTR,CompVal,New) WriteCacheline(PTR) 64B Data • Atomic CAS Request_CAS(PTR, CompVal, New) 16B Data Response 16B Data 7

Example Workload: Graph Computing Graph Search • Breadth-first Search • Check all Neighbors • Move to the next level 8

Processing in Memory Offloading Nai, Kim, 2015, Instruction Offloading with HMC 2.0 Standard – A Case Study for Graph Traversals 9

Nai, Kim, 2015, Instruction Offloading with HMC 2.0 Standard – A Case Study for Graph Traversals 10

Processing in Memory Application Offloading – Tesseract • Problematic Workload • Low Computation Intensity • Low Locality • Expectation • Efficient Bandwidth Usage • High Energy Efficiency • Scalability 11

Processing in Memory Tesseract • Single HMC • Max Interconnect Bandwidth: 160 GB/s • Max Memory Bandwidth: 256 GB/s • Tesseract • PU in every Vault • 16 HMC in Network • Max Interconnect Bandwidth: 160 GB/s • Max Memory Bandwidth: 4 TB/s HMC Specification V1.0 Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 12

Processing in Memory Tesseract Core Architecture • Distributed Memory Architecture • No Cache Coherence • Remote Function Call • List Prefetcher • Prefetch Stride (Cache Lines) • Message Triggered Prefetcher • Preload Data before Message handling Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 13

Processing in Memory Tesseract – Speedup • HMC-OoO Architecture • HMC-MC Architecture • Tesseract • 32 Performance Cores • 512 low-power Cores • 512 low-power Cores • 16 HMCs • 16 HMCs • 16 HMCs • 320GB/s Memory Bandwidth • 320GB/s Memory Bandwidth • 4TB/s Memory Bandwidth Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 14

Processing in Memory Tesseract – Energy Efficiency Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 15

Processing in Memory Tesseract – Scalability Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 16

Conclusion Processing in Memory • High Speedup • Highly Energy Efficient • Scales proportional to Memory Capacity • Currently usable via Instruction Offloading • Current Designs optimized for Graph Computing Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 17

Future Work • Additional Workloads • Processing Units • Internode Communication • Application specific • General Purpose • FPGA technology? Further Information MEMSYS International Symposium on Memory Systems Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 18

Through – Silicon Via µBumps on top Metal Layer ~ 50 µ m pitch Through – Metal Via ~ 2 - 50 µ m µBumps under Substrate 200µm ~ 50 µ m pitch 19

Processing in Memory Tesseract Core Architecture • Distributed Memory Architecture • No Coherence Traffic • Message / Instruction Passing • Optional List Prefetcher • Optimize Locality • Message Triggered Prefetcher • Preload Data before Message handling Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 20

Processing in Memory Tesseract – Latency Ahn, Hong, Yoo, Mutlu, Choi, 2015, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing 21

Emerging memory technologies for improved energy efficiency Martin - PowerPoint PPT Presentation

Emerging memory technologies for improved energy efficiency Martin Wenzel Advanced Seminar WS2015 Memory Bandwidth Technology BW GB/s DDR3-1333 2GB 10,66 DDR4-2667 4GB 21,34 Hennessy, Patterson, Computer Architecture, A quantitative

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Emerging Global Energy Network Emerging Global Energy Network Regional electricity grids

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

E3 E3T Energy Efficiency Emerging Technologies HVAC Technologies in Multifamily Buildings

E3 E3T Energy Efficiency Emerging Technologies Residential Window Treatments Emerging

E3 E3T Energy Efficiency Emerging Technologies E3T ComTAG BPA E3T Commercial Buildings

E3T E3T Energy Efficiency Emerging Technologies The Bullitt Center: Energy Efficiency in

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Ligra: A Lightweight Graph Processing Framework for Shared Memory Shared memory Other not

SPARK @SPARKIAPD ARDUINO BASICS WHAT IS AN ARDUINO? Digital input and output Ground (PWM

ARINA: Arduino Remote Infrared Network Adapter Ren Neff, Thomas Trimborn & Matthias

& Mobility COMPACT, LIGHTWEIGHT, HIGH EFFICIENCY ROTARY ENGINE FOR GENERATOR, APU, AND

Tom-and-Jerry Catching Game Platform Prototype Presentation Introduction A* Search?

Finding Pa)erns in Complex Social Networks Lennon Ganz Santa Barbara

Data- Class XII ( As per CBSE Board) structures: lists, stacks, queues New Syllabus 2019-20

Guam's Estimated Trajectory for COVID-19 Cases and Impact for Hospitalized Patients Governors

Sambuz

Useful Links

Newsletter

Mail Us

Emerging memory technologies for improved energy efficiency Martin - PowerPoint PPT Presentation

Emerging memory technologies for improved energy efficiency Martin Wenzel Advanced Seminar WS2015 Memory Bandwidth Technology BW GB/s DDR3-1333 2GB 10,66 DDR4-2667 4GB 21,34 Hennessy, Patterson, Computer Architecture, A quantitative

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Emerging Global Energy Network Emerging Global Energy Network Regional electricity grids

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

E3 E3T Energy Efficiency Emerging Technologies HVAC Technologies in Multifamily Buildings

E3 E3T Energy Efficiency Emerging Technologies Residential Window Treatments Emerging

E3 E3T Energy Efficiency Emerging Technologies E3T ComTAG BPA E3T Commercial Buildings

E3T E3T Energy Efficiency Emerging Technologies The Bullitt Center: Energy Efficiency in

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Ligra: A Lightweight Graph Processing Framework for Shared Memory Shared memory Other not

SPARK @SPARKIAPD ARDUINO BASICS WHAT IS AN ARDUINO? Digital input and output Ground (PWM

ARINA: Arduino Remote Infrared Network Adapter Ren Neff, Thomas Trimborn &amp; Matthias

&amp; Mobility COMPACT, LIGHTWEIGHT, HIGH EFFICIENCY ROTARY ENGINE FOR GENERATOR, APU, AND

Tom-and-Jerry Catching Game Platform Prototype Presentation Introduction A* Search?

Finding Pa)erns in Complex Social Networks Lennon Ganz Santa Barbara

Data- Class XII ( As per CBSE Board) structures: lists, stacks, queues New Syllabus 2019-20

Guam's Estimated Trajectory for COVID-19 Cases and Impact for Hospitalized Patients Governors

Sambuz

Useful Links

Newsletter

Mail Us

ARINA: Arduino Remote Infrared Network Adapter Ren Neff, Thomas Trimborn & Matthias

& Mobility COMPACT, LIGHTWEIGHT, HIGH EFFICIENCY ROTARY ENGINE FOR GENERATOR, APU, AND