The Active Memory Cube : A Processing-in-Memory System for High - PowerPoint PPT Presentation

The Active Memory Cube : A Processing-in-Memory System for High Performance Computing Zehra Sura IBM T.J. Watson Research Center Yorktown Heights, New York

AMC Team Members Ravi Nair Thomas Fox Martin Ohmacht Samuel Antao Diego Gallo Yoonho Park Carlo Bertolli Leopold Grinberg Daniel Prener Pradip Bose John Gunnels Bryan Rosenburg Jose Brunheroto Arpith Jacob Kyung Ryu Tong Chen Philip Jacob Olivier Sallenave Chen-Yong Cher Hans Jacobson Mauricio Serrano Carlos Costa Tejas Karkhanis Patrick Siegl Jun Doi Changhoan Kim Krishnan Sugavanam Constantinos Evangelinos Jaime Moreno Zehra Sura Bruce Fleischer Kevin O’Brien Supported in part by the US Department of Energy 2 AMC: Active Memory Cube August 25, 2015

HPC Challenges § Power Wall – High power affects: § Transistor reliability at circuit level § Power delivery/cooling costs at system level § Memory Wall – %time for memory ops é – %time for compute ops ê § Many others … 3 AMC: Active Memory Cube August 25, 2015

This Talk § Experience with the Active Memory Cube (AMC) – Developed microarchitecture, OS, compiler, cycle-accurate simulator, power model – Evaluated performance on kernels from HPC benchmarks § Outline – System design and goals – Architecture description – Power, performance, programmability concerns 4 AMC: Active Memory Cube August 25, 2015

AMC System Design Leverage stacked DRAM technology (Micron HMC) for processing-in-memory 5 AMC: Active Memory Cube August 25, 2015

AMC System Design Leverage stacked DRAM technology (Micron HMC) for processing-in-memory 6 AMC: Active Memory Cube August 25, 2015

AMC System Design Leverage stacked DRAM technology (Micron HMC) for processing-in-memory Impact Memory Wall • Move compute to data • Allow high memory bandwidth 7 AMC: Active Memory Cube August 25, 2015

AMC System Design Leverage stacked DRAM technology (Micron HMC) for processing-in-memory Impact Power Wall Impact Memory Wall • Move compute to data • Move compute to data • Custom design in-memory • Allow high memory bandwidth compute logic 8 AMC: Active Memory Cube August 25, 2015

AMC System Design Leverage stacked DRAM technology (Micron HMC) for processing-in-memory Impact Power Wall: Impact Memory Wall: • Move compute to data • Move compute to data • Custom design in-memory • Allow high memory bandwidth compute logic Integral Part of Design: • Help improve performance for a range of applications • Accessible, i.e. easy to use and program • Extreme power efficiency Projected to be 20 GFlops/W for DGEMM in 14nm at 1.25 GHz 9 AMC: Active Memory Cube August 25, 2015

The Green500 List Source: green500.org 10 AMC: Active Memory Cube August 25, 2015

AMC Processor Architecture 11 AMC: Active Memory Cube August 25, 2015

AMC Processor Architecture 12 AMC: Active Memory Cube August 25, 2015

Power Consumption Breakdown Source: green500.org 10 times power efficiency BlueGene/Q AMC 20 GFlops/W for DGEMM in 14nm at 1.25 GHz 13 AMC: Active Memory Cube August 25, 2015

Enabling Power-Performance Efficiency I. Exploit near-memory properties II. Delegate to software III. Provide lots of parallelism Balanced architecture design: ★ Save power ★ Improve performance ★ Support programmability 14 AMC: Active Memory Cube August 25, 2015

I. Exploit Near-Memory Properties § Latency range 26 cycles to 250+ cycles – No caches – Large register files: § 16 vector registers * 32 elements * 8-bytes * 4 slices è 16K per lane § 32 scalar registers § 4 vector mask registers – Buffers in vault controllers – Load combining – Page policy 15 AMC: Active Memory Cube August 25, 2015

I. Exploit Near-Memory Properties § Latency range 26 cycles to 250+ cycles – No caches – Large register files – Buffers in vault controllers – Load combining – Page policy Flop efficiency is the % of peak flop rate utilized in execution. Theoretical peak for a lane is 8 Flops per cycle. 16 AMC: Active Memory Cube August 25, 2015

I. Exploit Near-Memory Properties § Latency range 26 cycles to 250+ cycles – No caches – Large register files – Buffers in vault controllers – Load combining – Page policy § High bandwidth – On-chip ★ – Deep LSQ – Multiple load-store units – Multiple striping policies 17 AMC: Active Memory Cube August 25, 2015

I. Exploit Near-Memory Properties § Latency range 26 cycles to 250+ cycles – No caches – Large register files – Buffers in vault controllers – Load combining DAXPY : – Page policy for (i=0; i<N; i++) B(i) = B(i) + x * A(i); § High bandwidth Memory bound – On-chip ★ – Deep LSQ Maximum bandwidth utilization for kernel: – Multiple load-store units 47.8% of peak (153.2 GB/s of 320 GB/s) – Multiple striping policies Expected bandwidth utilization in apps: 30.9% of peak (99 GB/s of 320 GB/s) For node with 16 AMCs: 1.58 TF/s (99 GB/s * 16 AMCs) Peak bandwidth available to host: 256 GB/s 18 AMC: Active Memory Cube August 25, 2015

I. Exploit Near-Memory Properties § Latency range 26 cycles to 250+ cycles – No caches – Large register files – Buffers in vault controllers – Load combining – Page policy § High bandwidth – On-chip ★ – Deep LSQ – Multiple load-store units – Multiple striping policies § Support programming/heterogeneity: – Shared memory – Effective address space same as host processors ★ – Hardware coherence/consistency ★ – In-memory atomic operations ★ 19 AMC: Active Memory Cube August 25, 2015

II. Software Delegation § Memory – ERAT translation : segment-based translation table – Striping policy for data placement/affinity 21 AMC: Active Memory Cube August 25, 2015

II. Software Delegation § Memory – ERAT translation : segment-based translation table – Striping policy for data placement/affinity 22 AMC: Active Memory Cube August 25, 2015

II. Software Delegation § Memory – ERAT translation : segment-based translation table – Striping policy for data placement/affinity base lh data allocated across AMC base + latency hiding opts default optimizations aff lh+aff base , but data allocated in specific quadrant with all opts 23 AMC: Active Memory Cube August 25, 2015

II. Software Delegation § Memory – ERAT translation – Striping policy for data placement/affinity § Computation – Pipeline dependence checking – ILP detection – Instruction cache 24 AMC: Active Memory Cube August 25, 2015

II. Software Delegation § Memory – ERAT translation – Striping policy for data placement/affinity § Computation – Pipeline dependence checking – ILP detection – Instruction cache 25 AMC: Active Memory Cube August 25, 2015

II. Software Delegation § Memory – ERAT translation – Striping policy for data placement/affinity § Computation – Pipeline dependence checking – ILP detection – Instruction cache § Parallelization – Vectorization and SIMD ★ 26 AMC: Active Memory Cube August 25, 2015

III. Parallelization Maximize utilization of available resources for power-performance § Multiple types of parallelism – Programmable-length vector processing – Spatial SIMD (2-way, 4-way, 8-way) – ILP (multiple functional units; horizontal microcoding) – Heterogeneous – Multithreaded, multicore § Mixed scalar/vector § Scatter/gather, strided load/stores with update, packed load/stores § Predication 28 AMC: Active Memory Cube August 25, 2015

Compiler Supports an MPI+OpenMP4.0 programming model MANUAL COMPILER 71.1 GF/s (22.2% of peak) 121.6 GF/s (38% of peak) DET DAXPY (BW) 99 GB/s (30.9% of peak) 99 GB/s (30.9% of peak) DGEMM* 266 GF/s (83% of peak) 246 GF/s (77% of peak) DGEMM: Compiler currently needs 2 innermost loops to be manually blocked 29 AMC: Active Memory Cube August 25, 2015

Compiler Supports an MPI+OpenMP4.0 programming model MANUAL COMPILER 71.1 GF/s (22.2% of peak) 121.6 GF/s (38% of peak) DET DAXPY (BW) 99 GB/s (30.9% of peak) 99 GB/s (30.9% of peak) DGEMM* 266 GF/s (83% of peak) 246 GF/s (77% of peak) DGEMM: Compiler needs 2 innermost loops to be manually blocked THE GOOD THE BAD THE UGLY § Unified loop optimization § Latency prediction § Alias analysis – Blocking – Distribution § Data placement § Automatic – Unrolling coarse-grained – Versioning § Sequence of accesses parallelization § Array scalarization § Scheduling § Register allocation § Function calls, SIMD/ predicated functions § Software instruction caching 30 AMC: Active Memory Cube August 25, 2015

The Active Memory Cube : A Processing-in-Memory System for High - PowerPoint PPT Presentation

The Active Memory Cube : A Processing-in-Memory System for High Performance Computing Zehra Sura IBM T.J. Watson Research Center Yorktown Heights, New York AMC Team Members Ravi Nair Thomas Fox Martin Ohmacht Samuel Antao Diego Gallo

Outline Cube Release Roadmap Release Notes Cube 7 Highlights Cube 7 Beta

bluecube V 4 . 3 1 Blue Cube CMS V4.3 by Digitalcube TABLE OF CONTENTS Introduction Discover

Explorations of the Rubiks Cube Group Zeb Howell May 2016 Explorations of the Rubiks Cube

Cube Attacks on Stream Ciphers Based on Division Property Chaoyun Li ESAT-COSIC, KU Leuven

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Cubr: Cube Puzzle Solver 18500 S19 Team D6 Project Proposal JT Aceron, Lily Chen, Sam

THE COSO INTEGRATED CONTROL CUBE THE COSO I NTEGRATED CONTROL CUBE 1 COSO Definition of I

The Arbarr Merchandising Tray System Automatic E-Cube Mini-bar, Mini-bar Retrofit,

Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of

Fe February 1 Te Templates Wa Wade Fa Fagen-Ul Ulmsch schnei eider er, , Cra Craig

CPS Translations and Applications: The Cube and Beyond Section 2: The domain-free -cube Haye

Evolutionary Cube Solver Anurag Misra Dept. of Computer Science and Engineering Indian

Innovation Reall and unlocking affordable housing markets in urban Africa and Asia Andrew

TheAlternating-Time ExplicitStrategies Joint work with Lutz Schrder and Dirk Pattinson by

Feedback in workplace-based assessment Introduction to workplace-based assessment What is

Estimating and Sampling Graphs with Multidimensional Random Walks Group 2: Mingyan Zhao,

Tom orrow s W orld Asia Pacific Real Estate Asia Pacific Real Estate Conference 2013 6

Collaborative Channel Pruning for Deep Networks 11th June 2019 Background Model compression

POST-SHIPMENT FINANCE CREDIT FACILITY EXTENDED TO AN EXPORTER FROM THE DATE OF SHIPMENT OF

The Linkages among the 4 Financial Statements Consolidated