microscope on memory
play

Microscope on Memory: MPSoC-enabled Computer Memory System - PowerPoint PPT Presentation

Microscope on Memory: MPSoC-enabled Computer Memory System Assessments FCCM 2018 Abhishek Kumar Jain, Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing May 1, 2018 LLNL-PRES-750335 This work was performed under the auspices of


  1. Microscope on Memory: MPSoC-enabled Computer Memory System Assessments FCCM 2018 Abhishek Kumar Jain, Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing May 1, 2018 LLNL-PRES-750335 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Introduction ▪ Recent advances in memory technology and packaging — High bandwidth memories – HBM, HMC — Non-volatile memory – 3D XPoint — Potential for logic and compute functions co-located with the memory — Brought attention to computer memory system design and evaluation HMC HBM 3D XPoint Hongshin Jun, et. al. IMW 2017 Creative Commons Attribution Micron Technology 2 LLNL-PRES-750335

  3. Introduction ▪ Emerging memories exhibit a wide range of bandwidths, Memory/Storage Hierarchy latencies, and capacities Capacity Latency 8 MB SRAM 10 ns — Challenge for the computer 128 MB 45 ns Near DRAM architects to navigate the design 8 GB 70 ns DDR DRAM space 8 GB 100 ns — Challenge for application Far DRAM 64 GB 200 ns developers to assess performance NVM implications 128 GB 50 us SSD 6 TB 10 ms HDD 3 LLNL-PRES-750335

  4. Introduction ▪ Need for system level exploration of the design space — Combinations of memory technology — Various memory hierarchies — Potential benefit of near-memory accelerators — Prototype architectural ideas in detail ▪ Need to quantitatively evaluate the performance impact on applications – beyond an isolated function — Accelerator communication overhead — Cache management overhead — Operating System overhead — Byte addressable vs. block addressable — Scratchpad vs. Cache — Cache size to working data set size — Latency impact 4 LLNL-PRES-750335

  5. Background ▪ M. Butts, J. Batcheller and J. Varghese, “An efficient logic emulation system,” Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors , Cambridge, MA, 1992, pp. 138-141. — Realizer System ▪ “Virtex - 7 2000T FPGA for ASIC Prototyping & Emulation,” https://www.xilinx.com/video/fpga/virtex-7-2000t-asic-prototyping-emulation.html — Prototype ARM A9 processor subsystem (dual-core, caches) mapped into a single Virtex-7 2000T FPGA ▪ Our approach uses the native hard IP cores/cache hierarchy and focuses on external memory 5 LLNL-PRES-750335

  6. LiME (Logic in Memory Emulator) ZCU102 development board with Xilinx Zynq UltraScale+ MPSoC device 6 LLNL-PRES-750335

  7. Trace DRAM Trace Subsystem LiME (Logic in Memory Emulator) Zynq UltraScale+ MPSoC Programmable Logic (PL) Implementation AXI Performance Trace Capture Monitor (APM) Device ▪ Use embedded CPU and cache hierarchy in Zynq MPSoC to save AXI Peripheral FPGA logic and development time Memory Subsystem Interconnect Program DRAM Accelerator ▪ Route memory traffic through hardware IP blocks deployed in Monitor programmable logic Delay Delay ▪ Emulate the latencies of a wide HP2,3 HP0,1 HPM0 HPM1 range of memories by using Main Switch programmable delay units in the DDR Memory loopback path Host Subsystem Controller Coherent Interconnect ▪ Capture time-stamped memory Not Used L2 Cache transactions using trace subsystem L1 L1 L1 L1 ARM ARM ARM ARM Core Core Core Core Open Source: Processing System (PS) http://bitbucket.org/perma/emulator_st/ 7 LLNL-PRES-750335

  8. LiME (Logic in Memory Emulator) Implementation LLNL Hardware IP Blocks AXI Delay AXI Shim AXI Trace Capture Device LiME uses only 13% of the device resources 8 LLNL-PRES-750335

  9. Emulation Method Clock Domains Zynq UltraScale+: Actual ▪ ARM cores are slowed to run Programmable Subsystem Accelerator Memory Logic (PL) Program at a frequency similar to DRAM 62.5 MHz programmable logic 300 MHz 4.8 GB/s 950 MHz ▪ A scaling factor of 20x is Subsystem DDR 137.5 MHz 2.2 GB/s 15.2 GB/s 475 MHz Host 128 7.6 GB/s applied to the entire system 64 APU Processing 137.5 MHz System (PS) ▪ Other scaling factors can be Zynq UltraScale+: Emulated at 20x used depending on the target Programmable Subsystem Accelerator Memory Program Logic (PL) peak bandwidth to memory DRAM 1.25 GHz 6 GHz 96 GB/s ▪ CPU peak bandwidth is 19 GHz Subsystem DDR 2.75 GHz limited to 44 GB/s 44 GB/s 304 GB/s 9.5 GHz Host 128 152 GB/s 64 APU Processing 2.75 GHz System (PS) 9 LLNL-PRES-750335

  10. Emulation Method Scaling by 20 Example Component Actual Emulated Memory Bandwidth (PL) 4.8 GB/s 96 GB/s Memory Latency (PL) 230 ns 12 ns (too low) Memory Latency (PL) w/delay 230 ns 12+88 = 100 ns CPU Frequency 137.5 MHz 2.75 GHz CPU Bandwidth 2.2 GB/s 44 GB/s Accelerator Frequency 62.5 MHz 1.25 GHz Accelerator Bandwidth Up to 4.8 GB/s Up to 96 GB/s Delay is programmable over a wide range: 0 - 174 us in 0.16 ns increments 10 LLNL-PRES-750335

  11. Emulation Method Address Space Original Modified Contiguous AXI peripherals Loopback path through PL 11 LLNL-PRES-750335

  12. Emulation Method Zynq UltraScale+ MPSoC Delay & Loopback Programmable Logic (PL) 0x04_0000_0000 (R1, R2) Map Width: 20 bits ▪ Address ranges R1, R2 AXI Map In: 0x04000 Shift R1 Shim Map Out: 0x08000 intended to have different Memory Subsystem 0x 08_000 0_0000 (R1) 0x 04_001 0_0000 (R2) access latencies (e.g. Map Width: 8 bits AXI SRAM, DRAM) Map In: 0x04 Shift R2 Shim Map Out: 0x18 Addr Width: 40 bits 0x 08 _0000_0000 (R1) Program DRAM Data Width: 128 bits ▪ Shims shift and separate 0x 18 _0010_0000 (R2) R1: 1M range R2: 4G range AXI address ranges (R1, R2) for Route R1, R2 SmartConnect easier routing 0x 08 _0000_0000 (R1) 0x 18 _0010_0000 (R2) AXI AXI Addr Width: Addr Width: ▪ Standard AXI Interconnect 36 bits 36 bits Delay Delay routes requests through 0x 08 _0000_0000 (R1) 0x 08 _0010_0000 (R2) S_AXI_HP0 S_AXI_HP1 M_AXI_HPM0 different delay units Host Subsystem Main Switch DDR Memory Controller ▪ Delay units have separate Coherent Interconnect programmable delays for read and write access Not Used APU Processing System (PS) 12 LLNL-PRES-750335

  13. Emulation Method Macro Insertion ▪ Insert macros at the start and end of the region of interest (ROI) ▪ CLOCKS_EMULATE/CLOCKS_NORMAL — Modify the clock frequencies and configure the delay units ▪ TRACE_START/TRACE_STOP — Trigger the hardware to start/stop recording memory events in Trace DRAM ▪ STATS_START/STATS_STOP — Trigger the hardware to start/stop the performance monitor counters ▪ TRACE_CAP — Save captured trace from Trace DRAM to SD card 13 LLNL-PRES-750335

  14. Memory Trace Capture LiME trace.bin parser.c trace.csv CPU = 0, Accelerator = 1 Each count represents 0.16 ns 14 LLNL-PRES-750335

  15. Use Cases Bandwidth Analysis from Trace 15 LLNL-PRES-750335

  16. Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 16 LLNL-PRES-750335

  17. Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 17 LLNL-PRES-750335

  18. Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 18 LLNL-PRES-750335

  19. Use Cases Access Pattern Analysis from Trace STREAM Benchmark Address Time 19 LLNL-PRES-750335

  20. Use Cases Evaluation of Future Storage Class Memory DGEMM execution time on 64-bit processor SpMV execution time on 64-bit processor at varying latencies and varying cache-to- at varying latencies with a cache-to- memory ratios. memory ratio of 1:112. Cache can hide memory latency for a Latency has direct impact. Application will working set size up to twice the size of need a high level of concurrency and greater cache. throughput to offset the loss in performance. 20 LLNL-PRES-750335

  21. Use Cases Evaluation of Near-Memory Acceleration Engines Memory Subsystem ▪ Multiple Memory Channels ▪ Up to 16 concurrent memory requests Memory Memory Memory Memory Channel Channel Channel Channel ▪ DREs are located in the Memory Subsystem ▪ Scratchpad is used to communicate parameters and results between CPU and accelerator Switch To Switch DRE DRE DRE DRE Links Scratchpad Shared Cache Cache Cache Load-Store Control Unit Processor CPU CPU Core Core Data Rearrangement Engine (DRE) Processor 21 LLNL-PRES-750335

  22. Use Cases Evaluation of Near-Memory Acceleration Engines The results demonstrate that substantial speedup can be gained with a DRE due to the higher number of in-flight requests issued by the near-memory accelerator. 22 LLNL-PRES-750335

  23. Use Cases Comparing Performance Across CPUs ▪ 32-bit ARM A9 (Out-of-order with 11-stage pipeline) using Zynq 7000 — L1 Cache: Two separate 32 KB (4-way set-associative) for instruction and data — L2 Cache: Shared 512 KB (8-way set-associative) — Cache Line Size: 32 Bytes ▪ 64-bit ARM A53 (In-order with 8-stage pipeline) using Zynq UltraScale+ — L1 Cache: Two separate 32 KB (4-way set-associative) for instruction and data — L2 Cache: Shared 1MB (16-way set-associative) — Cache Line Size: 64 Bytes Bandwidth-dominated STREAM-triad Random Access is mostly dependent Image Difference requires some runs significantly faster on the 64-bit on memory latency with little computation, giving the 64-bit core an processor with wider data paths. difference from CPU architecture. advantage. 23 LLNL-PRES-750335

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend