scaling datacenter accelerators with
play

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - PowerPoint PPT Presentation

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators


  1. Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA

  2. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 2

  3. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 3

  4. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures ? Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 4

  5. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 5

  6. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 6

  7. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Transistor scaling stops. Chip specialization runs out of steam. What’s Next? Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 7

  8. Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation I: The Density of Emerging Memories are Projected to Increase ITRS Logic Roadmap 8

  9. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011 9

  10. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec 0% recurrence 38% recurrence 61% recurrence Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011 10

  11. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles Source: Google 11

  12. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 12

  13. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 13

  14. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Power laws suggest high recurrent processing of popular content Source: Twitter 14

  15. Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Power laws suggest high recurrent processing of popular content Source: Twitter 15

  16. Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric COREx: Compute-Reuse Architecture For Accelerators 16

  17. Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input lookup Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric fetched core hit result result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 17

  18. Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input lookup Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric fetched core hit result result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 18

  19. Architectural Guidelines Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC 19

  20. Architectural Guidelines ▪ Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Output Compute Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC Input 20

  21. Architectural Guidelines ▪ Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow ▪ But Not Straightforward! Output o High lookup costs Compute Accelerator Core o Unnecessary accesses o High access costs DMA General-Purpose Scratchpad Engine CMP ▪ COREx Key Ideas: Specialized o Hashing (reduce lookup costs) Compute Lanes o Lookup filtering(fewer accesses) Shared LLC o Banking (reduce access costs) Input 21

  22. Architectural Guidelines ▪ Accelerators Memoization is Natural Goal: Extend Specialization with o Little or no additional programming effort Workload-Specific Memoization o Built-in input-compute-output flow ▪ But Not Straightforward! Output o High lookup costs Compute Accelerator Core o Unnecessary accesses o High access costs DMA General-Purpose Scratchpad Engine CMP ▪ COREx Key Ideas: Specialized o Hashing (reduce lookup costs) Compute Lanes o Lookup filtering(fewer accesses) Shared LLC o Banking (reduce access costs) Input 22

  23. Top Level Architecture Mem. Chip Func. Block Control Datapath Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 23

  24. Top Level Architecture ▪ New Modules: Mem. Chip Func. Block Control o Input Hashing Unit (IHU) Datapath COREx Interconnect Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 24

  25. Top Level Architecture ▪ New Modules: Mem. Chip ILU Func. Block Associative Control o Input Hashing Unit (IHU) Cache Datapath Cache Ctrl. COREx Interconnect Hashes o Input Lookup Unit (ILU) Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 25

  26. Top Level Architecture ▪ New Modules: CHT Mem. Chip ILU Func. Block RAM-Array Associative Control o Input Hashing Unit (IHU) Cache Table Datapath Cache Ctrl. Fetch RAM-Array Ctrl. COREx Interconnect o Input Lookup Unit (ILU) Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP o ComputationHistoryTable(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend