Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - PowerPoint PPT Presentation

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 2

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 3

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures ? Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 4

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 5

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 6

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Transistor scaling stops. Chip specialization runs out of steam. What’s Next? Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 7

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation I: The Density of Emerging Memories are Projected to Increase ITRS Logic Roadmap 8

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011 9

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec 0% recurrence 38% recurrence 61% recurrence Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011 10

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles Source: Google 11

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 12

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 13

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Power laws suggest high recurrent processing of popular content Source: Twitter 14

Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Power laws suggest high recurrent processing of popular content Source: Twitter 15

Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric COREx: Compute-Reuse Architecture For Accelerators 16

Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input lookup Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric fetched core hit result result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 17

Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input lookup Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric fetched core hit result result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 18

Architectural Guidelines Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC 19

Architectural Guidelines ▪ Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Output Compute Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC Input 20

Architectural Guidelines ▪ Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow ▪ But Not Straightforward! Output o High lookup costs Compute Accelerator Core o Unnecessary accesses o High access costs DMA General-Purpose Scratchpad Engine CMP ▪ COREx Key Ideas: Specialized o Hashing (reduce lookup costs) Compute Lanes o Lookup filtering(fewer accesses) Shared LLC o Banking (reduce access costs) Input 21

Architectural Guidelines ▪ Accelerators Memoization is Natural Goal: Extend Specialization with o Little or no additional programming effort Workload-Specific Memoization o Built-in input-compute-output flow ▪ But Not Straightforward! Output o High lookup costs Compute Accelerator Core o Unnecessary accesses o High access costs DMA General-Purpose Scratchpad Engine CMP ▪ COREx Key Ideas: Specialized o Hashing (reduce lookup costs) Compute Lanes o Lookup filtering(fewer accesses) Shared LLC o Banking (reduce access costs) Input 22

Top Level Architecture Mem. Chip Func. Block Control Datapath Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 23

Top Level Architecture ▪ New Modules: Mem. Chip Func. Block Control o Input Hashing Unit (IHU) Datapath COREx Interconnect Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 24

Top Level Architecture ▪ New Modules: Mem. Chip ILU Func. Block Associative Control o Input Hashing Unit (IHU) Cache Datapath Cache Ctrl. COREx Interconnect Hashes o Input Lookup Unit (ILU) Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 25

Top Level Architecture ▪ New Modules: CHT Mem. Chip ILU Func. Block RAM-Array Associative Control o Input Hashing Unit (IHU) Cache Table Datapath Cache Ctrl. Fetch RAM-Array Ctrl. COREx Interconnect o Input Lookup Unit (ILU) Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP o ComputationHistoryTable(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect 26

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - PowerPoint PPT Presentation

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators

Application Accelerators: Application Accelerators: Application Accelerators: Application

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

Datacenter Transformation Datacenter Transformation

Google Datacenter CS 142 Lecture Notes: Datacenters Slide 1 Datacenter Organization Single

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

C++ Actor Framework Transparent Scaling from IoT to Datacenter Apps Matthias Vallentin UC

Direct Methods in Visual Odometry July 24, 2017 Direct Methods in Visual Odometry July 24, 2017

Reasoning and Meta-reasoning Sonia Marin IT-University of Copenhagen, Denmark 85-211

Doub DoubleChec leCheck Y k Your T our Theor heorems ems Car Carl Eastlund l Eastlund

Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @

Research Report: Mitigating LangSec Problems With Capabilities Or: How Sandstorm Taught Me to

Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of

X.1H-"3<"#6'/"--"-A(#)3 .''/-()$(0($%.11-0-1

Inheritance and Polymorphism CSSE 221 Fundamentals of Software Development Honors Rose-Hulman

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - PowerPoint PPT Presentation

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators

Application Accelerators: Application Accelerators: Application Accelerators: Application

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

Datacenter Transformation Datacenter Transformation

Google Datacenter CS 142 Lecture Notes: Datacenters Slide 1 Datacenter Organization Single

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

C++ Actor Framework Transparent Scaling from IoT to Datacenter Apps Matthias Vallentin UC

Direct Methods in Visual Odometry July 24, 2017 Direct Methods in Visual Odometry July 24, 2017

Reasoning and Meta-reasoning Sonia Marin IT-University of Copenhagen, Denmark 85-211

Doub DoubleChec leCheck Y k Your T our Theor heorems ems Car Carl Eastlund l Eastlund

Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @

Research Report: Mitigating LangSec Problems With Capabilities Or: How Sandstorm Taught Me to

Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of

X.1H-&quot;3*&lt;&quot;#6*'/&quot;--*&quot;-A(#)3* .'*'/-*()$(0($%.1*1-0-1*

Inheritance and Polymorphism CSSE 221 Fundamentals of Software Development Honors Rose-Hulman

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

X.1H-"3<"#6'/"--"-A(#)3 .''/-()$(0($%.11-0-1