Scaling Datacenter Accelerators With Compute-Reuse Architectures
Adi Fuchs and David Wentzlaff
ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA
Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - - PowerPoint PPT Presentation
Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators
ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA
Sources:
"Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Sources:
"Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Sources:
"Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Sources:
“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Sources:
“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Sources:
“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Observation I: The Density of Emerging Memories are Projected to Increase
ITRS Logic Roadmap
Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)
t=0 sec t=2 sec t=4 sec
Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)
t=0 sec 0% recurrence 38% recurrence 61% recurrence t=2 sec t=4 sec
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
hotel in downtown los angeles near intercontinental
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
hotel in downtown los angeles near intercontinental
Source: Twitter
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Power laws suggest high recurrent processing of popular content
Source: Twitter
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Power laws suggest high recurrent processing of popular content
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. COREx: Compute-Reuse Architecture For Accelerators
Input Lookup
core result
DMA Engine
Accelerator Core
input
input
Acceleration Fabric
Shared LLC / NoC
Host Processors
Scratchpad Memory
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. COREx: Compute-Reuse Architecture For Accelerators
Input Lookup
lookup
fetched result core result core result
DMA Engine
Accelerator Core
input
input
Acceleration Fabric
Shared LLC / NoC
hit Host Processors
Scratchpad Memory
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. COREx: Compute-Reuse Architecture For Accelerators
Input Lookup
lookup
fetched result core result core result
DMA Engine
Accelerator Core
input
input
Acceleration Fabric
Shared LLC / NoC
hit Host Processors
Scratchpad Memory
Accelerator Core
Specialized Compute Lanes
Scratchpad
DMA Engine
General-Purpose CMP Shared LLC
▪ Accelerators Memoization is Natural
Accelerator Core
Specialized Compute Lanes
Scratchpad
DMA Engine
General-Purpose CMP Shared LLC
Output Input Compute
▪ Accelerators Memoization is Natural
▪ But Not Straightforward!
▪ COREx Key Ideas:
Accelerator Core
Specialized Compute Lanes
Scratchpad
DMA Engine
General-Purpose CMP Shared LLC
Output Input Compute
▪ Accelerators Memoization is Natural
▪ But Not Straightforward!
▪ COREx Key Ideas:
Accelerator Core
Specialized Compute Lanes
Scratchpad
DMA Engine
General-Purpose CMP Shared LLC
Output Input Compute
Accelerator Core
Specialized Compute Lanes
Scratchpad
General-Purpose CMP Shared LLC SoC Interconnect
Datapath Control
DMA Engine
▪ New Modules:
Accelerator Core
Specialized Compute Lanes
Scratchpad
General-Purpose CMP Shared LLC
IHU
COREx Interconnect SoC Interconnect
Datapath Control
DMA Engine
▪ New Modules:
Accelerator Core
Specialized Compute Lanes
Scratchpad
General-Purpose CMP Shared LLC
IHU ILU
Cache Ctrl.
COREx Interconnect SoC Interconnect
Datapath Control
DMA Engine
Hashes
Associative Cache
▪ New Modules:
Accelerator Core
Specialized Compute Lanes
Scratchpad
General-Purpose CMP Shared LLC
IHU CHT ILU
Cache Ctrl.
COREx Interconnect SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Datapath Control
Associative Cache
DMA Engine
Fetch
▪ New Modules:
Accelerator Core
Specialized Compute Lanes
Scratchpad
General-Purpose CMP Shared LLC
IHU CHT ILU
Cache Ctrl.
COREx Interconnect SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Datapath Control
Associative Cache
DMA Engine
Fetch Match Input
▪ New Modules:
Accelerator Core
Specialized Compute Lanes
Scratchpad
General-Purpose CMP Shared LLC
IHU CHT ILU
Cache Ctrl.
COREx Interconnect SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Datapath Control
Associative Cache
DMA Engine
Use Output Fetch
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
▪ Baseline: highly-tuned accelerators
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
▪ Baseline: highly-tuned accelerators
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Runtime OPT: 5.8[us] Energy OPT: 6.2[uJ] EDP OPT: 148.7[pJs]
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
▪ Baseline: highly-tuned accelerators
▪ Memoization-Layers Specialization
▪ Example: Resistive RAM based COREx
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
▪ Memoization-Layers Specialization
▪ Example: Resistive RAM based COREx
Energy Optimization: 56.6% Energy Saved. 64KB ILU, 8MB CHT EDP Optimization: 63.5% EDP Saved. 512KB ILU, 2GB CHT Runtime Optimization: 2.7x Speedup. 512KB ILU, 32GB CHT
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Workloads
Workloads
Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Workloads
Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy Search Commonality
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Workloads
Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Workloads
Methodology
Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
▪ Runtime-OPT: Avg. 6.0-6.4x Speedup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
▪ Runtime-OPT: Avg. 6.0-6.4x Speedup
▪ EDP-OPT: Avg. 50%-68% Savings
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
▪ Runtime-OPT: Avg. 6.0-6.4x Speedup
▪ EDP-OPT: Avg. 50%-68% Savings
▪ Energy-OPT: Avg. 22%-50% Savings
▪ General Trends:
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
▪ Memoization is Fit for Accelerators
▪ Memoization is Fit for Accelerators
▪ Memoization is Fit for Datacenters
▪ COREx Extends Hardware Specialization
▪ COREx Extends Hardware Specialization
▪ COREx Opens New Opportunities for Future Architectures
Adi Fuchs David Wentzlaff adif@princeton.edu wentzlaf@princeton.edu