Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - - PowerPoint PPT Presentation

scaling datacenter accelerators with
SMART_READER_LITE
LIVE PREVIEW

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi - - PowerPoint PPT Presentation

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators


slide-1
SLIDE 1

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs and David Wentzlaff

ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA

slide-2
SLIDE 2

Scaling Datacenter Accelerators With Compute-Reuse Architectures 2

Sources:

"Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

slide-3
SLIDE 3

Scaling Datacenter Accelerators With Compute-Reuse Architectures 3

Sources:

"Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

slide-4
SLIDE 4

Scaling Datacenter Accelerators With Compute-Reuse Architectures 4

Sources:

"Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

?

slide-5
SLIDE 5

Scaling Datacenter Accelerators With Compute-Reuse Architectures 5

Sources:

“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

slide-6
SLIDE 6

Scaling Datacenter Accelerators With Compute-Reuse Architectures 6

Sources:

“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

slide-7
SLIDE 7

Scaling Datacenter Accelerators With Compute-Reuse Architectures 7

Sources:

“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Transistor scaling stops. Chip specialization runs out of steam. What’s Next?

slide-8
SLIDE 8

Scaling Datacenter Accelerators With Compute-Reuse Architectures 8

Observation I: The Density of Emerging Memories are Projected to Increase

Scaling Datacenter Accelerators With Compute-Reuse Architectures

ITRS Logic Roadmap

slide-9
SLIDE 9

Scaling Datacenter Accelerators With Compute-Reuse Architectures 9

Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec t=2 sec t=4 sec

slide-10
SLIDE 10

Scaling Datacenter Accelerators With Compute-Reuse Architectures 10

Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec 0% recurrence 38% recurrence 61% recurrence t=2 sec t=4 sec

slide-11
SLIDE 11

Scaling Datacenter Accelerators With Compute-Reuse Architectures 11

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

slide-12
SLIDE 12

Scaling Datacenter Accelerators With Compute-Reuse Architectures 12

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

slide-13
SLIDE 13

Scaling Datacenter Accelerators With Compute-Reuse Architectures 13

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

slide-14
SLIDE 14

Scaling Datacenter Accelerators With Compute-Reuse Architectures 14

Source: Twitter

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Power laws suggest high recurrent processing of popular content

slide-15
SLIDE 15

Scaling Datacenter Accelerators With Compute-Reuse Architectures 15

Source: Twitter

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Power laws suggest high recurrent processing of popular content

slide-16
SLIDE 16

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures 16

Input Lookup

core result

DMA Engine

Accelerator Core

input

input

  • utput

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory

slide-17
SLIDE 17

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures 17

Input Lookup

lookup

fetched result core result core result

DMA Engine

Accelerator Core

input

input

  • utput

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit Host Processors

Scratchpad Memory

slide-18
SLIDE 18

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures 18

Input Lookup

lookup

fetched result core result core result

DMA Engine

Accelerator Core

input

input

  • utput

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit Host Processors

Scratchpad Memory

slide-19
SLIDE 19

19 Architectural Guidelines

Accelerator Core

Specialized Compute Lanes

Scratchpad

DMA Engine

General-Purpose CMP Shared LLC

slide-20
SLIDE 20

20 Architectural Guidelines

▪ Accelerators Memoization is Natural

  • Little or no additional programming effort
  • Built-in input-compute-output flow

Accelerator Core

Specialized Compute Lanes

Scratchpad

DMA Engine

General-Purpose CMP Shared LLC

Output Input Compute

slide-21
SLIDE 21

21 Architectural Guidelines

▪ Accelerators Memoization is Natural

  • Little or no additional programming effort
  • Built-in input-compute-output flow

▪ But Not Straightforward!

  • High lookup costs
  • Unnecessary accesses
  • High access costs

▪ COREx Key Ideas:

  • Hashing (reduce lookup costs)
  • Lookup filtering(fewer accesses)
  • Banking (reduce access costs)

Accelerator Core

Specialized Compute Lanes

Scratchpad

DMA Engine

General-Purpose CMP Shared LLC

Output Input Compute

slide-22
SLIDE 22

22 Architectural Guidelines

▪ Accelerators Memoization is Natural

  • Little or no additional programming effort
  • Built-in input-compute-output flow

▪ But Not Straightforward!

  • High lookup costs
  • Unnecessary accesses
  • High access costs

▪ COREx Key Ideas:

  • Hashing (reduce lookup costs)
  • Lookup filtering(fewer accesses)
  • Banking (reduce access costs)

Accelerator Core

Specialized Compute Lanes

Scratchpad

DMA Engine

General-Purpose CMP Shared LLC

Output Input Compute

Goal: Extend Specialization with Workload-Specific Memoization

slide-23
SLIDE 23

23

Accelerator Core

Specialized Compute Lanes

Scratchpad

General-Purpose CMP Shared LLC SoC Interconnect

  • Mem. Chip
  • Func. Block

Datapath Control

Top Level Architecture

DMA Engine

slide-24
SLIDE 24

▪ New Modules:

  • Input Hashing Unit (IHU)

24

Accelerator Core

Specialized Compute Lanes

Scratchpad

General-Purpose CMP Shared LLC

IHU

COREx Interconnect SoC Interconnect

  • Mem. Chip
  • Func. Block

Datapath Control

Top Level Architecture

DMA Engine

slide-25
SLIDE 25

▪ New Modules:

  • Input Hashing Unit (IHU)
  • Input Lookup Unit (ILU)

25

Accelerator Core

Specialized Compute Lanes

Scratchpad

General-Purpose CMP Shared LLC

IHU ILU

Cache Ctrl.

COREx Interconnect SoC Interconnect

  • Mem. Chip
  • Func. Block

Datapath Control

Top Level Architecture

DMA Engine

Hashes

Associative Cache

slide-26
SLIDE 26

▪ New Modules:

  • Input Hashing Unit (IHU)
  • Input Lookup Unit (ILU)
  • ComputationHistoryTable(CHT)

26

Accelerator Core

Specialized Compute Lanes

Scratchpad

General-Purpose CMP Shared LLC

IHU CHT ILU

Cache Ctrl.

COREx Interconnect SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

  • Mem. Chip
  • Func. Block

Datapath Control

Associative Cache

Top Level Architecture

DMA Engine

Fetch

slide-27
SLIDE 27

▪ New Modules:

  • Input Hashing Unit (IHU)
  • Input Lookup Unit (ILU)
  • ComputationHistoryTable(CHT)

27

Accelerator Core

Specialized Compute Lanes

Scratchpad

General-Purpose CMP Shared LLC

IHU CHT ILU

Cache Ctrl.

COREx Interconnect SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

  • Mem. Chip
  • Func. Block

Datapath Control

Associative Cache

Top Level Architecture

DMA Engine

Fetch Match Input

slide-28
SLIDE 28

▪ New Modules:

  • Input Hashing Unit (IHU)
  • Input Lookup Unit (ILU)
  • ComputationHistoryTable(CHT)

28

Accelerator Core

Specialized Compute Lanes

Scratchpad

General-Purpose CMP Shared LLC

IHU CHT ILU

Cache Ctrl.

COREx Interconnect SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

  • Mem. Chip
  • Func. Block

Datapath Control

Associative Cache

Top Level Architecture

DMA Engine

Use Output Fetch

slide-29
SLIDE 29

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

  • Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

  • Sweep space for design alternatives (Aladdin)
  • Find optimal accelerator design for each goal

29

slide-30
SLIDE 30

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

  • Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

  • Sweep space for design alternatives (Aladdin)
  • Find optimal accelerator design for each goal

30

slide-31
SLIDE 31

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Runtime OPT: 5.8[us] Energy OPT: 6.2[uJ] EDP OPT: 148.7[pJs]

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

  • Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

  • Sweep space for design alternatives (Aladdin)
  • Find optimal accelerator design for each goal

31

slide-32
SLIDE 32

32

▪ Memoization-Layers Specialization

  • Extract input traces, examine hit and miss rates of different ILU/CHT sizes.
  • Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-33
SLIDE 33

33

▪ Memoization-Layers Specialization

  • Extract input traces, examine hit and miss rates of different ILU/CHT sizes.
  • Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

Energy Optimization: 56.6% Energy Saved. 64KB ILU, 8MB CHT EDP Optimization: 63.5% EDP Saved. 512KB ILU, 2GB CHT Runtime Optimization: 2.7x Speedup. 512KB ILU, 32GB CHT

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-34
SLIDE 34

34

Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Workloads

slide-35
SLIDE 35

35

Workloads

Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-36
SLIDE 36

36

Workloads

Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy Search Commonality

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-37
SLIDE 37

37

Workloads

Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-38
SLIDE 38

38

Workloads

Methodology

  • Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)
  • Integrate with highly-tuned accelerators (Aladdin)

Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP ("SSP") Graph Processing Maps Service: Shortest Walking Route Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-39
SLIDE 39

39 Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedup

  • Negligible Differences Between Memories

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-40
SLIDE 40

40 Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedup

  • Negligible Differences Between Memories

▪ EDP-OPT: Avg. 50%-68% Savings

  • PCM/Racetrack High write energy
  • Gain less for low bias apps (freq. updates)

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-41
SLIDE 41

41 Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedup

  • Negligible Differences Between Memories

▪ EDP-OPT: Avg. 50%-68% Savings

  • PCM/Racetrack High write energy
  • Gain less for low bias apps (freq. updates)

▪ Energy-OPT: Avg. 22%-50% Savings

  • PCM unbeneficial for 75% bias SSSP/RBM

▪ General Trends:

  • Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

slide-42
SLIDE 42

42

▪ Memoization is Fit for Accelerators

  • Memoization-Ready Programming Environment+Interface

Conclusions

slide-43
SLIDE 43

43

▪ Memoization is Fit for Accelerators

  • Memoization-Ready Programming Environment+Interface

▪ Memoization is Fit for Datacenters

  • Temporal Redundancy, Search Commonality, Content Popularity

Conclusions

slide-44
SLIDE 44

▪ COREx Extends Hardware Specialization

  • Memoization-layer specialization tailored for the workload

44 Conclusions

slide-45
SLIDE 45

▪ COREx Extends Hardware Specialization

  • Memoization-layer specialization tailored for the workload

▪ COREx Opens New Opportunities for Future Architectures

  • Shift compute from non-scaling CMOS to still-scaling memories

45 Conclusions

slide-46
SLIDE 46

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs David Wentzlaff adif@princeton.edu wentzlaf@princeton.edu