Convolution Engine Balancing Efficiency & Flexibility in - - PowerPoint PPT Presentation

convolution engine
SMART_READER_LITE
LIVE PREVIEW

Convolution Engine Balancing Efficiency & Flexibility in - - PowerPoint PPT Presentation

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal The Problem


slide-1
SLIDE 1

Convolution Engine

Balancing Efficiency & Flexibility in Specialized Computing

Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University

Kevin Loughlin and Ian Neal

slide-2
SLIDE 2

The Problem

  • Many workloads require specialized hardware to meet performance

expectations.

  • Ex. image processing (more on that shortly…)
  • Unfortunately, performant specialized hardware comes at the cost of

flexibility.

○ In some cases, power-efficiency is sacrificed as well!

  • How can we create specialized HW that can balance these 3 factors?

○ Reasonably performant ○ Reasonably power-efficient ○ Reasonably flexible (programmable)

Qadeer et al. Convolution Engine

2

slide-3
SLIDE 3

Example: Image Processing

  • Image (and video) processing calls for specialized HW

○ General-purpose HW not optimized for such high data parallelism

  • Traditional Solution: Single Instruction Multiple Data (SIMD) Units

○ Extremely flexible (programmable), but way too slow

  • Alternative: GPUs

○ Still flexible, and more performant than SIMD Units, but consume way too much power

  • Another Alternative: ASIC Accelerators

○ Very performant and power-efficient, but at the cost of flexibility -- only apply to 1 algorithm

  • Qadeer et al.

Convolution Engine

3

slide-4
SLIDE 4

Image Processing in Action

Qadeer et al. Convolution Engine

Source: Powell, Victor. “Image Kernels: Explained Visually.” http://setosa.io/ev/image-kernels/

4

slide-5
SLIDE 5

Key Insight: Image Processing is Convolution-like

  • Convolution: Apply a mapping function to a stencil (chunk) of data, perform a

reduction on the result, then shift stencil and repeat

○ Iterative map-then-reduce: common occurrence in image processing

  • Use this insight to abstract over image processing algorithms

○ Rather than build an ASIC for 1 algorithm, build specialized HW for the class of algorithms ○ Allow users to program this specialized HW based on specific application needs

  • Idea: Convolution Engine (CE)

○ An architecture that yields reasonable performance, power, and flexibility numbers for convolution-like algorithms

Qadeer et al. Convolution Engine

5

slide-6
SLIDE 6

Design: Improving Efficiency

  • Register file overheads (1D and 2D registers)

○ Shift registers are a natural extension of moving stencils

  • Load/store unit

○ Multiple memory access widths, unaligned accesses

  • Keeping things simple

○ Interface Units (IF) arrange data as needed for map operation ○ Functional Units are just 2-input ALUs on pre-arranged data

  • Complex Graph Fusion Unit (CGFU)

○ Combine up to 9 different convolution instructions into one “super instruction” in reduce

  • Lightweight SIMD Unit for all else

○ No multiplication, just add/subtract-type instructions

Qadeer et al. Convolution Engine

6

slide-7
SLIDE 7

Design: Providing Flexibility

  • CE is a processor extension

○ Small set of new ISA instructions ○ Issued through C code compiler intrinsics

  • Configuration registers for kernel-constant values

○ Convolution size, ALU operation, etc.

  • Completely software controlled

○ Can interleave non-CE instructions before next convolution iteration

  • Chained processors (slices) can be used for more complex convolution

Qadeer et al. Convolution Engine

7

slide-8
SLIDE 8

Evaluation

  • 3 different algorithms

○ H.264 motion estimation (video decoding) ○ SIFT (feature detection) ○ demosaic (interpret camera input)

  • Measure “custom” ASIC vs CE vs SIMD
  • Vary programmability of CE as well

○ Fixed kernel (equivalent to custom ASIC) ○ Multiple kernel sizes (more flexibility in interface units, register files, reduction stage) ○ Multiple flows (different dimensions, access patterns, but same operations) ○ Multiple arithmetic operations (full flexibility)

Qadeer et al. Convolution Engine

8

slide-9
SLIDE 9

Evaluation: ASIC vs CE vs SIMD

Qadeer et al. Convolution Engine

9

slide-10
SLIDE 10

Evaluation: Varying Flexibility

Qadeer et al. Convolution Engine

10

slide-11
SLIDE 11

Key Results

  • 8-15x less energy use than SIMD
  • 2-3x more energy use than custom ASIC
  • Within 6x performance of custom ASIC, 7x better than SIMD
  • All programmable versions do better performance-wise than SIMD

Qadeer et al. Convolution Engine

11

slide-12
SLIDE 12

Conclusion

  • Better performance and power than SIMD
  • Worse than fixed-application ASIC
  • Moderate amount of flexibility

○ The greater the degree of programmability, the more performance gains are lost

Qadeer et al. Convolution Engine

12

slide-13
SLIDE 13

Discussion Points

  • Is image processing a broad enough domain to claim that they apply their

architecture to a “wide range” of applications?

  • The authors compared CE to SIMD units and ASICs. Should a GPU

comparison have also been given?

  • CE doesn’t optimize any of the 3 relevant categories (performance, power,

and flexibility). Is there sufficient motivation to use it? Qadeer et al. Convolution Engine

13