Convolution Engine
Balancing Efficiency & Flexibility in Specialized Computing
Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University
Kevin Loughlin and Ian Neal
Convolution Engine Balancing Efficiency & Flexibility in - - PowerPoint PPT Presentation
Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal The Problem
Balancing Efficiency & Flexibility in Specialized Computing
Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University
Kevin Loughlin and Ian Neal
expectations.
○
flexibility.
○ In some cases, power-efficiency is sacrificed as well!
○ Reasonably performant ○ Reasonably power-efficient ○ Reasonably flexible (programmable)
Qadeer et al. Convolution Engine
2
○ General-purpose HW not optimized for such high data parallelism
○ Extremely flexible (programmable), but way too slow
○ Still flexible, and more performant than SIMD Units, but consume way too much power
○ Very performant and power-efficient, but at the cost of flexibility -- only apply to 1 algorithm
Convolution Engine
3
Qadeer et al. Convolution Engine
Source: Powell, Victor. “Image Kernels: Explained Visually.” http://setosa.io/ev/image-kernels/
4
reduction on the result, then shift stencil and repeat
○ Iterative map-then-reduce: common occurrence in image processing
○ Rather than build an ASIC for 1 algorithm, build specialized HW for the class of algorithms ○ Allow users to program this specialized HW based on specific application needs
○ An architecture that yields reasonable performance, power, and flexibility numbers for convolution-like algorithms
Qadeer et al. Convolution Engine
5
○ Shift registers are a natural extension of moving stencils
○ Multiple memory access widths, unaligned accesses
○ Interface Units (IF) arrange data as needed for map operation ○ Functional Units are just 2-input ALUs on pre-arranged data
○ Combine up to 9 different convolution instructions into one “super instruction” in reduce
○ No multiplication, just add/subtract-type instructions
Qadeer et al. Convolution Engine
6
○ Small set of new ISA instructions ○ Issued through C code compiler intrinsics
○ Convolution size, ALU operation, etc.
○ Can interleave non-CE instructions before next convolution iteration
Qadeer et al. Convolution Engine
7
○ H.264 motion estimation (video decoding) ○ SIFT (feature detection) ○ demosaic (interpret camera input)
○ Fixed kernel (equivalent to custom ASIC) ○ Multiple kernel sizes (more flexibility in interface units, register files, reduction stage) ○ Multiple flows (different dimensions, access patterns, but same operations) ○ Multiple arithmetic operations (full flexibility)
Qadeer et al. Convolution Engine
8
Qadeer et al. Convolution Engine
9
Qadeer et al. Convolution Engine
10
Qadeer et al. Convolution Engine
11
○ The greater the degree of programmability, the more performance gains are lost
Qadeer et al. Convolution Engine
12
architecture to a “wide range” of applications?
comparison have also been given?
and flexibility). Is there sufficient motivation to use it? Qadeer et al. Convolution Engine
13