Convolution Engine: Balancing Efficiency & Flexibility in - - PowerPoint PPT Presentation

convolution engine balancing efficiency flexibility in
SMART_READER_LITE
LIVE PREVIEW

Convolution Engine: Balancing Efficiency & Flexibility in - - PowerPoint PPT Presentation

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, Thats me Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark


slide-1
SLIDE 1

http://www.c2s2.org

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing

Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark Horowitz Stanford University That’s me  Did the heavy lifting but could not come today

slide-2
SLIDE 2

Smile, you’re on camera

  • By show of hands, who here has

an (HD) camera on them?

  • How many CPU’s/GPU’s in the

room?

  • How many of those xPU’s are

used for the image processing?

ISCA'13 shacham@alumni.stanford.edu 2

slide-3
SLIDE 3

Imaging and video systems

  • High computational requirements, low power budget
  • Stills: ~10M pixels x 10 frames per second
  • Video: ~2M pixels x 30 frames per second
  • ~400 math operations per pixel (just for the image acquisition)
  • On CPU… not enough horse power
  • On GPU… too much power
  • Typically use special purpose custom HW
  • About 500X better performance, 500X lower energy than CPU

ISCA'13 shacham@alumni.stanford.edu 3

slide-4
SLIDE 4

Example: H.264 encoder on RISC vs. ASIC

  • By coupling compute and storage closely together, ASIC’s are
  • rders of magnitude performance and energy more efficient

ISCA'13 shacham@alumni.stanford.edu 4

100 1000 10000 100000 1000000 10000000

IME FME IP CABAC

Energy (uJ) RISC ASIC

Sub-kernel-1 Sub-kernel-2 Sub-kernel-3 Sub-kernel-4

* R. Hameed et. al., Understanding Sources of Inefficiency in General-Purpose Chips. ISCA ’10

2-3 orders of magnitude

slide-5
SLIDE 5

We are solving the wrong problem!

  • Yes, ASIC is 1000X more efficient than general purpose
  • Yes, general purpose is more programmable than ASIC
  • Yes, we can make each one marginally better
  • But those are good answers to all the wrong questions!
  • The right questions:

Why is the RISC energy so high? What type of computation can we make efficient? Can we make it just 100X better but keep it programmable?

ISCA'13 shacham@alumni.stanford.edu 5

slide-6
SLIDE 6

Anatomy of a RISC Instruction

ISCA'13 6 shacham@alumni.stanford.edu

ADD 70 pJ

* Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology Energy of a 32-bit ADD ≈ 0.5 pJ I-Cache access Register file access

25pJ 4pJ Control

Control overheads (Instr Decode, sequencing, pipeline management, clocking, ….)

slide-7
SLIDE 7

Other instructions overhead

ISCA'13 7 shacham@alumni.stanford.edu

* Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology

25pJ 4pJ Control 25pJ 4pJ Control 25pJ 4pJ Control 25pJ 4pJ Control 25pJ 4pJ Control

ADD ST BR LD LD

Overhead instructions Overhead instructions

slide-8
SLIDE 8

D-Cache accesses overhead

ISCA'13 8 shacham@alumni.stanford.edu

* Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology

25pJ 4pJ Control 25pJ 4pJ Control 25pJ 4pJ Control 25pJ 4pJ Control 25pJ 4pJ Control

D-Cache access

  • verheads

25pJ 25pJ 25pJ

ADD ST BR LD LD

slide-9
SLIDE 9

SIMD machines give some improvement

  • SIMD units amortize overhead and improve performance
  • Achieves 10X better energy and performance AND is programmable
  • Can we do 100X and keep it programmable?

ISCA'13 9 shacham@alumni.stanford.edu

I-Cache RF Control

ADD

I-Cache RF Control

SIMD ADD

slide-10
SLIDE 10

Energy efficiency in a programmable environment

Each memory and instruction fetch must be amortized by hundreds of operations

ISCA'13 10 shacham@alumni.stanford.edu

slide-11
SLIDE 11

What we want to see

ISCA'13 11 shacham@alumni.stanford.edu

I-Cache Reg File Control D-Cache

OP ST LD

I-Cache Reg File Control D-Cache

OP OP OP OP OP

I-Cache Reg File Control I-Cache Reg File Control I-Cache Reg File Control I-Cache Reg File Control I-Cache Reg File Control I-Cache Reg File Control

D-Cache accesses much narrower than functional path Many ops per instruction Many ALU instructions per LD/ST instruction

slide-12
SLIDE 12

Image processing looks like convolution

  • Most of the computation is performed over (overlapping) stencils
  • Looks like convolution:

ISCA'13 shacham@alumni.stanford.edu 12

Out

( )

∑ ∑

− = − = − −

⋅ = ⊗

c c l c c k l m k n l k m n

f Img f Img

] , [ ] , [ ] , [

In coefficients x

slide-13
SLIDE 13

Image processing looks like convolution

  • Most of the computation is performed over (overlapping) stencils
  • Looks like convolution:

ISCA'13 shacham@alumni.stanford.edu 13

Out In coefficients x

( )

∑ ∑

− = − = − −

⋅ = ⊗

c c l c c k l m k n l k m n

f Img f Img

] , [ ] , [ ] , [

slide-14
SLIDE 14

Image processing looks like convolution

  • Most of the computation is performed over (overlapping) stencils
  • Looks like convolution:

ISCA'13 shacham@alumni.stanford.edu 14

Out In coefficients x

( )

∑ ∑

− = − = − −

⋅ = ⊗

c c l c c k l m k n l k m n

f Img f Img

] , [ ] , [ ] , [

slide-15
SLIDE 15

It does not have to be convolution

  • It only looks like convolution:

ISCA'13 shacham@alumni.stanford.edu 15

Out

( )

[ ] [ ]

] , [ ] , [ ] , [

,

l m k n l k c c k c c l m n CE

f Img map Reduce Reduce f Img

− − − = − =

= " # $ % & ' ⊗

In coefficients

reduce map

slide-16
SLIDE 16

Let’s look at some convolution-like workloads

  • De-mosaic:
  • Adaptive color plane interpolation (ACPI)*: image gradients

followed by a three-tap filter in the direction of smallest gradient.

ISCA'13 shacham@alumni.stanford.edu 16 * Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007.

slide-17
SLIDE 17

Let’s look at more convolution-like workloads

  • H.264 (high definition) video encoder:
  • IME: 2D-Sum of absolute differences
  • FME: Half pixel interpolation, quarter pixel interpolation, 2D SAD

ISCA'13 shacham@alumni.stanford.edu 17

Inter Prediction Intra Prediction CABAC Entropy Encoder Video Frames Compressed Bit Stream Integer Motion Estimation Fractional Motion Estimation 90% of execution time is here

slide-18
SLIDE 18

The main computation behind H.264

  • Trying to find best match for a stencil within a small neighborhood

ISCA'13 shacham@alumni.stanford.edu 18

Current Frame Previous Frame

slide-19
SLIDE 19

The convolution engine must support different ops

Map Reduce Stencil Size Data Flow IME SAD Abs Diff Add 4x4 2D Convolution FMW ½ pixel up-sample Multiply Add 6 1D Horizontal & vertical conv. FME ¼ pixel up-sample Average None

  • 2D Matrix operation

SIFT Gaussian blur Multiply Add 9, 13, 15 1D Horizontal & vertical conv. SIFT DoG Subtract None

  • 2D Matrix operation

SIFT Extreme Compare Logic AND 3 1D Horizontal & vertical conv. Demosaic interpolation Multiply Complex 3 1D Horizontal & vertical conv.

ISCA'13 shacham@alumni.stanford.edu 19

slide-20
SLIDE 20

Convolution Engine: An architecture for convolution-like kernels

ISCA'13 20 shacham@alumni.stanford.edu

Arithmetic / Logical reduction

ALU ALU ALU ALU

Flexible “reduce” step Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

1 15 1 15 1 1 15 15 1 15 1 15 1 1 15 15 16 17 31 16 17 31 16 17 31

Coefficients Stencil neighborhood

slide-21
SLIDE 21

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 21 shacham@alumni.stanford.edu

Arithmetic / Logical reduction

ALU ALU ALU ALU

Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

Current frame pixels Reference frame pixels

Flexible “reduce” step

1 15 1 15 1 1 15 15 1 15 1 15 1 1 15 15 16 17 31 16 17 31 16 17 31

slide-22
SLIDE 22

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 22 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

2D Regfile Wide 64- lane SIMD “map” unit 2D shift Regfile

Current frame pixels Reference frame pixels ALU’s instruction set to |a-b|

Arithmetic / Logical reduction

Flexible “reduce” step

1 15 1 15 1 1 15 15 1 15 1 15 1 1 15 15 16 17 31 16 17 31 16 17 31

slide-23
SLIDE 23

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 23 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

Sum (Reduction)

2D Regfile Wide 64- lane SIMD “map” unit 2D shift Regfile

Current frame pixels Reference frame pixels ALU’s instruction set to |a-b| Summation tree

Flexible “reduce” step

pixels shift left

1 15 1 15 1 1 15 15 1 15 1 15 1 1 15 15 16 17 31 16 17 31 16 17 31

slide-24
SLIDE 24

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 24 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

Sum (Reduction)

Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

Reference frame pixels pixels shift left

Flexible “reduce” step

1 15 1 15 1 1 15 15 1 2 16 1 2 16 1 1 2 16 15 17 18 17 18 17 18 31 31 31

slide-25
SLIDE 25

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 25 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

Sum (Reduction)

Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

Reference frame pixels pixels shift left

Flexible “reduce” step

1 15 1 15 1 1 15 15 2 3 17 2 3 17 1 2 3 17 15 18 19 1 17 19 1 18 19 1

slide-26
SLIDE 26

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 26 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

Sum (Reduction)

Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

Reference frame pixels pixels shift left We performed 4K ops before the next load! Pixels shift up

Flexible “reduce” step

1 15 1 15 1 1 15 15 16 17 31 16 17 31 1 16 17 31 15 1 15 1 15 1 15 14 14 14

slide-27
SLIDE 27

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 27 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

Sum (Reduction)

Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

Reference frame pixels

Flexible “reduce” step

Pixels shift up

1 15 1 15 1 1 15 15 16 17 31 1 15 1 15 14 16 16 17 31 1 15 14

slide-28
SLIDE 28

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 28 shacham@alumni.stanford.edu

  • ABS
  • ABS
  • ABS
  • ABS

Sum (Reduction)

Wide 64- lane SIMD “map” unit 2D Regfile 2D shift Regfile

load just

  • ne row
  • f data

Reference frame pixels ready for pixels to start shifting again

Flexible “reduce” step

1 15 1 15 1 1 15 15 16 17 31 1 16 17 31 15 18 19 15 1 15 14 14 16 16 17 31 1 15 14

slide-29
SLIDE 29

Our Convolution Engine as implemented

ISCA'13 29 shacham@alumni.stanford.edu

“Map” Flexible “Reduce” 2D Register 2D Shift Register

ALU ALU ALU ALU 18 entries 16 wide 10-bit pixel 16 x 10bit lane

1D Shift Register

2D / Column Access IF 2D / Column Access IF 40 x 10-bit 16x16x10- bit 16x36x10-bit 1D Window Access IF

16-wide Regfile 16-way SIMD

ALU ALU

Get full implementation details in the paper:

  • How we accomplished complex reduce

steps using a “fused instructions graph”

  • How we work on BIG stencils by

combining multiple convolution slices

  • The details of the ISA for the engine
  • And so on, and so forth…
slide-30
SLIDE 30

Result #1: CE is user programmable in C!

ISCA'13 30 shacham@alumni.stanford.edu

SET_CE_OPS (CE_ABSDIFF, CE_ADD); // Set map & reduce funcs to abs-diff and add SET_CE_OPSIZE(16); // Set convolution size 16x16 // Load the 16x16 current macroblock into 2D coefficients register for (int i=0; i<16; i++ { LD_COEFF_REG_128(curMBPtr, i); // Load 16 pixels to row i of coefficient register curMBPtr += imgWidth; } // Load the first 32x16 current reference window into 2D input register for (int i=0; i<16; i++ { LD_2D_REG_128(refPtr, 0, SHIFT_ENABLED); // Load & shift-up 16 pixels to 2D Reg LD_2D_REG_128(refPtr+16, 1, SHIFT_DISABLED); // Load next 16 pixels refPtr += imgWidth; } // Calculate one row of SAD output for (int x = 0; x < 16; x++) { CONVOLVE_2D(ROTATE_LEFT, x); // 16x16 2D convolution step and shift left } // Store 16 output SAD results ST_OUT_REG_128(outPtr);

slide-31
SLIDE 31

0.1 1.0 10.0 100.0 SIFT - DoG SIFT-Extrema H.264 - FME H.264- IME Demosaic

Energy Normalized To Custom (Lower is better) SIMD Convolution Engine Custom

Programmable Convolution enigne

Result #2: CE is 100X more energy efficient than RISC

  • All variations were implemented as Tensilica extensions (TIE)

shacham@alumni.stanford.edu ISCA'13 31

8 lane 16bit or 16 lane 8bit SIMD

~10X ~3X Does not do “real time”

Fixed accelerator

slide-32
SLIDE 32

Conclusions

  • There are classes of computations for which we can build efficient

hardware, and we typically build them in ASIC

  • Image and video are ubiquitous and represents one of those

classes as their computation is convolution-like

  • But when we restrict the domain, two orders of magnitude better

programmable engines are also possible!

  • Flexible specialized engines are not an oxymoron
  • Flexible convolution engine improves power & performance by ~100X
  • Only 2-3X worse off than a dedicated (not flexible) accelerator

ISCA'13 shacham@alumni.stanford.edu 32

slide-33
SLIDE 33

THANK YOU FOR LISTENING!

ISCA'13 33 shacham@alumni.stanford.edu

slide-34
SLIDE 34

BACKUP SLIDES…

ISCA'13 34 shacham@alumni.stanford.edu

slide-35
SLIDE 35

Energy dissipation in RISC machines

  • Let’s do a breakdown of a typical RISC Instruction
  • Keep in mind (at 45nm):
  • Addition is ~0.1pJ for 8bits (ASIC) or ~0.5pJ for 32bits (RISC)
  • Multiplication is ~0.2pJ for 8bits (ASIC) or ~3.1pJ for 32bits (RISC)
  • But a single RISC instruction is 70pJ
  • Need to see where the overhead is, and how we can mitigate it

ISCA'13 shacham@alumni.stanford.edu 35

slide-36
SLIDE 36

Processor Integration

  • Specialized Functional Unit
  • Adds about 30 instructions to the processor ISA
  • The execution flow is controlled by the processor

ISCA'13 shacham@alumni.stanford.edu 36

Processor Core

32-bit ALU Register File Integer FU Compute Register Storage Convolution Engine

Instruction Decode Pipeline Management Program Sequencing

slide-37
SLIDE 37

Evaluating the Convolution Engine

  • Applications
  • SIFT Feature extraction
  • Often a basic step for computational photography algorithms
  • HDR Imaging
  • Panorama stitching
  • Smart zoom / Super resolution
  • Multi-frame noise reduction
  • Synthetic aperture
  • Augmented reality
  • Flash – No-Flash photography
  • Video de-shake
  • ……
  • H.264 encoder
  • Every video system has one

37 ISCA'13 shacham@alumni.stanford.edu

slide-38
SLIDE 38

Let’s look at some of the workloads

  • De-mosaic:
  • Adaptive color plane interpolation (ACPI)*: image gradients

followed by a three-tap filter in the direction of smallest gradient.

ISCA'13 shacham@alumni.stanford.edu 38 * Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007.