Convolution Engine: Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, That’s me  Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark Horowitz http://www.c 2 s 2 .org Stanford University

Smile, you’re on camera  By show of hands, who here has an (HD) camera on them?  How many CPU’s/GPU’s in the room?  How many of those xPU’s are used for the image processing? ISCA'13 shacham@alumni.stanford.edu 2

Imaging and video systems  High computational requirements, low power budget  Stills: ~10M pixels x 10 frames per second  Video: ~2M pixels x 30 frames per second  ~400 math operations per pixel (just for the image acquisition)  On CPU… not enough horse power  On GPU… too much power  Typically use special purpose custom HW  About 500X better performance, 500X lower energy than CPU ISCA'13 shacham@alumni.stanford.edu 3

Example: H.264 encoder on RISC vs. ASIC  By coupling compute and storage closely together, ASIC’s are orders of magnitude performance and energy more efficient 10000000 2-3 orders of magnitude 1000000 Energy (uJ) 100000 RISC 10000 ASIC 1000 100 Sub-kernel-1 Sub-kernel-2 Sub-kernel-3 Sub-kernel-4 IME FME IP CABAC * R. Hameed et. al., Understanding Sources of Inefficiency in General-Purpose Chips. ISCA ’10 ISCA'13 shacham@alumni.stanford.edu 4

We are solving the wrong problem!  Yes, ASIC is 1000X more efficient than general purpose  Yes, general purpose is more programmable than ASIC  Yes, we can make each one marginally better  But those are good answers to all the wrong questions!  The right questions: Why is the RISC energy so high? What type of computation can we make efficient? Can we make it just 100X better but keep it programmable? ISCA'13 shacham@alumni.stanford.edu 5

Anatomy of a RISC Instruction Control overheads (Instr Decode, sequencing, pipeline management, clocking, … .) ADD 70 pJ 25pJ 4pJ Control I-Cache access Energy of a 32-bit ADD ≈ 0.5 pJ Register file access * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 6

Other instructions overhead LD 25pJ 4pJ Control Overhead instructions LD 25pJ 4pJ Control ADD 25pJ 4pJ Control ST 25pJ 4pJ Control Overhead instructions BR 25pJ 4pJ Control * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 7

D-Cache accesses overhead LD 25pJ 25pJ 4pJ Control LD 25pJ 25pJ 4pJ Control ADD 25pJ 4pJ Control ST 25pJ 25pJ 4pJ Control BR 25pJ 4pJ Control D-Cache access overheads * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 8

SIMD machines give some improvement  SIMD units amortize overhead and improve performance ADD I-Cache RF Control SIMD ADD I-Cache RF Control  Achieves 10X better energy and performance AND is programmable  Can we do 100X and keep it programmable? ISCA'13 shacham@alumni.stanford.edu 9

Energy efficiency in a programmable environment Each memory and instruction fetch must be amortized by hundreds of operations ISCA'13 shacham@alumni.stanford.edu 10

What we want to see D-Cache accesses much narrower than functional path LD D-Cache I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control Many ALU instructions Many ops per instruction per LD/ST instruction OP I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control ST D-Cache I-Cache Reg File Control ISCA'13 shacham@alumni.stanford.edu 11

Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 12

It does not have to be convolution  It only looks like convolution: CE ' $ [ ] [ ] ( ) c c Img f Reduce Reduce map Img , f ⊗ = % " l c k c [ k , l ] [ n k , m l ] = − = − − − & # [ n , m ] In Out reduce map coefficients ISCA'13 shacham@alumni.stanford.edu 15

Let’s look at some convolution-like workloads  De-mosaic:  Adaptive color plane interpolation (ACPI)*: image gradients followed by a three-tap filter in the direction of smallest gradient. * Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007. ISCA'13 shacham@alumni.stanford.edu 16

Let’s look at more convolution-like workloads  H.264 (high definition) video encoder:  IME: 2D-Sum of absolute differences  FME: Half pixel interpolation, quarter pixel interpolation, 2D SAD Video Compressed CABAC Frames Inter Intra Bit Stream Entropy Prediction Prediction Encoder 90% of execution time is here Integer Fractional Motion Motion Estimation Estimation ISCA'13 shacham@alumni.stanford.edu 17

The main computation behind H.264  Trying to find best match for a stencil within a small neighborhood Previous Frame Current Frame ISCA'13 shacham@alumni.stanford.edu 18

The convolution engine must support different ops Map Reduce Stencil Size Data Flow IME SAD Abs Diff Add 4x4 2D Convolution FMW ½ pixel up-sample Multiply Add 6 1D Horizontal & vertical conv. FME ¼ pixel up-sample Average None -- 2D Matrix operation SIFT Gaussian blur Multiply Add 9, 13, 15 1D Horizontal & vertical conv. SIFT DoG Subtract None -- 2D Matrix operation SIFT Extreme Compare Logic AND 3 1D Horizontal & vertical conv. Demosaic interpolation Multiply Complex 3 1D Horizontal & vertical conv. ISCA'13 shacham@alumni.stanford.edu 19

Convolution Engine : An architecture for convolution-like kernels Stencil Coefficients neighborhood 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 Wide 64- lane SIMD ALU ALU ALU ALU “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 20

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference frame pixels frame pixels 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 Wide 64- lane SIMD ALU ALU ALU ALU “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 21

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference frame pixels frame pixels 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 ALU’s Wide 64- - - - - instruction lane SIMD ABS ABS ABS ABS set to |a-b| “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 22

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference pixels frame pixels frame pixels shift left 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 ALU’s Wide 64- - - - - instruction lane SIMD ABS ABS ABS ABS set to |a-b| “map” unit Summation Flexible Sum (Reduction) tree “reduce” step ISCA'13 shacham@alumni.stanford.edu 23

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Reference pixels frame pixels shift left 0 0 1 15 0 1 2 16 17 18 31 0 2D Regfile 2D shift 1 0 1 15 1 1 2 16 17 18 31 0 Regfile 15 0 1 15 15 1 2 16 17 18 31 0 Wide 64- - - - - lane SIMD ABS ABS ABS ABS “map” unit Flexible Sum (Reduction) “reduce” step ISCA'13 shacham@alumni.stanford.edu 24

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Reference pixels frame pixels shift left 0 0 1 15 0 2 3 17 18 19 0 1 2D Regfile 2D shift 1 0 1 15 1 2 3 17 17 19 0 1 Regfile 15 0 1 15 15 2 3 17 18 19 0 1 Wide 64- - - - - lane SIMD ABS ABS ABS ABS “map” unit Flexible Sum (Reduction) “reduce” step ISCA'13 shacham@alumni.stanford.edu 25

Convolution Engine: Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, Thats me Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark

1 Convolution Convolution is an important operation in signal and image processing. Convolution

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

E he i m COMPSCI 527 Computer Vision Correlation, Convolution, Filtering 14 / 26 Image

Chapter 8: Fast Convolution Keshab K. Parhi Chapter 8 Fast Convolution Introduction

produce Good Flexibility? I. What does flexibility do? II. What flexibility does a

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

WEBee Reverse Convolution Coding Reverse Convolution Coding Convolutional encoding uses a

Convolution Sum Overview Review of time invariance Review of sampling property

How to Make Your Integrated Sensor Smarter Francesc Serra-Graells 1,2 paco.serra@imb-cnm.csic.es 1

CSCI 2133 Rapid Programming Techniques for Innovation UI Design CSS Grid and Flexbox

Project1: Build A Small Scanner/Parser Introducing Lex, Yacc, and POET cs5363 1 Project1:

CSc 337 LECTURE 4: POSITIONING The CSS float property property description float side to

LECTURES ON REAL OPTIONS: PART I BASIC CONCEPTS Robert S. Pindyck Massachusetts Institute of

Flexibility and Robustness: The Cloud, Standards, Web Services and the Hybrid Future of

Serving Students with Disabilities during Periods of Remote or Blended Learning Technical

Flexible Anonymous Network Flexible Anonymous Network Florentin Rochet Florentin Rochet ,

Convolution Engine: Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, Thats me Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark

1 Convolution Convolution is an important operation in signal and image processing. Convolution

Convolution Engine Balancing Efficiency &amp; Flexibility in Specialized Computing Wajahat

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

E he i m COMPSCI 527 Computer Vision Correlation, Convolution, Filtering 14 / 26 Image

Chapter 8: Fast Convolution Keshab K. Parhi Chapter 8 Fast Convolution Introduction

produce Good Flexibility? I. What does flexibility do? II. What flexibility does a

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

WEBee Reverse Convolution Coding Reverse Convolution Coding Convolutional encoding uses a

Convolution Sum Overview Review of time invariance Review of sampling property

How to Make Your Integrated Sensor Smarter Francesc Serra-Graells 1,2 paco.serra@imb-cnm.csic.es 1

CSCI 2133 Rapid Programming Techniques for Innovation UI Design CSS Grid and Flexbox

Project1: Build A Small Scanner/Parser Introducing Lex, Yacc, and POET cs5363 1 Project1:

CSc 337 LECTURE 4: POSITIONING The CSS float property property description float side to

LECTURES ON REAL OPTIONS: PART I BASIC CONCEPTS Robert S. Pindyck Massachusetts Institute of

Flexibility and Robustness: The Cloud, Standards, Web Services and the Hybrid Future of

Serving Students with Disabilities during Periods of Remote or Blended Learning Technical

Flexible Anonymous Network Flexible Anonymous Network Florentin Rochet Florentin Rochet ,

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2