convolution engine balancing efficiency flexibility in
play

Convolution Engine: Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, Thats me Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark


  1. Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, That’s me  Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark Horowitz http://www.c 2 s 2 .org Stanford University

  2. Smile, you’re on camera  By show of hands, who here has an (HD) camera on them?  How many CPU’s/GPU’s in the room?  How many of those xPU’s are used for the image processing? ISCA'13 shacham@alumni.stanford.edu 2

  3. Imaging and video systems  High computational requirements, low power budget  Stills: ~10M pixels x 10 frames per second  Video: ~2M pixels x 30 frames per second  ~400 math operations per pixel (just for the image acquisition)  On CPU… not enough horse power  On GPU… too much power  Typically use special purpose custom HW  About 500X better performance, 500X lower energy than CPU ISCA'13 shacham@alumni.stanford.edu 3

  4. Example: H.264 encoder on RISC vs. ASIC  By coupling compute and storage closely together, ASIC’s are orders of magnitude performance and energy more efficient 10000000 2-3 orders of magnitude 1000000 Energy (uJ) 100000 RISC 10000 ASIC 1000 100 Sub-kernel-1 Sub-kernel-2 Sub-kernel-3 Sub-kernel-4 IME FME IP CABAC * R. Hameed et. al., Understanding Sources of Inefficiency in General-Purpose Chips. ISCA ’10 ISCA'13 shacham@alumni.stanford.edu 4

  5. We are solving the wrong problem!  Yes, ASIC is 1000X more efficient than general purpose  Yes, general purpose is more programmable than ASIC  Yes, we can make each one marginally better  But those are good answers to all the wrong questions!  The right questions: Why is the RISC energy so high? What type of computation can we make efficient? Can we make it just 100X better but keep it programmable? ISCA'13 shacham@alumni.stanford.edu 5

  6. Anatomy of a RISC Instruction Control overheads (Instr Decode, sequencing, pipeline management, clocking, … .) ADD 70 pJ 25pJ 4pJ Control I-Cache access Energy of a 32-bit ADD ≈ 0.5 pJ Register file access * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 6

  7. Other instructions overhead LD 25pJ 4pJ Control Overhead instructions LD 25pJ 4pJ Control ADD 25pJ 4pJ Control ST 25pJ 4pJ Control Overhead instructions BR 25pJ 4pJ Control * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 7

  8. D-Cache accesses overhead LD 25pJ 25pJ 4pJ Control LD 25pJ 25pJ 4pJ Control ADD 25pJ 4pJ Control ST 25pJ 25pJ 4pJ Control BR 25pJ 4pJ Control D-Cache access overheads * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 8

  9. SIMD machines give some improvement  SIMD units amortize overhead and improve performance ADD I-Cache RF Control SIMD ADD I-Cache RF Control  Achieves 10X better energy and performance AND is programmable  Can we do 100X and keep it programmable? ISCA'13 shacham@alumni.stanford.edu 9

  10. Energy efficiency in a programmable environment Each memory and instruction fetch must be amortized by hundreds of operations ISCA'13 shacham@alumni.stanford.edu 10

  11. What we want to see D-Cache accesses much narrower than functional path LD D-Cache I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control Many ALU instructions Many ops per instruction per LD/ST instruction OP I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control ST D-Cache I-Cache Reg File Control ISCA'13 shacham@alumni.stanford.edu 11

  12. Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 12

  13. Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 13

  14. Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 14

  15. It does not have to be convolution  It only looks like convolution: CE ' $ [ ] [ ] ( ) c c Img f Reduce Reduce map Img , f ⊗ = % " l c k c [ k , l ] [ n k , m l ] = − = − − − & # [ n , m ] In Out reduce map coefficients ISCA'13 shacham@alumni.stanford.edu 15

  16. Let’s look at some convolution-like workloads  De-mosaic:  Adaptive color plane interpolation (ACPI)*: image gradients followed by a three-tap filter in the direction of smallest gradient. * Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007. ISCA'13 shacham@alumni.stanford.edu 16

  17. Let’s look at more convolution-like workloads  H.264 (high definition) video encoder:  IME: 2D-Sum of absolute differences  FME: Half pixel interpolation, quarter pixel interpolation, 2D SAD Video Compressed CABAC Frames Inter Intra Bit Stream Entropy Prediction Prediction Encoder 90% of execution time is here Integer Fractional Motion Motion Estimation Estimation ISCA'13 shacham@alumni.stanford.edu 17

  18. The main computation behind H.264  Trying to find best match for a stencil within a small neighborhood Previous Frame Current Frame ISCA'13 shacham@alumni.stanford.edu 18

  19. The convolution engine must support different ops Map Reduce Stencil Size Data Flow IME SAD Abs Diff Add 4x4 2D Convolution FMW ½ pixel up-sample Multiply Add 6 1D Horizontal & vertical conv. FME ¼ pixel up-sample Average None -- 2D Matrix operation SIFT Gaussian blur Multiply Add 9, 13, 15 1D Horizontal & vertical conv. SIFT DoG Subtract None -- 2D Matrix operation SIFT Extreme Compare Logic AND 3 1D Horizontal & vertical conv. Demosaic interpolation Multiply Complex 3 1D Horizontal & vertical conv. ISCA'13 shacham@alumni.stanford.edu 19

  20. Convolution Engine : An architecture for convolution-like kernels Stencil Coefficients neighborhood 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 Wide 64- lane SIMD ALU ALU ALU ALU “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 20

  21. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference frame pixels frame pixels 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 Wide 64- lane SIMD ALU ALU ALU ALU “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 21

  22. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference frame pixels frame pixels 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 ALU’s Wide 64- - - - - instruction lane SIMD ABS ABS ABS ABS set to |a-b| “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 22

  23. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference pixels frame pixels frame pixels shift left 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 ALU’s Wide 64- - - - - instruction lane SIMD ABS ABS ABS ABS set to |a-b| “map” unit Summation Flexible Sum (Reduction) tree “reduce” step ISCA'13 shacham@alumni.stanford.edu 23

  24. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Reference pixels frame pixels shift left 0 0 1 15 0 1 2 16 17 18 31 0 2D Regfile 2D shift 1 0 1 15 1 1 2 16 17 18 31 0 Regfile 15 0 1 15 15 1 2 16 17 18 31 0 Wide 64- - - - - lane SIMD ABS ABS ABS ABS “map” unit Flexible Sum (Reduction) “reduce” step ISCA'13 shacham@alumni.stanford.edu 24

  25. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Reference pixels frame pixels shift left 0 0 1 15 0 2 3 17 18 19 0 1 2D Regfile 2D shift 1 0 1 15 1 2 3 17 17 19 0 1 Regfile 15 0 1 15 15 2 3 17 18 19 0 1 Wide 64- - - - - lane SIMD ABS ABS ABS ABS “map” unit Flexible Sum (Reduction) “reduce” step ISCA'13 shacham@alumni.stanford.edu 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend