vector fpga acceleration of 1 d dwt computations using
play

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix - PowerPoint PPT Presentation

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University Matrix-Form 1-D DWT


  1. Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University

  2. Matrix-Form 1-D DWT • Formulation: * 𝐷 = 𝑈𝑁 % 𝑌 , where 𝑈𝑁 = ∏ 𝑈 ) () • TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes • Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

  3. Matrix-Form 1-D DWT • Formulation: * 𝐷 = 𝑈𝑁 % 𝑌 , where 𝑈𝑁 = ∏ 𝑈 ) () • TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes • Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

  4. Matrix-Form 1-D DWT • Formulation: * 𝐷 = 𝑈𝑁 % 𝑌 , where 𝑈𝑁 = ∏ 𝑈 ) () • TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes • Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

  5. Sparse Matrix Skeleton 8 36 • Remove multiply-by-zero operations • Reduction in memory footprint of TM .

  6. Modified Matrix-Form 1-D DWT N = 65536

  7. VectorBlox MXP • Lanes: 16-32 • Scratchpad: 64-128 KB • DMA bandwidth: 4-32 B/cycle

  8. Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP − DE2 MXP − DE4 MXP − Zed Board 𝑂 = 2 -. , 𝑀 = 6 𝑏𝑜𝑒 𝑙 = 3

  9. Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP − DE2 MXP − DE4 MXP − Zed Board 𝑂 = 2 -. , 𝑀 = 6 𝑏𝑜𝑒 𝑙 = 3

  10. Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP − DE2 MXP − DE4 MXP − Zed Board 𝑂 = 2 -. , 𝑀 = 6 𝑏𝑜𝑒 𝑙 = 3

  11. Summary • We propose a Modified Matrix-Form scheme to unlock inherent parallelism in 1-D DWT • We exploit the sparsity pattern in TM to reduce complexity from O( 𝑜 8 ) to O( 𝑜 ) using : Skeletons to avoid wasteful multiply-by-zero operations Ø Rearrangement of input samples Ø • Speedups of 12-103x over state-of-the-art in-built signal library in Octave( dwt function)

  12. Experimental Setup Matrix-form 1-D DWT Sparse Matrix Skeletons CPU CPU + MXP - Optimized OpenBLAS routines in - Customized DMA routines for data transfer between host and MXP Octave and C (compiled with –O3) - 16-32 vector lanes - Performance measured using PAPI v5.4.3 - 64-128KB scratchpad memory - Performance measured using MXP - 32b ARMv7 on Beaglebone Black, Timing API Zedboard, and ARMv6 on Raspberry Pi - Altera DE2/DE4 and Zedboard

  13. Results - Throughput ● ARM (Beagl.) ARM (Zedb.) MXP − DE4 ● ARM (Rasp.) MXP − DE2 ● MXP − Zed 80 Energy (mJ) 60 40 ● ● 20 ● 0.1 1.0 Throughput (GOps/S) 𝑂 = 2 -. , 𝑀 = 6 𝑏𝑜𝑒 𝑙 = 3

  14. CHALLENGES: • Large volume of data • Strict real-time processing constraints • High accuracy demands • Energy constraints, especially in embedded systems

  15. Modified Matrix-Form 1-D DWT Rearrangement

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend