Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University
Matrix-Form 1-D DWT β’ Formulation: * π· = ππ % π , where ππ = β π ) () β’ TM matrix is highly sparse Γ Large number of multiply-by-zero operations Γ Large memory footprint consisting of zeroes β’ Goals: Γ SIMD-friendly operations on non-zero values only Γ Customized DMA routines for efficient bandwidth utilization
Matrix-Form 1-D DWT β’ Formulation: * π· = ππ % π , where ππ = β π ) () β’ TM matrix is highly sparse Γ Large number of multiply-by-zero operations Γ Large memory footprint consisting of zeroes β’ Goals: Γ SIMD-friendly operations on non-zero values only Γ Customized DMA routines for efficient bandwidth utilization
Matrix-Form 1-D DWT β’ Formulation: * π· = ππ % π , where ππ = β π ) () β’ TM matrix is highly sparse Γ Large number of multiply-by-zero operations Γ Large memory footprint consisting of zeroes β’ Goals: Γ SIMD-friendly operations on non-zero values only Γ Customized DMA routines for efficient bandwidth utilization
Sparse Matrix Skeleton 8 36 β’ Remove multiply-by-zero operations β’ Reduction in memory footprint of TM .
Modified Matrix-Form 1-D DWT N = 65536
VectorBlox MXP β’ Lanes: 16-32 β’ Scratchpad: 64-128 KB β’ DMA bandwidth: 4-32 B/cycle
Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP β DE2 MXP β DE4 MXP β Zed Board π = 2 -. , π = 6 πππ π = 3
Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP β DE2 MXP β DE4 MXP β Zed Board π = 2 -. , π = 6 πππ π = 3
Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP β DE2 MXP β DE4 MXP β Zed Board π = 2 -. , π = 6 πππ π = 3
Summary β’ We propose a Modified Matrix-Form scheme to unlock inherent parallelism in 1-D DWT β’ We exploit the sparsity pattern in TM to reduce complexity from O( π 8 ) to O( π ) using : Skeletons to avoid wasteful multiply-by-zero operations Γ Rearrangement of input samples Γ β’ Speedups of 12-103x over state-of-the-art in-built signal library in Octave( dwt function)
Experimental Setup Matrix-form 1-D DWT Sparse Matrix Skeletons CPU CPU + MXP - Optimized OpenBLAS routines in - Customized DMA routines for data transfer between host and MXP Octave and C (compiled with βO3) - 16-32 vector lanes - Performance measured using PAPI v5.4.3 - 64-128KB scratchpad memory - Performance measured using MXP - 32b ARMv7 on Beaglebone Black, Timing API Zedboard, and ARMv6 on Raspberry Pi - Altera DE2/DE4 and Zedboard
Results - Throughput β ARM (Beagl.) ARM (Zedb.) MXP β DE4 β ARM (Rasp.) MXP β DE2 β MXP β Zed 80 Energy (mJ) 60 40 β β 20 β 0.1 1.0 Throughput (GOps/S) π = 2 -. , π = 6 πππ π = 3
CHALLENGES: β’ Large volume of data β’ Strict real-time processing constraints β’ High accuracy demands β’ Energy constraints, especially in embedded systems
Modified Matrix-Form 1-D DWT Rearrangement
Recommend
More recommend