Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix - PowerPoint PPT Presentation

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University

Matrix-Form 1-D DWT • Formulation: * 𝐷 = 𝑈𝑁 % 𝑌 , where 𝑈𝑁 = ∏ 𝑈 ) () • TM matrix is highly sparse Ø Large number of multiply-by-zero operations Ø Large memory footprint consisting of zeroes • Goals: Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

Sparse Matrix Skeleton 8 36 • Remove multiply-by-zero operations • Reduction in memory footprint of TM .

Modified Matrix-Form 1-D DWT N = 65536

VectorBlox MXP • Lanes: 16-32 • Scratchpad: 64-128 KB • DMA bandwidth: 4-32 B/cycle

Results - Speedup 60 55 50 45 40 Speedup Baseline CPU 35 Raspberry Pi 30 Zedboard 25 BeagleBone Black 20 15 10 5 0 MXP − DE2 MXP − DE4 MXP − Zed Board 𝑂 = 2 -. , 𝑀 = 6 𝑏𝑜𝑒 𝑙 = 3

Summary • We propose a Modified Matrix-Form scheme to unlock inherent parallelism in 1-D DWT • We exploit the sparsity pattern in TM to reduce complexity from O( 𝑜 8 ) to O( 𝑜 ) using : Skeletons to avoid wasteful multiply-by-zero operations Ø Rearrangement of input samples Ø • Speedups of 12-103x over state-of-the-art in-built signal library in Octave( dwt function)

Experimental Setup Matrix-form 1-D DWT Sparse Matrix Skeletons CPU CPU + MXP - Optimized OpenBLAS routines in - Customized DMA routines for data transfer between host and MXP Octave and C (compiled with –O3) - 16-32 vector lanes - Performance measured using PAPI v5.4.3 - 64-128KB scratchpad memory - Performance measured using MXP - 32b ARMv7 on Beaglebone Black, Timing API Zedboard, and ARMv6 on Raspberry Pi - Altera DE2/DE4 and Zedboard

Results - Throughput ● ARM (Beagl.) ARM (Zedb.) MXP − DE4 ● ARM (Rasp.) MXP − DE2 ● MXP − Zed 80 Energy (mJ) 60 40 ● ● 20 ● 0.1 1.0 Throughput (GOps/S) 𝑂 = 2 -. , 𝑀 = 6 𝑏𝑜𝑒 𝑙 = 3

CHALLENGES: • Large volume of data • Strict real-time processing constraints • High accuracy demands • Energy constraints, especially in embedded systems

Modified Matrix-Form 1-D DWT Rearrangement

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix - PowerPoint PPT Presentation

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University Matrix-Form 1-D DWT

DWT Presentation Suite PCH tel: +44 (0)1476 860833 www.dwt-exhibitions.co.uk DWT is a leading UK

DWT Presentation Suite PCT tel: +44 (0)1476 860833 www.dwt-exhibitions.co.uk DWT is a leading UK

DAVID WILSONS TRAILERS Ltd Putting your information on the road DWT E54 9.5m Exhibition

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

T Matrices F Gabriel Rodr guez, Louis-No el Pouchet A R D International Workshop on

Sparse Matrix Computation with PETSc Portable, Extensible Toolkit for

Enabling Sparse Matrix Computation in Multi-locale Chapel Tyler Simon Laboratory for Physical

Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels Weifeng Liu, Norwegian

Multilingual detection of Fake News Spreaders via Sparse Matrix Factorization Boshko Koloski

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding,

Squeezing GPU performance GPGPU 2015: High Performance Computing with CUDA University of Cape Town

On visualization of (social) networks Networks Approaches Statistics Drawing of Vladimir