Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix - - PowerPoint PPT Presentation

β–Ά
vector fpga acceleration of 1 d dwt computations using
SMART_READER_LITE
LIVE PREVIEW

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix - - PowerPoint PPT Presentation

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth Maheshwari, Gourav Modi, Siddhartha , Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University Matrix-Form 1-D DWT


slide-1
SLIDE 1

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons

Sidharth Maheshwari, Gourav Modi, Siddhartha, Nachiket Kapre School of Computer Science and Engineering Nanyang Technological University

slide-2
SLIDE 2
slide-3
SLIDE 3

Matrix-Form 1-D DWT

  • Formulation:

𝐷 = π‘ˆπ‘ % π‘Œ, where π‘ˆπ‘ = ∏ π‘ˆ

() * )

  • TM matrix is highly sparse

ØLarge number of multiply-by-zero

  • perations

ØLarge memory footprint consisting

  • f zeroes
  • Goals:

Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

slide-4
SLIDE 4

Matrix-Form 1-D DWT

  • Formulation:

𝐷 = π‘ˆπ‘ % π‘Œ, where π‘ˆπ‘ = ∏ π‘ˆ

() * )

  • TM matrix is highly sparse

ØLarge number of multiply-by-zero

  • perations

ØLarge memory footprint consisting

  • f zeroes
  • Goals:

Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

slide-5
SLIDE 5

Matrix-Form 1-D DWT

  • Formulation:

𝐷 = π‘ˆπ‘ % π‘Œ, where π‘ˆπ‘ = ∏ π‘ˆ

() * )

  • TM matrix is highly sparse

ØLarge number of multiply-by-zero

  • perations

ØLarge memory footprint consisting

  • f zeroes
  • Goals:

Ø SIMD-friendly operations on non-zero values only Ø Customized DMA routines for efficient bandwidth utilization

slide-6
SLIDE 6

Sparse Matrix Skeleton

  • Remove multiply-by-zero operations
  • Reduction in memory footprint of TM.

36 8

slide-7
SLIDE 7

Modified Matrix-Form 1-D DWT

N = 65536

slide-8
SLIDE 8

VectorBlox MXP

  • Lanes: 16-32
  • Scratchpad: 64-128 KB
  • DMA bandwidth: 4-32 B/cycle
slide-9
SLIDE 9

𝑂 = 2-., 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

Results - Speedup

5 10 15 20 25 30 35 40 45 50 55 60 MXPβˆ’DE2 MXPβˆ’DE4 MXPβˆ’Zed

Board Speedup

Baseline CPU Raspberry Pi Zedboard BeagleBone Black

slide-10
SLIDE 10

𝑂 = 2-., 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

Results - Speedup

5 10 15 20 25 30 35 40 45 50 55 60 MXPβˆ’DE2 MXPβˆ’DE4 MXPβˆ’Zed

Board Speedup

Baseline CPU Raspberry Pi Zedboard BeagleBone Black

slide-11
SLIDE 11

𝑂 = 2-., 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

Results - Speedup

5 10 15 20 25 30 35 40 45 50 55 60 MXPβˆ’DE2 MXPβˆ’DE4 MXPβˆ’Zed

Board Speedup

Baseline CPU Raspberry Pi Zedboard BeagleBone Black

slide-12
SLIDE 12

Summary

  • We propose a Modified Matrix-Form scheme to unlock inherent

parallelism in 1-D DWT

  • We exploit the sparsity pattern in TM to reduce complexity from

O(π‘œ8) to O(π‘œ) using :

Ø

Skeletons to avoid wasteful multiply-by-zero operations

Ø

Rearrangement of input samples

  • Speedups of 12-103x over state-of-the-art in-built signal library

in Octave(dwt function)

slide-13
SLIDE 13

Experimental Setup

Matrix-form 1-D DWT Sparse Matrix Skeletons

CPU

  • Optimized OpenBLAS routines in

Octave and C (compiled with –O3)

  • Performance measured using

PAPI v5.4.3

  • 32b ARMv7 on Beaglebone Black,

Zedboard, and ARMv6 on Raspberry Pi

CPU + MXP

  • Customized DMA routines for data

transfer between host and MXP

  • 16-32 vector lanes
  • 64-128KB scratchpad memory
  • Performance measured using MXP

Timing API

  • Altera DE2/DE4 and Zedboard
slide-14
SLIDE 14

Results - Throughput

𝑂 = 2-., 𝑀 = 6 π‘π‘œπ‘’ 𝑙 = 3

  • 20

40 60 80 0.1 1.0 Throughput (GOps/S) Energy (mJ)

  • ARM (Beagl.)

ARM (Rasp.) ARM (Zedb.) MXPβˆ’DE2 MXPβˆ’DE4 MXPβˆ’Zed

slide-15
SLIDE 15

CHALLENGES:

  • Large volume of data
  • Strict real-time

processing constraints

  • High accuracy

demands

  • Energy constraints,

especially in embedded systems

slide-16
SLIDE 16

Modified Matrix-Form 1-D DWT

Rearrangement