GPU Parallel Implementation of The Approximate K-SVD Algorithm Using - - PowerPoint PPT Presentation

gpu parallel implementation of the approximate k svd
SMART_READER_LITE
LIVE PREVIEW

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using - - PowerPoint PPT Presentation

Introduction OpenCL AK-SVD PAK-SVD Conclusions GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan Dumitrescu 2 1 University Politehnica of Bucharest 2 Tampere University of Technology


slide-1
SLIDE 1

Introduction OpenCL AK-SVD PAK-SVD Conclusions

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL

Paul Irofti1 Bogdan Dumitrescu2

1University Politehnica of Bucharest 2Tampere University of Technology

paul@irofti.net bogdan.dumitrescu@tut.fi

EUSIPCO’2014

slide-2
SLIDE 2

Introduction OpenCL AK-SVD PAK-SVD Conclusions

Outline

1

Introduction

2

OpenCL

3

AK-SVD

4

PAK-SVD

5

Conclusions

slide-3
SLIDE 3

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

The problem

Given: initial dictionary D0 set of training signals Y target sparsity s number of iterations K Output: trained dictionary D sparse representations X Such that Y ≈ DX.

slide-4
SLIDE 4

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Optimization Problem

Solving the optimization problem of: minimize

D,X

Y − DX2

F

subject to xi0 ≤ s, ∀i

slide-5
SLIDE 5

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

General Approach

Most algorithm iterations involve two essential steps: sparse coding Y using dictionary D resulting X updating the dictionary using the current representations X Existing solutions: Sparse representations:

SP MP OMP

Dictionary update:

MOD K-SVD AK-SVD

slide-6
SLIDE 6

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Current State

Practical applications employing these methods show good results low representation errors slow running times top consumer: the sparse representation stage dictionary update performed one atom at a time each update step depends on the one before it Our approach: update more than one atoms at a time distributed sparse coding new parallel algorithm PAK-SVD

slide-7
SLIDE 7

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Platform

OpenCL platform execute small functions (kernels) in parallel processing elements ⊂ compute units ⊂ OpenCL device work load topology defined as an n-dimensional space Notation: NDR : x, y, z

slide-8
SLIDE 8

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

N-Dimensional Range – 2D Example

slide-9
SLIDE 9

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Memory Layout

slide-10
SLIDE 10

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Hardware

ATI FirePro V8800 (FireGL V) specifications: 1600 streaming processors 2048MB global memory 32KB local memory 256 maximum work-group size 20 maximum compute units OpenCL v1.2 compliant 2640 single-precision GFLOPS 528 double-precision GFLOPS.

slide-11
SLIDE 11

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Time Counting

Counting in CPU ticks bypassing: unsynchronized tick counts between different cores on a multiprocessor system lack of serialization with MSVC compilers on x64 systems EBX/RBX register spilling issues with GCC compilers when using position independent code On the machine we tested one tick represents roughly 0.3125ns.

slide-12
SLIDE 12

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

AK-SVD Algorithm

Data: given dictionary D and signal set Y compute sparse representations X and optimize dictionary D Iterations: sparse coding: for each signal y in Y

use OMP(D, y) for representing x of X

dictionary update: for each atom d in D

remove d from the dictionary find the singals using d in their representation

  • ptimize d keeping the representations and the dictionary fixed

update the representations by using the new atom d update the dictionary by reintroducing the optimized atom d

slide-13
SLIDE 13

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Comments

Observations: the dictionary is changed on each update step so are the sparse representations the current atom’s update depends on all of the atoms updated before it AK-SVD eliminates the need to explicitly compute the residual

slide-14
SLIDE 14

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

PAK-SVD Sparse Coding

Data: given dictionary D ∈ Rp×n and signal set Y ∈ Rp×m compute sparse representations X ∈ Rn×m Sparse Coding with OMP: using an NDR(m, any) splitting big memory foot-print O(ns), where s is the desired sparsity all the matrices are kept in global memory each PE computes OMP for a single data item from Y

PE1 PE2 PEm X1 =OMP(Y1) X2 =OMP(Y2) . . . Xm =OMP(Ym)

slide-15
SLIDE 15

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

PAK-SVD Dictionary Update

Data: D ∈ Rp×n, Y ∈ Rp×m and X ∈ Rn×m Dictionary update for batches of ˜ n atoms from D: calculate the full residual matrix E = Y − DX for each atom from the current batch do in parallel

compensate the error matrix E as if the current atom was missing from the dictionary find the singals using d in their representation

  • ptimize d keeping the representations and the error matrix

fixed update the representations by using the new atom d update the dictionary by reintroducing the optimized atom d

slide-16
SLIDE 16

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

PAK-SVD Dictionary Update (2)

We use an NDR(˜ n, any) splitting for updating ˜ n atoms at a time: PE1 PE2 PE˜

n

D1, XD1 D1, XD2 . . . D˜

n, XD˜

n

Each PE is in charge of updating one atom. Memory layout: private: d, the atom being updated local or global: I, indices of signals using d global: E, X, D

slide-17
SLIDE 17

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description

Matrix Multiplication

OpenCL implementation: split the N-dimensional space as NDR(n, m, 64, 64) block-based multiplication calculating a block is performed within a work-group Memory layout: global: input and output matrices local: copied input block sub-matrices private: vectorized types for dot operations

slide-18
SLIDE 18

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results

Error

  • 40
  • 30
  • 20
  • 10

10 20 40 60 80 100 120 140 160 180 200 RMSE (dB) Iterations AK-SVD ˜ n = 64 ˜ n = 256 ˜ n = 512

Error evolution for m = 16384, n = 512, s = 12.

slide-19
SLIDE 19

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results

Performance (1)

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 128 256 512 log10(time(s)) Atoms CPU ˜ n = 1 ˜ n = 2 ˜ n = 4 ˜ n = 8 ˜ n = 16 ˜ n = 32 ˜ n = 64 ˜ n = 128

Execution times for m = 16384, s = 10, K = 200. : *

slide-20
SLIDE 20

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results

Performance (2)

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 8192 16384 32768 65536 log10(time(s)) Signals CPU ˜ n = 1 ˜ n = 8 ˜ n = 16 ˜ n = 32 ˜ n = 64 ˜ n = 128 ˜ n = 256 ˜ n = 512

Execution times for n = 512, s = 8, K = 100. : *

slide-21
SLIDE 21

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results

More Error Results

Table: Final errors for AK-SVD and PAK-SVD with ˜ n = n. n 128 256 512 AK PAK AK PAK AK PAK s 4 0.0425 0.0407 0.0385 0.0387 0.0376 0.0372 6 0.0374 0.0349 0.0334 0.0316 0.0311 0.0297 8 0.0345 0.0306 0.0294 0.0272 0.0259 0.0245 10 0.0322 0.0276 0.0276 0.0239 0.0233 0.0206 12 0.0319 0.0249 0.0254 0.0205 0.0221 0.0176

slide-22
SLIDE 22

Introduction OpenCL AK-SVD PAK-SVD Conclusions

Conclusions

PAK-SVD improves AK-SVD: performs up to 12x faster parallel sparse coding stage parallel dictionary update smaller representation error