FPGA vs GPU Performance Comparison on the Implementation of FIR - - PDF document

fpga vs gpu performance comparison on the implementation
SMART_READER_LITE
LIVE PREVIEW

FPGA vs GPU Performance Comparison on the Implementation of FIR - - PDF document

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the performance of them, we Abstract choose different models for each platform to get fairer FIR filters find place in digital signal processing


slide-1
SLIDE 1

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

Abstract

FIR filters find place in digital signal processing applications that require stopping a frequency band while passing another band or removing noise. Due to the complex structure and parallelism property of FIR filters, dedicated reconfigurable hardware are preferred for implementation rather than CPUs. Recently, GPGPU emerged as an effective technique for solving computation-intensive problems having massive level of parallelism. In this paper, we took FIR filtering application with different tap sizes and implemented them on different FPGA and GPU models using both OpenCL and CUDA platforms. We have evaluated FIR filters’ performances using two different kernels on GPU and compared the performances with various FPGA implementations by taking an OpenMP implementation that utilizes all available cores in single CPU as a baseline performance point. In general, FPGA

  • utperformed GPU in terms of output samples produced per
  • second. But GPU is a life saver when very high order filters

are needed where FPGA cannot help due to their inadequate logic units.

Keywords

FIR Filter, GPGPU, FPGA, heterogeneous computing

  • 1. Introduction

FIR (finite impulse response) filters are the most common digital filters used in signal processing applications due to the linear phase response and always stable characteristics. In signal processing, FIR filters are usually used for stopping a frequency band while passing another frequency band or removing noise from an information carrying signal. FIR filters find place themselves for applications varying from radar, satellite and military to numerous industrial systems; in fact, whenever an application involves signals, processing operations on them is inevitable, where filtering is the most common operation. FIR filters are inherently parallel structures, so that by using extra resources they can be implemented in a parallel fashion to reduce the operation time. In high order FIR filters, FPGAs were the common solution to achieve massive level of parallelism. However, programming FPGAs is not as easy as programming microcontrollers or digital signal processors (DSPs). Recently, GPGPU emerged as an efficient technique for solving computer-intensive problems having massive level of parallelism with the ease

  • f programmability. OpenCL and CUDA are the two most

common frameworks to program GPUs for general-purpose applications [13] [16]. OpenMP is a parallel platform for CPUs and can also be used to parallelize FIR filter applications on CPU platforms [19]. However, due to the fact that CPUs have small number of cores comparing to GPUs, even with OpenMP the performance results usually is not comparable with GPUs and FPGAs. In this work, we take the FIR filtering application and implement it on different platforms, namely CPU, GPU and

  • FPGA. While comparing the performance of them, we

choose different models for each platform to get fairer comparison, since the performance has steep difference between different models and architectures of the CPU, GPU and FPGA. Previous work usually takes one model from each platform and compares the performance of these platforms only by comparing the results of one model from each platform, which may misguide the researchers. Therefore, 3 different FPGAs, 5 GPUs, and 4 different CPU models are selected for comparison. In section two, previous works are given in which FPGA and GPUs are compared for

  • performance. Section three gives a summary of GPGPU

programming architecture for OpenCL and CUDA. The details of FIR the filter implementations on different platforms are given in section four. We show comprehensive performance results and discuss them in section five. Finally in section six the discussion is concluded.

  • 2. Related Work

In their work Llamocca et al. compares the energy, performance and accuracy of implementations of 2D difference of Gaussians (DOG) filter for real-time digital video processing applications on FPGA and GPU. The article concludes that for 2D filtering applications GPUs are better for performance and precision, but FPGAs have the advantage of lower power dissipation [1]. Pauwels et al. made a comparison of FPGA and GPU performance on computation of phase-based optical flow, stereo and local image features. Based on their work, GPUs overcome FPGAs for performance aspects especially by GPU’s higher memory bandwidth and clock speed [2]. Kalarot and Morris compare FPGA and GPU for implementation of real-time stereo vision applications. Although prior works state that FPGAs outperform GPUs [3], they conclude that GPUs are as effective as FPGAs when graphic processors are utilized efficiently with CUDA [4]. In their work, Zhang et al. take the operation of sparse matrix-vector multiplication (SpMV) for performance comparison between FPGA and GPU. GPU greatly

  • utperforms FPGA when considered the memory transaction
  • perations; however, when FPGA memory performance is

scaled to the GPU rates the FPGA exceeds the performance

  • f GPU [5]. In the digital video processing field, dynamic

partial reconfiguration method allows designers to control resources based on energy, performance, and accuracy considerations. FPGA implementations

  • f

different approaches utilized dynamic partial reconfiguration on digital video processing [6] [7]. Recently, image and video processing applications with OpenCL and CUDA programming have made practical performance as stated in [8] and [9]. In their work Che et al. compares FPGA, GPU and CPU for three different applications: Gaussian elimination, data encryption standard, and Needleman- Wunsch algorithm. They conclude that the application characteristics are important for choosing the platform to accelerate specific applications [10]. In their work Howes et

slide-2
SLIDE 2
  • al. compare Monte-Carlo simulations and FFT operations on

GPUs, FPGAs and Playstation 2 [11]. GPUs were seen as powerful computational units even before OpenCL and CUDA. Accessing computational resources of graphics cards is now possible through the usage of these parallel programming interfaces. However, programming GPUs for general-purpose parallel applications were more difficult than today. Programmers had to translate their parallel algorithms to native graphics

  • perations using graphics APIs such as OpenGL and

DirectX [20]. Smirnov and Chiueh compared the performance of FIR filter application on GPU and CPU [21]. When this work was proposed, only graphics APIs were available to program general-purpose applications on GPUs and they were not as powerful as today’s devices. Performance of different platforms for FIR filtering application is studied in [1] for real-time digital video processing operations like DOG. They have implemented 2- D FIR filter with 8, 12, 16, 20, 24 and 32 coefficients. For the DOG application the filter order is very low comparing to our work, besides 2-D FIR filter is utilized, and the weak point of performance comparison in [1] is the usage of only

  • ne GPU and one FPGA model like most of the comparison

researches on GPUs and FPGAs. However, we showed that different GPU and FPGA models have sharp performance differences for FIR filter applications. It is not a fair comparison by choosing only one FPGA and one GPU to compare two platforms which have vast different models each having different processing powers. We also used OpenCL to program GPUs along with CUDA and programmed CPU with OpenMP to get a fairer baseline performance point.

  • 3. GPGPU Architecture

CUDA and OpenCL are both frameworks for task-based and data-based general purpose parallel execution. Their architectures show a great similarity. The key difference between these two frameworks is that OpenCL is a cross- platform framework (implemented on CPUs, GPUs, DSPs and etc.); whereas, CUDA is supported only by NVIDIA

  • GPUs. That gives flexibility for programs to work on wide

range of systems written using OpenCL. However, this flexibility brings with it extra programming effort [17]. Although both OpenCL and CUDA frameworks proved themselves by accelerating a majority

  • f

parallel applications, generally we used CUDA terminology and model to explain the GPGPU work in this paper to ease the understanding.

3.1. Memory Hierarchy

Memory hierarchy of GPGPU architectures show similarity with CPU memory hierarchies in the sense that both architectures try to speed up memory transaction operations by utilizing a hierarchical structure rather than using a monolithic memory resource. An overview of memory hierarchy of CUDA devices is given in Figure 1. At the bottom level of the memory hierarchy the slowest but the largest capacity memory type

  • resides. This type of memory is named as global memory in

CUDA terminology. A typical global memory is 2 or 4 gigabyte size and resides outside of the GPU chip. Global memories are usually manufactured using DRAM technology like main memories in CPU memory hierarchy. Although global memory can store very large amount of data, accesses to it may take hundreds of clock cycles. Therefore, the programmer has to be very careful not to make redundant read/write operations. Constant memory is another memory type in CUDA devices and is optimized for broadcasting operations, so that it can be accessed faster than global memory. However, the disadvantage of the constant memory is being read only. If the data will not be modified in an application, this data can be stored in constant memory instead of global memory to reduce the latency of memory read operations. However, the constant memory of graphics cards is very limited. For example, mobile graphics card GT555M has 64 kilobyte constant memory and GTX560ti, which is a powerful desktop graphics processor, has also 64 kilobyte constant memory. The host can only communicate with the GPGPU through the constant memory and global memory. Like caches in CPU memory hierarchy, there is a faster but smaller memory type in CUDA memory hierarchy called as shared memory. Shared memory is named as “shared”, because it is shared among threads within a block. In order to increase the memory bandwidth and decrease the latency

  • f memory transactions, shared memory can be programmed

and frequently used data can be utilized in shared memory. It is a very efficient optimization method when writing CUDA programs to maximize the utilization of shared memory and minimize the access to the global memory. Registers are other storage units in CUDA memory hierarchy which are private for each thread. Registers have the smallest latency and maximum throughput, but their amount is very limited. OpenCL memory architecture is not very different from CUDA memory hierarchy. A similar hierarchy structure may apply to OpenCL, yet the terminology is slightly different than CUDA [18].

CONSTANT MEMORY HOST DEVICE (Device) Grid

Block (0,0) Block (1,0)

Shared Memory Shared Memory

Thread (0,0)

Registers

Thread (0,0)

Registers

Thread (0,1)

Registers Registers

Thread (0,1)

GLOBAL MEMORY

Figure 1: CUDA Memory Hierarchy 3.2. Programming Model In CUDA, the atom of parallel execution model is called a “thread”. Many threads work in parallel or are issued when a CUDA program launched. Thread sets constitutes a block.

slide-3
SLIDE 3

Communication between threads in a block is acquired through the usage of shared memory but blocks are totally independent of each other. The number of threads that can be in a block is limited. Blocks come together to construct a

  • grid. A grid can be 1D, 2D or 3D in a CUDA application.

An important optimization in CUDA is to utilize as many blocks as possible in each SM. On the hardware side, at most 8 blocks can be issued to an SM. Another limitation in hardware is that the maximum number of threads in an SM is limited. For example, for the GPUs that we used in this work the maximum number of threads is 1536. If there are 512 threads in blocks of a grid, only three blocks can be

  • issued. Even shared memory utilization may affect the

performance of CUDA applications. If a block allocates nearly all of the shared memory, then only this block can be issued and no other block is issued. It is also important to fit as many blocks as possible inside an SM, in order to give more options to the scheduler to select on long latency

  • perations.
  • 4. FIR Filter Implementation

4.1. Overview

FIR filter structure is constructed from its transfer function and linear difference equation which is obtained from taking inverse Z-transform of the transfer function of the filter. An “M-order” FIR filters transfer function and the corresponding linear difference equations are in (1) and (2), respectively:

( ) ( )

(1)

( ) ( ) ( ) ( ( ))

(2) The output stream y(n) is calculated by multiplying the input signal [x(n), x(n-1), … x(n-M+1)] with the corresponding filter coefficients [b0, b1, … ,bM-1] and adding all the multiplication results together. Digital filters take sampled discrete signals, where the sample rate defines the number of points taken from a continuous signal in a unit time. The precision of the magnitude of the sampled point is named as “resolution” and this process is called as “quantization”. In an FIR filter application, the continuous signal is first sampled and quantized usually by an analog to digital converter (ADC), then sent to the filter to be processed. A typical ADC has a sample rate of 100 kHz and 16-bit resolution [15]. FIR filters can be implemented in hardware in a variety of ways, but the most preferable method is direct-form FIR

  • structure. In this form, the linear difference equation in (2) is

implemented as a tapped delay line. An M-order direct-form FIR filter implementation is given in Figure 2. The triangle components indicate two input multipliers that operate on an input sample x[i] with the corresponding coefficient bi. Circles specify two input adders and squares define delay elements that the output of the square block is one sample point before the input sample point. Delay elements are usually implemented with shift registers.

Z

1

Z

1

X[n] Y[n] b0 b1

bm-1

X[n-1] X[n-m-1]

Figure 2: Direct-form FIR Filter Structure Symmetric FIR filters can be implemented if the coefficients

  • f the FIR filter show symmetry property. For an m-tap FIR

filter, if b0 = bm-1, b1 = bm-2, ... , bi = bi+1, where i = m/2 – 1, then the FIR filter can be implemented in symmetric direct- form structure [22]. An m-tap direct-form symmetric FIR filter implementation is given in Figure 3. The number of multiplier units in an m-tap symmetric FIR filter is half of the number of multipliers of a direct-form FIR filter.

4.2. GPU Implementation

Three different implementation techniques are designed to compare the performance of GPUs with the FPGAs. Two of the designs are implemented using CUDA, while the other design is an OpenCL kernel implementation. The first CUDA design is a naïve and simple kernel that does not include any significant optimization. The other optimized CUDA kernel uses shared memory and coalesce global memory accesses. The third one is just an OpenCL version

  • f the highly optimized CUDA FIR filter implementation.

We name these implementations as basic CUDA, optimized CUDA and OpenCL FIR filters. FIR filtering application fits to GPGPU computing due to its inherently parallel execution structure for expensive multiplication operations and utilizing coefficients of the FIR filter in read-only constant memory as coefficients will stay constant during the execution of the kernel. In section III, it was stated that the constant memory of a typical CUDA platform is usually limited to 64 kilobytes. However, this disadvantage has no effect on our FIR filtering application performance, since the maximum order of the FIR filter has 4096 coefficients, which allocate 16 kilobytes constant memory. For a CUDA architecture having 64 kilobytes of constant memory, maximum 16384-order FIR filter can be implemented, where all the coefficient values reside in constant memory and are single precision floating point numbers.

Z

1

Z

1

X[n] Y[n] b0 b1 X[n-1] X[n-m-1] Z

1 X[n-m-2]

Figure 3: M-tap symmetric direct-form FIR Filter Structure

slide-4
SLIDE 4

Figure 4: Basic CUDA FIR Filter Kernel Figure 4 shows our basic CUDA kernel code. Here, each thread computes one output element by multiplying all coefficients with corresponding input elements sequentially. All input elements are accessed from global memory where coefficients reside on constant memory. The variables defined in the first four lines in Figure 4 are private for each

  • thread. “lid” and “gid” stand for local and global ids
  • respectively. “lid” is the thread index on its corresponding
  • block. “gid” specifies the thread index based on all the

threads in the grid. Optimized CUDA kernel is slightly different than the basic CUDA kernel based on their memory transaction operation

  • handlings. First, in optimized CUDA kernel all the threads

in a block load multiple input elements to the shared

  • memory. Since an input element has to be read multiple

times, up to the order of the FIR filter, during the filtering process, caching these elements in the shared memory greatly increases total performance. The code portion of the

  • ptimized CUDA kernel is given in Figure 5. Input sample

values are stored in shared memory as chunks formed by two sections: New data section and overlapping section. In

  • rder to calculate an output value of the FIR filter, N input

samples are needed where N is the order of the FIR filter. The newer data section of the input samples is brought from global memory to shared memory, while the older N input samples should also exist in there. The older input samples constitute overlapping section. Threads first load an element to the new data section according to their global ids. Input samples are loaded to the shared memory and FIR filtering

  • perations related to these input values are performed. Then

the overlapping section is loaded sequentially with the offset

  • f block size. Figure 6 represents the loading methodology

we implemented for the optimized CUDA kernel. Shared memory is allocated dynamically before the kernel launched to maximize the utilization of SMs for any order FIR filters. By doing so, unnecessary compilation of FIR filter code for any order size every time is avoided. Figure 5: Optimized CUDA FIR Filter Kernel An important issue here is that the number of blocks that can reside on the SM at the same time is limited by the shared memory for large number of taps. Block size does not affect the allocated shared memory too much. Therefore, to truly utilize the GPU, it is important to select block size that will assign as much as treads near to the maximum thread capacity of the SM. For example; 4096 taps with 256 block size allocate 17408 bytes. Because of the maximum shared memory size of the GT 555M is 48KB, the maximum number of blocks per SM is limited to two. It is possible to use 256x2 threads per SM with utilizing two blocks, where the maximum number of threads is 1536. It is obvious that the shared memory will always limit the system utilizing two blocks per SM. Thus, block size of 512 have to be preferred for better occupancy.

Overlapping Section New Data

Block Size

Output

Input

Current Thread

Figure 6: Shared Memory Loading Methodology The computation phase of the optimized CUDA kernel is not much different than the basic CUDA kernel. Instead of accessing input elements from global memory, optimized kernel reads elements that are previously loaded to the shared memory. Finally, threads write the computed result back to the global memory.

float result = 0;

int lid = threadIdx.x; int gid = blockIdx.x*blockDim.x + lid; int offset = gid; //do the computation for(int i = 0; i <= taps; i++){ if(offset >= 0) result += coeffs[i]*src[offset--]; else break; } //write result to global memory

  • utput[gid] = result;

float result = 0; extern __shared__ float shrdMem[]; int lid = threadIdx.x; int gid = blockIdx.x*blockDim.x + lid; int local_offset = lid; int global_offset = gid - taps; //load shared mem if(gid < srcSize) { shrdMem[taps + lid] = src[gid]; } while(local_offset < taps){ //load overlapping elements if(global_offset >= 0) shrdMem[local_offset] = src[global_offset]; else shrdMem[local_offset] = 0; local_offset += BLOCK_SIZE; global_offset += BLOCK_SIZE;} __syncthreads(); //do the computation

for(int i = 0; i <= taps; i++){

result += coeffs[i]*shrdMem[lid + taps - i]; } //write result to global mem

  • utput[gid] = result;
slide-5
SLIDE 5

OpenCL implementation is just translation of optimized CUDA FIR filter kernel. We implemented this to compare FIR filter application on other devices such as AMD GPUs.

4.3. FPGA Implementation

Three different implementation techniques are selected to synthesize FIR filters on various FPGAs: Direct-form, symmetric-form, and distributed arithmetic. It is possible to achieve massive level of parallelism by utilizing multiplier sources of the FPGAs. Most Xilinx FPGAs have DSP48 macro blocks embedded in their chips [12]. These slices have 18x18-bit multiplier units with pre- accumulator, 48-bit accumulators and selection multiplexers in order to speed up DSP operations. For the direct-form and symmetric-form FPGA implementations Xilinx’s DSP48 macro slices are utilized. Distributed arithmetic (DA) technique is an efficient method for implementing multiplication operations without using the DSP macro blocks of the FPGA [23]. In the DA technique, the coefficients of the FIR filter is represented in two’s complement binary form and all possible sum combinations

  • f the filter coefficients are stored in look-up tables (LUT).

Using classical shift-adder method the multiplication

  • peration can be performed effectively without using

multiplier units of the FPGA. We used 4-input LUTs of the FPGA to implement the DA form of the FIR filter structure. We chose three different FPGAs to compare the performance results of the FIR filters. Utilized FPGAs and their properties are given in Table 1. Xilinx ISE v14.1 software is used to synthesize the circuits. Table 1: FPGA properties used in implementing FIR filters

Spartan 3A Spartan 6 Artix 7 Device XC3SD1800A XC6SLX75 XC7A200T Logic Cell 37440 74637 215360 DSP Slice 84 132 740

We used MATLAB™ signal processing toolbox to generate the VHDL codes of the FIR filters. For addition operations tree adder type is selected to reduce the latency. Instead of using single precision floating point data type for coefficients, input and output values, like in the case of GPGPU programming, we used fixed point data types for FIR filter implementation on FPGAs. The reason for using fixed point data type is to reduce the complexity of the multiply and add operations, so that to increase the performance of the FIR filter. However, precautions should be taken for the resolution of the coefficient values to avoid quantization errors. Using 16-bit resolution for input and

  • utput signals, 18-bit for coefficients, 18x18 multiplier and

40-bit accumulator, the quantization effect is ignorable in magnitude response of the FIR filters. GPU doesn’t have this optimization property as CUDA and OpenCL doesn’t allow half precision arithmetic operations inside GPU. GPU needs CPU and global memory components to work; in contrast, FPGA can be used standalone for FIR filtering applications.

  • 5. Results and Discussions

Performance comparison in a FIR filter application is related to the output data rate in terms of sample number being

  • produced. We assume using a 16-bit resolution 10 kHz

sample rate ADC output to feed the FIR filter [15]. Therefore, when calculating the filter coefficients 10 kHz sample rate is used. Coefficients are calculated using MATLAB signal processing toolbox in double precision

  • format. In CPU and GPU implementations of the FIR filter

double precision coefficients are converted to single precision and in FPGA we used fixed point. Even though the

  • utput data precision of the FIR filter is different in FPGA

and other platforms, the quantization has negligible effect on

  • calculations. Since the aim of this paper is comparing FIR

filtering performance through an application perspective, GFLOPS or throughput with respect to bit rate is not a fair comparison methodology. For a FIR filtering application the most important performance criteria is achieving maximum throughput with respect to sample point number being filtered. We wrote a java code to produce input sample points from a generated sine wave. Noise components are added for frequencies found in stopband. For example, if stopband starts from 350 Hz, noise starting from 350 Hz up to 450 Hz is added to a pure 50 Hz sinusoidal signal, where 50 Hz resides in passband interval. Sampled data points from the java code are passed through the FIR filter and the output data stream is plotted in MATLAB to check if the filter worked properly and suppress the noise components of the main signal. We tested the correctness of each filter for every platform and device to be sure that the filter works properly. Figure 7: FPGA throughput performances obtained from three types of implementations on three different FPGA

  • architectures. DF, S and DA stand for implementation

names which are Direct Form, Symmetric and Distributed Arithmetic respectively while x-axis indicates filter order. Y-axis points to throughput as Million Samples per second. Throughput rates obtained from different implementations

  • n FPGA platform are shown in Figure 7. As the money

spent on FPGA increase, the performance gained also

50 100 150 200 250 64 128 256 512 1024 2048 4096

DF-Spartan 3A DF-Spartan 6 DF-Artix 7 S-Spartan 3A S-Spartan 6 S-Artix 7 DA-Spartan 3A DA-Spartan 6 DA-Artix 7

slide-6
SLIDE 6

Table 2: GPU and CPU Performance Results of FIR Filter Application (Million Samples per Second)

OpenMP CUDA (Basic) OpenCL FIR Order 64 128 256 512 1024 2048 4096

i5 3210M i7 2670QM i7 3820 i7 860 GT555M GT635M GTX260 GTX560ti

CUDA (Optimized)

GT555M GT635M GTX260 GTX560ti GT555M GTX260 GTX560ti HD7870 19.70 17.87 13.31 2.18 1.06 0.55 0.29 9.68 4.59 1.98 0.90 0.64 0.41 13.57 4.29 2.07 1.03 0.51 0.36 6.58 3.16 1.54 0.78 0.46 0.32 94.39 71.53 47.70 27.68 15.18 8.12 4.19 70.67 50.96 32.05 18.61 10.15 5.40 2.76 40.21 35.64 28.29 20.01 12.51 7.29 180.24 150.88 108.54 74.19 44.48 24.60 13.05 109.42 90.18 63.44 41.04 24.80 13.28 5.40 92.02 72.17 49.52 30.51 17.15 8.92 3.59 39.19 38.83 33.59 29.00 21.55 13.80 181.14 166.48 136.28 100.89 66.87 38.81 19.89 60.35 54.56 45.39 32.90 21.49 12.29 5.53 18.59 16.89 14.78 12.74 9.48 5.73 106.82 102.93 92.54 75.00 54.86 34.55 20.02 50.71 48.81 45.06 38.67 29.41 20.57 11.27 17.25 8.49 4.42 4.02 N/A N/A

increases for all three implementations. 1024 and higher

  • rder filters are not applicable on Spartan 3A as DF or

Symmetric due to lack of resources. 4096 order filters are not applicable on any of three FPGA by the same reason. While DA implementation has the higher throughput, filters with order higher than 512 are not possible to implement on Spartan 3A due to logarithmic increase in its look-up table size. On average, for 256 order FIR filter, DA performs best with 164 MSamples/sec while DF executes only 67 MSamples/sec. Table II shows the throughput obtained from CPU and GPU devices. Compared to FPGA results, the best GPU we have evaluated (GTX560ti) performs worse than DA- Artix 7. But GPUs could take advantage of their capability

  • f executing FIR filter with order 4096 and higher. Only

GTX260 cannot execute 4096 order filter due to insufficient shared memory size. As latest GPU architectures have more shared memory, it is possible to run filters with very high order. Also, instead of straightforward shared memory loading method, the kernel could be optimized to separate shared memory loads so that even higher order FIR filters could be executed independent of shared memory size. We leave this kind of implementation as future work. Comparing against CPU, it is obvious that both GPUs and FPGAs are more efficient and have much higher

  • throughput. Peak performance shows that almost 10x

speed up is achieved for 64-order FIR filter by utilizing FPGA or GPU with respect to CPU performance. Much higher gains are obtained for higher order FIR filters. Table II indicates that the GPU performance decreases with the increase of the order of the FIR filter. By analyzing both the FPGA and the GPU results we can say that:

  • FPGAs are more efficient than GPUs especially for

higher order filters when multiplication operations of the FIR filter can be fully parallelized. Otherwise, GPU performance outperforms FPGA when serialization occurs.

  • Resource allocation is a problem for FPGAs, for example

4096 order FIR filter can’t be implemented on any of three

  • FPGAs. Spartan 3A DSP FPGA resources were not

sufficient for FIR filters with the order of higher than 512. In GPUs, higher order FIR filters can be implemented, yet the performance decreases.

  • When comparing FPGA and GPU on FIR filtering

application, it is not fair to select one model from both. Model of the device affects performance sharply. For 128

  • rder DA FIR filter, Artix 7 achieved 224.3 million

samples per second throughput while Spartan 3A DSP achieved only 85.9. On the other hand, GTX 560ti throughput was 166.5 million samples per second and GTX 260 throughput was 38.4.

  • Development and implementing of FIR filters have

different steps on FPGA and GPU. Writing HDL codes is very difficult comparing writing kernel code of the

  • GPGPU. Also the synthesis takes enormous time

especially for higher order FIR filters.

  • 6. Conclusions

GPGPUs and FPGAs are both platforms to efficiently implement highly parallelizable application. We have evaluated the performance of FIR filter across various models belonging to the both of architectures. We showed that both platforms are effective for accelerating FIR filtering applications each having advantages and disadvantages. Performance comparison is made through a single metric: the output data rate in terms

  • f sample number being produced in unit time (e.g.

Million Samples/sec). FIR filter order has a noticeable effect on performance. For lower order FIR filters both FPGA and GPU achieved better performance with respect to higher order FIR filters. Serialization due to the lack of enough multiplier units is the main performance decrease reason for FPGAs. Logic resource capacity of an FPGA is another limiting factor to implement high order FIR filters. An FPGA must have relatively substantial amount of logic slices to synthesize FIR filters with the order of higher than a thousand. GPUs,

  • n the other hand, have to be thoroughly coded. Utilizing

shared memory and coalescing global memory accesses have significant performance effects, thus one should

  • ptimize the kernel considering these issues to acquire

more performance. FPGAs have relatively lower prices than GPUs, yet GPUs enjoy the ease of programmability where FPGAs are still tough to program. In general, FPGA performance is higher than GPU when the FIR filter is

slide-7
SLIDE 7

fully parallelized on FPGA device. However, GPU

  • utperforms FPGA when the FIR filter has to be

implemented with serial parts on FPGA.

  • 9. References

[1] D. Llamocca, C. Carranza, and M. Pattichis, “Seperable FIR filtering in FPGA and GPU implementations: Energy, performance, and accuracy considerations,” 21st International Conference on Field Programmable Logic and Applications (FPL),

  • Sept. 2011.

[2] K. Pauwels, M. Tomasi, J. Diaz, E. Ros, and M.M. Van Hulle, “A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features,”IEEE Transactions on Computers, July 2012. [3]

  • S. Asano, T. Maruyama, and Y. Yamaguchi,

“Performance comparison of FPGA, GPU, and CPU in image processing,” IEEE Trans. Image Processing,

  • vol. 16, no. 3, pp. 879-884, Mar. 2007.

[4] R. Kalarot and J. Morris, “Comparison of FPGA and GPU implementations of real-time stereo vision,” Computer Vision and Pattern Recognition Workshops (CVPRW), June 2010. [5] Y. Zhang, Y.H. Shalabi, R. Jain, K.K. Nagar, and J.D. Bakos, “FPGA vs. GPU for Sparse Matrix Vector Multiply,” International Conference on Field- Programmable Technology (FPT), Dec. 2009. [6]

  • D. Llamocca, M. Pattichis, and A. Vera, “A

dynamically reconfigurable parallel pixel processing system,” in Proceedings

  • f the International

Conference on Field Programmable Logic and Applications (FPL), Sept. 2009. [7] D. Llamocca and M. Pattichis, “Real-time dynamically reconfigurable 2-D filterbanks,” in Proceedings of the IEEE Southwest Symposium on Image Analysis & Interpretation, Austin, TX, May 2010. [8] V. Podlozhnyuk, “Image convolution with CUDA,” NVIDIA, June 2007. [9]

  • B. Cope, P.Y.K. Cheung, W. Luk, and S. Witt, “Have

GPUs made FPGAs redundant in the field of video processing,” IEEE International Conference on Field- Programmable Technology (FPT), Dec. 2005. [10] S. Che, J. Li, J.W. Sheaffer, K. Skadron, and J. Lach, “Accelerating compute-intensive applications with GPUs and FPGAs,” Symposium on Application Specific Processors (SASP), June 2008. [11] L. W. Howes, P. Price, O. Mencer, O. Beckmann, and O. Pell, “Comparing FPGAs to graphics accelerators and the Playstation 2 using a unified source description,” in Proceedings

  • f

the International Conference on Field Programmable Logic and Applications (FPL), Aug. 2006. [12] Xilinx, XtremeDSP DSP48A for Spartan-3A DSP FPGAs User Guide, Jul 2008. [13] CUDA C Programming Guide, NVIDIA, v 5.0, Oct. 2012. [14] A.V. Oppenheim and R.W. Schafer “Digital signal processing,” Prentice-Hall. 1981. [15] Analog Devices, “16-bit 100kSPS sampling ADC,” AD677 datasheet, Rev A. [16] Khronos OpenCL Working Group, “The OpenCL specification version 1.2,” 2010, http://khronos.org/opencl. [17] K. Karimi, N. G. Dickson, and F. Hamze, “A performance comparison of CUDA and OpenCL,” ArXiv e-prints arXiv:1005.2581 [18] J. Fang, A.L. Varbanescu, and H. Sips, “A comprehensive performance comparison of CUDA and OpenCL,” Parallel Processing (ICPP), 2011 International Conference on , vol., no., pp.216-225, 13-16 Sept. 2011. [19] L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” Computational Science & Engineering, IEEE , vol.5, no.1, pp.46-55, Jan-Mar 1998. [20] D.B. Kirk and W.W. Hwu, “Programming massively parallel processors: A hands-on approach,” Morgan Kaufmann, 2010. [21] A. Smirnov and T. Chiueh, “Implementation of a FIR filter on a GPU,” Technical report, ECSL, 2005. [22] T. C. Denk and K. K. Parhi, “Synthesis of folded pipeline architectures for multirate DSP algorithms,” IEEE Transactions of Very Large Scale Integration Systems, Vol. 6, No. 4, pp. 595-607, Dec. 1998. [23] P. Longa and A. Miri, “Area-efficient FIR filter design on FPGAs using distributed arithmetic,” in

  • Proc. Int. Symp. on Signal Processing and

Information Theory, 2006.