Optimization techniques for 3D-FWT on systems with manycore GPUs and - - PowerPoint PPT Presentation

optimization techniques for 3d fwt on systems
SMART_READER_LITE
LIVE PREVIEW

Optimization techniques for 3D-FWT on systems with manycore GPUs and - - PowerPoint PPT Presentation

International Conference on Computational Science (ICCS 2013) Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernab , J. Cuenca and D. Gimnez Computer Engineering Department, University


slide-1
SLIDE 1

Conference title 1

Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs

  • G. Bernabé†, J. Cuenca† and D. Giménez‡

† Computer Engineering Department, University of Murcia

‡ Computer Science and Systems Department, University of Murcia

5-7 June, 2013

International Conference on Computational Science (ICCS 2013)

slide-2
SLIDE 2

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 2

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-3
SLIDE 3

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 3

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-4
SLIDE 4

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 4

Introduction

  • The application of Wavelet Transform

– Important development: Mainly applied to image and video compression – Optimal tiled 2D and 3D FWT: Reduction of almost an order of magnitude in the

  • verall execution time (with respect to a baseline version on a CPU)

– CUDA and OpenCL provide mechanisms to optimize general-purpose applications on GPUs (GPGPUs) – Several implementations of the 3D-FWT on CUDA and OpenCL for accelerating

  • n GPUs

A method to compute automatically the parameters

  • f the 3D-FWT running on systems with multicore

CPU and manycore GPUs

slide-5
SLIDE 5

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 5

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-6
SLIDE 6

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 6

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-7
SLIDE 7

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 7

The Wavelet Transform 1D-FWT

  • The wavelet transform uses simple

filters for fast computing

  • The filters are applied to the signal.

The filter output downsampled by two generating two bands

  • Maintaining the amount of data on

each additional level with minimum info loss

  • Access pattern is determined

by our mother wavelet function

slide-8
SLIDE 8

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 8

The Wavelet Transform

2D-FWT

  • Generalize the 1D-FWT for an image (2D)
  • Applying the 1D-FWT to each row and to each column of the

image

Columns transformed Rows transformed ... after a three level application of the filters Original image

slide-9
SLIDE 9

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 9

The Wavelet Transform

3D-FWT with tiling

  • Generalize the 1D-FWT for a sequence of video (3D)

1.Nrows x Ncolums calls to 1D-FWT on frames 2.Each of Nframes calls to 2D-FWT with tiling

slide-10
SLIDE 10

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 11

The Wavelet Transform 3D-FWT on CUDA and OpenCL

  • Our 3D-FWT implementation in CUDA and OpenCL

consists of the following three steps:

  • 1. The host (CPU) allocates in memory the first four video frames
  • 2. The first four images are transferred from host to device.

– The 1D-FWT is then applied to the first four frames over the time dimension

  • 3. The 2D-FWT is applied to detailed and reference video and

results sent to CPU

slide-11
SLIDE 11

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 15

The Wavelet Transform 3D-FWT on CUDA and OpenCL

  • We read two more frames (interleaved) to complete each

new step

slide-12
SLIDE 12

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 16

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-13
SLIDE 13

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 17

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-14
SLIDE 14

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 18

Optimization techniques for 3D-FWT on a single GPU system The method consists mainly on three stages

1. Detect automatically the available GPU in the system 2. GPU Nvidia or ATI  3D-FWT 3. The key parameter value of block or work-group size is selected automatically

  • The remaining parameters (grid size, the occupation of the shared memory, etc)

are also calculated automatically

slide-15
SLIDE 15

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 19

Optimization techniques for 3D-FWT

  • n a single GPU system

The method consists mainly on three stages

  • 1. Detect automatically the available GPU in the system
  • 2. GPU Nvidia or ATI  3D-FWT CUDA or OpenCL
  • 3. The key parameter value of block or work-group size is

selected automatically

  • The block size value is based on the CUDA occupancy calculator
  • 1. Select the block size that maximizes the occupancy of each

multiprocessor

  • 2. If two or more values obtain the same occupancy, the maximum

value of the number of active threads blocks per multiprocessor

  • The work-group size is equal to the value of

CL_DEVICE_MAX_WORK_GROUP_SIZE

slide-16
SLIDE 16

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 20

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution

Run on 64 frames, each of them of size:

Times

512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-17
SLIDE 17

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 21

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution

Run on 64 frames, each of them of size:

Times

512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-18
SLIDE 18

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 22

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution

Run on 64 frames, each of them of size:

Times

512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-19
SLIDE 19

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 23

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution

Run on 64 frames, each of them of size:

Times

512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-20
SLIDE 20

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 24

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution Times

Run on 64 frames, each of them of size: 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-21
SLIDE 21

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 25

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution Times

Run on 64 frames, each of them of size: 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-22
SLIDE 22

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 26

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution Times

Run on 64 frames, each of them of size: 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-23
SLIDE 23

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 27

Optimization techniques for 3D-FWT

  • n a single GPU system

Experiments with 3D-FWT parameters for 3 GPUs

Execution Times

Run on 64 frames, each of them of size: 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870

58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26

Tesla C2050

35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84

FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59

  • The optimization engine studies the problem for different block or

work-group sizes

  • Selects 192 in the Tesla C870 and Fermi C2050 (optimal)
  • Selects 256 for the ATI FirePro (optimal)
slide-24
SLIDE 24

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 28

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-25
SLIDE 25

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 29

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-26
SLIDE 26

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 30

Optimization techniques for 3D-FWT

  • n hybrid systems

Automatic algorithm for manycore GPUs and multicore CPU systems

1. Detect automatically the available GPUs and CPUs in the system 2. For each platform in the system (GPU or CPU) do

  • If GPU Nvidia: CUDA 3D-FWT  calculates automatically block size
  • If GPU ATI: OpenCL 3D-FWT  work-group size =

CL_DEVICE_MAX_WORK_GROUP_SIZE

  • If CPU: Tiling and pthreads  n threads = n compute units CPU
  • Send one sequence  Computer performance of the 3D-FWT kernel.

3. Send sequences in a proportion equal to the 3D-FWT kernel computer performance in each GPU and CPU

  • To process the 3D-FWT concurrently
slide-27
SLIDE 27

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 31

Optimization techniques for 3D-FWT

  • n hybrid systems

The optimization engine is executed for the system1 with an Intel Core 2 Quad Q6700 CPU and a Nvidia Tesla C870

1. Detect automatically the Intel Core 2 Quad CPU and the Nvidia C870 2. For each platform in the system (GPU or CPU) do

  • Nvidia C870: CUDA 3D-FWT  192 is the block size
  • Intel Core 2 Quad: Tiling and pthreads  n threads = 4
  • Send one sequence  Computer performance of the 3D-FWT kernel

3. Send sequences in a proportion equal to the 3D-FWT kernel computer performance in each GPU and CPU

Speedups Platforms 512x512 1024x1024 2048x2048 Nvidia C870 versus Intel Core 2 Quad 3.38 3.66 3.62

slide-28
SLIDE 28

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 32

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system1 127.00 470.11 1889.64 Nvidia C870 (normal user) 159.66 608.92 2405.02 Nvidia C870 (optimization techniques for a single GPU system) 150.44 587.63 2362.05

  • Normal user has not knowledge to obtain the block or the work group

size (an averaged execution times among the different sizes)

slide-29
SLIDE 29

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 33

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system1 127.00 470.11 1889.64 Nvidia C870 (normal user) 159.66 608.92 2405.02 Nvidia C870 (optimization techniques for a single GPU system) 150.44 587.63 2362.05

  • Normal user has not knowledge to obtain the block or the work group

size (an averaged execution times among the different sizes)

slide-30
SLIDE 30

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 34

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system1 127.00 470.11 1889.64 Nvidia C870 (normal user) 159.66 608.92 2405.02 Nvidia C870 (optimization techniques for a single GPU system) 150.44 587.63 2362.05

  • Normal user has not knowledge to obtain the block or the work group

size (an averaged execution times among the different sizes)

slide-31
SLIDE 31

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 35

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system2 127.00 470.11 1889.64 Nvidia C870 (normal user) 159.66 (1.26) 608.92 (1.30) 2405.02 (1.27) Nvidia C870 (optimization techniques for a single GPU system) 150.44 (1.18) 587.63 (1.25) 2362.05 (1.25)

  • Normal user has not knowledge to obtain the block or the work group

size (an averaged execution times among the different sizes)

  • Our proposal obtains an average speedups of 1.23 and 1.28, regarding

the optimization techniques on a single GPU system and a normal user

slide-32
SLIDE 32

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 36

Optimization techniques for 3D-FWT

  • n hybrid systems

The optimization engine is executed for the system2 with an Intel Xeon E5620 CPU, a Nvdia Fermi Tesla C2050 and an ATI FirePro V5800 DVI

1. Detect automatically the Intel Xeon CPU and both GPUs 2. For each platform in the system (GPU or CPU) do

  • Nvidia C2050: CUDA 3D-FWT  192 is the block size
  • ATI FirePro V5800 DVI: OpenCL 3D-FWT  256 work-group size
  • Intel Xeon: Tiling and pthreads  n threads = 8
  • Send one sequence  Computer performance of the 3D-FWT kernel

3. Send sequences in a proportion equal to the 3D-FWT kernel computer performance in each GPU and CPU

Speedups Platforms 512x512 1024x1024 2048x2048 Nvidia C2050 versus ATI FirePro 3.58 2.77 2.85 Intel Xeon CPU versus ATI FirePro 2.81 2.01 2.02

slide-33
SLIDE 33

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 37

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system2 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 324.27 1242.33 Nvidia C2050 (optimization techniques for a single GPU system) 90.33 311.74 1202.45 ATI FirePro (normal user) 359.93 998.17 4303.02 ATI FirePro (optimization techniques for a single GPU system) 322.96 864.65 3423.25

slide-34
SLIDE 34

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 38

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system2 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 324.27 1242.33 Nvidia C2050 (optimization techniques for a single GPU system) 90.33 311.74 1202.45 ATI FirePro (normal user) 359.93 998.17 4303.02 ATI FirePro (optimization techniques for a single GPU system) 322.96 864.65 3423.25

slide-35
SLIDE 35

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 39

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system2 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 324.27 1242.33 Nvidia C2050 (optimization techniques for a single GPU system) 90.33 311.74 1202.45 ATI FirePro (normal user) 359.93 998.17 4303.02 ATI FirePro (optimization techniques for a single GPU system) 322.96 864.65 3423.25

slide-36
SLIDE 36

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 40

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system2 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 324.27 1242.33 Nvidia C2050 (optimization techniques for a single GPU system) 90.33 311.74 1202.45 ATI FirePro (normal user) 359.93 998.17 4303.02 ATI FirePro (optimization techniques for a single GPU system) 322.96 864.65 3423.25

slide-37
SLIDE 37

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 41

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 Optimization engine on system2 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 324.27 1242.33 Nvidia C2050 (optimization techniques for a single GPU system) 90.33 311.74 1202.45 ATI FirePro (normal user) 359.93 998.17 4303.02 ATI FirePro (optimization techniques for a single GPU system) 322.96 864.65 3423.25

slide-38
SLIDE 38

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 42

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 3D-FWT on system1 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 (2.97) 324.27 (2.43) 1242.33 (2.41) Nvidia C2050 (optimization techniques for a single GPU system) 90.33 (2.00) 311.74 (2.33) 1202.45 (2.33) ATI FirePro (normal user) 359.93 (7.97) 998.17 (7.47) 4303.02 (8.35) ATI FirePro (optimization techniques for a single GPU system) 322.96 (7.15) 864.65 (6.47) 3423.25 (6.64)

  • Our proposal obtains an average speedups of 2.60 and 7.93, versus

a normal user, who sends all sequences to the Nvidia or ATI GPU

slide-39
SLIDE 39

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 43

Optimization techniques for 3D-FWT

  • n hybrid systems

For a sequence of video of 2 hours with 25 frames per second, split in group of 64 frames Execution times (seconds) 512 x 512 1024 x 1024 2048 x 2048 3D-FWT on system1 45.17 133.60 515.34 Nvidia C2050 (normal user) 134.21 (2.97) 324.27 (2.43) 1242.33 (2.41) Nvidia C2050 (optimization techniques for a single GPU system) 90.33 (2.00) 311.74 (2.33) 1202.45 (2.33) ATI FirePro (normal user) 359.93 (7.97) 998.17 (7.47) 4303.02 (8.35) ATI FirePro (optimization techniques for a single GPU system) 322.96 (7.15) 864.65 (6.47) 3423.25 (6.64)

  • Our proposal obtains an average speedups of 2.22 and 6.75,

regarding the optimization techniques on a single GPU system

slide-40
SLIDE 40

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 44

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-41
SLIDE 41

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 45

Outline

Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work

slide-42
SLIDE 42

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 46

Conclusions

  • An optimization engine to run automatically the 3D-FWT

kernel on integrated systems with different platforms such as multicore CPU and manycore GPUs

  • 1. Detects different platforms in the system
  • 2. Executes the best implementation for each platform: CUDA for Nvidia

GPUs, OpenCL for ATI GPUs and pthreads for CPUs

  • 3. Computes the block or the work-group size in a Nvidia or ATI GPU  the

rest of parameters needed to execute the kernel automatically, and the number of threads in a CPU

  • 4. Sends proportionally different sequences of video depending of the

computer performance of the 3D-FWT kernel of each platform in order to process concurrently the video sequences

  • Averaged gains up to 7.93x with respect a normal user, who sends all group of

frames to the platform with the 3D-FWT implemented in OpenCL on the ATI GPU

slide-43
SLIDE 43

ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 47

Future work

  • An extension of this work to develop an optimization

engine of the 3D-FWT for a heterogeneous cluster of multicores and GPUs

  • The methodology applied to propose the optimization

engine should be applicable to other complex compute applications

  • Our work is part of the development of an image

processing library oriented to biomedical applications, allowing users the efficient executions of different routines automatically

slide-44
SLIDE 44

Conference title 48

Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs

  • G. Bernabé†, J. Cuenca† and D. Giménez‡

† Computer Engineering Department, University of Murcia ‡ Computer Science and Systems Department, University of Murcia 5-7 June, 2013

International Conference on Computational Science (ICCS 2013)