DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee - - PowerPoint PPT Presentation

decompression on heterogeneous
SMART_READER_LITE
LIVE PREVIEW

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee - - PowerPoint PPT Presentation

DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics JPEG


slide-1
SLIDE 1

DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES

1Yonsei University

Wasuwee Sodsong1, Jingun Hong1, Seongwook Chung1, Yeongkyu Lim2, Shin-Dug Kim1 and Bernd Burgstaller1

2LG Electronics

slide-2
SLIDE 2

JPEG Decompression

2

110010001010011010 101001011010100100 110000101000010011 010010101001000111 Entropy Coded Data

slide-3
SLIDE 3

JPEG Decompression

3

110010001010011010 101001011010100100 110000101000010011 010010101001000111 Huffman Decoding Entropy Coded Data Frequency Domain

8 8 8 8 8 8 8 8

slide-4
SLIDE 4

JPEG Decompression

4

110010001010011010 101001011010100100 110000101000010011 010010101001000111 Huffman Decoding IDCT Entropy Coded Data Frequency Domain Spatial Domain (YCbCr)

8 8 8 8 8 8 8 8

slide-5
SLIDE 5

JPEG Decompression

5

110010001010011010 101001011010100100 110000101000010011 010010101001000111 Huffman Decoding IDCT Upsampling Entropy Coded Data Frequency Domain Spatial Domain (YCbCr) YCbCr Color

8 8 8 8 8 8 8 8

slide-6
SLIDE 6

JPEG Decompression

6

110010001010011010 101001011010100100 110000101000010011 010010101001000111 Huffman Decoding IDCT Upsampling

Color Conversion

Entropy Coded Data Frequency Domain Spatial Domain (YCbCr) YCbCr Color RGB Color

8 8 8 8 8 8 8 8

slide-7
SLIDE 7

JPEG Decompression

7

110010001010011010 101001011010100100 110000101000010011 010010101001000111 Huffman Decoding IDCT Upsampling

Color Conversion Output

Entropy Coded Data Frequency Domain Spatial Domain (YCbCr) YCbCr Color RGB Color Bitmap Image

8 8 8 8 8 8 8 8

slide-8
SLIDE 8

Sequential JPEG Decompression

8

 JPEG is an asymmetric compression  Compression performs once per image  Decompression performs once per use

 463 out of the 500 most popular websites use JPEG

images

 Operates in blocks of 8x8 pixels  Sequential JPEG decoders apply IDCT, upsampling

and color conversion block-by-block

slide-9
SLIDE 9

Parallelism in JPEG Decompression

9

 Sequential Part

 Huffman decoding  NOT suitable for data-parallelism  Codewords have variable lengths.  The starting bit of a codeword in the

encoded bitstream is only known

  • nce the previous codeword has been

decoded. Parallelizable Part Sequential Part Huffman decoding IDCT Upsampling Color Conversion

.jpg

slide-10
SLIDE 10

Parallelism in JPEG Decompression

10

 Sequential Part

 Huffman decoding  NOT suitable for data-parallelism  Codewords have variable lengths  The starting bit of a codeword in the

encoded bitstream is only known

  • nce the previous codeword has been

decoded

 Parallelizable Part

 IDCT, upsampling and color

conversion

 Suitable for GPU computing and

SIMD operations on CPU

 low data dependency  operates same instructions repeatedly  has fixed input and output sizes

Parallelizable Part Sequential Part Huffman decoding IDCT Upsampling Color Conversion

.jpg

slide-11
SLIDE 11

Research Question

11

How to orchestrate JPEG decompression on CPU+GPU architectures?

 Input image characterized by  Width  Height  Entropy  Need: work partitioning, schedule, execution infrastructure

Parallelizable Part Sequential Part Huffman decoding IDCT Upsampling Color Conversion

.jpg

slide-12
SLIDE 12

Our Contributions

12

 Heterogeneous JPEG decoder on CPU+GPU architectures  profiling based performance model  dynamic partitioning scheme that automatically distributes the workload at run-time  Pipelined execution model overlaps sequential Huffman decoding with

GPU computations

 Parallelizable part is distributed across CPU and GPU  data-, task- and pipeline-parallelism  GPU kernels designed to minimize memory access overhead  libjpeg-turbo implementation and experimental evaluation for libjpeg-turbo

library

slide-13
SLIDE 13

libjpeg & libjpeg-turbo

13

 libjpeg is a sequential JPEG compression reference implementation by Independent JPEG

group

 First version released in 1991  libjpeg-turbo is a re-implementation of libjpeg  Utilizes SIMD instructions on x86 and ARM platforms.  Used by Google Chrome, Firefox, Webkit, Ubuntu, Fedora and openSUSE  Both libraries strictly designed to conserve memory

 Inhibits coarse-grained parallelism  A non-goal with today's target architectures

slide-14
SLIDE 14

Re-engineering libjpeg-turbo

 To conserve memory, libjpeg-turbo

decodes images in units of 8 pixel rows:

 8 rows at a time do not contain enough

computations to keep the data-parallel execution units of a GPU busy.

 Significant constant overhead per kernel

invocation and data transfer (hostdevicehost).

 Store an entire image in memory:  Fully utilizes all GPU cores by processing

several larger image chunks.

 Reduce number of kernel invocations and

data transferring overhead.

14

libjpeg-turbo Our Approach

slide-15
SLIDE 15

GPU-Only

Heterogeneous JPEG Decompression Overview

15

Huffman Decoding GPU Kernel

Dispatch Idle Idle

CPU GPU

 Motivation: One architecture is unutilized when the other is

processing

 Observation: No dependency among 8x8 pixel blocks. Thus,

the CPU and the GPU can compute in parallel

 Goal: Find partitioning size at runtime such that the load on

the CPU and the GPU are balanced

 Requirement: Performance model through offline profiling

slide-16
SLIDE 16

Performance Model

16

 Offline profiling step on image training set  19 master images cropped to various sizes  Maximum image size is 25 megapixels  Profile execution time of the sequential part and the parallelizable part on CPU and GPU  Model all decompression steps using multivariate polynomial regression up to degree 7  Select the best-fit model by comparing Akaike information criterion (AIC) values

slide-17
SLIDE 17

Performance Model for the Parallelizable Part

17

 Linearly scales as image size increased  Image dimension is known at the beginning of the decompression step  Parameters: width and height 10 20 30 10M 20M

Pixels Time (ms)

Subsampling 4:2:2 4:4:4 40

slide-18
SLIDE 18

Performance Model for the Sequential Part

18

 Unlike the parallelizable part, Huffman decoding time does NOT have a high correlation

with image width and height.

Time (ms)

10 20 30 40 5M 10M 15M 20M

Pixels

slide-19
SLIDE 19

Performance Model for the Sequential Part

19

 Huffman decoding time has a high correlation with the size of entropy coded data.  We have observed a linear trend as entropy density increased, entropy size in bytes per pixel.  Parameters: width, height and entropy size  Entropy size can be roughly approximated from JPEG file size.

2 4 6 0.0 0.1 0.2 0.3 0.4

Entropy density (bytes/pixel)

Subsampling 4:2:2 4:4:4

Time (ns)

slide-20
SLIDE 20

GPU-Only Overlapped

Overlapped Partitioning Scheme

20

Huffman Decoding GPU Kernel

Dispatch Idle Idle

CPU GPU

Huffman

Dispatch

SIMD GPU Kernel

CPU GPU  Sharing workload of the

parallelizable part between CPU and GPU

slide-21
SLIDE 21

CPU GPU

GPU Kernel Huffman Decoding

Dispatch

SIMD

Idle

Overlapped Partitioning Scheme

21

 Idea: Share workload of the parallelizable part on the

CPU and the GPU.

 Partitioning equation can be formulated as

where is number of rows given to CPU, and are image width and height.

 When , the time spent on the CPU and GPU

are equaled.

and are known at runtime. We can use Newton’s method to solve for .

 Problem: GPU is unutilized during Huffman decoding.

to GPU to CPU

slide-22
SLIDE 22

GPU-Only Pipelined Overlapped

Pipelined Partitioning Scheme

22

Huffman Decoding GPU Kernel

Dispatch Idle Idle

Huffman 1 Kernel (Huffman 1)

Dispatch

Huffman 2

Dispatch

Huffman 3

Dispatch

Kernel (Huffman 2) Kernel (Huffman 3)

CPU GPU

CPU GPU

Huffman

Dispatch

SIMD GPU Kernel

CPU GPU  Increase parallelism by

performing Huffman decoding and GPU kernel in pipelined fashion

 Sharing workload of the

parallelizable part between CPU and GPU

slide-23
SLIDE 23

Pipelined Partitioning Scheme

 Idea: Execute Huffman decoding in a pipelined fashion

with GPU kernel.

 Split an image into several chucks of rows.  An optimal chunk size is found through a profiling.  We can start kernel invocation as soon as an image

chunk is decoded.

 On a fast GPU, only the execution time of last chunk is

visible to users.

 Problem: Does NOT guarantee improvement over CPU

computation.

23

Huffman 1

Kernel (Huffman 1)

Dispatch Huffman 2 Dispatch Huffman 3 Dispatch

Kernel (Huffman 2) Kernel (Huffman 3)

CPU GPU

Chunk 1 Chunk 2 Chunk 3

  

𝐷 rows 𝐷 rows 𝐷 rows

slide-24
SLIDE 24

GPU-Only Combined Pipelined Overlapped

Combined Partitioning Scheme

24

Huffman Decoding GPU Kernel

Dispatch Idle Idle

Huffan 1 Kernel (Huffman 1)

Dispatch

Huffman 2

Dispatch

Huffman 3

Dispatch

Kernel (Huffman 2) Kernel (Huffman 3)

CPU GPU

CPU GPU

Huffman

Dispatch

SIMD GPU Kernel

CPU GPU

CPU GPU

Huffman 1 Kernel (Huffman 1)

Dispatch

Huffman 2

Dispatch

Huffman 3

Dispatch

Kernel (Huffman 2) Kernel (Huffman 3) Huffman 4 SIMD (Huffman 4)

slide-25
SLIDE 25

Combined Partitioning Scheme

 Combining overlapped and pipelined model to

guarantee improvement.

 where is number of rows in a chunk, and

is entropy density in bytes per pixel

 Using Newton’s method to solve for at runtime.  Estimation errors from Huffman decoding  Assume the same Huffman decoding time for every

pixel across an image

 Entropy is not distributed evenly in practice.  Perform re-partitioning before Huffman decoding

for the last GPU kernel

25

Huffman 1 Kernel (Huffman 1)

Dispatch

Huffman 2

Dispatch

Huffman 3

Dispatch

Kernel (Huffman 2) Kernel (Huffman 3)

CPU GPU

Huffman 4 SIMD (Huffman 4)

slide-26
SLIDE 26

GPU Optimizations

26

 We implemented GPU kernels for IDCT,

upsampling and color conversion.

 Optimizations  Vectorization to reduce number of reads/writes

to global memory

 Store intermediate results in local memory

(NVIDIA’s shared memory)

 Map work-items with consideration of coalesced

memory access

 Combine color conversion kernel with the prior

kernel to avoid global memory store between kernels.

Upsampling

Upsampling

1

3 5 7 9 11 13 15  

2

4

6

8

10

12

14

        

2 4 6 8 10 12 14 1

3 5 7 9 11 13 15 

Global Memory Local Memory Global Memory

7

IDCT

8 x x 8

8x1D IDCTs (horiz.) 8x1D IDCTs (vert.)

6

5

4

3

2

1

 1 2 3 4 5 6 7 

Color Conversion

to RGB

Y Cb Cr

1 2 3 4 5 6 7 

slide-27
SLIDE 27

Experimental Setup

Machine Name GT 430 GTX 560 GTX 680 CPU model Intel i7-2600k Intel i7-2600k Intel i7-3770k CPU frequency 3.4 GHz 3.4 GHz 3.5 GHz GPU model NVIDIA GT 430 NVIDIA GTX 560Ti NVIDIA GTX 680 GPU core frequency 700 MHz 822MHz 1006MHz

  • No. of GPU cores

96 384 1536 GPU memory size 1024 MB 1024 MB 2048 MB Compute Capability 2.1 2.1 3.0

27

 Test set  A new set of images. No images are reused from the training set  3591 images of various size  Maximum image size is 25 megapixels  Hardware specification

slide-28
SLIDE 28

Speedup Over SIMD Version

28

0.75 1.00 1.25 1.50 1.5 2.0 2.5 3.0

GT 430 GTX 680

5M 10M 15M 20M

Pixels Average Speedup

GPU-Only Pipelined Overlapped Combined

slide-29
SLIDE 29

Amdahl's Law, Theoretical Maximum Speedup

29

Pixels

40% 60% 80% 100%

5M 10M 15M 20M

  • Up to 95% of theoretical maximum speedup
  • Average of 88% when the problem size is large

enough

Max Intel-i7 3770k + Nvidia GTX 680

slide-30
SLIDE 30

Partitioning Errors

30

25 50 75 10 20 30 40

GT 430 GTX680

10M 20M

Pixels Average Execution Time (ms)

CPU Time GPU Time

slide-31
SLIDE 31

Conclusions

31

 JPEG decoding contains high amount of massive data computation  We proposed JPEG decoding scheme for heterogeneous architectures

 Performance model using polynomial regression  Dynamic partitioning scheme  Up to 4.2x (2.5x average) speedup over the SIMD version of libjpeg-turbo  Guaranteed improvement regardless of CPU+GPU combinations  Workload is well distributed across CPU and GPU

 Our combined partitioning scheme achieves up to 95% of the theoretically attainable

speedup, with an average of 88%

 Future work

 Extension to mobile systems

slide-32
SLIDE 32

Q&A

32

Thank you

slide-33
SLIDE 33

33

Backup Slides

slide-34
SLIDE 34

Related Work

34

Parallel JPEG Decoding

[1] Parallel image processing based on CUDA [Yang ‘08] [2] Design, implementation and evaluation of a task-parallel JPEG decoder for the libjpeg-turbo library [Hong ’12] [3] GPUJPEG: JPEG compression and decompression accelerated on GPU

Heterogeneous Computing

[1] Architectural exploration of heterogeneous multiprocessor systems for JPEG [Shee ‘08] [2] MapReduce on a coupled CPU-GPU architecture [Chen ‘12] [3] Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping [Luk ‘09] [4] Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids [Lee ’12] [5] CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems [Wang ‘13]

slide-35
SLIDE 35

GPU Results

35

 An 2048x2048 image

0.0 0.5 1.0 1.5 2.0 CPU SIMD GPU

GT 430 Normalized Time

0.0 0.5 1.0 1.5 2.0 CPU SIMD GPU

GTX 560

0.0 0.5 1.0 1.5 2.0 CPU SIMD GPU

GTX 680

Huffman Host to Device IDCT Upsampling Color Conversion Upsampling & Color Conversion Device to Host