DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee - PowerPoint PPT Presentation

DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics

JPEG Decompression 2 110010001010011010 101001011010100100 110000101000010011 010010101001000111 Entropy Coded Data

JPEG Decompression 3 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 Decoding 110000101000010011 8 010010101001000111 8 Entropy Coded Data Frequency Domain

JPEG Decompression 4 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain

JPEG Decompression 5 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling YCbCr Color

JPEG Decompression 6 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling Color Conversion RGB Color YCbCr Color

JPEG Decompression 7 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling Color Output Conversion Bitmap Image RGB Color YCbCr Color

Sequential JPEG Decompression 8  JPEG is an asymmetric compression  Compression performs once per image  Decompression performs once per use  463 out of the 500 most popular websites use JPEG images  Operates in blocks of 8x8 pixels  Sequential JPEG decoders apply IDCT, upsampling and color conversion block-by-block

Parallelism in JPEG Decompression 9 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part  Sequential Part  Huffman decoding  NOT suitable for data-parallelism  Codewords have variable lengths.  The starting bit of a codeword in the encoded bitstream is only known once the previous codeword has been decoded.

Parallelism in JPEG Decompression 10 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part  Parallelizable Part  Sequential Part  IDCT, upsampling and color  Huffman decoding conversion  NOT suitable for data-parallelism  Suitable for GPU computing and  Codewords have variable lengths SIMD operations on CPU  The starting bit of a codeword in the  low data dependency encoded bitstream is only known once the previous codeword has been  operates same instructions repeatedly decoded  has fixed input and output sizes

Research Question 11 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part How to orchestrate JPEG decompression on CPU+GPU architectures?  Input image characterized by  Width  Height  Entropy  Need: work partitioning, schedule, execution infrastructure

Our Contributions 12  Heterogeneous JPEG decoder on CPU+GPU architectures  profiling based performance model  dynamic partitioning scheme that automatically distributes the workload at run-time  Pipelined execution model overlaps sequential Huffman decoding with GPU computations  Parallelizable part is distributed across CPU and GPU  data-, task- and pipeline-parallelism  GPU kernels designed to minimize memory access overhead  libjpeg-turbo implementation and experimental evaluation for libjpeg-turbo library

libjpeg & libjpeg-turbo 13  libjpeg is a sequential JPEG compression reference implementation by Independent JPEG group  First version released in 1991  libjpeg-turbo is a re-implementation of libjpeg  Utilizes SIMD instructions on x86 and ARM platforms.  Used by Google Chrome, Firefox, Webkit, Ubuntu, Fedora and openSUSE  Both libraries strictly designed to conserve memory  Inhibits coarse-grained parallelism  A non-goal with today's target architectures

Re-engineering libjpeg-turbo 14 libjpeg-turbo Our Approach  To conserve memory, libjpeg-turbo  Store an entire image in memory: decodes images in units of 8 pixel rows:  8 rows at a time do not contain enough  Fully utilizes all GPU cores by processing computations to keep the data-parallel several larger image chunks. execution units of a GPU busy.  Reduce number of kernel invocations and  Significant constant overhead per kernel data transferring overhead. invocation and data transfer (host  device  host).

Heterogeneous JPEG Decompression Overview 15 GPU-Only CPU GPU  Huffman  Motivation: One architecture is unutilized when the other is Idle Decoding processing  Observation: No dependency among 8x8 pixel blocks. Thus, Dispatch the CPU and the GPU can compute in parallel GPU Idle Kernel  Goal: Find partitioning size at runtime such that the load on the CPU and the GPU are balanced  Requirement: Performance model through offline profiling

Performance Model 16  Offline profiling step on image training set  19 master images cropped to various sizes  Maximum image size is 25 megapixels  Profile execution time of the sequential part and the parallelizable part on CPU and GPU  Model all decompression steps using multivariate polynomial regression up to degree 7  Select the best-fit model by comparing Akaike information criterion (AIC) values

Performance Model for the Parallelizable Part 17  Linearly scales as image size increased  Image dimension is known at the beginning of the decompression step  Parameters: width and height 40 30 Time (ms) 20 Subsampling 4:2:2 10 4:4:4 0 0 10M 20M Pixels

Performance Model for the Sequential Part 18  Unlike the parallelizable part, Huffman decoding time does NOT have a high correlation with image width and height. 40 30 Time (ms) 20 10 0 0 5M 10M 15M 20M Pixels

Performance Model for the Sequential Part 19  Huffman decoding time has a high correlation with the size of entropy coded data.  We have observed a linear trend as entropy density increased, entropy size in bytes per pixel.  Parameters: width, height and entropy size  Entropy size can be roughly approximated from JPEG file size. 6 Time (ns) 4 Subsampling 4:2:2 2 4:4:4 0.0 0.1 0.2 0.3 0.4 Entropy density (bytes/pixel)

Overlapped Partitioning Scheme 20 Overlapped  Sharing workload of the GPU-Only CPU GPU parallelizable part between CPU GPU CPU and GPU Huffman Dispatch Huffman GPU Idle SIMD Kernel Decoding Dispatch GPU Idle Kernel

Overlapped Partitioning Scheme 21 to GPU CPU GPU  to CPU  Idea: Share workload of the parallelizable part on the CPU and the GPU. Huffman Idle  Partitioning equation can be formulated as Decoding where is number of rows given to CPU, and are image width and height. Dispatch  When , the time spent on the CPU and GPU GPU are equaled. SIMD Kernel  and are known at runtime . We can use Newton’s method to solve for .  Problem: GPU is unutilized during Huffman decoding.

Pipelined Partitioning Scheme 22 Overlapped  Sharing workload of the GPU-Only CPU GPU parallelizable part between CPU GPU CPU and GPU Huffman Dispatch Huffman GPU Idle SIMD Kernel Decoding  Increase parallelism by Pipelined Dispatch CPU GPU performing Huffman GPU Idle Kernel decoding and GPU kernel Huffman 1 Dispatch Kernel in pipelined fashion (Huffman 1) Huffman 2 Dispatch Kernel (Huffman 2) Huffman 3 Dispatch Kernel (Huffman 3)

Pipelined Partitioning Scheme 23  𝐷 rows Chunk 1 CPU GPU  Chunk 2 𝐷 rows  Huffman 1 Chunk 3 𝐷 rows Dispatch Kernel  Idea: Execute Huffman decoding in a pipelined fashion (Huffman 1) with GPU kernel. Huffman 2  Split an image into several chucks of rows. Dispatch Kernel  An optimal chunk size is found through a profiling. (Huffman 2) Huffman 3  We can start kernel invocation as soon as an image Dispatch Kernel chunk is decoded. (Huffman 3)  On a fast GPU, only the execution time of last chunk is visible to users.  Problem: Does NOT guarantee improvement over CPU computation.

Combined Partitioning Scheme 24 Overlapped GPU-Only Combined CPU GPU CPU GPU CPU GPU Huffman Huffman 1 Dispatch Dispatch Kernel Huffman GPU Idle SIMD (Huffman 1) Kernel Decoding Huffman 2 Dispatch Kernel Huffman 3 Pipelined Dispatch (Huffman 2) Dispatch Huffman 4 CPU GPU Kernel SIMD GPU (Huffman 3) Idle (Huffman 4) Kernel Huffan 1 Dispatch Kernel (Huffman 1) Huffman 2 Dispatch Kernel (Huffman 2) Huffman 3 Dispatch Kernel (Huffman 3)

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee - PowerPoint PPT Presentation

DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics JPEG

Singular curve point decompression attack Peter Gnther joint work with Johannes Blmer

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., & Davies, J.

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

INDICATE Injection or Decompression in Carpal Tunnel Syndrome: A Feasibility Study Presenting

Advances In Minamilly Invasive Spine Surgery: Minimally invasive spinal posterior decompression

Hardware Decompression for Compressed Sensing Applications Keith Dronson Frank Zovko Samuel

shows: Finger Thoracostomy Vs Chest Tube Insertion Vs Needle Decompression Cynthia Griffin

COMPRESSING XKCD IMAGES By Akarsh Kumar XKCD IMAGE EXAMPLE COMPRESSION AND DECOMPRESSION

Fully endoscopic microvascular decompression for hemifacial spasm Tracy M. Flanders, MD 1 ,

Saturday Disclosure Partner, BESPA Global Orthosolutions, Design Team Member ACFAS,

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Exploring Qualcomm Baseband via ModKit Tencent Blade Team Tencent Security Platform Department

Robust PCA Yingjun Wu Preliminary: vector projection Scalar projection of a onto b: a1 could be

libdft Practical Dynamic Data Flow Tracking for Commodity Systems Vasileios P. Kemerlis Georgios

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos

Opening Exercise Suppose that you are given three integers in int variables. Describe a way to

Uncovering SAP vulnerabilities: Reversing and breaking the Diag protocol Martin Gallo Core

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee - PowerPoint PPT Presentation

DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics JPEG

Singular curve point decompression attack Peter Gnther joint work with Johannes Blmer

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., &amp; Davies, J.

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

INDICATE Injection or Decompression in Carpal Tunnel Syndrome: A Feasibility Study Presenting

Advances In Minamilly Invasive Spine Surgery: Minimally invasive spinal posterior decompression

Hardware Decompression for Compressed Sensing Applications Keith Dronson Frank Zovko Samuel

shows: Finger Thoracostomy Vs Chest Tube Insertion Vs Needle Decompression Cynthia Griffin

COMPRESSING XKCD IMAGES By Akarsh Kumar XKCD IMAGE EXAMPLE COMPRESSION AND DECOMPRESSION

Fully endoscopic microvascular decompression for hemifacial spasm Tracy M. Flanders, MD 1 ,

Saturday Disclosure Partner, BESPA Global Orthosolutions, Design Team Member ACFAS,

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Exploring Qualcomm Baseband via ModKit Tencent Blade Team Tencent Security Platform Department

Robust PCA Yingjun Wu Preliminary: vector projection Scalar projection of a onto b: a1 could be

libdft Practical Dynamic Data Flow Tracking for Commodity Systems Vasileios P. Kemerlis Georgios

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos

Opening Exercise Suppose that you are given three integers in int variables. Describe a way to

Uncovering SAP vulnerabilities: Reversing and breaking the Diag protocol Martin Gallo Core

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., & Davies, J.