Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet - PowerPoint PPT Presentation

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband Engine on the Cell Broadband Engine Seunghwa Kang David A. Bader

Key Contributions • We design an efficient data decomposition scheme to We design an efficient data decomposition scheme to achieve high performance with affordable programming achieve high performance with affordable programming complexity complexity • We introduce multiple Cell/B.E. and DWT specific We introduce multiple Cell/B.E. and DWT specific optimization issues and solutions optimization issues and solutions • Our implementation achieves 34 and 56 times speedup over Our implementation achieves 34 and 56 times speedup over one PPE performance, and 4.7 and 3.7 times speedup over one PPE performance, and 4.7 and 3.7 times speedup over the cutting edge multicore processor (AMD Barcelona), for the cutting edge multicore processor (AMD Barcelona), for lossless and lossy DWT, respectively. lossless and lossy DWT, respectively.

Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform • Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the traditional multicore processor Comparison with the traditional multicore processor - Impact in performance and programmability Impact in performance and programmability • Optimization Strategies Optimization Strategies - Previou Previous work work - Data decomposition scheme Data decomposition scheme - Real number Real number representation representation - Loop int Loop interleavin rleaving - Fine-grain data transfer control Fine-grain data transfer control • Performance Evaluation Performance Evaluation - Comparison with the AMD Barcelona Comparison with the AMD Barcelona • Conclusions Conclusions

Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform • Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the traditional multicore processor Comparison with the traditional multicore processor - Impact in performance and programmability Impact in performance and programmability • Optimization Strategies Optimization Strategies - Previous work Previou work - Data decomposition scheme Data decomposition scheme - Real number Real number representation representation - Loop int Loop interleavin rleaving - Fine-grain data transfer control Fine-grain data transfer control • Performance Evaluation Performance Evaluation - Comparison with the AMD Barcelona Comparison with the AMD Barcelona • Conclusions Conclusions

Discrete Wavelet Transform (in JPEG2000) • Decompose an image in both vertical and horizontal Decompose an image in both vertical and horizontal direction to the sub-bands representing the coarse and direction to the sub-bands representing the coarse and detail part detail part while preserving space information while preserving space information LL HL LH HH

Discrete Wavelet Transform (in JPEG2000) • Vertical Vertical filtering followed by filtering followed by horizontal horizontal filtering filtering • Highly parallel but bandwidth intensive Highly parallel but bandwidth intensive • Distinct memory access pattern Distinct memory access pattern becomes a problem becomes a problem • Adopt Jasper [Adams2005] as a baseline code Adopt Jasper [Adams2005] as a baseline code

Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform •Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the traditional multicore processor Comparison with the traditional multicore processor - Impact in performance and programmability Impact in performance and programmability • Optimization Strategies Optimization Strategies - Previous work Previous work - Data decomposition scheme Data decomposition scheme - Real number representation Real number representation - Loop interleaving Loop interleaving - Fine-grain data transfer control Fine-grain data transfer control • Performance Evaluation Performance Evaluation - Comparison with the AM Comparison with the AMD Barcelona D Barcelona • Conclusions Conclusions

Cell/B.E. vs Traditional Multi-core Processor Traditional SPE Multi-core Processor • In-order In-order • Out-of-order Out-of-order • No dynamic branch No dynamic branch • Dynamic branch Dynamic branch prediction prediction prediction prediction • SIMD only SIMD only • Scalar + SIMD Scalar + SIMD => Small and simple core => Small and simple core => Large and complex core => Large and complex core

Cell/B.E. vs Traditional Multi-core Processor Exe. I1 D1 Pipeline L2 Exe. LS Pipeline L3 Main Memory Main Memory • Isolated Isolated constant latency constant latency • Every memory access is Every memory access is LS access LS access cache coherent cache coherent • Software controlled Software controlled DMA DMA • Hardware controlled Hardware controlled data data data transfer between LS data transfer between LS transfer transfer and main memory and main memory

Cell/B.E. Architecture - Performance • More cores within power and transistor budget More cores within power and transistor budget • Invest the larger fraction of the die area for actual Invest the larger fraction of the die area for actual computation computation • Highly scalable memory architecture Highly scalable memory architecture • Enable fine-grain data transfer control Enable fine-grain data transfer control • Efficient vectorization is even more important (No scalar Efficient vectorization is even more important (No scalar unit) unit)

Cell/B.E. Architecture - Programmability • Software ( Software (mostly programmer mostly programmer up to date) controlled data up to date) controlled data transfer transfer • Limited LS size Limited LS size • Manual vectorization Manual vectorization • Manual branch hint, loop unrolling, etc. Manual branch hint, loop unrolling, etc. • Efficient DMA data transfer requires Efficient DMA data transfer requires cache line alignment cache line alignment and transfer size needs to be and transfer size needs to be a multiple of cache line size. a multiple of cache line size. • Vectorization (SIMD) requires 16 byte alignment Vectorization (SIMD) requires 16 byte alignment and vector and vector size needs to be size needs to be 16 byte. 16 byte. => Challenging to deal with misaligned => Challenging to deal with misaligned data !!! data !!!

Cell/B.E. Architecture - Programmability No guarantee Satisfies for( i = 0 ; i < n ; i++ ) { alignment and in alignment a[i] = b[i] + c[i] and size size } requirements n_head = ( 16 – ( ( unsigned int )a % 16 ) / 4; v_a = ( vector int* )a; n_head = n_head % 4; Head n_body = ( n – n_head ) / 4; v_b = ( vector int* )b; n_tail = ( n – n_head ) % 4; for( i = 0 ; i < n_head ; i++ ) { a[i] = b[i] + c[i]; v_c = ( vector int* )c; } v_a = ( vector int* )( a + n_head ); for( i = 0 ; i < n_c / 4 ; i++ ) { v_b = ( vector int* )( b + n_head ); Body v_c = ( vector int* )( c + n_head ); for( i = 0 ; i < n_body ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) v_a[i] = v_add( v_b[i], v_c[i] ) } } a = ( int* )( v_a + n_body ); b = ( int* )( v_b + n_body ); c = ( int* )( v_c + n_body ); Tail //n_c: a constant multiple of 4 for( i = 0 ; i < n_tail ; i++ ) { a[i] = b[i] + c[i]; } =>Even more complex if a, b, and c are misaligned!!!

Presentation Outline • Discrete Wavelet Transform Discrete Wavelet Transform • Cell Broadband Engine architecture Cell Broadband Engine architecture - Comparison with the tradit Comparison with the traditional multicore processor ional multicore processor - Impact in performance and programmabil Impact in performance and prog rammability ity •Optimization Strategies Optimization Strategies - Previous work Previous work - Data Decomposition Scheme Data Decomposition Scheme - Real Number Representation Real Number Representation - Loop Interleaving Loop Interleaving - Fine-grain Data Transfer Control Fine-grain Data Transfer Control • Performance Evaluation Performance Evaluation - Comparison with the AM Comparison with the AMD Barcelona D Barcelona • Conclusions Conclusions

Previous work • Column grouping [Chaver2002] to enhance cache behavior Column grouping [Chaver2002] to enhance cache behavior in vertical filtering in vertical filtering • Muta Muta et al. [Muta2007] optimized et al. [Muta2007] optimized convolution based convolution based (require up to 2 times more operations than (require up to 2 times more operations than lifting based lifting based approach) DWT for Cell/B.E. approach) DWT for Cell/B.E. - High single SPE performance High single SPE performance - Does not scale Does not scale above 1 SPE above 1 SPE

Data Decomposition Scheme Cache line aligned A multiple of the cache line size 2-D array width Row padding A unit of data transfer and 2-D computation array height A unit of data distribution to the processing elements A multiple of the Remainder cache line size Distributed to Processed by the SPEs the PPE

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet - PowerPoint PPT Presentation

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband Engine on the Cell Broadband Engine Seunghwa Kang David A. Bader Key Contributions We design an efficient data decomposition scheme

The Haar Wavelet Transform: Compression and Adams and Halsey Reconstruction Patterson Damien

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Recall 1 Wavelet coefficients of images are Laplacian distributed! The various wavelet

Topic 10: The Z Transform o Introduction to Z Transform o Relationship to the Fourier transform o

GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising Hermann Frntratt ,

Wavelet Scattering Transforms Haixia Liu Department of Mathematics The Hong Kong University of

SMART GOVERNMENT INVOICING: INVOICE PROCESSING PLATFORM LEAD. TRANSFORM. DELIVER LEAD. TRANSFORM.

Fast Fourier Transform Discrete-time windowing Discrete Fourier Transform Relationship

Discrete Fourier Transform Graduate School of Culture Technology (GSCT) Juhan Nam 1 Outlines

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

Multi-D wavelet construction using Quillen-Suslin theorem for Laurent polynomials Youngmi Hur

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Discrete Wavelet Transform Techniques for Denoising, Pattern Detection and Compression of

Fast Fourier Transform Fourier Series & Transform Summary Discrete-time windowing X [

Transform Learning MRI with Global Wavelet Regularization A. Korhan Tanc 1 Ender M. Eksioglu 2 1

SWIFT-SPRAY) MODEL TO LONG-TERM REGULATORY SIMULATIONS OF THE IMPACT OF INDUSTRIAL PLANTS

Nonlinear Fluid-Structure Interaction: a Partitioned Approach and its Application through

Incomplete Factorization by Local Exact Factorization (ILUE) Johannes Kraus and Maria Lymbery

aerodynamic and static aeroelastic numerical simulations for the 6th aiaa cfd drag prediction

Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: NEBO Worldwide

Chapter 3 : Computer Science Class XI ( As per Flowchart and CBSE Board) concept of running a

Chapter 10 Trusted Computing Trusted Computing Chapter 10 and Multilevel Security and

Reflections on the Prospect of a Peace Studies Approach to Study Urban (In)security in Latin

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet - PowerPoint PPT Presentation

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband Engine on the Cell Broadband Engine Seunghwa Kang David A. Bader Key Contributions We design an efficient data decomposition scheme

The Haar Wavelet Transform: Compression and Adams and Halsey Reconstruction Patterson Damien

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Recall 1 Wavelet coefficients of images are Laplacian distributed! The various wavelet

Topic 10: The Z Transform o Introduction to Z Transform o Relationship to the Fourier transform o

GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising Hermann Frntratt ,

Wavelet Scattering Transforms Haixia Liu Department of Mathematics The Hong Kong University of

SMART GOVERNMENT INVOICING: INVOICE PROCESSING PLATFORM LEAD. TRANSFORM. DELIVER LEAD. TRANSFORM.

Fast Fourier Transform Discrete-time windowing Discrete Fourier Transform Relationship

Discrete Fourier Transform Graduate School of Culture Technology (GSCT) Juhan Nam 1 Outlines

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

Multi-D wavelet construction using Quillen-Suslin theorem for Laurent polynomials Youngmi Hur

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Discrete Wavelet Transform Techniques for Denoising, Pattern Detection and Compression of

Fast Fourier Transform Fourier Series &amp; Transform Summary Discrete-time windowing X [

Transform Learning MRI with Global Wavelet Regularization A. Korhan Tanc 1 Ender M. Eksioglu 2 1

SWIFT-SPRAY) MODEL TO LONG-TERM REGULATORY SIMULATIONS OF THE IMPACT OF INDUSTRIAL PLANTS

Nonlinear Fluid-Structure Interaction: a Partitioned Approach and its Application through

Incomplete Factorization by Local Exact Factorization (ILUE) Johannes Kraus and Maria Lymbery

aerodynamic and static aeroelastic numerical simulations for the 6th aiaa cfd drag prediction

Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: NEBO Worldwide

Chapter 3 : Computer Science Class XI ( As per Flowchart and CBSE Board) concept of running a

Chapter 10 Trusted Computing Trusted Computing Chapter 10 and Multilevel Security and

Reflections on the Prospect of a Peace Studies Approach to Study Urban (In)security in Latin

Fast Fourier Transform Fourier Series & Transform Summary Discrete-time windowing X [