Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet - - PowerPoint PPT Presentation

optimizing discrete wavelet transform optimizing discrete
SMART_READER_LITE
LIVE PREVIEW

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet - - PowerPoint PPT Presentation

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband Engine on the Cell Broadband Engine Seunghwa Kang David A. Bader Key Contributions We design an efficient data decomposition scheme


slide-1
SLIDE 1

Seunghwa Kang David A. Bader

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform

  • n the Cell Broadband Engine
  • n the Cell Broadband Engine
slide-2
SLIDE 2

Key Contributions

  • We design an efficient data decomposition scheme to

We design an efficient data decomposition scheme to achieve high performance with affordable programming achieve high performance with affordable programming complexity complexity

  • We introduce multiple Cell/B.E. and DWT specific

We introduce multiple Cell/B.E. and DWT specific

  • ptimization issues and solutions
  • ptimization issues and solutions
  • Our implementation achieves 34 and 56 times speedup over

Our implementation achieves 34 and 56 times speedup over

  • ne PPE performance, and 4.7 and 3.7 times speedup over
  • ne PPE performance, and 4.7 and 3.7 times speedup over

the cutting edge multicore processor (AMD Barcelona), for the cutting edge multicore processor (AMD Barcelona), for lossless and lossy DWT, respectively. lossless and lossy DWT, respectively.

slide-3
SLIDE 3

Presentation Outline

  • Discrete Wavelet Transform

Discrete Wavelet Transform

  • Cell Broadband Engine architecture

Cell Broadband Engine architecture

  • Comparison with the traditional multicore processor

Comparison with the traditional multicore processor

  • Impact in performance and programmability

Impact in performance and programmability

  • Optimization Strategies

Optimization Strategies

  • Previou

Previous work work

  • Data decomposition scheme

Data decomposition scheme

  • Real number

Real number representation representation

  • Loop int

Loop interleavin rleaving

  • Fine-grain data transfer control

Fine-grain data transfer control

  • Performance Evaluation

Performance Evaluation

  • Comparison with the AMD Barcelona

Comparison with the AMD Barcelona

  • Conclusions

Conclusions

slide-4
SLIDE 4

Presentation Outline

  • Discrete Wavelet Transform

Discrete Wavelet Transform

  • Cell Broadband Engine architecture

Cell Broadband Engine architecture

  • Comparison with the traditional multicore processor

Comparison with the traditional multicore processor

  • Impact in performance and programmability

Impact in performance and programmability

  • Optimization Strategies

Optimization Strategies

  • Previou

Previous work work

  • Data decomposition scheme

Data decomposition scheme

  • Real number

Real number representation representation

  • Loop int

Loop interleavin rleaving

  • Fine-grain data transfer control

Fine-grain data transfer control

  • Performance Evaluation

Performance Evaluation

  • Comparison with the AMD Barcelona

Comparison with the AMD Barcelona

  • Conclusions

Conclusions

slide-5
SLIDE 5

Discrete Wavelet Transform (in JPEG2000)

  • Decompose an image in both vertical and horizontal

Decompose an image in both vertical and horizontal direction to the sub-bands representing the coarse and direction to the sub-bands representing the coarse and detail part detail part while preserving space information while preserving space information LL LH HL HH

slide-6
SLIDE 6

Discrete Wavelet Transform (in JPEG2000)

  • Vertical

Vertical filtering followed by filtering followed by horizontal horizontal filtering filtering

  • Highly parallel but

Highly parallel but bandwidth intensive bandwidth intensive

  • Distinct memory access pattern

Distinct memory access pattern becomes a problem becomes a problem

  • Adopt Jasper [Adams2005] as a baseline code

Adopt Jasper [Adams2005] as a baseline code

slide-7
SLIDE 7

Presentation Outline

  • Discrete Wavelet Transform

Discrete Wavelet Transform

  • Cell Broadband Engine architecture

Cell Broadband Engine architecture

  • Comparison with the traditional multicore processor

Comparison with the traditional multicore processor

  • Impact in performance and programmability

Impact in performance and programmability

  • Optimization Strategies

Optimization Strategies

  • Previous work

Previous work

  • Data decomposition scheme

Data decomposition scheme

  • Real number representation

Real number representation

  • Loop interleaving

Loop interleaving

  • Fine-grain data transfer control

Fine-grain data transfer control

  • Performance Evaluation

Performance Evaluation

  • Comparison with the AM

Comparison with the AMD Barcelona D Barcelona

  • Conclusions

Conclusions

slide-8
SLIDE 8

Cell/B.E. vs Traditional Multi-core Processor

  • In-order

In-order

  • No dynamic branch

No dynamic branch prediction prediction

  • SIMD only

SIMD only => Small and simple core => Small and simple core

SPE Traditional Multi-core Processor

  • Out-of-order

Out-of-order

  • Dynamic branch

Dynamic branch prediction prediction

  • Scalar + SIMD

Scalar + SIMD => Large and complex core => Large and complex core

slide-9
SLIDE 9

Cell/B.E. vs Traditional Multi-core Processor

Exe. Pipeline LS Main Memory Main Memory L2 I1 D1 L3

  • Isolated

Isolated constant latency constant latency LS access LS access

  • Software controlled

Software controlled DMA DMA data transfer between LS data transfer between LS and main memory and main memory

  • Every memory access is

Every memory access is cache coherent cache coherent

  • Hardware controlled

Hardware controlled data data transfer transfer

Exe. Pipeline

slide-10
SLIDE 10

Cell/B.E. Architecture - Performance

  • More cores within power and transistor budget

More cores within power and transistor budget

  • Invest the larger fraction of the die area for actual

Invest the larger fraction of the die area for actual computation computation

  • Highly scalable memory architecture

Highly scalable memory architecture

  • Enable fine-grain data transfer control

Enable fine-grain data transfer control

  • Efficient vectorization is even more important (No scalar

Efficient vectorization is even more important (No scalar unit) unit)

slide-11
SLIDE 11

Cell/B.E. Architecture - Programmability

  • Software (

Software (mostly programmer mostly programmer up to date) controlled data up to date) controlled data transfer transfer

  • Limited LS size

Limited LS size

  • Manual vectorization

Manual vectorization

  • Manual branch hint, loop unrolling, etc.

Manual branch hint, loop unrolling, etc.

  • Efficient DMA data transfer requires

Efficient DMA data transfer requires cache line alignment cache line alignment and transfer size needs to be and transfer size needs to be a multiple of cache line size. a multiple of cache line size.

  • Vectorization (SIMD) requires

Vectorization (SIMD) requires 16 byte alignment 16 byte alignment and vector and vector size needs to be size needs to be 16 byte. 16 byte.

=> Challenging to deal with misaligned => Challenging to deal with misaligned data !!! data !!!

slide-12
SLIDE 12

Cell/B.E. Architecture - Programmability

for( i = 0 ; i < n ; i++ ) { a[i] = b[i] + c[i] } v_a = ( vector int* )a; v_b = ( vector int* )b; v_c = ( vector int* )c; for( i = 0 ; i < n_c / 4 ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) } //n_c: a constant multiple of 4

n_head = ( 16 – ( ( unsigned int )a % 16 ) / 4; n_head = n_head % 4; n_body = ( n – n_head ) / 4; n_tail = ( n – n_head ) % 4; for( i = 0 ; i < n_head ; i++ ) { a[i] = b[i] + c[i]; } v_a = ( vector int* )( a + n_head ); v_b = ( vector int* )( b + n_head ); v_c = ( vector int* )( c + n_head ); for( i = 0 ; i < n_body ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) } a = ( int* )( v_a + n_body ); b = ( int* )( v_b + n_body ); c = ( int* )( v_c + n_body ); for( i = 0 ; i < n_tail ; i++ ) { a[i] = b[i] + c[i]; }

=>Even more complex if a, b, and c are misaligned!!!

Satisfies alignment and size requirements No guarantee in alignment and size

Head Body Tail

slide-13
SLIDE 13

Presentation Outline

  • Discrete Wavelet Transform

Discrete Wavelet Transform

  • Cell Broadband Engine architecture

Cell Broadband Engine architecture

  • Comparison with the tradit

Comparison with the traditional multicore processor ional multicore processor

  • Impact in performance and prog

Impact in performance and programmabil rammability ity

  • Optimization Strategies

Optimization Strategies

  • Previous work

Previous work

  • Data Decomposition Scheme

Data Decomposition Scheme

  • Real Number Representation

Real Number Representation

  • Loop Interleaving

Loop Interleaving

  • Fine-grain Data Transfer Control

Fine-grain Data Transfer Control

  • Performance Evaluation

Performance Evaluation

  • Comparison with the AM

Comparison with the AMD Barcelona D Barcelona

  • Conclusions

Conclusions

slide-14
SLIDE 14

Previous work

  • Column grouping [Chaver2002]

Column grouping [Chaver2002] to enhance cache behavior to enhance cache behavior in vertical filtering in vertical filtering

  • Muta

Muta et al. [Muta2007] optimized et al. [Muta2007] optimized convolution based convolution based (require up to 2 times more operations than (require up to 2 times more operations than lifting based lifting based approach) DWT for Cell/B.E. approach) DWT for Cell/B.E.

  • High single SPE performance

High single SPE performance

  • Does not scale

Does not scale above 1 SPE above 1 SPE

slide-15
SLIDE 15

Data Decomposition Scheme

2-D array width Row padding A multiple of the cache line size Remainder Distributed to the SPEs Processed by the PPE 2-D array height A multiple of the cache line size A unit of data transfer and computation A unit of data distribution to the processing elements Cache line aligned

slide-16
SLIDE 16

Data Decomposition Scheme

  • Satisfies the alignment and size requirements for efficient

Satisfies the alignment and size requirements for efficient DMA data transfer and vectorization. DMA data transfer and vectorization.

  • Fixed LS space requirements regardless of an input image

Fixed LS space requirements regardless of an input image size size

  • Constant loop count

Constant loop count

A unit of data transfer and computation constant width

slide-17
SLIDE 17

Vectorization – Real number representation

  • Jasper adopts fixed point representation

Jasper adopts fixed point representation

  • Replace floating point arithmetic with fixed point arithmetic

Replace floating point arithmetic with fixed point arithmetic

  • Not a good choice for Cell/B.E.

Not a good choice for Cell/B.E.

Inst. Latency (SPE) mpyh 7 cycles mpyu 7 cycles a 2 cycles fm 6 cycles

mpyh $5, $3, $4 mpyh $2, $4, $3 mpyu $4, $3, $4 a $3, $5, $2 a $3, $3, $4 fm $3, $3, $4

slide-18
SLIDE 18

Loop Interleaving

  • In a naïve approach, a single vert

In a naïve approach, a single vertical filtering involves 3 or 6 ical filtering involves 3 or 6 times data transfer times data transfer

  • Bandwidth becomes a bottleneck

Bandwidth becomes a bottleneck

  • Interleave splitting, lifting,

Interleave splitting, lifting, and optional scaling steps and optional scaling steps

Does not fit into the LS

slide-19
SLIDE 19

Loop Interleaving

low0 high0 low1 high1 low2 high2 low3 high3 low0 low1 low2 low3 high0 high1 high2 high3 low0* low1* low2* low3* high0* high1* high2* high3* Interleaved Lifting Splitting

  • First interleave multiple lifting steps

First interleave multiple lifting steps

  • Then, merge splitting step with the interleaved lifting step

Then, merge splitting step with the interleaved lifting step

Overwritten before read

  • Use temporary main memory buffer for the upper half

Use temporary main memory buffer for the upper half

slide-20
SLIDE 20

Fine–grain Data Transfer Control

low0 high0 low1 high1 low2 high2 low3 high3 low0 low1 low2 low3 high0 high1 high2 high3 low0* low1* low2* low3* high0* high1* high2* high3* Interleaved Lifting Splitting

  • Initially, we copy data from the buffer after the interleaved

Initially, we copy data from the buffer after the interleaved loop is finished loop is finished

  • Yet, we can start it just after

Yet, we can start it just after low2 low2 and and high2 high2 are read are read

  • Cell/B.E.’s

Cell/B.E.’s software controlled DMA data transfer enables software controlled DMA data transfer enables this this

slide-21
SLIDE 21

Presentation Outline

  • Discrete Wavelet Transform

Discrete Wavelet Transform

  • Cell Broadband Engine architecture

Cell Broadband Engine architecture

  • Comparison with the tradit

Comparison with the traditional multicore processor ional multicore processor

  • Impact in performance and prog

Impact in performance and programmabil rammability ity

  • Optimization Strategies

Optimization Strategies

  • Previous work

Previous work

  • Data decomposition scheme

Data decomposition scheme

  • Real number representation

Real number representation

  • Loop interleaving

Loop interleaving

  • Fine-grain data transfer control

Fine-grain data transfer control

  • Performance Evaluation

Performance Evaluation

  • Comparison with the AMD Barcelona

Comparison with the AMD Barcelona

  • Conclusions

Conclusions

slide-22
SLIDE 22

Performance Evaluation

* 3800 X 2600 color image, 5 resolution levels * Execution time and scalability up to 2 Cell/B.E. chips (IBM QS20)

slide-23
SLIDE 23

Performance Evaluation – Comparison with x86 Architecture

Parallelization OpenMP based parallelization Vectorization Auto-vectorization with compiler directives Real Number Representation Identical to the Cell/B.E. case Loop Interleaving Identical to the Cell/B.E. case Run-time profile feedback Compile with run- time profile feedback

C e l l / B . E . ( B a s e ) C e l l / B . E . ( O p t i m i z e d ) B a r c e l

  • n

a ( B a s e ) B a r c e l

  • n

a ( O p t i m i z e d )

Execution Time (ms)

1000 2000 3000 4000 5000

DWT - Lossless DWT - Lossy 1.0 1.0 34 56 2.4 1.9 7.3 15

  • One 3.2 GHz Cell/B.E. chip (IBM QS20)
  • One 2.0 GHz AMD Barcelona chip (AMD Quad-core Opteron 8350)

* Optimization for the Barcelona * Optimization for the Barcelona

slide-24
SLIDE 24

Presentation Outline

  • Discrete Wavelet Transform

Discrete Wavelet Transform

  • Cell Broadband Engine architecture

Cell Broadband Engine architecture

  • Comparison with the traditional multicore processor

Comparison with the traditional multicore processor

  • Impact in performance and programmability

Impact in performance and programmability

  • Optimization Strategies

Optimization Strategies

  • Previou

Previous work work

  • Data decomposition scheme

Data decomposition scheme

  • Real number

Real number representation representation

  • Loop int

Loop interleavin rleaving

  • Fine-grain data transfer control

Fine-grain data transfer control

  • Performance Evaluation

Performance Evaluation

  • Comparison with the AMD Barcelona

Comparison with the AMD Barcelona

  • Conclusions

Conclusions

slide-25
SLIDE 25

Conclusions

  • Cell/B.E. has a great potential to speed-up parallel

Cell/B.E. has a great potential to speed-up parallel workloads but requires judicious implementation workloads but requires judicious implementation

  • We design an efficient data decomposition scheme to

We design an efficient data decomposition scheme to achieve high performance with affordable programming achieve high performance with affordable programming complexity complexity

  • Our implementation demonstrates 34 and 56 times

Our implementation demonstrates 34 and 56 times speedup over one PPE, and 4.7 and 3.7 times speedup over speedup over one PPE, and 4.7 and 3.7 times speedup over the AMD Barcelona processor with one Cell/B.E. chip the AMD Barcelona processor with one Cell/B.E. chip

  • Cell/B.E. can also be used as an accelerator in combination

Cell/B.E. can also be used as an accelerator in combination with the traditional microprocessor with the traditional microprocessor

slide-26
SLIDE 26

Acknowledgment of Support

David A. Bader

26

slide-27
SLIDE 27

References

[1] M.D. Adams. The JPEG-2 [1] M.D. Adams. The JPEG-2000 Still Image Compression 000 Still Image Compression Standard, Dec. 2005. Standard, Dec. 2005. [2] D. Chaver, M. Prieto, L. Pinuel, and F. Tirado. Parallel [2] D. Chaver, M. Prieto, L. Pinuel, and F. Tirado. Parallel wavelet transform for large scale image processing, Int’l wavelet transform for large scale image processing, Int’l Parallel and Distributed Processing Symp., Apr. 2002. Parallel and Distributed Processing Symp., Apr. 2002. [3] H. Muta, M. Doi, H. Nakano, and Y. Mori. Multilevel [3] H. Muta, M. Doi, H. Nakano, and Y. Mori. Multilevel parallelization on the Cell/B.E. for a Motion JPEG 2000 parallelization on the Cell/B.E. for a Motion JPEG 2000 encoding server, encoding server, ACM Multimedia Conf. ACM Multimedia Conf., Sep. 2007. , Sep. 2007.