SLIDE 1 Seunghwa Kang David A. Bader
Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform
- n the Cell Broadband Engine
- n the Cell Broadband Engine
SLIDE 2 Key Contributions
- We design an efficient data decomposition scheme to
We design an efficient data decomposition scheme to achieve high performance with affordable programming achieve high performance with affordable programming complexity complexity
- We introduce multiple Cell/B.E. and DWT specific
We introduce multiple Cell/B.E. and DWT specific
- ptimization issues and solutions
- ptimization issues and solutions
- Our implementation achieves 34 and 56 times speedup over
Our implementation achieves 34 and 56 times speedup over
- ne PPE performance, and 4.7 and 3.7 times speedup over
- ne PPE performance, and 4.7 and 3.7 times speedup over
the cutting edge multicore processor (AMD Barcelona), for the cutting edge multicore processor (AMD Barcelona), for lossless and lossy DWT, respectively. lossless and lossy DWT, respectively.
SLIDE 3 Presentation Outline
- Discrete Wavelet Transform
Discrete Wavelet Transform
- Cell Broadband Engine architecture
Cell Broadband Engine architecture
- Comparison with the traditional multicore processor
Comparison with the traditional multicore processor
- Impact in performance and programmability
Impact in performance and programmability
Optimization Strategies
Previous work work
- Data decomposition scheme
Data decomposition scheme
Real number representation representation
Loop interleavin rleaving
- Fine-grain data transfer control
Fine-grain data transfer control
Performance Evaluation
- Comparison with the AMD Barcelona
Comparison with the AMD Barcelona
Conclusions
SLIDE 4 Presentation Outline
- Discrete Wavelet Transform
Discrete Wavelet Transform
- Cell Broadband Engine architecture
Cell Broadband Engine architecture
- Comparison with the traditional multicore processor
Comparison with the traditional multicore processor
- Impact in performance and programmability
Impact in performance and programmability
Optimization Strategies
Previous work work
- Data decomposition scheme
Data decomposition scheme
Real number representation representation
Loop interleavin rleaving
- Fine-grain data transfer control
Fine-grain data transfer control
Performance Evaluation
- Comparison with the AMD Barcelona
Comparison with the AMD Barcelona
Conclusions
SLIDE 5 Discrete Wavelet Transform (in JPEG2000)
- Decompose an image in both vertical and horizontal
Decompose an image in both vertical and horizontal direction to the sub-bands representing the coarse and direction to the sub-bands representing the coarse and detail part detail part while preserving space information while preserving space information LL LH HL HH
SLIDE 6 Discrete Wavelet Transform (in JPEG2000)
Vertical filtering followed by filtering followed by horizontal horizontal filtering filtering
Highly parallel but bandwidth intensive bandwidth intensive
- Distinct memory access pattern
Distinct memory access pattern becomes a problem becomes a problem
- Adopt Jasper [Adams2005] as a baseline code
Adopt Jasper [Adams2005] as a baseline code
SLIDE 7 Presentation Outline
- Discrete Wavelet Transform
Discrete Wavelet Transform
- Cell Broadband Engine architecture
Cell Broadband Engine architecture
- Comparison with the traditional multicore processor
Comparison with the traditional multicore processor
- Impact in performance and programmability
Impact in performance and programmability
Optimization Strategies
Previous work
- Data decomposition scheme
Data decomposition scheme
- Real number representation
Real number representation
Loop interleaving
- Fine-grain data transfer control
Fine-grain data transfer control
Performance Evaluation
Comparison with the AMD Barcelona D Barcelona
Conclusions
SLIDE 8 Cell/B.E. vs Traditional Multi-core Processor
In-order
No dynamic branch prediction prediction
SIMD only => Small and simple core => Small and simple core
SPE Traditional Multi-core Processor
Out-of-order
Dynamic branch prediction prediction
Scalar + SIMD => Large and complex core => Large and complex core
SLIDE 9 Cell/B.E. vs Traditional Multi-core Processor
Exe. Pipeline LS Main Memory Main Memory L2 I1 D1 L3
Isolated constant latency constant latency LS access LS access
Software controlled DMA DMA data transfer between LS data transfer between LS and main memory and main memory
Every memory access is cache coherent cache coherent
Hardware controlled data data transfer transfer
Exe. Pipeline
SLIDE 10 Cell/B.E. Architecture - Performance
- More cores within power and transistor budget
More cores within power and transistor budget
- Invest the larger fraction of the die area for actual
Invest the larger fraction of the die area for actual computation computation
- Highly scalable memory architecture
Highly scalable memory architecture
- Enable fine-grain data transfer control
Enable fine-grain data transfer control
- Efficient vectorization is even more important (No scalar
Efficient vectorization is even more important (No scalar unit) unit)
SLIDE 11 Cell/B.E. Architecture - Programmability
Software (mostly programmer mostly programmer up to date) controlled data up to date) controlled data transfer transfer
Limited LS size
Manual vectorization
- Manual branch hint, loop unrolling, etc.
Manual branch hint, loop unrolling, etc.
- Efficient DMA data transfer requires
Efficient DMA data transfer requires cache line alignment cache line alignment and transfer size needs to be and transfer size needs to be a multiple of cache line size. a multiple of cache line size.
- Vectorization (SIMD) requires
Vectorization (SIMD) requires 16 byte alignment 16 byte alignment and vector and vector size needs to be size needs to be 16 byte. 16 byte.
=> Challenging to deal with misaligned => Challenging to deal with misaligned data !!! data !!!
SLIDE 12 Cell/B.E. Architecture - Programmability
for( i = 0 ; i < n ; i++ ) { a[i] = b[i] + c[i] } v_a = ( vector int* )a; v_b = ( vector int* )b; v_c = ( vector int* )c; for( i = 0 ; i < n_c / 4 ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) } //n_c: a constant multiple of 4
n_head = ( 16 – ( ( unsigned int )a % 16 ) / 4; n_head = n_head % 4; n_body = ( n – n_head ) / 4; n_tail = ( n – n_head ) % 4; for( i = 0 ; i < n_head ; i++ ) { a[i] = b[i] + c[i]; } v_a = ( vector int* )( a + n_head ); v_b = ( vector int* )( b + n_head ); v_c = ( vector int* )( c + n_head ); for( i = 0 ; i < n_body ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] ) } a = ( int* )( v_a + n_body ); b = ( int* )( v_b + n_body ); c = ( int* )( v_c + n_body ); for( i = 0 ; i < n_tail ; i++ ) { a[i] = b[i] + c[i]; }
=>Even more complex if a, b, and c are misaligned!!!
Satisfies alignment and size requirements No guarantee in alignment and size
Head Body Tail
SLIDE 13 Presentation Outline
- Discrete Wavelet Transform
Discrete Wavelet Transform
- Cell Broadband Engine architecture
Cell Broadband Engine architecture
- Comparison with the tradit
Comparison with the traditional multicore processor ional multicore processor
- Impact in performance and prog
Impact in performance and programmabil rammability ity
Optimization Strategies
Previous work
- Data Decomposition Scheme
Data Decomposition Scheme
- Real Number Representation
Real Number Representation
Loop Interleaving
- Fine-grain Data Transfer Control
Fine-grain Data Transfer Control
Performance Evaluation
Comparison with the AMD Barcelona D Barcelona
Conclusions
SLIDE 14 Previous work
- Column grouping [Chaver2002]
Column grouping [Chaver2002] to enhance cache behavior to enhance cache behavior in vertical filtering in vertical filtering
Muta et al. [Muta2007] optimized et al. [Muta2007] optimized convolution based convolution based (require up to 2 times more operations than (require up to 2 times more operations than lifting based lifting based approach) DWT for Cell/B.E. approach) DWT for Cell/B.E.
- High single SPE performance
High single SPE performance
Does not scale above 1 SPE above 1 SPE
SLIDE 15
Data Decomposition Scheme
2-D array width Row padding A multiple of the cache line size Remainder Distributed to the SPEs Processed by the PPE 2-D array height A multiple of the cache line size A unit of data transfer and computation A unit of data distribution to the processing elements Cache line aligned
SLIDE 16 Data Decomposition Scheme
- Satisfies the alignment and size requirements for efficient
Satisfies the alignment and size requirements for efficient DMA data transfer and vectorization. DMA data transfer and vectorization.
- Fixed LS space requirements regardless of an input image
Fixed LS space requirements regardless of an input image size size
Constant loop count
A unit of data transfer and computation constant width
SLIDE 17 Vectorization – Real number representation
- Jasper adopts fixed point representation
Jasper adopts fixed point representation
- Replace floating point arithmetic with fixed point arithmetic
Replace floating point arithmetic with fixed point arithmetic
- Not a good choice for Cell/B.E.
Not a good choice for Cell/B.E.
Inst. Latency (SPE) mpyh 7 cycles mpyu 7 cycles a 2 cycles fm 6 cycles
mpyh $5, $3, $4 mpyh $2, $4, $3 mpyu $4, $3, $4 a $3, $5, $2 a $3, $3, $4 fm $3, $3, $4
SLIDE 18 Loop Interleaving
- In a naïve approach, a single vert
In a naïve approach, a single vertical filtering involves 3 or 6 ical filtering involves 3 or 6 times data transfer times data transfer
- Bandwidth becomes a bottleneck
Bandwidth becomes a bottleneck
- Interleave splitting, lifting,
Interleave splitting, lifting, and optional scaling steps and optional scaling steps
Does not fit into the LS
SLIDE 19 Loop Interleaving
low0 high0 low1 high1 low2 high2 low3 high3 low0 low1 low2 low3 high0 high1 high2 high3 low0* low1* low2* low3* high0* high1* high2* high3* Interleaved Lifting Splitting
- First interleave multiple lifting steps
First interleave multiple lifting steps
- Then, merge splitting step with the interleaved lifting step
Then, merge splitting step with the interleaved lifting step
Overwritten before read
- Use temporary main memory buffer for the upper half
Use temporary main memory buffer for the upper half
SLIDE 20 Fine–grain Data Transfer Control
low0 high0 low1 high1 low2 high2 low3 high3 low0 low1 low2 low3 high0 high1 high2 high3 low0* low1* low2* low3* high0* high1* high2* high3* Interleaved Lifting Splitting
- Initially, we copy data from the buffer after the interleaved
Initially, we copy data from the buffer after the interleaved loop is finished loop is finished
- Yet, we can start it just after
Yet, we can start it just after low2 low2 and and high2 high2 are read are read
Cell/B.E.’s software controlled DMA data transfer enables software controlled DMA data transfer enables this this
SLIDE 21 Presentation Outline
- Discrete Wavelet Transform
Discrete Wavelet Transform
- Cell Broadband Engine architecture
Cell Broadband Engine architecture
- Comparison with the tradit
Comparison with the traditional multicore processor ional multicore processor
- Impact in performance and prog
Impact in performance and programmabil rammability ity
Optimization Strategies
Previous work
- Data decomposition scheme
Data decomposition scheme
- Real number representation
Real number representation
Loop interleaving
- Fine-grain data transfer control
Fine-grain data transfer control
Performance Evaluation
- Comparison with the AMD Barcelona
Comparison with the AMD Barcelona
Conclusions
SLIDE 22
Performance Evaluation
* 3800 X 2600 color image, 5 resolution levels * Execution time and scalability up to 2 Cell/B.E. chips (IBM QS20)
SLIDE 23 Performance Evaluation – Comparison with x86 Architecture
Parallelization OpenMP based parallelization Vectorization Auto-vectorization with compiler directives Real Number Representation Identical to the Cell/B.E. case Loop Interleaving Identical to the Cell/B.E. case Run-time profile feedback Compile with run- time profile feedback
C e l l / B . E . ( B a s e ) C e l l / B . E . ( O p t i m i z e d ) B a r c e l
a ( B a s e ) B a r c e l
a ( O p t i m i z e d )
Execution Time (ms)
1000 2000 3000 4000 5000
DWT - Lossless DWT - Lossy 1.0 1.0 34 56 2.4 1.9 7.3 15
- One 3.2 GHz Cell/B.E. chip (IBM QS20)
- One 2.0 GHz AMD Barcelona chip (AMD Quad-core Opteron 8350)
* Optimization for the Barcelona * Optimization for the Barcelona
SLIDE 24 Presentation Outline
- Discrete Wavelet Transform
Discrete Wavelet Transform
- Cell Broadband Engine architecture
Cell Broadband Engine architecture
- Comparison with the traditional multicore processor
Comparison with the traditional multicore processor
- Impact in performance and programmability
Impact in performance and programmability
Optimization Strategies
Previous work work
- Data decomposition scheme
Data decomposition scheme
Real number representation representation
Loop interleavin rleaving
- Fine-grain data transfer control
Fine-grain data transfer control
Performance Evaluation
- Comparison with the AMD Barcelona
Comparison with the AMD Barcelona
Conclusions
SLIDE 25 Conclusions
- Cell/B.E. has a great potential to speed-up parallel
Cell/B.E. has a great potential to speed-up parallel workloads but requires judicious implementation workloads but requires judicious implementation
- We design an efficient data decomposition scheme to
We design an efficient data decomposition scheme to achieve high performance with affordable programming achieve high performance with affordable programming complexity complexity
- Our implementation demonstrates 34 and 56 times
Our implementation demonstrates 34 and 56 times speedup over one PPE, and 4.7 and 3.7 times speedup over speedup over one PPE, and 4.7 and 3.7 times speedup over the AMD Barcelona processor with one Cell/B.E. chip the AMD Barcelona processor with one Cell/B.E. chip
- Cell/B.E. can also be used as an accelerator in combination
Cell/B.E. can also be used as an accelerator in combination with the traditional microprocessor with the traditional microprocessor
SLIDE 26 Acknowledgment of Support
David A. Bader
26
SLIDE 27
References
[1] M.D. Adams. The JPEG-2 [1] M.D. Adams. The JPEG-2000 Still Image Compression 000 Still Image Compression Standard, Dec. 2005. Standard, Dec. 2005. [2] D. Chaver, M. Prieto, L. Pinuel, and F. Tirado. Parallel [2] D. Chaver, M. Prieto, L. Pinuel, and F. Tirado. Parallel wavelet transform for large scale image processing, Int’l wavelet transform for large scale image processing, Int’l Parallel and Distributed Processing Symp., Apr. 2002. Parallel and Distributed Processing Symp., Apr. 2002. [3] H. Muta, M. Doi, H. Nakano, and Y. Mori. Multilevel [3] H. Muta, M. Doi, H. Nakano, and Y. Mori. Multilevel parallelization on the Cell/B.E. for a Motion JPEG 2000 parallelization on the Cell/B.E. for a Motion JPEG 2000 encoding server, encoding server, ACM Multimedia Conf. ACM Multimedia Conf., Sep. 2007. , Sep. 2007.