Towards Exploiting Data Locality for Irregular Applications on - - PowerPoint PPT Presentation
Towards Exploiting Data Locality for Irregular Applications on - - PowerPoint PPT Presentation
Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17,
Outline
1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work
May 17, 2016 Cheng Wang (cwang35@uh.edu) 2 / 22
The Reality of Parallel Computing ...
1Slide based on a post from https://highscalability.com May 17, 2016 Cheng Wang (cwang35@uh.edu) 3 / 22
Why CPU Caching Matters?
year Performance
1980 2004 2014
CPU
Memory Processor-Memory Performance Gap
Memory hierarchy Processor-memory performance gap
- Memory has become the principle performance bottleneck
- Improve the cache utilization is the key to performance optimization
1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html May 17, 2016 Cheng Wang (cwang35@uh.edu) 4 / 22
Shared-Memory Multicore Architectures
1 Shared memory
- On-chip: (Last-level) cache shared by homo/hetero processors
- Off-chip: Main memory shared by homo/hetero processors
May 17, 2016 Cheng Wang (cwang35@uh.edu) 5 / 22
What are Irregular Applications?
do i = 1 , N . . . = x [ idx [ i ] ] end do
1 Indirect array reference pattern 2 Commonly found in linked list, tree and graph-based
applications
3 Poor data locality 4 Challenge esp. for shared-memory multicore architecture as
cores compete for memory bandwidth
May 17, 2016 Cheng Wang (cwang35@uh.edu) 6 / 22
Approach: Computation/Data Reordering
1 2 3 4 1 2 3 4 Computation Data 1 2 3 4 3 1 2 4 2 3 1 4 1 2 3 4 Computation reordering Data reordering
May 17, 2016 Cheng Wang (cwang35@uh.edu) 7 / 22
Challenges in Dynamic Irregularity Removal
1 Dynamic irregularity
- Memory access pattern remains unknown until runtime and
may change during computations
- Previous work on compile-time transformations can hardly
apply
- Need for transformation at runtime
2 Runtime transformation overhead
- Transformation overhead is placed on the critical path of the
application’s execution
- The benefits of improved data locality must outweigh the cost
- f the data layout transformation at runtime
May 17, 2016 Cheng Wang (cwang35@uh.edu) 8 / 22
1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work
Sparse FFT
FFT
1 A novel compressive sensing algorithm with massive application
domains
2 Fourier transform is dominated by a small number of “peaks”
- FFT(O(nlogn)) is inefficient
3 Compute the k-sparse Fourier transform in lower time complexity
- k-sparse: no. of “large” coordinates at freq. domain
May 17, 2016 Cheng Wang (cwang35@uh.edu) 10 / 22
Sparse Data is Ubiquitous ...
1Slide based on http://groups.csail.mit.edu/netmit/sFFT/ May 17, 2016 Cheng Wang (cwang35@uh.edu) 11 / 22
Irregular Memory Access Pattern in Sparse FFT
n coordinates B buckets
Irregular data reference pattern
buckets[i % B] += signal[idx] * filter[i]
- Randomly permutes the signal
spectrum and bins into a small number of buckets
- Irregular memory access pattern
May 17, 2016 Cheng Wang (cwang35@uh.edu) 12 / 22
Parallel Sparse FFT
1 Modern architectures are exclusively based on multicore and
manycore processors
- e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc.
- Nature path to improve the performance of sFFT through
efficient parallel algorithm design and impl.
2 Standard full-size FFT has been well studied and implemented
- FFTW, cuFFT, Intel MLK, etc..
- Highly optimized for specific architectures
3 We are the first such effort of high-performance parallel sFFT
implementation
1cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs,
- C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEE
International Parallel & Distributed Processing Symposium (IPDPS 2016). [To Appear]
May 17, 2016 Cheng Wang (cwang35@uh.edu) 13 / 22
- Exec. Time: cusFFT vs. sFFT vs. cuFFT (k = 1000)
0.001 0.01 0.1 1 10 19 20 21 22 23 24 25 26 27 Execution Time (sec) Signal Size (2n)
GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)
sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT
- cuFFT: full-size FFT library on Nvidia GPUs
- The MIT seq. sFFT is slower than cuFFT
- cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT
- cusFFT is up to 12x faster than cuFFT
May 17, 2016 Cheng Wang (cwang35@uh.edu) 14 / 22
- Exec. Time cusFFT vs. sFFT vs. cuFFT (n = 225)
0.01 0.1 1 10 5000 10000 15000 20000 25000 30000 35000 40000 Execution Time (sec) Signal Sparsity k
GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)
sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT
- The seq. sFFT is slower than cuFFT
- PsFFT is faster than cuFFT until k = 3000
- cusFFT is faster than cuFFT until k = 41, 000
May 17, 2016 Cheng Wang (cwang35@uh.edu) 15 / 22
1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work
Rethink the Consecutive Packing (CPACK) Algorithm
CPACK: A greedy algorithm which packs data into consecutive locations in the order they are first accessed by the computation
23 67 103 1 2 3 4 Computation Data 9 ... ... ... CPACK 9 23 67 103 Data 6 cache miss 5 6 7 1 2 3 4 Computation 5 6 7 Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses Original Data reordered by CPACK miss hit
- First-touch policy packs (9,23) together
- Not optimal
May 17, 2016 Cheng Wang (cwang35@uh.edu) 17 / 22
Rethink the Consecutive Packing (CPACK) Algorithm
Affinity-conscious data reordering ...
23 67 103 1 2 3 4 Computation Data 9 ... ... ... data reordering 23 67 9 103 Data 4 cache miss 5 6 7 1 2 3 4 Computation 5 6 7 Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses Original An Optimal data layout miss hit
- CPACK does not consider data affinity (i.e., how close the nearby
data elements are accessed together)
- Packs (23,67) rather than (9,23) should yield better locality
May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22
Data Reordering and NP-completeness
1 Finding an optimal data layout is a NP-complete problem1 2 No “best” data reordering algorithm that works in general 3 Implicit constraint: Each data entry has only one copy in
the transformed format
4 The complexity can be significantly reduced if more space is
allowed to use
- 1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02)
May 17, 2016 Cheng Wang (cwang35@uh.edu) 19 / 22
A Padding Algorithm that Circumvents the Complexity
CPACKE Algorithm: Extends the CPACK by creating duplicated copies of each repeatedly accessed data entry
23 67 103 1 2 3 4 Computation Data 9 ... ... ... data reordering 23 67 9 103 Data 4 cache miss 5 6 7 1 2 3 4 Computation 5 6 7 Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses Original Padding algorithm miss hit 23 23 67
- Advantage: Better locality than CPACK
- Disadvantage: Slight space overhead
May 17, 2016 Cheng Wang (cwang35@uh.edu) 20 / 22
Performance Evaluation
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 19 20 21 22 23 24 25 26 27 28 Execution time (sec) Signal size (2n) Intel Xeon E5-2670 (Sandy Bridge) PsFFT (before trans) PsFFF (after trans)
- Applies the CPACKE to the perm+filter stage in sFFT
- Improves the performance by 30% for the irregular kernel
- Improves the overall performance of PsFFT by 20%
May 17, 2016 Cheng Wang (cwang35@uh.edu) 21 / 22
Conclusion & Future Work
1 A padding-based algorithm improving the data locality of
irregular applications
2 Improves the performance of sFFT by 30% 3 Future work
- Evaluate with more irregular applications
- Evaluate with other data/computation reordering algorithms
- Let compiler generate the transformed the code automatically
May 17, 2016 Cheng Wang (cwang35@uh.edu) 22 / 22