Towards Exploiting Data Locality for Irregular Applications on - PowerPoint PPT Presentation

Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17, 2016

Outline 1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work May 17, 2016 Cheng Wang (cwang35@uh.edu) 2 / 22

The Reality of Parallel Computing ... 1Slide based on a post from https://highscalability.com May 17, 2016 Cheng Wang (cwang35@uh.edu) 3 / 22

Why CPU Caching Matters? Performance CPU Processor-Memory Performance Gap Memory year 1980 2004 2014 Processor-memory Memory hierarchy performance gap • Memory has become the principle performance bottleneck • Improve the cache utilization is the key to performance optimization 1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html May 17, 2016 Cheng Wang (cwang35@uh.edu) 4 / 22

Shared-Memory Multicore Architectures 1 Shared memory • On-chip : (Last-level) cache shared by homo/hetero processors • Off-chip : Main memory shared by homo/hetero processors May 17, 2016 Cheng Wang (cwang35@uh.edu) 5 / 22

What are Irregular Applications? do i = 1 , N . . . = x [ idx [ i ] ] end do 1 Indirect array reference pattern 2 Commonly found in linked list, tree and graph-based applications 3 Poor data locality 4 Challenge esp. for shared-memory multicore architecture as cores compete for memory bandwidth May 17, 2016 Cheng Wang (cwang35@uh.edu) 6 / 22

Approach: Computation/Data Reordering 3 1 2 4 Computation reordering Computation 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Data Data reordering 2 3 1 4 May 17, 2016 Cheng Wang (cwang35@uh.edu) 7 / 22

Challenges in Dynamic Irregularity Removal 1 Dynamic irregularity • Memory access pattern remains unknown until runtime and may change during computations • Previous work on compile-time transformations can hardly apply • Need for transformation at runtime 2 Runtime transformation overhead • Transformation overhead is placed on the critical path of the application’s execution • The benefits of improved data locality must outweigh the cost of the data layout transformation at runtime May 17, 2016 Cheng Wang (cwang35@uh.edu) 8 / 22

1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

Sparse FFT FFT 1 A novel compressive sensing algorithm with massive application domains 2 Fourier transform is dominated by a small number of “peaks” • FFT( O ( nlogn )) is inefficient 3 Compute the k-sparse Fourier transform in lower time complexity • k-sparse : no. of “large” coordinates at freq. domain May 17, 2016 Cheng Wang (cwang35@uh.edu) 10 / 22

Sparse Data is Ubiquitous ... 1Slide based on http://groups.csail.mit.edu/netmit/sFFT/ May 17, 2016 Cheng Wang (cwang35@uh.edu) 11 / 22

Irregular Memory Access Pattern in Sparse FFT n coordinates • Randomly permutes the signal spectrum and bins into a small Irregular data number of buckets reference pattern • Irregular memory access pattern B buckets buckets[i % B] += signal[idx] * fi lter[i] May 17, 2016 Cheng Wang (cwang35@uh.edu) 12 / 22

Parallel Sparse FFT 1 Modern architectures are exclusively based on multicore and manycore processors • e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc. • Nature path to improve the performance of sFFT through efficient parallel algorithm design and impl. 2 Standard full-size FFT has been well studied and implemented • FFTW, cuFFT, Intel MLK, etc.. • Highly optimized for specific architectures 3 We are the first such effort of high-performance parallel sFFT implementation 1 cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs, C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016). [To Appear] May 17, 2016 Cheng Wang (cwang35@uh.edu) 13 / 22

Exec. Time: cusFFT vs. sFFT vs. cuFFT ( k = 1000) GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge) 10 sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT 1 Execution Time (sec) 0.1 0.01 0.001 19 20 21 22 23 24 25 26 27 Signal Size (2 n ) • cuFFT : full-size FFT library on Nvidia GPUs • The MIT seq. sFFT is slower than cuFFT • cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT • cusFFT is up to 12x faster than cuFFT May 17, 2016 Cheng Wang (cwang35@uh.edu) 14 / 22

Exec. Time cusFFT vs. sFFT vs. cuFFT ( n = 2 25 ) GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge) 10 Execution Time (sec) 1 0.1 sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT 0.01 5000 10000 15000 20000 25000 30000 35000 40000 Signal Sparsity k • The seq. sFFT is slower than cuFFT • PsFFT is faster than cuFFT until k = 3000 • cusFFT is faster than cuFFT until k = 41 , 000 May 17, 2016 Cheng Wang (cwang35@uh.edu) 15 / 22

1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

Rethink the Consecutive Packing (CPACK) Algorithm CPACK: A greedy algorithm which packs data into consecutive locations in the order they are first accessed by the computation miss Original Data reordered by CPACK hit Computation Computation 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CPACK ... ... ... 9 23 67 103 9 23 67 103 Data Data Data access order: 9, 23, 103, 23, 67, 23, 67 6 cache miss 7 cache misses • First-touch policy packs (9,23) together • Not optimal May 17, 2016 Cheng Wang (cwang35@uh.edu) 17 / 22

Rethink the Consecutive Packing (CPACK) Algorithm Affinity-conscious data reordering ... miss Original An Optimal data layout hit Computation Computation 1 2 3 4 5 6 7 1 2 3 4 5 6 7 data reordering ... ... ... 9 23 67 103 9 103 23 67 Data Data Data access order: 9, 23, 103, 23, 67, 23, 67 4 cache miss 7 cache misses • CPACK does not consider data affinity (i.e., how close the nearby data elements are accessed together) • Packs (23,67) rather than (9,23) should yield better locality May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22

Data Reordering and NP-completeness 1 Finding an optimal data layout is a NP-complete problem 1 2 No “best” data reordering algorithm that works in general 3 Implicit constraint : Each data entry has only one copy in the transformed format 4 The complexity can be significantly reduced if more space is allowed to use 1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02) May 17, 2016 Cheng Wang (cwang35@uh.edu) 19 / 22

A Padding Algorithm that Circumvents the Complexity CPACKE Algorithm : Extends the CPACK by creating duplicated copies of each repeatedly accessed data entry miss Original Padding algorithm hit Computation Computation 1 2 3 4 5 6 7 1 2 3 4 5 6 7 data reordering ... ... ... 23 67 103 9 103 67 67 9 23 23 23 Data Data 4 cache miss Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses • Advantage : Better locality than CPACK • Disadvantage : Slight space overhead May 17, 2016 Cheng Wang (cwang35@uh.edu) 20 / 22

Performance Evaluation Intel Xeon E5-2670 (Sandy Bridge) 0.9 PsFFT (before trans) PsFFF (after trans) 0.8 0.7 Execution time (sec) 0.6 0.5 0.4 0.3 0.2 0.1 0 19 20 21 22 23 24 25 26 27 28 Signal size (2 n ) • Applies the CPACKE to the perm+filter stage in sFFT • Improves the performance by 30% for the irregular kernel • Improves the overall performance of PsFFT by 20% May 17, 2016 Cheng Wang (cwang35@uh.edu) 21 / 22

Conclusion & Future Work 1 A padding-based algorithm improving the data locality of irregular applications 2 Improves the performance of sFFT by 30% 3 Future work • Evaluate with more irregular applications • Evaluate with other data/computation reordering algorithms • Let compiler generate the transformed the code automatically May 17, 2016 Cheng Wang (cwang35@uh.edu) 22 / 22

Towards Exploiting Data Locality for Irregular Applications on - PowerPoint PPT Presentation

Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17,

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Irregular Migration, Human Irregular Migration, Human Smuggling and Informal Smuggling and

Analyzing Irregular Mutual Analyzing Irregular Mutual Exclusion in Parallel Programs Exclusion

Concatenated Irregular Variable Length Coding and Irregular Unity Rate Coding R. G. Maunder and

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality

Compiling for Parallelism & Locality Last time SSA and its uses Today

Best Practices for Irregular Warfare (IW) Data Quality Control Jeff Appleget & Fred Cameron

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Intro This talk will focus on Cell processor Cell Broadband Engine Architecture (CBEA)

Modeling Inter-layer Interactions in Layered Materials Oded Hod Tel-Aviv University Trend in

Exponentially Suppressed Cosmological Constant with Gauge Enhanced Symmetry in Heterotic

Extreme Value Theory with Operator Norming Stilian Stoev ( sstoev@umich.edu ) University of

Hetero-Diatomics: HF Due to higher electronegativity of F than H, the electron distribution is

Feature-Critic Networks for Heterogeneous Domain Generalisation Yiying Li, Yongxin Yang, Wei

LGBTQ YOUTH & TOBACCO A dangerous liaison Scout, MA, PhD Acting Deputy Director, National

Vectors, Matrices, and Associative Memory Computational Models of Neural Systems Lecture 3.1

Towards Exploiting Data Locality for Irregular Applications on - PowerPoint PPT Presentation

Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17,

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Irregular Migration, Human Irregular Migration, Human Smuggling and Informal Smuggling and

Analyzing Irregular Mutual Analyzing Irregular Mutual Exclusion in Parallel Programs Exclusion

Concatenated Irregular Variable Length Coding and Irregular Unity Rate Coding R. G. Maunder and

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality

Compiling for Parallelism &amp; Locality Last time SSA and its uses Today

Best Practices for Irregular Warfare (IW) Data Quality Control Jeff Appleget &amp; Fred Cameron

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Intro This talk will focus on Cell processor Cell Broadband Engine Architecture (CBEA)

Modeling Inter-layer Interactions in Layered Materials Oded Hod Tel-Aviv University Trend in

Exponentially Suppressed Cosmological Constant with Gauge Enhanced Symmetry in Heterotic

Extreme Value Theory with Operator Norming Stilian Stoev ( sstoev@umich.edu ) University of

Hetero-Diatomics: HF Due to higher electronegativity of F than H, the electron distribution is

Feature-Critic Networks for Heterogeneous Domain Generalisation Yiying Li*, Yongxin Yang*, Wei

LGBTQ YOUTH &amp; TOBACCO A dangerous liaison Scout, MA, PhD Acting Deputy Director, National

Vectors, Matrices, and Associative Memory Computational Models of Neural Systems Lecture 3.1

Compiling for Parallelism & Locality Last time SSA and its uses Today

Best Practices for Irregular Warfare (IW) Data Quality Control Jeff Appleget & Fred Cameron

Feature-Critic Networks for Heterogeneous Domain Generalisation Yiying Li, Yongxin Yang, Wei

LGBTQ YOUTH & TOBACCO A dangerous liaison Scout, MA, PhD Acting Deputy Director, National