Towards Exploiting Data Locality for Irregular Applications on - - PowerPoint PPT Presentation

towards exploiting data locality for irregular
SMART_READER_LITE
LIVE PREVIEW

Towards Exploiting Data Locality for Irregular Applications on - - PowerPoint PPT Presentation

Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17,


slide-1
SLIDE 1

Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures

Cheng Wang Advisor: Barbara Chapman

HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA

May 17, 2016

slide-2
SLIDE 2

Outline

1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

May 17, 2016 Cheng Wang (cwang35@uh.edu) 2 / 22

slide-3
SLIDE 3

The Reality of Parallel Computing ...

1Slide based on a post from https://highscalability.com May 17, 2016 Cheng Wang (cwang35@uh.edu) 3 / 22

slide-4
SLIDE 4

Why CPU Caching Matters?

year Performance

1980 2004 2014

CPU

Memory Processor-Memory Performance Gap

Memory hierarchy Processor-memory performance gap

  • Memory has become the principle performance bottleneck
  • Improve the cache utilization is the key to performance optimization

1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html May 17, 2016 Cheng Wang (cwang35@uh.edu) 4 / 22

slide-5
SLIDE 5

Shared-Memory Multicore Architectures

1 Shared memory

  • On-chip: (Last-level) cache shared by homo/hetero processors
  • Off-chip: Main memory shared by homo/hetero processors

May 17, 2016 Cheng Wang (cwang35@uh.edu) 5 / 22

slide-6
SLIDE 6

What are Irregular Applications?

do i = 1 , N . . . = x [ idx [ i ] ] end do

1 Indirect array reference pattern 2 Commonly found in linked list, tree and graph-based

applications

3 Poor data locality 4 Challenge esp. for shared-memory multicore architecture as

cores compete for memory bandwidth

May 17, 2016 Cheng Wang (cwang35@uh.edu) 6 / 22

slide-7
SLIDE 7

Approach: Computation/Data Reordering

1 2 3 4 1 2 3 4 Computation Data 1 2 3 4 3 1 2 4 2 3 1 4 1 2 3 4 Computation reordering Data reordering

May 17, 2016 Cheng Wang (cwang35@uh.edu) 7 / 22

slide-8
SLIDE 8

Challenges in Dynamic Irregularity Removal

1 Dynamic irregularity

  • Memory access pattern remains unknown until runtime and

may change during computations

  • Previous work on compile-time transformations can hardly

apply

  • Need for transformation at runtime

2 Runtime transformation overhead

  • Transformation overhead is placed on the critical path of the

application’s execution

  • The benefits of improved data locality must outweigh the cost
  • f the data layout transformation at runtime

May 17, 2016 Cheng Wang (cwang35@uh.edu) 8 / 22

slide-9
SLIDE 9

1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

slide-10
SLIDE 10

Sparse FFT

FFT

1 A novel compressive sensing algorithm with massive application

domains

2 Fourier transform is dominated by a small number of “peaks”

  • FFT(O(nlogn)) is inefficient

3 Compute the k-sparse Fourier transform in lower time complexity

  • k-sparse: no. of “large” coordinates at freq. domain

May 17, 2016 Cheng Wang (cwang35@uh.edu) 10 / 22

slide-11
SLIDE 11

Sparse Data is Ubiquitous ...

1Slide based on http://groups.csail.mit.edu/netmit/sFFT/ May 17, 2016 Cheng Wang (cwang35@uh.edu) 11 / 22

slide-12
SLIDE 12

Irregular Memory Access Pattern in Sparse FFT

n coordinates B buckets

Irregular data reference pattern

buckets[i % B] += signal[idx] * filter[i]

  • Randomly permutes the signal

spectrum and bins into a small number of buckets

  • Irregular memory access pattern

May 17, 2016 Cheng Wang (cwang35@uh.edu) 12 / 22

slide-13
SLIDE 13

Parallel Sparse FFT

1 Modern architectures are exclusively based on multicore and

manycore processors

  • e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc.
  • Nature path to improve the performance of sFFT through

efficient parallel algorithm design and impl.

2 Standard full-size FFT has been well studied and implemented

  • FFTW, cuFFT, Intel MLK, etc..
  • Highly optimized for specific architectures

3 We are the first such effort of high-performance parallel sFFT

implementation

1cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs,

  • C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEE

International Parallel & Distributed Processing Symposium (IPDPS 2016). [To Appear]

May 17, 2016 Cheng Wang (cwang35@uh.edu) 13 / 22

slide-14
SLIDE 14
  • Exec. Time: cusFFT vs. sFFT vs. cuFFT (k = 1000)

0.001 0.01 0.1 1 10 19 20 21 22 23 24 25 26 27 Execution Time (sec) Signal Size (2n)

GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)

sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT

  • cuFFT: full-size FFT library on Nvidia GPUs
  • The MIT seq. sFFT is slower than cuFFT
  • cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT
  • cusFFT is up to 12x faster than cuFFT

May 17, 2016 Cheng Wang (cwang35@uh.edu) 14 / 22

slide-15
SLIDE 15
  • Exec. Time cusFFT vs. sFFT vs. cuFFT (n = 225)

0.01 0.1 1 10 5000 10000 15000 20000 25000 30000 35000 40000 Execution Time (sec) Signal Sparsity k

GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)

sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT

  • The seq. sFFT is slower than cuFFT
  • PsFFT is faster than cuFFT until k = 3000
  • cusFFT is faster than cuFFT until k = 41, 000

May 17, 2016 Cheng Wang (cwang35@uh.edu) 15 / 22

slide-16
SLIDE 16

1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

slide-17
SLIDE 17

Rethink the Consecutive Packing (CPACK) Algorithm

CPACK: A greedy algorithm which packs data into consecutive locations in the order they are first accessed by the computation

23 67 103 1 2 3 4 Computation Data 9 ... ... ... CPACK 9 23 67 103 Data 6 cache miss 5 6 7 1 2 3 4 Computation 5 6 7 Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses Original Data reordered by CPACK miss hit

  • First-touch policy packs (9,23) together
  • Not optimal

May 17, 2016 Cheng Wang (cwang35@uh.edu) 17 / 22

slide-18
SLIDE 18

Rethink the Consecutive Packing (CPACK) Algorithm

Affinity-conscious data reordering ...

23 67 103 1 2 3 4 Computation Data 9 ... ... ... data reordering 23 67 9 103 Data 4 cache miss 5 6 7 1 2 3 4 Computation 5 6 7 Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses Original An Optimal data layout miss hit

  • CPACK does not consider data affinity (i.e., how close the nearby

data elements are accessed together)

  • Packs (23,67) rather than (9,23) should yield better locality

May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22

slide-19
SLIDE 19

Data Reordering and NP-completeness

1 Finding an optimal data layout is a NP-complete problem1 2 No “best” data reordering algorithm that works in general 3 Implicit constraint: Each data entry has only one copy in

the transformed format

4 The complexity can be significantly reduced if more space is

allowed to use

  • 1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02)

May 17, 2016 Cheng Wang (cwang35@uh.edu) 19 / 22

slide-20
SLIDE 20

A Padding Algorithm that Circumvents the Complexity

CPACKE Algorithm: Extends the CPACK by creating duplicated copies of each repeatedly accessed data entry

23 67 103 1 2 3 4 Computation Data 9 ... ... ... data reordering 23 67 9 103 Data 4 cache miss 5 6 7 1 2 3 4 Computation 5 6 7 Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses Original Padding algorithm miss hit 23 23 67

  • Advantage: Better locality than CPACK
  • Disadvantage: Slight space overhead

May 17, 2016 Cheng Wang (cwang35@uh.edu) 20 / 22

slide-21
SLIDE 21

Performance Evaluation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 19 20 21 22 23 24 25 26 27 28 Execution time (sec) Signal size (2n) Intel Xeon E5-2670 (Sandy Bridge) PsFFT (before trans) PsFFF (after trans)

  • Applies the CPACKE to the perm+filter stage in sFFT
  • Improves the performance by 30% for the irregular kernel
  • Improves the overall performance of PsFFT by 20%

May 17, 2016 Cheng Wang (cwang35@uh.edu) 21 / 22

slide-22
SLIDE 22

Conclusion & Future Work

1 A padding-based algorithm improving the data locality of

irregular applications

2 Improves the performance of sFFT by 30% 3 Future work

  • Evaluate with more irregular applications
  • Evaluate with other data/computation reordering algorithms
  • Let compiler generate the transformed the code automatically

May 17, 2016 Cheng Wang (cwang35@uh.edu) 22 / 22