using intra core loop task accelerators to improve the
play

Using Intra-Core Loop-Task Accelerators to Improve the Productivity - PowerPoint PPT Presentation

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten


  1. Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten Computer Systems Laboratory Cornell University 50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017 Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 1/21 1/21 1

  2. Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 2/21 2/21 2

  3. Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 3/21 3/21 2

  4. Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 4/21 4/21 2

  5. Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 5/21 5/21 2

  6. Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 6/21 6/21 3

  7. Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 7/21 7/21 3

  8. Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Challenge #2: Inefficient Execution of Irregular Tasks Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 8/21 8/21 3

  9. Native Performance Results Regular Irregular Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 9/21 9/21 4

  10. Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 10/21 10/21 5

  11. LTA SW: API and ISA Hint void app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); })); } void loop_task_func(void* a, int start, int end, int step=1); Hint that hardware can potentially accelerate task execution Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 11/21 11/21 6

  12. LTA SW: Task-Based Runtime Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 12/21 12/21 7

  13. Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 13/21 13/21 8

  14. LTA HW: Fully-Coupled LTA Coupling better for regular workloads (amortize frontend/memory) Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 14/21 14/21 9

  15. LTA HW: Fully Decoupled LTA Decoupling better for irregular workloads (hide latencies) Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 15/21 15/21 10

  16. LTA HW: Task-Coupling Taxonomy + Higher Perf on Irregular - Higher Area/Energy Task Group (lock-step execution) More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 16/21 16/21 11

  17. LTA HW: Task-Coupling Taxonomy Does it matter whether we decouple in space or in time? Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 17/21 17/21 11

  18. Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 18/21 18/21 12

  19. Evaluation: Methodology • Ported 16 application kernels from PBBS and in-house benchmark suites with diverse loop-task parallelism • Scientific computing: N-body simulation, MRI-Q, SGEMM • Image processing : bilateral filter, RGB-to-CMYK, DCT • Graph algorithms : breadth-first search, maximal matching • Search/Sort algorithms : radix sort, substring matching • gem5 + PyMTL co-simulation for cycle-level performance • Component/event-based area/energy modeling • Uses area/energy dictionary backed by VLSI results and McPAT Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 19/21 19/21 13

  20. Evaluation: Design Space Exploration g n i l p u o c e d resource constraints l a i t a p s g n l i p u o c e d a l r o p m e t Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 20/21 20/21 14

  21. Evaluation: Design Space Exploration g n i l p u o c e d resource constraints l a i t a p s g n l i p u o c e d a l r o p m e t Prefer spatial decoupling over temporal decoupling Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 21/21 21/21 14

  22. Evaluation: Design Space Exploration g n i l p u o r e s o c u r e c e d c o n s l t r a a i n i t s t a p s g n l i p u o c e d a l r o p m e t Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 22/21 22/21 14

  23. Evaluation: Design Space Exploration g n i l p u o r e s o c u r e c e d c o n s l t r a a i n i t s t a p s g n l i p u o c e d a l r o p m e t Reduce spatial decoupling to improve energy efficiency Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 23/21 23/21 14

  24. Evaluation: Multicore LTA Performance 10.7x 5.2x 2.9x 4.4x Regular Irregular Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 24/21 24/21 15

  25. Evaluation: Area-Normalized Performance 1.8x 1.6x 1.2x Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 25/21 25/21 16

  26. Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 26/21 26/21 17

  27. Related Work • Challenge #1: Intra-Core Parallel Abstraction Gap • Persistent threads for GPGPUs (S. Tzeng et al.) • OpenCL, OpenMP, C++ AMP • Cilk for vectorization (B. Ren et al.) • And more... • Challenge #2: Inefficient Execution of Irregular Tasks • Variable warp sizing (T. Rogers et al.) • Temporal SIMT (S. Keckler et al.) • Vector-lane threading (S. Rivoire et al.) • And more… • Please see paper for more detailed references! Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 27/21 27/21 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend