acaces 2018 summer school gpu architectures basic to
play

ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/) Course Outline q Lectures 1 and 2: Basics Concepts Basics of GPU


  1. ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/)

  2. Course Outline q Lectures 1 and 2: Basics Concepts ● Basics of GPU Programming ● Basics of GPU Architecture q Lecture 3: GPU Performance Bottlenecks ● Memory Bottlenecks ● Compute Bottlenecks ● Possible Software and Hardware Solutions q Lecture 4: GPU Security Concerns ● Timing channels ● Possible Software and Hardware Solutions

  3. Key GPU Performance Concerns GPU (Device) Memory Concerns: Data transfers SM SM SM SM between SMs and global memory Scratchpad are costly. Registers and Local Memory Bottleneck! GPU Global Memory A T1 T2 T3 T4 Compute Concerns: Threads that B T1 T2 T3 T4 do not take the same control path Time C T1 T2 lead to serialization in the GPU compute pipeline. D T3 T4 E T1 T2 T3 T4

  4. Reducing Off-Chip Access q Re-writing software to use “shared memory” and avoid un-coalesced global accesses is difficult for the GPU programmer. q Recent GPUs introduce hardware-managed caches (L1/L2), but large number of threads lead to thrashing. q General purpose code, now being ported to GPUs, has branches and irregular accesses. Not always possible to fix them in the code. We need intelligent hardware solutions!

  5. I) Alleviating the Memory Bottlenecks – Memory concerns: Thousands of threads running on SMs need data from DRAM, however, DRAM bandwidth is limited. Increasing it is very costly Bottleneck! Bottleneck! SM L1 GPU Global Memory GPU (Device) SM SM L1 L2 Cache L1 SM L1 – Q1. How can we use caches effectively to reduce the bandwidth demand? – Q2. Can we effectively data compression and reduce the data consumption? – Q3. How can we effectively/fairly allocate memory bandwidth across concurrent streams/apps?

  6. Quantifying Memory Bottlenecks AVG: 32% 55% 100% 80% 60% 40% 20% 0% SAD PVC SSC BFS MUM CFD KMN SCP FWT IIX SPMV JPEG BFSR SC FFT SD2 WP PVR BP CON AES SD1 BLK HS SLA DN LPS NN PFN LYTE LUD MM STO CP NQU CUTP HW TPAF AVG AVG-T1 HPC Applications Percentage of total execution cycles wasted waiting for the data to come back from memory. [Jog et al., ASPLOS 2013]

  7. Strategies q Cache-Aware Warp Scheduling Techniques ● Effective caching à Less Pressure on Memory q Employing Assist Warps for Helping Data Compression ● Bandwidth Preserved n Bandwidth Allocation Strategies for Multi- Application execution on GPUs ● Better System Throughput and Fairness

  8. Application-Architecture Co-Design q Architecture: GPUs typically employ smaller caches compared to CPUs q Scheduler: Many warps concurrently access the small caches in a round-robin manner leading to thrashing.

  9. Cache Aware Scheduling q Philosophy: "One work at a time" q Working ● Select a "group" ( work ) of warps ● Always prioritizes it over other groups ● Group switch is not round-robin q Benefits: ● Preserve locality ● Fewer Cache Misses

  10. Improve L1 Cache Hit Rates Data for Grp.1 arrives. No prioritization. Round-Robin Order Grp.1 Grp.2 Grp.4 Grp.1 Grp.2 Grp.3 Grp.4 Grp.3 W W W W W W W W W W W W W W W W Data for Grp.1 arrives. Cache Aware Order Prioritize Grp.1 Grp.1 Grp.2 G3 Grp.1 G3 Grp.4 Grp.2 Grp.3 Grp.4 Round-Robin: 4 groups in Time T T Prioritization: 3 groups in Time T Fewer warp groups access the cache concurrently à Less cache contention Time

  11. Reduction in L1 Miss Rates Cache Aware Round-Robin Scheduler Scheduler Normalized L1 Miss Rates 1.20 1.00 0.80 0.60 34% 0.40 0.20 0.00 SAD SSC BFS KMN IIX SPMV BFSR AVG. n 25% improvement in IPC across 19 applications n Limited benefits for cache insensitive applications n Software Support (e.g., specify data-structures that should be " uncacheable ”) [Jog et al., ASPLOS 2013]

  12. Other Sophisticated Mechanisms q Rogers et al., Cache Conscious Wavefront Scheduling, MICRO’12 q Kayiran et al., Neither more Nor Less: Optimizing Thread-level Parallelism for GPGPUs, PACT’13 q Chen et al., Adaptive cache management for energy-efficient GPU computing, MICRO’14 q Lee et al., CAWS: criticality-aware warp scheduling for GPGPU workloads

  13. Strategies q Cache-Aware Warp Scheduling Techniques ● Effective caching à Less Pressure on Memory q Employing Assist Warps for Helping Data Compression ● Bandwidth Preserved n Bandwidth Allocation Strategies for Multi- Application execution on GPUs ● Better System Throughput and Fairness

  14. Challenges in GPU Efficiency Threads Cores Register File Thread 3 Idle! Thread 2 Memory Full! Idle! Full! Hierarchy Thread 1 Thread 0 GPU Streaming Multiprocessor The memory bandwidth bottleneck leads to idle cores Thread limits lead to an underutilized register file

  15. Motivation: Unutilized On-chip Memory 100% % Unallocated 80% Registers 60% 40% 20% 0% q 24% of the register file is unallocated on average q Similar trends for on-chip scratchpad memory

  16. Motivation: Idle Pipelines Memory Bound 100% % Cycles 80% 60% Active 67% of cycles idle 40% Stalls 20% 0% CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg. Compute Bound 100% 80% % Cycles 60% 35% of cycles idle Active 40% Stalls 20% 0% NN STO bp hs dmr NQU SLA lc pt mc

  17. Motivation: Summary Heterogeneous application requirements lead to: q Bottlenecks in execution q Idle resources

  18. Our Goal q Use idle resources to do something useful: accelerate bottlenecks using helper threads Core Register s File Memory Helper threads Hierarchy ¨ A flexible framework to enable helper threading in GPUs: C ore- A ssisted B ottleneck A cceleration ( CABA )

  19. Helper threads in GPUs q Large body of work in CPUs … ● [Chappell+ ISCA ’99, MICRO ’02], [Yang+ USC TR ’98], [Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01, MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05], [Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+ ASPLOS ’11], etc. q However, there are new challenges with GPUs…

  20. Challenge How do you efficiently manage and use helper threads in a throughput-oriented architecture?

  21. Managing Helper Threads in GPUs Thread Hardware Warp Block Software Where do we add helper threads?

  22. Approach #1: Software-only ü No hardware changes Coarse grained Regular threads Synchronization is difficult Not aware of runtime program behavior Helper threads

  23. Where Do We Add Helper Threads? Thread Warp Hardware Block Software

  24. Other functionality In the paper: q More details on the hardware structures q Data communication and synchronization q Enforcing priorities

  25. CABA: Applications q Data compression q Memoization q Prefetching q Encyrption …

  26. A Case for CABA: Data Compression q Data compression can help alleviate the memory bandwidth bottleneck - transmits data in a more condensed form Uncompressed Compressed Memory Idle! Hierarchy ¨ CABA employs idle compute pipelines to perform compression

  27. Data Compression with CABA q Use assist warps to: ● Compress cache blocks before writing to memory ● Decompress cache blocks before placing into the cache q CABA flexibly enables various compression algorithms q Example: BDI Compression [Pekhimenko+ PACT ’12] ● Parallelizable across SIMT width ● Low latency q Others: FPC [Alameldeen+ TR ’04], C-Pack [Chen+ VLSI ’10]

  28. Walkthrough of Decompression Assist Assist Trigger Warp Warp Store Controller Scheduler L2 + Miss! L1D Hit! Memory Cores

  29. Walkthrough of Compression Assist Assist Trigger Warp Warp Store Controller Scheduler L2 + L1D Memory Cores

  30. Effect on Performance 2.8 2.6 Performance Normalized 2.4 41.7% 2.2 2 1.8 1.6 1.4 1.2 1 CABA-BDI No-Overhead-BDI § CABA provides a 41.7% performance improvement § CABA achieves performance close to that of designs with no overhead for compression

  31. Effect on Bandwidth Consumption 90% 80% Memory Bandwidth 70% Consumption 60% 50% 40% 30% 20% 10% 0% Baseline CABA-BDI Data compression with CABA alleviates the memory bandwidth bottleneck

  32. Conclusion q Observation: Imbalances in execution leave GPU resources underutilized q Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads q Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? q Solution: CABA (Core-Assisted Bottleneck Acceleration, ISCA’15) ● A new framework to enable helper threading in GPUs ● Enables flexible data compression to alleviate the memory bandwidth bottleneck ● A wide set of use cases (e.g., prefetching, memoization)

  33. Strategies q Cache-Aware Warp Scheduling Techniques ● Effective caching à Less Pressure on Memory q Employing Assist Warps for Helping Data Compression ● Bandwidth Preserved n Bandwidth Allocation Strategies for Multi- Application execution on GPUs ● Better System Throughput and Fairness

  34. Discrete GPU Cards --- Scaling Trends 2008 2010 2012 2014 2016 2018 GTX 275 GTX 480 GV 100 GTX 680 GTX 980 GP 100 (Tesla) (Fermi) (Volta) (Maxwell) (Pascal) (Kepler) 240 448 5120 2048 3584 1536 CUDA CUDA CUDA CUDA CUDA CUDA Cores Cores Cores Cores Cores Cores (127 (139 (900 (224 (720 (192 GB/sec) GB/sec) GB/sec) GB/sec) GB/sec) GB/sec)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend