Analyzing Throughput of GPUs Analyzing Throughput of GPUs - PowerPoint PPT Presentation

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Exploiting Within-Die Core-to-Core Frequency Variation Frequency Variation Jungseob Lee, Paritosh Ajgaonkar, Nam Sung Kim Apr 12, 2011 Department of Electrical and Computer Engineering University of Wisconsin - Madison

Outline Outline � Introduction � GPU architecture and impact of WID variations on GPUs � Throughput improvement techniques � Allowing per-SM clocking (PSMC) � Disabling the slowest SMs (DSSM) � Impact of main memory latency and bandwidth on throughput improvement � Conclusion

Introduction Introduction � Improve throughput of GPU applications GPUs can provide high throughput for general-purpose and data-intensive applications. Slowest core � Limit F max PSMC & DSSM PSMC & DSSM Die Two techniques for mitigating Two techniques for mitigating the negative impact of WID the negative impact of WID Fastest core C2C frequency variations on C2C frequency variations on throughput of GPUs throughput of GPUs Increasing WID Core-to-Core frequency Increasing WID Core-to-Core frequency variations affect F max of GPUs. variations affect F max of GPUs.

GPU architecture GPU architecture � GPU architecture: � Streaming Multiprocessor (SM)s, off-chip DRAM, and on-chip interconnect network � Each SM: 1) 8 to 32 streaming processors (SPs), 2) an instruction scheduler, 3) instruction cache, 4) register files, 5) special function units (SFU), and 6) shared memory/cache SM SP Streaming Multiprocessor (SM) Fermi Architecture [1] [1] [1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

D2D & WID variations D2D & WID variations � Die to Die (D2D) variations � Affect all transistors on a die identically � Within-Die (WID) variations � Different transistor characteristics within a single die. � With technology scaling (i.e., more cores per a die) � Considerable C2C F MAX variation due to spatially correlated WID variations Die-to-Die (D2D) Variations Within-Die (WID) Variations Systematic Random Die Scale Feature Scale Wafer Scale Courtesy : K. Bowman from Intel

WID C2C F MAX Variations WID C2C F MAX Variations � Impact of WID C2C F max variations F max SM ID Each variation map has 80x80 grid points 1 grid point [2] The corresponding A WID Vth/L eff variation map F max map for a 16-SM GPU � C2C frequency variation affects GPU’s F MAX � F max is limited by the slowest core in a GPU designed to operate all SMs at the same frequency (i.e, per-chip clocking). � More SMs in a die � SM-to-SM frequency variation increases � Power inefficiency � faster SMs consume more leakage power [2] S. Herbert et al., “Characterizing chip-multiprocessor variability-tolerance,” in Proc. IEEE DAC, 2008.

Per-SM clocking (PSMC) Per-SM clocking (PSMC) � Each SM executing independent thread blocks � Enabling PSMC efficiently for GPUs w/ per-SM PLL � Many SP-to-SP communications through a shared memory in an SM BLOCK 28 BLOCK 28 … Queue … Queue BLOCK 2 BLOCK 2 BLOCK 1 BLOCK 1 F max F max, SM1 2.5 2.0 1.5 1.0 2.5 2.0 1.5 1.0 SM1 SM2 SM3 SM4 SM1 SM2 SM3 SM4 Execution Status Execution Status Rel. Exec. Time = 0.57 1 2 3 1 2 3 4 4 5 6 8 7 Rel. Exec. Time = 1 9 5 6 7 8 11 10 12 13 14 … 9 10 11 12 … … … 24 26 27 28 … … … … SM1 SM2 SM3 SM4 25 26 27 28 SM1 SM2 SM3 SM4

GPGPU-Sim config. and benchmarks GPGPU-Sim config. and benchmarks � GPGPU-Sim parameters Number of Core (SM)s 16 / 32 / 64 Memory Channels 4 / 8 / 8 Core (SM) Frequency (GHz) 1.688 / 1.476 / 1.401 Memory Frequency (GHz) 1.100 / 1.242 / 1.848 Interconnection Frequency (GHz) 0.85 / 1.00 / 1.50 Memory bandwidth (GB/s) 70.4 / 159 / 236.5 Warp Size 32 Bandwidth / Memory module 4 (Bytes/Cycle) SIMD Pipeline Width 8 Memory Controller FR-FCFS Number of Threads / Core 1024 Branch Divergence Method Immediate Post Dominator Number of CTAs / Core 8 Warp Scheduling Policy Round Robin Number of Regs / Core 16384 Constant Cache Size / Core 8 KB Shared Memory / Core 16 KB Texture Cache Size / Core 8 KB � 12 CUDA benchmarks [3, 4] AES encryption (AES) Black-scholes (BLK) gpuDG (DG) 3D Laplace Solver (LPS) Ray Tracing (RAY) StoreGPU (STO) Breadth-First Search (BFS) LIBOR Monte Carlo (LIB) MUMmerGPU (MUM) Neural Network (NN) Image Denoising (IMG) Sum of Absolute Differences (SAD) [3] A. Bakhoda., “Analyzing cuda workloads using a detailed GPU simulator,” in Proc. ISPASS, 2009. [4] “ERCBench, A Benchmark Suite for Embedded and Reconfigurable Computing,” http://ercbench.ece.wisc.edu/index.php

Per-SM clocking (PSMC) Per-SM clocking (PSMC) � Theoretical throughput improvement: = ∑   N   ( ) � Speedup F F N   max, i max, slowest   i = 1 σ sys =6.4% σ sys =3.2% � 10%, 14%, and 16% higher throughput for entry-level, mid-range, and high-end GPUs on average.

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � Problem size-bounded applications � small problem size relative to the number of SMs � throughput does not increase with more available SMs Disabling the slowest SM(s) � higher F max for the GPU � BLK 12 BLK 12 Queue … Queue … BLK 2 BLK 2 BLK 1 BLK 1 F max = 1.0 F max = 1.5 2.5 2.0 1.5 1.0 2.5 2.0 1.5 SM1 SM2 SM3 SM4 SM1 SM2 SM3 Execution Status Execution Status Rel. Exec. Time = 0.67 Rel. Exec. Time = 1 1 2 3 1 2 3 4 4 5 6 5 6 SM1 SM2 SM3 SM1 SM2 SM3 SM4

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for entry-level and mid- range GPU • Problem size-bounded applications benefit from DSSM � No change in number of execution rounds by disabling SMs � GPU’s F max increases with more disabled SMs. • Memory-bounded applications � Fewer SMs request less concurrent memory accesses (higher rate) � Higher F max request more memory accesses (lower rate) • Compute-bounded applications benefit from more SMs than higher F max .

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for high-end GPU � If appropriate number of the slowest SMs are disabled (i.e., 2 to 6 out of 32 SMs and 4 to 32 out of 64 SMs) � 1%~7% and 4%~19% throughput improvement for certain applications

Per-SM clocking (PSMC) Per-SM clocking (PSMC) � Emerging memory technology � 32% lower latency and 6 times higher bandwidth � Relative throughput improvement of applications adopting the PSMC scheme � 15%, 18%, and 24% higher throughput than the baselines for entry-level, mid-range, and high-end GPUs on average

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for entry-level and mid- range GPUs •Problem size bounded applications still benefit from DSSM •Memory-bounded applications look more like compute- bounded ones

Disabling the slowest SM (DSSM) Disabling the slowest SM (DSSM) � The slowest SMs are disabled one by one � Relative throughput of 12 applications for high-end GPU � If appropriate number of the slowest SMs are disabled for high- end GPU � 7%~20% throughput improvement for certain applications

Conclusion Conclusion � Two throughput improvement techniques to exploit WID SM- to-SM frequency variations in GPUs. � Allowing each SM to operate at its own F max � Disabling the slowest SMs � PSMC � 10%~16% throughput improvement of applications on average � DSSM � up to 19% throughput improvement of applications � Impact of main memory latency and bandwidth on throughput � Emerging memory technology � lower latency and higher bandwidth � Throughput improvement by up to 24% and 20%

Analyzing Throughput of GPUs Analyzing Throughput of GPUs - PowerPoint PPT Presentation

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core Exploiting Within-Die Core-to-Core Frequency Variation Frequency Variation Jungseob Lee, Paritosh Ajgaonkar, Nam Sung Kim Apr 12, 2011 Department of

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput and Stability in Cellular Networks Ermias Walelgne 1 Jukka Manner 1 Vaibhav

Wayne Snyder Computer Science Department Boston University Today: Analyzing Rhythm Analyzing

2007 Half Year Results September 2007 Admiral Is STILL Different 2004 2005 2006 2007

2007 Full Year Results March 2008 Summary Wow! 2007 - Great Year! Profits Policyholders

17 o f 46 Ac c ide nts 26 o f 58 Ac c ide nts 35% 45% $76,858 in WC Cla ims Pa id Out $78,628

CERCLA 108(b) Financial Responsibility Rulemaking for Facilities in the Petroleum and Coal

Q2 2017 July 26th, 2017, 11 AM EDT About Advisen Leading the way to smarter and more efficient

Teaching Formal Set Theory with Regard to Students Comprehension Libor Bhounek Workshop on

Keep up with openSUSE Packaging News in Packaging Vtzslav ek vcizek@suse.com

From Complexity to Intelligence Machine Learning and Complexity 17 novembre 2016