New hardware features in Kepler, SMX and Tesla K40 GPGPU2: Advanced - PowerPoint PPT Presentation

New hardware features in Kepler, SMX and Tesla K40 GPGPU2: Advanced Methods for Computing with CUDA Cape Town, April, 2014 Manuel Ujaldón Computer Architecture Department. University of Malaga. CUDA Fellow 1

``... and if so fu ware people wants good machin et , ti ey mu su learn more ab ov t hardware to influence ti at way hardware d et igners ...´´ David A. Patterson & John Hennessy Organization and Computer Design Mc-Graw-Hill (1995) Chapter 9, page 569 2 2

Talk outline [63 slides] 1. Introducing the architecture [4 slides] 2. The memory [3] 3. The SMX cores [9] 4. How the SMX works: Front-end and back-end [21] 5. Functional enhancements [11] 1. Dynamic parallelism [5] 2. Hyper-Q [6] 6. A look to the future [15] 1. Vectorization: The warp size [7] 2. Stacked-DRAM: 3D memory on top of the GPU [4] 3. Analysis based on the roofline model [4] 3 3

1. Introducing the architecture 4

The three pillars of Kepler Power consumption Performance Programmability 5 5

And its three basic innovations A multiprocessor with more SMX: resources and less power. Dynamic The GPU is autonomous, can launch CUDA kernels. parallelism: Multiple kernels can share Hyper-Q: the SMXs. 6 6

SMX Balance of Resources: Summary of improvements versus Fermi Resource Kepler GK110 vs. Fermi GF100 Floating-point throughput 2-3x Maximum number of blocks per SMX 2x Maximum number of threads per SMX 1.3x Register file bandwidth 2x Register file capacity 2x Shared memory bandwidth 2x Shared memory capacity 1x L2 bandwidth 2x L2 cache capacity 2x 7 7

Commercial models available for Kepler: GeForce vs. Tesla GeForce GTX Titan Designed for gamers: Oriented to HPC: Price is a priority (<500 € ). Reliable (3 year warranty). Availability and popularity. For cluster deployment. Little video memory (1-2 GB.). More video memory (6-12 GB.). Frequency slightly ahead. Tested for endless run (24/7). Hyper-Q only for CUDA streams. Hyper-Q for MPI. Perfect for developing code GPUDirect (RDMA) and other which can later run on a Tesla. features for GPU clusters. 8 8

2. Memory 9

The memory in Tesla cards: Fermi vs. Kepler Tesla card M2075 M2090 K20 K20X K40 32-bit register file / multiprocessor 32768 32768 65536 65536 65536 L1 cache + shared memory size 64 KB. 64 KB. 64 KB. 64 KB. 64 KB. Width of 32 shared memory banks 32 bits 32 bits 64 bits 64 bits 64 bits SRAM clock freq. (same as GPU) 575 MHz 650 MHz 706 MHz 732 MHz 745,810,875 MHz L1 and shared memory bandwidth 73.6 GB/s. 83.2 GB/s. 180.7 GB/s 187.3 GB/s 216.2 GB/s. L2 cache size 768 KB. 768 KB. 1.25 MB. 1.5 MB. 1.5 MB. L2 cache bandwidth (bytes/cycle) 384 384 1024 1024 1024 L2 on atomic ops. (shared address) 1/9 per clk 1/9 per clk 1 per clk 1 per clk 1 per clk L2 on atomic ops. (indep. address) 24 per clk 24 per clk 64 per clk 64 per clk 64 per clk DRAM memory width 384 bits 384 bits 320 bits 384 bits 384 bits DRAM memory clock (MHz) 2x 1500 2x 1850 2x 2600 2x 2600 2 x 3000 DRAM bandwidth (ECC off) 144 GB/s. 177 GB/s. 208 GB/s. 250 GB/s. 288 GB/s. DRAM memory size (all GDDR5) 6 GB. 6 GB. 5 GB. 6 GB. 12 GB. External bus to connect to CPU PCI-e 2.0 PCI-e 2.0 PCI-e 3.0 PCI-e 3.0 PCI-e 3.0 10 10

Differences in memory hierarchy: Fermi vs. Kepler 11 11

The memory hierarchy in numbers GPU generation Fermi Ferm Kepler Kepl Limi- Limi- Hardware model GF100 GF104 GK104 GK110 Impact tation tation CUDA Compute Capability (CCC) 2.0 2.1 3.0 3.5 Max. 32 bits registers / thread 63 63 63 255 SW. Working set 32 bits registers / Multiprocessor 32 K 32 K 64 K 64 K HW. Working set Shared memory / Multiprocessor HW. Tile size 16-48KB 16-48KB 16-32-48KB 16-32-48 KB Access L1 cache / Multiprocessor HW. 48-16KB 48-16KB 48-32-16KB 48-32-16 KB speed Access L2 cache / GPU HW. 768 KB. 768 KB. 768 KB. 1536 KB. speed All Fermi and Kepler models are endowed with: ECC (Error Correction Code) in the video memory controller. Address bus 64 bits wide. Data bus 64 bits wide for each memory controller (few models include 4 controllers for 256 bits, most have 6 controllers for 384 bits) 12 12

3. The SMX cores 13

A brief reminder of what CUDA is about Thread Thread block ··· On-chip memory Grid 0 · · · · · · · · · Memory · · · · · · · · · outside the GPU chip (but within the ··· ··· ··· graphics card) Grid 1 ··· ··· ··· ··· ··· ··· 14 14

... and how the architecture scales up Tesl Tesla Ferm Fermi Kepl Kepler GeForce GF100 GF104 GK104 GK110 GK110 Architecture G80 GT200 GTX (K10) (K20) (K40) Titan Z Time frame 2006-07 2008-09 2010 2011 2012 2013 2013-14 2014 CUDA Compute 1.0 1.2 2.0 2.1 3.0 3.5 3.5 3.5 Capability (CCC) N (multiprocs.) 16 30 16 7 8 14 15 30 M (cores/multip.) 8 8 32 48 192 192 192 192 Number of cores 128 240 512 336 1536 2688 2880 5760 15 15

Kepler in perspective: Hardware resources and peak performance Tesla card (commercial model) M2075 M2090 K20 K20X K40 Similar GeForce model in cores GTX 470 GTX 580 - GTX Titan GTX Titan Z (x2) GPU generation (and CCC) Fermi GF10 GF100 (2.0) Kepler GK11 GK110 (3.5) Multiprocessors x (cores/multipr.) 14 x 32 16 x 32 14 x 192 15 x 192 13 x 192 Total number of cores 448 512 2496 2688 2880 Type of multiprocessor SM SM SMX wit X with dynamic para paralelism and HyperQ Transistors manufacturing process 40 nm. 40 nm. 28 nm. 28 nm. 28 nm. GPU clock frequency (for graphics) 575 MHz 650 MHz 706 MHz 732 MHz 745,810,875 MHz Core clock frequency (for GPGPU) 1150 MHz 1300 MHz 706 MHz 732 MHz 745,810,875 MHz Number of single precision cores 448 512 2496 2688 2880 GFLOPS (peak single precision) 1030 1331 3520 3950 4290 Number of double precision cores 224 256 832 896 960 GFLOPS (peak double precision) 515 665 1170 1310 1680 16 16

The new GeForce GTX Titan Z 5760 cores (2x K40). Video memory: 12 Gbytes. Peak performance: 8 TeraFLOPS. Starting price: $2999. 17 17

GPU Boost Allows to speed-up the GPU clock up to 17% if the power required by an application is low. The base clock will be restored if we exceed 235 W. We can set up a persistent mode which keep values permanently, or another one for a single run. Performance Power Headroom Maximizes Graphics Clocks within the specified power envelope Base Clock Highest Boost Clock 875 MHz 745 MHz 810 MHz 18 18

Every application has a different behaviour regarding power consumption Here we see the average power (watts) on a Tesla K20X for a set of popular applications within the HPC field: 160 Board Power (Watts) 120 80 40 0 AMBER ANSYS Black ScholesChroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D 19 19

Those applications which are less power hungry can benefit from a higher clock rate For the Tesla K40 case, 3 clocks are defined, 8.7% apart. Boosted clock #2 875 MHz Boosted clock #1 810 MHz Base 745 MHz clock Up to 40% higher performance relative 235W 235W 235W to Tesla K20X. And not only GFLOPS are Workload #1 Workload #2 Workload #3 improved, but also effective Worst case E.g. AMBER E.g. ANSYS Fluent memory bandwidth. Reference App 20 20

GPU Boost compared to other approaches It is better a stationary state for the frequency to avoid thermal stress and improve reliability. Boost Clock # 2 Boost Clock # 1 GPU clock Base Clock # 1 Tesla K40 Other vendors Deterministic Clocks Automatic clock switching Other vendors Tesla K40 Default Boost Base Preset options Lock to base clock 3 levels: Base, Boost1 o Boost2 Boost interface Control panel Shell command: nv-smi Target duration for boosts Roughly 50% of run-time 100% of workload run time 21 21

GPU Boost - List of commands Command Effect View the clocks supported by our GPU nvidia-smi -q -d SUPPORTED_CLOCKS nvidia-smi -ac <MEM clock, Set one of the supported clocks Graphics clock> Enables persistent mode: The clock settings are nvidia-smi -pm 1 preserved after restarting the system or driver Enables non-persistent mode: Clock settings revert nvidia-smi -pm 0 to base clocks after restarting the system or driver Query the clock in use nvidia-smi -q -d CLOCK Reset clocks back to the base clock nvidia-smi -rac Allow non-root users to change clock rates nvidia-smi -acp 0 22 22

Example: Query the clock in use nvidia-smi -q -d CLOCK —id=0000:86:00.0 23 23

4. How the SMX works: Front-end and back-end 24

Kepler GK110: Physical layout of functional units for the Tesla K40 (endowed with 15 SMX) 25 25

The SMX multiprocessor Instruction scheduling Front-end and issuing in warps Instructions execution. 512 functional units: Back-end - 192 for ALUs. - 192 for FPUs S.P. - 64 for FPUs D.P. - 32 for load/store. - 32 for SFUs (log,sqrt, ...) Interface Memory access 26 26

From SM multiprocessor in Fermi GF100 to SMX multiprocessor in Kepler GK110 Front-end Back-end 27 27

New hardware features in Kepler, SMX and Tesla K40 GPGPU2: Advanced - PowerPoint PPT Presentation

New hardware features in Kepler, SMX and Tesla K40 GPGPU2: Advanced Methods for Computing with CUDA Cape Town, April, 2014 Manuel Ujaldn Computer Architecture Department. University of Malaga. CUDA Fellow 1 ``... and if so fu ware people

The TESLA Linear Collider Winfried Decking (DESY) for the TESLA Collaboration Outline

Kepler telescope and the Kepler Input Catalog (KIC) situation in a nutshell Kepler/K2:

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Presentation and Open House June 2, 2015 Tesla Road Safety Study PUBLIC MEETING No. 2 PUBLIC

Te Tesla From an owners perspective Autop opilot ot I responded to a Facebook post

By Ben Follett Executive Summary Tesla will most likely be having another net loss in 2016 due

Tesla Travis County Colorado River Project Partner PreQualification Presentation August 27,

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

61 st Annual Convention of the Rural Bankers Association of the Philippines SMX Convention Center

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VOL Charu Dwivedi, Ameya Khare, and Aman Agrawal Vol is helping save Tesla car owners from

Kepler/ r/K2, TESS, and Opportu rtuniti ties for r Austra tralia Jessie Christiansen Kepler

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Thinking In and Out of the Box: The Art and Science of Economic Impact Analysis Tesla Overview

October 2017 and RELION too Accelerating Discoveries Using a supercomputer powered by the Tesla

425 1 d277584d425.htm 425 Filed by Tesla Motors, Inc. Pursuant to Rule 425 under the Securities

mcwm Michael Cardell Widerkrantz November 11, 2011 History of mcwm Desktop Environments

One Hundred Years of Dollar Hegemony Brendan Brown (joint author of Europes Century of

AAPICO HITECH PLC [AH] Agenda 1. Company Profile 2. Industry Overview 3. Financial

NOR-based ROM charge BL 0 BL 1 BL 2 BL 3 WL 0 WL 0 1 0 1 1 WL 1 0 1 1 0 WL 1 WL 2 1 0 1

3M/Q1 2019 FINANCIAL RESULTS SIPCHEM 3M/Q1 2019 EARNINGS CALL Date: Tuesday, April 30, 2019

Keppel REIT Third Quarter and Nine Months 2017 Financial Results 17 October 2017 Outline Key

Static analysis by abstract interpretation of run-time errors in synchronous and multithreaded

An Overview of the Alloy Language & Analyzer Slides contain some modified content from the

New hardware features in Kepler, SMX and Tesla K40 GPGPU2: Advanced - PowerPoint PPT Presentation

New hardware features in Kepler, SMX and Tesla K40 GPGPU2: Advanced Methods for Computing with CUDA Cape Town, April, 2014 Manuel Ujaldn Computer Architecture Department. University of Malaga. CUDA Fellow 1 ``... and if so fu ware people

The TESLA Linear Collider Winfried Decking (DESY) for the TESLA Collaboration Outline

Kepler telescope and the Kepler Input Catalog (KIC) situation in a nutshell Kepler/K2:

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Presentation and Open House June 2, 2015 Tesla Road Safety Study PUBLIC MEETING No. 2 PUBLIC

Te Tesla From an owners perspective Autop opilot ot I responded to a Facebook post

By Ben Follett Executive Summary Tesla will most likely be having another net loss in 2016 due

Tesla Travis County Colorado River Project Partner PreQualification Presentation August 27,

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

61 st Annual Convention of the Rural Bankers Association of the Philippines SMX Convention Center

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VOL Charu Dwivedi, Ameya Khare, and Aman Agrawal Vol is helping save Tesla car owners from

Kepler/ r/K2, TESS, and Opportu rtuniti ties for r Austra tralia Jessie Christiansen Kepler

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Thinking In and Out of the Box: The Art and Science of Economic Impact Analysis Tesla Overview

October 2017 and RELION too Accelerating Discoveries Using a supercomputer powered by the Tesla

425 1 d277584d425.htm 425 Filed by Tesla Motors, Inc. Pursuant to Rule 425 under the Securities

mcwm Michael Cardell Widerkrantz November 11, 2011 History of mcwm Desktop Environments

One Hundred Years of Dollar Hegemony Brendan Brown (joint author of Europes Century of

AAPICO HITECH PLC [AH] Agenda 1. Company Profile 2. Industry Overview 3. Financial

NOR-based ROM charge BL 0 BL 1 BL 2 BL 3 WL 0 WL 0 1 0 1 1 WL 1 0 1 1 0 WL 1 WL 2 1 0 1

3M/Q1 2019 FINANCIAL RESULTS SIPCHEM 3M/Q1 2019 EARNINGS CALL Date: Tuesday, April 30, 2019

Keppel REIT Third Quarter and Nine Months 2017 Financial Results 17 October 2017 Outline Key

Static analysis by abstract interpretation of run-time errors in synchronous and multithreaded

An Overview of the Alloy Language &amp; Analyzer Slides contain some modified content from the

An Overview of the Alloy Language & Analyzer Slides contain some modified content from the