27th January 2016
REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION)
François Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com
REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) 27 th - - PowerPoint PPT Presentation
REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) 27 th January 2016 Franois Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com GAMING PRO VISUALIZATION ENTERPRISE DATA CENTER AUTO THE WORLD LEADER IN VISUAL
27th January 2016
François Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com
2
ENTERPRISE AUTO GAMING DATA CENTER PRO VISUALIZATION
3
Productive Programming Model & Tools Expert Co-Design Accessibility
APPLICATION MIDDLEWARE SYS SW LARGE SYSTEMS PROCESSOR
Fast GPU Engineered for High Throughput
0.0 0.5 1.0 1.5 2.0 2.5 3.0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU TFLOPS M2090 M1060 K20 K80 K40
Fast GPU + Strong CPU
4
500 1000 1500 2000 2500 3000 3500 2008 2009 2010 2011 2012 2013 2014
Peak Double Precision FLOPS
NVIDIA GPU x86 CPU
M2090 M1060 K20
K80
Westmere Sandy Bridge Haswell
GFLOPS
100 200 300 400 500 600 2008 2009 2010 2011 2012 2013 2014
Peak Memory Bandwidth
NVIDIA GPU x86 CPU
GB/s
K20
K80
Westmere Sandy Bridge Haswell Ivy Bridge
K40
Ivy Bridge
K40
M2090 M1060
5
SGEMM / W
2012 2014 2008 2010 2016
48 36 12 24 60
2018
72 Tesla Fermi Kepler Maxwell
Pascal
Mixed Precision 3D Memory NVLink
Instruction Cache Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler
SP DP
SP SFU LD/ST
SP SP DP SP SFU LD/ST Shared Memory / L1 Cache On-Chip Network
5
Register File
Instruction Cache
T ex / L1 $ T ex / L1 $
Shared Memory
Instruction Buffer Warp Scheduler
SFU LD/ST SP SP SP SP
… 32 SP CUDA Cores …
SFU LD/ST SP SP SP SP Register File
9.0 7.5 6.0
5.5x faster
4.5 3.0 1.5 0.0 1 2 4 8 16 32 64 128
Elements per thread
Fermi M2070 Kepler K20X Maxwell GTX 750 Ti
Bandwidth/SM, GiB/s
16 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
2015 2016 2017
KEPLER – K40
1.43TF DP, 4.3TF SP Peak 3.3 TF SGEMM/1.22TF DGEMM 12 GB, 288 GB/s, 235W PCIe Active/PCIe Passive
KEPLER – K80
2xGPU, 2.9TF DP, 8.7TF SP Peak 4.4TF SGEMM/1.59TF DGEMM 24GB, ~480 GB/s, 300W PCIe Passive
MAXWELL – M60
2xGPU, 7.4TF SP Peak, ~6TF SGEMM 16GB, 320 GB/s, 300W PCIe Active/PCIe Passive
MAXWELL – M6
1xGPU, TBD TF SP Peak, 8GB, 160 GB/s, 75-100W, MXM
POR In Definition
GRID Enabled GRID Enabled
*For End Customer Deployments
MAXWELL – M40
1xGPU, 7TF SP Peak (Boost Clock), 12GB, 288 GB/s, 250W PCIe Passive
MAXWELL – M4
1xGPU, 2.2 TF SP Peak, 4GB, 88 GB/s, 50-75W, PCIe Low Profile
17
Software System Tools & Services Accelerators Accelerated Computing Toolkit Tesla K80
HPC
Enterprise Services ∙ Data Center GPU Manager ∙ Mesos ∙ Docker GRID 2.0 Tesla M60, M6
Enterprise Virtualization
DL Training
Hyperscale
Hyperscale Suite Tesla M40 Tesla M4
Web Services
18
NVLink NVLink
POWER CPU X86, ARM64, POWER CPU X86, ARM64, POWER CPU
PASCAL GPU KEPLER GPU 2016 2014
PCIe PCIe
19
Traditional Developer View Developer View With Unified Memory
Unified Memory System Memory GPU Memory
Developer View With Pascal & NVLink
Unified Memory
NVLink
Share Data Structures at CPU Memory Speeds, not PCIe speeds Oversubscribe GPU Memory
20
22
U.S. Dept. of Energy
Pre-Exascale Supercomputers for Science
NOAA
New Supercomputer for Next-Gen Weather Forecasting
IBM Watson
Breakthrough Natural Language Processing for Cognitive Computing
SUMMIT SIERRA
23
100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017
25
25 50 75 100 125 2013 2014 2015
100+ accelerated systems now on Top500 list 1/3 of total FLOPS powered by accelerators NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers Tesla supercomputers growing at 50% CAGR
Top500: # of Accelerated Supercomputers
26
27
GPU Accelerators Interconnect System Management Compiler Solutions
GPU Boost … GPU Direct NVLink … NVML … LLVM …
Profile and Debug
CUDA Debugging API …
Development Tools Programming Languages Infrastructure Management Communication System Solutions
Software Solutions
Libraries
cuBLAS …
“ Accelerators Will Be Installed in More than Half of New Systems ”
Source: Top 6 predictions for HPC in 2015
“In 2014, NVIDIA enjoyed a dominant market share with 85%
28
www.nvidia.com/appscatalog
29
TOP 25 APPS IN SURVEY INTERSECT360 SURVEY OF TOP APPS
GROMACS SIMULIA Abaqus NAMD AMBER ANSYS Mechanical Exelis IDL MSC NASTRAN ANSYS Fluent WRF VASP OpenFOAM CHARMM Quantum Espresso LAMMPS NWChem LS-DYNA Schrodinger Gaussian GAMESS ANSYS CFX Star-CD CCSM COMSOL Star-CCM+ BLAST
= All popular functions accelerated
Top 10 HPC Apps
90% Accelerated
Top 50 HPC Apps
70% Accelerated
Intersect360, Nov 2015 “HPC Application Support for GPU Computing”
= Some popular functions accelerated = In development = Not supported
33
Deep Learning Toolkit GPU REST Engine GPU Accelerated FFmpeg Image Compute Engine
TESLA M40
POWERFUL Fastest Deep Learning Performance
TESLA M4
LOW POWER Highest Hyperscale Throughput GPU support in Mesos
http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/
34
35 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBRARIES
TESLA ACCELERATED COMPUTING
LANGUAGES DIRECTIVES
ACCELERATED COMPUTING TOOLKIT
37
Automatically scale with multi-GPU libraries (cuBLAS-XT , cuFFT-XT , AmgX,…)
AmgX cuFFT NPP cuBLAS cuRAND cuSPARSE MATH
38
39
Fueling the Next Wave of Scientific Discoveries in HPC
University of Illinois
PowerGrid- MRI Reconstruction
70x Speed-Up 2 Days of Effort
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
RIKEN Japan
NICAM- Climate Modeling
7-8x Speed-Up 5% of Code Modified
main() { <serial code> #pragma acc kernels
//automatically runs on GPU
{ <parallel code> } }
Developers using OpenACC
40
Janus Juul Eriksen, PhD Fellow qLEAP Center for Theoretical Chemistry, Aarhus University
OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU implementation.
Lines of Code Modified # of Weeks Required # of Codes to Maintain
0.0x 4.0x 8.0x 12.0x
Alanine-1 13 Atoms Alanine-2 23 Atoms Alanine-3 33 Atoms
Speedup vs CPU
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
41
4.1x 5.2x 7.1x 4.3x 5.3x 7.1x 7.6x 11.9x 30.3x 0x 5x 10x 15x 20x 25x 30x 35x 359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS)
CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC Speedup vs Single CPU Core
Application Performance Benchmark
359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs
42
PGI Compiler
Free OpenACC compiler for academia
NVProf Profiler
Easily find where to add compiler directives
Code Samples
Learn from examples of real-world algorithms
Documentation
Quick start guide, Best practices, Forums
http://developer.nvidia.com/openacc
GPU Wizard
Identify which GPU libraries can jumpstart code
44
DATE COURSE REGION March 2016
Intro to Performance Portability with OpenACC China
March 2016
Intro to Performance Portability with OpenACC India
May 2016
Advanced OpenACC Worldwide
September 2016
Intro to Performance Portability with OpenACC Worldwide
46
47
49
Kokkos::parallel_for(N, KOKKOS_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });
https://github.com/kokkos
RAJA::forall<cuda_exec>(0, N, [=] __device__ (int i) { y[i] = a * x[i] + y[i]; });
https://e-reports-ext.llnl.gov/pdf/782261.pdf
hemi::parallel_for(0, N, [=] HEMI_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });
CUDA Portability Library
http://github.com/harrism/hemi
50
Thrust Sort Speedup
CUDA 7.0 vs. 6.5 (32M samples)
2.0x
Bundled with NVIDIA’s CUDA Toolkit Supports execution on GPUs and CPUs
1.1x 1.0x
Ongoing performance & feature improvements
0.0x char short int long float double
Functionality beyond Parallel STL
From CUDA 7.0 Performance Report. Run on K40m, ECC ON, input and output data on device Performance may vary based on OS and software versions, and motherboard configuration
14
1.7x 1.8x 1.2x 1.3x 1.1x
51
Thrust library allows the same C++ code to target both:
NVIDIA GPUs x86, ARM and POWER CPUs
Thrust was the inspiration for a proposal to the ISO C++ Committee Committee voted unanimously to accept as
N3960 Technical Specification Working Draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf
Prototype:
https://github.com/n3554/n3554
52 13
Published as ISO/IEC TS 19570:2015, July 2015.
Draft available online
http://www .open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf
We’ve proposed adding this to C++17
http://www .open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0024r0.html
53
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
CPU Code CUDA 6 Code with Unified Memory
54
49
55
GOAL:
topology-aware so as to improve the scalability of multi-GPU applications APPROACH:
50
56
51
57
Implemented as monolithic CUDA C++ kernels combining the following:
52
58
#include <nccl.h> ncclComm_t comm[4]; ncclCommInitAll(comm, 4, {0, 1, 2, 3}); foreach g in (GPUs) { // or foreach thread cudaSetDevice(g); double *d_send, *d_recv; // allocate d_send, d_recv; fill d_send with data ncclAllReduce(d_send,d_recv, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]); // consume d_recv }
53
59
54
Broadcast All-Reduce All-Gather Reduce-Scatter
60
55
61
x86
AmgX cuBLAS
62
Consultants & Training
ANEO GPU Tech
OEM Solution Providers Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView MATLAB Mathematica NI LabView pyCUDA Numerical Packages OpenACC mCUDA OpenMP Ocelot Auto-parallelizing & Cluster Tools BLAS FFT LAPACK NPP Video Imaging GPULib Libraries C C++ Fortran Java Python GPU Compilers
63
Designed for Developers & Gamers
Available Everywhere https://developer.nvidia.com/cuda-gpus
Designed for the Data Center
ECC 24x7 Runtime GPU Monitoring Cluster Management GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support
64
CUDA resource center:
http://docs.nvidia.com/cuda
GTC on-demand and webinars:
http://on-demand-gtc.gputechconf.com http://www.gputechconf.com/gtc-webinars
Parallel Forall Blog:
http://devblogs.nvidia.com/parallelforall
Self-paced labs:
http://nvidia.qwiklab.com
65
24
Supercomputer on a module
Under 10 W for typical use cases
KEY SPECS GPU
1 TFLOP/s 256-core Maxwell
CPU
64-bit ARM A57 CPUs
Memory
4 GB LPDDR4 | 25.6 GB/s
Storage
16 GB eMMC
Wifi/BT
802.11 2x2 ac / BT Ready
Networking
1 Gigabit Ethernet
Size
50mm x 87mm
Interface
400 pin board-to-board connector
Power
Under 10W
NVTX
NVIDIA Tools eXtension
Debugger | Profiler | System Trace
GPU Compute Graphics Deep Learning and Computer Vision
26
50 45 40 35 30 25 20 15 10 5 Intel core i7-6700K (Skylake) Jetson TX1 Efficiency Images/sec/Watt
27
TX1
*Based on SGEMM performance
TODA Y’S DRONE (GPS-BASED) CORE i7 JETSON Performance* 1x 100x 100x Power (compute) 2W 60W 6W Power (mechanical) 70W 100W 80W Flight Time 20 minutes 9 minutes 18 minutes
28
http://developer .nvidia.com/embedded-computing
31
J $ $ Pr S In Jetson TX1 Developer $599 retail $299 EDU Pre-order Nov 12 Shipping Nov 16 (US) Intl to follow Kit
32
Jetson TX1 Module $299 Available 1Q16 Distributors Worldwide
(1000 unit QTY)
33
Jetson for Embedded T esla for Cloud Titan X for PC DRIVE PX for Auto
74
Time of accelerators has come NVIDIA is focused on co-design from top-to-bottom Accelerators are surging in supercomputing Machine learning is the next killer application for HPC Tesla platform leads in every way