REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) 27 th - - PowerPoint PPT Presentation

real time control for adaptive optics workshop
SMART_READER_LITE
LIVE PREVIEW

REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) 27 th - - PowerPoint PPT Presentation

REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) 27 th January 2016 Franois Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com GAMING PRO VISUALIZATION ENTERPRISE DATA CENTER AUTO THE WORLD LEADER IN VISUAL


slide-1
SLIDE 1

27th January 2016

REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION)

François Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com

slide-2
SLIDE 2

2

ENTERPRISE AUTO GAMING DATA CENTER PRO VISUALIZATION

THE WORLD LEADER IN VISUAL COMPUTING

slide-3
SLIDE 3

3

TESLA ACCELERATED COMPUTING PLATFORM

Focused on Co-Design from Top to Bottom

Productive Programming Model & Tools Expert Co-Design Accessibility

APPLICATION MIDDLEWARE SYS SW LARGE SYSTEMS PROCESSOR

Fast GPU Engineered for High Throughput

0.0 0.5 1.0 1.5 2.0 2.5 3.0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU TFLOPS M2090 M1060 K20 K80 K40

Fast GPU + Strong CPU

slide-4
SLIDE 4

4

PERFORMANCE LEAD CONTINUES TO GROW

500 1000 1500 2000 2500 3000 3500 2008 2009 2010 2011 2012 2013 2014

Peak Double Precision FLOPS

NVIDIA GPU x86 CPU

M2090 M1060 K20

K80

Westmere Sandy Bridge Haswell

GFLOPS

100 200 300 400 500 600 2008 2009 2010 2011 2012 2013 2014

Peak Memory Bandwidth

NVIDIA GPU x86 CPU

GB/s

K20

K80

Westmere Sandy Bridge Haswell Ivy Bridge

K40

Ivy Bridge

K40

M2090 M1060

slide-5
SLIDE 5

5

GPU Architecture Roadmap

SGEMM / W

2012 2014 2008 2010 2016

48 36 12 24 60

2018

72 Tesla Fermi Kepler Maxwell

Pascal

Mixed Precision 3D Memory NVLink

slide-6
SLIDE 6

Kepler SM (SMX)

  • Scheduler not tied

to cores Double issue for max utilization

Instruction Cache Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler

  • SP

SP DP

…!

SP SFU LD/ST

192 CUDA … cores!

SP SP DP SP SFU LD/ST Shared Memory / L1 Cache On-Chip Network

5

Register File

slide-7
SLIDE 7

Maxwell SM (SMM)

SMM

Instruction Cache

  • Simplified design

T ex / L1 $ T ex / L1 $

– – power-of-two, quadrant-based scheduler tied to cores

  • Better utilization

Shared Memory

– – single issue sufficient lower instruction latency Quadrant

Instruction Buffer Warp Scheduler

  • Efficiency

– – <10% difference from SMX ~50% SMX chip area

SFU LD/ST SP SP SP SP

… 32 SP CUDA Cores …

SFU LD/ST SP SP SP SP Register File

slide-8
SLIDE 8

Histogram : Performance per SM

9.0 7.5 6.0

5.5x faster

4.5 3.0 1.5 0.0 1 2 4 8 16 32 64 128

Elements per thread

Fermi M2070 Kepler K20X Maxwell GTX 750 Ti

Higher performance expected with larger GPUS (more SMs)

Bandwidth/SM, GiB/s

slide-9
SLIDE 9

16 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA GPU ACCELERATORS 2015-2016*

2015 2016 2017

KEPLER – K40

1.43TF DP, 4.3TF SP Peak 3.3 TF SGEMM/1.22TF DGEMM 12 GB, 288 GB/s, 235W PCIe Active/PCIe Passive

KEPLER – K80

2xGPU, 2.9TF DP, 8.7TF SP Peak 4.4TF SGEMM/1.59TF DGEMM 24GB, ~480 GB/s, 300W PCIe Passive

MAXWELL – M60

2xGPU, 7.4TF SP Peak, ~6TF SGEMM 16GB, 320 GB/s, 300W PCIe Active/PCIe Passive

MAXWELL – M6

1xGPU, TBD TF SP Peak, 8GB, 160 GB/s, 75-100W, MXM

POR In Definition

GRID Enabled GRID Enabled

*For End Customer Deployments

MAXWELL – M40

1xGPU, 7TF SP Peak (Boost Clock), 12GB, 288 GB/s, 250W PCIe Passive

MAXWELL – M4

1xGPU, 2.2 TF SP Peak, 4GB, 88 GB/s, 50-75W, PCIe Low Profile

slide-10
SLIDE 10

17

TESLA PLATFORM PRODUCT STACK

Software System Tools & Services Accelerators Accelerated Computing Toolkit Tesla K80

HPC

Enterprise Services ∙ Data Center GPU Manager ∙ Mesos ∙ Docker GRID 2.0 Tesla M60, M6

Enterprise Virtualization

DL Training

Hyperscale

Hyperscale Suite Tesla M40 Tesla M4

Web Services

slide-11
SLIDE 11

18

NVLINK

HIGH-SPEED GPU INTERCONNECT

NVLink NVLink

POWER CPU X86, ARM64, POWER CPU X86, ARM64, POWER CPU

PASCAL GPU KEPLER GPU 2016 2014

PCIe PCIe

NODE DESIGN FLEXIBILITY

slide-12
SLIDE 12

19

UNIFIED MEMORY: SIMPLER & FASTER WITH NVLINK

Traditional Developer View Developer View With Unified Memory

Unified Memory System Memory GPU Memory

Developer View With Pascal & NVLink

Unified Memory

NVLink

Share Data Structures at CPU Memory Speeds, not PCIe speeds Oversubscribe GPU Memory

slide-13
SLIDE 13

20

MOVE DATA WHERE IT IS NEEDED FAST

Accelerated Communication GPU Direct RDMA NVLINK Fast Access to other Nodes Eliminate CPU Latency Eliminate CPU Bottleneck 2x App Performance 5x Faster Than PCIe Fast Access to System Memory GPU Direct P2P Multi-GPU Scaling Fast GPU Communication Fast GPU Memory Access

slide-14
SLIDE 14

22

U.S. Dept. of Energy

Pre-Exascale Supercomputers for Science

NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

NOAA

New Supercomputer for Next-Gen Weather Forecasting

IBM Watson

Breakthrough Natural Language Processing for Cognitive Computing

SUMMIT SIERRA

slide-15
SLIDE 15

23

U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS

Powered by the Tesla Platform

100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017

Major Step Forward on the Path to Exascale

slide-16
SLIDE 16

25

25 50 75 100 125 2013 2014 2015

ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS

100+ accelerated systems now on Top500 list 1/3 of total FLOPS powered by accelerators NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers Tesla supercomputers growing at 50% CAGR

  • ver past five years

Top500: # of Accelerated Supercomputers

slide-17
SLIDE 17

26

TESLA PLATFORM FOR HPC

slide-18
SLIDE 18

27

TESLA ACCELERATED COMPUTING PLATFORM

Development Data Center Infrastructure

GPU Accelerators Interconnect System Management Compiler Solutions

GPU Boost … GPU Direct NVLink … NVML … LLVM …

Profile and Debug

CUDA Debugging API …

Development Tools Programming Languages Infrastructure Management Communication System Solutions

/

Software Solutions

Libraries

cuBLAS …

“ Accelerators Will Be Installed in More than Half of New Systems ”

Source: Top 6 predictions for HPC in 2015

“In 2014, NVIDIA enjoyed a dominant market share with 85%

  • f the accelerator market.”
slide-19
SLIDE 19

28

370 GPU-Accelerated Applications

www.nvidia.com/appscatalog

slide-20
SLIDE 20

29

70% OF TOP HPC APPS ACCELERATED

TOP 25 APPS IN SURVEY INTERSECT360 SURVEY OF TOP APPS

GROMACS SIMULIA Abaqus NAMD AMBER ANSYS Mechanical Exelis IDL MSC NASTRAN ANSYS Fluent WRF VASP OpenFOAM CHARMM Quantum Espresso LAMMPS NWChem LS-DYNA Schrodinger Gaussian GAMESS ANSYS CFX Star-CD CCSM COMSOL Star-CCM+ BLAST

= All popular functions accelerated

Top 10 HPC Apps

90% Accelerated

Top 50 HPC Apps

70% Accelerated

Intersect360, Nov 2015 “HPC Application Support for GPU Computing”

= Some popular functions accelerated = In development = Not supported

slide-21
SLIDE 21

33

HYPERSCALE SUITE

Deep Learning Toolkit GPU REST Engine GPU Accelerated FFmpeg Image Compute Engine

TESLA M40

POWERFUL Fastest Deep Learning Performance

TESLA M4

LOW POWER Highest Hyperscale Throughput GPU support in Mesos

TESLA FOR HYPERSCALE

http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/

slide-22
SLIDE 22

34

TESLA PLATFORM FOR DEVELOPERS

slide-23
SLIDE 23

35 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA FOR SIMLUATION

LIBRARIES

TESLA ACCELERATED COMPUTING

LANGUAGES DIRECTIVES

ACCELERATED COMPUTING TOOLKIT

slide-24
SLIDE 24

37

DROP-IN ACCELERATION WITH GPU LIBRARIES

5x-10x speedups out of the box

Automatically scale with multi-GPU libraries (cuBLAS-XT , cuFFT-XT , AmgX,…)

75% of developers use GPU libraries to accelerate their application

AmgX cuFFT NPP cuBLAS cuRAND cuSPARSE MATH

BLAS | LAPACK | SPARSE | FFT Math | Deep Learning | Image Processing

slide-25
SLIDE 25

38

“DROP-IN” ACCELERATION: NVBLAS

slide-26
SLIDE 26

39

OpenACC

Simple | Powerful | Portable

Fueling the Next Wave of Scientific Discoveries in HPC

University of Illinois

PowerGrid- MRI Reconstruction

70x Speed-Up 2 Days of Effort

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

RIKEN Japan

NICAM- Climate Modeling

7-8x Speed-Up 5% of Code Modified

main() { <serial code> #pragma acc kernels

//automatically runs on GPU

{ <parallel code> } }

8000+

Developers using OpenACC

slide-27
SLIDE 27

40

Janus Juul Eriksen, PhD Fellow qLEAP Center for Theoretical Chemistry, Aarhus University

OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU implementation.

Lines of Code Modified # of Weeks Required # of Codes to Maintain

<100 Lines 1 Week 1 Source Big Performance

0.0x 4.0x 8.0x 12.0x

Alanine-1 13 Atoms Alanine-2 23 Atoms Alanine-3 33 Atoms

Speedup vs CPU

Minimal Effort

LS-DALTON CCSD(T) Module

Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

LS-DALTON

Large-scale Application for Calculating High-accuracy Molecular Energies

slide-28
SLIDE 28

41

OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY

Paving the Path Forward: Single Code for All HPC Processors

4.1x 5.2x 7.1x 4.3x 5.3x 7.1x 7.6x 11.9x 30.3x 0x 5x 10x 15x 20x 25x 30x 35x 359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS)

CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC Speedup vs Single CPU Core

Application Performance Benchmark

359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs

slide-29
SLIDE 29

42

INTRODUCING THE NEW OPENACC TOOLKIT

Free Toolkit Offers Simple & Powerful Path to Accelerated Computing

PGI Compiler

Free OpenACC compiler for academia

NVProf Profiler

Easily find where to add compiler directives

Code Samples

Learn from examples of real-world algorithms

Documentation

Quick start guide, Best practices, Forums

http://developer.nvidia.com/openacc

GPU Wizard

Identify which GPU libraries can jumpstart code

slide-30
SLIDE 30

44

DATE COURSE REGION March 2016

Intro to Performance Portability with OpenACC China

March 2016

Intro to Performance Portability with OpenACC India

May 2016

Advanced OpenACC Worldwide

September 2016

Intro to Performance Portability with OpenACC Worldwide

Registration page: https://developer.nvidia.com/openacc-courses Self-paced labs: http://nvidia.qwiklab.com

FREE OPENACC COURSES

Begin Accelerating Applications with OpenACC

slide-31
SLIDE 31

46

PROGRAMMING LANGUAGES

OpenACC, CUDA Fortran Fortran OpenACC, CUDA C C Thrust, CUDA C++, KOKKOS, RAJA, HEMI, OCCA C++ PyCUDA, Copperhead, Numba, Numbapro Python GPU.NET , Hybridizer (Altimesh),JCUDA,CUDA4J JAVA,C# MATLAB, Mathematica, LabVIEW, Scilab, Octave Numerical analytics

slide-32
SLIDE 32

47

COMPILE PYTHON FOR PARALLEL ARCHITECTURES

Anaconda Accelerate from Continuum Analytics NumbaPro array-oriented compiler for Python & NumPy Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK) Fast Development + Fast Execution: Ideal Combination http://continuum.io Free Academic License

slide-33
SLIDE 33

49

MORE C++ PARALLEL FOR LOOPS

GPU Lambdas Enable Custom Parallel Programming Models

Kokkos::parallel_for(N, KOKKOS_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });

Kokkos

https://github.com/kokkos

RAJA::forall<cuda_exec>(0, N, [=] __device__ (int i) { y[i] = a * x[i] + y[i]; });

RAJA

https://e-reports-ext.llnl.gov/pdf/782261.pdf

hemi::parallel_for(0, N, [=] HEMI_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });

Hemi

CUDA Portability Library

http://github.com/harrism/hemi

slide-34
SLIDE 34

50

THRUST LIBRARY

Programming with algorithms and policies today

Thrust Sort Speedup

CUDA 7.0 vs. 6.5 (32M samples)

2.0x

Bundled with NVIDIA’s CUDA Toolkit Supports execution on GPUs and CPUs

1.1x 1.0x

Ongoing performance & feature improvements

0.0x char short int long float double

Functionality beyond Parallel STL

From CUDA 7.0 Performance Report. Run on K40m, ECC ON, input and output data on device Performance may vary based on OS and software versions, and motherboard configuration

14

1.7x 1.8x 1.2x 1.3x 1.1x

slide-35
SLIDE 35

51

Portable, High-level Parallel Code TODAY

Thrust library allows the same C++ code to target both:

NVIDIA GPUs x86, ARM and POWER CPUs

Thrust was the inspiration for a proposal to the ISO C++ Committee Committee voted unanimously to accept as

  • fficial tech. specification working draft

N3960 Technical Specification Working Draft:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf

Prototype:

https://github.com/n3554/n3554

slide-36
SLIDE 36

52 13

T echnical Specification for C++ Extensions for Parallelism

ST ANDARDIZING

Published as ISO/IEC TS 19570:2015, July 2015.

PARALLEL STL

Draft available online

http://www .open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf

We’ve proposed adding this to C++17

http://www .open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0024r0.html

slide-37
SLIDE 37

53

CUDA

Super Simplified Memory Management Code

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPU Code CUDA 6 Code with Unified Memory

slide-38
SLIDE 38

54

49

INTRODUCING NCCL (“NICKEL”): ACCELERATED COLLECTIVES FOR MULTI-GPU SYSTEMS

slide-39
SLIDE 39

55

INTRODUCING NCCL

Accelerating multi-GPU collective communications

GOAL:

  • Build a research library of accelerated collectives that is easily integrated and

topology-aware so as to improve the scalability of multi-GPU applications APPROACH:

  • Pattern the library after MPI’s collectives
  • Handle the intra-node communication in an optimal way
  • Provide the necessary functionality for MPI to build on top to handle inter-node

50

slide-40
SLIDE 40

56

NCCL FEATURES AND FUTURES

(Green = Currently available) Collectives

  • Broadcast
  • All-Gather
  • Reduce
  • All-Reduce
  • Reduce-Scatter
  • Scatter
  • Gather
  • All-To-All
  • Neighborhood

Key Features

  • Single-node, up to 8 GPUs
  • Host-side API
  • Asynchronous/non-blocking interface
  • Multi-thread, multi-process support
  • In-place and out-of-place operation
  • Integration with MPI
  • Topology Detection
  • NVLink & PCIe/QPI* support

51

slide-41
SLIDE 41

57

NCCL IMPLEMENTATION

Implemented as monolithic CUDA C++ kernels combining the following:

  • GPUDirect P2P Direct Access
  • Three primitive operations: Copy, Reduce, ReduceAndCopy
  • Intra-kernel synchronization between GPUs
  • One CUDA thread block per ring-direction

52

slide-42
SLIDE 42

58

NCCL EXAMPLE

All-reduce

#include <nccl.h> ncclComm_t comm[4]; ncclCommInitAll(comm, 4, {0, 1, 2, 3}); foreach g in (GPUs) { // or foreach thread cudaSetDevice(g); double *d_send, *d_recv; // allocate d_send, d_recv; fill d_send with data ncclAllReduce(d_send,d_recv, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]); // consume d_recv }

53

slide-43
SLIDE 43

59

54

NCCL PERFORMANCE

Bandwidth at different problem sizes (4 Maxwell GPUs)

Broadcast All-Reduce All-Gather Reduce-Scatter

slide-44
SLIDE 44

60

55

AVAILABLE NOW github.com/NVIDIA/nccl

slide-45
SLIDE 45

61

COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS

x86

Libraries Programming Languages Compiler Directives

AmgX cuBLAS

/

slide-46
SLIDE 46

62

GPU DEVELOPER ECO-SYSTEM

Consultants & Training

ANEO GPU Tech

OEM Solution Providers Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView MATLAB Mathematica NI LabView pyCUDA Numerical Packages OpenACC mCUDA OpenMP Ocelot Auto-parallelizing & Cluster Tools BLAS FFT LAPACK NPP Video Imaging GPULib Libraries C C++ Fortran Java Python GPU Compilers

slide-47
SLIDE 47

63

DEVELOP ON GEFORCE, DEPLOY ON TESLA

Designed for Developers & Gamers

Available Everywhere https://developer.nvidia.com/cuda-gpus

Designed for the Data Center

ECC 24x7 Runtime GPU Monitoring Cluster Management GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support

slide-48
SLIDE 48

64

RESOURCES

CUDA resource center:

http://docs.nvidia.com/cuda

GTC on-demand and webinars:

http://on-demand-gtc.gputechconf.com http://www.gputechconf.com/gtc-webinars

Parallel Forall Blog:

http://devblogs.nvidia.com/parallelforall

Self-paced labs:

http://nvidia.qwiklab.com

Learn more about GPUs

slide-49
SLIDE 49

65

TEGRA TX1

slide-50
SLIDE 50

24

JETSON TX1

Supercomputer on a module

Under 10 W for typical use cases

KEY SPECS GPU

1 TFLOP/s 256-core Maxwell

CPU

64-bit ARM A57 CPUs

Memory

4 GB LPDDR4 | 25.6 GB/s

Storage

16 GB eMMC

Wifi/BT

802.11 2x2 ac / BT Ready

Networking

1 Gigabit Ethernet

Size

50mm x 87mm

Interface

400 pin board-to-board connector

Power

Under 10W

slide-51
SLIDE 51

JETSON LINUX SDK

NVTX

NVIDIA Tools eXtension

Debugger | Profiler | System Trace

GPU Compute Graphics Deep Learning and Computer Vision

slide-52
SLIDE 52

26

10X ENERGY EFFICIENCY MACHINE LEARNING

Alexnet

FOR

50 45 40 35 30 25 20 15 10 5 Intel core i7-6700K (Skylake) Jetson TX1 Efficiency Images/sec/Watt

slide-53
SLIDE 53

27

PA TH TO AN AUTONOMOUS DRONE

TX1

*Based on SGEMM performance

TODA Y’S DRONE (GPS-BASED) CORE i7 JETSON Performance* 1x 100x 100x Power (compute) 2W 60W 6W Power (mechanical) 70W 100W 80W Flight Time 20 minutes 9 minutes 18 minutes

slide-54
SLIDE 54

28

Comprehensive developer platform

http://developer .nvidia.com/embedded-computing

slide-55
SLIDE 55

31

J $ $ Pr S In Jetson TX1 Developer $599 retail $299 EDU Pre-order Nov 12 Shipping Nov 16 (US) Intl to follow Kit

slide-56
SLIDE 56

32

Jetson TX1 Module $299 Available 1Q16 Distributors Worldwide

(1000 unit QTY)

slide-57
SLIDE 57

33

Jetson for Embedded T esla for Cloud Titan X for PC DRIVE PX for Auto

ONE ARCHITECTURE — END-TO-END AI

slide-58
SLIDE 58

74

FIVE THINGS TO REMEMBER

Time of accelerators has come NVIDIA is focused on co-design from top-to-bottom Accelerators are surging in supercomputing Machine learning is the next killer application for HPC Tesla platform leads in every way

slide-59
SLIDE 59