[PPT] - Visualization of Petascale Particle Data in Nvidia DGX-1 Benjamin PowerPoint Presentation

SLIDE 1

ORNL is managed by UT-Battelle for the US Department of Energy

Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Benjamin Hernandez, PhD

hernandezarb@ornl.gov

Advanced Data and Workflows Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory

SLIDE 2

2 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Oak Ridge Leadership Computing Facility (OLCF)

Mission: Provide the computational and data resources required to solve

the most challenging problems.

Highly competitive user allocation programs (INCITE, ALCC).
Projects receive 10x to 100x more resource than at other generally

available centers.

We partner with users to enable science & engineering breakthroughs.

SLIDE 3

3 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Sight:

Exploratory Visualization of Scientific Data

Client/Server architecture

to provide high end visualization in laptops, desktops, and powerwalls.

Heterogeneous scientific

visualization

– Take advantage of both CPU & GPU resources within a node: DGX-1 use case. – Advanced shading to enable new insights into data exploration.

Parallel I/O & Data Staging

– “Pluggable” for in-situ visualization

Lightweight tool

– Load your data – Perform exploratory analysis – Visualize/Save results

SLIDE 4

4 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Server (DGX-1) Server (DGX-1)

Sight System Architecture (in progress)

Server (DGX-1 or multiGPU node)

CPU cores Multi-GPU

OSPray Nvidia Optix VTK-m Local/Parallel File System HPC Cluster

ADIOS I/O System

VTK-m

Visualization Frames

Compression

Websockets

HTML Client

*OSPray and Nvidia Optix are finely tuned libraries for Ray Tracing in multicore and manycore architectures

SLIDE 5

5 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

ADIOS

ADIOS is an I/O framework

– Provides multiple methods to stage data to a staging area (on node,

ff node, off machine)

– Data output can be anything one wants

Different methods allow for different types of data movement, aggregation,

and arrangement to the storage system or to stream over the local-nodes, LAN, WAN

– It contains our own file format if you choose to use it (ADIOS-BP) – Compress/decompress data in parallel – Contains mechanisms to index and query data

SLIDE 6

6 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

First Approach: OpenGL

VBO (points)

V.S. Apply transfer function G.S. Quads w/tex. coords F.S. Sphere gen. and Shading

SLIDE 7

7 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

OpenGL Bindless Graphics

Initialization Display

Address pointer Vertices start from vboAddress to vboAddress + sizeof (float)*size

SLIDE 8

8 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

(-1,-1) (1,1) (-1,1) (1,-1)

Fragment Shader sphere Generation

Sphere equation:

𝑠2 = 𝑦 − 𝑦0 2 + 𝑧 − 𝑧0 2 + 𝑨 − 𝑨0 2

r = 1.0, z = 1.0 x = texCoord.x y = texCoord.y zz = 1.0 – x*x – y*y if (zz <= 0.0) // removes fragments outside discard; // scale to the desired radius // calculate diffuse illumination

SLIDE 9

9 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results

SLIDE 10

10 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

OpenGL

Multi-GPU Rendering

One MPI task for each device

– Easy to implement – Each device initialize its GLX/EGL context

Multi-threading. One thread per device

– In EGL is possible:

Create the main context in the main thread:

mainCtx = eglCreateContext(display, config, 0, contextAttrs)

Each additional thread create a shared context:

lclThrdCtx = eglCreateContext(display, config, mainCtx, contextAttrs);

Implement some mutex/semaphores to sync any updates

– Vulkanize your viz !

Devices are aware of other devices and can coordinate between

each other

That’s precisely NVIDIA Optix can do

SLIDE 11

11 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Second Approach Nvidia Optix Ray Tracing Engine

“The OptiX API is an application framework for achieving
ptimal ray tracing performance on the GPU. It provides a

simple, recursive, and flexible pipeline for accelerating ray tracing algorithms.”

“Similar to OpenGL in doing the “heavy lifting” of ray tracing

and leaving capability and technique to the developers”

– Plus it can use all GPUs available in your system – Naturally fits material appearance and scene illumination

SLIDE 12

12 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Optix provides eight programmable components, some of

them are:

1. Ray generation
2. Intersection
3. Shading (closest hit)

Shadows (any hit) Selector

Shaders are CUDA like syntax

Nvidia Optix Programming Model

1 2 3

SLIDE 13

13 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Nvidia Optix Graph Nodes

Geometry is defined by Graph nodes.
A tree-like hierarchy where:

– Nodes at the bottom describes geometric objects. – Nodes at the top describes collections of geometric objects.

Group Transform Selector Geometry Group Geometry Instance Geometry Instance Geometry Group Geometry Group Geometry Geometry Geometry Instance Geometry Instance Geometry Geometry

Acceleration Acceleration Acceleration

SLIDE 14

14 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Nvidia Optix Graph Nodes

“Keep the hierarchy as flat as possible…”

Geometry Group Geometry Instance Particles

Acceleration

But not too flat!

Group Geometry Group Geometry Instance Particles

Acceleration Acceleration

Geometry Group Geometry Instance Particles

Acceleration

Geometry Group Geometry Instance Particles

Acceleration

SLIDE 15

15 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results Test Systems

Workstation

– CPU Intel Xeon 20 cores… 512 GB – GPU Titan Z (2 Geforce Kepler GPU, 2x6 GB VRAM), – Ubuntu 16, Nvidia Driver

Rhea Node

– CPU Intel Xeon … – GPU 2x Tesla K80 (4 Tesla Kepler GPU, 2x24 GB VRAM) – Redhat 7. Nvidia Driver

DGX-1

– CPU Intel… – GPU 8x Tesla Pascal SMX, 8x16 GB VRAM – Ubuntu 16, Nvidia Driver …

All systems: CUDA 8.0 Optix 4.0.2 Acceleration Structure: Trbvh Image resolution: 1080p Shading: Phong Illumination & Ambient Occlusion

SLIDE 16

16 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results. How fast is built the acceleration

structure ? lower is better

1 10 100 1000 Workstation Rhea Node DGX-1 1 Million 10 Million 20 Million

808

Time (ms)

108 85 80 Particles 1616 736 1474 1958 980

SLIDE 17

17 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results Performance, lower is better

25 50 75 100 125 150 175 200 225 250 275 Workstation Rhea node DGX-1

Frame rate (worst case)

30 59 127 600

ms per frame Million particles

5 10 15 20 25 30 Workstation Rhea node DGX-1

Frame rate (best case)

30 59 127 600

Million particles ms per frame

60 fps 30 fps 60 fps 30 fps

SLIDE 18

18 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results

SLIDE 19

19 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Discussion

DGX-1 can handle particle systems up to 10x larger in our

test environment.

For particle systems of the same size DGX-1 is 10x faster

than the workstation system and 4.6x faster than Rhea node

– We expect for larger image resolutions DGX-1 speed up will increase.

Our preliminary tests showed DGX-1 has enough compute

power to drive a powerwall

– 3840 x 1080 @ 60 – 120 fps

Test larger resolution

– Researchers usually are happy when they can explore datasets even at 1 fps

SLIDE 20

20 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Discussion

Nvidia Optix provides multi-GPU support with no hassle

– Test if Nvidia Optix leaves free resources for analysis tasks.

Paging was removed in Optix 4.x

– DGX-1 includes 40 CPU cores and 512 RAM – Using Nvidia Optix & OSPRay library will allow full system allocation to handle larger systems.

Summit is likely to support EGL through the Nvidia GPU

Drivers (do not take it as a fact or alternative fact neither!).

– Best if (pre)exascale visualization tools are 100 % CUDA compliant.

SLIDE 21

21 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Questions ?

Benjamin Hernandez, PhD

hernandezarb@ornl.gov

Advanced Data and Workflows Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory

Acknowledgements: Dylan Lacewell and the Nvidia Optix Team for their technical support. Datasets provided by Cheng-Yu Shi and Leonid Zhigilei from the Computational Materials Group at University of Virginia. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

SLIDE 22

22 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

2018 INCITE Call for Proposals

The 2018 INCITE Call for Proposals opened April 17, 2017 and closes June 23, 2017.
Features large allocations of computer time and supporting resources at the Argonne

and Oak Ridge Leadership Computing Facility (LCF) centers, operated by the US Department of Energy (DOE) Office of Science.

Soliciting research proposals for awards of time on the 27-petaflop Cray XK7 Titan, and

the 10-petaflop IBM Blue Gene/Q, Mira. In addition, certain 2018 INCITE awards will receive time on Argonne’s new Intel/Cray system, a 9.65-petaflops system called Theta.

The INCITE program seeks research proposals for capability computing

– Production simulations, including ensembles, that use a large fraction

f the LCF systems, or

– Proposals that require the unique LCF architectural infrastructure for high- performance computing projects that cannot be performed elsewhere

The INCITE program is open to US and non-US based researchers.
The INCITE program invites you to participate in an INCITE Proposal Writing Webinar,
ffered on April 19, May 18, and June 6.
For more information visit http://www.doeleadershipcomputing.org/

SLIDE 24

24 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results How fast is built the acceleration structure ?

Time(%) Time Calls Avg Min Max Name 49.63% 53.663ms 111 483.45us 1.3430us 2.5468ms [CUDA memcpy HtoD] 22.64% 24.478ms 10 2.4478ms 2.4320ms 2.4768ms Megakernel_CUDA_1 20.43% 22.090ms 22 1.0041ms 3.7440us 2.2334ms [CUDA memcpy DtoH] 6.74% 7.2833ms 1 7.2833ms 7.2833ms 7.2833ms Megakernel_CUDA_0 0.30% 324.41us 1 324.41us 324.41us 324.41us [CUDA memcpy HtoA] 0.21% 231.71us 23 10.074us 1.2480us 150.49us [CUDA memset] 0.05% 55.903us 44 1.2700us 1.2150us 1.6320us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 50.29% 43.010ms 133 323.38us 1.2800us 1.0984ms [CUDA memcpy HtoD] 34.61% 29.596ms 44 672.63us 3.3280us 1.0593ms [CUDA memcpy DtoH] 9.48% 8.1082ms 10 810.82us 623.39us 1.2392ms Megakernel_CUDA_1 5.03% 4.2986ms 1 4.2986ms 4.2986ms 4.2986ms Megakernel_CUDA_0 0.36% 303.77us 23 13.207us 1.1520us 211.29us [CUDA memset] 0.17% 148.67us 1 148.67us 148.67us 148.67us [CUDA memcpy HtoA] 0.06% 51.008us 44 1.1590us 1.1200us 1.1840us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 43.50% 35.509ms 133 266.99us 1.1200us 1.1818ms [CUDA memcpy HtoD] 30.77% 25.122ms 44 570.95us 1.2800us 1.1629ms [CUDA memcpy DtoH] 18.13% 14.800ms 44 336.36us 1.7920us 371.84us [CUDA memcpy PtoP] 6.16% 5.0304ms 10 503.04us 332.39us 889.03us Megakernel_CUDA_1 1.11% 904.00us 1 904.00us 904.00us 904.00us Megakernel_CUDA_0 0.14% 112.03us 1 112.03us 112.03us 112.03us [CUDA memcpy HtoA] 0.13% 108.03us 23 4.6970us 1.0560us 60.032us [CUDA memset] 0.06% 47.264us 44 1.0740us 1.0560us 1.3440us [CUDA memcpy DtoD]

Workstation Rhea node

DGX-1

SLIDE 25

25 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Results How fast is built the acceleration structure ?

Time(%) Time Calls Avg Min Max Name 49.63% 53.663ms 111 483.45us 1.3430us 2.5468ms [CUDA memcpy HtoD] 22.64% 24.478ms 10 2.4478ms 2.4320ms 2.4768ms Megakernel_CUDA_1 20.43% 22.090ms 22 1.0041ms 3.7440us 2.2334ms [CUDA memcpy DtoH] 6.74% 7.2833ms 1 7.2833ms 7.2833ms 7.2833ms Megakernel_CUDA_0 0.30% 324.41us 1 324.41us 324.41us 324.41us [CUDA memcpy HtoA] 0.21% 231.71us 23 10.074us 1.2480us 150.49us [CUDA memset] 0.05% 55.903us 44 1.2700us 1.2150us 1.6320us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 50.29% 43.010ms 133 323.38us 1.2800us 1.0984ms [CUDA memcpy HtoD] 34.61% 29.596ms 44 672.63us 3.3280us 1.0593ms [CUDA memcpy DtoH] 9.48% 8.1082ms 10 810.82us 623.39us 1.2392ms Megakernel_CUDA_1 5.03% 4.2986ms 1 4.2986ms 4.2986ms 4.2986ms Megakernel_CUDA_0 0.36% 303.77us 23 13.207us 1.1520us 211.29us [CUDA memset] 0.17% 148.67us 1 148.67us 148.67us 148.67us [CUDA memcpy HtoA] 0.06% 51.008us 44 1.1590us 1.1200us 1.1840us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 43.50% 35.509ms 133 266.99us 1.1200us 1.1818ms [CUDA memcpy HtoD] 30.77% 25.122ms 44 570.95us 1.2800us 1.1629ms [CUDA memcpy DtoH] 18.13% 14.800ms 44 336.36us 1.7920us 371.84us [CUDA memcpy PtoP] 6.16% 5.0304ms 10 503.04us 332.39us 889.03us Megakernel_CUDA_1 1.11% 904.00us 1 904.00us 904.00us 904.00us Megakernel_CUDA_0 0.14% 112.03us 1 112.03us 112.03us 112.03us [CUDA memcpy HtoA] 0.13% 108.03us 23 4.6970us 1.0560us 60.032us [CUDA memset] 0.06% 47.264us 44 1.0740us 1.0560us 1.3440us [CUDA memcpy DtoD]

Workstation Rhea node

DGX-1

SLIDE 26

26 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

ADIOS I/O

Abstracting metadata, data types, and dimensions from the

source code into an XML file

C Fortran POSIX MPI MPI_LUSTRE PHDF5 … DATASPACES DIMES FLEXPATH ICEE zlib, bzip2, szip, zfp, isobar Alacrity all data in adios_write() calls are buffered before writing to the file system.

SLIDE 27

27 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

ADIOS I/O

Generate the c-code from the XML file

gwrite_Atoms.ch gread_Atoms.ch

Both files contains code to write and read ADIOS files

– You only need to modify your XML file an generate new *.ch files – Main code remains the same

gpp.py atoms.xml

SLIDE 28

28 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

ADIOS – Write/Read example

Write Read

Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Benjamin Hernandez, PhD

Oak Ridge Leadership Computing Facility (OLCF)

the most challenging problems.

available centers.

Sight:

Exploratory Visualization of Scientific Data

to provide high end visualization in laptops, desktops, and powerwalls.

visualization

– Take advantage of both CPU & GPU resources within a node: DGX-1 use case. – Advanced shading to enable new insights into data exploration.

– “Pluggable” for in-situ visualization

– Load your data – Perform exploratory analysis – Visualize/Save results

Server (DGX-1) Server (DGX-1)

Sight System Architecture (in progress)

CPU cores Multi-GPU

ADIOS I/O System

Visualization Frames

HTML Client

ADIOS

– Provides multiple methods to stage data to a staging area (on node,

– Data output can be anything one wants

and arrangement to the storage system or to stream over the local-nodes, LAN, WAN

– It contains our own file format if you choose to use it (ADIOS-BP) – Compress/decompress data in parallel – Contains mechanisms to index and query data

First Approach: OpenGL

OpenGL Bindless Graphics

Initialization Display

(-1,-1) (1,1) (-1,1) (1,-1)

Fragment Shader sphere Generation

𝑠2 = 𝑦 − 𝑦0 2 + 𝑧 − 𝑧0 2 + 𝑨 − 𝑨0 2

Results

OpenGL

Multi-GPU Rendering

– Easy to implement – Each device initialize its GLX/EGL context

– In EGL is possible:

– Vulkanize your viz !

each other

Second Approach Nvidia Optix Ray Tracing Engine

simple, recursive, and flexible pipeline for accelerating ray tracing algorithms.”

and leaving capability and technique to the developers”

– Plus it can use all GPUs available in your system – Naturally fits material appearance and scene illumination

them are:

Shadows (any hit) Selector

Nvidia Optix Programming Model

1 2 3

Nvidia Optix Graph Nodes

– Nodes at the bottom describes geometric objects. – Nodes at the top describes collections of geometric objects.

Nvidia Optix Graph Nodes

But not too flat!

Results Test Systems

structure ? lower is better

Results Performance, lower is better

Frame rate (worst case)

Frame rate (best case)

Results

Discussion

test environment.

than the workstation system and 4.6x faster than Rhea node

– We expect for larger image resolutions DGX-1 speed up will increase.

power to drive a powerwall

– 3840 x 1080 @ 60 – 120 fps

– Researchers usually are happy when they can explore datasets even at 1 fps

Discussion

– Test if Nvidia Optix leaves free resources for analysis tasks.

– DGX-1 includes 40 CPU cores and 512 RAM – Using Nvidia Optix & OSPRay library will allow full system allocation to handle larger systems.

Drivers (do not take it as a fact or alternative fact neither!).

– Best if (pre)exascale visualization tools are 100 % CUDA compliant.

Questions ?

Benjamin Hernandez, PhD

Further Reading

Applications for Multiple GPUs” GTC 2013

SIGGRAPH 2013

2018 INCITE Call for Proposals

and Oak Ridge Leadership Computing Facility (LCF) centers, operated by the US Department of Energy (DOE) Office of Science.

the 10-petaflop IBM Blue Gene/Q, Mira. In addition, certain 2018 INCITE awards will receive time on Argonne’s new Intel/Cray system, a 9.65-petaflops system called Theta.

– Production simulations, including ensembles, that use a large fraction

– Proposals that require the unique LCF architectural infrastructure for high- performance computing projects that cannot be performed elsewhere

Results How fast is built the acceleration structure ?

DGX-1

Results How fast is built the acceleration structure ?

DGX-1