Combining NVIDIA Docker and databases to enhance agile development - - PowerPoint PPT Presentation

combining nvidia docker and databases to enhance agile
SMART_READER_LITE
LIVE PREVIEW

Combining NVIDIA Docker and databases to enhance agile development - - PowerPoint PPT Presentation

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation Chris Davis , Sophie Voisin , Devin White, Andrew Hardin Scalable and High Performance Geocomputation Team Geographic Information Science and


slide-1
SLIDE 1

ORNL is managed by UT-Battelle for the US Department of Energy

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation

Chris Davis, Sophie Voisin, Devin White, Andrew Hardin

Scalable and High Performance Geocomputation Team Geographic Information Science and Technology Group Oak Ridge National Laboratory

GTC 2017 – May 2017

slide-2
SLIDE 2

2

Outline

  • Background
  • Example HPC Application
  • Study Results
  • Lessons Learned / Future Work
slide-3
SLIDE 3

3

The Story

  • We are:

– Developing an HPC suite of applications – Spread across multiple R&D teams – In an Agile development process – Delivering to a production environment – Needing to support multiple systems / multiple capabilities – Collecting performance metrics for system optimization

slide-4
SLIDE 4

4

Why We Use NVIDIA-Docker

Resource Optimization GPU Access Flexibility Operating Systems NVIDIA-Docker Docker Virtual Machine

slide-5
SLIDE 5

5

Hardware – Quadro: Compute + Display

Card M4000 P6000 Capability 5.2 6.1 Block 32 32 SM 13 30 Cores 1664 3840 Memory 8GB 24GB

slide-6
SLIDE 6

6

Hardware – Tesla: Compute Only

Card K40 K80 Capability 3.5 3.7 Block 16 16 SM 15 13 Cores 2880 2496 Memory 12GB 12GB

slide-7
SLIDE 7

7

Hardware – High End

DELL C4130 GPU 4 x K80 RAM 256GB Cores 48 SSD Storage 400GB

slide-8
SLIDE 8

8

Constructing Containers

  • Build Container:

– Based off NVIDIA Images at gitlab.com – https://gitlab.com/nvidia/cuda/tree/centos7 – CentOS 7 – CUDA 8.0 / 7.5 – cuDNN 5.1 – GCC 4.9.2 – Cores: 24 – Mount local folder with code

  • Compile against chosen compute capability
  • Copy product inside container
  • ”docker commit” container updates to new image
  • “docker save” to Isilon

Isilon Container Container Container Git Repo PostgreSQL Compile Stats Profile Stats

Data

HPC Server NVIDIA-Docker GPUs CPUs Local Drive Container

slide-9
SLIDE 9

9

Running Containers

  • For each compute capability:

– “docker load” from Isilon storage – Run container & profile script – Send nvprof results to Profile Stats DB – Container/Image removed

Isilon Container Container Container PostgreSQL Compile Stats Profile Stats

Data

HPC Server NVIDIA-Docker GPUs CPUs Local Drive Container

slide-10
SLIDE 10

10

Hooking It All Together

HPC Server NVIDIA-Docker GPUs CPUs Local Drive Container Isilon Container Container Container Git Repo PostgreSQL Compile Stats Profile Stats

Data

HPC Server NVIDIA-Docker GPUs CPUs Local Drive Container HPC Server NVIDIA-Docker GPUs CPUs Local Drive Container

  • One server generates containers
  • All servers pull containers from Isilon
  • Data to be processed pulled from Isilon
  • Container build stats stored in Compiler DB
  • Container execution stats stored in Profiler DB
slide-11
SLIDE 11

11

Profiling Combinations

  • nvprof

– Output Parsed – Sent to Profile DB

  • Containers for:

– Cuda Version – Each Capability – All Capabilities – CPU only

  • Data sets: 4
  • Total of 104 profiles

CPU 3.0 3.5 3.7 5.0 5.2 6.0 6.1

CUDA 8.0

D1 D2 D3 D4

M4000 K80 P6000 K40

All Capabilities

CUDA 7.5

slide-12
SLIDE 12

12

Database

Hostname Dataset CUDA Version Num CPU Threads Compile Time Compute Capability Execution Time Timestamp GPU Device Num CPU Threads Timestamp Num CPU Threads Dataset Kernel / API Call Step Time Percent Step Time Num Calls Ave Time Min Time Max Time Step Name Timestamp

  • Postgres Databases

– Shared Fields – Compile DB – Run Time DB – NVPROF DB

slide-13
SLIDE 13

13

Outline

  • Background
  • Example HPC Application
  • Study Results
  • Lessons Learned / Future Work
slide-14
SLIDE 14

14

Example HPC Application

  • Geospatial metadata generator

– Leverages Open Source 3rdparty libraries

  • OpenCV, Caffe, GDAL, …

– Computer Vision Algorithms – GPU Enabled

  • SURF, ORB, NCC, NMI…

– Automated matching against control data – Calculates geospatial metadata for input imagery

Satellites Manned Aircraft Unmanned Aerial Systems

slide-15
SLIDE 15

15

  • Two-step Image Re-alignment Application using NMI

Example HPC Application - GTC16

Input Image Source Selection Global Localization Registration Resection Metadata Output Image GPU Preprocessing CPU

Pipeline

Core Libraries:

  • NITRO
  • GDAL
  • Proj.4
  • libpq (Postgres)
  • OpenCV
  • CUDA
  • OpenMP

Normalized Mutual Information !"# = &' + &) &* Histograms

Source Control Joint

slide-16
SLIDE 16

16

  • Global Localization

Example HPC Application - GTC16

Input Image Source Selection Global Localization Registration Resection Metadata Output Image GPU Preprocessing CPU

Pipeline

Core Libraries:

  • NITRO
  • GDAL
  • Proj.4
  • libpq (Postgres)
  • OpenCV
  • CUDA
  • OpenMP

Control 382x100 Tactical 258x67

  • Objective

– Re-align the source image with the control image.

  • Method In-house Implementation

– Roughly match source and control images. – Coarse resolution – Mask for non-valid data – Exhaustive search

Solutions 4250

slide-17
SLIDE 17

17

Example HPC Application - GTC16

  • Global Localization
slide-18
SLIDE 18

18

  • Similarity Metric

Example HPC Application - GTC16

– Normalized Mutual Information – Histogram with masked area

  • Missing data
  • Artifact
  • Homogeneous area

Source image and mask: NSxMS pixels Control image and mask: NCxMC pixels Solution space: nxm NMI coefficients

!"# = &' + &) &* & = − , - . /012- .

3 456

& is the entropy

  • . is the probability density function

H ∈ J 0. . 255 for S and C

  • 0. . 65535 for J
slide-19
SLIDE 19

19

Example HPC Application - GTC16

Summary

  • Global Localization as coarse re-alignment

– Problematic: joint histogram computation for each solution

  • No compromise on the number of bins - 65536
  • Exhaustive search

– Solution: leverage of the K80 specifications

  • 12 GB of memory
  • 1 thread per solution
  • Less than 25 seconds - 61K solutions

for a 131K pixel image Kernel specifications

  • ccupancy

100% threads / block 128 stack frame 264192 total memory / block 33.81 MB total memory / SM 541.06 MB total memory / GPU 7.03 GB memory % 61.06% spill stores – spill loads 0 – 0 registers 27 smem / block smem / SM smem % 0.00% cmem[0] – cmem[2] 448 – 20

  • 1 solution / thread
slide-20
SLIDE 20

20

  • Registration

Control 382x100 Tactical 258x67

Example HPC Application - GTC16

Input Image Source Selection Global Localization Registration Resection Metadata Output Image GPU Preprocessing CPU

Pipeline

Core Libraries:

  • NITRO
  • GDAL
  • Proj.4
  • libpq (Postgres)
  • OpenCV
  • CUDA
  • OpenMP
slide-21
SLIDE 21

21

  • Registration

Control 382x100 Tactical 258x67 Tactical & Control 4571x1555

Example HPC Application - GTC16

Input Image Source Selection Global Localization Registration Resection Metadata Output Image GPU Preprocessing CPU

Pipeline

Core Libraries:

  • NITRO
  • GDAL
  • Proj.4
  • libpq (Postgres)
  • OpenCV
  • CUDA
  • OpenMP
  • Objective

– Refine the localization

  • Method

– Use higher resolution ~400 times – Keypoint matching

slide-22
SLIDE 22

22

Example HPC Application - GTC16

Tiepoint list Control Image Descriptor Keypoint list detect from metric Search Windows detect describe Source Image Keypoint list Descriptor

Descriptors: 11x11 intensity values Search windows: 73x73 pixels

  • Registration Workflow
slide-23
SLIDE 23

23

  • Similarity Metric

– Normalized Mutual Information – Small “images” but numerous Keypoints

  • Numerous keypoints

– up to 65536 with GPU SURF detector

  • Image / Descriptor size

– 11 x 11 intensity values to describe

  • Search area

– 73 x 73 control sub-image

  • Solution space

– 63 x 63 = 3969 / keypoint

Application

Descriptors: 11x11 intensity values Search windows: 73x73 pixels Solution spaces: 63x63 NMI coefficients !"# = &' + &) &* & = − , - . /012- .

3 456

& is the entropy

  • . is the probability density function

H ∈ J 0. . 255 for S and C

  • 0. . 65535 for J

… … …

slide-24
SLIDE 24

24

Example HPC Application - GTC16

Summary

  • Registration refine the re-alignment

– Problematic: joint histogram computation for each solution

  • No compromise on the number of bins - 65536
  • Exhaustive search

– Solution: leverage of the K80 specifications

  • 12 GB of memory
  • 1 block per solution
  • Leverage the number of values of the descriptors

121 (maximum) << 65536

  • Less than 100 seconds - 65K keypoints

260M NMI coefficients

  • About 10K keypoints in less than 20 seconds

List of indices for source List of indices for the corresponding subset control Joint histogram

=

Kernel

Find the best match for all keypoints

1 block per keypoint

Optimize for the 63 x 63 search windows

64 threads / blocks – 1 idle each threads compute a “row” of solutions

Sparse joint histogram

65536 bins but only 121 values

Leverage the 11 x 11 descriptor size

Create 2 lists (length 121) of intensity values Update joint histogram count from lists Loop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval

slide-25
SLIDE 25

25

Outline

  • Background
  • Example HPC Application
  • Study Results
  • Lessons Learned / Future Work
slide-26
SLIDE 26

26

Compile Time Results

100 200 300 400 500 600 700 800 900 1000 500 1000 1500 2000 2500

OFF 30 35 37 50 52 60 61 30 - 52 30 - 61

size of binary files in MB time in seconds

Compute Capability Specifications

CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA 8.0

slide-27
SLIDE 27

27

Run Time Results

50 100 150 200

D1 Ave Run Time (sec)

CPU CUDA 7.5 CUDA8 50 100 150 200

D2 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8 50 100 150 200

D3 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8 50 100 150 200

D4 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

slide-28
SLIDE 28

28

K80 - Kernel Time Results in Seconds with nvprof

10 15 20 25 30 35

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 D1 D2 D3 D4

Step 2 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std 0.1 0.15 0.2 0.25 0.3 0.35

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 D1 D2 D3 D4

Step 1 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std

slide-29
SLIDE 29

29

Run Time Results

20 40 60 80 100 120 140 160 180 200 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000

D1 - Step 2 Kernel (sec)

average min max std std 20 40 60 80 100 120 140 160 180 200 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000

D2 - Step 2 Kernel (sec)

average min max std std 20 40 60 80 100 120 140 160 180 200 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000

D3 - Step 2 Kernel (sec)

average min max std std 20 40 60 80 100 120 140 160 180 200 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000

D4 - Step 2 Kernel (sec)

average min max std std

slide-30
SLIDE 30

30

Outline

  • Background
  • Example HPC Application
  • Study Results
  • Lessons Learned / Future Work
slide-31
SLIDE 31

31

Lessons Learned

  • GPU isolation: Ran into issue with swapping out P6000 and K40.

– nvidia-smi swapped GPU ID for K40 and M4000. – This caused nvidia-docker to ignore NV_GPU value – UUID vs Index – Our Application can set the GPU index for multi-GPU environment

  • (default to 0)
slide-32
SLIDE 32

32

Future Work

  • Move off Desktop machines to full testing platform with dedicated

hardware with multiple GPU types

  • Investigate Docker Registry & Docker Swarm for managing containers
  • Enhance Database analysis to autogenerate reports
  • Generalize the process to containerize any GPU application to profile

with this architecture

slide-33
SLIDE 33

Thank you!

slide-34
SLIDE 34

34

Customer Resources

DELL C4130 GPU 4 x K80 RAM 256GB Cores 48 SSD Storage 400GB

5 10 15 20 25 30 35 40 45 50 D1 D2 D3 D4

Run time with 6 threads (sec)

CPU CUDA 7.5