High-Performance Broadcast Designs for Streaming Applications on - - PowerPoint PPT Presentation

high performance broadcast designs for streaming
SMART_READER_LITE
LIVE PREVIEW

High-Performance Broadcast Designs for Streaming Applications on - - PowerPoint PPT Presentation

High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters GPU Technology Conference (GTC 2017) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu


slide-1
SLIDE 1

High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

GPU Technology Conference (GTC 2017) by

slide-2
SLIDE 2

GTC 2017 2 Network Based Computing Laboratory

  • Examples - surveillance, habitat

monitoring, etc..

  • Require efficient transport of data

from/to distributed sources/sinks

  • Sensitive to latency and throughput

metrics

  • Require HPC resources to

efficiently carry out compute- intensive tasks

Streaming Applications

slide-3
SLIDE 3

GTC 2017 3 Network Based Computing Laboratory

  • Pipelined data parallel compute phases
  • Form the crux of streaming applications lend

themselves for GPGPUs

  • Data distribution to GPGPU sites
  • Over PCIe within the node
  • Over InfiniBand interconnects across nodes
  • Back-to-back Broadcast operation

– Key dictator of throughput of streaming applications

Nature of Streaming Applications

Data Source Data Distributor

HPC resources

Real-time streaming

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Worker

CPU GPU GPU Data streaming-like broadcast operations

slide-4
SLIDE 4

GTC 2017 4 Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

  • Multi-core/many-core technologies
  • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
  • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
  • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
  • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 Titan K - Computer Sunway TaihuLight

slide-5
SLIDE 5

GTC 2017 5 Network Based Computing Laboratory

  • 187 IB Clusters (37%) in the Nov’16 Top500 list

– (http://www.top500.org)

  • Installations in the Top 50 (15 systems):

Large-scale InfiniBand Installations

241,108 cores (Pleiades) at NASA/Ames (13th) 147,456 cores (SuperMUC) in Germany (36th) 220,800 cores (Pangea) in France (16th) 86,016 cores (SuperMUC Phase 2) in Germany (37th) 462,462 cores (Stampede) at TACC (17th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (40th) 144,900 cores (Cheyenne) at NCAR/USA (20th) 194,616 cores (Cascade) at PNNL (44th) 72,800 cores Cray CS-Storm in US (25th) 76,032 cores (Makman-2) at Saudi Aramco (49th) 72,800 cores Cray CS-Storm in US (26th) 72,000 cores (Prolix) at Meteo France, France (50th) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (27th ) 73,440 cores (Beaufix2) at Meteo France, France (51st) 60,512 cores (DGX SATURNV) at NVIDIA/USA (28th) 42,688 cores (Lomonosov-2) at Russia/MSU (52nd) 72,000 cores (HPC2) in Italy (29th) 60,240 cores SGI ICE X at JAEA Japan (54th) 152,692 cores (Thunder) at AFRL/USA (32nd) and many more!

slide-6
SLIDE 6

GTC 2017 6 Network Based Computing Laboratory

  • Introduced in Oct 2000
  • High Performance Point-to-point Data Transfer

– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%)

  • Multiple Features

– Offloaded Send/Recv, RDMA Read/Write, Atomic Operations

– Hardware Multicast support through Unreliable Datagram (UD)

  • A message sent from a single source can reach all destinations in a single

pass over the network through switch-based replication

  • Restricted to one MTU
  • Large messages need to be sent in a chunked manner
  • Reliability needs to be addressed
  • Leading to big changes in designing HPC clusters, file systems, cloud

computing systems, grid computing systems, ….

InfiniBand Networking Technology

slide-7
SLIDE 7

GTC 2017 7 Network Based Computing Laboratory

InfiniBand Hardware Multicast Example

slide-8
SLIDE 8

GTC 2017 8 Network Based Computing Laboratory

Multicast-aware CPU-Based MPI_Bcast on Stampede using MVAPICH2 (6K nodes with 102K cores)

10 20 30 40 2 8 32 128 512 Latency (µs) Message Size (Bytes)

Small Messages (102,400 Cores)

Default Multicast ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCIe Gen3 with Mellanox IB FDR switch 100 200 300 400 500 2K 8K 32K 128K Latency (µs) Message Size (Bytes)

Large Messages (102,400 Cores)

Default Multicast 10 20 30 Latency (µs) Number of Nodes

16 Byte Message

Default Multicast 50 100 150 200 Latency (µs) Number of Nodes

32 KByte Message

Default Multicast

slide-9
SLIDE 9

GTC 2017 9 Network Based Computing Laboratory

GPU Memory

  • Before CUDA 4: Additional copies
  • Low performance and low productivity
  • After CUDA 4: Host-based pipeline
  • Unified Virtual Address
  • Pipeline CUDA copies with IB transfers
  • High performance and high productivity
  • After CUDA 5.5: GPUDirect RDMA support
  • GPU to GPU direct transfer
  • Bypass the host memory
  • Hybrid design to avoid PCI bottlenecks

InfiniBand

GPU CPU Chip set

GPUDirect RDMA (GDR) and CUDA-Aware MPI

slide-10
SLIDE 10

GTC 2017 10 Network Based Computing Laboratory

1000 2000 3000 4000 1 4 16 64 256 1K 4K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth

Message Size (bytes) Bi-Bandwidth (MB/s) 5 10 15 20 25 30 2 8 32 128 512 2K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR

GPU-GPU internode latency

Message Size (bytes)

Latency (us)

MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-X4 EDR HCA CUDA 8.0 Mellanox OFED 3.0 with GPU-Direct-RDMA

10x

2X

11x

Performance of MVAPICH2-GDR with GPUDirect RDMA (GDR)

2.18us 500 1000 1500 2000 2500 3000 1 4 16 64 256 1K 4K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR

GPU-GPU Internode Bandwidth

Message Size (bytes)

Bandwidth (MB/s)

11X 2X 3X

More details in 2:00pm session today

S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING

slide-11
SLIDE 11

GTC 2017 11 Network Based Computing Laboratory

  • Host-Staged Multicast (HSM):

Traditional short message broadcast

  • peration between GPUs
  • Data copied from GPU to host memory
  • Using InfiniBand UD-based hardware multicast
  • Sub-optimal use of near-scale invariant

UD-multicast performance

  • PCIe resources are wasted and benefits
  • f multicast are nullified
  • GPUDirect RDMA capabilities unused

Multicasting Data from one GPU to other GPUs: Shortcomings

slide-12
SLIDE 12

GTC 2017 12 Network Based Computing Laboratory

  • Can we design a GPU broadcast mechanism that can deliver low latency and high

throughput for streaming applications?

  • Can we combine GDR and MCAST features to
  • Achieve the best performance
  • Free-up the Host-Device PCIe bandwidth for application needs
  • Can such design be extended to support heterogeneous configurations?
  • Host-to-Device (H2D): Most common in streaming applications
  • E.g., Camera connected to host and devices used for computation
  • Device-to-device (D2D)
  • Device-to-Host (D2H)
  • Can we design an efficient MCAST based broadcast for multi-GPU systems?
  • Can we design an efficient reliability support on top of the UD-based MCAST broadcast?
  • How much performance benefits can be achieved with the new designs?

Problem Statement

slide-13
SLIDE 13

GTC 2017 13 Network Based Computing Laboratory

  • Copy user GPU data to host buffers
  • Perform Multicast
  • Copy data back to user GPU buffer

GPU

HCA

Host

Vbuf

user

NW

  • Drawbacks:
  • CudaMemcpy dictates

performance

  • Requires PCIe Host-Device

resources

cudaMemcpy MCAST

Existing Protocol for GPU Multicast

slide-14
SLIDE 14

GTC 2017 14 Network Based Computing Laboratory

GPU

HCA

Host

Vbuf

user

NW

  • Drawbacks:
  • D-H operation limits

performance

  • Can we avoid GDRCOPY for

D-H copies?

GDRCOPY* MCAST

Enhanced Solution #1: GDRCOPY-based design

  • Copy user GPU data to host buffers
  • Using GDRCOPY module*
  • Perform Multicast
  • Copy data back to user GPU buffer
  • Using GDRCOPY module

*https://github.com/NVIDIA/gdrcopy

slide-15
SLIDE 15

GTC 2017 15 Network Based Computing Laboratory

  • Copy user GPU data to host buffers
  • Using loopback scheme
  • Perform Multicast
  • Copy back the data to GPU
  • Using GDRCOPY scheme

GPU

HCA

Host

Vbuf

user

NW

  • Good performance for both

H-D and D-H copies

  • Expected good performance only

for small message

  • Still using the PCIe H-D resources

GDRCOPY MCAST

Enhanced Solution #2: (GDRCOPY + Loopback)-based design

LoopBack

slide-16
SLIDE 16

GTC 2017 16 Network Based Computing Laboratory

  • How to design efficient and reliable broadcast operation from

host to device for streaming applications on multi-GPU node systems?

  • Challenges
  • How to handle heterogeneity of the configuration including H2D

broadcast?

  • Can we have a topology-aware broadcast designs on multi-GPU nodes?
  • Can we enhance the reliability support for streaming applications?
  • Can we mimic such behavior at benchmark level?
  • mimic the need for PCIe H-D at application level
  • Demonstrate the benefits of such designs on such application patterns

Can we do Better?

slide-17
SLIDE 17

GTC 2017 17 Network Based Computing Laboratory

  • Handling Efficient Broadcast on Multi-GPU node Systems
  • C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing

High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

  • Providing Reliability Support
  • C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient

Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.

  • Optimizing Broadcast for Multi-source Streaming
  • C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient and

Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted for presentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.

Three Major Solutions

slide-18
SLIDE 18

GTC 2017 18 Network Based Computing Laboratory

  • Combining MCAST+GDR hardware features for

heterogeneous configurations:

– Source on the Host and destination on Device – SL design: Scatter at destination

  • Source: Data and Control on Host
  • Destinations: Data on Device and Control on Host

– Combines IB MCAST and GDR features at receivers – CUDA IPC-based topology-aware intra-node broadcast – Minimize use of PCIe resources – Maximizing availability of PCIe Host-Device Resources

SL-based Design for Heterogeneous Configuration (H2D)

Node N IB HCA IB HCA CPU GPU Source IB Switch GPU CPU Node 1 Multicast steps

C Data C

IB SL step

Data

IB HCA GPU CPU

Data C

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

slide-19
SLIDE 19

GTC 2017 19 Network Based Computing Laboratory

  • Socket based leader (1 HCA per socket)
  • Control synchronization through Host Shared

memory

  • Polling on shared flag
  • Reading the buffers addresses
  • IPC read of the GPU data
  • Direct (RMA semantics) IPC read
  • IPC read with other access patterns in the future

– K-nomial Tree, Ring structure

Intra-node Topology-aware (Hybrid SL+IPC) Design for Multi- GPU Node Configuration

Node 1 IB Switch

GPU 0 GPU 1 GPU N

Node N GPU CPU Source GPU CPU CPU SL MCAST cudaMemcpy (D2D)

slide-20
SLIDE 20

GTC 2017 20 Network Based Computing Laboratory

10 20 30 40 50 60 1 4 16 64 256 1K 4K 16K Latency (μs) Message Size (Bytes) SL-MCAST SGL-MCAST Host-MCAST 2000 4000 6000 8000 10000 32K 64K 128K256K512K 1M 2M 4M Latency(μs) Message Size (Bytes) SL-MCAST SGL-MCAST Host-MCAST

  • Redesigned broadcast benchmark with Root buffer on Host & non-Root on Device
  • Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node

SL-based Design for H-D Heterogeneous Support

56% 39%

Lower is better

slide-21
SLIDE 21

GTC 2017 21 Network Based Computing Laboratory

  • Evaluate H-D heterogeneous support
  • Mixing Inter-node and Intra-node experiments @ CSCS cluster, 88 GPUs, 8

NVIDIA K80 GPUs per node

Evaluation of the Topology-aware (SL+IPC) Design

2000 4000 6000 8000 10000 12000 Latency (μs) Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST 20 40 60 1 4 16 64 256 1K 4K 16K Latency (μs) Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST 58% 79%

Lower is better

slide-22
SLIDE 22

GTC 2017 22 Network Based Computing Laboratory

  • Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node

– 1K byte messages

Scalability Evaluation of the Proposed Design

5 10 15 20 25 2 4 8 16 32 Latency (μs) System size (Number of GPU nodes)

SL-MCAST SGL-MCAST Host-MCAST

Maintain good Scalability while yielding up to 64% reduction of latency 64%

slide-23
SLIDE 23

GTC 2017 23 Network Based Computing Laboratory

  • Mimic the behavior of streaming applications @ CSCS cluster, 88 GPUs,

8 NVIDIA K80 GPUs per node

– Broadcast operations overlapped with application level Host-Device transfers

Benefits of the Availability of Host-Device PCI Resources

1 2 3 4 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Throughput (GB/s) Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST

3.2X

Higher is better

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

Maintain near-peak throughput over all message sizes

slide-24
SLIDE 24

GTC 2017 24 Network Based Computing Laboratory

  • Handling Efficient Broadcast on Multi-GPU node Systems
  • C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing

High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

  • Providing Reliability Support
  • C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient

Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.

  • Optimizing Broadcast for Multi-source Streaming
  • C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient and

Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted for presentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.

Three Major Solutions

slide-25
SLIDE 25

GTC 2017 25 Network Based Computing Laboratory

  • Remote Memory Access (RMA)-based Design

– Sender maintains a backup buffer for the MCAST packets

  • Sender is not interrupted

– Receiver Performs the RMA Get operation to the sender’s backup buffer to retrieve lost MCAST packets

Efficient Reliability Support for MCAST-based broadcast

Broadcast receiver Broadcast sender

IB HCA IB HCA

MPI

Timeout

MPI

Time

slide-26
SLIDE 26

GTC 2017 26 Network Based Computing Laboratory

  • Evaluate the RMA-based reliability support for SL-based MCAST design

@ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node – Negligible overhead – RMA-based design performs better than NACK-based scheme for large messages

Evaluation: Efficient Reliability Design

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (μs) Message Size (Bytes) w/o reliability NACK RMA 1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M 4M Latency (μs) Message Size (Bytes) w/o reliability NACK RMA

slide-27
SLIDE 27

GTC 2017 27 Network Based Computing Laboratory

Latency reduction compared to the existing NACK-based scheme

Benefits of the RMA-based Reliability Design

0.5 1 1.5 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Normalized Latency Message Size (Bytes) 0.01% 0.10% 1% 0.5 1 1.5 16K 32K 64K 128K 256K 512K 1M 2M 4M Normalized Latency Message Size (Bytes) 0.01% 0.10% 1%

Normalized to SL-based MCAST with NACK-based retransmission scheme

Message Size 8KB 128KB 2MB Error Rate 0.01% 16% 31% 11% 0.1% 21% 36% 19% 1% 24% 21% 10%

Error rates:

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.

slide-28
SLIDE 28

GTC 2017 28 Network Based Computing Laboratory

  • Handling Efficient Broadcast on Multi-GPU node Systems
  • C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing

High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

  • Providing Reliability Support
  • C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient

Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.

  • Optimizing Broadcast for Multi-source Streaming
  • C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient and

Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted for presentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.

Three Major Solutions

slide-29
SLIDE 29

GTC 2017 29 Network Based Computing Laboratory

  • Optimizing MCAST+GDR Broadcast:

– Source and destination buffers are on GPU Device

  • Typically very large messages (>1MB)

– Pipelining data from Device to Host

  • Avoid GDR read limit
  • Leverage high-performance SL design

– Combines IB MCAST and GDR features – Minimize use of PCIe resources on the receiver side – Maximizing availability of PCIe Host-Device Resources

Optimized Design for Multi-Source Streaming

IB HCA CPU GPU

Source

IB Switch

Header

Data

IB HCA CPU GPU

Node 1

Header

Data

IB HCA CPU GPU

Node N

Header

Data

  • 1. Pipeline Data movement
  • 2. IB Gather
  • 3. IB Hardware Multicast
  • 4. IB Scatter + GDR Write

Data

slide-30
SLIDE 30

GTC 2017 30 Network Based Computing Laboratory

  • Pipelined MCAST+GDR Design

– Pipelines data from Device to Host on the source node

  • Streaming broadcast

– Leverages high-performance SL-based design

  • High Scalability
  • High overlap between multiple broadcast calls

Analysis of the Optimized Design

slide-31
SLIDE 31

GTC 2017 31 Network Based Computing Laboratory

  • @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node

1 10 100 1000 10000 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Latency (μs) Message Size (bytes) MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt

Benchmark Evaluation

1 10 100 1000 10000 2 4 8 16 Latency (μs) Number of GPU nodes

2 MB Message

MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt Lower is better

Near-Constant

65%

  • Provide near-constant latency over the system sizes
  • Reduces up to 65% of latency for large messages
slide-32
SLIDE 32

GTC 2017 32 Network Based Computing Laboratory

  • @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node

– Microsoft Cognitive Toolkit (CNTK) with CUDA-Aware MPI*

Application Evaluation: Deep Learning Frameworks

100 200 300 8 16 Training Time (s) Number of GPU nodes

AlexNet model

MV2-GDR-Knomial MV2-GDR-Ring 1000 2000 3000 8 16 Training Time (s) Number of GPU nodes

VGG model

MV2-GDR-Knomial MV2-GDR-Ring

Lower is better

15% 24% 6% 15%

  • Reduces up to 24% and 15% of latency for AlexNet and VGG models

– Average training time of one Epoch

  • Higher improvement can be observed for larger system sizes

*D. Banerjee, K. Hamidouche, D. Panda, Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom'16, Dec 2016.

slide-33
SLIDE 33

GTC 2017 33 Network Based Computing Laboratory

  • IB MCAST feature provides high scalability and low latency
  • NVIDIA GDR feature provides a direct access between IB and GPUs
  • MVAPICH2-GDR provides schemes to efficiently broadcast from/to

GPU memories using host staged techniques

  • Presented a set of designs to couple GDR and IB MCAST features for
  • Heterogeneous Systems
  • Multi-GPU systems
  • Single-source and Multi-source Streaming
  • New designs will be available in future MVAPICH2-GDR library

Conclusions

slide-34
SLIDE 34

GTC 2017 34 Network Based Computing Laboratory

Two Additional Talks

  • S7356 - MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

– Day: Today, 05/11 – Time: 14:00 - 14:50 – Location: Room 211B

  • S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions

– Day: Today, 05/11 – Time: 15:00 - 15:25 – Location: Room 211B

slide-35
SLIDE 35

GTC 2017 35 Network Based Computing Laboratory

panda@cse.ohio-state.edu

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/