Node-Level Deep Learning Input Pipeline Optimization on - - PowerPoint PPT Presentation

node level deep learning input
SMART_READER_LITE
LIVE PREVIEW

Node-Level Deep Learning Input Pipeline Optimization on - - PowerPoint PPT Presentation

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity Service Excellence Distribution A. Approved for public release;


slide-1
SLIDE 1

Distribution A. Approved for public release; distribution unlimited.

Integrity  Service  Excellence Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory

slide-2
SLIDE 2

Distribution A. Approved for public release; distribution unlimited.

Introduction

 Deep Learning on high performance computing (HPC) systems has unique challenges  Shared (contested) I/O bandwidth  Distributed file systems  High compute density  From this talk you should get:  The relationship between DL concurrency and I/O  A simple, effective method for hiding I/O on existing systems  An appreciation for the importance of specialized I/O systems for DL on HPC

slide-3
SLIDE 3

Distribution A. Approved for public release; distribution unlimited.

Motivation

 Deep learning on modern HPC systems is bound by

system-wide I/O

 TensorFlow optimization step running time on 4x

P100, 20-core Power8 node is ~50ms/step, when:

 System under full, mostly DL, load  LeNet-5 training on MNIST (batch size 256)  Standard asynchronous data queue  Typically ~16 jobs running on separate nodes  Loading the dataset into memory yields ~17x

speedup (~3ms)

 Instrumenting the training shows exhausted input

queues

HPC is likely to remain I/O bound for the foreseeable future [1]

slide-4
SLIDE 4

Distribution A. Approved for public release; distribution unlimited.

Impact of Queue Exhaustion

Queue Exhausted → Step Running Time Increases

slide-5
SLIDE 5

Distribution A. Approved for public release; distribution unlimited.

Causes of Queue Exhaustion: Data Parallelism

 Dequeue rate exceeds enqueue rate  Data parallel concurrency scheme increases dequeue rate  Enqueue threads share storage I/O bandwidth  More model copies = more data throughput  Exacerbated by large data elements

Enqueue Thread 1

Model Copy 1 Model Copy 2 Model Copy 3 Model Copy 4

Enqueue Thread 3 Enqueue Thread 2 Enqueue Thread 4 Enqueue Thread N TF Queue Element 1 Element 2 Element 3 Data Q Dequeue Batch Element Q

NIC 1

Off Node Storage

NIC 2 NIC M

… … …

slide-6
SLIDE 6

Distribution A. Approved for public release; distribution unlimited.

Causes of Queue Exhaustion: Pipeline Parallelism

 Dequeue rate exceeds enqueue rate  Model or pipeline parallel schemes increase dequeue rate  Typically used when model won’t fit on one device  Pipelines operations, increasing throughput

Enqueue Thread 1 Enqueue Thread 3 Enqueue Thread 2 Enqueue Thread 4 Enqueue Thread N TF Queue Element 1 Element 2 Element 3 Data Q Dequeue Batch Element Q

NIC 1

Off Node Storage

NIC 2 NIC M

… … …

Model Ops 1 Model Ops 2 Model Ops 3 Model Ops 4

slide-7
SLIDE 7

Distribution A. Approved for public release; distribution unlimited.

Standard Approach: Increase Thread Count

 Enqueue threads asynchronously enqueue data element  Adding more enqueue threads:  Delays queue exhaustion  Decreases slowdown cause by exhaustion  We need to increase the net enqueue rate further  We can’t increase enqueue rate…  So, we must decease the dequeue rate.

slide-8
SLIDE 8

Distribution A. Approved for public release; distribution unlimited.

 Artificially slow the dequeue rate by dequeuing

batches less than once per step

 Allows the queue to fill up  Trivial to implement  Repeating batches introduces new problems  The model is optimized with less new data/step  Your epochs per second will decrease  Generalization is more impacted by how

representative any individual batch is of the true data distribution

Batch Repetition

slide-9
SLIDE 9

Distribution A. Approved for public release; distribution unlimited.

Batch Repetition Prevents Queue Exhaustion

slide-10
SLIDE 10

Distribution A. Approved for public release; distribution unlimited.

Batch Interval Impact on Net Enqueue Rate

 Running time is inversely proportional to net enqueue rate  Validates the hypothesis that training was I/O bound  We get diminishing returns for batch intervals >16  Batch intervals allow for better throughput with less threads

slide-11
SLIDE 11

Distribution A. Approved for public release; distribution unlimited.

Batch Interval & Model Performance

slide-12
SLIDE 12

Distribution A. Approved for public release; distribution unlimited.

Summary

 HPC systems are structurally likely to be I/O bound for DL workloads  Repeating batches for an interval of steps can hide I/O latency and keep the GPUs fed  Small refresh intervals don’t impact converged optimization, but decrease runtime  If you want to talk more, ask me about my circular data queues

slide-13
SLIDE 13

Distribution A. Approved for public release; distribution unlimited.

slide-14
SLIDE 14

Distribution A. Approved for public release; distribution unlimited.

HPCMP Distro A Source