Distribution A. Approved for public release; distribution unlimited.
Node-Level Deep Learning Input Pipeline Optimization on - - PowerPoint PPT Presentation
Node-Level Deep Learning Input Pipeline Optimization on - - PowerPoint PPT Presentation
Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity Service Excellence Distribution A. Approved for public release;
Distribution A. Approved for public release; distribution unlimited.
Introduction
Deep Learning on high performance computing (HPC) systems has unique challenges Shared (contested) I/O bandwidth Distributed file systems High compute density From this talk you should get: The relationship between DL concurrency and I/O A simple, effective method for hiding I/O on existing systems An appreciation for the importance of specialized I/O systems for DL on HPC
Distribution A. Approved for public release; distribution unlimited.
Motivation
Deep learning on modern HPC systems is bound by
system-wide I/O
TensorFlow optimization step running time on 4x
P100, 20-core Power8 node is ~50ms/step, when:
System under full, mostly DL, load LeNet-5 training on MNIST (batch size 256) Standard asynchronous data queue Typically ~16 jobs running on separate nodes Loading the dataset into memory yields ~17x
speedup (~3ms)
Instrumenting the training shows exhausted input
queues
HPC is likely to remain I/O bound for the foreseeable future [1]
Distribution A. Approved for public release; distribution unlimited.
Impact of Queue Exhaustion
Queue Exhausted → Step Running Time Increases
Distribution A. Approved for public release; distribution unlimited.
Causes of Queue Exhaustion: Data Parallelism
Dequeue rate exceeds enqueue rate Data parallel concurrency scheme increases dequeue rate Enqueue threads share storage I/O bandwidth More model copies = more data throughput Exacerbated by large data elements
Enqueue Thread 1
Model Copy 1 Model Copy 2 Model Copy 3 Model Copy 4
Enqueue Thread 3 Enqueue Thread 2 Enqueue Thread 4 Enqueue Thread N TF Queue Element 1 Element 2 Element 3 Data Q Dequeue Batch Element Q
NIC 1
Off Node Storage
NIC 2 NIC M
… … …
Distribution A. Approved for public release; distribution unlimited.
Causes of Queue Exhaustion: Pipeline Parallelism
Dequeue rate exceeds enqueue rate Model or pipeline parallel schemes increase dequeue rate Typically used when model won’t fit on one device Pipelines operations, increasing throughput
Enqueue Thread 1 Enqueue Thread 3 Enqueue Thread 2 Enqueue Thread 4 Enqueue Thread N TF Queue Element 1 Element 2 Element 3 Data Q Dequeue Batch Element Q
NIC 1
Off Node Storage
NIC 2 NIC M
… … …
Model Ops 1 Model Ops 2 Model Ops 3 Model Ops 4
Distribution A. Approved for public release; distribution unlimited.
Standard Approach: Increase Thread Count
Enqueue threads asynchronously enqueue data element Adding more enqueue threads: Delays queue exhaustion Decreases slowdown cause by exhaustion We need to increase the net enqueue rate further We can’t increase enqueue rate… So, we must decease the dequeue rate.
Distribution A. Approved for public release; distribution unlimited.
Artificially slow the dequeue rate by dequeuing
batches less than once per step
Allows the queue to fill up Trivial to implement Repeating batches introduces new problems The model is optimized with less new data/step Your epochs per second will decrease Generalization is more impacted by how
representative any individual batch is of the true data distribution
Batch Repetition
Distribution A. Approved for public release; distribution unlimited.
Batch Repetition Prevents Queue Exhaustion
Distribution A. Approved for public release; distribution unlimited.
Batch Interval Impact on Net Enqueue Rate
Running time is inversely proportional to net enqueue rate Validates the hypothesis that training was I/O bound We get diminishing returns for batch intervals >16 Batch intervals allow for better throughput with less threads
Distribution A. Approved for public release; distribution unlimited.
Batch Interval & Model Performance
Distribution A. Approved for public release; distribution unlimited.
Summary
HPC systems are structurally likely to be I/O bound for DL workloads Repeating batches for an interval of steps can hide I/O latency and keep the GPUs fed Small refresh intervals don’t impact converged optimization, but decrease runtime If you want to talk more, ask me about my circular data queues
Distribution A. Approved for public release; distribution unlimited.
Distribution A. Approved for public release; distribution unlimited.