node level deep learning input
play

Node-Level Deep Learning Input Pipeline Optimization on - PowerPoint PPT Presentation

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity Service Excellence Distribution A. Approved for public release;


  1. Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity  Service  Excellence Distribution A. Approved for public release; distribution unlimited.

  2. Introduction  Deep Learning on high performance computing (HPC) systems has unique challenges  Shared (contested) I/O bandwidth  Distributed file systems  High compute density  From this talk you should get:  The relationship between DL concurrency and I/O  A simple, effective method for hiding I/O on existing systems  An appreciation for the importance of specialized I/O systems for DL on HPC Distribution A. Approved for public release; distribution unlimited.

  3. Motivation  Deep learning on modern HPC systems is bound by system-wide I/O  TensorFlow optimization step running time on 4x P100, 20-core Power8 node is ~50ms/step, when:  System under full, mostly DL, load  LeNet-5 training on MNIST (batch size 256)  Standard asynchronous data queue  Typically ~16 jobs running on separate nodes  Loading the dataset into memory yields ~17x speedup (~3ms)  Instrumenting the training shows exhausted input HPC is likely to remain I/O bound for the foreseeable future [1] queues Distribution A. Approved for public release; distribution unlimited.

  4. Impact of Queue Exhaustion Queue Exhausted → Step Running Time Increases Distribution A. Approved for public release; distribution unlimited.

  5. Causes of Queue Exhaustion: Data Parallelism TF Queue Enqueue Thread 1 Model Copy 1 NIC Element 1 1 Enqueue Thread 2 Element 2 Model Copy 2 Off NIC Enqueue Thread 3 Element 3 2 Node Dequeue … Storage Model Copy 3 Batch Enqueue Thread 4 … Element Q Data Q … NIC Model Copy 4 Enqueue Thread N M  Dequeue rate exceeds enqueue rate  Data parallel concurrency scheme increases dequeue rate  Enqueue threads share storage I/O bandwidth  More model copies = more data throughput  Exacerbated by large data elements Distribution A. Approved for public release; distribution unlimited.

  6. Causes of Queue Exhaustion: Pipeline Parallelism TF Queue Enqueue Thread 1 Model Ops 1 NIC Element 1 1 Enqueue Thread 2 Element 2 Model Ops 2 Off NIC Enqueue Thread 3 Element 3 2 Node Dequeue … Storage Model Ops 3 Batch Enqueue Thread 4 … Element Q Data Q … NIC Model Ops 4 Enqueue Thread N M  Dequeue rate exceeds enqueue rate  Model or pipeline parallel schemes increase dequeue rate  Typically used when model won’t fit on one device  Pipelines operations, increasing throughput Distribution A. Approved for public release; distribution unlimited.

  7. Standard Approach: Increase Thread Count  Enqueue threads asynchronously enqueue data element  Adding more enqueue threads:  Delays queue exhaustion  Decreases slowdown cause by exhaustion  We need to increase the net enqueue rate further  We can’t increase enqueue rate…  So, we must decease the dequeue rate. Distribution A. Approved for public release; distribution unlimited.

  8. Batch Repetition  Artificially slow the dequeue rate by dequeuing batches less than once per step  Allows the queue to fill up  Trivial to implement  Repeating batches introduces new problems  The model is optimized with less new data/step  Your epochs per second will decrease  Generalization is more impacted by how representative any individual batch is of the true data distribution Distribution A. Approved for public release; distribution unlimited.

  9. Batch Repetition Prevents Queue Exhaustion Distribution A. Approved for public release; distribution unlimited.

  10. Batch Interval Impact on Net Enqueue Rate  Running time is inversely proportional to net enqueue rate  Validates the hypothesis that training was I/O bound  We get diminishing returns for batch intervals >16  Batch intervals allow for better throughput with less threads Distribution A. Approved for public release; distribution unlimited.

  11. Batch Interval & Model Performance Distribution A. Approved for public release; distribution unlimited.

  12. Summary  HPC systems are structurally likely to be I/O bound for DL workloads  Repeating batches for an interval of steps can hide I/O latency and keep the GPUs fed  Small refresh intervals don’t impact converged optimization, but decrease runtime  If you want to talk more, ask me about my circular data queues Distribution A. Approved for public release; distribution unlimited.

  13. Distribution A. Approved for public release; distribution unlimited.

  14. HPCMP Distro A Source Distribution A. Approved for public release; distribution unlimited.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend