Node-Level Deep Learning Input Pipeline Optimization on - PowerPoint PPT Presentation

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity  Service  Excellence Distribution A. Approved for public release; distribution unlimited.

Introduction  Deep Learning on high performance computing (HPC) systems has unique challenges  Shared (contested) I/O bandwidth  Distributed file systems  High compute density  From this talk you should get:  The relationship between DL concurrency and I/O  A simple, effective method for hiding I/O on existing systems  An appreciation for the importance of specialized I/O systems for DL on HPC Distribution A. Approved for public release; distribution unlimited.

Motivation  Deep learning on modern HPC systems is bound by system-wide I/O  TensorFlow optimization step running time on 4x P100, 20-core Power8 node is ~50ms/step, when:  System under full, mostly DL, load  LeNet-5 training on MNIST (batch size 256)  Standard asynchronous data queue  Typically ~16 jobs running on separate nodes  Loading the dataset into memory yields ~17x speedup (~3ms)  Instrumenting the training shows exhausted input HPC is likely to remain I/O bound for the foreseeable future [1] queues Distribution A. Approved for public release; distribution unlimited.

Impact of Queue Exhaustion Queue Exhausted → Step Running Time Increases Distribution A. Approved for public release; distribution unlimited.

Causes of Queue Exhaustion: Data Parallelism TF Queue Enqueue Thread 1 Model Copy 1 NIC Element 1 1 Enqueue Thread 2 Element 2 Model Copy 2 Off NIC Enqueue Thread 3 Element 3 2 Node Dequeue … Storage Model Copy 3 Batch Enqueue Thread 4 … Element Q Data Q … NIC Model Copy 4 Enqueue Thread N M  Dequeue rate exceeds enqueue rate  Data parallel concurrency scheme increases dequeue rate  Enqueue threads share storage I/O bandwidth  More model copies = more data throughput  Exacerbated by large data elements Distribution A. Approved for public release; distribution unlimited.

Causes of Queue Exhaustion: Pipeline Parallelism TF Queue Enqueue Thread 1 Model Ops 1 NIC Element 1 1 Enqueue Thread 2 Element 2 Model Ops 2 Off NIC Enqueue Thread 3 Element 3 2 Node Dequeue … Storage Model Ops 3 Batch Enqueue Thread 4 … Element Q Data Q … NIC Model Ops 4 Enqueue Thread N M  Dequeue rate exceeds enqueue rate  Model or pipeline parallel schemes increase dequeue rate  Typically used when model won’t fit on one device  Pipelines operations, increasing throughput Distribution A. Approved for public release; distribution unlimited.

Standard Approach: Increase Thread Count  Enqueue threads asynchronously enqueue data element  Adding more enqueue threads:  Delays queue exhaustion  Decreases slowdown cause by exhaustion  We need to increase the net enqueue rate further  We can’t increase enqueue rate…  So, we must decease the dequeue rate. Distribution A. Approved for public release; distribution unlimited.

Batch Repetition  Artificially slow the dequeue rate by dequeuing batches less than once per step  Allows the queue to fill up  Trivial to implement  Repeating batches introduces new problems  The model is optimized with less new data/step  Your epochs per second will decrease  Generalization is more impacted by how representative any individual batch is of the true data distribution Distribution A. Approved for public release; distribution unlimited.

Batch Repetition Prevents Queue Exhaustion Distribution A. Approved for public release; distribution unlimited.

Batch Interval Impact on Net Enqueue Rate  Running time is inversely proportional to net enqueue rate  Validates the hypothesis that training was I/O bound  We get diminishing returns for batch intervals >16  Batch intervals allow for better throughput with less threads Distribution A. Approved for public release; distribution unlimited.

Batch Interval & Model Performance Distribution A. Approved for public release; distribution unlimited.

Summary  HPC systems are structurally likely to be I/O bound for DL workloads  Repeating batches for an interval of steps can hide I/O latency and keep the GPUs fed  Small refresh intervals don’t impact converged optimization, but decrease runtime  If you want to talk more, ask me about my circular data queues Distribution A. Approved for public release; distribution unlimited.

Distribution A. Approved for public release; distribution unlimited.

HPCMP Distro A Source Distribution A. Approved for public release; distribution unlimited.

Node-Level Deep Learning Input Pipeline Optimization on - PowerPoint PPT Presentation

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity Service Excellence Distribution A. Approved for public release;

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&Modules Module&Loaders Node&Packages

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Input Input devices Text entry Positional input Input Devices 1 MacBook Wheel (The Onion) -

Deceased Estate Registration Process on eFiling Agenda eFiling Profile types Post- Death

THE EFFECT OF RETAIL MERGERS ON VARIETY: AN EX-POST EVALUATION Paolo Buccirossi

Meeting 38: 22 March 2018 Karakia 2 Karakia Ko te tumanako Kia pai tenei r Kia tutuki i ng

OPE RAT I ON HE AL T HY WORK SI T E I nve sting in We llne ss Pro g ra ms to I mpro

long-term trends and opportunities Schouw & Co. CEO Jens Bjerg Srensen BioMar CEO Carlos

Calculating S Corp Stock and Debt Basis: Avoiding Loss Limitations and Excess Distributions

Virginia Garcia Memo morial He Health Cen enter er Alternative Payment Methodology

Alzheimer Bulgaria Association 2011 - 2016 Dublin, 7 th of November About Alzheimer Bulgaria

Sambuz

Useful Links

Newsletter

Mail Us

Node-Level Deep Learning Input Pipeline Optimization on - PowerPoint PPT Presentation

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar 2018 Captain Justin Fletcher Air Force Research Laboratory Integrity Service Excellence Distribution A. Approved for public release;

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&amp;Modules Module&amp;Loaders Node&amp;Packages

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Input Input devices Text entry Positional input Input Devices 1 MacBook Wheel (The Onion) -

Deceased Estate Registration Process on eFiling Agenda eFiling Profile types Post- Death

THE EFFECT OF RETAIL MERGERS ON VARIETY: AN EX-POST EVALUATION Paolo Buccirossi

Meeting 38: 22 March 2018 Karakia 2 Karakia Ko te tumanako Kia pai tenei r Kia tutuki i ng

OPE RAT I ON HE AL T HY WORK SI T E I nve sting in We llne ss Pro g ra ms to I mpro

long-term trends and opportunities Schouw &amp; Co. CEO Jens Bjerg Srensen BioMar CEO Carlos

Calculating S Corp Stock and Debt Basis: Avoiding Loss Limitations and Excess Distributions

Virginia Garcia Memo morial He Health Cen enter er Alternative Payment Methodology

Alzheimer Bulgaria Association 2011 - 2016 Dublin, 7 th of November About Alzheimer Bulgaria

Sambuz

Useful Links

Newsletter

Mail Us

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

1 Agenda Node&Modules Module&Loaders Node&Packages

long-term trends and opportunities Schouw & Co. CEO Jens Bjerg Srensen BioMar CEO Carlos