Initial Characterization of I/O in Large-Scale Deep Learning - - PowerPoint PPT Presentation

▶

Oct 23, 2022 232 likes •413 views

Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu November 12, 2018 - 1 - Outline Objectives DL

SLIDE 1

Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu

Initial Characterization of I/O in Large-Scale Deep Learning Applications

November 12, 2018

SLIDE 2

Outline

2 -
Objectives
DL Benchmarks at NERSC
Profiling Approaches
Experimental Results
Future Work

SLIDE 3

Outline

3 -
Objectives
DL Benchmarks at NERSC
Profiling Approaches
Experimental Results
Future Work

SLIDE 4

Objectives

Deep Learning (DL) applications demand large-scale computing facilities.
DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

The goals of this project are
Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

Addressing possible bottlenecks caused by I/O in the training phase
Developing optimization strategies to overcome the possible I/O bottlenecks

SLIDE 5

Objectives

Deep Learning (DL) applications demand large-scale computing facilities.
DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

The goals of this project are
Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

Addressing possible bottlenecks caused by I/O in the training phase
Developing optimization strategies to overcome the possible I/O bottlenecks

SLIDE 6

Outline

6 -
Objectives
DL Benchmarks at NERSC
Profiling Approaches
Experimental Results
Future Work

SLIDE 7

HEPCNNB Overview

High Energy Physics Deep Learning Convolutional Neural Network Benchmark

(HEPCNNB)

Runs on distributed TensorFlow using Horovod
Can generate particle events that can be described by standard model physics

and particle events with R-parity violating Supersymmetry

Uses a 496 GB dataset of 2048 HDF5 files representing particle collisions

generated by a fast Monte-Carlo generator named Delphes at CERN

SLIDE 8

CDB Overview

Climate Data Benchmark (CDB)
Runs on distributed TensorFlow using Horovod
Can act as an image recognition model to detect patterns for extreme weather
Uses a 3.5 TB dataset of 62738 HDF5 images representing climate data
Leverages TensorFlow Dataset API and python’s multiprocessing package for

input pipelining

SLIDE 9

Outline

9 -
Objectives
DL Benchmarks at NERSC
Profiling Approaches
Experimental Results
Future Work

SLIDE 10

Profiling Approaches

Develop TimeLogger tool based on python to profile application layer
Determine the total latency from merged interval list for each training component
Explore TensorFlow Runtime Tracing Metadata Visualization (TRTMV) tool

developed at Google and extract I/O specific metadata

Working on integration of runtime metadata from application and framework layer
Work available in: https://github.com/NERSC/DL-Parallel-IO

SLIDE 11

Outline

11 -
Objectives
DL Benchmarks at NERSC
Profiling Approaches
Experimental Results
Future Work

SLIDE 12

HEPCNNB Latency Breakdown

12 -

Local Shuffle Global Shuffle

3.60% 3.08% 3.17% 2.91% 1.44% 8.01% 7.72% 6.83% 6.16% 1.49%

I/O takes more time when Global Shuffling is introduced
Global Shuffling affects I/O even for small dataset and only 5 epochs training
I/O bottleneck can become more severe with increasing epochs

SLIDE 13

HEPCNNB Read Bandwidth

13 -
I/O takes more time when Global Shuffling is introduced
Global Shuffling affects I/O even for small dataset and only 5 epochs training
I/O bottleneck can become more severe with increasing epochs

Local Shuffle Global Shuffle

9.91 21.69 44.20 91.53 194.99 4.33 8.76 15.90 30.86 187.99

SLIDE 14

CDB Latency and Read Bandwidth

14 -
The percentage of I/O in the training process is more when dataset is larger
The I/O percentage increases with the number of nodes
Training benefits more from the scaling than I/O

8.73% 15.05% 10.63% 11.04% 0.30 0.33 1.14 2.57

SLIDE 15

Outline

15 -
Objectives
DL Benchmarks at NERSC
Profiling Approaches
Experimental Results
Future Work

SLIDE 16

Future Work

To integrate TRTMV results with TimeLogger data for better profiling of highly

parallelized I/O pipeline

To explore the I/O patterns and determine possible I/O bottlenecks in distributed

TensorFlow

To develop an optimized cross-framework I/O strategy to overcome the possible

I/O bottlenecks

SLIDE 17

17 -

Initial Characterization of I/O in Large-Scale Deep Learning Applications

Outline

Outline

Objectives

accelerate the training phase.

HPC systems

Objectives

accelerate the training phase.

HPC systems

Outline

HEPCNNB Overview

(HEPCNNB)

and particle events with R-parity violating Supersymmetry

generated by a fast Monte-Carlo generator named Delphes at CERN

CDB Overview

input pipelining

Outline

Profiling Approaches

developed at Google and extract I/O specific metadata

Outline

HEPCNNB Latency Breakdown

HEPCNNB Read Bandwidth

CDB Latency and Read Bandwidth

Outline

Future Work

parallelized I/O pipeline

TensorFlow

I/O bottlenecks

Thank You