Initial Characterization of I/O in Large-Scale Deep Learning - - PowerPoint PPT Presentation

initial characterization of i o in large scale deep
SMART_READER_LITE
LIVE PREVIEW

Initial Characterization of I/O in Large-Scale Deep Learning - - PowerPoint PPT Presentation

Initial Characterization of I/O in Large-Scale Deep Learning Applications Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu November 12, 2018 - 1 - Outline Objectives DL


slide-1
SLIDE 1

Fahim Chowdhury, Jialin Liu, Quincey Koziol, Thorsten Kurth, Steven Farrell, Suren Byna, Prabhat, and Weikuan Yu

Initial Characterization of I/O in Large-Scale Deep Learning Applications

  • 1 -

November 12, 2018

slide-2
SLIDE 2

Outline

  • 2 -
  • Objectives
  • DL Benchmarks at NERSC
  • Profiling Approaches
  • Experimental Results
  • Future Work
slide-3
SLIDE 3

Outline

  • 3 -
  • Objectives
  • DL Benchmarks at NERSC
  • Profiling Approaches
  • Experimental Results
  • Future Work
slide-4
SLIDE 4

Objectives

  • Deep Learning (DL) applications demand large-scale computing facilities.
  • DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

  • The goals of this project are
  • Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

  • Addressing possible bottlenecks caused by I/O in the training phase
  • Developing optimization strategies to overcome the possible I/O bottlenecks

4

slide-5
SLIDE 5

Objectives

  • Deep Learning (DL) applications demand large-scale computing facilities.
  • DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

  • The goals of this project are
  • Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

  • Addressing possible bottlenecks caused by I/O in the training phase
  • Developing optimization strategies to overcome the possible I/O bottlenecks

5

slide-6
SLIDE 6

Outline

  • 6 -
  • Objectives
  • DL Benchmarks at NERSC
  • Profiling Approaches
  • Experimental Results
  • Future Work
slide-7
SLIDE 7

HEPCNNB Overview

  • High Energy Physics Deep Learning Convolutional Neural Network Benchmark

(HEPCNNB)

  • Runs on distributed TensorFlow using Horovod
  • Can generate particle events that can be described by standard model physics

and particle events with R-parity violating Supersymmetry

  • Uses a 496 GB dataset of 2048 HDF5 files representing particle collisions

generated by a fast Monte-Carlo generator named Delphes at CERN

7

slide-8
SLIDE 8

CDB Overview

  • Climate Data Benchmark (CDB)
  • Runs on distributed TensorFlow using Horovod
  • Can act as an image recognition model to detect patterns for extreme weather
  • Uses a 3.5 TB dataset of 62738 HDF5 images representing climate data
  • Leverages TensorFlow Dataset API and python’s multiprocessing package for

input pipelining

8

slide-9
SLIDE 9

Outline

  • 9 -
  • Objectives
  • DL Benchmarks at NERSC
  • Profiling Approaches
  • Experimental Results
  • Future Work
slide-10
SLIDE 10

Profiling Approaches

  • Develop TimeLogger tool based on python to profile application layer
  • Determine the total latency from merged interval list for each training component
  • Explore TensorFlow Runtime Tracing Metadata Visualization (TRTMV) tool

developed at Google and extract I/O specific metadata

  • Working on integration of runtime metadata from application and framework layer
  • Work available in: https://github.com/NERSC/DL-Parallel-IO

10

slide-11
SLIDE 11

Outline

  • 11 -
  • Objectives
  • DL Benchmarks at NERSC
  • Profiling Approaches
  • Experimental Results
  • Future Work
slide-12
SLIDE 12

HEPCNNB Latency Breakdown

  • 12 -

Local Shuffle Global Shuffle

3.60% 3.08% 3.17% 2.91% 1.44% 8.01% 7.72% 6.83% 6.16% 1.49%

  • I/O takes more time when Global Shuffling is introduced
  • Global Shuffling affects I/O even for small dataset and only 5 epochs training
  • I/O bottleneck can become more severe with increasing epochs
slide-13
SLIDE 13

HEPCNNB Read Bandwidth

  • 13 -
  • I/O takes more time when Global Shuffling is introduced
  • Global Shuffling affects I/O even for small dataset and only 5 epochs training
  • I/O bottleneck can become more severe with increasing epochs

Local Shuffle Global Shuffle

9.91 21.69 44.20 91.53 194.99 4.33 8.76 15.90 30.86 187.99

slide-14
SLIDE 14

CDB Latency and Read Bandwidth

  • 14 -
  • The percentage of I/O in the training process is more when dataset is larger
  • The I/O percentage increases with the number of nodes
  • Training benefits more from the scaling than I/O

8.73% 15.05% 10.63% 11.04% 0.30 0.33 1.14 2.57

slide-15
SLIDE 15

Outline

  • 15 -
  • Objectives
  • DL Benchmarks at NERSC
  • Profiling Approaches
  • Experimental Results
  • Future Work
slide-16
SLIDE 16

Future Work

  • To integrate TRTMV results with TimeLogger data for better profiling of highly

parallelized I/O pipeline

  • To explore the I/O patterns and determine possible I/O bottlenecks in distributed

TensorFlow

  • To develop an optimized cross-framework I/O strategy to overcome the possible

I/O bottlenecks

16

slide-17
SLIDE 17
  • 17 -

Thank You