S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek - - PowerPoint PPT Presentation

s8906 fast data pipelines
SMART_READER_LITE
LIVE PREVIEW

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek - - PowerPoint PPT Presentation

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING Przemek Tredak, Simon Layton THE PROBLEM 2 CPU BOTTLENECK OF DL TRAINING CPU : GPU ratio Multi-GPU, dense systems are more common (DGX-1V, DGX-2) Using more cores / sockets is very


slide-1
SLIDE 1

Przemek Tredak, Simon Layton

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING

slide-2
SLIDE 2

2

THE PROBLEM

slide-3
SLIDE 3

3

CPU BOTTLENECK OF DL TRAINING

  • Multi-GPU, dense systems are more common (DGX-1V, DGX-2)
  • Using more cores / sockets is very expensive
  • CPU to GPU ratio becomes lower:
  • DGX-1V: 40 cores / 8, 5 cores / GPU
  • DGX-2: 48 cores / 16, 3 cores / GPU

CPU : GPU ratio

slide-4
SLIDE 4

4

CPU BOTTLENECK OF DL TRAINING

Complexity of I/O pipeline

Alexnet 256x256 image 224x224 crop and mirror ResNet 50 480p image Random resize Color augment 224x224 crop and mirror Training Training

slide-5
SLIDE 5

5

CPU BOTTLENECK OF DL TRAINING

Increased complexity of CPU-based I/O pipeline Higher GPU to CPU ratio CPU GPU Throughput Time

slide-6
SLIDE 6

6

LOTS OF FRAMEWORKS

Frameworks have their own I/O pipelines (often more than 1!) Lots of duplicated effort to optimize them all Training process is not portable even if the model is (e.g. via ONNX)

Lots of effort

Caffe2 ImageInputOp Python MXNet ImageRecordIter Python TensorFlow Dataset Python ImageIO Manual graph construction

slide-7
SLIDE 7

7

LOTS OF FRAMEWORKS

Optimized I/O pipelines are not flexible and often unsuitable for research

Lots of effort

train = mx.io.ImageRecordIter( path_imgrec = args.data_train, path_imgidx = args.data_train_idx, label_width = 1, mean_r = rgb_mean[0], mean_g = rgb_mean[1], mean_b = rgb_mean[2], data_name = 'data', label_name = 'softmax_label', data_shape = image_shape, batch_size = 128, rand_crop = True, max_random_scale = 1, pad = 0, fill_value = 127, min_random_scale = 0.533, max_aspect_ratio = args.max_random_aspect_ratio, random_h = args.max_random_h, random_s = args.max_random_s, random_l = args.max_random_l, max_rotate_angle = args.max_random_rotate_angle, max_shear_ratio = args.max_random_shear_ratio, rand_mirror = args.random_mirror, preprocess_threads = args.data_nthreads, shuffle = True, num_parts = 0, part_index = 1)

vs

image, _ = mx.image.random_size_crop(image, (data_shape, data_shape), 0.08, (3/4., 4/3.)) image = mx.nd.image.random_flip_left_right(image) image = mx.nd.image.to_tensor(image) image = mx.nd.image.normalize(image, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)) return mx.nd.cast(image, dtype), label

Inflexible fast flexible slow

slide-8
SLIDE 8

8

SOLUTION: ONE LIBRARY

  • Centralize the effort
  • Integrate into all frameworks
  • Provide both flexibility and performance

DALI

MXNet Caffe2 PyTorch TF etc.

slide-9
SLIDE 9

9

DALI: OVERVIEW

slide-10
SLIDE 10

10

DALI

  • Flexible, high-performance image data pipeline
  • Python / C++ frontends with C++ / CUDA backend
  • Minimal (or no) changes to the frameworks required
  • Full pipeline - from disk to GPU, ready to train
  • OSS (soon)

Framework DALI Plugin

slide-11
SLIDE 11

11

GRAPH WITHIN A GRAPH

Data pipeline is just a (simple) graph

I/O in Frameworks today

Loader Decode Resize Training Images Labels JPEG Augment GPU CPU

slide-12
SLIDE 12

12

GPU OPTIMIZED PRIMITIVES

High performance, GPU optimized implementations

DALI

Loader Decode Resize Training Images Labels JPEG Augment GPU CPU

slide-13
SLIDE 13

13

GPU ACCELERATED JPEG DECODE

Hybrid approach to JPEG decoding – can move fully to GPU in the future Hu

DALI with nvJPEG

Loader Decode Resize Training Images Labels JPEG Augment GPU CPU

slide-14
SLIDE 14

14

SET YOUR DATA FREE

Use any file format in any framework

DALI

LMDB (Caffe, Caffe2) RecordIO (MXNet) TFRecord (TensorFlow) List of JPEGs (PyTorch,

  • thers)
slide-15
SLIDE 15

15

BEHIND THE SCENES: PIPELINE

slide-16
SLIDE 16

16

PIPELINE

Overview

Framework One pipeline per GPU The same logic for multithreaded and multiprocess frameworks

slide-17
SLIDE 17

17

PIPELINE

Overview

Framework

CPU Mixed GPU

Single direction 3 stages CPU -> Mixed -> GPU

slide-18
SLIDE 18

18

PIPELINE

Overview

1 2 3 4 6 8 5 7 9 Framework

CPU Mixed GPU

Simple scheduling of operations

slide-19
SLIDE 19

19

5 5

PIPELINE

CPU

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 5 5 5 Operations processed per-sample in a thread pool

slide-20
SLIDE 20

20

PIPELINE

GPU

8 9 8 9 9 Batched processing of data

slide-21
SLIDE 21

21

PIPELINE

Mixed

Mixed 9 A bridge between CPU and GPU Per-sample input, batched output Used also for batching CPU data (for CPU outputs of the pipeline)

slide-22
SLIDE 22

22

EXECUTOR

Pipelining the pipeline

CPU, Mixed and GPU stages need to be executed serially But each batch of data is independent… Mixed 1 GPU 1 CPU 1 Mixed 2 GPU 2 CPU 2 Mixed 4 CPU 3 time

slide-23
SLIDE 23

23

EXECUTOR

Pipelining the pipeline

Mixed 1 Each stage is asynchronous Stages of given batch synchronized via events GPU 1 CPU 1 time Mixed 2 GPU 2 CPU 2 Mixed 3 GPU 3 CPU 3 …

slide-24
SLIDE 24

24

OPERATORS

Gallery

slide-25
SLIDE 25

25

USING DALI

slide-26
SLIDE 26

26

EXAMPLE: RESNET-50 PIPELINE

Pipeline class

import dali import dali.ops as ops class HybridRN50Pipe(dali.Pipeline): def __init__(self, batch_size, num_threads, device_id, num_devices): super(HybridRN50Pipe, self).__init__(batch_size, num_threads, device_id) # define used operators def define_graph(self): # define graph of operations

slide-27
SLIDE 27

27

EXAMPLE: RESNET-50 PIPELINE

Defining operators

def __init__(self, batch_size, num_threads, device_id, num_devices): super(HybridRN50Pipe, self).__init__(batch_size, num_threads, device_id) self.loader = ops.Caffe2Reader(path=lmdb_path, shard_id=dev_id, num_shards=num_devices) self.decode = ops.HybridDecode(output_type=dali.types.RGB) self.resize = ops.Resize(device="gpu", resize_a=256, resize_b=480, random_resize=True, image_type=types.RGB) self.crop = ops.CropMirrorNormalize(device="gpu", random_crop=True, crop=(224, 224), mirror_prob=0.5, mean=[128.,128.,128.], std=[1.,1.,1.], output_layout=dali.types.NCHW)

slide-28
SLIDE 28

28

EXAMPLE: RESNET-50 PIPELINE

Defining graph

def define_graph(self): jpeg, labels = self.loader(name="Reader") images = self.decode(jpeg) resized_images = self.resize(images) cropped_images = self.crop(resized_images) return [cropped_images, labels] Loader Decode Resize Crop MakeContiguous Data Label jpeg labels

slide-29
SLIDE 29

29

EXAMPLE: RESNET-50 PIPELINE

Usage: MXNet

import mxnet as mx from dali.plugin.mxnet import DALIIterator pipe = HybridRN50Pipe(128, 2, 0, 1) pipe.build() train = DALIIterator(pipe, pipe.epoch_size("Reader")) model.fit(train, # other parameters )

slide-30
SLIDE 30

30

EXAMPLE: RESNET-50 PIPELINE

Usage: TensorFlow

import tensorflow as tf from dali.plugin.tf import DALIIterator pipe = HybridRN50Pipe(128, 2, 0, 1) serialized_pipe = pipe.serialize() train = DALIIterator() with tf.session() as sess: images, labels = train(serialized_pipe) # rest of the model using images and labels sess.run(...)

slide-31
SLIDE 31

31

EXAMPLE: RESNET-50 PIPELINE

Usage: Caffe 2

from caffe2.python import brew pipe = HybridRN50Pipe(128, 2, 0, 1) serialized_pipe = pipe.serialize() data, label = brew.dali_input(model, ["data", "label"], serialized_pipe=serialized_pipe) # Add the rest of your network as normal conv1 = brew.conv(model, data, “conv1”, …)

slide-32
SLIDE 32

32

PERFORMANCE

slide-33
SLIDE 33

33

PERFORMANCE

I/O Pipeline

5150 5450 8000 14350 23000 5000 10000 15000 20000 25000

Images / Second

Throughput, DGX-2, RN50 pipeline, Batch 128, NCHW

slide-34
SLIDE 34

34

PERFORMANCE

End-to-end training

8000 15500 17000 2000 4000 6000 8000 10000 12000 14000 16000 18000 Native DALI Synthetic

images / second

End-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU

slide-35
SLIDE 35

35

NEXT STEPS

slide-36
SLIDE 36

36

NEXT: MORE WORKLOADS

Segmentation

def define_graph(self): images, masks = self.loader(name="Reader") images = self.decode(images) masks = self.decode(masks) # Apply identical transformations resized_images, resized_masks = self.resize([images, masks], …) cropped_images, cropped_masks = self.crop([resized_images, resized_masks], …) return [cropped_images, cropped_masks]

slide-37
SLIDE 37

37

NEXT: MORE FORMATS

What would be useful to you?

PNG Video frames

slide-38
SLIDE 38

38

NEXT++: MORE OFFLOADING

Fully GPU-based decode HW-based via. NVDEC Transcode to video

slide-39
SLIDE 39

39

SOON: EARLY ACCESS

Looking for: General feedback New workloads New transformations Contact: Milind Kukanur mkukanur@nvidia.com

slide-40
SLIDE 40

40

ACKNOWLEDGEMENTS

Trevor Gale Andrei Ivanov Serge Panev Cliff Woolley DL Frameworks team @ NVIDIA

slide-41
SLIDE 41