ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , - - PowerPoint PPT Presentation
Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , - - PowerPoint PPT Presentation
Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, Aristeidis Tsaris Oak Ridge National Laboratory, USA August 19, 2020 International Conference on Parallel Processing Alberta,
2 2
Introduction
Semantic Segmentation of Images
Γ Given an image with πΓπ pixels and a set of π distinct classes, label each of the π! pixels with one of the π distinct classes. Γ For example, given a 256 Γ256 image of a car, road, buildings and people, a semantic segmentation of the image classifies each of the 256Γ256 = 2"# pixels into one of π = 4 classes {car, road, building, people}.
Input Image Segmented Image
Image credit: https://mc.ai/how-to-do-semantic-segmentation-using-deep-learning/ Semantic Segmentation
conv 3 x 3, ReLU max pool, 2 x 2 up-conv 2 x 2 conv 1 x 1 copy and crop
Input image Segmented image
3 3
The U-Net Model
π π π β π π β π
U-Net Architecture
Input Image Output Image U-Net: Convolutional Networks for Biomedical Image Segmentation Olaf Ronneberger, Philipp Fischer, Thomas Brox Medical Image Computing and Computer- Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234--241, 2015.
- Refer to this as the π β region (halo).
- Halo width (π) is a function of U-Net
architecture (depth, channel width, filter sizes, etc.).
- Halo width (π) determines the
receptive field of the model.
- Larger the receptive field, wider the
length-scales of identifiable objects.
π = 3 β 2$%! β 1 ππ&
4 4
Why Is It A Summit-scale Problem?
Γ Satellite images collected at high-resolutions (30-50 cm) yield very large 10,000 x 10,000 images. Γ Most computer vision workloads deal with images of π(10!Γ 10!) resolution (for example, ImageNet). Γ This work targets ultra-wide extent images with π(10'Γ10') resolution β 10,000-fold larger data samples! Γ At present, requires many days to train a single model (even
- n special-purpose DL platforms like DGX boxes).
Γ Hyperparameter tuning of these models take much longer. Γ Need accurate scalable high-speed training framework. Γ Large U-Net models are needed to resolve multi-scale objects (buildings, solar panels, land cover details). Γ Advanced DAQ systems generate vast amounts of high- resolution images β large data volume.
Sample size Model size Data size 10000-fold larger image size Larger receptive fields require larger models Multi-TB of data from DAQ systems.
5 5
Sample Parallelism - Taming Large Image Size
Leveraging Summitβs Vast GPU Farm
1 , 10,000
Tile size chosen such that appended tile plus model parameters fit on a single Summit GPU.
Γ Given a πΓπ image, U-Net segments a π β π Γ(π β π) inset square. Γ Partition each πΓπ = 10000Γ10000 image sample into non-overlapping tiles. Γ Append an extra halo region of width π along each side of each tile. Γ Assign each appended tile to a Summit
- GPU. Use standard U-Net to segment
appended tile. Γ Each GPU segments an area equal to that
- f the original non-overlapping tile.
π π π β π π β π
Blue dashed square is segmented for each appended tile.
6 6
Γ Optimal tiling for each 10000Γ10000 sample image was found to be 8Γ8. Γ Each 1250Γ1250 tile was appended with a halo of width π = 92 and assigned to a single Summit GPU. Γ 10 β 11 Summit nodes to train each 10000Γ 10000 image sample. Γ A U-Net model was trained on a data set of 100 10000Γ10000Γ4 satellite images, collected at 30- 50 cm resolution. Γ The training time per epoch was shown to be βΌ12 seconds using 1200 Summit GPUs compared to βΌ1,740 seconds on a DGX-1. Γ Initial testing revealed no appreciable loss of training/validation accuracy using the new parallel framework.
Performance of Sample-Parallel U-Net Training
+100X Faster U-Net Training
7 7
Β§ πΏ β πΊπππ’ππ π‘ππ¨π Β§ π β ππ’π πππ πππππ’β Β§ π β πππππππ πππ¨π Β§ π! β ππ. ππ ππππ€π‘ πππ πππ€ππ Β§ π β ππ. ππ ππππ’ πππ€πππ‘ Β§ πΓπ = π"(π#Γπβ²)
Limitations of Sample Parallelism
N = 1 , π = 10,000 π β² Γ π β² πΓπ πΓπ πβ²Γπβ²
Γ An image of size πΓ π is partitioned into a πΓπ array of π(Γπ( tiles. Γ πΉ βΌ
)*+,- .*-/01 *2 &*03/+,+4*56 317 +4-1 )*+,- .*-/01 *2 /612/- &*03/+,+4*56 317 +4-1 = π )! )"! βΌ π 1 + π '8 9
Γ Ideally, πΉ = 1. Γ Decreasing π (increasing tile sizes) increases the memory requirement and quickly overtakes memory available per GPU. Γ Decreasing π decreases the receptive field of the model. Γ On the other hand, the goal is to decrease π and increase π. Γ Decrease π β increasing tile size πβ² and decreasing π steers away from target receptive fields. Γ To satisfy both, larger U-Net models than can fit on a GPU needed. Γ Need model-parallel execution. π = ππ( π = 3 β 2$%! β 1 ππ& π = 9 :%" ;<%!=
:
β 1
8 8
TorchGPipe: PyTorch implementation of Gpipe* Framework
Model-Parallelism - Taming Large Model Size
Node-level Pipeline-Parallel Execution
GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 No load balance
- - skip connections omitted for ease of presentation --
!
!!
!
!"
!
!#
!
!$
!
"!
!
""
!
"#
!
"$
!
#!
!
#"
!
##
!
#$
!
$!
!
$"
!
$#
!
$$
"$$ "$# "$" "$! "#$ "## "#" "#! ""$ ""# """ ""! "!$ "!# "!" "!! #! #" ## #$ $! $" $# $$ $% $& $' $( $) $* $"! $"" $"# $"$ $"%
Update Update Update Update
$"&
Single Summit Node
Γ Number of consecutive layers mapped to a GPU is called partition. Γ Number of layers in each partition is called balance. Γ Subdivide each mini-batch of tiles into smaller micro- batches that are assigned to each partition. Γ Micro-batches per partition β‘ ππππ Memory needed/GPU = size(micro-batch) + size(partition)
* Huang, Yanping, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le and Zhifeng Chen. βGPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.β NeurIPS (2019).
9 9
Model Parallel Experiments
Single Node Execution
SAMPLE PARALLEL
!Γ! Image Samples #Γ# Padded Tiles
!Γ! !Γ! !Γ! !β²Γ!β² !β²Γ!β² !β²Γ!β²
MODEL PARALLEL U-NET 96 GB SUMMIT NODE
!Γ!
Model
- No. of
Levels
- Conv. Layers
Per Level
- No. of Trainable
Parameters π Small (Standard) 5 2 72,301,856 92 Medium-1 5 5 232,687,904 230 Medium-2 6 2 289,357,088 188 Large 7 2 1,157,578,016 380
- 10Γ larger number of trainable parameters.
- 4Γ fold larger receptive field.
Benchmark U-Net Models
Medium -1 π. πΓ (192), π. πΓ (512) and πΓ (1024) speedup using 6 pipeline stages. Speedup doubles (small: 1.97; medium-2: 2.01) as no. of pipeline stages increases from 1 to 6.
10 10
Need for Performance Improvement
Single Node Execution
Γ Small, Medium-2 and Large Models: Γ Layers: 109, 129 and 149. Γ Balances: small {14, 24, 30, 22, 12, 7}; medium-2 {16, 26, 38, 26, 12, 11}; large {18, 30, 44, 30, 14, 13}. Γ Need load balanced pipelined execution. Γ Encoder memory: Γ Decoder memory: Γ Memory profile: πΉβ + πΈβ( vs. β , β( = π β β
E` = O I2
` + 2`nf nc
X
i=1
(I` β i Β· d)2 ! D`0 = O 2`0nf 2I2
`0 + nc
X
i=1
(I`0 β i Β· d)2 !!
11 11
Wrapping Up
SAMPLE PARALLEL
!Γ! Image Samples #Γ# Padded Tiles
MODEL PARALLEL U-NET MODEL PARALLEL U-NET MODEL PARALLEL U-NET MODEL PARALLEL U-NET 96 GB SUMMIT NODE
DATA PARALLEL
!Γ! !Γ! !Γ! !Γ! !Γ! !Γ! !Γ! !β²Γ!β² !β²Γ!β² !β²Γ!β²
Ongoing Work: Sample + Model + Data Parallel Framework Load Balance Heuristics Data Parallelism
SAMPLE PARALLEL
!Γ! Image Samples #Γ# Padded Tiles
!Γ! !Γ! !Γ! !β²Γ!β² !β²Γ!β² !β²Γ!β²
MODEL PARALLEL U-NET 96 GB SUMMIT NODE
!Γ!
This Paper: Prototype Sample + Model Parallel Framework
- ππΓ larger number of trainable parameters.
- πΓ fold larger receptive field.
- πππππΓ larger image size.
Γ Training image segmentation neural network models become extremely challenging when: Γ Image sizes are very large Γ Desired receptive fields are large Γ Volume of training data is large. Γ Fast training/inference needed for geo-sensing applications βsatellite imagery, disaster assessment, precision agriculture, etc. Γ This work is a first step β can train 10Γ larger U-Net models with 4Γ larger receptive field on 10000Γ larger images. Γ Ongoing efforts are underway to integrate load balancing heuristics and data-parallel execution to handle large volumes of training data efficiently.
12 12