toward large scale image segmentation on summit
play

Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , - PowerPoint PPT Presentation

Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, Aristeidis Tsaris Oak Ridge National Laboratory, USA August 19, 2020 International Conference on Parallel Processing Alberta,


  1. Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, Aristeidis Tsaris Oak Ridge National Laboratory, USA August 19, 2020 International Conference on Parallel Processing Alberta, Canada ORNL is managed by UT-Battelle, LLC for the US Department of Energy

  2. Introduction Semantic Segmentation of Images Input Image Ø Given an image with 𝑂×𝑂 pixels and a set of 𝑙 distinct classes, label each of the 𝑂 ! pixels with one of the 𝑙 distinct classes. Ø For example, given a 256 Γ—256 image of a car, road, buildings and people, a semantic segmentation of the image classifies each of the 256Γ—256 = 2 "# pixels into one of 𝑙 = 4 classes {car, road, building, people}. Semantic Segmentation Input image Segmented image Segmented Image conv 3 x 3, ReLU max pool, 2 x 2 up-conv 2 x 2 conv 1 x 1 copy and crop 2 2 Image credit: https://mc.ai/how-to-do-semantic-segmentation-using-deep-learning/

  3. The U-Net Model πœ— = 3 β‹… 2 $%! βˆ’ 1 π‘’π‘œ & Input Image Output Image 𝑂 𝑂 βˆ’ πœ— U-Net Architecture 𝑂 𝑂 βˆ’ πœ— β€’ Refer to this as the πœ— βˆ’ region (halo). β€’ Halo width ( πœ— ) is a function of U-Net architecture (depth, channel width, filter sizes, etc.). U-Net: Convolutional Networks for β€’ Halo width ( πœ— ) determines the Biomedical Image Segmentation Olaf Ronneberger, Philipp Fischer, Thomas Brox receptive field of the model. Medical Image Computing and Computer- Assisted Intervention (MICCAI), Springer, LNCS, β€’ Larger the receptive field, wider the Vol.9351: 234--241, 2015. length-scales of identifiable objects. 3 3

  4. Why Is It A Summit-scale Problem? Larger receptive fields require larger models Ø Satellite images collected at high-resolutions (30-50 cm) yield Model size very large 10,000 x 10,000 images. Ø Most computer vision workloads deal with images of 𝑃 ( 10 ! Γ— 10 ! ) resolution (for example, ImageNet). Sample size Data size 10000-fold larger Ø This work targets ultra-wide extent images with 𝑃 ( 10 ' Γ—10 ' ) image size resolution β‡’ 10,000-fold larger data samples ! Ø At present, requires many days to train a single model (even Multi-TB of data from DAQ systems. on special-purpose DL platforms like DGX boxes). Ø Hyperparameter tuning of these models take much longer. Ø Need accurate scalable high-speed training framework. Ø Large U-Net models are needed to resolve multi-scale objects (buildings, solar panels, land cover details). Ø Advanced DAQ systems generate vast amounts of high- resolution images β‡’ large data volume . 4 4

  5. Sample Parallelism - Taming Large Image Size Leveraging Summit’s Vast GPU Farm Blue dashed square Ø Given a 𝑂×𝑂 image, U-Net segments a 0 is segmented for 0 0 , 0 1 𝑂 βˆ’ πœ— Γ—(𝑂 βˆ’ πœ—) inset square. each appended tile. 𝑂 βˆ’ πœ— 𝑂 10,000 𝑂 βˆ’ πœ— Tile size chosen such that appended 𝑂 tile plus model parameters fit on a single Summit GPU. Ø Partition each 𝑂×𝑂 = 10000Γ—10000 image sample into non-overlapping tiles. Ø Append an extra halo region of width πœ— along each side of each tile. Ø Assign each appended tile to a Summit GPU. Use standard U-Net to segment appended tile. Ø Each GPU segments an area equal to that of the original non-overlapping tile. 5 5

  6. Performance of Sample-Parallel U-Net Training Ø Optimal tiling for each 10000Γ—10000 sample image was found to be 8Γ—8 . Ø Each 1250Γ—1250 tile was appended with a halo of width πœ— = 92 and assigned to a single Summit GPU. Ø 10 – 11 Summit nodes to train each 10000Γ— 10000 image sample. Ø A U-Net model was trained on a data set of 100 10000Γ—10000Γ—4 satellite images, collected at 30- 50 cm resolution. Ø The training time per epoch was shown to be ∼ 12 seconds using 1200 Summit GPUs compared to ∼ 1,740 seconds on a DGX-1 . Ø Initial testing revealed no appreciable loss of training/validation accuracy using the new parallel framework. +100X Faster U-Net Training 6 6

  7. Limitations of Sample Parallelism 𝐿 β†’ πΊπ‘—π‘šπ‘’π‘“π‘  𝑑𝑗𝑨𝑓 Β§ 𝑇 β†’ 𝑇𝑒𝑠𝑗𝑒𝑓 π‘šπ‘“π‘œπ‘•π‘’β„Ž Β§ 𝑄 β†’ π‘„π‘π‘’π‘’π‘—π‘œπ‘• π‘šπ‘—π‘¨π‘“ Β§ π‘œ ! β†’ 𝑂𝑝. 𝑝𝑔 π‘‘π‘π‘œπ‘€π‘‘ π‘žπ‘“π‘  π‘šπ‘“π‘€π‘“π‘š Β§ πœ— = 3 β‹… 2 $%! βˆ’ 1 π‘’π‘œ & 𝑒 = 9 :%" ;<%!= βˆ’ 1 𝑀 β†’ π‘œπ‘. 𝑝𝑔 𝑉𝑂𝑓𝑒 π‘šπ‘“π‘€π‘“π‘šπ‘‘ Β§ β€² π‘ˆ : 0 0 Γ— 0 β€² 𝑂×𝑂 = π‘Ÿ " (π‘ˆ # Γ—π‘ˆβ€²) , 0 π‘ˆ Β§ 1 = N π‘ˆβ€²Γ—π‘ˆβ€² Ø An image of size 𝑂× 𝑂 is partitioned into a π‘ŸΓ—π‘Ÿ array of π‘ˆ ( Γ—π‘ˆ ( tiles. π‘ˆΓ—π‘ˆ ) ! )*+,- .*-/01 *2 &*03/+,+4*56 317 +4-1 '8 Ø 𝐹 ∼ )*+,- .*-/01 *2 /612/- &*03/+,+4*56 317 +4-1 = 𝑃 ) "! ∼ 𝑃 1 + π‘Ÿ 𝑂 = 10,000 9 π‘ˆΓ—π‘ˆ Ø Ideally, 𝐹 = 1 . Ø Decreasing π‘Ÿ (increasing tile sizes) increases the memory 𝑂 = π‘Ÿπ‘ˆ ( requirement and quickly overtakes memory available per GPU. Ø Decreasing πœ— decreases the receptive field of the model. Ø On the other hand, the goal is to decrease π‘Ÿ and increase πœ—. Ø Decrease π‘Ÿ β‡’ increasing tile size π‘ˆβ€² and decreasing πœ— steers away from target receptive fields. Ø To satisfy both, larger U-Net models than can fit on a GPU needed. Ø Need model-parallel execution. 7 7

  8. Model-Parallelism - Taming Large Model Size Node-level Pipeline-Parallel Execution Single Summit Node No load balance GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 -- skip connections omitted for ease of presentation -- Memory needed/GPU = size(micro-batch) + size(partition) Number of consecutive layers mapped to a GPU is Ø called partition . $ ! $ " $ # $ $ $ ) $ * $ "! $ "" $ % $ & $ ' $ ( $ "# $ "$ $ "% $ "& Number of layers in each partition is called balance . Ø # $ ! ! ! ! " $$ " $# " $" " $! Update $! $" $# $$ Subdivide each mini-batch of tiles into smaller micro- ! ! ! ! " #$ " ## " #" " #! # # Ø Update #! #" ## #$ batches that are assigned to each partition. # " ! ! ! ! " "$ " "# " "" " "! Update "! "" "# "$ Micro-batches per partition ≑ π‘›π‘π‘žπ‘ž Ø # ! ! ! ! ! " !$ " !# " !" " !! Update !! !" !# !$ TorchGPipe: PyTorch implementation of Gpipe* Framework * Huang, Yanping, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le and Zhifeng Chen. β€œGPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.” NeurIPS (2019). 8 8

  9. Model Parallel Experiments !Γ—! Image Samples #Γ—# Padded Tiles Single Node Execution !β€²Γ—!β€² 96 GB !Γ—! MODEL PARALLEL U-NET !β€²Γ—!β€² !Γ—! Benchmark U-Net Models !β€²Γ—!β€² !Γ—! SUMMIT NODE No. of Conv. Layers No. of Trainable !Γ—! SAMPLE PARALLEL Model 𝝑 Levels Per Level Parameters Small (Standard) 5 2 72,301,856 92 Medium-1 5 5 232,687,904 230 10Γ— larger number of trainable parameters. β€’ 4Γ— fold larger receptive field. β€’ Medium-2 6 2 289,357,088 188 Large 7 2 1,157,578,016 380 Medium -1 πŸ‘. πŸ—Γ— (192) , πŸ‘. πŸ”Γ— (512) and πŸ‘Γ— (1024) Speedup doubles (small: 1.97 ; medium-2: 2.01 ) speedup using 6 pipeline stages. as no. of pipeline stages increases from 1 to 6. 9 9

  10. Need for Performance Improvement Single Node Execution Ø Small, Medium-2 and Large Models: Ø Layers: 109, 129 and 149. Ø Balances: small {14, 24, 30, 22, 12, 7}; medium-2 {16, 26, 38, 26, 12, 11}; large {18, 30, 44, 30, 14, 13}. Ø Need load balanced pipelined execution. n c ! X ( I ` βˆ’ i Β· d ) 2 Ø Encoder memory: I 2 ` + 2 ` n f E ` = O i =1 n c !! Ø Decoder memory: 2 ` 0 n f X ( I ` 0 βˆ’ i Β· d ) 2 2 I 2 D ` 0 = O ` 0 + i =1 β„“ ( = 𝑀 βˆ’ β„“ Ø Memory profile: 𝐹 β„“ + 𝐸 β„“( vs. β„“ , 10 10

  11. Wrapping Up This Paper: Prototype Sample + Model Parallel Framework Ø Training image segmentation neural !Γ—! Image Samples network models become extremely #Γ—# Padded Tiles challenging when: !β€²Γ—!β€² 96 GB !Γ—! Ø Image sizes are very large MODEL PARALLEL U-NET !β€²Γ—!β€² !Γ—! Ø Desired receptive fields are large !β€²Γ—!β€² !Γ—! SUMMIT NODE Ø Volume of training data is large. !Γ—! SAMPLE PARALLEL Ø Fast training/inference needed for πŸπŸΓ— larger number of trainable parameters. β€’ πŸπŸπŸπŸπŸΓ— larger image size. β€’ πŸ“Γ— fold larger receptive field. β€’ geo-sensing applications –satellite imagery, disaster assessment, precision Load Balance Heuristics Data Parallelism agriculture, etc. Ø This work is a first step – can train 10Γ— Ongoing Work: Sample + Model + Data Parallel Framework larger U-Net models with 4Γ— larger receptive field on 10000Γ— larger SAMPLE PARALLEL 96 GB !Γ—! #Γ—# Padded Tiles images. MODEL PARALLEL U-NET !Γ—! Image Samples !Γ—! DATA PARALLEL Ø Ongoing efforts are underway to MODEL PARALLEL U-NET !β€²Γ—!β€² integrate load balancing heuristics !Γ—! !β€²Γ—!β€² !Γ—! MODEL PARALLEL U-NET and data-parallel execution to handle !β€²Γ—!β€² large volumes of training data !Γ—! !Γ—! MODEL PARALLEL U-NET efficiently. !Γ—! SUMMIT NODE 11 11

  12. THANK YOU 12 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend