Containerizing Deep Learning Frameworks with Singularity
Rengan Xu, Frank Han, Nishanth Dandapanthula HPC & AI Solutions Engineering, Dell EMC
Containerizing Deep Learning Frameworks with Singularity Rengan Xu, - - PowerPoint PPT Presentation
Containerizing Deep Learning Frameworks with Singularity Rengan Xu, Frank Han, Nishanth Dandapanthula HPC & AI Solutions Engineering, Dell EMC Agenda Dell EMC HPC & AI Solutions Engineering Why use containers?
Rengan Xu, Frank Han, Nishanth Dandapanthula HPC & AI Solutions Engineering, Dell EMC
2 of 20
Agenda
3 of 20
Dell EMC HPC & AI Solutions Engineering
Heading
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Heading
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Design, develop and integrate HPC systems Act as the focal point for joint R&D activities
HPC & AI Innovation Lab
Prototype and evaluate advanced technologies Conduct application performance studies and develop best practices
4 of 20
Containers and Virtualization Machine: A Recap
source: https://www.docker.com/what-container
5 of 20
Need for Containerization
– Simplify application building – Application isolation – Faster application deployment – Validate and reproduce results – Server consolidation/Server efficiency – Can be deployed on bare metal or on virtual machines
– Lightweight – Low overhead – Easier application sharing among users – Reproducibility
– LXC – Docker – Singularity
6 of 20
Singularity Vs Docker
Feature Singularity Docker
Multiple containers can be run on same hardware Can be created and destroyed more quickly Do not need entire OS, only a core run time Transferable to other machines easily Image format Single file Layered Image Use with HPC schedulers X Native Support for MPI X Support for GPUs X root owned Daemon process X
7 of 20
Singularity: Workflow Summary
source: http://singularity.lbl.gov/docs-flow
8 of 20
Interpretability between Singularity vs Docker
9 of 20
Singularity MPI
(OpenMPI, MPICH, Intel MPI, etc.)
version inside the container
– mpirun –np 4 singularity exec centos_ompi.img /usr/bin/mpi_ring
9
source: https://wikihub.berkeley.edu/download/attachments/129695919/Containers_in_HPC_summary_Singularity.pdf
run on the host run in the container
10 of 20
Challenges and Workarounds
– Every DL framework has too many dependences – Each dependent library has special version requirement – All DL frameworks are changing frequently – The friendly supported OS for most DL frameworks is Ubuntu, where as datacenter deployments are RHEL/Centos
– Scaling containerized deep learning frameworks past a single node
– PCIe device driver mismatch
– GPUs
› The container should always use the host GPU driver › Create a symbolic links for all GPU driver related files and then bind it to container › Update to latest drivers since they are backward compatible
– InfiniBand
› The InfiniBand driver is kernel dependent, and the solution is to make the container OS and host OS compatible and the container reuses the InfiniBand driver and libraries on the host
11 of 20
Singularity recipe for Caffe2
12 of 20
Building the container
13 of 20
Build Caffe2 inside the container
14 of 20
Run the container
singularity exec -s /bin/bash \
centos7_caffe2_dev_sandbox /mnt/caffe2_singularity_cmd.sh \ ${WORK_DIR} ${gpu_arch} ${gpus_per_node} $network ${run_id} ${num_nodes} $epochs $profile $debug $mpi ) >& $train_log
15 of 20
Testbed
nodes.
– In process of updating to 32 nodes with NVLINK
16 of 20
Performance Results – MXNet
communication
25.8x in FP16
Performance difference between Singularity vs bare-metal
0.4%
0.3% 0.5%
0.2% 2000 4000 6000 8000 10000 12000 14000 16000 18000 1 V100 2 V100 4 V100 8 V100 16 V100 32 V100 Imaegs/sec
MXNet Resnet50
FP32 Singularity FP32 bare-metal FP16 Singularity FP16 bare-metal
17 of 20
Performance Results – Horovod + TensorFlow
23.7x in FP16
Performance difference between Singularity vs bare-metal
0.3% 0.6% 0.0% 0.4% 1.2%
0.1% 0.2% 0.2%
0.3% 2000 4000 6000 8000 10000 12000 14000 16000 1 V100 2 V100 4 V100 8 V100 16 V100 32 V100 Images/sec
Horovod+TensorFlow Resnet50
FP32 Singularity FP32 bare-metal FP16 Singularity FP16 bare-metal
18 of 20
Performance Results – Caffe2
communication
nodes
Performance difference between Singularity vs bare-metal
0.0%
4.0% 2.4%
0.2%
500 1000 1500 2000 2500 3000 3500 4000 1 V100 2 V100 4 V100 8 V100 16 V100 32 V100 Images/sec
Caffe2 Resnet50
FP32 Singularity FP32 bare-metal FP16 Singularity FP16 bare-metal
19 of 20
Conclusions and Future Work
– Singularity simplifies the building and deployment of DL in both single-node and multi-node – Easy to use Singularity on GPU server – Straightforward to run MPI on InfiniBand interconnect – No performance loss compared to bare-metal
– File system impact for DL models – Scale impact for DL model accuracy – Research on neural networks with model parallelism – Case studies with appropriate DL models
{Rengan.Xu,Frank.Han1,Nishanth.Dandapanthu}@Dell.com