Industrial Level Deep Learning Training Infrastructure the Practice - - PowerPoint PPT Presentation

industrial level deep learning
SMART_READER_LITE
LIVE PREVIEW

Industrial Level Deep Learning Training Infrastructure the Practice - - PowerPoint PPT Presentation

Industrial Level Deep Learning Training Infrastructure the Practice and Experience from SenseTime Shengen Yan SenseTime Group Limited. The Success of Deep Learning Google Search AlexNet won ImageNet 2006-01 2007-01 2008-01 2009-01


slide-1
SLIDE 1

Industrial Level Deep Learning Training Infrastructure

—the Practice and Experience from SenseTime

Shengen Yan SenseTime Group Limited.

slide-2
SLIDE 2

The Success of Deep Learning

2006-01 2007-01 2008-01 2009-01 2010-01 2011-01 2012-01 2013-01 2014-01 2015-01 2016-01

Google Search AlexNet won ImageNet

slide-3
SLIDE 3

What Lead to the Success?

slide-4
SLIDE 4

Model Capacity

The Key to High Performance

5 8 22 169 1207

LeNe Net Alex exNet ( (2012) 2) Goog

  • gLeNet

et (201 014) ResN sNet (2 (2016) Ours rs

# Layers

slide-5
SLIDE 5

Computation power

Years months weeks days

Accelerate the training time from several years to several days!

slide-6
SLIDE 6

Deep Learning Package

A deep learning framework that is efficient, scalable, and flexible.

DeepLink

A large-scale cluster platform designed for deep learning.

Applications

Delivers many application models

01 02 03

slide-7
SLIDE 7

Deep Learning is Complicated

Deep Learning community developed frameworks to make the life easier.

GoogleNet (2014)

slide-8
SLIDE 8

Deep learning Training Frameworks

  • SenseTime Deep Learning training Package
  • Memory efficient
  • Computation efficient
  • Both model parallel & data

parallel

  • Support huge model
  • Scalability
slide-9
SLIDE 9

Memory Footprint Optimization

high level compiler backend optimization algorithms on intermediate representation.

 Optimizations: liveness analysis, computation graph

slide-10
SLIDE 10

Seeing Perceiving

Generated Graph with mirror(re-compute) node

Chen T, Xu B, Zhang C, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint arXiv:1604.06174, 2016.

Memory Footprint Optimization

slide-11
SLIDE 11

Model Capacity

Memory usage efficiency, higher is better

20 40 60 80 100 120 140 VGG ResNet50 ResNet152 Inception V4 ResNet269 Inception ResNet Ours MxNet TensorFlow Chainer Caffe Torch

slide-12
SLIDE 12

Single-GPU Performance

Batch-32 Batch-64 Batch-128 Caffe 497.5 1045 1965 Chainer 200 290 543 TensorFlow 178.6 315.7 587.2 Parrots 122.7 225.6 471 500 1000 1500 2000 2500

milliseconds / iteration

Caff ffe Chai ainer Tens nsorFlo low Parr rrots

slide-13
SLIDE 13

Communication Optimization

 Support Multi-GPUs and Multi-Nodes  Three procedures: Copy, Allreduce, Copy Optimizations:

  • Master-slave threads to overlap the

communication and computation overhead

  • GPU direct communication
  • Ring allreduce message passing

GPU0 GPU1 GPU3 GPU2 CPU Memory Other Nodes Allreduce Copy Copy

slide-14
SLIDE 14

Scalability

0.2 0.4 0.6 0.8 1 1.2 2000 4000 6000 8000 10000 12000 1 2 3 4 8 16 24 32 # GPUs # GPUs

millisec/iter scale efficiency

single node multiple nodes

slide-15
SLIDE 15

Deep Learning Package

A deep learning framework that is efficient, scalable, and flexible.

DeepLink

A large-scale cluster platform designed for deep learning.

Applications

Delivers many application models

01 02 03

slide-16
SLIDE 16

The role of supercomputer

It just like highway in the city

— It is a key infrastructure of AI

slide-17
SLIDE 17

Supercomputing Centers for AI

The key infrastructures for AI research.

DATA

COMPPUT- ATION

MODEL

DeepLink

slide-18
SLIDE 18

Challenges

  • Interconnects at multiple levels
  • GPUs, Nodes, Sub-networks
  • Distributed data
  • Random access becomes particularly difficult
  • Scale vs. Stability
  • Failures of individual nodes/links
  • Human resources
  • Engineers who understand both Deep Learning & HPC are difficult to come by
slide-19
SLIDE 19

DeepLink Clusters

Designed for Deep Learning

Software Hardware Co-design High- performance Hardware Customized Middlewares Maximize respective strengths while ensuring optimal cooperation.

  • High speed interconnects
  • High performance GPU computing
  • Efficient distributed storage
  • Distributed storage & cache system (optimized for small

files)

  • Distributed deep learning framework
  • Task scheduling & monitoring
slide-20
SLIDE 20

Platform overview

Heterogeneous deep learning super computer High speed storage system Operation/Maintenance/Monitoring System Lightweight virtualization Task scheduling system Distributed training software Deep Learning Training Visualization System Customized communication library for deep learning Computation library Distributed cache system Software Platform

slide-21
SLIDE 21

Training Visualization

slide-22
SLIDE 22

DeepLink in SenseTime

>3000 GPUs

slide-23
SLIDE 23

Deep Learning Package

A deep learning framework that is efficient, scalable, and flexible.

DeepLink

A large-scale cluster platform designed for deep learning.

Applications

Delivers many application models

01 02 03

slide-24
SLIDE 24
slide-25
SLIDE 25

THANK YOU