Large-scale GPU Deep Learning Platform Design and Case Analysis - - PowerPoint PPT Presentation

large scale gpu deep learning platform design and case
SMART_READER_LITE
LIVE PREVIEW

Large-scale GPU Deep Learning Platform Design and Case Analysis - - PowerPoint PPT Presentation

Large-scale GPU Deep Learning Platform Design and Case Analysis Zhang Qing Alfie Lew YOUR SUCCESS, WE SUCCEED AI Age Has arrived Electric Age AI Age In 1870s In 2012 The second The fourth technological technological


slide-1
SLIDE 1

YOUR SUCCESS, WE SUCCEED

Large-scale GPU Deep Learning Platform Design and Case Analysis

Alfie Lew Zhang Qing

slide-2
SLIDE 2

AI Age Has arrived

Steam Age

  • In 1760s
  • The first technological

revolution

Information Age

  • In 1940s~1950s
  • The third technological

revolution

AI Age

  • In 2012
  • The fourth technological

revolution

Electric Age

  • In 1870s
  • The second

technological revolution

slide-3
SLIDE 3

AI Application Trend

  • More and More Users

– The Internet – Security and Surveillance – Finance, health care – Car manufacturers – Robots, entertainment

  • More and more application scenarios

– Image / video analysis – Speech recognition – NLP/OCR – …

Smart City Financial Medical care Automobile Household Entertainment

slide-4
SLIDE 4

Deep Learning Process Flow

Inference Model Data sets Abnormal “Thank you” Data Preprocessing Training

slide-5
SLIDE 5

Deep Learning Computing Characteristics

Inference

High throughput and low latency

Training

Extreme Computing and Communication Intensity

Data Preprocess

High IO Intensity

slide-6
SLIDE 6

Deep Learning Computing System Trend

  • Computing Mode

– From single node to clusters – From local to cloud

  • Data Storage

– From Dedicated (Training and Inference) to Unified Storage

  • System Management

– Development platform – Production platform – Cloud platform

  • Application mode

– From single user to multi-user – From single framework to multiple frameworks

slide-7
SLIDE 7

Deep Leaning Challenges

Obtaining large amount of labeled data and preprocessing efficiency Implementing distributed parallel neural network algorithm for speed, scale, expandability Large-scale deep learning computing platform

slide-8
SLIDE 8

Architecture of Large-scale Deep Learning System

Parallel storage Caffe-mpi TensorFlow Caffe

monitoring Management Scheduling

Image/video Apps Speech Apps NLP Apps CNTK mxnet

Mirror Management applied analysis

Inspur AIStation

GPU Training Platform

Platform Level Management Level Framework Level

10GbE/IB Network

App Level

GPU Inference Platform CPU pre-processing Platform

Inspur Teye

slide-9
SLIDE 9

Deep Learning Challenges - Platform Level Design

IO efficiency for data pre-processing Computing resources required for modeling, tuning and optimization Inference speed and desired throughput for large amount of sample processing

slide-10
SLIDE 10

Architecture of Large-scale Deep Learning Platform

  • Computing Architecture

– Data preprocessing platform

  • CPU cluster

– Training platform

  • CPU+ P100/P40 GPU
  • HPC Cluster

– Inference platform

  • CPU+P4 GPU
  • Hadoop
  • Data Storage

– Offline with Lustre – Online with HDFS

  • Network

– Offline with IB – Online with 10GbE

slide-11
SLIDE 11

Deep Learning Challenges - Management Layer

Managing different computing platforms and configurations/devices Managing different frameworks for different computing tasks Managing the whole system and monitoring different computing tasks

slide-12
SLIDE 12

Deep Learning Management System

AIStation

is a Deep Learning Cluster and training task management software, which can rapidly deploy training environment for Deep Learning, and comprehensively manage Deep Learning training tasks, providing an efficient and convenient platform for users.

GPU & CPU Monitoring Deployment of Deep-Learning environment Management of Deep-Learning training tasks GPU resources management and scheduling Cluster Statistics & Report Key Functions

slide-13
SLIDE 13

AIStation - Workflow

Assign GPUs Compose training jobs Containers run Applications start Resource scheduling Container installation Training mgmt User interaction

  • 1. Resources:GPU
  • 2. Templates:TF1
  • 3. Images:TF/v1.0
  • 4. Parametes:ps,ws…
  • 5. Data:volume
  • 1. Job starter
  • 2. TF1.yaml
  • 1. Run Contariners
  • 2. Execute Job

commands

  • 1. Shell access
  • 2. VNC access
  • 3. Training

visualization

slide-14
SLIDE 14

AIStation - Integrating Deep Learning Frameworks

Integrate Deep Learning Frameworks – Supports Multiple Deep Learning Frameworks: Caffe, TensorFlow, CNTK, etc. – Support various models: GoogleNet, VGG, ResNet, etc. – One-Key deployment of the Deep Learning environment – Training jobs submit & schedule – Training process management & visualization

GPU Resource Utilization

20%

Training Jobs Throughput

30%

slide-15
SLIDE 15

Teye : Application Optimization Analysis Tool

  • Analyzing the bottleneck and characteristics of Applicatation

– GPU driver data:clock,ECC,power – GPU runtime data:memory util,memory copy,cache,SP/DP Gflops – CPU runtime info: AVX,SSE,SP/DP Gflops, CPI

slide-16
SLIDE 16

Deep Learning Challenges - Framework

  • How to select from many Deep Learning Framework?

Caffe, TensorFlow, MxNET, CNTK, Torch, Theono, DeepLearning4j, PaddlePaddle …

  • What framework to use for a given scenario and model?
  • Using a single framework or multiple frameworks?
slide-17
SLIDE 17

A Frameworks Comparison

  • Compute Platform: Inspur SR-AI Rack(16 GPUs) + AIStation+Teye (management)
  • Framework: Caffe, Tensorflow, MxNet
  • Model: Alexnet, Googlenet
  • Performance

– Alexnet: 4675.799 Images/s (16 GPUs/GPU = 14X) à Caffe is best – Googlenet:2462Images/s (16 GPUs/GPU = 13X) à MxNet is best

slide-18
SLIDE 18

Factors to Consider when Selecting Framework

  • Based on model size and complexity
  • Based on different application scenarios

– Image – Speech – NLP

  • Based on data size to select distributed framework

– Caffe-MPI – Tensorflow – MxNet

slide-19
SLIDE 19

Deep Learning Challenges - Applications Layer

  • How to improve the recognition accuracy?

– Model design – Data pre-processing

  • How to improve Training performance?

– CUDA Programming for half Precision (pascal) – CUDA Programming for mixed Precision

  • How to improve Inference performance?

– CUDA Programming for Int8

slide-20
SLIDE 20

Deep Learning Applications on GPU

Image Search Speech Training Image Training Network Security

50 100 150 200 250 300 350 CPU:C+MKL version 1GPU version 4GPU version Time(s)

256.1 115.2

50 100 150 200 250 300 Samples:1M,dimensions:180 Time(S)

CPU GPU

slide-21
SLIDE 21

Deep Learning Platform End-to-End

data processing Cluster

Training GPU Clustre

NF5280M4 16 Card GPU Box 2U4 Card 2U8 Card Flash Storage AS5600/13000

Storage

10G/IB Network

Inference

P8000 Wrokstation GPU AI Cloud

“Big Win!” “This is Daniel wu” “Retinopathy” “Have booked G6”

Training Data Model

AIStation management system T-Eye Tuning Tool

Caffe-MPI TensorFlow MXNET PaddlePaddle

Alexnet/GoogLenet/Resnet CNN/RNN/LSTM

“Pursuit staff”

Speech recognition Face recognition Video monitoring Medical imaging Personal assistant

DL Management DL Framework Model & Algorithm

speech/image/video / natural language AI recognition processing

Deep learning training platform

4U4 Card Terminal

slide-22
SLIDE 22

Inspur Deep Learning GPU Servers

2GPU Server

NF5280M4 Inference NF5568M4 Training

8GPU Server

AGX-2 Training

64GPU Server

SR-AI Rack Training

4GPU Server

Inspur is a leading AI computing providers: Supply >60% AI HW to CSP in China

slide-23
SLIDE 23

COMPUTING INSPIRES FUTURE

Thank You

Visit us in Booth #911