Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, - - PowerPoint PPT Presentation

pluto
SMART_READER_LITE
LIVE PREVIEW

Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, - - PowerPoint PPT Presentation

Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud Outline PAI(Platform of Artificial Intelligence) PAI Overview Deep Learning with PAI Pluto PAI DL Application


slide-1
SLIDE 1

Pluto

A Distributed Heterogeneous Deep Learning Framework

Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

slide-2
SLIDE 2

Outline

2

  • PAI(Platform of Artificial Intelligence)
  • PAI Overview
  • Deep Learning with PAI
  • Pluto
  • PAI DL Application
  • Chatbot Engine
  • Summary
slide-3
SLIDE 3

Machine Learning Platforms

3

slide-4
SLIDE 4

PAI Overview

OSS Streaming data: DataHub/TT/Kafka

Feature Engineering Statistic Methods Machine Learning Deep Learning

……

PAI WEB Console PAI IDE Database: ODPS/RDS

CPU/GPU/FPGA/ASIC/… Fuxi Scheduler MR/MPI/PS/Graph/Pluto… PAI SDK

Data Storage Distributed Computing Algorithms

Serving

Frontend

4

Tutorial: data.aliyun.com

slide-5
SLIDE 5

5

PAI Project

Search Experiment Data Source Component Model Serving

slide-6
SLIDE 6

Machine Learning with PAI

Data Preprocessing

Sampling & Filtering Data Merge Fill Missing Values Normalizatio n …

Feature Engineering

Feature Transformatio n Feature Selection Feature Importance Feature Generation

Statistics

Correlation Coefficients Histogram Hypothesis Test Visualization …

Modeling

Binary Classificatio n Multiple Classificatio n Clustering Regression Prediction Evaluation

Deep Learning

DNN CNN RNN A La Carte

Application

NLP Search & Rec. Image Process Network Analysis Financial Section …

Pluto 6

slide-7
SLIDE 7

7

Deep Learning with PAI

slide-8
SLIDE 8

8

PAI TensorFlow

  • Rich Data IO
  • Distributed Job Optimization

(Multi. GPU/CPUs)

  • Easy model Serving
  • Hyper Parameter Tuning
slide-9
SLIDE 9

Pluto

9

slide-10
SLIDE 10

10

Single-card Optimization

  • Compiler-oriented strategy
  • Fuse small ops into bigger one
  • Reduce CUDA kernel launch overhead
  • Prepare data layout friendly with low-level computation library
  • Memory optimization
  • Here again compiler-oriented tactics
  • Dependency analysis
  • Lifetime analysis
slide-11
SLIDE 11

11

Multi-cards Optimization

  • Heuristic-based Model Parallelism
  • Both model weights and feature map taken into consideration
  • Memory allocator strategy taken into consideration
  • A greedy allocation algorithm
  • With pre-run support
slide-12
SLIDE 12

12

Multi-cards Optimization

  • Hybrid-parallelism
  • Mixture of data-parallelism and model-parallelism
  • For communication-intensive parts, consider model-parallelism
  • For computation-intensive parts, consider data-parallelism
  • Tricks
  • Integrate seamlessly with computation graph style
  • Happier with pyramid network
slide-13
SLIDE 13

13

Multi-cards Optimization

  • Hybrid-parallelism(cont.)

M40 Result K40 Result

slide-14
SLIDE 14

14

Multi-cards Optimization

  • Late-multiply
  • Customized for fully-connected layers
  • Trade-off between computation and communication

Wavg: [Nl ,Nl+1], X:[M, Nl], E:[M, Nl+1], here Nl,Nl+1 layer sizes, M is mini-batch size

slide-15
SLIDE 15

15

Multi-cards Optimization

  • Late-multiply(cont.)
slide-16
SLIDE 16

16

Multi-cards Optimization

  • Heuristic-based MA
  • Automatic batch-size selection
  • Learning rate auto-tuning
  • Happier with sequential model
slide-17
SLIDE 17

17

Multi-cards Optimization

  • Heuristic-based MA(cont.)

Training Time in Wallclock Model Metrics

slide-18
SLIDE 18

18

Inference Optimization

  • Quantization
  • Significantly reduce model size(4X)
  • Around 2X speed-up on average
  • Binarized Neural Network
  • Binarize model weights
  • Convert floating point computation into bit manipulation
  • Both model size and computation speed significantly improved
  • Training process needs to be manipulated to compensate for accuracy
  • Happier with CNN, but for RNN…
slide-19
SLIDE 19

PAI DL Application

19

slide-20
SLIDE 20

20

AliMe – Personal Assistant Bot in E-commerce

AliMe for Customers AliMe for Sellers AliMe for Enterprises

20

From 海青@云栖大会

slide-21
SLIDE 21

Open-Domain Conversations

  • Retrieval Model
  • Learning to rank
  • Generation Model
  • Sequence to Sequence (Seq2Seq) Model
  • Recurrent Neural Networks: LSTM, GRU (our choice)

Cho et al., 2014 Query QA pairs Knowledge Base Q1-A1: s1 Q2-A2: s2 Q3-A3: s3 ... Qn-An: sn A1

21

slide-22
SLIDE 22

A Hybrid Conversation Model based on Seq2Seq

  • Overview

Query IR Candidates Answer Rerank Output Answer Generation Score > T Yes No Seq2Seq Model QA pairs

Seq2Seq Based Rerank and Generation Modules KnowledgeBase Retrieval Module

Chat logs SNS data

22

[AliMe Chat: Minghui Qiu et al., ACL 2017]

slide-23
SLIDE 23

PAI DL Support for AliMe

  • Both the offline training and online serving

backed by PAI

  • Through heuristic-based MA, the offline

training task has 2.8X convergence speed-up with 4 cards setting

  • Through quantization, the online serving task

has 1.5X speed-up on commodity CPU servers.

23

slide-24
SLIDE 24

24

Conclusion

  • PAI DL
  • End2end machine learning platform
  • Support big data analytics
  • Optimized Deep learning algorithms
  • Scheduling on CPU/GPU cloud
  • More data intelligence…
  • Pluto
  • Distributed optimization engine of PAI DL
  • PAI DL Application
  • PAI DL makes it easy to build DL methods

for industrial applications

SCAN BARCODE

START YOUR TRIAL

数据智能 触手可及

slide-25
SLIDE 25

25

We are hiring! J

muzhuo.yj@alibaba-inc.com chenyan.cy@alibaba-inc.com

slide-26
SLIDE 26

26

Reference

  • AliMe Chat: A Sequence to Sequence and Rerank based Chatbot

Engine, Minghui Qiu et al., ACL 2017.

  • Deep Learning with PAI: a Case Study of AliMe, Minghui Qiu et al., Deep

Learning Summit 2017.

  • TensorFlow in AliMe, Jun Yang et al., Shanghai GDG Mar., 2017.
slide-27
SLIDE 27

Thanks!