Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L - PowerPoint PPT Presentation

Our team: Zehua Hu, Menghao Li, Jeffrey Zhu , Elton Zheng, Mingqin Li, Jason Li, Yuxiong He Microsoft AI and Research

Deep Learning at Microsoft 2

De Deep L Lear arnin ing I Inference ce S Service ice • Serves Bing, Office, and Cortana • Large scale • Millions of model inferences per second • Hundreds of models • Tens of thousands of servers • Forty data centers worldwide • Variety of serving requirements • TensorFlow, PyTorch • Windows, Linux • CPU, GPU • Strict latency requirements • Often single-digit milliseconds 3

Mod Model O Optimi mization on E Examp mple • Large-scale BERT 1 for Bing web ranking • 1 million queries per second • TensorFlow latency and throughput were unacceptable • Hand-optimized BERT on V100 GPU • 800x throughput increase • Millions of dollars saved • Over a month of dev time • Blog post • https://azure.microsoft.com/en-us/blog/bing-delivers-its-largest-improvement- in-search-experience-using-azure-gpus/ 1. Devlin et. al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, https://arxiv.org/pdf/1810.04805.pdf 4

Mod Model O Optimi mization on Ch Challenges • Existing DL frameworks don’t fit our requirements • Challenges • Reducing latency to a scenario-acceptable number • Supporting advanced models at large scale while saving cost • Agility to bring new optimization techniques into production • We need new solutions to ship new and exciting models 5

Mod Model O Optimi mization on Sol Solution ons Custom Optimizations Framework Integration Compiler • Rewrite models with high • Integrate custom ops with existing • Graph-level optimizations performance C++ library frameworks (e.g., TF, PyTorch) • Optimized code generation • Customized serving runtime and • Replace nodes in model graphs and • Cross-platform, cross-device performance tuning leverage existing framework serving engine • Example: DeepCPU, DeepGPU, TensorRT • Example: Customized TensorFlow, WinML Low latency and high throughput Less development work Best utilization of hardware Decent latency improvement Low agility Suboptimal performance Can we achieve low latency, high throughput, and high agility? 6

Ca Case-St Study 1: Qu 1: Query U Understanding f for Bi or Bing • Generate query encoding for ranking • Model: CNN embedding + LSTM + scoring function • Latency SLA: 35ms • TensorFlow: 112ms on CPU • TVM + Custom RNN: 34ms on CPU 7

A Hybrid Approach: TVM TVM + De DeepCPU • DeepCPU 1 is plugged in as TVM external library • Automatically identify high-level TF constructs • Utilize TensorFlow scopes • Identify single- and bi-directional LSTMs • Rewrite Relay graph • Replace subgraph with a custom op node • 63ms -> 15ms • CNN and the rest of graph are optimized and auto-tuned by TVM • 49ms -> 19ms (2.5 times speedup) 1. “DeepCPU: Serving RNN-based Deep Learning Models 10x Faster”, Zhang et. al. USENIX ATC 2018 8

Ca Case-St Study 2: A 2: Azure Qn QnA Ma Maker Se r Service • Azure cognitive service that creates question-and-answer bots • Model: Distilled BERT • Latency SLA: 10ms • TensorFlow: 73ms on CPU, 10.1ms on GPU • TVM + our improvements: 28ms on CPU, 5.5ms on GPU 9

Optimizing BERT T with TVM TVM on GPU 16 • New operators 14.1 14 • OneHot, Erf, BatchMatMul with 12 > 3 dimensions 10.1 9.8 10 Latency (ms) • New softmax schedule tailored for 7.4 8 large-vocabulary projection 5.5 6 • Adding support for half-precision 3.3 4 and extended GEMM on TensorCore 2 0 TF-GPU TVM: with TVM: added TVM: TVM: Customized unsupported unsupported optimized TensorCore + optimization • Still a gap with hand-tuned version ops running on ops softmax fp16 but decent speedup over TF-GPU CPU (46% improvement) On Nvidia V100 10

Contributions to TVM TVM • CombineParallelDense IR pass • Operators for TensorFlow and ONNX frontends • Improve softmax compute and CPU schedule • Auto-tune softmax schedule • > 80% improvement on 16 cores • Fix schedule_extern to prevent fusion of external ops • ~50% improvement when using external libraries on CPU • Support MKL and cuBLAS for BatchMatMul • Windows support and fixes 11

We’re hiring! Our Experience with TVM TVM • Vibrant, supportive, and open community • Developer-friendly • Emphasis on innovating and experimenting with new techniques • Performance improvement over popular DL frameworks • Several models shipped to production • We are looking forward to contributing and trying new features from the community! • Dynamic shapes, TensorFlow dynamic RNN, bring-your-own-codegen Th Thank y you! 12

Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L - PowerPoint PPT Presentation

Our team: Zehua Hu, Menghao Li, Jeffrey Zhu , Elton Zheng, Mingqin Li, Jason Li, Yuxiong He Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L Lear arnin ing I Inference ce S Service ice Serves Bing, Office, and Cortana

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning in Microsoft with CNTK Alexey Kamenev Microsoft Research Deep Learning in the

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

TREC Deep Learning Track Nick Craswell (Microsoft), Bhaskar Mitra (Microsoft and UCL), Emine Yilmaz

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

An Overview of Deep Residual Learning Semih Yagcioglu 01.03.2016 Deep Residual Learning

Distributed DeepLearning at Scale Soumith Chintala Facebook AI Research Overview Deep

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier erarchic ical Datacen enter Job

CISC2000/2010 of 1 6 Lecture 3 Fall 2018 Prof. Zhang Last week: 1. Three aspects of

TDD of toEnglish Justin Pearson 1 Introduction This is not an original idea. I found the idea on

Number Systems MA1S1 Tristan McLoughlin November 27, 2013 http://en.wikipedia.org/wiki/Binary

DEVICES and more... Andr Bourdoux 2 nd Vision for Future Communications Systems 27 - 28

Move fast and secure things About Me $whoami Security engineer @ Fb > 2 years Security

Achieving a Readable Style Part 2: Sentence Structure A wri&ng workshop presented by BACTER

Today What is this class all about? Why am I here? Prerequisites You must be a strong

Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L - PowerPoint PPT Presentation

Our team: Zehua Hu, Menghao Li, Jeffrey Zhu , Elton Zheng, Mingqin Li, Jason Li, Yuxiong He Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L Lear arnin ing I Inference ce S Service ice Serves Bing, Office, and Cortana

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning in Microsoft with CNTK Alexey Kamenev Microsoft Research Deep Learning in the

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

TREC Deep Learning Track Nick Craswell (Microsoft), Bhaskar Mitra (Microsoft and UCL), Emine Yilmaz

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

An Overview of Deep Residual Learning Semih Yagcioglu 01.03.2016 Deep Residual Learning

Distributed DeepLearning at Scale Soumith Chintala Facebook AI Research Overview Deep

Pigeo eon: a an Effec ective D e Distrib ibuted ed, Hier erarchic ical Datacen enter Job

CISC2000/2010 of 1 6 Lecture 3 Fall 2018 Prof. Zhang Last week: 1. Three aspects of

TDD of toEnglish Justin Pearson 1 Introduction This is not an original idea. I found the idea on

Number Systems MA1S1 Tristan McLoughlin November 27, 2013 http://en.wikipedia.org/wiki/Binary

DEVICES and more... Andr Bourdoux 2 nd Vision for Future Communications Systems 27 - 28

Move fast and secure things About Me $whoami Security engineer @ Fb &gt; 2 years Security

Achieving a Readable Style Part 2: Sentence Structure A wri&amp;ng workshop presented by BACTER

Today What is this class all about? Why am I here? Prerequisites You must be a strong

Move fast and secure things About Me $whoami Security engineer @ Fb > 2 years Security

Achieving a Readable Style Part 2: Sentence Structure A wri&ng workshop presented by BACTER