Microsoft AI & Research Traditional IR Keyword based Search - PowerPoint PPT Presentation

Microsoft AI & Research

Traditional IR Keyword based Search AUTB streams Inverted index User Engagement Natural Language User clicks, Search Inverted index metawords Voice, Vision AI Context-Based Deep Learning Vectors Search

• • • • •

• • •

AI Conversational Search

• • • •

 Key ey Cha halle llenges nges

Ideas Experiments Trained models Shipped models

 Trad ade-offs ffs – A Vi Visu sual al T ax axonomy nomy

“Maximum bar” metrics “Minimum bar” metrics Model latency Throughput (milliseconds @ 99) (@ given 99% latency) Model footprint Model fit Cost of HW Cost efficiency Error rate Model accuracy ($/ops) (ops/$) (1%) (99%)

Model latency Model error rate Cost of HW

Model latency Cost of HW Model error rate

Latency Model error rate Cost of HW

Latency Model error rate Cost

 Cost t to serve: : BERT

 Mod odel el fi fit: : Tempora mporal/ l/sp spatia atial l br break eakdo down: wn: wor word-base ased d RNNs

Model fit technique #1: T emporal/spatial breakdown Case study: Typical word-based RNNs - Hidden dimensions = ~100 - Input size = 100 (or more) - Output size = 100 - Bi-directional - Time steps = 100+ (one timestep per word in a sentence/paragraph) Problem : - Native TensorFlow execution is slow - 20K kernel invocations - 5 us/invocation - ~100 ms delay - We need a custom implementation [1] Diagram source: https://devblogs.nvidia.com/optimizing-recurrent-neural-networks- - TensorFlow custom op for RNN execution cudnn-5 - Collaboration with Nvidia [1] - S81039 – GTC 2018: Accelerating Machine Reading Comprehension Models Using GPUs

- Time is averaged over back-to-back 100 iterations (different input data) - 40 timesteps 1) For K80 and M40 we used CUDNN_RNN_ALGO_STANDARD, since grid persistent approach is supported starting from Pascal. 2) For P4 we chose CUDNN_RNN_ALGO_PERSIST_STATIC, and it is more than 2x faster than block persistent. Block-persistent RNNs results – joint collaboration with NVidia Latency cuDNN DNN Blockpe ckpersist sistent nt Batch 1 Batch 5 Batch 1 Batch 5 K80 9.30 .30 6.52 .52 1.77 77 1.87 .87 M40 M4 2.78 78 2.90 .90 0.95 .95 1.00 .00 Cost of Model P4 P4 0.25 .25 1.06 .06 0.68 .68 0.72 72 HW accuracy

GRNN Text Classification CharRNN 20 3 18 2.5 16 14 2 GRNN Latency (ms) Latency (ms) 12 cuDNN Persistent 10 1.5 8 cuDNN Traditional 1 6 / TensorRT 4 0.5 2 0 0 0 10 20 30 40 50 1 5 10 15 20 25 30 35 40 45 50 Batch Size Batch Size * GRNN: Low-Latency and Scalable RNN Inference on GPUs. Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, Bo Wu. To Appear @ Eurosys 2019.

 Mode del l fi fit: : Qu Quantizat antization: ion: ELMO O

• Case study: ELMO [1] • Deep contextualized word representation that models Latency both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). • Characteristics • Main block: Very large LSTM, 4-layer • Hidden states = 4096 • Hard to fit in memory • We can improve model fit through quantization • Latency greatly improved Model error rate Cost of HW • We might need more expensive HW (INT8 support is only in V100) • Model accuracy suffers [1] - https://allennlp.org/elmo

Projection bi-LSTM from the last stage dominates the execution time 2 layer bi-LSTM, 512 input, 4096 hidden, 512 projection Batch 1, 100 timesteps, takes about 70 ms on P40 in FP32 when using cuDNN Reduce precision to INT8 to meet latency requirements We choose conservative quantization approach, keep as much as possible in FP32 Custom implementation (Joint collaboration with Nvidia): 1) Single cuBLAS call per layer for input GEMMs for all timesteps (done in FP32) 2) Elementwise ops kernel in FP32 3) Custom MatVec kernel for recurrent part (at batch 1) Input / Output vectors are FP32 • Recurrent weights are quantized once • All math is FP32, so older GPU architectures can be used as well •

Only hidden and projection weights are quantized, input weights and all bias are in FP32 • In the quantization scheme (e.g. https://arxiv.org/abs/1609.08144v2 ), each row has a separate scale factor • rowMax = max(abs(row)) scale = 127.0 / rowMax quantRow = int(round(row * scale)) Results at batch 1 cuDNN NN FP32, 32, ms ms Custom om INT8, 8, ms ms P40, ECC ON 70.0 20.7 V100, ECC ON 41.9 11.9

Microsoft AI & Research Traditional IR Keyword based Search - PowerPoint PPT Presentation

Microsoft AI & Research Traditional IR Keyword based Search AUTB streams Inverted index User Engagement Natural Language User clicks, Search Inverted index metawords Voice, Vision AI Context-Based Deep Learning Vectors Search

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Z3 - a Tutorial Leonardo de Moura Nikolaj Bjrner Microsoft Research Microsoft Research

5.2 Microsoft Excel Microsoft Excel Microsoft Excel is the spreadsheet component of the

Microsoft Access 2010 Powerpoint Presentation 2003 Microsoft Access 2010 is a software program

Nothing is Traditional about Nothing is Traditional about Environments in a Traditional

Formal Methods and Tools for Distributed Systems Thomas Ball Microsoft

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

Microsoft Powerpoint Presentation 2010 For Windows 7 microsoft powerpoint windows 7 free download

3 Ways To Sell With Microsoft Microsoft field sellers facilitate engagements between partners +

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

GESTURE SENSORS Microsoft Kinect V1 24M - 2013 Microsoft Kinect V2 20M - 2016 + VR + GESTURE

The Microsoft Software The Microsoft Software Development Process Development Process Scott

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

ML.NET Presented by: Markus Weimer Markus.Weimer@Microsoft.com https://dot.net/ml Brought to

From Traditional Neural From Traditional NN . . . Networks to Deep Learning Need to Go Beyond .

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

Corporate Presentation 3Q19 We are a global company dedicated to bringing local favorite foods

Students Academic Vocabulary with Daily Minilessons ANGELA PEERY, ED. D.

An intrajugular paraganglioma. Unusual presentation of a classical tumor Article in Romanian

Digital Infrastructure Wireless, Small Cells: and How Municipalities Are Key to the 5G Revolution

Apex Frozen Foods Limited H1FY18 Safe Harbor This presentation and the accompanying slides (the

4 th Quarter & Full Year 2016 Supplemental Presentation Strategic Operational and Financial

F isc al Ye ar E nde d F e br uar y 2017 Re sults Br ie fing WARABE YA NI CHI YO HOL

Applegreen plc 2016 Preliminary Results Presentation 14 th March 2017 Applegreen plc 2016

Microsoft AI & Research Traditional IR Keyword based Search - PowerPoint PPT Presentation

Microsoft AI & Research Traditional IR Keyword based Search AUTB streams Inverted index User Engagement Natural Language User clicks, Search Inverted index metawords Voice, Vision AI Context-Based Deep Learning Vectors Search

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Z3 - a Tutorial Leonardo de Moura Nikolaj Bjrner Microsoft Research Microsoft Research

5.2 Microsoft Excel Microsoft Excel Microsoft Excel is the spreadsheet component of the

Microsoft Access 2010 Powerpoint Presentation 2003 Microsoft Access 2010 is a software program

Nothing is Traditional about Nothing is Traditional about Environments in a Traditional

Formal Methods and Tools for Distributed Systems Thomas Ball Microsoft

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

Microsoft Powerpoint Presentation 2010 For Windows 7 microsoft powerpoint windows 7 free download

3 Ways To Sell With Microsoft Microsoft field sellers facilitate engagements between partners +

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

GESTURE SENSORS Microsoft Kinect V1 24M - 2013 Microsoft Kinect V2 20M - 2016 + VR + GESTURE

The Microsoft Software The Microsoft Software Development Process Development Process Scott

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

ML.NET Presented by: Markus Weimer Markus.Weimer@Microsoft.com https://dot.net/ml Brought to

From Traditional Neural From Traditional NN . . . Networks to Deep Learning Need to Go Beyond .

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

Corporate Presentation 3Q19 We are a global company dedicated to bringing local favorite foods

Students Academic Vocabulary with Daily Minilessons ANGELA PEERY, ED. D.

An intrajugular paraganglioma. Unusual presentation of a classical tumor Article in Romanian

Digital Infrastructure Wireless, Small Cells: and How Municipalities Are Key to the 5G Revolution

Apex Frozen Foods Limited H1FY18 Safe Harbor This presentation and the accompanying slides (the

4 th Quarter &amp; Full Year 2016 Supplemental Presentation Strategic Operational and Financial

F isc al Ye ar E nde d F e br uar y 2017 Re sults Br ie fing WARABE YA NI CHI YO HOL

Applegreen plc 2016 Preliminary Results Presentation 14 th March 2017 Applegreen plc 2016

4 th Quarter & Full Year 2016 Supplemental Presentation Strategic Operational and Financial