Low-latency RNN inference using Cellular Batching Jinyang - PowerPoint PPT Presentation

Low-‑latency ¡RNN ¡inference ¡using ¡ Cellular ¡Batching ¡ ¡ Jinyang ¡Li ¡ joint ¡work ¡with ¡Pin ¡Gao, ¡Lingfan ¡Yu, ¡Yongwei ¡Wu ¡ ¡ ¡ New ¡York ¡University ¡ ¡ ¡ ¡ ¡ ¡Tsinghua ¡University ¡

Lifecycle ¡of ¡DNN ¡deployments ¡ opGmal ¡weights ¡θ opt ¡ Training: ¡ [0.1, ¡0.2, ¡...] ¡ IteraGvely ¡modify ¡ weights ¡θ ¡ Use ¡fixed ¡weights ¡ ¡θ opt ¡ predicGons ¡ Inference ¡

DNN ¡Serving ¡must ¡provide ¡low ¡latency ¡ good ¡ throughput ¡ Goal ¡ Training: ¡ ¡ All ¡samples ¡are ¡ available ¡at ¡once ¡ Inference: ¡ good ¡throughput ¡ Goal ¡ & ¡low ¡latency ¡ Request ¡arrives ¡one ¡ at ¡a ¡Gme ¡

The ¡hardware ¡reality: ¡batching ¡ improves ¡performance ¡ Throughput ¡(reqs/sec) ¡ 900000 ¡ 600000 ¡ 300000 ¡ 0 ¡ 6000 ¡ Latency ¡(us) ¡ 4000 ¡ 2000 ¡ 0 ¡ 2 ¡ 16 ¡ 32 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ 4 ¡ 8 ¡ 64 ¡ Batch ¡size ¡ Nvidia ¡Tesla ¡v100, ¡one ¡LSTM ¡step ¡(hidden ¡state ¡size ¡1024) ¡

Background ¡on ¡Recurrent ¡Neural ¡Network ¡ hidden ¡ ¡ state: ¡h ¡ h 2 ¡ h 1 ¡ h 0 ¡ unfold ¡ f θ f θ f θ f θ x 0 ¡ x 2 ¡ x 1 ¡ input: ¡x ¡

RNN’s ¡batching ¡challenge: ¡ ¡ chain-‑RNNs ¡ • How ¡to ¡batch ¡chains ¡of ¡different ¡lengths? ¡ LSTM ¡ LSTM ¡ LSTM ¡ “fast” ¡ Volta ¡CPU ¡is... ¡ cell ¡ cell ¡ cell ¡ “Volta” ¡ “is” ¡ “GPU” ¡ LSTM ¡ LSTM ¡ I ¡love ¡... ¡ “california” ¡ cell ¡ cell ¡ “I” ¡ “love” ¡

State-‑of-‑the-‑art ¡soluGon: ¡padding ¡ • TensorFlow, ¡MXNet, ¡PyTorch, ¡CNTK ¡batch ¡via ¡ padding ¡ LSTM ¡ LSTM ¡ LSTM ¡ “fast” ¡ Volta ¡CPU ¡is... ¡ cell ¡ cell ¡ cell ¡ “Volta” ¡ “is” ¡ “GPU” ¡ LSTM ¡ LSTM ¡ I ¡love ¡... ¡ “california” ¡ cell ¡ cell ¡ “I” ¡ “love” ¡

State-‑of-‑the-‑art ¡soluGon: ¡padding ¡ • TensorFlow, ¡MXNet, ¡PyTorch, ¡CNTK ¡batch ¡via ¡ padding ¡ LSTM ¡ LSTM ¡ LSTM ¡ “fast” ¡ Volta ¡CPU ¡is... ¡ cell ¡ cell ¡ cell ¡ “california” ¡ I ¡love ¡... ¡ “Volta” ¡ “is” ¡ “GPU” ¡ xxxx ¡ “love” ¡ “I” ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

RNN’s ¡batching ¡challenge: ¡ ¡ TreeLSTM ¡ • How ¡to ¡batch ¡trees ¡of ¡different ¡structures? ¡ LSTM ¡ LSTM ¡ cell ¡ cell ¡ LSTM ¡ LSTM ¡ LSTM ¡ LSTM ¡ leaf ¡ leaf ¡ leaf ¡ cell ¡ “cats” ¡ “sleep” ¡ LSTM ¡ LSTM ¡ “kids” ¡ leaf ¡ leaf ¡ “love” ¡ “dogs” ¡

State-‑of-‑the-‑art ¡soluGons: ¡graph ¡batching ¡ • TensorFlow-‑Fold ¡and ¡DyNet ¡merge ¡dataflow ¡ graphs ¡ LSTM ¡ LSTM ¡ cell ¡ cell ¡ LSTM ¡ LSTM ¡ LSTM ¡ LSTM ¡ leaf ¡ leaf ¡ leaf ¡ cell ¡ “cats” ¡ “sleep” ¡ LSTM ¡ LSTM ¡ “kids” ¡ leaf ¡ leaf ¡ “love” ¡ “dogs” ¡

State-‑of-‑the-‑art ¡soluGons: ¡graph ¡batching ¡ • TensorFlow-‑Fold ¡and ¡DyNet ¡merge ¡dataflow ¡ graphs ¡ LSTM ¡ cell ¡ LSTM ¡ LSTM ¡ leaf ¡ cell ¡ LSTM ¡ LSTM ¡ “kids” ¡ leaf ¡ leaf ¡ “love” ¡ “dogs” ¡ “sleep” ¡ “cats” ¡

ExisGng ¡batching ¡techniques ¡do ¡not ¡ work ¡well ¡for ¡RNNs ¡ • Long ¡delay ¡ – New ¡request ¡has ¡to ¡wait ¡for ¡current ¡batch ¡to ¡finish ¡ • SubopGmal ¡Throughput ¡ – Padding ¡wastes ¡computaGon ¡ – Not ¡every ¡operator ¡is ¡batched ¡thoroughly ¡in ¡the ¡ merged ¡graph. ¡

Talk ¡outline ¡ • MoGvaGon: ¡ ¡ – ExisGng ¡batching ¡techniques ¡incur ¡long ¡delay ¡for ¡RNNs ¡ • Our ¡approach: ¡cellular ¡batching ¡ • Batchmaker ¡RNN ¡inference ¡system ¡ • EvaluaGon ¡

Our ¡approach: ¡cellular ¡batching ¡ • The ¡insight: ¡RNN ¡is ¡made ¡up ¡of ¡(a ¡few ¡types ¡of) ¡ cells ¡that ¡are ¡repeated ¡many ¡Gmes ¡ ü Cellular ¡batching: ¡make ¡batching ¡decisions ¡ before ¡execuGng ¡each ¡cell. ¡ X Graph ¡batching ¡or ¡padding: ¡make ¡batching ¡ decisions ¡before ¡execuGng ¡requests. ¡

Graph ¡batching ¡wastes ¡batching ¡ opportunity ¡ execuGng ¡ finished ¡ Not ¡yet ¡executed ¡

Cellular ¡batching ¡reduces ¡waiGng ¡ execuGng ¡ finished ¡ Not ¡yet ¡executed ¡

Performance ¡challenges ¡in ¡realizing ¡ cellular ¡batching ¡ • Must ¡support ¡mulGple ¡types ¡of ¡cells ¡and ¡>1 ¡GPUs ¡ – Only ¡cells ¡of ¡the ¡same ¡type ¡can ¡be ¡batched ¡ ¡together ¡ – Different ¡cell ¡types ¡have ¡different ¡priority ¡ • Balance ¡batching ¡opportunity ¡and ¡overhead ¡ – Cost ¡of ¡scheduling ¡is ¡non-‑trivial ¡ – Adding ¡new ¡requests ¡to ¡a ¡batch ¡incurs ¡memory ¡copy ¡ • Decisions ¡must ¡be ¡made ¡asynchronously ¡with ¡ execuGon ¡

Batchmaker: ¡system ¡architecture ¡ User-‑defined ¡ cell ¡definiGon ¡ unfolding ¡logic ¡ (saved ¡by ¡MXNet/TensorFlow) ¡ initalizaGon ¡ requests ¡ predicGons ¡

How ¡the ¡scheduler ¡makes ¡decisions ¡ • Preserve ¡Locality ¡ – prefer ¡batching ¡cells ¡of ¡the ¡same ¡set ¡of ¡request ¡ – prefer ¡sGcking ¡to ¡the ¡same ¡GPU ¡for ¡the ¡same ¡ request ¡ • Allow ¡Priority ¡ – decoder ¡cells ¡> ¡encoder ¡cells ¡ – TreeLSTM ¡internal ¡cells ¡> ¡leaf ¡cell ¡ • Schedule ¡several ¡dependent ¡batches ¡at ¡a ¡Gme ¡

Priority ¡and ¡locality ¡in ¡scheduling ¡

ImplementaGon ¡opGmizaGons ¡ • Dynamically ¡adjust ¡the ¡batch ¡size ¡(up ¡to ¡the ¡ configured ¡maximum) ¡ – Form ¡and ¡execute ¡a ¡batch ¡whenever ¡GPU ¡is ¡idle ¡ • Asynchronously ¡wait ¡for ¡GPU ¡kernel ¡compleGon ¡ – GPU ¡driver’s ¡asynchronous ¡callback ¡is ¡too ¡slow ¡ – Add ¡a ¡signaling ¡kernel ¡at ¡the ¡end ¡of ¡all ¡cell ¡kernels ¡ • update ¡a ¡variable ¡in ¡host ¡memory ¡

EvaluaGon ¡quesGons ¡ • How ¡does ¡Batchmaker ¡compare ¡against ¡ baseline ¡systems? ¡ • Where ¡does ¡Batchmaker’s ¡performance ¡ advantage ¡come ¡from? ¡

Low-latency RNN inference using Cellular Batching Jinyang - PowerPoint PPT Presentation

Low-latency RNN inference using Cellular Batching Jinyang Li joint work with Pin Gao, Lingfan Yu, Yongwei Wu New York University

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney Developer

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28

Cellular Automaton Tracking for VXD Cellular Automaton Tracking for VXD Cellular Automaton

Investigating scalability of recurrent network using dynamic batching in PyTorch Devin Taylor

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

On the way to Omniledger: adding transaction batching and ByzcoinX to skipchains Raphal Dunant

EFFICIENT USE OF ALUMINUM SCRAP IN EFFICIENT USE OF ALUMINUM SCRAP IN BATCHING SECONDARY ALLOYS

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Cellular structure for a digital fiat currency Robleh Ali MIT Media Lab Digital Currency

Mobility and cellular networks Mobility and cellular networks Cellular radio and PCS networks

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

JUNE 2017 HPS - Crime Information Analysis Unit L OCATION Ward 11 Division 2 Ward 11 @ Division

Data Protection Regulation (GDPR) Robertas T amosaitis Microsoft Business Solution Sales

The debate around the extension of concessions term. The case of Brazilian energy

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and

Introduction to HTML II Shih-Heng Chin Preface Structure of a HTML File Elements used

Imaging the City: GPU simulation in space & time Nikita Pestrov, Habidatum International,

SCL: Site Construction Language SCL: Site Construction Language Sudip Das, Clark Landis, Sudip

Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger

Sambuz

Useful Links

Newsletter

Mail Us

Low-latency RNN inference using Cellular Batching Jinyang - PowerPoint PPT Presentation

Low-latency RNN inference using Cellular Batching Jinyang Li joint work with Pin Gao, Lingfan Yu, Yongwei Wu New York University

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney Developer

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28

Cellular Automaton Tracking for VXD Cellular Automaton Tracking for VXD Cellular Automaton

Investigating scalability of recurrent network using dynamic batching in PyTorch Devin Taylor

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

On the way to Omniledger: adding transaction batching and ByzcoinX to skipchains Raphal Dunant

EFFICIENT USE OF ALUMINUM SCRAP IN EFFICIENT USE OF ALUMINUM SCRAP IN BATCHING SECONDARY ALLOYS

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Cellular structure for a digital fiat currency Robleh Ali MIT Media Lab Digital Currency

Mobility and cellular networks Mobility and cellular networks Cellular radio and PCS networks

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

JUNE 2017 HPS - Crime Information Analysis Unit L OCATION Ward 11 Division 2 Ward 11 @ Division

Data Protection Regulation (GDPR) Robertas T amosaitis Microsoft Business Solution Sales

The debate around the extension of concessions term. The case of Brazilian energy

Thai Speech Processing Activities at NECTEC Chai Wutiwiwatchai, Ph.D. National Electronics and

Introduction to HTML II Shih-Heng Chin Preface Structure of a HTML File Elements used

Imaging the City: GPU simulation in space &amp; time Nikita Pestrov, Habidatum International,

SCL: Site Construction Language SCL: Site Construction Language Sudip Das, Clark Landis, Sudip

Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger

Sambuz

Useful Links

Newsletter

Mail Us

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Imaging the City: GPU simulation in space & time Nikita Pestrov, Habidatum International,