P Y T O R C H A N D T H E N E W C H A L L E N G E S O F M L
LeCun’s Law and the Rise of Deep Learning Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998 6,000 5371 4,800 Citation Count 3711 3,600 2472 2,400 1469 1,200 762 495 318 315 264 195 167 155 118 137 72 37 43 20 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
TRANSLATION SPARK AR OCULUS VR BLOOD DONATIONS
400T+ PREDICTIONS PER DAY
1B+ PHONES RUNNING NEURAL NETS GLOBALLY
WHAT IS PYTORCH? DYNAMIC HARDWARE NEURAL ACCELERATED NETWORKS INFERENCE EAGER & DISTRIBUTED SIMPLICITY GRAPH-BASED TRAINING OVER EXECUTION COMPLEXITY
BUILT BY THE DESIGNED FOR BUILT FOR COMMUNITY RESEARCHERS PRODUCTION
BUILT BY THE DESIGNED FOR BUILT FOR COMMUNITY RESEARCHERS PRODUCTION
~1,200 50%+ 22K C O N T R I B U T O R S Y O Y G R O W T H P Y T O R C H F O R U M U S E R S
DESIGNED FOR BUILT BY THE BUILT FOR RESEARCHERS COMMUNITY PRODUCTION
GROWTH IN ARXIV MENTIONS IN RESEARCH PAPERS
U D A C I T Y 16K+ S T U D E N T S E N R O L L E D I N C O U R S E S F A S T . A I Practical Deep Learning for Part 2: Deep Learning from 21M Coders, V3 the Foundations M I N U T E S O F W A T C H T I M E A Code-First Introduction to Introduction to Machine I N T H E L A S T 1 2 M O N T H S Natural Language Processing Learning for Coders
BUILT FOR BUILT BY THE DESIGNED FOR PRODUCTION COMMUNITY RESEARCHERS
PRODUCTION RESEARCH
P Y T O R C H
P Y T O R C H
P Y T O R C H
P Y T O R C H
C O R E P R I N C I P L E S DEVELOPER BUILDING EFFICIENCY FOR SCALE
DEVELOPER EFFICIENCY ENABLING A HIGH VELOCITY OF MODEL ITERATION AND INNOVATION
C L E A N A P I S `
N A M E D T E N S O R S Today, we name and access dimensions by comment: # Tensor[N, C, H, W] images = torch.randn(32, 3, 56, 56) images.sum(dim=1) ` images.select(dim=1, index=0) But naming explicitly leads to more readable and maintainable code: NCHW = [‘N’, ‘C’, ‘H’, ‘W’] EXPERIMENTAL images = torch.randn(32, 3, 56, 56, names=NCHW)
T O R C H S C R I P T class RNN(nn.Module): Models are Python TorchScript def __init__(self, W_h, U_h, W_y, b_h, b_y): programs, an optimizable subset of super(RNN, self).__init__() self.W_h = nn.Parameter(W_h) Python self.U_h = nn.Parameter(U_h) self.W_y = nn.Parameter(W_y) self.b_h = nn.Parameter(b_h) + Same “models are programs” idea self.b_y = nn.Parameter(b_y) def forward(self, x, h): + Production deployment y = [] ` for t in range(x.size(0)): + No Python dependency h = torch.tanh(x[t] @ self.W_h + h @ self.U_h + self.b_h) + Compilation for performance y += [torch.tanh(h @ self.W_y + self.b_y)] if t % 10 == 0: optimization print ("stats: ", h.mean(), h.var()) return torch.stack(y), h # one annotation! script_rnn = torch.jit.script(RNN(W_h, U_h, W_y, b_h, b_y))
C O R E P R I N C I P L E S DEVELOPER BUILDING EFFICIENCY FOR SCALE
BUILDING FOR SCALE HIGH PERFORMANCE EXECUTION FOR MODEL TRAINING AND INFERENCE
G R O W T H O F D A T A I N M L P I P E L I N E S 3X 30% 50% FB data used in an ML FB data used in an ML ML Data Growth in pipeline in 2018 pipeline TODAY One Year
S C A L E O F M L T R A I N I N G A T F A C E B O O K RANKING ENGINEERS WORKFLOWS TRAINED COMPUTE CONSUMED 2X 3X 3X INCREASE INCREASE INCREASE
O P T I M I Z I N G F O R H A R D W A R E B A C K E N D S PYTORCH DEVELOPMENT ENV PYTORCH JIT MKL-DNN Cuda/CuDNN (Q)NNPACK FBGEMM XLA Glow TVM
1 2 3 Feature Engineering Training Inference sxm2 Lightning Tioga Pass Tioga Pass (30X Flash Drives JBOF) (Dual CPU, High Mem) Bryce Canyon Big Basin Twin Lakes (70X HDDs + Integrated Compute) (8X GPU + 2X CPU) (Single socket CPU card, Low Mem)
Q U A N T I Z A T I O N Efficient inference on server and mobile devices using reduced precision math. POST QUANTIZATION TRAINING AWARE DYNAMIC QUANTIZATION TRAINING QUANTIZATION ACCURACY & PERF SIMPLICITY OF USE CONTROL 4x 2-4x LESS MEMORY COMPUTE SPEEDUP
P Y T O R C H R E S E A R C H P R O D U C T I O N + P R O T O T Y P I N G D E P L O Y M E N T
NAMED TENSORS
PyTorch set the bar for ML Developer UX by focusing on expressivity and productivity "I want to write a program, not to (manually) build a graph" Where are similar areas for improvement today?
Data has semantic meaning! But we force users to drop that context and use an abstract "Tensor" mathematical object Type to enter a caption.
Key Insight: Named Dimensions Inspired by and done in collaboration with Prof. Alexander Rush, now Cornell Tech.
Key Insight: Named Dimensions Today we name and access dimensions by comment
Key Insight: Named Dimensions But naming explicitly leads to more readable and Today we name and access dimensions by comment maintainable code
By retaining semantic meaning, we also avoid common " Tensor Pitfalls " - Accidental Broadcasting - Accidental Alignment
By retaining semantic meaning, we also avoid common " Tensor Pitfalls " - Accidental Broadcasting - Accidental Alignment
Accidental Broadcasting We didn't expect broadcasting to happen, but it did:
Accidental Broadcasting We didn't expect broadcasting to happen, but it did: We can catch this automatically!
Accidental Broadcasting Broadcast by position, but check that dimension names are We didn't expect broadcasting to happen, but it did: aligned. We can catch this automatically!
By retaining semantic meaning, we also avoid common " Tensor Pitfalls " - Accidental Broadcasting - Accidental Alignment
Accidental Alignment No 1->N broadcast occurs across semantically distinct dimensions, but size happens to match.
Accidental Alignment No 1->N broadcasting occurs across semantically distinct But there are so many formats! dimensions, but size happens to match.
Accidental Alignment No 1->N broadcasting occurs across semantically distinct But there are so many formats! dimensions, but size happens to match. There is a "time bomb" if I ever normalize the wrong format and the "unaligned" dimensions have the same size!
Accidental Alignment No 1->N broadcasting occurs across semantically distinct dimensions, but size happens to match.
Accidental Alignment No 1->N broadcasting occurs across semantically distinct If we broadcast by name ( align_as ), we only need a single dimensions, but size happens to match. normalize function for all formats
What about mixing named and unnamed Tensors? I don't want to convert my entire program at once...
Coexistence with Unnamed Named Tensors can coexist with Unnamed Tensors. Let's remove the requirement that mean, stdv are named
Coexistence with Unnamed refine_names lifts unnamed tensors to be named tensors Named Tensors can coexist with Unnamed Tensors. Let's remove the requirement that mean, stdv are named
N a m e d T e n s o r s Experimental in 1.3 Future Work C o r e F u n c t i o n a l i t y T u t o r i a l E x p a n d e d C o v e r a g e See our in-depth Expanded NN package coverage Common torch operators are MultiheadedAttention tutorial supported in eager mode Named autograd support (Unnamed) autograd is supported Serialization, multiprocessing, distributed, JIT, mypy
P y T o r c h J I T / T o r c h S c r i p t
What is the PyTorch JIT? A compiler and language infrastructure for machine learning
Production Requirements P O R T A B I L I T Y P E R F O R M A N C E Models should run anywhere Whole-program optimization
Problem Statement We need a system that can: 1. Capture the structure of PyTorch programs. 2. Use that structure to optimize.
Problem Statement We need a system that can: 1. Capture the structure of PyTorch programs. TorchScript 2. Use that structure to optimize. JIT Compiler
TorchScript A static, high-performance subset of Python. 1. Prototype your model with PyTorch 2. Control flow is preserved 3. First-class support for lists, dicts, etc.
PyTorch JIT An optimizing just-in-time compiler for PyTorch programs. 1. Lightweight, thread-safe interpreter 2. Easy to write custom transformations 3. Not just for inference! Autodiff support.
C A S E S T U D Y Recursive Neural Network Grammars — Complex dynamic behavior based on the inputs — Typically written in pure C++
Complex Control Flow
Use common data structures
Define your own classes
Define your own classes
W H A T ' S N E X T ? JIT as a Platform Q U A N T I Z A T I O N M O B I L E B A C K E N D S A lightweight interpreter that can Support for lowering models to Model quantization done safely and run on-device. static graph compilers, like TVM, automatically using JIT Glow, XLA. transformations.
Q U A N T I Z A T I O N
Recommend
More recommend