p y t o r c h a n d t h e n e w
play

P Y T O R C H A N D T H E N E W C H A L L E N G E S O F M L - PowerPoint PPT Presentation

P Y T O R C H A N D T H E N E W C H A L L E N G E S O F M L LeCuns Law and the Rise of Deep Learning Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998 6,000 5371 4,800 Citation Count 3711 3,600 2472 2,400


  1. P Y T O R C H A N D T H E N E W C H A L L E N G E S O F M L

  2. LeCun’s Law and the Rise of Deep Learning Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998 6,000 5371 4,800 Citation Count 3711 3,600 2472 2,400 1469 1,200 762 495 318 315 264 195 167 155 118 137 72 37 43 20 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

  3. TRANSLATION SPARK AR OCULUS VR BLOOD DONATIONS

  4. 400T+ PREDICTIONS PER DAY

  5. 1B+ PHONES RUNNING NEURAL NETS GLOBALLY

  6. WHAT IS PYTORCH? DYNAMIC HARDWARE NEURAL ACCELERATED NETWORKS INFERENCE EAGER & DISTRIBUTED SIMPLICITY GRAPH-BASED TRAINING OVER EXECUTION COMPLEXITY

  7. BUILT BY THE DESIGNED FOR BUILT FOR COMMUNITY RESEARCHERS PRODUCTION

  8. BUILT BY THE DESIGNED FOR BUILT FOR COMMUNITY RESEARCHERS PRODUCTION

  9. ~1,200 50%+ 22K C O N T R I B U T O R S Y O Y G R O W T H P Y T O R C H F O R U M U S E R S

  10. DESIGNED FOR BUILT BY THE BUILT FOR RESEARCHERS COMMUNITY PRODUCTION

  11. GROWTH IN ARXIV MENTIONS IN RESEARCH PAPERS

  12. U D A C I T Y 16K+ S T U D E N T S E N R O L L E D I N C O U R S E S F A S T . A I Practical Deep Learning for Part 2: Deep Learning from 21M Coders, V3 the Foundations M I N U T E S O F W A T C H T I M E A Code-First Introduction to Introduction to Machine I N T H E L A S T 1 2 M O N T H S Natural Language Processing Learning for Coders

  13. BUILT FOR BUILT BY THE DESIGNED FOR PRODUCTION COMMUNITY RESEARCHERS

  14. PRODUCTION RESEARCH

  15. P Y T O R C H

  16. P Y T O R C H

  17. P Y T O R C H

  18. P Y T O R C H

  19. C O R E P R I N C I P L E S DEVELOPER BUILDING EFFICIENCY FOR SCALE

  20. DEVELOPER EFFICIENCY ENABLING A HIGH VELOCITY OF MODEL ITERATION AND INNOVATION

  21. C L E A N A P I S `

  22. N A M E D T E N S O R S Today, we name and access dimensions by comment: # Tensor[N, C, H, W] images = torch.randn(32, 3, 56, 56) images.sum(dim=1) ` images.select(dim=1, index=0) But naming explicitly leads to more readable and maintainable code: NCHW = [‘N’, ‘C’, ‘H’, ‘W’] EXPERIMENTAL images = torch.randn(32, 3, 56, 56, names=NCHW)

  23. T O R C H S C R I P T class RNN(nn.Module): Models are Python TorchScript def __init__(self, W_h, U_h, W_y, b_h, b_y): programs, an optimizable subset of super(RNN, self).__init__() self.W_h = nn.Parameter(W_h) Python self.U_h = nn.Parameter(U_h) self.W_y = nn.Parameter(W_y) self.b_h = nn.Parameter(b_h) + Same “models are programs” idea self.b_y = nn.Parameter(b_y) def forward(self, x, h): + Production deployment y = [] ` for t in range(x.size(0)): + No Python dependency h = torch.tanh(x[t] @ self.W_h + h @ self.U_h + self.b_h) + Compilation for performance y += [torch.tanh(h @ self.W_y + self.b_y)] if t % 10 == 0: optimization print ("stats: ", h.mean(), h.var()) return torch.stack(y), h # one annotation! script_rnn = torch.jit.script(RNN(W_h, U_h, W_y, b_h, b_y))

  24. C O R E P R I N C I P L E S DEVELOPER BUILDING EFFICIENCY FOR SCALE

  25. BUILDING FOR SCALE HIGH PERFORMANCE EXECUTION FOR MODEL TRAINING AND INFERENCE

  26. G R O W T H O F D A T A I N M L P I P E L I N E S 3X 30% 50% FB data used in an ML FB data used in an ML ML Data Growth in pipeline in 2018 pipeline TODAY One Year

  27. S C A L E O F M L T R A I N I N G A T F A C E B O O K RANKING ENGINEERS WORKFLOWS TRAINED COMPUTE CONSUMED 2X 3X 3X INCREASE INCREASE INCREASE

  28. O P T I M I Z I N G F O R H A R D W A R E B A C K E N D S PYTORCH DEVELOPMENT ENV PYTORCH JIT MKL-DNN Cuda/CuDNN (Q)NNPACK FBGEMM XLA Glow TVM

  29. 1 2 3 Feature Engineering Training Inference sxm2 Lightning Tioga Pass Tioga Pass (30X Flash Drives JBOF) (Dual CPU, High Mem) Bryce Canyon Big Basin Twin Lakes (70X HDDs + Integrated Compute) (8X GPU + 2X CPU) (Single socket CPU card, Low Mem)

  30. Q U A N T I Z A T I O N Efficient inference on server and mobile devices using reduced precision math. POST QUANTIZATION TRAINING AWARE DYNAMIC QUANTIZATION TRAINING QUANTIZATION ACCURACY & PERF SIMPLICITY OF USE CONTROL 4x 2-4x LESS MEMORY COMPUTE SPEEDUP

  31. P Y T O R C H R E S E A R C H P R O D U C T I O N + P R O T O T Y P I N G D E P L O Y M E N T

  32. NAMED TENSORS

  33. PyTorch set the bar for ML Developer UX by focusing on expressivity and productivity "I want to write a program, not to (manually) build a graph" Where are similar areas for improvement today?

  34. Data has semantic meaning! But we force users to drop that context and use an abstract "Tensor" mathematical object Type to enter a caption.

  35. Key Insight: Named Dimensions Inspired by and done in collaboration with Prof. Alexander Rush, now Cornell Tech.

  36. Key Insight: Named Dimensions Today we name and access dimensions by comment

  37. Key Insight: Named Dimensions But naming explicitly leads to more readable and Today we name and access dimensions by comment maintainable code

  38. By retaining semantic meaning, we also avoid common " Tensor Pitfalls " - Accidental Broadcasting - Accidental Alignment

  39. By retaining semantic meaning, we also avoid common " Tensor Pitfalls " - Accidental Broadcasting - Accidental Alignment

  40. Accidental Broadcasting We didn't expect broadcasting to happen, but it did:

  41. Accidental Broadcasting We didn't expect broadcasting to happen, but it did: We can catch this automatically!

  42. Accidental Broadcasting Broadcast by position, but check that dimension names are We didn't expect broadcasting to happen, but it did: aligned. We can catch this automatically!

  43. By retaining semantic meaning, we also avoid common " Tensor Pitfalls " - Accidental Broadcasting - Accidental Alignment

  44. Accidental Alignment No 1->N broadcast occurs across semantically distinct dimensions, but size happens to match.

  45. Accidental Alignment No 1->N broadcasting occurs across semantically distinct But there are so many formats! dimensions, but size happens to match.

  46. Accidental Alignment No 1->N broadcasting occurs across semantically distinct But there are so many formats! dimensions, but size happens to match. There is a "time bomb" if I ever normalize the wrong format and the "unaligned" dimensions have the same size!

  47. Accidental Alignment No 1->N broadcasting occurs across semantically distinct dimensions, but size happens to match.

  48. Accidental Alignment No 1->N broadcasting occurs across semantically distinct If we broadcast by name ( align_as ), we only need a single dimensions, but size happens to match. normalize function for all formats

  49. What about mixing named and unnamed Tensors? I don't want to convert my entire program at once...

  50. Coexistence with Unnamed Named Tensors can coexist with Unnamed Tensors. Let's remove the requirement that mean, stdv are named

  51. Coexistence with Unnamed refine_names lifts unnamed tensors to be named tensors Named Tensors can coexist with Unnamed Tensors. Let's remove the requirement that mean, stdv are named

  52. N a m e d T e n s o r s Experimental in 1.3 Future Work C o r e F u n c t i o n a l i t y T u t o r i a l E x p a n d e d C o v e r a g e See our in-depth Expanded NN package coverage Common torch operators are MultiheadedAttention tutorial supported in eager mode Named autograd support (Unnamed) autograd is supported Serialization, multiprocessing, distributed, JIT, mypy

  53. P y T o r c h J I T / T o r c h S c r i p t

  54. What is the PyTorch JIT? A compiler and language infrastructure for machine learning

  55. Production Requirements P O R T A B I L I T Y P E R F O R M A N C E Models should run anywhere Whole-program optimization

  56. Problem Statement We need a system that can: 1. Capture the structure of PyTorch programs. 2. Use that structure to optimize.

  57. Problem Statement We need a system that can: 1. Capture the structure of PyTorch programs. TorchScript 2. Use that structure to optimize. JIT Compiler

  58. TorchScript A static, high-performance subset of Python. 1. Prototype your model with PyTorch 2. Control flow is preserved 3. First-class support for lists, dicts, etc.

  59. PyTorch JIT An optimizing just-in-time compiler for PyTorch programs. 1. Lightweight, thread-safe interpreter 2. Easy to write custom transformations 3. Not just for inference! Autodiff support.

  60. C A S E S T U D Y Recursive Neural Network Grammars — Complex dynamic behavior based on the inputs — Typically written in pure C++

  61. Complex Control Flow

  62. Use common data structures

  63. Define your own classes

  64. Define your own classes

  65. W H A T ' S N E X T ? JIT as a Platform Q U A N T I Z A T I O N M O B I L E B A C K E N D S A lightweight interpreter that can Support for lowering models to Model quantization done safely and run on-device. static graph compilers, like TVM, automatically using JIT Glow, XLA. transformations.

  66. Q U A N T I Z A T I O N

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend