Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance - PowerPoint PPT Presentation

Computation Rachel Hu and Zhi Zhang d2l.ai

Outline • Performance • Hybridization • Async-computation • Multi-GPU/machine training • Computer Vision • Image augmentation • Fine tuning d2l.ai

A Hybrid of Imperative and Symbolic Programming d2l.ai

Imperative Programming Interpreter compiles into bytecode • The common way to program in Python, Execute on virtual Java, C/C++, … machine • Straightforward, easy to debug a = 1 • Requires (Python) interpreter b = 2 • Hard to deploy models   c = a + b (smart phones, browser, embedded) • Performance problems 3 calls in total d2l.ai

Symbolic Programming Know the whole program, easy to • Define the program first, feed optimize with data to execute later • Math, SQL, … expr = "c = a + b" • Easy to optimize, less exec = compile(expr) exec(a=1, b=2) frontend overhead, portable • Hard to use May be used Single call without Python interpreter d2l.ai

Hybridization in Gluon • Define a model through nn.HybridSequential or nn.HybridBlock • Call .hybridize() to switch from imperative execution to symbolic execution net = nn.HybridSequential() net.add(nn.Dense(256, activation='relu'), nn.Dense(10)) net.hybridize() d2l.ai

Hybridize Notebook d2l.ai

Asynchronous Computing d2l.ai

Asynchronous Execution • Execute one-by-one a = 1 a = 1 b = 2 c = a + b print(c) b = 2 c = a + b System overhead print(c) • With a backend thread Overlapped } Frontend thread print(c) Push Wait Backend thread a = 1 b = 2 c = a + b d2l.ai

Automatic Parallelism d2l.ai

Writing Parallel Program is Painful • Single hidden- layer MLP with 2 data[gpu0].copyfrom(data[51:100]) data[gpu0].copyfrom(data[0:50]) data = next_batch() GPUs fc2_wgrad[cpu] = fc1[gpu0] = FullcForward(data[gpu0], fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc1_weight[gpu0]) fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1]) • Scales to fc2_weight[cpu] -= lr*fc12_wgrad[gpu0] fc2[gpu0] = FullcForward(fc1[gpu0], fc2[gpu1] = fc2_weight[gpu0]) hundreds of FullcForward(fc1[gpu1], fc2_weight[cpu].copyto( layers and tens fc2_weight[gpu0] , fc2_weight[gpu1]) fc2_ograd[gpu0] = fc2_ograd[gpu1] = LossGrad(fc2[gpu0], label[0:50]) LossGrad(fc2[gpu1], label[51:100]) of GPUs fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_ograd[gpu0], fc2_wgrad[gpu0] = fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu0] , FullcBackward(fc2_ograd[gpu1] , fc2_weight[gpu0]) fc2_weight[gpu1]) fc1_weight[cpu] -= lr * fc1_wgrad[gpu0] _, fc1_wgrad[gpu1] = _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu1] , FullcBackward(fc1_ograd[gpu0] , fc1_weight[gpu1]) fc1_weight[gpu0]) fc1_weight[cpu].copyto( fc1_weight[gpu0] , d2l.ai

Auto Parallelization Run in parallel Write serial programs A = 2 A = nd.ones((2,2)) * 2 C = A + 2 C = A + 2 B = A + 1 B = A + 1 D = B * C D = B ⨉ C d2l.ai

Multi-GPU Training (Lunar new year, 2014) d2l.ai

Data Parallelism key-value store 1. Read a data partition 2. Pull the parameters 3. Compute the gradient 4. Push the gradient 5. Update the parameters examples d2l.ai

Distributed Training (Alex’s frugal GPU cluster at CMU, 2015) d2l.ai

Distributed Computing multiple key-value store server machines push and pull   over network multiple worker machines read over network Store data in example examples a distributed filesystem s d2l.ai

GPU Machine Hierarchy Hierarchical parameter server Network Switch 1.25 GB/s 10 Gbit Ethernet CPUs Level-2 Servers CPU 15.75 GB/s PCIe 3.0 16x PCIe Switch Level-1 Servers GPUs 63 GB/s Workers 4 PCIe 3.0 16x GPU GPU GPU GPU d2l.ai

Iterating a Batch • Each worker machine read a part of the data batch example examples s d2l.ai

Iterating a Batch • Further split and move to each GPU example examples s d2l.ai

Iterating a Batch • Each server maintain a part of parameters • Each worker pull the whole parameters from servers example examples s d2l.ai

Iterating a Batch • Copy parameters into each GPU example examples s d2l.ai

Iterating a Batch • Each GPU computes gradients example examples s d2l.ai

Iterating a Batch • Sum the gradients over all GPU example examples s d2l.ai

Iterating a Batch • Push gradients into servers example examples s d2l.ai

Iterating a Batch • Each server sum gradients from all workers, then updates its parameters example examples s d2l.ai

Synchronized SGD • Each worker run synchronically • If n GPUs and each GPU process b examples per time • Synchronized SGD equals to mini-batch SGD on a single GPU with a nb batch size • In the ideal case, training with n GPUs will lead to a n times speedup compared to a single GPU training d2l.ai

Performance • T1 = O( b ): time to compute gradients for b example in a GPU • T2 = O( m ): time to send and receive m parameters/ gradients for a worker • Wall-time for each batch is max(T1, T2) • Idea case: T1 > T2, namely using large enough b • A too large b needs more data epochs to reach a desired model quality d2l.ai

Performance Trade-off Optimal Good batch size System performance (walltime per epoch) Training efficiency (#epoch to stop) Batch size per GPU d2l.ai

Practical Suggestions • A large dataset • Good GPU-GPU and machine-machine bandwidth • Efficient data loading/preprocessing • A model with good computation (FLOP) vs communication (model size) ratio • ResNet > AlexNet • A large enough batch size for good system performance • Tricks for efficiency optimization with a large batch size d2l.ai

Multi-GPU Notebooks d2l.ai

Image Augmentation d2l.ai

Real Story from CES’19 • Startup with smart vending machine demo that identifies purchases via a camera • Demo at CES failed • Different light temperature • Light reflection from table • The fix • Collect new data • Buy tablecloth • Retrain all night d2l.ai

Data Augmentation • Use prior knowledge about invariances to augment data • Add background noise to speech • Transform / augment image by altering colors, noise, cropping, distortions gluon-cv.mxnet.io

Training with Augmented Data Original Augmented Dataset Model Generate on the fly d2l.ai

Flip vertical x horizontal d2l.ai

Crop • Crop an area from the image and resize it • Random aspect ratio (e.g. [3:4, 4:3]) • Random area size (e.g. [8%, 100%]) • Random position d2l.ai

Color Scale hue, saturation, and brightness (e.g. [0.5, 1.5]) Brightness Hue d2l.ai

Many Other Augmentations https://github.com/aleju/imgaug d2l.ai

Fine Tuning courses.d2l.ai/berkeley-stat-157

Labelling a Dataset is Expensive # examples 1.2M 50K 60K # classes 1,000 100 10 Can we My dataset reuse this? d2l.ai

Network Structure Two components in Softmax   Output layer deep network classifier • Feature extractor to } Layer L - 1 map raw pixels into Feature linearly separable … extractor features. • Linear classifier for Layer 1 decisions gluon-cv.mxnet.io

Fine Tuning Don’t use last layer Output layer since classification problem is different } Layer L - 1 Likely good feature … extractor for target Layer 1 Source Target Dataset Dataset gluon-cv.mxnet.io

Weight Initialization for Fine Turning Source Target Dataset Dataset gluon-cv.mxnet.io

Fix Lower Layers • Neural networks learn hierarchical Output layer feature representations • Low-level features are universal Layer L - 1 • High-level features are more related to objects in the dataset … • Fix the bottom layer parameters during fine tuning   Layer 1 (useful for regularization) d2l.ai

Re-use Classifier Parameters Lucky break • Source dataset may contain some of the target categories • Use the according weight vectors from the pre-trained model during initialization d2l.ai

Fine-tuning Training Recipe • Train on the target dataset as normal but with strong regularization • Small learning rate • Fewer epochs • If source dataset is more complex than the target dataset, fine-tuning can lead to better models   (source model is a good prior) gluon-cv.mxnet.io

Fine-tuning Notebook gluon-cv.mxnet.io

Summary • To get good performance: • Optimize codes through hybridization • Use multiple GPUs/machines • Augment image data by transformations • Train with pre-trained models d2l.ai

Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance - PowerPoint PPT Presentation

Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance Hybridization Async-computation Multi-GPU/machine training Computer Vision Image augmentation Fine tuning d2l.ai A Hybrid of Imperative and

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Massively Parallel Computation Philip Bille Sequential Computation Computation. Read and

Model of Computation and Runtime Analysis Model of Computation Model of Computation Specifies

randomized computation Sometimes randomness helps in computation. randomized computation Augment

Secure Outsourcing of Computation Ron Rothblum MIT Outsourcing Computation Motivation: allow a

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

An Analytic Framework for Human Computation Crowdsourcing and Human Computation Instructor:

Information, Computation, and Communication Representation of Information 1 ICC Module

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

DECIMAL COMPUTATION 20120709 www.njctl.org 1 Decimal Computation Unit Topics Click on

INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you re-build the computation

Systemic Computation , where natural computation and new technologies come together Erwan Le

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Section 15 Factor-group computation and simple groups Instructor: Yifan Yang Fall 2006

Finite-State Machines (Automata) a simple form of computation used widely one way to

Introduction to Fitwel Certifications & Reporting Best Practices FITWEL & Design is a

Mergers and R&D i in Rece cent J Japanese M Manufacturing: Learnings f from Em Empirical

Welcome! Accessibility and the ADA: Facility Standards Update will begin at 2:00 p.m. Eastern

Focus of the Course Overview of the Course Semantics and Verification 2005 Transition systems

CS 309: Autonomous Robots FRI I Final Project Proposals Instructor: Justin Hart

CS 309: Autonomous Robots FRI I Final Project Proposals Instructor: Justin Hart

Data management in autonomous driving projects Who we are? Aptiv - Aptiv PLC (formerly known as

Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance - PowerPoint PPT Presentation

Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance Hybridization Async-computation Multi-GPU/machine training Computer Vision Image augmentation Fine tuning d2l.ai A Hybrid of Imperative and

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Massively Parallel Computation Philip Bille Sequential Computation Computation. Read and

Model of Computation and Runtime Analysis Model of Computation Model of Computation Specifies

randomized computation Sometimes randomness helps in computation. randomized computation Augment

Secure Outsourcing of Computation Ron Rothblum MIT Outsourcing Computation Motivation: allow a

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

An Analytic Framework for Human Computation Crowdsourcing and Human Computation Instructor:

Information, Computation, and Communication Representation of Information 1 ICC Module

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

DECIMAL COMPUTATION 20120709 www.njctl.org 1 Decimal Computation Unit Topics Click on

INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you re-build the computation

Systemic Computation , where natural computation and new technologies come together Erwan Le

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Section 15 Factor-group computation and simple groups Instructor: Yifan Yang Fall 2006

Finite-State Machines (Automata) a simple form of computation used widely one way to

Introduction to Fitwel Certifications &amp; Reporting Best Practices FITWEL &amp; Design is a

Mergers and R&amp;D i in Rece cent J Japanese M Manufacturing: Learnings f from Em Empirical

Welcome! Accessibility and the ADA: Facility Standards Update will begin at 2:00 p.m. Eastern

Focus of the Course Overview of the Course Semantics and Verification 2005 Transition systems

CS 309: Autonomous Robots FRI I Final Project Proposals Instructor: Justin Hart

CS 309: Autonomous Robots FRI I Final Project Proposals Instructor: Justin Hart

Data management in autonomous driving projects Who we are? Aptiv - Aptiv PLC (formerly known as

Introduction to Fitwel Certifications & Reporting Best Practices FITWEL & Design is a

Mergers and R&D i in Rece cent J Japanese M Manufacturing: Learnings f from Em Empirical