Beyond Backprop: Online Alternating Minimization with Auxiliary - PowerPoint PPT Presentation

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables NYU IBM Sadhana Ronny Mattia Irina Anna Research Benjamin Kumaravel Luss Rigotti Rish Choromanska Cowen Djallel Brian Viatcheslav Paolo MIT Kingsbury Gurev Bouneffouf Di Achillele Ravi Tejwani

WHAT’S WRONG WITH BACKPROP? Computational Issues: • Vanishing gradients (due to chain of derivatives) • Difficulty handling non-differentiable nonlinearities (e.g., binary spikes) • L ack of cross-layer weight update parallelism Biologically implausibility: • Error feedback does not influence neural activity, unlike biological feedback mechanisms • Non-local weight updates, and more [Bartunov et al, 2018]

ALTERNATIVES: PRIOR WORK • Offline Auxiliary-variable methods • MAC (Carreira-Perpiñán & Wang, 2014) and other BCD methods (Zhang & Brand, 2017; Zhang & Kleijn, 2017; Askari et al., 2018; Zeng et al., 2018; Lau et al., 2018; Gotmare et al., 2018) • ADMM (Taylor et al., 2016; Zhang et al., 2016) • offline (batch) is not scalable to large data and continual learning • Target propagation methods • [LeCun 1986] [Lee, Fisher, Bengio 2015] [Bartunov et al, 2018] • Below backprop-SGD performance levels on standard benchmarks • Proposed method: • Online (mini-batch, stochastic) auxiliary-variable alternating-minimization

OUR APPROACH Breaking gradient chains with auxiliary activation variables: Y 1 Y K … L+1 W • Relaxing nonlinear activations to noisy (Gaussian) linear activations followed by a m a 1 … nonlinearity (e.g., ReLU) c 1 c m … • Alternating minimization over activations 1 and weights: explicit activation propagation W … X N X 1 • Weight updates are layer-local, and thus can be parallel (distributed, asynchronous)

NEURAL NETWORK FORMULATIONS Standard neural network objective function: Nested Add auxiliary activation variables (hard constrained problem) Constrained Relax constraints and now amenable to alternating minimization Relaxed

ONLINE ALTERNATING MINIMIZATION Offline algorithms of prior works are not scalable to extremely large datasets and not suitable for incremental, continual/lifelong learning, hence … Forward: compute linear activations at layers 1,…,L Backward: error propagation by code changes Parallelizable Note: updateWeights has two options: Apply SGD to the current mini-batch or apply BCD to version that includes memory of previous samples using the following (via Mairal et al., 2009):

FULLY-CONNECTED NETS AM greatly outperforms all off-line methods (ADMM of Taylor et al, and offline AM), and often matches Adam and SGD (50 epochs) MNIST CIFAR-10

FASTER INITIAL LEARNING: POTENTIAL USE AS A GOOD INIT? • AM often learns faster than SGD & Adam (backprop-based) in the 1 st epoch, then matches their performance MNIST CIFAR-10

CONVNETS: LENET5, MNIST RNN: SEQUENTIAL MNIST HIGGS DATASET, FULLY-CONNECTED • AM performs similarly to Adam, outperforms SGD • All methods greatly outperform offline ADMM (Taylor’s 0.64 benchmark) using less than 0.01% of 10.5M-sample HIGGS data

NONDIFFERENTIABLE (BINARY) NETS • Backprop replaced by Straight-Through Estimator (STE) • Comparing with Difference Target Propagation (DTP) • DTP took about 200 epochs to reach 0.2 error, matching the STE performance (Lee et al., 2015) 10 • AM-Adam with binary activations reaches same error in < than 20 epochs

SUMMARY: CONTRIBUTIONS • Algorithm(s): novel online (stochastic) auxiliary-variable approach for training neural networks (prior methods are offline/batch); two versions of the approach (memory-based and local-SGD-based) • Theory: first general theoretical convergence guarantees for alternating minimization in the stochastic setting: the error decays at the sub-linear rate in t iterations • Extensive Evaluations: variety of architectures and datasets demonstrating advantages of online vs offline approaches and performance similar to SGD (Adam), with faster initial convergence

Beyond Backprop: Online Alternating Minimization with Auxiliary - PowerPoint PPT Presentation

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables NYU IBM Sadhana Ronny Mattia Irina Anna Research Benjamin Kumaravel Luss Rigotti Rish Choromanska Cowen Djallel Brian Viatcheslav Paolo MIT Kingsbury

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

Bundles, Lenses & Machine Learning Motivation Backprop as Functor Jules Hedges 1 Bundles

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural

Neural Networks Learning the network: Backprop 11-785, Spring 2020 Lecture 4 1 Recap: The MLP

ECE 6504: Deep Learning for Perception Topics: (Finish) Backprop Convolutional Neural

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks

The time complexity of Backprop; Auto-Diff; and the Baur-Strassen theorem Instructor: Sham Kakade

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

ECE 6504: Deep Learning for Perception Topics: Recurrent Neural Networks (RNNs) BackProp

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv

Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Neural Networks Learning the network: Backprop 11-785, Spring 2018 Lecture 4 1 Design exercise

ECE 5984: Introduction to Machine Learning Topics: Neural Networks Backprop Readings:

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv

Genera&ve Stochas&c Networks Trainable by Backprop Yoshua

The South Coast Doctoral Training Partnership Professor Pauline Leonard SCDTP Director

Managing the Bathtub of Tax Debt UK Chapter of the System Dynamics Society Annual Gathering

rss r t Ptr rrs rst

Data Collection Planning Multi-Goal Planning Jan Faigl Department of Computer Science

Field-strength correlators for QCD in a magnetic background Enrico Meggiolaro Dipartimento di

BACK TO THE FUTURE Typographic quality WYSIWYG collaborative document editing and form fjlling in

An Introduction to Type Theory Part 3 Tallinn, September 2003 with cartoons by Conor McBride

Manipulating exponential products Instead of working with complicated concatenations of flows

Beyond Backprop: Online Alternating Minimization with Auxiliary - PowerPoint PPT Presentation

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables NYU IBM Sadhana Ronny Mattia Irina Anna Research Benjamin Kumaravel Luss Rigotti Rish Choromanska Cowen Djallel Brian Viatcheslav Paolo MIT Kingsbury

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

Bundles, Lenses &amp; Machine Learning Motivation Backprop as Functor Jules Hedges 1 Bundles

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks &amp; backprop Byron C Wallace Neural

Neural Networks Learning the network: Backprop 11-785, Spring 2020 Lecture 4 1 Recap: The MLP

ECE 6504: Deep Learning for Perception Topics: (Finish) Backprop Convolutional Neural

Backpropagation and Gradients Agenda Motivation Backprop Tips &amp; Tricks

The time complexity of Backprop; Auto-Diff; and the Baur-Strassen theorem Instructor: Sham Kakade

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

ECE 6504: Deep Learning for Perception Topics: Recurrent Neural Networks (RNNs) BackProp

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv

Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Neural Networks Learning the network: Backprop 11-785, Spring 2018 Lecture 4 1 Design exercise

ECE 5984: Introduction to Machine Learning Topics: Neural Networks Backprop Readings:

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv

Genera&amp;ve Stochas&amp;c Networks Trainable by Backprop Yoshua

The South Coast Doctoral Training Partnership Professor Pauline Leonard SCDTP Director

Managing the Bathtub of Tax Debt UK Chapter of the System Dynamics Society Annual Gathering

rss r t Ptr rrs rst

Data Collection Planning Multi-Goal Planning Jan Faigl Department of Computer Science

Field-strength correlators for QCD in a magnetic background Enrico Meggiolaro Dipartimento di

BACK TO THE FUTURE Typographic quality WYSIWYG collaborative document editing and form fjlling in

An Introduction to Type Theory Part 3 Tallinn, September 2003 with cartoons by Conor McBride

Manipulating exponential products Instead of working with complicated concatenations of flows

Bundles, Lenses & Machine Learning Motivation Backprop as Functor Jules Hedges 1 Bundles

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks

Genera&ve Stochas&c Networks Trainable by Backprop Yoshua