Deep Equilibrium Models Shaojie Bai Carnegie Mellon University - PowerPoint PPT Presentation

� 1 “DEQ” Deep Equilibrium Models Shaojie Bai   Carnegie Mellon University joint work with J. Zico Kolter (CMU/Bosch) and Vladlen Koltun (Intel) NeurIPS 2019 TL;DR: One (implicit) layer is all you need.

� 2 Outline of This Talk We can replace many classes of deep models with a single layer, keep the number of parameters the same, and lose no representational . . . capacity . x z [1] z [2] z [ L ] Requires us to (re-)consider deep networks implicitly, with an approach that we call the deep equilibrium (DEQ) model . Works as well (or better) than existing models on large-scale sequence tasks while using only constant memory. x z ?

� 3 Weight-Tied, Input-Injected Networks θ 0 θ 1 θ 2 θ L − 1 Isn’t weight-tying a big restriction? . . . - Theoretically, no : We show that x any deep feedforward network can z [1] z [2] z [ L ] Forward be represented by a weight-tied, Traditional layer : Backward z [ i +1] = f θ i ( z [ i ] ) = σ ( W i z [ i ] + b i ) input-injected network of equivalent (just a simple example) depth. U - Empirically, no : The (many) recent θ θ θ successes of weight-tied models: . . . TrellisNet [Bai et al., ICLR 2019] , Universal Transformer [Dehghani et al., x ICLR 2019] , ALBERT [Lan et al., preprint] . z [1] z [2] z [ L ] Weight-tied input-injected layer : z [ i +1] = f θ ( z [ i ] ; x ) = σ ( W z [ i ] + U x + b )

� 4 Equilibrium Points, and the DEQ Model We now can think of a deep network as repeated applications of some function z [ i +1] = f θ ( z [ i ] ; x ) In practice (a bit more on this point shortly), after these types of models converge to an equilibrium point (i.e., an “infinite depth" network) ) = z ? = f ✓ ( z ? ; x ) Deep Equilibrium (DEQ) Models : Find this equilibrium point directly via root- finding (e.g., Newton/quasi-Newton methods) rather than iterating the forward model. Backpropagate via implicit differentiation.

� 5 A Formal Summary of the DEQ Approach Define a single layer . Virtually always exists in practice   f θ ( z ; x ) (examples later) Forward pass : Given an input , f ✓ ( z ? ; x ) x compute the equilibrium point , z ? (via RootFind ( f θ − I ; x ) such that f ✓ ( z ? ; x ) − z ? = 0 (via any black-box root solver; e.g. Broyden’s method) Backward pass : Implicitly differentiate through the x z ? equilibrium state to form gradients: ◆ − 1 @ f ✓ ✓ @ ( · ) = @` @` I − @ f ✓ @ z ? @ z ? @ ( · ) Gradient of one layer Jacobian at the equilibrium

� 6 FAQs Q: Is DEQ related to the decade-old attractor network , and the recurrent backprop (RBP) ideas? - Yes ! Our main contributions here are conceptual and empirical: 1) We advocate for replacing general, modern, highly structured networks with single-layer equilibrium models, not using simple recurrent cells; and 2) We demonstrate that with these networks, the method can achieve SOTA performance with vast reduction in memory. Q: Why not stack these deep equilibrium "implicit" layers (with potentially different functions)? - No ! Stacked DEQs can be equivalently represented as a single (wider) DEQ; i.e., “deep” DEQs doesn’t give you more; it’s only a matter of designing . f θ Intuitively, 9 Γ Θ s.t. DEQ Γ Θ = DEQ h θ 2 � DEQ f θ 1

� 7 FAQs Q: What are the relative time/memory tradeoffs? - Typically ~2-2.5x slower to train, ~1.5-2x slower for inference (root finding takes slightly longer than iterating a small fixed # of forward steps). Forward pass : black-box root solving Backward pass : One-step multiplication (e.g., fast Quasi-Newton methods) with the inverse Jacobian at equilibrium - Constant memory consumption : no need to store any intermediate value (i.e., no growth at all with “depth”; O(1)). Only need to store x , z ? , θ .

� 8 DEQs for Sequence Modeling - One can easily extend the methods above to create DEQ versions of all common sequence modeling architectures. - We specifically provide two instantiations of DEQ based on two very different SOTA sequence modeling architectures: 1) DEQ-TrellisNet : equilibrium version of . . . y 1 y 2 y 3 y T TrellisNet architecture [Bai et al., ICLR 2019] , a type of weight-tied temporal convolutions . . . z ⋆ z ⋆ z ⋆ z ⋆ 1 2 3 T that generalizes RNNs . . . x 1 x 2 x 3 x T 2) DEQ-Transformer : equilibrium version of Transformer architecture [Vaswani et al., z ? 1: T = f ✓ ( z ? 1: T ; x 1: T ) NIPS 2017] , with weight-tied multi-head self- = RootFind ( g ✓ ; x 1: T ) attention [Dehghani et al., ICLR 2019] More details in the paper.

� 9 Large-Scale Benchmarks Word-level Language Modeling on WikiText-103 (WT103) 1) Benchmarked on sequence length 150   2) Does not include memory for word embeddings 35.8 Perplexity Memory (GB) 32.4 29.2 29 24.7 23.6 23.2 Perplexity 18.7 12.0 9.0 4.8 3.7 3.3 1.1 Transformer-XL DEQ-Transformer 70-layer TrellisNet DEQ-TrellisNet Transformer-XL DEQ-Transformer Transformer-XL Small Small Medium Medium XLarge (TPU) 224M Params 5M (Non-Embedding) Params 45M Params 70M Params More results in the paper.

� 10 Summary, Thoughts and Challenges - DEQ represents the largest-scale practical application of implicit layers in deep learning of which we are aware. - DEQ computes an “infinite-depth" network. DEQ’s forward pass relies on a direct root solving; its backward pass relies only on the equilibrium point, not on any of the intermediate “hidden features". Memory needed to train DEQ is therefore constant (i.e., equivalent to that of 1 layer). - DEQ performs competitively with SOTA architectures, but with up to 90% reduction in memory cost. - How should we understand depth in deep networks? - Let the objective of a model be implicitly defined (e.g., “the equilibrium")? Interested in DEQ? Stop by our poster at   Exhibition Hall B+C #137 (right after this talk) ;-) Shaojie Bai shaojieb@cs.cmu.edu https://github.com/locuslab/deq @shaojieb

Deep Equilibrium Models Shaojie Bai Carnegie Mellon University - PowerPoint PPT Presentation

1 DEQ Deep Equilibrium Models Shaojie Bai Carnegie Mellon University joint work with J. Zico Kolter (CMU/Bosch) and Vladlen Koltun (Intel) NeurIPS 2019 TL;DR: One (implicit) layer is all you need. 2 Outline of This Talk We

LABOR MARKET EQUILIBRIUM Competitive Equilibrium I Equilibrium as the intersection of supply and

New Tier 1 Boron Guideline for Alberta Greg Huber, M.Sc., P.Eng., PMP (Equilibrium) Anthony

Chemical Equilibrium Chemical equilibrium occurs when a reaction and its reverse reaction proceed

Equilibrium Refinements Mihai Manea MIT Sequential Equilibrium In many games information is

PHYSICS OF BIOLOGICAL SYSTEMS ph549 LECTURE 9 Energy and Equilibrium LIFE and ENERGY

An epistemic extension of equilibrium logic and its relation to Gelfonds epistemic

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Chemical Equilibrium Chapter 13 Chemical Equilibrium When neither the products nor the

Chapter 9 Chapter Outline The FE Line: Equilibrium in the Labor

Chemistry 2000 Slide Set 11: Chemical equilibrium Marc R. Roussel February 4, 2020 Marc R.

Beyond Nash Equilibrium: Solution Concepts for the 21st Century Joe Halpern and many

Applied General Equilibrium Models Trade and Tax Models Cesar Corredor Department of Economics

Computing the Electricity Market Equilibrium: Uses of market equilibrium models Ross Baldick

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Argument Inspection Linux Plumbers Conference 2019 Kees Cook <keescook@chromium.org>

Deep Canonical Correlation Analysis Galen Andrew 1 Raman Arora 2 Jeff Bilmes 1 Karen Livescu 2 1

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond Luke Zettlemoyer *

Solving High-dimensional PDEs Using Deep Learning Jiequn Han The Program in Applied &

Global Optimality in Neural Network Training Benjamin D. Haeffele and Ren Vidal Johns Hopkins

Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation Karn

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep Equilibrium Models Shaojie Bai Carnegie Mellon University - PowerPoint PPT Presentation

1 DEQ Deep Equilibrium Models Shaojie Bai Carnegie Mellon University joint work with J. Zico Kolter (CMU/Bosch) and Vladlen Koltun (Intel) NeurIPS 2019 TL;DR: One (implicit) layer is all you need. 2 Outline of This Talk We

LABOR MARKET EQUILIBRIUM Competitive Equilibrium I Equilibrium as the intersection of supply and

New Tier 1 Boron Guideline for Alberta Greg Huber, M.Sc., P.Eng., PMP (Equilibrium) Anthony

Chemical Equilibrium Chemical equilibrium occurs when a reaction and its reverse reaction proceed

Equilibrium Refinements Mihai Manea MIT Sequential Equilibrium In many games information is

PHYSICS OF BIOLOGICAL SYSTEMS ph549 LECTURE 9 Energy and Equilibrium LIFE and ENERGY

An epistemic extension of equilibrium logic and its relation to Gelfonds epistemic

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Chemical Equilibrium Chapter 13 Chemical Equilibrium When neither the products nor the

Chapter 9 Chapter Outline The FE Line: Equilibrium in the Labor

Chemistry 2000 Slide Set 11: Chemical equilibrium Marc R. Roussel February 4, 2020 Marc R.

Beyond Nash Equilibrium: Solution Concepts for the 21st Century Joe Halpern and many

Applied General Equilibrium Models Trade and Tax Models Cesar Corredor Department of Economics

Computing the Electricity Market Equilibrium: Uses of market equilibrium models Ross Baldick

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Argument Inspection Linux Plumbers Conference 2019 Kees Cook &lt;keescook@chromium.org&gt;

Deep Canonical Correlation Analysis Galen Andrew 1 Raman Arora 2 Jeff Bilmes 1 Karen Livescu 2 1

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond Luke Zettlemoyer *

Solving High-dimensional PDEs Using Deep Learning Jiequn Han The Program in Applied &amp;

Global Optimality in Neural Network Training Benjamin D. Haeffele and Ren Vidal Johns Hopkins

Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation Karn

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep Argument Inspection Linux Plumbers Conference 2019 Kees Cook <keescook@chromium.org>

Solving High-dimensional PDEs Using Deep Learning Jiequn Han The Program in Applied &