Dynamic gradient estimation in machine learning Thomas Flynn

Abstract The optimization problems arising in machine learning form some of the most theoret- ically challenging and computationally demanding problems in numerical computing today. Due to the complexity of the models and the problem domains to which they are applied, approximation methods are required during optimization. This review focuses on optimization schemes involving dynamic gradient estimation. In these algorithms, gradient estimation runs in parallel with the parameter adaptation process. We survey a number of problems from machine learning that admit such approaches to optimization, including applications to deterministic and stochastic neural networks, and present these algorithms in a common framework of stochastic approximation.

Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . 7 2.3 Application: A Joint Model of Images and Text . . . . . . . . 8 2.4 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . 9 2.5 Numerical experiment . . . . . . . . . . . . . . . . . . . . . 12 2.6 Variants of the Boltzmann Machine . . . . . . . . . . . . . . 13 3 Stochastic approximation . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Weak convergence to an ODE . . . . . . . . . . . . . . . . . 15 3.2 Applying SA to the Boltzmann machine . . . . . . . . . . . . 19 3.3 Application to online Bayesian learning . . . . . . . . . . . . 19 4 Sigmoid belief networks . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . 23 4.3 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . 23 4.4 Numerical experiment . . . . . . . . . . . . . . . . . . . . . 25 4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Attractor networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . 31 5.3 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . 33 5.4 Numerical experiment . . . . . . . . . . . . . . . . . . . . . 35 6 Chemical reaction networks . . . . . . . . . . . . . . . . . . . . . . 36 6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . 39 6.3 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . 40 6.4 Numerical experiment . . . . . . . . . . . . . . . . . . . . . 40 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2

. . . ∆ n − 1 ∆ n ∆ n +1 . . . . . . w n − 1 w n w n +1 . . . (a) . . . y n − 1 y n +1 . . . y n ∆ n − 1 ∆ n ∆ n +1 . . . . . . . . . w n − 1 w n +1 . . . w n (b) Figure 1: In standard gradient based optimization schemes (a), the search direction ∆ n at time n is calculated based solely on the parameter w n − 1 . In dynamic gradient estimation schemes (b), the search directions ∆ n are computed based on the current parameter and the state y n of an auxiliary system. 1 Introduction We will review several network-based models useful for applications in machine learning and other areas, touching upon a number of topics in each case. This includes how the networks operate, what they are used for, and issues related to optimization. The networks are diverse in terms of their dynamical features: some operate probabilisti- cally while others are deterministic; some run in a continuous state space and some have discrete states. In terms of optimization, we discuss what is the typical optimization problem associated with the network, describe the sensitivity analysis procedure (that is, how to compute the necessary gradients), and also mention what are some theoretical challenges associated with the optimization. Typically, the parameters of the model relate either to the local behavior of a unit or how units interact. These parameters determine things like affinity for a certain state, or how one unit inhibits or excites another. For several of the problems the results of numerical experiments are presented. Many of the models have the property that computing their derivatives is computationally difficult, and one must resort to (either deterministic or probabilistic) iterative procedures to do so. The resulting optimization algorithms then have a “two-timescale” 3

form, where derivative estimation and parameter update steps are parallel processes that must be calibrated correctly to achieve convergence. A schematic for this type of procedure is shown in Figure 1. For example, one situation where gradient estimation becomes non-trivial is when the optimization problem concerns the long-term behavior of a system. In this case, the sensitivity analysis procedure must discover the way long- term behavior is affected by changes to local parameters, but typically one only has a description of how the network evolves over the short term. A framework to analyze these multiple time-scale stochastic adaptive algorithms is provided by the theory of stochastic approximation, another topic which we review below. The remainder of this survey is organized as follows. In Section 2 we consider the Boltzmann machine, a discrete time, discrete state space, stochastic neural network. In Section 3 we review the theory of Stochastic Approximation. This provides a framework for analyzing the asymptotic and transient properties of stochastic optimization algorithms as parameters such as the step size are varied. In Section 4 we consider another model, the Sigmoid Belief Network, which is similar to the Boltzmann machine but has an acyclic and directed connectivity graph. Section 5 considers continuous state space models that may have cycles in the connectivity graph, known as attractor networks. These are also known as fixed-point or recurrent neural networks. The last model we consider, in Section 6, is a chemical reaction network. We finish with a discussion in Section 7 1.1 Notation For reference, we record some of the notation that is used in the rest of this survey. • n - dimensionality of the state space of a model. In a network based model, this will be the number of nodes in the network. • V - a subset of { 1 , . . . n } defining the indices of the visible units. • n V - number of visible units in a model. • n H - number of hidden or latent variables in a model. • X - state space of the model. • x # U - projection of the vector x onto the components U . Formally, the vector ( x U 1 , x U 2 , . . . , x U | U | ) . • m - number of training examples. • w (1) , w (2) , . . . - sequence of parameters generated by optimization algorithm. • w ǫ (1) , w ǫ (2) , . . . - sequence of parameters generated using specific step size ǫ . 4

Dynamic gradient estimation in machine learning Thomas Flynn - PDF document

Dynamic gradient estimation in machine learning Thomas Flynn Abstract The optimization problems arising in machine learning form some of the most theoret- ically challenging and computationally demanding problems in numerical computing today.

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Density Ratio Estimation Density Ratio Estimation in Machine Learning in Machine Learning

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic

Frasers Property Thailand Opportunity Day 3/2019 3rd Quarter Fiscal Year 2019 Earnings Three

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

Nhuong Tran, WorldFish Anh Van Thi Nguyen and Norbert Wilson, Auburn University Non-tariff

Investor Presentation December 2018 1 Proprietary and confidential Disclaimer This

Particle Filters: Convergence Results and High Dimensions Mark Coates mark.coates@mcgill.ca

Nonlinear Expectations and Stochastic Calculus under Uncertainty with Robust Central Limit

On the Resource/Performance Tradeoff in Large Scale Queueing Systems David Gamarnik MIT Joint

Q1 2015 EARNINGS RELEASE April 21, 2015 PENTAIR Q1 2015 Earnings Release FORWARD-LOOKING

Dynamic gradient estimation in machine learning Thomas Flynn - PDF document

Dynamic gradient estimation in machine learning Thomas Flynn Abstract The optimization problems arising in machine learning form some of the most theoret- ically challenging and computationally demanding problems in numerical computing today.

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Density Ratio Estimation Density Ratio Estimation in Machine Learning in Machine Learning

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Applied Machine Learning Applied Machine Learning Gradient Computation &amp; Automatic

Frasers Property Thailand Opportunity Day 3/2019 3rd Quarter Fiscal Year 2019 Earnings Three

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

Nhuong Tran, WorldFish Anh Van Thi Nguyen and Norbert Wilson, Auburn University Non-tariff

Investor Presentation December 2018 1 Proprietary and confidential Disclaimer This

Particle Filters: Convergence Results and High Dimensions Mark Coates mark.coates@mcgill.ca

Nonlinear Expectations and Stochastic Calculus under Uncertainty with Robust Central Limit

On the Resource/Performance Tradeoff in Large Scale Queueing Systems David Gamarnik MIT Joint

Q1 2015 EARNINGS RELEASE April 21, 2015 PENTAIR Q1 2015 Earnings Release FORWARD-LOOKING

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic