Geometry of Boltzmann Machines Guido Montfar Max Planck Institute - PowerPoint PPT Presentation

Geometry of Boltzmann Machines Guido Montúfar Max Planck Institute for Mathematics in the Sciences, Leipzig Talk at IGAIA IV, June 17, 2016 On the occasion of Shun-ichi Amari’s 80th birthday Max Planck Institute for Mathematics in the Sciences

• Boltzmann Machines • Geometric Perspectives • Universal Approximation (new results) • Dimension (new results)

Boltzmann Machines A Boltzmann machine is a network of stochastic units. It defines a set of probability vectors 0 1 @X X A , x ∈ { 0 , 1 } N , p θ ( x ) = exp θ i x i + θ ij x i x j − ψ ( θ ) i i<j for all θ ∈ R d . 1 x 8 x 7 x 8 x 7 0 . 75 x 1 x 6 x 1 x 6 σ ( α ) 0 . 5 0 . 25 x 2 x 5 x 2 x 5 0 − 6 − 4 − 2 0 0 2 4 6 x 3 x 4 x 3 x 3 x 4 α [Ackley, Hinton, Sejnowski ’85] [Geman & Geman ’84]

Boltzmann Machines Generative Models Modeling Temporal Sequences ⎫ X L X L X L X L ⎪ ⎪ 1 2 3 . . . n L ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ X L 1 X L 1 X L 1 X L 1 − ⎪ − − − ⎪ n L Learning Representations 1 2 3 . . . Structured Output Prediction 1 ⎪ − ⎭ X L 2 X L 2 X L 2 X L 2 − − − − n L 1 2 3 . . . 2 − Learning Modules   . . . . . . Recommender Systems . . . . . . for Deep Belief Networks X 2 X 2 X 2 X 2 1 2 3 . . . n 2 Classification Stochastic Controller � X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 1 1 2 2 3 3 n 1 n 1 . . . [Montufar, Zahedi, Ay ’15] h 1 x 1 h 2 x 2 y 1 h 3 x 3 y 2 . . . x 4 h k

Information Geometric Perspectives Without hidden units B 0 1 @X X p θ ( x ) = exp θ i x i + θ ij x i x j − ψ ( θ ) A i i<j R Q The Boltzmann machine defines an e- • P linear manifold MLE is the unique m-projection of the   • η R target distribution to this manifold η P Natural gradient learning trajectory is the   • m-geodesic to the MLE η = r ψ ( θ ) Stochastic interpretation of natural • parameters ∆ ✓ = ✏ G − 1 ( ⌘ Q − ⌘ R ) [Amari, Kurata, Nagaoka ’92]

BOLTZMANN MACHINE LEARNING 155 tion gain (Kullback, 1959; Renyi, 1962), is a measure of the distance from the distribution given by the P’(V,) to the distribution given by the P(VJ. G is zero if and only if the distributions are identical; otherwise it is positive. The term P’(VJ depends on the weights, and so G can be altered by changing them. To perform gradient descent in G, it is necessary to know the partial derivative of G with respect to each individual weight. In most cross-coupled nonlinear networks it is very hard to derive this quantity, but because of the simple relationships that hold at thermal equilibrium, the partial derivative of G is straightforward to derive for our networks. The probabilities of global states are determined by their energies (Eq. 6) and the energies are determined by the weights (Eq. 1). Using these equations Information Geometric Perspectives the partial derivative of G (see the appendix) is: With hidden units x = ( x V , x H ) ac -= - f@G, - PJ 0 1 a wij X @X X p θ ( x V ) = exp θ i x i + θ ij x i x j − ψ ( θ ) A [Ackley, Hinton, Sejnowski ’85] where pij is the average probability of two units both being in the on state x H i i<j when the environment is clamping the states of the visible units, and pi:, as in Eq. (7), is the corresponding probability when the environmental input is not present and the network is running freely. (Both these probabilities must The Boltzmann machine defines a • be measured at equilibrium.) Note the similarity between this equation and curved manifold with singularities Eq. (7), which shows how changing a weight affects the log probability of a single state. MLE minimizes KL-divergence from   • To minimize G, it is therefore sufficient to observe pi, and pi; when the m-flat data manifold to the e-flat   network is at thermal equilibrium, and to change each weight by an amount fully observable Boltzmann manifold proportional to the difference between these two probabilities: [Amari, Kurata, Nagaoka ’92] K P t A W<j = c@<, - pi;) (10) Iterative optimization using m- and e- • P t +1 . P* projections, EM-algorithm where e scales the size of each weight change. ... . A surprising feature of this rule is that it uses only local/y available Q* information. The change in a weight depends only on the behavior of the Q t +1 [Amari ’16] Q t S two units it connects, even though the change optimizes a global measure, [Amari, Kurata, Nagaoka ’92] and the best value for each weight depends on the values of all the other weights. If there are no hidden units, it can be shown that G-space is con- cave (when viewed from above) so that simple gradient descent will not get trapped at poor local minima. With hidden units, however, there can be local minima that correspond to different ways of using the hidden units to represent the higher-order constraints that are implicit in the probability distribution of environmental vectors. Some techniques for handling these more complex G-spaces are discussed in the next section. Once G has been minimized the network will have captured as well as possible the regularities in the environment, and these regularities will be en- forced when performing completion. An alternative view is that the net-

Algebraic Geometric Perspectives A Boltzmann machine has a polynomial • parametrization and defines a semialgebraic variety in the probability simplex . . . 3 x 3 minors   Main invariant of interest is the expected • of 2-d flattenings dimension and the number of parameters of [Raicu ’11] (Zariski) dense models Implicitization: Find an ideal basis that cuts • out the model from the probability simplex { p = g ( θ ): θ ∈ R d } ∩ ∆ { p ∈ ∆ : f ( p ) = 0 , f ∈ I } One polynomial of degree 110   and >5.5 trillion monomials [Cueto, Tobis, Yu ’10] [Pistone, Riccomagno, Wynn ‘01] [Garcia, Stillman, Sturmfels ‘05] [Geiger, Meek, Sturmfels ‘06] [Cueto, Morton, Sturmfels ‘10]

Questions 0 1 X @X X x V ∈ { 0 , 1 } V p θ ( x V ) = exp θ i x i + θ ij x i x j − ψ ( θ ) A , x H i i<j hidden Universal Approximation. What is the smallest • x 8 x 7 number of hidden units such that any distribution on {0,1} V can be represented to within any desired x 1 x 6 accuracy? x 2 x 5 Dimension. What is the dimension of the set of • distributions represented by a fixed network? x 3 x 4 visible Approximation errors. MLE, maximum and • expected KL-divergence, etc. Support sets. Properties of the marginal polytopes. •

Various Possible Hierarchies . . . Number of hidden units . . . . . . fully connected . . . . . . stack of layers . . . . . . bipartite graph

Restricted Boltzmann Machine H . . . . . . V [Smolensky ’86] Y Y #parameters = V · H + V + H Harmony Theory ∈ ∈ h2 h3 h1 Y Hidden Units p ( x V | x H ) = p ( x i | x H ) m=3 i ∈ V (2) w 1 Y p ( x H | x V ) = p ( x j | x V ) n=5 Input Units x1 x2 x3 x4 x5 j ∈ H [Freund & Haussler ’94] Influence Combination Machine Y p ( x V ) ∝ q j ( x V ) j ∈ H Y Y q j ( x V ) = λ j r j,i ( x i ) + (1 − λ j ) s j,i ( x i ) [Hinton ’02] i ∈ V i ∈ V Products of Experts Y

Universal Approximation

Universal Approximation Let H V := min { H : RBM is a universal approximator on { 0 , 1 } V } nr. parameters behaviour ≤ − H V ≥ 2 V − V − 1 2 V . Observation V +1 H V ≤ 2 V . Theorem (Freund & Haussler ’94) H V ≤ 2 V . Theorem (Le Roux & Bengio ’10) V 2 V H V ≤ 2 V − V − 1. Theorem (Younes ’95) 2 2 V − 1. H V ≤ 1 Theorem (M. & Ay ’11) ≤ − H V ≤ 2(log( V )+1) 2 V − 1. ∼ log( V )2 V Theorem (M. & Rauh ’16) V +1

Comparison with mixtures of product distributions ≥ − − Theorem. Every distribution on { 0 , 1 } V can be approximated arbitrarily well by a mixture of k product distributions if and only if k ≥ 2 V � 1 . Θ ( V 2 V ) [M., Kybernetika ’13] � � P ≥ Theorem. Every distribution on { 0 , 1 } V can be approximated arbitrarily well by distributions from RBM V,H whenever H ≥ 2(log( V � 1)+1) (2 V − ( V +1) − 1)+1 . V +1 Ω (2 V ) , O (log( V )2 V ) [M. & Rauh ’16]

Proof I - Intuition Each hidden unit extends the RBM along some parameters of the simplex B θ V ∪ H B V ϑ E Λ Previous Approach V [M. & Rauh ’16] [M. & Ay ’11] [Younes ’95] [Le Roux & Bengio ’08]

Proof II Hierarchical models Consider the set E Λ of probability vectors X ! x V ∈ { 0 , 1 } V , Y q ϑ ( x V ) = exp x i − ψ ( ϑ ) ϑ λ , λ 2 Λ i 2 λ for all ϑ ∈ R Λ , where Λ is an inclusion closed subset of 2 V . Natural parameters X Y ( ϑ λ ) λ 2 Λ ∈ R Λ , ( ϑ λ ) λ 62 Λ = 0 q ϑ ( x V ) − H ( x ) = ϑ λ x i ↔ ↔ λ 2 Λ i 2 λ ⇣ ⌘ Coordinates for the visible probability simplex We will use each hidden unit to model a group of monomials

Proof III Boltzmann Machine ⇣ X ⌘ X X x V ∈ { 0 , 1 } V p θ ( x V ) = exp θ i x i + θ ij x i x j − ψ ( θ ) , x H i i 2 V,j 2 H Free Energy 0 1 ⇣ X ⌘ @X X p θ ( x V ) − F ( x V ) = log exp θ i x i + θ ij x i x j ↔ A x H i i 2 V,j 2 H ⇣ ⌘ X X = log 1 + exp( θ j + θ ij x i ) j 2 H i 2 V Natural parameters in the visible probability simplex ( − 1) | B \ C | log ⇣ ⌘ X X X B ∈ 2 V ϑ B ( θ ) = 1 + exp( θ j + θ ij ) , ↔ j 2 H C ✓ B i 2 C Sum of independent terms

Geometry of Boltzmann Machines Guido Montfar Max Planck Institute - PowerPoint PPT Presentation

Geometry of Boltzmann Machines Guido Montfar Max Planck Institute for Mathematics in the Sciences, Leipzig Talk at IGAIA IV, June 17, 2016 On the occasion of Shun-ichi Amaris 80th birthday Max Planck Institute for Mathematics in the

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Transport properties - Boltzmann equation goal: calculation of conductivity Boltzmann transport

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo ao Paulo Papa

Biologically-Inspired Sparse Restricted Boltzmann Machines Pablo Tostado Michael Wiest Alice

On the Thermodynamic Equivalence between Hopfield Networks and Hybrid Boltzmann Machines Enrica

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Boltzmann Sampling and Random Generation of Combinatorial Structures Philippe Flajolet Based on

Einstein on Boltzmann principle Giovanni Jona-Lasinio Galileo Galilei Institute, May 27, 2014

Non Isotropic Cauchy Theory for the Boltzmann Nordheim Equations Equation for Bosons. Bose

Fourier Law and Non-Isothermal Boundary in the Boltzmann Theory Joint work with Raffaele

with Applications to Change-point Detection and Restricted Boltzmann Machine Restricted Boltzmann

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Tensor Networks for Generative Modeling From Boltzmann machines to Born machines, and back ang (

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Sibyl: A system for large scale supervised machine learning Kevin Canini, Tushar Chandra, Eugene

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Trusted Machine Learning for Probabilistic Models Shalini Ghosh, Patrick Lincoln, Ashish Tiwari

HP-Mapper: A High Performance Storage Driver for Docker Containers Fan Guo 1 , Yongkun Li 1 , Min

Lecture 20 Top 500 EN 600.320/420/620 Instructor: Randal Burns 12 March 2019 Department of

Elimina'ng Read Barriers through Procras'na'on and Cleanliness KC

RAN 2016 1. Digital transformation 2. IoT vision 3. Cyber Security 4. Data is big, sharing &

PetaBricks and Julia Kathleen C. Alexander Massachusetts Institute of Technology December 11th,

Geometry of Boltzmann Machines Guido Montfar Max Planck Institute - PowerPoint PPT Presentation

Geometry of Boltzmann Machines Guido Montfar Max Planck Institute for Mathematics in the Sciences, Leipzig Talk at IGAIA IV, June 17, 2016 On the occasion of Shun-ichi Amaris 80th birthday Max Planck Institute for Mathematics in the

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Transport properties - Boltzmann equation goal: calculation of conductivity Boltzmann transport

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo ao Paulo Papa

Biologically-Inspired Sparse Restricted Boltzmann Machines Pablo Tostado Michael Wiest Alice

On the Thermodynamic Equivalence between Hopfield Networks and Hybrid Boltzmann Machines Enrica

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Boltzmann Sampling and Random Generation of Combinatorial Structures Philippe Flajolet Based on

Einstein on Boltzmann principle Giovanni Jona-Lasinio Galileo Galilei Institute, May 27, 2014

Non Isotropic Cauchy Theory for the Boltzmann Nordheim Equations Equation for Bosons. Bose

Fourier Law and Non-Isothermal Boundary in the Boltzmann Theory Joint work with Raffaele

with Applications to Change-point Detection and Restricted Boltzmann Machine Restricted Boltzmann

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Tensor Networks for Generative Modeling From Boltzmann machines to Born machines, and back ang (

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Sibyl: A system for large scale supervised machine learning Kevin Canini, Tushar Chandra, Eugene

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Trusted Machine Learning for Probabilistic Models Shalini Ghosh, Patrick Lincoln, Ashish Tiwari

HP-Mapper: A High Performance Storage Driver for Docker Containers Fan Guo 1 , Yongkun Li 1 , Min

Lecture 20 Top 500 EN 600.320/420/620 Instructor: Randal Burns 12 March 2019 Department of

Elimina'ng Read Barriers through Procras'na'on and Cleanliness KC

RAN 2016 1. Digital transformation 2. IoT vision 3. Cyber Security 4. Data is big, sharing &amp;

PetaBricks and Julia Kathleen C. Alexander Massachusetts Institute of Technology December 11th,

RAN 2016 1. Digital transformation 2. IoT vision 3. Cyber Security 4. Data is big, sharing &