Rao-Blackwellized Stochastic Gradients for Discrete Distributions - PowerPoint PPT Presentation

Rao-Blackwellized Stochastic Gradients for Discrete Distributions Runjing (Bryan) Liu June 11, 2019 University of California, Berkeley

Objective • We fit a discrete latent variable model .

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories.

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories. • Two common approaches are :

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories. • Two common approaches are : 1. Analytically integrate out z .

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories. • Two common approaches are : 1. Analytically integrate out z . K might be large. Problem:

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories. • Two common approaches are : 1. Analytically integrate out z . K might be large. Problem: 2. Sample z ∼ q η ( z ), and estimate the gradient with g ( z ).

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories. • Two common approaches are : 1. Analytically integrate out z . K might be large. Problem: 2. Sample z ∼ q η ( z ), and estimate the gradient with g ( z ). Problem: g ( z ) might have high variance.

Objective • We fit a discrete latent variable model . • Fitting such a model involves finding argmin E q η ( z ) [ f η ( z )] η where z is a discrete random variable with K categories. • Two common approaches are : 1. Analytically integrate out z . K might be large. Problem: 2. Sample z ∼ q η ( z ), and estimate the gradient with g ( z ). Problem: g ( z ) might have high variance. We propose a method that uses a combination of these two approaches to reduce the variance of any gradient estimator g ( z ).

Our method Suppose g is an unbiased estimate of the gradient, so K � ∇ η L ( η ) = E q η ( z ) [ g ( z )] = q η ( k ) g ( k ) k =1

Our method Suppose g is an unbiased estimate of the gradient, so K � ∇ η L ( η ) = E q η ( z ) [ g ( z )] = q η ( k ) g ( k ) k =1 In many applications (e.g. variational Bayes), q η ( z ) Key observation: is concentrated on only a few categories.

Our method Suppose g is an unbiased estimate of the gradient, so K � ∇ η L ( η ) = E q η ( z ) [ g ( z )] = q η ( k ) g ( k ) k =1 In many applications (e.g. variational Bayes), q η ( z ) Key observation: is concentrated on only a few categories. Our idea: Let us analytically sum categories where q η ( z ) has high probability, and sample the remaining terms.

Our method In math, K � � q η ( k ) g ( k ) = q η ( z ) g ( z ) + (1 − q η ( C α )) E q η ( z ) [ g ( z ) | z / ∈ C α ] � �� k =1 z ∈C α small estimate by sampling � �� analytically sum

Our method In math, K � � q η ( k ) g ( k ) = q η ( z ) g ( z ) + (1 − q η ( C α )) E q η ( z ) [ g ( z ) | z / ∈ C α ] � �� k =1 z ∈C α small estimate by sampling � �� analytically sum The variance reduction is guaranteed by representing our estimator as an instance of Rao-Blackwellization .

Results: Generative semi-supervised classification We train a classifier to classify the class label of MNIST digits and learn a generative model for MNIST digits conditional on the class label.

Results: Generative semi-supervised classification We train a classifier to classify the class label of MNIST digits and learn a generative model for MNIST digits conditional on the class label. Our objective is to maximize the evidence lower bound (ELBO), p η ( x ) ≥ E q η ( z ) [log p η ( x , z ) − log q η ( z )] In this problem, the class label z has ten discrete categories.

Results: Generative semi-supervised classification

Results: moving MNIST We train a generative model for non-centered MNIST digits.

Results: moving MNIST We train a generative model for non-centered MNIST digits. To do so, we must first learn the location of the MNIST digit. There are 68 × 68 discrete categories.

Results: moving MNIST We train a generative model for non-centered MNIST digits. To do so, we must first learn the location of the MNIST digit. There are 68 × 68 discrete categories. Thus, computing the exact sum is intractable!

Results: moving MNIST Trajectory of the negative ELBO Reconstruction of MNIST digits

Our paper: Rao-Blackwellized Stochastic Gradients for Discrete Distributions https://arxiv.org/abs/1810.04777 Our code: https://github.com/Runjing-Liu120/RaoBlackwellizedSGD The collaboration: Bryan Liu Jeffrey Regier Nilesh Michael I. Jon McAuliffe Tripuraneni Jordan

Rao-Blackwellized Stochastic Gradients for Discrete Distributions - PowerPoint PPT Presentation

Rao-Blackwellized Stochastic Gradients for Discrete Distributions Runjing (Bryan) Liu June 11, 2019 University of California, Berkeley Objective We fit a discrete latent variable model . Objective We fit a discrete latent variable

Rao-Blackwellized Monte Carlo Data Association for Multiple Target Tracking Simo Srkk,

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

Welcome to CSE/NEUBEH 528: Computational Neuroscience Instructors: Rajesh Rao (rao@cs)

Welcome to CSE/NEUBEH 528: Computational Neuroscience Instructors: Rajesh Rao (rao@cs.uw)

C R RAO AIMSCS Lecture Notes Series Author (s): B.L.S. PRAKASA RAO Title of the Notes : Brief

Welcome to CSE 311: Foundations of Computing I F Instructor: Rajesh Rao (rao@cs.washington.edu) F

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Outlines Stochastic Process Discrete Time Markov Chain (DTMC) 2 Stochastic Process

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

Cyber-Physical Systems Discrete Dynamics IECE 553/453 Fall 2019 Prof. Dola Saha 1 Discrete

CMSC 222: Discrete Mathematics Prof S Fall 2018 What is Discrete Mathematics? Discrete

Cyber-Physical Systems Discrete Dynamics ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Discrete

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Divide and Conquer Summary Divide Identify one or more subproblems Conquer Solve

CS 61A Lecture 11 Wednesday, February 18 Announcements 2 Announcements Optional Hog Contest

CS1150 Principles of Computer Science Methods (Part II) Yanyan Zhuang Department of Computer

Marseille 2019 On the digits of primes In memoriam Christian MAUDUIT Jo e l RIVAT Institut de

Numerical computations with a view towards R Sren Hjsgaard Department of Mathematical

Numerically Stable Parallel Computation of (Co-)Variance Erich Schubert, Michael Gertz 30th Int.

Accuracy and Reliability Accuracy and Reliability Numerical Precision Is the Very Soul of

Lecture 24: Recursive Algorithms and Basic Counting Rules Dr. Chengjiang Long Computer Vision