University of Cambridge University of California, Irvine
Eric Nalisnick, José Miguel Hernández-Lobato, Padhraic Smyth
Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel - - PowerPoint PPT Presentation
Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic Smyth University of Cambridge University of California, Irvine Dropout & Multiplicative Noise (2012) Standard Neural After
University of Cambridge University of California, Irvine
Eric Nalisnick, José Miguel Hernández-Lobato, Padhraic Smyth
2
(2012)
Standard Neural Network After Applying Dropout
3
(2012)
Standard Neural Network After Applying Dropout
Implementation as Multiplicative Noise:
Hidden Units Weights
Diagonal Matrix of Random Variables
4
(2012)
Standard Neural Network After Applying Dropout
Implementation as Multiplicative Noise:
Hidden Units Weights
Diagonal Matrix of Random Variables
been shown to work as well.
5
(2012)
Standard Neural Network After Applying Dropout
Implementation as Multiplicative Noise:
Hidden Units Weights
Diagonal Matrix of Random Variables
been shown to work as well.
6
7
A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:
Gaussian Scale Mixtures
8
A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:
Gaussian Scale Mixtures
Can be reparametrized into a hierarchical form:
9
A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:
Gaussian Scale Mixtures
Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…
Weights Noise
λi,i ∼ p(λ)
<latexit sha1_base64="kEf5cwRXb6DTemwrlLiDzlO+Lg=">ACB3icbVDLSgMxFM3UV62vUZeCBItQcpMFXRZcOygn1AZxgymUwbmSGJCOUoTs3/obF4q49Rfc+Tem7Sy09UDgcM653NwTpowq7TjfVmldW19o7xZ2dre2d2z9w86KskJm2csET2QqQIo4K0NdWM9FJEA8Z6Yajm6nfSBS0UTc63FKfI4GgsYUI2kwD72mAlHKMjpOaQT6CnKYVor1LPArjp1Zwa4TNyCVEGBVmB/eVGCM06Exgwp1XedVPs5kpiRiYVL1MkRXiEBqRvqECcKD+f3TGBp0aJYJxI84SGM/X3RI64UmMemiRHeqgWvan4n9fPdHzt51SkmSYCzxfFGYM6gdNSYEQlwZqNDUFYUvNXiIdIqxNdRVTgrt48jLpNOruRb1xd1ltoqKOMjgCJ6AGXHAFmuAWtEAbYPAInsEreLOerBfr3fqYR0tWMXMI/sD6/AGOQZiP</latexit>wi,j ∼ N(0, σ2
0)
<latexit sha1_base64="US8pWfsk+4hnLFkFwbrxIjiSxQ=">ACEnicbVDLSgNBEJyNrxhfUY9eBoOQAi7UdBjwIsniWAekMRldjJxszsLjO9alj2G7z4K148KOLVkzf/xsnjoIkFDUVN91dXi4Btv+tlJLyura+n1zMbm1vZOdnevroNIUVajgQhU0yOaCe6zGnAQrBkqRqQnWMbno/9xh1Tmgf+NYxC1pGk7/MepwSM5GYL927Mi7cJbmsucRvYA8SXSd4u4rHSl+QmLidubCcFN5uzS/YEeJE4M5JDM1Td7Fe7G9BIMh+oIFq3HDuETkwUcCpYkmlHmoWEDkmftQz1iWS6E09eSvCRUbq4FyhTPuCJ+nsiJlLrkfRMpyQw0PeWPzPa0XQO+vE3A8jYD6dLupFAkOAx/ngLleMghgZQqji5lZMB0QRCibFjAnBmX95kdTLJe4VL46yVXILI40OkCHKI8cdIoq6AJVUQ1R9Iie0St6s56sF+vd+pi2pqzZzD76A+vzB6ZAnOo=</latexit>10
A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:
Definition of a Gaussian Scale Mixture
Gaussian Scale Mixtures
Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…
11
A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:
Definition of a Gaussian Scale Mixture
Gaussian Scale Mixtures
Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…
SWITCH TO HIERARCHICAL PARAMETRIZATION
12
A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:
Definition of a Gaussian Scale Mixture
Gaussian Scale Mixtures
Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…
SWITCH TO HIERARCHICAL PARAMETRIZATION
Noise distribution becomes a scale prior
i,iσ2 0)
<latexit sha1_base64="qbRjHkO/Xdy7nrQYmRoL0WZSLto=">ACI3icbVDLSgMxFM34tr6qLt0Ei6AgZaYKivBjSupYFXo1OFOmtbYZGZI7qhlmH9x46+4caEUNy78FzO1C18HAodzk1yT5hIYdB1352x8YnJqemZ2dLc/MLiUnl5dzEqWa8wWIZ68sQDJci4g0UKPlojmoUPKLsHdU+Be3XBsR2fYT3hLQTcSHcEArRSUD+6CTGzf5NQ3QlEf+T1mJ/mu019aW9pQ2GL/CqrDSNdBQUNMjfCsoVt+oOQf8Sb0QqZIR6UB747ZilikfIJBjT9NwEWxloFEzyvOSnhifAetDlTUsjUNy0suGOd2wSpt2Ym1PhHSofp/IQBnTV6FNKsBr89srxP+8Zoqd/VYmoiRFHrGvhzqpBjTojDaFpozlH1LgGlh/0rZNWhgaGst2RK83yv/Je1qrdTrZ3uVg5hVMcMWSPrZJN4ZI8ckmNSJw3CyAN5Ii/k1Xl0np2B8/YVHXNGM6vkB5yPT1cspC4=</latexit>13
Can translate noise distributions into the marginal prior they induce
14
Sampling noise for each hidden unit induces a particular structure…
15
Sampling noise for each hidden unit induces a particular structure…
WEIGHT MATRIX
NOISE MATRIX
dl-1 dl
HIDDEN UNITS
dl-1 dl-1 dl-1
16
Sampling noise for each hidden unit induces a particular structure…
dl
HIDDEN UNITS
dl-1 dl-1
WEIGHT MATRIX
i indexes rows Shared scale
17
Sampling noise for each hidden unit induces a particular structure…
dl dl-1 W
WEIGHT MATRIX
i indexes rows Shared scale
18
variational inference, allowing for any inference strategy.
better revealing their modeling assumptions.
19
variational inference, allowing for any inference strategy.
better revealing their modeling assumptions.
20
variational inference, allowing for any inference strategy.
better revealing their modeling assumptions.
Dropout has been shown to have a Bayesian interpretation
[Gal & Ghahramani, 2016]. But still there are open questions...Multiplicative noise regularization is implemented as:
Multiplicative Noise in NNs (Dropout)
DROPOUT AS A STRUCTURED SHRINKAGE PRIOR
Eric Nalisnick, José Miguel Hernández-Lobato, Padhraic Smyth
Assuming a Gaussian prior on a neural network’s weights, we observe that...
Gaussian Scale Mixtures (GSMs)
A random variable is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]: Sampling noise for each hidden unit endows the prior with structure...
Applying Dropout Standard Neural Net.
Diagonal Matrix of Random Variables Bernoulli noise corresponds to Dropout, but other noise distributions (Gauss., Beta, uniform) have been shown to work as well.
Image from [Srivastava et al., 2014]Expanded Parametrization: Hierarchical Parametrization: Definition of a Gaussian Scale Mixture Switch to Hierarchical Parametrization This insight allows us to translate noise distributions into their induced marginal prior on the weights: This scale structure is the same as that of automatic relevance determination (ARD)
[MacKay, 1994]. The intuition is that all outgoing weights from a unit grow or shrinktogether in a form of group regularization. DropConnect, which samples noise for each weight, does not have this structure. Residual networks (ResNets) allow scale sharing to be extended to whole layers (since information can still propagative via the skip connection). We term this natural analog of ARD to be automatic depth determination (ADD). .
Automatic Relevance Determination Automatic Depth DeterminationA similar scale mixture analysis reveals connections to stochastic depth regularization
[Huang et al., 2016].UCI Regression Data Sets
Figure (right) shows heat maps of the hidden-to-hidden weight matrices. ARD induces row-structured shrinkage, ADD induces matrix-wide shrinkage, and ARD-ADD allows some rows to grow while preserving global
seems to balance having some row structure with strong global shrinkage.
Beale, E. M. L., and C. L. Mallows. Scale Mixing of Symmetric Distributions with Zero Means. The Annals of Mathematical Statistics 1959. Gal, Yarin, and Zoubin Ghahramani. Dropout as a Bayesian Approximation. ICML 2016. Huang, Gao, et al. Deep Networks with Stochastic Depth. ECCV 2016. MacKay, David JC. Bayesian Nonlinear Modeling for the Prediction Competition. ASHRAE Transactions 1994. Srivastava, Nitish, et al. Dropout. JMLR 2014.