Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel - - PowerPoint PPT Presentation

dropout as a structured shrinkage prior
SMART_READER_LITE
LIVE PREVIEW

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel - - PowerPoint PPT Presentation

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic Smyth University of Cambridge University of California, Irvine Dropout & Multiplicative Noise (2012) Standard Neural After


slide-1
SLIDE 1

University of Cambridge University of California, Irvine

Eric Nalisnick, José Miguel Hernández-Lobato, Padhraic Smyth

Dropout as a Structured Shrinkage Prior

slide-2
SLIDE 2

2

Dropout & Multiplicative Noise

(2012)

Standard Neural Network After Applying Dropout

slide-3
SLIDE 3

3

Dropout & Multiplicative Noise

(2012)

Standard Neural Network After Applying Dropout

Implementation as Multiplicative Noise:

Hidden Units Weights

Diagonal Matrix of Random Variables

λi,i ∼ p(λ)

<latexit sha1_base64="kEf5cwRXb6DTemwrlLiDzlO+Lg=">ACB3icbVDLSgMxFM3UV62vUZeCBItQcpMFXRZcOygn1AZxgymUwbmSGJCOUoTs3/obF4q49Rfc+Tem7Sy09UDgcM653NwTpowq7TjfVmldW19o7xZ2dre2d2z9w86KskJm2csET2QqQIo4K0NdWM9FJEA8Z6Yajm6nfSBS0UTc63FKfI4GgsYUI2kwD72mAlHKMjpOaQT6CnKYVor1LPArjp1Zwa4TNyCVEGBVmB/eVGCM06Exgwp1XedVPs5kpiRiYVL1MkRXiEBqRvqECcKD+f3TGBp0aJYJxI84SGM/X3RI64UmMemiRHeqgWvan4n9fPdHzt51SkmSYCzxfFGYM6gdNSYEQlwZqNDUFYUvNXiIdIqxNdRVTgrt48jLpNOruRb1xd1ltoqKOMjgCJ6AGXHAFmuAWtEAbYPAInsEreLOerBfr3fqYR0tWMXMI/sD6/AGOQZiP</latexit>
slide-4
SLIDE 4

4

Dropout & Multiplicative Noise

(2012)

Standard Neural Network After Applying Dropout

Implementation as Multiplicative Noise:

Hidden Units Weights

Diagonal Matrix of Random Variables

λi,i ∼ p(λ)

<latexit sha1_base64="kEf5cwRXb6DTemwrlLiDzlO+Lg=">ACB3icbVDLSgMxFM3UV62vUZeCBItQcpMFXRZcOygn1AZxgymUwbmSGJCOUoTs3/obF4q49Rfc+Tem7Sy09UDgcM653NwTpowq7TjfVmldW19o7xZ2dre2d2z9w86KskJm2csET2QqQIo4K0NdWM9FJEA8Z6Yajm6nfSBS0UTc63FKfI4GgsYUI2kwD72mAlHKMjpOaQT6CnKYVor1LPArjp1Zwa4TNyCVEGBVmB/eVGCM06Exgwp1XedVPs5kpiRiYVL1MkRXiEBqRvqECcKD+f3TGBp0aJYJxI84SGM/X3RI64UmMemiRHeqgWvan4n9fPdHzt51SkmSYCzxfFGYM6gdNSYEQlwZqNDUFYUvNXiIdIqxNdRVTgrt48jLpNOruRb1xd1ltoqKOMjgCJ6AGXHAFmuAWtEAbYPAInsEreLOerBfr3fqYR0tWMXMI/sD6/AGOQZiP</latexit>
  • Dropout corresponds to p(λ) being Bernoulli.
  • Gaussian, beta, and uniform noise have

been shown to work as well.

slide-5
SLIDE 5

5

Dropout & Multiplicative Noise

(2012)

Standard Neural Network After Applying Dropout

Implementation as Multiplicative Noise:

Hidden Units Weights

Diagonal Matrix of Random Variables

λi,i ∼ p(λ)

<latexit sha1_base64="kEf5cwRXb6DTemwrlLiDzlO+Lg=">ACB3icbVDLSgMxFM3UV62vUZeCBItQcpMFXRZcOygn1AZxgymUwbmSGJCOUoTs3/obF4q49Rfc+Tem7Sy09UDgcM653NwTpowq7TjfVmldW19o7xZ2dre2d2z9w86KskJm2csET2QqQIo4K0NdWM9FJEA8Z6Yajm6nfSBS0UTc63FKfI4GgsYUI2kwD72mAlHKMjpOaQT6CnKYVor1LPArjp1Zwa4TNyCVEGBVmB/eVGCM06Exgwp1XedVPs5kpiRiYVL1MkRXiEBqRvqECcKD+f3TGBp0aJYJxI84SGM/X3RI64UmMemiRHeqgWvan4n9fPdHzt51SkmSYCzxfFGYM6gdNSYEQlwZqNDUFYUvNXiIdIqxNdRVTgrt48jLpNOruRb1xd1ltoqKOMjgCJ6AGXHAFmuAWtEAbYPAInsEreLOerBfr3fqYR0tWMXMI/sD6/AGOQZiP</latexit>
  • Dropout corresponds to p(λ) being Bernoulli.
  • Gaussian, beta, and uniform noise have

been shown to work as well.

slide-6
SLIDE 6

6

Dropout as a Gaussian Scale Mixture

slide-7
SLIDE 7

7

Dropout as a Gaussian Scale Mixture

A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:

Gaussian Scale Mixtures

slide-8
SLIDE 8

8

Dropout as a Gaussian Scale Mixture

A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:

Gaussian Scale Mixtures

Can be reparametrized into a hierarchical form:

slide-9
SLIDE 9

9

Dropout as a Gaussian Scale Mixture

A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:

Gaussian Scale Mixtures

Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…

fl(hn,l−1ΛlWl)

<latexit sha1_base64="WqtOMyEvjU7vCAHV+56b9SYRNo=">ACKnicbVDLSsNAFJ34rPVdelmsAgVtCRV0GXFjQsXFewDmhAmk0k7dDIJMxOhHyPG3/FTRdKceuHOGkjaOuFYc6ce+6de48XMyqVaU6NldW19Y3N0lZ5e2d3b79ycNiRUSIwaeOIRaLnIUkY5aStqGKkFwuCQo+Rrje6y/PdZyIkjfiTGsfECdGA04BipDTlVm4DN2VZzQ6RGnpBOszclJ+zCyuDthcxX45DfaX2g+7oyzXwh9td/Y8cytVs27OAi4DqwBVUETLrUxsP8JSLjCDEnZt8xYOSkSimJGsrKdSBIjPEID0teQo5BIJ52tmsFTzfgwiIQ+XMEZ+7siRaHMh9bKfEy5mMvJ/3L9RAU3Tkp5nCjC8fyjIGFQRTD3DfpUEKzYWAOEBdWzQjxEAmGl3S1rE6zFlZdBp1G3LuNx6tqExV2lMAxOAE1YIFr0AT3oAXaAIMX8AbewYfxakyMqfE5l64YRc0R+BPG1ze4V6jG</latexit>

Weights Noise

λi,i ∼ p(λ)

<latexit sha1_base64="kEf5cwRXb6DTemwrlLiDzlO+Lg=">ACB3icbVDLSgMxFM3UV62vUZeCBItQcpMFXRZcOygn1AZxgymUwbmSGJCOUoTs3/obF4q49Rfc+Tem7Sy09UDgcM653NwTpowq7TjfVmldW19o7xZ2dre2d2z9w86KskJm2csET2QqQIo4K0NdWM9FJEA8Z6Yajm6nfSBS0UTc63FKfI4GgsYUI2kwD72mAlHKMjpOaQT6CnKYVor1LPArjp1Zwa4TNyCVEGBVmB/eVGCM06Exgwp1XedVPs5kpiRiYVL1MkRXiEBqRvqECcKD+f3TGBp0aJYJxI84SGM/X3RI64UmMemiRHeqgWvan4n9fPdHzt51SkmSYCzxfFGYM6gdNSYEQlwZqNDUFYUvNXiIdIqxNdRVTgrt48jLpNOruRb1xd1ltoqKOMjgCJ6AGXHAFmuAWtEAbYPAInsEreLOerBfr3fqYR0tWMXMI/sD6/AGOQZiP</latexit>

wi,j ∼ N(0, σ2

0)

<latexit sha1_base64="US8pWfsk+4hnLFkFwbrxIjiSxQ=">ACEnicbVDLSgNBEJyNrxhfUY9eBoOQAi7UdBjwIsniWAekMRldjJxszsLjO9alj2G7z4K148KOLVkzf/xsnjoIkFDUVN91dXi4Btv+tlJLyura+n1zMbm1vZOdnevroNIUVajgQhU0yOaCe6zGnAQrBkqRqQnWMbno/9xh1Tmgf+NYxC1pGk7/MepwSM5GYL927Mi7cJbmsucRvYA8SXSd4u4rHSl+QmLidubCcFN5uzS/YEeJE4M5JDM1Td7Fe7G9BIMh+oIFq3HDuETkwUcCpYkmlHmoWEDkmftQz1iWS6E09eSvCRUbq4FyhTPuCJ+nsiJlLrkfRMpyQw0PeWPzPa0XQO+vE3A8jYD6dLupFAkOAx/ngLleMghgZQqji5lZMB0QRCibFjAnBmX95kdTLJe4VL46yVXILI40OkCHKI8cdIoq6AJVUQ1R9Iie0St6s56sF+vd+pi2pqzZzD76A+vzB6ZAnOo=</latexit>
slide-10
SLIDE 10

10

Dropout as a Gaussian Scale Mixture

A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:

Definition of a Gaussian Scale Mixture

Gaussian Scale Mixtures

Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…

fl(hn,l−1ΛlWl)

<latexit sha1_base64="WqtOMyEvjU7vCAHV+56b9SYRNo=">ACKnicbVDLSsNAFJ34rPVdelmsAgVtCRV0GXFjQsXFewDmhAmk0k7dDIJMxOhHyPG3/FTRdKceuHOGkjaOuFYc6ce+6de48XMyqVaU6NldW19Y3N0lZ5e2d3b79ycNiRUSIwaeOIRaLnIUkY5aStqGKkFwuCQo+Rrje6y/PdZyIkjfiTGsfECdGA04BipDTlVm4DN2VZzQ6RGnpBOszclJ+zCyuDthcxX45DfaX2g+7oyzXwh9td/Y8cytVs27OAi4DqwBVUETLrUxsP8JSLjCDEnZt8xYOSkSimJGsrKdSBIjPEID0teQo5BIJ52tmsFTzfgwiIQ+XMEZ+7siRaHMh9bKfEy5mMvJ/3L9RAU3Tkp5nCjC8fyjIGFQRTD3DfpUEKzYWAOEBdWzQjxEAmGl3S1rE6zFlZdBp1G3LuNx6tqExV2lMAxOAE1YIFr0AT3oAXaAIMX8AbewYfxakyMqfE5l64YRc0R+BPG1ze4V6jG</latexit>
slide-11
SLIDE 11

11

Dropout as a Gaussian Scale Mixture

A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:

Definition of a Gaussian Scale Mixture

Gaussian Scale Mixtures

Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…

SWITCH TO HIERARCHICAL PARAMETRIZATION

fl(hn,l−1ΛlWl)

<latexit sha1_base64="WqtOMyEvjU7vCAHV+56b9SYRNo=">ACKnicbVDLSsNAFJ34rPVdelmsAgVtCRV0GXFjQsXFewDmhAmk0k7dDIJMxOhHyPG3/FTRdKceuHOGkjaOuFYc6ce+6de48XMyqVaU6NldW19Y3N0lZ5e2d3b79ycNiRUSIwaeOIRaLnIUkY5aStqGKkFwuCQo+Rrje6y/PdZyIkjfiTGsfECdGA04BipDTlVm4DN2VZzQ6RGnpBOszclJ+zCyuDthcxX45DfaX2g+7oyzXwh9td/Y8cytVs27OAi4DqwBVUETLrUxsP8JSLjCDEnZt8xYOSkSimJGsrKdSBIjPEID0teQo5BIJ52tmsFTzfgwiIQ+XMEZ+7siRaHMh9bKfEy5mMvJ/3L9RAU3Tkp5nCjC8fyjIGFQRTD3DfpUEKzYWAOEBdWzQjxEAmGl3S1rE6zFlZdBp1G3LuNx6tqExV2lMAxOAE1YIFr0AT3oAXaAIMX8AbewYfxakyMqfE5l64YRc0R+BPG1ze4V6jG</latexit>
slide-12
SLIDE 12

12

Dropout as a Gaussian Scale Mixture

A random variable θ is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]:

Definition of a Gaussian Scale Mixture

Gaussian Scale Mixtures

Can be reparametrized into a hierarchical form: Let’s assume a Gaussian prior on the NN weights…

SWITCH TO HIERARCHICAL PARAMETRIZATION

fl(hn,l−1Wl)

<latexit sha1_base64="OYaES3WBGdDQl75bNdzULpu+9TQ=">ACEXicbVDLSsNAFL3xWesr6tLNYBEqaEmqoMuCG5cV7APaEibTSTt0MgkzE6GE/Ibf8WNC0XcunPn3zhpK2jrgYHDOecy9x4/5kxpx/mylpZXVtfWCxvFza3tnV17b7+pokQS2iARj2Tbx4pyJmhDM81pO5YUhz6nLX90nfuteyoVi8SdHse0F+KBYAEjWBvJs8uBl/Ks3A2xHvpBOsy8VJzyMzdDP1IryxMnl1yKs4EaJG4M1KCGeqe/dntRyQJqdCEY6U6rhPrXoqlZoTrNhNFI0xGeEB7RgqcEhVL51clKFjo/REnzhEYT9fdEikOlxqFvkvmat7Lxf+8TqKDq17KRJxoKsj0oyDhSEcorwf1maRE87EhmEhmdkVkiCUm2pRYNCW48ycvkma14p5XqrcXpRqe1VGAQziCMrhwCTW4gTo0gMADPMELvFqP1rP1Zr1Po0vWbOYA/sD6+AbcFJ3B</latexit>

fl(hn,l−1ΛlWl)

<latexit sha1_base64="WqtOMyEvjU7vCAHV+56b9SYRNo=">ACKnicbVDLSsNAFJ34rPVdelmsAgVtCRV0GXFjQsXFewDmhAmk0k7dDIJMxOhHyPG3/FTRdKceuHOGkjaOuFYc6ce+6de48XMyqVaU6NldW19Y3N0lZ5e2d3b79ycNiRUSIwaeOIRaLnIUkY5aStqGKkFwuCQo+Rrje6y/PdZyIkjfiTGsfECdGA04BipDTlVm4DN2VZzQ6RGnpBOszclJ+zCyuDthcxX45DfaX2g+7oyzXwh9td/Y8cytVs27OAi4DqwBVUETLrUxsP8JSLjCDEnZt8xYOSkSimJGsrKdSBIjPEID0teQo5BIJ52tmsFTzfgwiIQ+XMEZ+7siRaHMh9bKfEy5mMvJ/3L9RAU3Tkp5nCjC8fyjIGFQRTD3DfpUEKzYWAOEBdWzQjxEAmGl3S1rE6zFlZdBp1G3LuNx6tqExV2lMAxOAE1YIFr0AT3oAXaAIMX8AbewYfxakyMqfE5l64YRc0R+BPG1ze4V6jG</latexit>

Noise distribution becomes a scale prior

wi,j ∼ N(0, λ2

i,iσ2 0)

<latexit sha1_base64="qbRjHkO/Xdy7nrQYmRoL0WZSLto=">ACI3icbVDLSgMxFM34tr6qLt0Ei6AgZaYKivBjSupYFXo1OFOmtbYZGZI7qhlmH9x46+4caEUNy78FzO1C18HAodzk1yT5hIYdB1352x8YnJqemZ2dLc/MLiUnl5dzEqWa8wWIZ68sQDJci4g0UKPlojmoUPKLsHdU+Be3XBsR2fYT3hLQTcSHcEArRSUD+6CTGzf5NQ3QlEf+T1mJ/mu019aW9pQ2GL/CqrDSNdBQUNMjfCsoVt+oOQf8Sb0QqZIR6UB747ZilikfIJBjT9NwEWxloFEzyvOSnhifAetDlTUsjUNy0suGOd2wSpt2Ym1PhHSofp/IQBnTV6FNKsBr89srxP+8Zoqd/VYmoiRFHrGvhzqpBjTojDaFpozlH1LgGlh/0rZNWhgaGst2RK83yv/Je1qrdTrZ3uVg5hVMcMWSPrZJN4ZI8ckmNSJw3CyAN5Ii/k1Xl0np2B8/YVHXNGM6vkB5yPT1cspC4=</latexit>
slide-13
SLIDE 13

13

Dropout as a Gaussian Scale Mixture

Can translate noise distributions into the marginal prior they induce

  • n the NN weights…
slide-14
SLIDE 14

14

Dropout’s Scale Structure

Sampling noise for each hidden unit induces a particular structure…

fl(hn,l−1ΛlWl)

<latexit sha1_base64="WqtOMyEvjU7vCAHV+56b9SYRNo=">ACKnicbVDLSsNAFJ34rPVdelmsAgVtCRV0GXFjQsXFewDmhAmk0k7dDIJMxOhHyPG3/FTRdKceuHOGkjaOuFYc6ce+6de48XMyqVaU6NldW19Y3N0lZ5e2d3b79ycNiRUSIwaeOIRaLnIUkY5aStqGKkFwuCQo+Rrje6y/PdZyIkjfiTGsfECdGA04BipDTlVm4DN2VZzQ6RGnpBOszclJ+zCyuDthcxX45DfaX2g+7oyzXwh9td/Y8cytVs27OAi4DqwBVUETLrUxsP8JSLjCDEnZt8xYOSkSimJGsrKdSBIjPEID0teQo5BIJ52tmsFTzfgwiIQ+XMEZ+7siRaHMh9bKfEy5mMvJ/3L9RAU3Tkp5nCjC8fyjIGFQRTD3DfpUEKzYWAOEBdWzQjxEAmGl3S1rE6zFlZdBp1G3LuNx6tqExV2lMAxOAE1YIFr0AT3oAXaAIMX8AbewYfxakyMqfE5l64YRc0R+BPG1ze4V6jG</latexit>

wi,j ∼ N(0, σ2

0)

<latexit sha1_base64="l7mCxmCP0jP2l6tBCxcAkCOTIRg=">ACEXicbVDLSgNBEJz1GeNr1aOXwSBECGE3CnoMePEkEcwDkrjMTibJmJndZaZXDcv+ghd/xYsHRbx68+bfOHkcNLGgoajqprvLjwTX4Djf1sLi0vLKamYtu76xubVt7+zWdBgryqo0FKFq+EQzwQNWBQ6CNSLFiPQFq/uD85Ffv2NK8zC4hmHE2pL0At7lICRPDt/7yW8cJviluYSt4A9QHKZ5p3CSOhJcpOUi9x0iPzjlFZw8T9wpyaEpKp791eqENJYsACqI1k3XiaCdEAWcCpZmW7FmEaED0mNQwMimW4n49SfGiUDu6GylQAeKz+nkiI1HofdMpCfT1rDcS/OaMXTP2gkPohYQCeLurHAEOJRPLjDFaMghoYQqri5FdM+UYSCTFrQnBnX54ntVLRPS6Wrk5yZTKNI4P20QHKIxedojK6QBVURQ9omf0it6sJ+vFerc+Jq0L1nRmD/2B9fkDRdKcwA=</latexit>
slide-15
SLIDE 15

15

Dropout’s Scale Structure

Sampling noise for each hidden unit induces a particular structure…

W

WEIGHT MATRIX

h

x Λ

NOISE MATRIX

x

dl-1 dl

HIDDEN UNITS

dl-1 dl-1 dl-1

wi,j ∼ N(0, σ2

0)

<latexit sha1_base64="l7mCxmCP0jP2l6tBCxcAkCOTIRg=">ACEXicbVDLSgNBEJz1GeNr1aOXwSBECGE3CnoMePEkEcwDkrjMTibJmJndZaZXDcv+ghd/xYsHRbx68+bfOHkcNLGgoajqprvLjwTX4Djf1sLi0vLKamYtu76xubVt7+zWdBgryqo0FKFq+EQzwQNWBQ6CNSLFiPQFq/uD85Ffv2NK8zC4hmHE2pL0At7lICRPDt/7yW8cJviluYSt4A9QHKZ5p3CSOhJcpOUi9x0iPzjlFZw8T9wpyaEpKp791eqENJYsACqI1k3XiaCdEAWcCpZmW7FmEaED0mNQwMimW4n49SfGiUDu6GylQAeKz+nkiI1HofdMpCfT1rDcS/OaMXTP2gkPohYQCeLurHAEOJRPLjDFaMghoYQqri5FdM+UYSCTFrQnBnX54ntVLRPS6Wrk5yZTKNI4P20QHKIxedojK6QBVURQ9omf0it6sJ+vFerc+Jq0L1nRmD/2B9fkDRdKcwA=</latexit>

fl(hn,l−1ΛlWl)

<latexit sha1_base64="WqtOMyEvjU7vCAHV+56b9SYRNo=">ACKnicbVDLSsNAFJ34rPVdelmsAgVtCRV0GXFjQsXFewDmhAmk0k7dDIJMxOhHyPG3/FTRdKceuHOGkjaOuFYc6ce+6de48XMyqVaU6NldW19Y3N0lZ5e2d3b79ycNiRUSIwaeOIRaLnIUkY5aStqGKkFwuCQo+Rrje6y/PdZyIkjfiTGsfECdGA04BipDTlVm4DN2VZzQ6RGnpBOszclJ+zCyuDthcxX45DfaX2g+7oyzXwh9td/Y8cytVs27OAi4DqwBVUETLrUxsP8JSLjCDEnZt8xYOSkSimJGsrKdSBIjPEID0teQo5BIJ52tmsFTzfgwiIQ+XMEZ+7siRaHMh9bKfEy5mMvJ/3L9RAU3Tkp5nCjC8fyjIGFQRTD3DfpUEKzYWAOEBdWzQjxEAmGl3S1rE6zFlZdBp1G3LuNx6tqExV2lMAxOAE1YIFr0AT3oAXaAIMX8AbewYfxakyMqfE5l64YRc0R+BPG1ze4V6jG</latexit>
slide-16
SLIDE 16

16

Dropout’s Scale Structure

Sampling noise for each hidden unit induces a particular structure…

h

x

dl

HIDDEN UNITS

dl-1 dl-1

W

WEIGHT MATRIX

fl(hn,l−1Wl)

<latexit sha1_base64="OYaES3WBGdDQl75bNdzULpu+9TQ=">ACEXicbVDLSsNAFL3xWesr6tLNYBEqaEmqoMuCG5cV7APaEibTSTt0MgkzE6GE/Ibf8WNC0XcunPn3zhpK2jrgYHDOecy9x4/5kxpx/mylpZXVtfWCxvFza3tnV17b7+pokQS2iARj2Tbx4pyJmhDM81pO5YUhz6nLX90nfuteyoVi8SdHse0F+KBYAEjWBvJs8uBl/Ks3A2xHvpBOsy8VJzyMzdDP1IryxMnl1yKs4EaJG4M1KCGeqe/dntRyQJqdCEY6U6rhPrXoqlZoTrNhNFI0xGeEB7RgqcEhVL51clKFjo/REnzhEYT9fdEikOlxqFvkvmat7Lxf+8TqKDq17KRJxoKsj0oyDhSEcorwf1maRE87EhmEhmdkVkiCUm2pRYNCW48ycvkma14p5XqrcXpRqe1VGAQziCMrhwCTW4gTo0gMADPMELvFqP1rP1Zr1Po0vWbOYA/sD6+AbcFJ3B</latexit>

wi,j ∼ N(0, λ2

i,iσ2 0)

<latexit sha1_base64="qbRjHkO/Xdy7nrQYmRoL0WZSLto=">ACI3icbVDLSgMxFM34tr6qLt0Ei6AgZaYKivBjSupYFXo1OFOmtbYZGZI7qhlmH9x46+4caEUNy78FzO1C18HAodzk1yT5hIYdB1352x8YnJqemZ2dLc/MLiUnl5dzEqWa8wWIZ68sQDJci4g0UKPlojmoUPKLsHdU+Be3XBsR2fYT3hLQTcSHcEArRSUD+6CTGzf5NQ3QlEf+T1mJ/mu019aW9pQ2GL/CqrDSNdBQUNMjfCsoVt+oOQf8Sb0QqZIR6UB747ZilikfIJBjT9NwEWxloFEzyvOSnhifAetDlTUsjUNy0suGOd2wSpt2Ym1PhHSofp/IQBnTV6FNKsBr89srxP+8Zoqd/VYmoiRFHrGvhzqpBjTojDaFpozlH1LgGlh/0rZNWhgaGst2RK83yv/Je1qrdTrZ3uVg5hVMcMWSPrZJN4ZI8ckmNSJw3CyAN5Ii/k1Xl0np2B8/YVHXNGM6vkB5yPT1cspC4=</latexit>

i indexes rows Shared scale

slide-17
SLIDE 17

17

Dropout’s Scale Structure

Sampling noise for each hidden unit induces a particular structure…

wi,j ∼ N(0, λ2

i,iσ2 0)

<latexit sha1_base64="qbRjHkO/Xdy7nrQYmRoL0WZSLto=">ACI3icbVDLSgMxFM34tr6qLt0Ei6AgZaYKivBjSupYFXo1OFOmtbYZGZI7qhlmH9x46+4caEUNy78FzO1C18HAodzk1yT5hIYdB1352x8YnJqemZ2dLc/MLiUnl5dzEqWa8wWIZ68sQDJci4g0UKPlojmoUPKLsHdU+Be3XBsR2fYT3hLQTcSHcEArRSUD+6CTGzf5NQ3QlEf+T1mJ/mu019aW9pQ2GL/CqrDSNdBQUNMjfCsoVt+oOQf8Sb0QqZIR6UB747ZilikfIJBjT9NwEWxloFEzyvOSnhifAetDlTUsjUNy0suGOd2wSpt2Ym1PhHSofp/IQBnTV6FNKsBr89srxP+8Zoqd/VYmoiRFHrGvhzqpBjTojDaFpozlH1LgGlh/0rZNWhgaGst2RK83yv/Je1qrdTrZ3uVg5hVMcMWSPrZJN4ZI8ckmNSJw3CyAN5Ii/k1Xl0np2B8/YVHXNGM6vkB5yPT1cspC4=</latexit>

Same structure as the automatic relevance determination (ARD) prior proposed by D. MacKay and R. Neal for Bayesian NNs (1994).

fl(hn,l−1Wl)

<latexit sha1_base64="OYaES3WBGdDQl75bNdzULpu+9TQ=">ACEXicbVDLSsNAFL3xWesr6tLNYBEqaEmqoMuCG5cV7APaEibTSTt0MgkzE6GE/Ibf8WNC0XcunPn3zhpK2jrgYHDOecy9x4/5kxpx/mylpZXVtfWCxvFza3tnV17b7+pokQS2iARj2Tbx4pyJmhDM81pO5YUhz6nLX90nfuteyoVi8SdHse0F+KBYAEjWBvJs8uBl/Ks3A2xHvpBOsy8VJzyMzdDP1IryxMnl1yKs4EaJG4M1KCGeqe/dntRyQJqdCEY6U6rhPrXoqlZoTrNhNFI0xGeEB7RgqcEhVL51clKFjo/REnzhEYT9fdEikOlxqFvkvmat7Lxf+8TqKDq17KRJxoKsj0oyDhSEcorwf1maRE87EhmEhmdkVkiCUm2pRYNCW48ycvkma14p5XqrcXpRqe1VGAQziCMrhwCTW4gTo0gMADPMELvFqP1rP1Zr1Po0vWbOYA/sD6+AbcFJ3B</latexit>

dl dl-1 W

WEIGHT MATRIX

i indexes rows Shared scale

slide-18
SLIDE 18

18

Summary

  • Under mild assumptions, multiplicative noise is equivalent to a
  • Gauss. scale mixture prior with ARD structure.
  • This decouples dropout’s Bayesian interpretation from

variational inference, allowing for any inference strategy.

  • Provides a ‘recipe’ for translating noise distributions into priors,

better revealing their modeling assumptions.

slide-19
SLIDE 19

19

Summary

  • Under mild assumptions, multiplicative noise is equivalent to a
  • Gauss. scale mixture prior with ARD structure.
  • This decouples dropout’s Bayesian interpretation from

variational inference, allowing for any inference strategy.

  • Provides a ‘recipe’ for translating noise distributions into priors,

better revealing their modeling assumptions.

slide-20
SLIDE 20

20

Summary

  • Under mild assumptions, multiplicative noise is equivalent to a
  • Gauss. scale mixture prior with ARD structure.
  • This decouples dropout’s Bayesian interpretation from

variational inference, allowing for any inference strategy.

  • Provides a ‘recipe’ for translating noise distributions into priors,

better revealing their modeling assumptions.

slide-21
SLIDE 21

For more details, please visit our poster (#84)

  • 6. EXPERIMENTS
  • 3. MULTIPLICATIVE NOISE AS A GAUSSIAN SCALE MIXTURE
  • 1. INTRODUCTION

Dropout has been shown to have a Bayesian interpretation

[Gal & Ghahramani, 2016]. But still there are open questions...
  • Why is the noise drawn from a (fixed) Bernoulli dist.?
  • Why does dropping hidden units work best?
  • Is there a principled extension to ResNets?

Multiplicative noise regularization is implemented as:

Multiplicative Noise in NNs (Dropout)

DROPOUT AS A STRUCTURED SHRINKAGE PRIOR

Eric Nalisnick, José Miguel Hernández-Lobato, Padhraic Smyth

Assuming a Gaussian prior on a neural network’s weights, we observe that...

  • 2. BACKGROUND
  • 5. EXTENSION TO RESNETS
  • 4. INDUCED STRUCTURE

Gaussian Scale Mixtures (GSMs)

A random variable is a Gaussian scale mixture iff it can be expressed as the product of a Gaussian random variable and an independent scalar random variable [Beale & Mallows, 1959]: Sampling noise for each hidden unit endows the prior with structure...

Applying Dropout Standard Neural Net.

Diagonal Matrix of Random Variables Bernoulli noise corresponds to Dropout, but other noise distributions (Gauss., Beta, uniform) have been shown to work as well.

Image from [Srivastava et al., 2014]

Expanded Parametrization: Hierarchical Parametrization: Definition of a Gaussian Scale Mixture Switch to Hierarchical Parametrization This insight allows us to translate noise distributions into their induced marginal prior on the weights: This scale structure is the same as that of automatic relevance determination (ARD)

[MacKay, 1994]. The intuition is that all outgoing weights from a unit grow or shrink

together in a form of group regularization. DropConnect, which samples noise for each weight, does not have this structure. Residual networks (ResNets) allow scale sharing to be extended to whole layers (since information can still propagative via the skip connection). We term this natural analog of ARD to be automatic depth determination (ADD). .

Automatic Relevance Determination Automatic Depth Determination

A similar scale mixture analysis reveals connections to stochastic depth regularization

[Huang et al., 2016].

UCI Regression Data Sets

Figure (right) shows heat maps of the hidden-to-hidden weight matrices. ARD induces row-structured shrinkage, ADD induces matrix-wide shrinkage, and ARD-ADD allows some rows to grow while preserving global

  • shrinkage. MC dropout’s heat map

seems to balance having some row structure with strong global shrinkage.

Beale, E. M. L., and C. L. Mallows. Scale Mixing of Symmetric Distributions with Zero Means. The Annals of Mathematical Statistics 1959. Gal, Yarin, and Zoubin Ghahramani. Dropout as a Bayesian Approximation. ICML 2016. Huang, Gao, et al. Deep Networks with Stochastic Depth. ECCV 2016. MacKay, David JC. Bayesian Nonlinear Modeling for the Prediction Competition. ASHRAE Transactions 1994. Srivastava, Nitish, et al. Dropout. JMLR 2014.