Symbolic Differentiation for Rapid Model Prototyping in Machine Learning and Data Analysis — a Hands-on Tutorial
Yarin Gal
yg279@cam.ac.uk
November 13th, 2014
Symbolic Differentiation for Rapid Model Prototyping in Machine - - PowerPoint PPT Presentation
Symbolic Differentiation for Rapid Model Prototyping in Machine Learning and Data Analysis a Hands-on Tutorial Yarin Gal yg279@cam.ac.uk November 13th, 2014 A TALK IN TWO ACTS , based on the online tutorial
November 13th, 2014
2 of 39
3 of 39
4 of 39
◮ numpy.exp() – theano.tensor.exp() ◮ numpy.sum() – theano.tensor.sum()
5 of 39
6 of 39
7 of 39
8 of 39
9 of 39
10 of 39
11 of 39
12 of 39
13 of 39
14 of 39
15 of 39
16 of 39
17 of 39
18 of 39
19 of 39
20 of 39
21 of 39
22 of 39
23 of 39
24 of 39
25 of 39
26 of 39
27 of 39
◮ We need to derive appropriate inference ◮ Often cumbersome implementation which changes regularly
◮ “Quick fabrication of scale models of a physical part” ◮ Probabilistic programming can be used for rapid prototyping in
28 of 39
◮ We need to derive appropriate inference ◮ Often cumbersome implementation which changes regularly
◮ “Quick fabrication of scale models of a physical part” ◮ Probabilistic programming can be used for rapid prototyping in
28 of 39
◮ We need to derive appropriate inference ◮ Often cumbersome implementation which changes regularly
◮ “Quick fabrication of scale models of a physical part” ◮ Probabilistic programming can be used for rapid prototyping in
28 of 39
◮ We need to derive appropriate inference ◮ Often cumbersome implementation which changes regularly
◮ “Quick fabrication of scale models of a physical part” ◮ Probabilistic programming can be used for rapid prototyping in
28 of 39
◮ With this we can take advantage of effective symbolic
◮ Models are often mathematically too cumbersome otherwise
29 of 39
◮ We approximate the posterior of the latent variables with
30 of 39
◮ Often used to speed-up inference using mini-batches
◮ But can also be used to approximate integrals through Monte
K
◮ Optimising these objectives relies on non-deterministic gradients 31 of 39
◮ Often used to speed-up inference using mini-batches
◮ But can also be used to approximate integrals through Monte
K
◮ Optimising these objectives relies on non-deterministic gradients 31 of 39
◮ Often used to speed-up inference using mini-batches
◮ But can also be used to approximate integrals through Monte
K
◮ Optimising these objectives relies on non-deterministic gradients 31 of 39
◮ Use learning-rate free optimisation (again, from deep learning) ◮ AdaGrad [Duchi et. al 2011], AdaDelta [Zeiler 2012] ◮ RMSPROP [Tieleman and Hinton 2012, Lecture 6.5,
◮ These have been compared to each other and others
32 of 39
◮ Use learning-rate free optimisation (again, from deep learning) ◮ AdaGrad [Duchi et. al 2011], AdaDelta [Zeiler 2012] ◮ RMSPROP [Tieleman and Hinton 2012, Lecture 6.5,
◮ These have been compared to each other and others
32 of 39
◮ Use learning-rate free optimisation (again, from deep learning) ◮ AdaGrad [Duchi et. al 2011], AdaDelta [Zeiler 2012] ◮ RMSPROP [Tieleman and Hinton 2012, Lecture 6.5,
◮ These have been compared to each other and others
32 of 39
◮ Use learning-rate free optimisation (again, from deep learning) ◮ AdaGrad [Duchi et. al 2011], AdaDelta [Zeiler 2012] ◮ RMSPROP [Tieleman and Hinton 2012, Lecture 6.5,
◮ These have been compared to each other and others
32 of 39
◮ Use learning-rate free optimisation (again, from deep learning) ◮ AdaGrad [Duchi et. al 2011], AdaDelta [Zeiler 2012] ◮ RMSPROP [Tieleman and Hinton 2012, Lecture 6.5,
◮ These have been compared to each other and others
32 of 39
33 of 39
34 of 39
35 of 39
36 of 39
◮ Deriving variational inference ◮ Researching appropriate bound in the statistics literature ◮ Derivations for the model ◮ Implementation (hundreds of lines of python code)
◮ Derivations took a day ◮ Programming took a day (15 lines of Python) 37 of 39
◮ Deriving variational inference ◮ Researching appropriate bound in the statistics literature ◮ Derivations for the model ◮ Implementation (hundreds of lines of python code)
◮ Derivations took a day ◮ Programming took a day (15 lines of Python) 37 of 39
◮ Careless implementation can take long to run ◮ But careful implementation (together with mini batches) can
◮ Either use more samples (slower to run), ◮ Or use variance reduction techniques [Wang, Chen, Smola, and
38 of 39
◮ Careless implementation can take long to run ◮ But careful implementation (together with mini batches) can
◮ Either use more samples (slower to run), ◮ Or use variance reduction techniques [Wang, Chen, Smola, and
38 of 39
◮ Careless implementation can take long to run ◮ But careful implementation (together with mini batches) can
◮ Either use more samples (slower to run), ◮ Or use variance reduction techniques [Wang, Chen, Smola, and
38 of 39
39 of 39