Edward: Deep Probabilistic Programming Extended Seminar Systems and - - PowerPoint PPT Presentation

edward deep probabilistic programming
SMART_READER_LITE
LIVE PREVIEW

Edward: Deep Probabilistic Programming Extended Seminar Systems and - - PowerPoint PPT Presentation

Edward: Deep Probabilistic Programming Extended Seminar Systems and Machine Learning Steven Lang 13.02.2020 1 Outline Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in


slide-1
SLIDE 1

Edward: Deep Probabilistic Programming

Extended Seminar – Systems and Machine Learning Steven Lang 13.02.2020

1

slide-2
SLIDE 2

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

2

slide-3
SLIDE 3

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Introduction 3

slide-4
SLIDE 4

Motivation

◮ Nature of deep neural networks is compositional ◮ Connect layers in creative ways ◮ No worries about

– testing (forward propagation) – inference (gradient based opt., with backprop. and auto-diff.)

◮ Leads to easy development of new successful architectures

Introduction 4

slide-5
SLIDE 5

Motivation

LeNet-5 (Lecun et al. 1998) VGG16 (Simonyan and Zisserman 2014) ResNet-50 (He et al. 2015) Inception-v4 (Szegedy et al. 2014)

Introduction 5

slide-6
SLIDE 6

Motivation

Goal: Achieve the composability of deep learning for

  • 1. Probabilistic models
  • 2. Probabilistic inference

Introduction 6

slide-7
SLIDE 7

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Refresher on Probabilistic Modeling 7

slide-8
SLIDE 8

What is a Random Variable (RV)?

◮ Random number determined by chance, e.g. outcome of a single dice roll ◮ Drawn according to a probability distribution ◮ Typical random variables in statistical machine learning:

– input data – output data – noise

Refresher on Probabilistic Modeling 8

slide-9
SLIDE 9

What is a Probability Distribution?

◮ Discrete: Describes probability, that RV will be equal to a certain value ◮ Continuous: Describes probability density, that RV will be equal to a certain value

2 2 4 6 8 10

X

0.0 0.1 0.2 0.3 0.4

p(X)

= 0,

2 = 1

= 2,

2 = 2

= 4,

2 = 3

Refresher on Probabilistic Modeling 9

slide-10
SLIDE 10

What is a Probability Distribution?

◮ Discrete: Describes probability, that RV will be equal to a certain value ◮ Continuous: Describes probability density, that RV will be equal to a certain value Example: Normal distribution N (µ, σ) = 1 √ 2πσ2 exp

  • −1

2 x − µ σ 2

2 2 4 6 8 10

X

0.0 0.1 0.2 0.3 0.4

p(X)

= 0,

2 = 1

= 2,

2 = 2

= 4,

2 = 3

Refresher on Probabilistic Modeling 9

slide-11
SLIDE 11

Common Probability Distributions

Discrete ◮ Bernoulli ◮ Binomial ◮ Hypergeometric ◮ Poisson ◮ Boltzmann

Refresher on Probabilistic Modeling 10

slide-12
SLIDE 12

Common Probability Distributions

Discrete ◮ Bernoulli ◮ Binomial ◮ Hypergeometric ◮ Poisson ◮ Boltzmann Continuous ◮ Uniform ◮ Beta ◮ Normal ◮ Laplace ◮ Student-t

Refresher on Probabilistic Modeling 10

slide-13
SLIDE 13

What is Inference?

◮ Answer the query P (Q | E)

– Q: Query, set of RVs we are interested in – E: Evidence, set of RVs that we know the state of

Refresher on Probabilistic Modeling 11

slide-14
SLIDE 14

What is Inference?

◮ Answer the query P (Q | E)

– Q: Query, set of RVs we are interested in – E: Evidence, set of RVs that we know the state of

◮ Example: What is the prob. that

– it has rained (Q) – when we know that the gras is wet (E)

P (Has Rained = true | Gras = wet)

Refresher on Probabilistic Modeling 11

slide-15
SLIDE 15

Probabilistic Models

Bayesian Networks Markov Networks Variational Autoencoder Deep Belief Networks

Refresher on Probabilistic Modeling 12

slide-16
SLIDE 16

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Deep Probabilistic Programming 13

slide-17
SLIDE 17

Key Ideas

Probabilistic programming lets users ◮ specify probabilistic models as programs ◮ compile those models down into inference procedures

Deep Probabilistic Programming 14

slide-18
SLIDE 18

Key Ideas

Probabilistic programming lets users ◮ specify probabilistic models as programs ◮ compile those models down into inference procedures Two compositional representations as first class citizens ◮ Random variables ◮ Inference

Deep Probabilistic Programming 14

slide-19
SLIDE 19

Key Ideas

Probabilistic programming lets users ◮ specify probabilistic models as programs ◮ compile those models down into inference procedures Two compositional representations as first class citizens ◮ Random variables ◮ Inference

Goal

Make probabilistic programming as flexible and efficient as deep learning!

Deep Probabilistic Programming 14

slide-20
SLIDE 20

Typicall PPL Tradeoffs

Probabilistic programming languages typically have the following trade-off:

Deep Probabilistic Programming 15

slide-21
SLIDE 21

Typicall PPL Tradeoffs

Probabilistic programming languages typically have the following trade-off: ◮ Expressiveness

– allow rich class beyond graphical models – scales poorly w.r.t. data and model size

Deep Probabilistic Programming 15

slide-22
SLIDE 22

Typicall PPL Tradeoffs

Probabilistic programming languages typically have the following trade-off: ◮ Expressiveness

– allow rich class beyond graphical models – scales poorly w.r.t. data and model size

◮ Efficiency

– PPL is restricted to a specific class of models – inference algorithms are optimized for this specific class

Deep Probabilistic Programming 15

slide-23
SLIDE 23

Edward

Edward (Tran et al. 2017) builds on two compositional representations ◮ Random variables ◮ Inference

Deep Probabilistic Programming 16

slide-24
SLIDE 24

Edward

Edward (Tran et al. 2017) builds on two compositional representations ◮ Random variables ◮ Inference Edward allows to fit the same model using a variety of composable inference methods ◮ Point estimation ◮ Variational inference ◮ Markov Chain Monte Carlo

Deep Probabilistic Programming 16

slide-25
SLIDE 25

Edward

Key concept: no distinct model or inference block ◮ Model: Composition/collection of random variables ◮ Inference: Way of modifying parameters in that collection subject to another

Deep Probabilistic Programming 17

slide-26
SLIDE 26

Edward

Uses computational benefits from TensorFlow like ◮ distributed training ◮ parallelism ◮ vectorization ◮ GPU support “for free”

Deep Probabilistic Programming 18

slide-27
SLIDE 27

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Compositional Representations in Edward 19

slide-28
SLIDE 28

Criteria for Probabilistic Models

Edward poses the following criteria on compositional representations for probabilistic models:

  • 1. Integration with computational graphs

– nodes represent operations on data – edges represent data communicated between nodes

Compositional Representations in Edward 20

slide-29
SLIDE 29

Criteria for Probabilistic Models

Edward poses the following criteria on compositional representations for probabilistic models:

  • 1. Integration with computational graphs

– nodes represent operations on data – edges represent data communicated between nodes

  • 2. Invariance of the representation under the graph

– graph can be reused during inference

Compositional Representations in Edward 20

slide-30
SLIDE 30

Graph Example

Computational Graph

x y + z x pow 2

Variables Constants Operations

Evaluation

Compositional Representations in Edward 21

slide-31
SLIDE 31

Graph Example

Computational Graph

x y + z x pow 2

Variables Constants Operations

Evaluation

  • 1. x + y

Compositional Representations in Edward 21

slide-32
SLIDE 32

Graph Example

Computational Graph

x y + z x pow 2

Variables Constants Operations

Evaluation

  • 1. x + y
  • 2. (x + y) · y · z

Compositional Representations in Edward 21

slide-33
SLIDE 33

Graph Example

Computational Graph

x y + z x pow 2

Variables Constants Operations

Evaluation

  • 1. x + y
  • 2. (x + y) · y · z
  • 3. 2(x+y)·y·z

Compositional Representations in Edward 21

slide-34
SLIDE 34

Example: Beta-Bernoulli Programm

Beta-Bernoulli Model p(x, θ) = Beta(θ | 1, 1)

50

  • n=1

Bernoulli(xn | θ)

Compositional Representations in Edward 22

slide-35
SLIDE 35

Example: Beta-Bernoulli Programm

Beta-Bernoulli Model p(x, θ) = Beta(θ | 1, 1)

50

  • n=1

Bernoulli(xn | θ) Computation Graph θ

θ∗

  • nes(50)

x

x∗ Compositional Representations in Edward 22

slide-36
SLIDE 36

Example: Beta-Bernoulli Programm

Beta-Bernoulli Model p(x, θ) = Beta(θ | 1, 1)

50

  • n=1

Bernoulli(xn | θ) Computation Graph θ

θ∗

  • nes(50)

x

x∗

Edward code

theta = Beta(a=1.0 , b=1.0) # Sample from Beta dist. x = Bernoulli(p=tf.ones (50) * theta) # Sample from Bernoulli dist. Compositional Representations in Edward 22

slide-37
SLIDE 37

Criteria for Probabilistic Inference

Edward poses the following criteria on compositional representations for probabilistic inference:

  • 1. Support for many classes of inference

Compositional Representations in Edward 23

slide-38
SLIDE 38

Criteria for Probabilistic Inference

Edward poses the following criteria on compositional representations for probabilistic inference:

  • 1. Support for many classes of inference
  • 2. Invariance of inference under the computational graph

– posterior can be further composed as part of another model

Compositional Representations in Edward 23

slide-39
SLIDE 39

Inference in Edward

Goal: calculate posterior p(z, β | xtrain; θ), given ◮ data xtrain ◮ model parameters θ ◮ local variables z ◮ global variables β

Compositional Representations in Edward 24

slide-40
SLIDE 40

Inference as Stochastic Graph Optimization

Edward formalize this as optimization problem min

λ,θ L(p(z, β | xtrain; θ), q(z, β; λ))

where ◮ L is a loss function w.r.t. p and q ◮ q(z, β; λ) is an approximation of the posterior p(z, β | xtrain; θ)

Note

Choice of approximation q, loss L and rules to update parameters {θ, λ} are specified by an inference algorithm.

Compositional Representations in Edward 25

slide-41
SLIDE 41

Inference in Edward

◮ ed.Inference defines and solves minλ,θ L(p(z, β | xtrain; θ), q(z, β; λ))

# Construct inference

  • bject

inference = ed. Inference( latent_vars ={ beta: qbeta , z:qz}, data ={x: x_train })

– Posterior variables: qbeta , qz , Observed random variables: x_train

Compositional Representations in Edward 26

slide-42
SLIDE 42

Inference in Edward

◮ ed.Inference defines and solves minλ,θ L(p(z, β | xtrain; θ), q(z, β; λ))

# Construct inference

  • bject

inference = ed. Inference( latent_vars ={ beta: qbeta , z:qz}, data ={x: x_train })

– Posterior variables: qbeta , qz , Observed random variables: x_train

◮ Build a computational graph to update parameters

inference.initialize () Compositional Representations in Edward 26

slide-43
SLIDE 43

Inference in Edward

◮ ed.Inference defines and solves minλ,θ L(p(z, β | xtrain; θ), q(z, β; λ))

# Construct inference

  • bject

inference = ed. Inference( latent_vars ={ beta: qbeta , z:qz}, data ={x: x_train })

– Posterior variables: qbeta , qz , Observed random variables: x_train

◮ Build a computational graph to update parameters

inference.initialize ()

◮ Run computations to update parameters

while not_converged : inference.update () Compositional Representations in Edward 26

slide-44
SLIDE 44

Classes of Inference

Edward supports the following classes of inference: ◮ Variational Inference ◮ Monte Carlo ◮ Generative Adverserial Networks (GANs)

Compositional Representations in Edward 27

slide-45
SLIDE 45

Composing Inferences

Inference as a collection of separate inference programs, e.g. Variational EM:

qbeta = PointMass (...) # Global variables qz = Categorical (...) # Local variables Compositional Representations in Edward 28

slide-46
SLIDE 46

Composing Inferences

Inference as a collection of separate inference programs, e.g. Variational EM:

qbeta = PointMass (...) # Global variables qz = Categorical (...) # Local variables # E-Step

  • ver

local variables inf_e = ed. VariationalInference ( latent_vars ={z: qz}, data ={x: x_train , beta: qbeta }) # M-Step

  • ver

global variables inf_m = ed.MAP( latent_vars ={ beta: qbeta}, data ={x: x_train , z: qz}) Compositional Representations in Edward 28

slide-47
SLIDE 47

Composing Inferences

Inference as a collection of separate inference programs, e.g. Variational EM:

qbeta = PointMass (...) # Global variables qz = Categorical (...) # Local variables # E-Step

  • ver

local variables inf_e = ed. VariationalInference ( latent_vars ={z: qz}, data ={x: x_train , beta: qbeta }) # M-Step

  • ver

global variables inf_m = ed.MAP( latent_vars ={ beta: qbeta}, data ={x: x_train , z: qz}) # Expectation - Maximization loop while not_converged : inf_e.update () inf_m.update () Compositional Representations in Edward 28

slide-48
SLIDE 48

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Experiments 29

slide-49
SLIDE 49

Benchmarks

Logistic Regression using Hamiltonian Monte Carlo iterations Probabilistic programming system Runtime (s) Handwritten NumPy (1 CPU) 534 Stan (1 CPU) (Carpenter et al. 2017) 171 PyMC3 (12 CPU) (Salvatier et al. 2015) 30.0 Edward (12 CPU) 8.2 Handwritten TensorFlow (GPU) 5.0 Edward (GPU) 4.9 ◮ 35x Speedup over Stan (1 CPU) ◮ 6x Speedup over PyMC3 (12 CPU)

(CPU: 12-core Intel i7-5930K at 3.50GHz, GPU: NVIDIA Titan X (Maxwell)) Experiments 30

slide-50
SLIDE 50

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Alternatives 31

slide-51
SLIDE 51

Edward Successor: TensorFlow Probability (Dillon et al. 2017)

Integration into TensorFlow itself: 4-Layer architecture

  • 1. TensorFlow – Numerical operations
  • 2. Statistical Building Blocks – Distributions
  • 3. Model Building – Joint distributions, Probabilistic layers
  • 4. Probabilistic Inference – Markov Chain Monte Carlo, Variational inference, Optimizers

Alternatives 32

slide-52
SLIDE 52

Pyro: PyTorch Probabilistic Programming (Bingham et al. 2018)

◮ PyTorch as backend ◮ Unifies modern deep learning and Bayesian modeling ◮ Focus on Stochastic Variational Inference

Alternatives 33

slide-53
SLIDE 53

Outline

Introduction Refresher on Probabilistic Modeling Deep Probabilistic Programming Compositional Representations in Edward Experiments Alternatives Conclusion

Conclusion 34

slide-54
SLIDE 54

Conclusion

Edward . . . ◮ is a novel deep probabilistic programming language ◮ provides compositional representations for model and inference ◮ leverages computational graphs for fast parallelizable computation

Conclusion 35

slide-55
SLIDE 55

References I

Bingham, Eli et al. (2018). “Pyro: Deep Universal Probabilistic Programming”. In: Journal of Machine Learning Research. Carpenter, Bob et al. (2017). “Stan: A Probabilistic Programming Language”. In: Journal of Statistical Software, Articles 76.1, pp. 1–32. ISSN: 1548-7660. DOI: 10.18637/jss.v076.i01. URL: https://www.jstatsoft.org/v076/i01. Dillon, Joshua V. et al. (2017). TensorFlow Distributions. arXiv: 1711.10604 [cs.LG]. He, Kaiming et al. (2015). “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385. arXiv: 1512.03385. URL: http://arxiv.org/abs/1512.03385. Lecun, Yann et al. (1998). “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE, pp. 2278–2324. Salvatier, John et al. (2015). Probabilistic Programming in Python using PyMC. arXiv: 1507.08050 [stat.CO].

References 36

slide-56
SLIDE 56

References II

Simonyan, Karen and Andrew Zisserman (2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: CoRR abs/1409.1556. URL: http://arxiv.org/abs/1409.1556. Szegedy, Christian et al. (2014). “Going Deeper with Convolutions”. In: CoRR abs/1409.4842. arXiv: 1409.4842. URL: http://arxiv.org/abs/1409.4842. Tran, Dustin et al. (2017). Deep Probabilistic Programming. arXiv: 1701.03757 [stat.ML].

References 37

slide-57
SLIDE 57

Figure Sources

◮ CNNs: https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d ◮ Bayesian Networks: K. Kersting, Probabilistic Graphical Models Lecture (2.), 2018 ◮ Markov Models: https://en.wikipedia.org/wiki/File:A_simple_Markov_network.png ◮ Variational Autoencoder: https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html ◮ Deep Belief Networks: https://medium.com/analytics-army/deep-belief-networks-an-introduction-1d52bb867a25

References 38

slide-58
SLIDE 58

Example: Variational Auto-Encoder

# Probabilistic model z = Normal(loc=tf.zeros ([50 , 10]) , scale=tf.ones ([N, 10])) h = Dense (256 , activation ="relu")(z) x = Bernoulli(logits=Dense (28 * 28)(h)) # Variational model qx = tf. placeholder (tf.float32 , [50, 28 * 28]) qh = Dense (256 , activation ="relu")(qx) qz = Normal(loc=Dense (10, activation =None) (qh), scale=Dense (10, activation ="softplus")(qh))

zn xn

θ φ N Appendix 39