Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs - - PowerPoint PPT Presentation

bayesian inference and markov chain monte carlo
SMART_READER_LITE
LIVE PREVIEW

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs - - PowerPoint PPT Presentation

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs Alexander Terenin and David Draper University of California, Santa Cruz Joint work with Shawfeng Dong May 11, 2017 Talk for Nvidia GPU Technology Conference arXiv:1608.04329


slide-1
SLIDE 1

Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs

Alexander Terenin and David Draper

University of California, Santa Cruz Joint work with Shawfeng Dong

May 11, 2017 Talk for Nvidia GPU Technology Conference arXiv:1608.04329 Special thanks to Nvidia and Akitio for providing hardware

slide-2
SLIDE 2

What are we trying to do?

Statistical machine learning and artificial intelligence at scale arg min L(x, θ) + ||θ|| L(x, θ): loss function ||θ||: regularization Goal: minimize loss

  • Typical approach: stochastic gradient descent

Alternative approach: rewrite loss as an instance of Bayes’ Rule

Alexander Terenin and David Draper 1 Bayesian Inference and MCMC on GPUs

slide-3
SLIDE 3

Bayesian Representation of Statistical Machine Learning

Consider the exponential of the loss f(x | θ) ∝ exp{c L(x, θ)} π(θ) ∝ exp ||θ|| f(x | θ): likelihood π(θ): prior loss function ⇐ ⇒ posterior distribution New goal: draw samples from f(θ | x)

  • A lot like non-convex optimization

Alexander Terenin and David Draper 2 Bayesian Inference and MCMC on GPUs

slide-4
SLIDE 4

What are we trying to do?

Goal: draw samples from f(θ | x) Some difficulties at scale

  • Big Data: large x, computations using all data will be slow
  • Complex Models: large θ (can be ≥ x): curse of dimensionality

Why not just find the maximum?

  • Understand, quantify, and propagate uncertainty
  • Sampling algorithms are essentially global optimizers
  • Loss may have no analytic form, making SGD impractical

Alexander Terenin and David Draper 3 Bayesian Inference and MCMC on GPUs

slide-5
SLIDE 5

Hardware

Bayesian inference is inherently expensive: let’s parallelize it

  • Parallelizable: only has meaning in context
  • Different types of parallel hardware have different requirements

GPUs: main challenges

  • Memory bottleneck: limited RAM, may need to stream data
  • Warp divergence: fine-grained if/else → if, wait, else

GPUs: design goals

  • Expose fine-grained parallelism
  • Minimize branching to control warp divergence
  • Ideally: run out-of-core (i.e. on minibatches streaming off disk)

Alexander Terenin and David Draper 4 Bayesian Inference and MCMC on GPUs

slide-6
SLIDE 6

Gibbs Sampling

The canonical Bayesian sampling algorithm Draws samples from target with density f(x, y, z) sequentially

  • Full conditionals: f(x | y, z), f(y | x, z), f(z | x, y)

Algorithm: Gibbs Sampling

  • Step 1: draw x1 | y0, z0
  • Step 2: draw y1 | x1, z0
  • Step 3: draw z1 | x1, y1, repeat until convergence to f(x, y, z)

How do we parallelize this?

Alexander Terenin and David Draper 5 Bayesian Inference and MCMC on GPUs

slide-7
SLIDE 7

GPU-accelerated Gibbs Sampling

Start with an exchangeable model f(x | θ) =

N

  • i=1

f(xi | θ) Example: Probit Regression yi | zi = round[Φ(zi)] zi | xi, β ∼ N(xiβ, 1) β ∼ N(µ, λ2) Data Augmentation Gibbs Sampler zi | β ∼ TN(xiβ, 1, yi) β | z ∼ N

  • (XT X)−1XT z, (XT X)−1

Alexander Terenin and David Draper 6 Bayesian Inference and MCMC on GPUs

slide-8
SLIDE 8

GPU-accelerated Gibbs Sampling

Data Augmentation Gibbs Sampler zi | β ∼ TN(xiβ, 1, yi) β | z ∼ N

  • (XT X)−1XT z, (XT X)−1

Both steps are amenable to GPU-based parallelism

  • Draw β | z in parallel: use Cholesky decomposition
  • Draw z | β in parallel: zi ⊥

⊥ z−i for all i by exchangeability Sufficient fine-grained parallelism in Xβ, XT z, Chol(XT X) Some tricks used to control warp divergence in TN kernel Overlap computation and output: write β to disk while updating z

Alexander Terenin and David Draper 7 Bayesian Inference and MCMC on GPUs

slide-9
SLIDE 9

GPU-accelerated Gibbs Sampling

Data Augmentation Gibbs Sampler zi | β ∼ TN(xiβ, 1, yi) β | z ∼ N

  • (XT X)−1XT z, (XT X)−1

What if we add a hierarchical prior such as the Horseshoe? β | λ ∼ N(0, λ2) λ | ν ∼ π(ν) ν ∼ π(η) Hierarchical priors factorize: update λ | − and ν | − in parallel

  • If GPU is not saturated, the computation is essentially free
  • More complicated model: more available parallelism

Alexander Terenin and David Draper 8 Bayesian Inference and MCMC on GPUs

slide-10
SLIDE 10

GPU-accelerated Performance

Horseshoe Probit Regression

2:17 1:11 0:28 2:22 0:41 0:17 0:23 1:30 10,000 100,000 1,000,000 10 20 30 40 50 60 70 80 90

Time (minutes) Data Size Dimension

GPU: 10,000 GPU: 1,000 GPU: 100 Workstation: 1,000 Workstation: 100 Laptop: 1,000 Laptop: 100

CPU and GPU Run Time: 10,000 Monte Carlo iterations

It’s lightning fast, and requires no new theory

  • N = 10,000, p = 1,000: 90 minutes → 41 seconds

Alexander Terenin and David Draper 9 Bayesian Inference and MCMC on GPUs

slide-11
SLIDE 11

Conclusions

Bayesian problems can benefit immensely from hardware acceleration

  • External GPUs, like the Akitio Node, are making this accessible

MCMC is both inherently sequential and massively parallelizable

  • Not well-studied, lots of potential for new results
  • Stay tuned: minibatch-based MCMC possible in continuous time
  • A. Terenin, S. Dong, and D. Draper. GPU-accelerated Gibbs
  • Sampling. arXiv:1608.04329, 2016.

Alexander Terenin and David Draper 10 Bayesian Inference and MCMC on GPUs