Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs - - PowerPoint PPT Presentation
Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs - - PowerPoint PPT Presentation
Bayesian Inference and Markov Chain Monte Carlo Algorithms on GPUs Alexander Terenin and David Draper University of California, Santa Cruz Joint work with Shawfeng Dong May 11, 2017 Talk for Nvidia GPU Technology Conference arXiv:1608.04329
What are we trying to do?
Statistical machine learning and artificial intelligence at scale arg min L(x, θ) + ||θ|| L(x, θ): loss function ||θ||: regularization Goal: minimize loss
- Typical approach: stochastic gradient descent
Alternative approach: rewrite loss as an instance of Bayes’ Rule
Alexander Terenin and David Draper 1 Bayesian Inference and MCMC on GPUs
Bayesian Representation of Statistical Machine Learning
Consider the exponential of the loss f(x | θ) ∝ exp{c L(x, θ)} π(θ) ∝ exp ||θ|| f(x | θ): likelihood π(θ): prior loss function ⇐ ⇒ posterior distribution New goal: draw samples from f(θ | x)
- A lot like non-convex optimization
Alexander Terenin and David Draper 2 Bayesian Inference and MCMC on GPUs
What are we trying to do?
Goal: draw samples from f(θ | x) Some difficulties at scale
- Big Data: large x, computations using all data will be slow
- Complex Models: large θ (can be ≥ x): curse of dimensionality
Why not just find the maximum?
- Understand, quantify, and propagate uncertainty
- Sampling algorithms are essentially global optimizers
- Loss may have no analytic form, making SGD impractical
Alexander Terenin and David Draper 3 Bayesian Inference and MCMC on GPUs
Hardware
Bayesian inference is inherently expensive: let’s parallelize it
- Parallelizable: only has meaning in context
- Different types of parallel hardware have different requirements
GPUs: main challenges
- Memory bottleneck: limited RAM, may need to stream data
- Warp divergence: fine-grained if/else → if, wait, else
GPUs: design goals
- Expose fine-grained parallelism
- Minimize branching to control warp divergence
- Ideally: run out-of-core (i.e. on minibatches streaming off disk)
Alexander Terenin and David Draper 4 Bayesian Inference and MCMC on GPUs
Gibbs Sampling
The canonical Bayesian sampling algorithm Draws samples from target with density f(x, y, z) sequentially
- Full conditionals: f(x | y, z), f(y | x, z), f(z | x, y)
Algorithm: Gibbs Sampling
- Step 1: draw x1 | y0, z0
- Step 2: draw y1 | x1, z0
- Step 3: draw z1 | x1, y1, repeat until convergence to f(x, y, z)
How do we parallelize this?
Alexander Terenin and David Draper 5 Bayesian Inference and MCMC on GPUs
GPU-accelerated Gibbs Sampling
Start with an exchangeable model f(x | θ) =
N
- i=1
f(xi | θ) Example: Probit Regression yi | zi = round[Φ(zi)] zi | xi, β ∼ N(xiβ, 1) β ∼ N(µ, λ2) Data Augmentation Gibbs Sampler zi | β ∼ TN(xiβ, 1, yi) β | z ∼ N
- (XT X)−1XT z, (XT X)−1
Alexander Terenin and David Draper 6 Bayesian Inference and MCMC on GPUs
GPU-accelerated Gibbs Sampling
Data Augmentation Gibbs Sampler zi | β ∼ TN(xiβ, 1, yi) β | z ∼ N
- (XT X)−1XT z, (XT X)−1
Both steps are amenable to GPU-based parallelism
- Draw β | z in parallel: use Cholesky decomposition
- Draw z | β in parallel: zi ⊥
⊥ z−i for all i by exchangeability Sufficient fine-grained parallelism in Xβ, XT z, Chol(XT X) Some tricks used to control warp divergence in TN kernel Overlap computation and output: write β to disk while updating z
Alexander Terenin and David Draper 7 Bayesian Inference and MCMC on GPUs
GPU-accelerated Gibbs Sampling
Data Augmentation Gibbs Sampler zi | β ∼ TN(xiβ, 1, yi) β | z ∼ N
- (XT X)−1XT z, (XT X)−1
What if we add a hierarchical prior such as the Horseshoe? β | λ ∼ N(0, λ2) λ | ν ∼ π(ν) ν ∼ π(η) Hierarchical priors factorize: update λ | − and ν | − in parallel
- If GPU is not saturated, the computation is essentially free
- More complicated model: more available parallelism
Alexander Terenin and David Draper 8 Bayesian Inference and MCMC on GPUs
GPU-accelerated Performance
Horseshoe Probit Regression
2:17 1:11 0:28 2:22 0:41 0:17 0:23 1:30 10,000 100,000 1,000,000 10 20 30 40 50 60 70 80 90
Time (minutes) Data Size Dimension
GPU: 10,000 GPU: 1,000 GPU: 100 Workstation: 1,000 Workstation: 100 Laptop: 1,000 Laptop: 100
CPU and GPU Run Time: 10,000 Monte Carlo iterations
It’s lightning fast, and requires no new theory
- N = 10,000, p = 1,000: 90 minutes → 41 seconds
Alexander Terenin and David Draper 9 Bayesian Inference and MCMC on GPUs
Conclusions
Bayesian problems can benefit immensely from hardware acceleration
- External GPUs, like the Akitio Node, are making this accessible
MCMC is both inherently sequential and massively parallelizable
- Not well-studied, lots of potential for new results
- Stay tuned: minibatch-based MCMC possible in continuous time
- A. Terenin, S. Dong, and D. Draper. GPU-accelerated Gibbs
- Sampling. arXiv:1608.04329, 2016.
Alexander Terenin and David Draper 10 Bayesian Inference and MCMC on GPUs