introduction to stan and bayesian inference
play

Introduction to Stan and Bayesian Inference Paris Machine Learning - PowerPoint PPT Presentation

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup 21 September 2016 Eric Novik enovik@stan.fit Outline Why should you bother with Bayes Why should you use Stan Introduction to modern


  1. Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup 21 September 2016 Eric Novik enovik@stan.fit

  2. Outline ▸ Why should you bother with Bayes ▸ Why should you use Stan ▸ Introduction to modern Bayesian workflow ▸ Building up a Stan model ▸ Brief introduction to pooling and magic of multi-level models ▸ Pricing books using Stan and rstanarm package ▸ References and guide to getting started

  3. Why Bayes

  4. Benefits of Bayesian Approach ▸ Express your beliefs about parameters and the data generating process ▸ Properly account for uncertainty at the individual and group level ▸ Do not collapse grouping variables (e.g. sales for of multiple products over time) and do not fit a separate model to each group ▸ Small data is fine if you have a strong model ▸ But what about Big Data?

  5. Big Data Need Big Models Model Complexity BIG MODELS BIG DATA Size of Data

  6. Traditional Machine Learning and Causal Models ▸ Problem A : A large retailer ▸ Problem B : A large retailer wants to know how many units of wants to find a revenue each product they are going to sell maximizing price for all of their tomorrow products ▸ Data : We observe quantity sold of each product of time, meta data about the products, and price variation ▸ Question : Which one needs a causal model?

  7. Why Stan

  8. What Is Stan What What For Mathematical specification of models; C++ Math/Stats Library Automatic calculations of gradients Fast and simple way to specify complex Imperative Model Specification Language models Fit with full Bayes, approximate Bayes, Algorithm Toolbox optimization (HMC NUTS, ADVI, L-BFGS) Interfaces (Command Line, R, Python, Work in the language of your choice Julia, Matlab, Stata, …) Interpretation Tools (shinystan) Model critisism, algorithm evaluation

  9. Who Is Using Stan ▸ 2,000+ members on the user list ▸ Over 10,000 manual downloads during the new release ▸ Stan is used for fitting climate models, clinical drug trials, genomics and cancer biology, population dynamics, psycholinguistics, social networks, finance and econometrics, professional sports, publishing, recommender systems, educational testing, and many more.

  10. Stan vs Traditional Machine Learning ▸ Model is directly expressed in Stan ▸ When in MCMC mode Stan produces produces draws from posterior distribution, not point estimates (MLE) of the parameters ▸ Fit complex models with millions of parameters ▸ Express and fit hierarchical models

  11. Stan vs Gibbs and Metropolis ▸ 2-d projection of a highly correlated 250-d distribution ▸ 1M samples from Metropolis and 1M samples from Gibbs ▸ 1K samples from NUTS

  12. Hamiltonian Simulation

  13. Intro to Bayes with Modern Bayesian Workflow

  14. Bayesian Workflow FIT FAKE DATA AND GATHER PRIOR FORMULATE A FIT THE MODEL TO SIMULATE FAKE DATA RECOVER KNOWLEDGE GENERATIVE MODEL REAL DATA PARAMETERS EVALUATE AND GOOD ADD STRUCTURE TO CRITICIZE THE MODEL ENOUGH? THE MODEL yes PREDICT FOR NEW DATA / MAKE A DECISION

  15. Bayesian Machinery ▸ The joint probability of data y and unknown parameter theta : p ( y, θ ) = p ( y | θ ) ∗ p ( θ ) p ( y, θ ) = p ( θ | y ) ∗ p ( y ) ▸ The conditional probability of theta given y : Likelihood Prior p ( θ | y ) = p ( y | θ ) ∗ p ( θ ) = p ( y | θ ) ∗ p ( θ ) p ( y | θ ) ∗ p ( θ ) p ( y, θ ) d θ = R R p ( y ) p ( y | θ ) ∗ p ( θ ) d θ Marginal Likelihood ∝ p ( y | θ ) ∗ p ( θ )

  16. Bernoulli Model ▸ If we model each occurrence as independent, the data <- list(N = 5, joint model can be written as: y = c(0, 1, 1, 0, 1)) p ( y | θ ) Bernoulli Likelihood # log probability function N θ y n ∗ (1 − θ ) 1 − y n = θ P N P N Y n =1 (1 − y n ) lp <- function(theta, data) { n =1 y n ∗ (1 − θ ) p ( y, θ ) = lp <- 0 n =1 for (i in 1:data$N) { ▸ What happened to the prior, p ( θ ) lp <- lp + log(theta) * data$y[i] + log(1 - theta) * (1 - data$y[i]) ▸ On the log scale: } return(lp) N N X X log( p ( y, θ )) = y n ∗ log( θ ) + (1 − y n ) ∗ log(1 − θ ) } n =1 n =1

  17. Bernoulli Model # using dbinom() lp_dbinom <- function(theta, d) { lp <- 0 for (i in 1:length(theta)) lp[i] <- sum(dbinom(d$y, size = 1, prob = theta[i], log = TRUE)) return(lp) } > lp(c(0.6, 0.7), data) [1] -3.365058 -3.477970 > lp_dbinom(c(0.6, 0.7), data) [1] -3.365058 -3.477970

  18. Bernoulli Model # generate the parameter grid theta <- seq(0.001, 0.999, 2.0 length.out = 250) # log p(theta | y) posterior <- lp(theta = theta, data) 1.5 posterior <- exp(log_prob) # normalize posterior <- posterior / sum(posterior) 1.0 # sample from the posterior post_draws <- sample(theta, size = 1e5, replace = TRUE, 0.5 prob = posterior) post_dens <- density(post_draws) mle <- sum(data$y) / data$N > mle 0.0 [1] 0.6 0.00 0.25 0.50 0.75 1.00 Theta

  19. Same Model in Stan data { data { int<lower=1> N; int<lower=1> N; int<lower=0, upper=1> y[N]; int<lower=0, upper=1> y[N]; } } parameters { parameters { real theta; real<lower=0, upper=1> theta; } } model { model { for (n in 1:N) target += y[n] * log(theta) + y ~ bernoulli(theta); (1 - y[n]) * log(1 - theta); } } N N X X log( p ( y, θ )) = y n ∗ log( θ ) + (1 − y n ) ∗ log(1 − θ ) n =1 n =1

  20. Product Pricing Basic Model and Data Simulation

  21. Anlytical Problem ▸ A large publisher has hundreds of thousands of books in the catalog ▸ Every year, thousands of new books (products) are published ▸ There are also new authors, repeat authors, genres, seasonality, and so on ▸ Publisher wants to maximize revenue but if uncertainty is high, maximize quantity sold ▸ How should we model this? (and what is this)?

  22. Basic Model for Quantity Sold 1 4000 qty = f ( price, price 2 , product age, ... ) 3000 Simulated Quantity Sold ▸ For a Gaussian model, and one product: 2000 EL qty i ∼ N ( X i β , σ 2 ) 1000 ▸ For products that sell thousands of units we 0 would fit a log-log model 0 20 40 60 Product Age ▸ For lower volume products that sometimes sell simd2 <- hir_data_sim(N_prod = 1, zero units, we fit a count model that does not global_intercept = 8.5, force the mean to be equal to the variance theta = 10, qty_process = "negbinom", qty ∼ NegativeBinomial ( µ, φ ) primary_price_process = "none", … µ = exp ( α + β 1 ∗ product age + β 2 ∗ price + ... ) linkinv = exp) σ 2 = µ + µ 2 / φ

  23. Simulating Data if (process == "normal") { data <- data %>% mutate(qty = linkinv(product_intercept + product_beta_time * days + product_beta_price * price + error_sd * rnorm(sum(n)))) %>% mutate(qty = ifelse(qty <= 0, 0, round(qty))) } else { # negative binomial data <- data %>% mutate(mu = linkinv(product_intercept + product_beta_time * day + product_beta_price * price) qty = MASS::rnegbin(n = sum(n), mu = mu, theta = theta)) } 1 2 3 4 3000 Simulated Quantity Sold 2000 EL 1000 0 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Product Age

  24. What About the Real Data?

  25. Baseline Stan Model for Single Product simd2_m2 <- stan('m2_self_stan_nbinom.stan' data { data = list(N = nrow(simd2$data), int<lower=0> N; y = simd2$data$qty, int<lower=0> y[N]; t = simd2$data$day), vector[N] t; control = list(stepsize = 0.01, } adapt_delta = 0.99), parameters { cores = 4, iter = 400) real alpha; // overall mean real beta; // time beta # truth: alpha = 8.5, beta = -0.10, phi = 10 real<lower=0> phi; // dispersion samples <- rstan::extract(simd2_m2, pars = c('alpha', } 'beta', model { 'phi')) vector[N] eta; > lapply(samples, quantile) // linear predictor $alpha eta = alpha + t * beta; 0% 25% 50% 75% 100% // priors 8.3 8.4 8.5 8.6 8.8 alpha ~ normal(0, 10); $beta phi ~ cauchy(0, 2.5); 0% 25% 50% 75% 100% beta ~ normal(0, 1); -0.107 -0.102 -0.100 -0.099 -0.092 // likelihood $phi y ~ neg_binomial_2_log(eta, phi); 0% 25% 50% 75% 100% } 6.2 10.1 11.5 13.0 24.1

  26. Looking at Posterior Draws > pairs(simd2_m2)

  27. Diagnostics with Shinystan

  28. Product Pricing Introduction to Pooling and Hirarchical Models

  29. We Have Multiple Products, Authors, Genres

  30. Hierarchical Pooling in One Slide Average sales across all books Average sales for book j Number of observations for book j n j 1 y ¯ y j + α ¯ y all σ 2 σ 2 α multilevel b ≈ n j j 1 y + σ 2 σ 2 α Indexes books Estimate of average sales for book j Within-book variance Variance among average sales of different books

  31. Multi-Level Models using Lmer Syntax

  32. Fitting Multi-Level Models in rstanarm “Fixed Effects” fit <- stan_glmer.nb(qty ~ product_age + price + price_sqr + (1 + product_age + price + price_sqr | product), algorithm = "sampling", Random Seed seed = 123, cores = 4, iter = 600, 1 Core per data = data) Varying Chain Intercepts Number of Varying Iterations per Slopes Fit using Chain MCMC

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend