Introduction to Stan and Bayesian Inference Paris Machine Learning - PowerPoint PPT Presentation

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup 21 September 2016 Eric Novik enovik@stan.fit

Outline ▸ Why should you bother with Bayes ▸ Why should you use Stan ▸ Introduction to modern Bayesian workflow ▸ Building up a Stan model ▸ Brief introduction to pooling and magic of multi-level models ▸ Pricing books using Stan and rstanarm package ▸ References and guide to getting started

Why Bayes

Benefits of Bayesian Approach ▸ Express your beliefs about parameters and the data generating process ▸ Properly account for uncertainty at the individual and group level ▸ Do not collapse grouping variables (e.g. sales for of multiple products over time) and do not fit a separate model to each group ▸ Small data is fine if you have a strong model ▸ But what about Big Data?

Big Data Need Big Models Model Complexity BIG MODELS BIG DATA Size of Data

Traditional Machine Learning and Causal Models ▸ Problem A : A large retailer ▸ Problem B : A large retailer wants to know how many units of wants to find a revenue each product they are going to sell maximizing price for all of their tomorrow products ▸ Data : We observe quantity sold of each product of time, meta data about the products, and price variation ▸ Question : Which one needs a causal model?

Why Stan

What Is Stan What What For Mathematical specification of models; C++ Math/Stats Library Automatic calculations of gradients Fast and simple way to specify complex Imperative Model Specification Language models Fit with full Bayes, approximate Bayes, Algorithm Toolbox optimization (HMC NUTS, ADVI, L-BFGS) Interfaces (Command Line, R, Python, Work in the language of your choice Julia, Matlab, Stata, …) Interpretation Tools (shinystan) Model critisism, algorithm evaluation

Who Is Using Stan ▸ 2,000+ members on the user list ▸ Over 10,000 manual downloads during the new release ▸ Stan is used for fitting climate models, clinical drug trials, genomics and cancer biology, population dynamics, psycholinguistics, social networks, finance and econometrics, professional sports, publishing, recommender systems, educational testing, and many more.

Stan vs Traditional Machine Learning ▸ Model is directly expressed in Stan ▸ When in MCMC mode Stan produces produces draws from posterior distribution, not point estimates (MLE) of the parameters ▸ Fit complex models with millions of parameters ▸ Express and fit hierarchical models

Stan vs Gibbs and Metropolis ▸ 2-d projection of a highly correlated 250-d distribution ▸ 1M samples from Metropolis and 1M samples from Gibbs ▸ 1K samples from NUTS

Hamiltonian Simulation

Intro to Bayes with Modern Bayesian Workflow

Bayesian Workflow FIT FAKE DATA AND GATHER PRIOR FORMULATE A FIT THE MODEL TO SIMULATE FAKE DATA RECOVER KNOWLEDGE GENERATIVE MODEL REAL DATA PARAMETERS EVALUATE AND GOOD ADD STRUCTURE TO CRITICIZE THE MODEL ENOUGH? THE MODEL yes PREDICT FOR NEW DATA / MAKE A DECISION

Bayesian Machinery ▸ The joint probability of data y and unknown parameter theta : p ( y, θ ) = p ( y | θ ) ∗ p ( θ ) p ( y, θ ) = p ( θ | y ) ∗ p ( y ) ▸ The conditional probability of theta given y : Likelihood Prior p ( θ | y ) = p ( y | θ ) ∗ p ( θ ) = p ( y | θ ) ∗ p ( θ ) p ( y | θ ) ∗ p ( θ ) p ( y, θ ) d θ = R R p ( y ) p ( y | θ ) ∗ p ( θ ) d θ Marginal Likelihood ∝ p ( y | θ ) ∗ p ( θ )

Bernoulli Model ▸ If we model each occurrence as independent, the data <- list(N = 5, joint model can be written as: y = c(0, 1, 1, 0, 1)) p ( y | θ ) Bernoulli Likelihood # log probability function N θ y n ∗ (1 − θ ) 1 − y n = θ P N P N Y n =1 (1 − y n ) lp <- function(theta, data) { n =1 y n ∗ (1 − θ ) p ( y, θ ) = lp <- 0 n =1 for (i in 1:data$N) { ▸ What happened to the prior, p ( θ ) lp <- lp + log(theta) * data$y[i] + log(1 - theta) * (1 - data$y[i]) ▸ On the log scale: } return(lp) N N X X log( p ( y, θ )) = y n ∗ log( θ ) + (1 − y n ) ∗ log(1 − θ ) } n =1 n =1

Bernoulli Model # using dbinom() lp_dbinom <- function(theta, d) { lp <- 0 for (i in 1:length(theta)) lp[i] <- sum(dbinom(d$y, size = 1, prob = theta[i], log = TRUE)) return(lp) } > lp(c(0.6, 0.7), data) [1] -3.365058 -3.477970 > lp_dbinom(c(0.6, 0.7), data) [1] -3.365058 -3.477970

Bernoulli Model # generate the parameter grid theta <- seq(0.001, 0.999, 2.0 length.out = 250) # log p(theta | y) posterior <- lp(theta = theta, data) 1.5 posterior <- exp(log_prob) # normalize posterior <- posterior / sum(posterior) 1.0 # sample from the posterior post_draws <- sample(theta, size = 1e5, replace = TRUE, 0.5 prob = posterior) post_dens <- density(post_draws) mle <- sum(data$y) / data$N > mle 0.0 [1] 0.6 0.00 0.25 0.50 0.75 1.00 Theta

Same Model in Stan data { data { int<lower=1> N; int<lower=1> N; int<lower=0, upper=1> y[N]; int<lower=0, upper=1> y[N]; } } parameters { parameters { real theta; real<lower=0, upper=1> theta; } } model { model { for (n in 1:N) target += y[n] * log(theta) + y ~ bernoulli(theta); (1 - y[n]) * log(1 - theta); } } N N X X log( p ( y, θ )) = y n ∗ log( θ ) + (1 − y n ) ∗ log(1 − θ ) n =1 n =1

Product Pricing Basic Model and Data Simulation

Anlytical Problem ▸ A large publisher has hundreds of thousands of books in the catalog ▸ Every year, thousands of new books (products) are published ▸ There are also new authors, repeat authors, genres, seasonality, and so on ▸ Publisher wants to maximize revenue but if uncertainty is high, maximize quantity sold ▸ How should we model this? (and what is this)?

Basic Model for Quantity Sold 1 4000 qty = f ( price, price 2 , product age, ... ) 3000 Simulated Quantity Sold ▸ For a Gaussian model, and one product: 2000 EL qty i ∼ N ( X i β , σ 2 ) 1000 ▸ For products that sell thousands of units we 0 would fit a log-log model 0 20 40 60 Product Age ▸ For lower volume products that sometimes sell simd2 <- hir_data_sim(N_prod = 1, zero units, we fit a count model that does not global_intercept = 8.5, force the mean to be equal to the variance theta = 10, qty_process = "negbinom", qty ∼ NegativeBinomial ( µ, φ ) primary_price_process = "none", … µ = exp ( α + β 1 ∗ product age + β 2 ∗ price + ... ) linkinv = exp) σ 2 = µ + µ 2 / φ

Simulating Data if (process == "normal") { data <- data %>% mutate(qty = linkinv(product_intercept + product_beta_time * days + product_beta_price * price + error_sd * rnorm(sum(n)))) %>% mutate(qty = ifelse(qty <= 0, 0, round(qty))) } else { # negative binomial data <- data %>% mutate(mu = linkinv(product_intercept + product_beta_time * day + product_beta_price * price) qty = MASS::rnegbin(n = sum(n), mu = mu, theta = theta)) } 1 2 3 4 3000 Simulated Quantity Sold 2000 EL 1000 0 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Product Age

What About the Real Data?

Baseline Stan Model for Single Product simd2_m2 <- stan('m2_self_stan_nbinom.stan' data { data = list(N = nrow(simd2$data), int<lower=0> N; y = simd2$data$qty, int<lower=0> y[N]; t = simd2$data$day), vector[N] t; control = list(stepsize = 0.01, } adapt_delta = 0.99), parameters { cores = 4, iter = 400) real alpha; // overall mean real beta; // time beta # truth: alpha = 8.5, beta = -0.10, phi = 10 real<lower=0> phi; // dispersion samples <- rstan::extract(simd2_m2, pars = c('alpha', } 'beta', model { 'phi')) vector[N] eta; > lapply(samples, quantile) // linear predictor $alpha eta = alpha + t * beta; 0% 25% 50% 75% 100% // priors 8.3 8.4 8.5 8.6 8.8 alpha ~ normal(0, 10); $beta phi ~ cauchy(0, 2.5); 0% 25% 50% 75% 100% beta ~ normal(0, 1); -0.107 -0.102 -0.100 -0.099 -0.092 // likelihood $phi y ~ neg_binomial_2_log(eta, phi); 0% 25% 50% 75% 100% } 6.2 10.1 11.5 13.0 24.1

Looking at Posterior Draws > pairs(simd2_m2)

Diagnostics with Shinystan

Product Pricing Introduction to Pooling and Hirarchical Models

We Have Multiple Products, Authors, Genres

Hierarchical Pooling in One Slide Average sales across all books Average sales for book j Number of observations for book j n j 1 y ¯ y j + α ¯ y all σ 2 σ 2 α multilevel b ≈ n j j 1 y + σ 2 σ 2 α Indexes books Estimate of average sales for book j Within-book variance Variance among average sales of different books

Multi-Level Models using Lmer Syntax

Fitting Multi-Level Models in rstanarm “Fixed Effects” fit <- stan_glmer.nb(qty ~ product_age + price + price_sqr + (1 + product_age + price + price_sqr | product), algorithm = "sampling", Random Seed seed = 123, cores = 4, iter = 600, 1 Core per data = data) Varying Chain Intercepts Number of Varying Iterations per Slopes Fit using Chain MCMC

Introduction to Stan and Bayesian Inference Paris Machine Learning - PowerPoint PPT Presentation

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup 21 September 2016 Eric Novik enovik@stan.fit Outline Why should you bother with Bayes Why should you use Stan Introduction to modern

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

An Introduction to Stan and RStan Introduction I (MW) am not a developer of Stan , only a very

Stan Software Ecosystem for Modern Bayesian Inference Course materials:

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Fast Bayesian modeling in Stan using StataStan mc-stan.org Robert Grant Kingston University +

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Photobioreactor system case-study Tom a s Stan ek Tom a s Stan ek

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Analytics, Inference and Computation in Cosmology: Exercises on Bayesian Inference Roberto

EST5104 Bayesian Inference EST5803 Advanced Bayesian Inference Ricardo Ehlers ehlers@icmc.usp.br

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

using Stan mc-stan.org About the speaker Robert Grant is senior lecturer in health & social

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

The Political Economy of (De)Regulation: Theory and Evidence from the U.S. Electricity Market.

Moses Obinyeluaku & Christopher Sako 20 May 2014 1/ Manufacturing Led Growth for Employment

ACA Sustainability, Productivity Growth and the Complex Relationship between Medicare and Private

Embargo: 14 December 2017, 6:00 CET Dr Andreas Dombret Member of the Executive Board of the

Introduction to Bayesian Statistics Louis Raes Spring 2017 Table of contents Organisation,

An Introduction to Title Maximum Pawel Zabczyk Likelihood and

Estimation of behavioural Parameters of CGE Models For the 28 EU Countries Second Bwanakare

Methods for Small Area Analyses of Spatial and Space-time Data Evan Carey Robert Penfold

Introduction to Stan and Bayesian Inference Paris Machine Learning - PowerPoint PPT Presentation

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup 21 September 2016 Eric Novik enovik@stan.fit Outline Why should you bother with Bayes Why should you use Stan Introduction to modern

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

An Introduction to Stan and RStan Introduction I (MW) am not a developer of Stan , only a very

Stan Software Ecosystem for Modern Bayesian Inference Course materials:

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Fast Bayesian modeling in Stan using StataStan mc-stan.org Robert Grant Kingston University +

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Photobioreactor system case-study Tom a s Stan ek Tom a s Stan ek

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Analytics, Inference and Computation in Cosmology: Exercises on Bayesian Inference Roberto

EST5104 Bayesian Inference EST5803 Advanced Bayesian Inference Ricardo Ehlers ehlers@icmc.usp.br

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

using Stan mc-stan.org About the speaker Robert Grant is senior lecturer in health &amp; social

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

The Political Economy of (De)Regulation: Theory and Evidence from the U.S. Electricity Market.

Moses Obinyeluaku &amp; Christopher Sako 20 May 2014 1/ Manufacturing Led Growth for Employment

ACA Sustainability, Productivity Growth and the Complex Relationship between Medicare and Private

Embargo: 14 December 2017, 6:00 CET Dr Andreas Dombret Member of the Executive Board of the

Introduction to Bayesian Statistics Louis Raes Spring 2017 Table of contents Organisation,

An Introduction to Title Maximum Pawel Zabczyk Likelihood and

Estimation of behavioural Parameters of CGE Models For the 28 EU Countries Second Bwanakare

Methods for Small Area Analyses of Spatial and Space-time Data Evan Carey Robert Penfold

using Stan mc-stan.org About the speaker Robert Grant is senior lecturer in health & social

Moses Obinyeluaku & Christopher Sako 20 May 2014 1/ Manufacturing Led Growth for Employment