Computational Bayesian data analysis Bruno Nicenboim / Shravan - PowerPoint PPT Presentation

Computational Bayesian data analysis Bruno Nicenboim / Shravan Vasishth 2020-03-11 1

Bayesian Regression Models using ‘Stan’: brms Examples 1: A single participant pressing a button repeatedly (A simple linear model) Prior predictive distributions The influence of priors: sensitivity analysis Posterior predictive distributions Comparing different likelihoods: The log-normal likelihood 2

• Deriving the posterior distribution analytically is possible for only a very limited number of cases. • The denominator, the marginal likelihood, requires us to integrate the numerator: 𝑞(Θ|𝑧) = 𝑞(𝑧|Θ) ⋅ 𝑞(Θ) ∫ (1) 3 Θ 𝑞(𝑧|Θ) ⋅ 𝑞(Θ)𝑒Θ

Alternative: Deriving the posterior through sampling We want to derive the posterior distribution of the Cloze probability of “umbrella” , 𝜄 : • Data: a word (e.g., “umbrella” ) was answered 80 out of 100 times, • Likelihood: a binomial distribution • Prior for 𝜄 : 𝐶𝑓𝑢𝑏(𝑏 = 4, 𝑐 = 4) 4

We sample from the posterior distribution of 𝜄 : • We use a probabilistic programming language, • given enough samples we will have a good approximation of the real posterior distribution, • say we got 20000 samples from the posterior distribution of the Cloze probability, 𝜄 : 0.712, 0.838, 0.767, 0.743, 0.732, 0.804, 0.738, 0.832, 0.851, 0.816, 0.771, 0.817, 0.721, 0.705, 0.827, 0.808, 0.776, 0.823, 0.786, 0.78, … 5

The approximation of the posterior looks quite similar to the real posterior. 1 and -0.00000003 respectively 1 The difference between the true and the approximated mean and variance are 0.0002 density plot of the exact posterior in red. Figure 1: Histogram of the samples of 𝜄 from the posterior distribution calculated through sampling in gray; 6 10.0 7.5 density 5.0 2.5 0.0 0.6 0.7 0.8 0.9 theta

Computational Bayesian data analysis: Why now? • increase in computing power • appearance of probabilistic programming languages: WinBUGS (Lunn et al. 2000), JAGS (Plummer 2016), and more recently pymc3 (Salvatier, Wiecki, and Fonnesbeck 2016) and Stan (Carpenter et al. 2017). Easier alternatives based on Stan: • rstanarm (Goodrich et al. 2018) • brms (Bürkner 2019) 7

Bayesian Regression Models using ‘Stan’: brms

Load the following: library (brms) library (tictoc) library (bayesplot) options (mc.cores = parallel ::detectCores ()) ## Parallelize the chains using all the cores: rstan_options (auto_write = TRUE) ## Save compiled models: library (ggplot2) set.seed (42) library (readr) library (purrr) library (tidyr) library (dplyr) ## be careful to load dplyr after MASS library (MASS) 8

Examples 1: A single participant pressing a button repeatedly (A simple linear model)

We have data from a participant repeatedly pressing the space bar as fast as possible, without paying attention to any stimuli. Data: reaction times in milliseconds in each trial Question: How long does it take to press a key when there is no decision involved? 9

Assumptions: 1. There is a true underlying time, 𝜈 , that the participant needs to press the space bar. 2. There is some noise in this process. 3. The noise is normally distributed (this assumption is questionable given that reaction times are generally skewed; we fix this assumption later). 10

Formal model: Likelihood for each observation 𝑜 : (2) (Bad) priors: 𝜈 ∼ 𝑉𝑜𝑗𝑔𝑝𝑠𝑛(0, 60000) 𝜏 ∼ 𝑉𝑜𝑗𝑔𝑝𝑠𝑛(0, 2000) (3) 11 𝑠𝑢 𝑜 ∼ 𝑂𝑝𝑠𝑛𝑏𝑚(𝜈, 𝜏)

Fitting the model 138 ## # ... with 356 more rows 5 126 ## 5 4 132 ## 4 3 128 ## 3 2 ## 2 We’ll first load the data from data/button_press.csv : 1 141 ## 1 <dbl> <dbl> ## rt trialn ## ## # A tibble: 361 x 2 df_noreading_data read_csv ("./data/button_press.csv") df_noreading_data <- 12

ggplot (df_noreading_data, aes (rt)) + geom_density () + Figure 2: Visualizing the data 13 ggtitle ("Button-press data") Button−press data 0.020 0.015 density 0.010 0.005 0.000 100 200 300 400 rt

Specifying the model in brms fit_press <- brm (rt ~ 1, data = df_noreading_data, family = gaussian (), prior = c ( prior ( uniform (0, 60000), class = Intercept), prior ( uniform (0, 2000), class = sigma) ), chains = 4, iter = 2000, warmup = 1000 ) 14

Sampling and convergence in a nutshell 5. from that point onwards the samples will belong to the posterior. Figure 3: Trace plot of the brms model 1. Chains start in random locations; 15 4. eventually, the chains end up in the vicinity of the posterior distribution; 2. in each iteration they take one sample each; 3. samples at the beginning do not belong to the posterior distribution; mu 300 200 Warm−up 100 chain Sample value 1 0 2 sigma 300 3 4 200 Warm−up 100 0 0 500 1000 1500 2000 Iteration number

16 Figure 4: Trace plot of a model that did not converge. mu 2000 1500 1000 500 chain Warm−up Sample value 0 1 2 sigma 3 2000 4 1500 1000 500 Warm−up 0 0 500 1000 1500 2000 Iteration number

Output of brms $ sigma -1688 -1688 -1690 -1690 -1688 ... : num $ lp__ ## 24.9 25.2 24.3 23.6 25.2 ... : num ## posterior_samples (fit_press) %>% str () 167 168 171 171 168 ... $ b_Intercept: num ## 3 variables: 4000 obs. of ## 'data.frame': 17

Output of brms plot (fit_press) 18 b_Intercept b_Intercept 174 172 0.2 170 168 0.1 166 Chain 164 0.0 1 164 166 168 170 172 0 200 400 600 800 1000 2 sigma sigma 3 0.4 28 4 0.3 26 0.2 24 0.1 22 0.0 22 24 26 28 0 200 400 600 800 1000

Output of brms 23.26 2654 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat ## sigma 25.00 0.93 26.91 1.00 ## Intercept ## Bulk_ESS Tail_ESS ## sigma 4184 2769 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). 3295 Bulk_ESS Tail_ESS fit_press ## # posterior_summary(fit_press) is also useful ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: rt ~ 1 ## Data: df_noreading_data (Number of observations: 361) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 ## ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat ## Intercept 168.66 1.28 166.24 171.20 1.00 19

Output of brms Notice that the Estimate is just the mean of the posterior sample, and CI are the 95% quantiles: posterior_samples (fit_press) $ b_Intercept %>% mean () ## [1] 169 posterior_samples (fit_press) $ b_Intercept %>% quantile ( c (0.025, .975)) ## 2.5% 98% ## 166 171 20

Exercises 3.8.1.1 Fit the model fit_press with just a few of iterations? What happens? 3.8.1.2 Using uniform distributions, choose priors that represent better your assumptions about reaction times. What happens with the new model? 21

Important questions 1. What information are the priors encoding? Do the priors make sense? 2. Does the likelihood assumed in the model make sense for the data? 22

Prior predictive distributions

Prior predictive distributions We want to know the density 𝑞(⋅) of data points 𝑧 1 , … , 𝑜 , given a vector of priors Θ (e.g., Θ = ⟨𝜈, 𝜏⟩ ) The prior predictive density is: 𝑞(𝑧 1 , … , 𝑧 𝑜 ) = ∫ 𝑞(𝑧|Θ) ⋅ 𝑞(𝑧 2 |Θ) ⋯ 𝑞(𝑧 𝑜 |Θ)𝑞(Θ) 𝑒Θ (4) We avoid doing the integration by generating samples from the prior distribution. We repeat the following: 1. Take one sample from each of the priors. 2. Plug those samples in the likelihood and generate a dataset 𝑧 𝑞𝑠𝑓𝑒 1 , … , 𝑧 𝑞𝑠𝑓𝑒 𝑜 . 23

normal_predictive_distribution <- function (mu_samples, sigma_samples, N_obs) { df_pred, } df_pred } ) ) iter = i rt_pred = rnorm (N_obs, mu, sigma), trialn = seq_len (N_obs), # 1, 2,... N_obs tibble ( sigma <- sigma_samples[i] # empty data frame with headers: mu <- mu_samples[i] for (i in seq_along (mu_samples)) { # the length of the sigma_samples: # which we assume is identical to # i iterates from 1 to the length of mu_samples, ) iter = numeric (0) rt_pred = numeric (0), trialn = numeric (0), df_pred <- tibble ( 24 df_pred <- bind_rows (

This approach works, but it’s quite slow: 4 48271. 1 ## 3 3 50291. 1 ## 4 48784. ## 2 1 ## 5 5 49350. 1 ## # ... with 3.61e+05 more rows ## 3.711 sec elapsed 2 1 tic () ## N_obs <- nrow (df_noreading_data) normal_predictive_distribution (mu_samples = mu_samples, sigma_samples = sigma_samples, N_obs = N_obs) toc () 44587. ## # A tibble: 361,000 x 3 trialn rt_pred iter ## <dbl> <dbl> <dbl> ## 1 1 25 N_samples <- 1000 mu_samples <- runif (N_samples, 0, 60000) sigma_samples <- runif (N_samples, 0, 2000)

Computational Bayesian data analysis Bruno Nicenboim / Shravan - PowerPoint PPT Presentation

Computational Bayesian data analysis Bruno Nicenboim / Shravan Vasishth 2020-03-11 1 Bayesian Regression Models using Stan: brms Examples 1: A single participant pressing a button repeatedly (A simple linear model) Prior predictive

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Introduction to Bayesian Inference Frank Wood April 6, 2010 Introduction Overview of Topics

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Bayesian Phylogenetics Mark Holder (with big thanks to Paul Lewis) Outline Intro What is

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Lognormals and friends Lognormals Empirical Confusability Principles of Complex Systems Random

Approximations of the Laplace Transform of a Lognormal Random Variable Leonardo Rojas Nandayapa

A Brief History of Lognormal and Power Law Distributions and an Application to File Size

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall

Welfare, Inequality & Poverty, # 2 1 Arthur CHARPENTIER - Welfare, Inequality and Poverty

Phase Fluctuations and Sign Problems Michael Wagman MIT Lattice 2018 East Lansing, Michigan

Mixture Models Simulation-based Estimation Michel Bierlaire michel.bierlaire@epfl.ch

New approaches for statistical modelling Jelena Jockovi c ADVISORS: Pepa Ram rez Cobo,

Computational Bayesian data analysis Bruno Nicenboim / Shravan - PowerPoint PPT Presentation

Computational Bayesian data analysis Bruno Nicenboim / Shravan Vasishth 2020-03-11 1 Bayesian Regression Models using Stan: brms Examples 1: A single participant pressing a button repeatedly (A simple linear model) Prior predictive

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Introduction to Bayesian Inference Frank Wood April 6, 2010 Introduction Overview of Topics

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Bayesian Phylogenetics Mark Holder (with big thanks to Paul Lewis) Outline Intro What is

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Lognormals and friends Lognormals Empirical Confusability Principles of Complex Systems Random

Approximations of the Laplace Transform of a Lognormal Random Variable Leonardo Rojas Nandayapa

A Brief History of Lognormal and Power Law Distributions and an Application to File Size

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall

Welfare, Inequality &amp; Poverty, # 2 1 Arthur CHARPENTIER - Welfare, Inequality and Poverty

Phase Fluctuations and Sign Problems Michael Wagman MIT Lattice 2018 East Lansing, Michigan

Mixture Models Simulation-based Estimation Michel Bierlaire michel.bierlaire@epfl.ch

New approaches for statistical modelling Jelena Jockovi c ADVISORS: Pepa Ram rez Cobo,

Welfare, Inequality & Poverty, # 2 1 Arthur CHARPENTIER - Welfare, Inequality and Poverty