Part 7 Bayesian hierarchical modelling, simulation and MCMC by - - PowerPoint PPT Presentation

part 7 bayesian hierarchical modelling simulation and mcmc
SMART_READER_LITE
LIVE PREVIEW

Part 7 Bayesian hierarchical modelling, simulation and MCMC by - - PowerPoint PPT Presentation

Wednesday 14:00-17:30 Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian hierarchical modelling, simulation and MCMC Outline Bayesian hierarchical modelling / Bayesian networks / graphical models


slide-1
SLIDE 1

Wednesday 14:00-17:30

Part 7 Bayesian hierarchical modelling, simulation and MCMC

by Gero Walter

252

slide-2
SLIDE 2

Bayesian hierarchical modelling, simulation and MCMC

Outline

Bayesian hierarchical modelling / Bayesian networks / graphical models Exercises I Simulation & MCMC Exercises II

253

slide-3
SLIDE 3

Bayesian Hierarchical Modelling, a.k.a. Bayesian (Belief) Networks, a.k.a. Graphical Models

◮ many names for the same thing (it’s a powerful tool),

I will use the term Bayesian Networks (BNs)

◮ BNs as a unifying way to think about (Bayesian) statistical

models

◮ how to build complex Bayesian models out of simple building

blocks

◮ how to specify joint distributions (over many variables)

via univariate distributions using conditional independence assumptions

◮ conditional independence assumptions are visualized by a graph ◮ the graph can establish a hierarchy between variables

254

slide-4
SLIDE 4

Bayesian Networks: Simple Example

x θ n(0) y(0) f (x | θ) ∼ Binomial(n, θ) f (θ | n(0), y(0)) ∼ Beta(n(0), y(0)) y a y a xi i = 1, . . . , n variables / parameters fixed values y depends on a repeated nodes

255

slide-5
SLIDE 5

Another Example: Linear Regression

yi τ i = 1, . . . , n µi xi β1 β2 yi = β1 + xiβ2 + εi, where εi

iid

∼ N(0, σ2) yi | µi, τ ∼ N(µi, 1/τ), where µi = β1 + xiβ2 τ | a, b ∼ Gamma(a, b) β1 | m1, t1 ∼ N(m1, 1/t1) β2 | m2, t2 ∼ N(m2, 1/t2)

256

slide-6
SLIDE 6

Another Example: Linear Regression

yi τ i = 1, . . . , n xi β1 β2 yi = β1 + xiβ2 + εi, where εi

iid

∼ N(0, σ2) yi | µi, τ ∼ N(µi, 1/τ), where µi = β1 + xiβ2 ◮ yi | β1, β2, τ ∼ N(β1 + xiβ2, 1/τ) τ | a, b ∼ Gamma(a, b) β1 | m1, t1 ∼ N(m1, 1/t1) β2 | m2, t2 ∼ N(m2, 1/t2)

256

slide-7
SLIDE 7

Another Example: Linear Regression

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2 yi = β1 + xiβ2 + εi, where εi

iid

∼ N(0, σ2) yi | µi, τ ∼ N(µi, 1/τ), where µi = β1 + xiβ2 ◮ yi | β1, β2, τ ∼ N(β1 + xiβ2, 1/τ) τ | a, b ∼ Gamma(a, b) β1 | m1, t1 ∼ N(m1, 1/t1) β2 | m2, t2 ∼ N(m2, 1/t2)

256

slide-8
SLIDE 8

Another Example: Linear Regression

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2 yi = β1 + xiβ2 + εi, where εi

iid

∼ N(0, σ2) yi | µi, τ ∼ N(µi, 1/τ), where µi = β1 + xiβ2 ◮ yi | β1, β2, τ ∼ N(β1 + xiβ2, 1/τ) τ | a, b ∼ Gamma(a, b) a = b = 10−3 β1 | m1, t1 ∼ N(m1, 1/t1) β2 | m2, t2 ∼ N(m2, 1/t2) m1 = m2 = 0, t1 = t2 = 104

256

slide-9
SLIDE 9

Another Example: Linear Regression

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2 graph:

  • cond. indep. relations

yi = β1 + xiβ2 + εi, where εi

iid

∼ N(0, σ2) yi | µi, τ ∼ N(µi, 1/τ), where µi = β1 + xiβ2 ◮ yi | β1, β2, τ ∼ N(β1 + xiβ2, 1/τ) τ | a, b ∼ Gamma(a, b) a = b = 10−3 β1 | m1, t1 ∼ N(m1, 1/t1) β2 | m2, t2 ∼ N(m2, 1/t2) m1 = m2 = 0, t1 = t2 = 104 variables: conditional distributions

256

slide-10
SLIDE 10

Bayesian Networks: Why?

◮ conditional independence reduces model complexity

(10 binary variables: joint distribution has 1023 parameters)

257

slide-11
SLIDE 11

Bayesian Networks: Why?

◮ conditional independence reduces model complexity

(10 binary variables: joint distribution has 1023 parameters)

◮ BNs enable us to construct probability distributions that

capture the important dependencies between the relevant variables in a given inference problem while keeping models (relatively) simple

257

slide-12
SLIDE 12

Bayesian Networks: Why?

◮ conditional independence reduces model complexity

(10 binary variables: joint distribution has 1023 parameters)

◮ BNs enable us to construct probability distributions that

capture the important dependencies between the relevant variables in a given inference problem while keeping models (relatively) simple

◮ often: BN = discrete variables only, distributions defined via

conditional probability tables (CPTs)

257

slide-13
SLIDE 13

Bayesian Networks: Why?

◮ conditional independence reduces model complexity

(10 binary variables: joint distribution has 1023 parameters)

◮ BNs enable us to construct probability distributions that

capture the important dependencies between the relevant variables in a given inference problem while keeping models (relatively) simple

◮ often: BN = discrete variables only, distributions defined via

conditional probability tables (CPTs)

◮ What kind of graphs work for expressing conditional

independence relations?

257

slide-14
SLIDE 14

Bayesian Networks: Directed Acyclic Graphs

Definition (Directed Graph)

A directed graph G = (V , E) consists of a set of vertices V and a set of edges E, where E ⊂ V × V . An arrow leads from u ∈ V to v ∈ V if and only if (u, v) ∈ E; u is the source and v is the target

  • f edge (u, v).

258

slide-15
SLIDE 15

Bayesian Networks: Directed Acyclic Graphs

Definition (Directed Graph)

A directed graph G = (V , E) consists of a set of vertices V and a set of edges E, where E ⊂ V × V . An arrow leads from u ∈ V to v ∈ V if and only if (u, v) ∈ E; u is the source and v is the target

  • f edge (u, v).

Definition (Paths and Cycles)

A path in a graph is an ordered set of edges {ei} such that t(ei) = s(ei+1) (chain of head-to-tail arrows). A cycle is a path such that t(eN) = s(e1), where N is the number of edges in the path.

258

slide-16
SLIDE 16

Bayesian Networks: Directed Acyclic Graphs

Definition (Directed Graph)

A directed graph G = (V , E) consists of a set of vertices V and a set of edges E, where E ⊂ V × V . An arrow leads from u ∈ V to v ∈ V if and only if (u, v) ∈ E; u is the source and v is the target

  • f edge (u, v).

Definition (Paths and Cycles)

A path in a graph is an ordered set of edges {ei} such that t(ei) = s(ei+1) (chain of head-to-tail arrows). A cycle is a path such that t(eN) = s(e1), where N is the number of edges in the path.

Definition (Directed Acyclic Graph)

A directed acyclic graph (DAG) is a directed graph that does not contain a cycle, i.e. there does not exist a subset of edges that forms a cycle.

258

slide-17
SLIDE 17

Bayesian Networks: Formal Definition

Definition (Bayesian Network)

Given a DAG G = (V , E), and variables xV = {xv}v∈V , a Bayesian network with respect to G and xV is a joint probability distribution for the xV of the form: f (xV ) =

  • v∈V

f

  • xv
  • xpa(v)
  • where pa(v) is the set of parents of v, i.e. the set of vertices u

such that (u, v) is an edge.

259

slide-18
SLIDE 18

Bayesian Networks: Formal Definition

Definition (Bayesian Network)

Given a DAG G = (V , E), and variables xV = {xv}v∈V , a Bayesian network with respect to G and xV is a joint probability distribution for the xV of the form: f (xV ) =

  • v∈V

f

  • xv
  • xpa(v)
  • where pa(v) is the set of parents of v, i.e. the set of vertices u

such that (u, v) is an edge. ◮ joint distribution factorizes according to the graph!

259

slide-19
SLIDE 19

Bayesian Networks: Factorization of the joint

One can always factorize a joint distribution by f (x1, . . . , xK) = f (xK | x1, . . . , xK−1) f (xK−1 | x1, . . . , xK−2) · · · f (x3 | x1, x2) f (x2 | x1) f (x1)

260

slide-20
SLIDE 20

Bayesian Networks: Factorization of the joint

One can always factorize a joint distribution by f (x1, . . . , xK) = f (xK | x1, . . . , xK−1) f (xK−1 | x1, . . . , xK−2) · · · f (x3 | x1, x2) f (x2 | x1) f (x1) ◮ corresponds to a fully connected graph: there is a link between every pair of vertices (each of the K vertices has incoming edges from all lower-numbered vertices)

260

slide-21
SLIDE 21

Bayesian Networks: Factorization of the joint

One can always factorize a joint distribution by f (x1, . . . , xK) = f (xK | x1, . . . , xK−1) f (xK−1 | x1, . . . , xK−2) · · · f (x3 | x1, x2) f (x2 | x1) f (x1) ◮ corresponds to a fully connected graph: there is a link between every pair of vertices (each of the K vertices has incoming edges from all lower-numbered vertices)

◮ sparser graph =

⇒ nodes have fewer parents = ⇒ less complex joint

260

slide-22
SLIDE 22

Factorization of the joint: Example

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2

261

slide-23
SLIDE 23

Factorization of the joint: Example

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2

  • mitting the fixed values in notation:

the joint distribution f (y1, . . . , yn, β1, β2, τ) =

n

  • i=1

f (yi | β1, β2, τ) f (β1) f (β2) f (τ)

261

slide-24
SLIDE 24

Factorization of the joint: Example

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2

  • mitting the fixed values in notation:

the joint distribution f (y1, . . . , yn, β1, β2, τ) =

n

  • i=1

f (yi | β1, β2, τ)

  • likelihood

f (β1) f (β2) f (τ)

  • prior

is ∝ posterior f (β1, β2, τ | y1, . . . , yn)

261

slide-25
SLIDE 25

Factorization of the joint: Example

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2

  • mitting the fixed values in notation:

the joint distribution f (y1, . . . , yn, β1, β2, τ) =

n

  • i=1

f (yi | β1, β2, τ)

  • likelihood

f (β1) f (β2) f (τ)

  • prior

is ∝ posterior f (β1, β2, τ | y1, . . . , yn) ◮ it would be really useful to get posterior estimates based on the non-normalized density f (y1, . . . , yn, β1, β2, τ) !

261

slide-26
SLIDE 26

Bayesian Networks: Inference

◮ Markov Chain Monte Carlo:

simulate samples from the joint ∝ posterior (◮ next block!)

262

slide-27
SLIDE 27

Bayesian Networks: Inference

◮ Markov Chain Monte Carlo:

simulate samples from the joint ∝ posterior (◮ next block!)

◮ can get any distributions for any (set of) variables in the graph

by conditioning and marginalizing of the joint

262

slide-28
SLIDE 28

Bayesian Networks: Inference

◮ Markov Chain Monte Carlo:

simulate samples from the joint ∝ posterior (◮ next block!)

◮ can get any distributions for any (set of) variables in the graph

by conditioning and marginalizing of the joint

◮ for a set of M samples from the joint,

{βm

1 , βm 2 , τ m}, m = 1, . . . , M,

262

slide-29
SLIDE 29

Bayesian Networks: Inference

◮ Markov Chain Monte Carlo:

simulate samples from the joint ∝ posterior (◮ next block!)

◮ can get any distributions for any (set of) variables in the graph

by conditioning and marginalizing of the joint

◮ for a set of M samples from the joint,

{βm

1 , βm 2 , τ m}, m = 1, . . . , M,

◮ marginalizing = use only, e.g., {βm

1 }, m = 1, . . . , M

262

slide-30
SLIDE 30

Bayesian Networks: Inference

◮ Markov Chain Monte Carlo:

simulate samples from the joint ∝ posterior (◮ next block!)

◮ can get any distributions for any (set of) variables in the graph

by conditioning and marginalizing of the joint

◮ for a set of M samples from the joint,

{βm

1 , βm 2 , τ m}, m = 1, . . . , M,

◮ marginalizing = use only, e.g., {βm

1 }, m = 1, . . . , M

◮ conditioning = use only samples m with the right value of the

conditioning parameter(s) (or redo the sampling with fixed conditioned values)

262

slide-31
SLIDE 31

Bayesian Networks with Imprecise Probability

◮ use sets of conditional distributions at nodes: credal networks

(see, e.g., [2, §10], [17])

263

slide-32
SLIDE 32

Bayesian Networks with Imprecise Probability

◮ use sets of conditional distributions at nodes: credal networks

(see, e.g., [2, §10], [17])

◮ specific algorithms for discrete credal networks

(see, e.g., [2, §10.5.3], or [14])

263

slide-33
SLIDE 33

Bayesian Networks with Imprecise Probability

◮ use sets of conditional distributions at nodes: credal networks

(see, e.g., [2, §10], [17])

◮ specific algorithms for discrete credal networks

(see, e.g., [2, §10.5.3], or [14])

◮ conditional independence with IP gets very non-trivial

(see, e.g., [2, §4] for the gory details)

263

slide-34
SLIDE 34

Bayesian Networks with Imprecise Probability

◮ use sets of conditional distributions at nodes: credal networks

(see, e.g., [2, §10], [17])

◮ specific algorithms for discrete credal networks

(see, e.g., [2, §10.5.3], or [14])

◮ conditional independence with IP gets very non-trivial

(see, e.g., [2, §4] for the gory details)

◮ here: do sensitivity analysis by varying prior distributions in

sets: f (β1) ∈ Mβ1, . . .

263

slide-35
SLIDE 35

Other Graph-Based Methods: SEM, Path Analysis

Complaints Loyalty CUSCO Expectation Quality Satisfaction Value CUEX1 CUEX2 CUEX3 Image IMAG1 IMAG2 IMAG3 IMAG4 IMAG5 CUSL1 CUSL2 CUSL3 PERQ1 PERQ2 PERQ3 PERQ4 PERQ5 PERQ6 PERQ7 CUSA1 CUSA2 CUSA3 PERV1 PERV2

264

slide-36
SLIDE 36

Other Graph-Based Methods: SEM, Path Analysis

Complaints Loyalty CUSCO Expectation Quality Satisfaction Value CUEX1 CUEX2 CUEX3 Image IMAG1 IMAG2 IMAG3 IMAG4 IMAG5 CUSL1 CUSL2 CUSL3 PERQ1 PERQ2 PERQ3 PERQ4 PERQ5 PERQ6 PERQ7 CUSA1 CUSA2 CUSA3 PERV1 PERV2

264

slide-37
SLIDE 37

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

265

slide-38
SLIDE 38

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

265

slide-39
SLIDE 39

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

◮ example: customer satisfaction is measured by survey

questions 1 and 2 (measurement model); brand loyality is a function of customer satisfaction (structural model)

265

slide-40
SLIDE 40

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

◮ example: customer satisfaction is measured by survey

questions 1 and 2 (measurement model); brand loyality is a function of customer satisfaction (structural model)

◮ estimation of factor loadings ( = regression coefficients)

265

slide-41
SLIDE 41

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

◮ example: customer satisfaction is measured by survey

questions 1 and 2 (measurement model); brand loyality is a function of customer satisfaction (structural model)

◮ estimation of factor loadings ( = regression coefficients)

◮ likelihood-based (R package lavaan):

models expectations and (co)variances, not full distributions (→ multivariate normal)

265

slide-42
SLIDE 42

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

◮ example: customer satisfaction is measured by survey

questions 1 and 2 (measurement model); brand loyality is a function of customer satisfaction (structural model)

◮ estimation of factor loadings ( = regression coefficients)

◮ likelihood-based (R package lavaan):

models expectations and (co)variances, not full distributions (→ multivariate normal)

◮ Bayesian SEM: (R package blavaan) 265

slide-43
SLIDE 43

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

◮ example: customer satisfaction is measured by survey

questions 1 and 2 (measurement model); brand loyality is a function of customer satisfaction (structural model)

◮ estimation of factor loadings ( = regression coefficients)

◮ likelihood-based (R package lavaan):

models expectations and (co)variances, not full distributions (→ multivariate normal)

◮ Bayesian SEM: (R package blavaan) ◮ partial least squares (R package semPLS):

iterative fitting of latent variable values and regression coefficients via least squares

265

slide-44
SLIDE 44

Other Graph-Based Methods: SEM, Path Analysis

◮ Structural Equation Modeling (SEM, a.k.a. path modeling)

uses graphs like BNs, but is something different

◮ used to estimate latent constructs by assuming linear

relationships with measurements (measurement / outer model) and relationships between latent constructs (structural model)

◮ example: customer satisfaction is measured by survey

questions 1 and 2 (measurement model); brand loyality is a function of customer satisfaction (structural model)

◮ estimation of factor loadings ( = regression coefficients)

◮ likelihood-based (R package lavaan):

models expectations and (co)variances, not full distributions (→ multivariate normal)

◮ Bayesian SEM: (R package blavaan) ◮ partial least squares (R package semPLS):

iterative fitting of latent variable values and regression coefficients via least squares

◮ path analysis: special case where a measurement can be linked

to only one construct

265

slide-45
SLIDE 45

Bayesian hierarchical modelling, simulation and MCMC

Outline

Bayesian hierarchical modelling / Bayesian networks / graphical models Exercises I Simulation & MCMC Exercises II

266

slide-46
SLIDE 46

Exercise 1: Factorization of a Joint

Which factorization of f

  • {xi}i∈[1,...,7]
  • does this graph encode?

x6 x7 x4 x5 x2 x3 x1

267

slide-47
SLIDE 47

Exercise 2: Naive Bayes Classifier

The naive Bayes classifier from Part 6 assumes that the joint distribution of class c and attributes a1, . . . , ak can be factorized as p(c, a) = p(c)p(a | c) = p(c)

k

  • i=1

p(ai | c). Draw the corresponding DAG! (Hint: use either a plate or consider two attributes a1 and a2 only.)

268

slide-48
SLIDE 48

Exercise 3: Naive Bayes Classifier with Dirichlet Priors

We can introduce parameters for p(c) and p(ai | c): (n(c))c∈C ∼ Multinomal(θc; c ∈ C) (36) ∀c ∈ C : (n(ai, c))ai∈Ai | c ∼ Multinomal(θai|c; ai ∈ Ai) (37) where C denotes the set of all possible class values, and Ai denotes the set of all possible values of attribute i. The θ parameters can be estimated using a Dirichlet prior: (θc)c∈C ∼ Dir(s, (t(c))c∈C) (38) ∀c ∈ C : (θai|c)ai∈Ai | c ∼ Dir(s, (t(ai, c))ai∈Ai) (39) where we must have that

ai∈Ai t(ai, c) = t(c). [Note that t(c) is

the prior expectation of θc and t(ai, c)/t(c) is the prior expectation

  • f θai|c.]

Draw the corresponding graph!

269

slide-49
SLIDE 49

Exercise 4: Sensitivity Analysis

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2 In the linear regression example there are 6 hyperparameters m1, t1, m2, t2, a, b. How would you do sensitivity analysis

  • ver the prior in that example? What

problems do you foresee?

270

slide-50
SLIDE 50

Bayesian hierarchical modelling, simulation and MCMC

Outline

Bayesian hierarchical modelling / Bayesian networks / graphical models Exercises I Simulation & MCMC Exercises II

271

slide-51
SLIDE 51

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

272

slide-52
SLIDE 52

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . . 272

slide-53
SLIDE 53

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

272

slide-54
SLIDE 54

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

◮ simulate samples from joint / posterior: approximate . . .

272

slide-55
SLIDE 55

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

◮ simulate samples from joint / posterior: approximate . . .

◮ . . . posterior cdf by empirical cdf (density: kernel dens. est.) 272

slide-56
SLIDE 56

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

◮ simulate samples from joint / posterior: approximate . . .

◮ . . . posterior cdf by empirical cdf (density: kernel dens. est.) ◮ . . . posterior expectation by sample mean 272

slide-57
SLIDE 57

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

◮ simulate samples from joint / posterior: approximate . . .

◮ . . . posterior cdf by empirical cdf (density: kernel dens. est.) ◮ . . . posterior expectation by sample mean ◮ . . . any function of posterior parameters by sample equivalent 272

slide-58
SLIDE 58

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

◮ simulate samples from joint / posterior: approximate . . .

◮ . . . posterior cdf by empirical cdf (density: kernel dens. est.) ◮ . . . posterior expectation by sample mean ◮ . . . any function of posterior parameters by sample equivalent

◮ first: quick look at sampling from univariate distributions

272

slide-59
SLIDE 59

Simulation & Markov Chain Monte Carlo: What? Why?

◮ BNs allow us to formulate complex models

◮ complex variance structures, . . .

◮ joint ∝ posterior usually intractable: how to do inference?

◮ simulate samples from joint / posterior: approximate . . .

◮ . . . posterior cdf by empirical cdf (density: kernel dens. est.) ◮ . . . posterior expectation by sample mean ◮ . . . any function of posterior parameters by sample equivalent

◮ first: quick look at sampling from univariate distributions ◮ then: MCMC for sampling from multivariate distributions

272

slide-60
SLIDE 60

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

273

slide-61
SLIDE 61

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...))

273

slide-62
SLIDE 62

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...)) ◮ estimate E (g(X)) by

  • E (g(X)) = 1

M

M

  • i=1

g(xi)

273

slide-63
SLIDE 63

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...)) ◮ estimate E (g(X)) by

  • E (g(X)) = 1

M

M

  • i=1

g(xi)

◮ unbiased: E

  • E (g(X))
  • = E (g(X))

273

slide-64
SLIDE 64

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...)) ◮ estimate E (g(X)) by

  • E (g(X)) = 1

M

M

  • i=1

g(xi)

◮ unbiased: E

  • E (g(X))
  • = E (g(X))

◮ variance: Var

  • E (g(X))
  • = 1

M Var (g(X)) for independent samples only!

273

slide-65
SLIDE 65

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...)) ◮ estimate E (g(X)) by

  • E (g(X)) = 1

M

M

  • i=1

g(xi)

◮ unbiased: E

  • E (g(X))
  • = E (g(X))

◮ variance: Var

  • E (g(X))
  • = 1

M Var (g(X)) for independent samples only!

◮ precision of MC estimate increases with M, independent of

parameter dimension! (numeric integration: number of evaluation points increases exponentially with dimension)

273

slide-66
SLIDE 66

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...)) ◮ estimate E (g(X)) by

  • E (g(X)) = 1

M

M

  • i=1

g(xi)

◮ unbiased: E

  • E (g(X))
  • = E (g(X))

◮ variance: Var

  • E (g(X))
  • = 1

M Var (g(X)) for independent samples only!

◮ precision of MC estimate increases with M, independent of

parameter dimension! (numeric integration: number of evaluation points increases exponentially with dimension)

lim

M→∞

  • E (g(X))

a.s.

− − → E (g(X)) (strong law of large numbers)

273

slide-67
SLIDE 67

Monte Carlo Estimation: Why does it work?

◮ want to estimate E (g(X)) =

  • g(x) f (x | . . .) dx

◮ Monte Carlo sample x1, . . . , xM (M samples drawn from f (x | ...)) ◮ estimate E (g(X)) by

  • E (g(X)) = 1

M

M

  • i=1

g(xi)

◮ unbiased: E

  • E (g(X))
  • = E (g(X))

◮ variance: Var

  • E (g(X))
  • = 1

M Var (g(X)) for independent samples only!

◮ precision of MC estimate increases with M, independent of

parameter dimension! (numeric integration: number of evaluation points increases exponentially with dimension)

lim

M→∞

  • E (g(X))

a.s.

− − → E (g(X)) (strong law of large numbers)

  • E (g(X)) a.s.

∼ N

  • E (g(X)) , 1

M Var (g(X))

  • (central limit thm)

273

slide-68
SLIDE 68

Simulation & MCMC: Univariate Sampling

◮ assumption for all sampling algorithms:

we can sample from the uniform U([0, 1])

◮ done by pseudo-random number generator (PNRG), in R: ?RNG

274

slide-69
SLIDE 69

Simulation & MCMC: Univariate Sampling

◮ assumption for all sampling algorithms:

we can sample from the uniform U([0, 1])

◮ done by pseudo-random number generator (PNRG), in R: ?RNG

a F (a) 1 a1 a2 a3 a4 a5 u

274

slide-70
SLIDE 70

Simulation & MCMC: Univariate Sampling

◮ assumption for all sampling algorithms:

we can sample from the uniform U([0, 1])

◮ done by pseudo-random number generator (PNRG), in R: ?RNG

x F (x) 1 u

274

slide-71
SLIDE 71

Simulation & MCMC: Univariate Sampling

◮ assumption for all sampling algorithms:

we can sample from the uniform U([0, 1])

◮ done by pseudo-random number generator (PNRG), in R: ?RNG ◮ does not work

well in dimensions > 1 x F (x) 1 u

274

slide-72
SLIDE 72

Simulation & MCMC: Univariate Sampling

◮ assumption for all sampling algorithms:

we can sample from the uniform U([0, 1])

◮ done by pseudo-random number generator (PNRG), in R: ?RNG ◮ does not work

well in dimensions > 1

◮ needs F −1(·)

x F (x) 1 u

274

slide-73
SLIDE 73

Simulation & MCMC: Univariate Sampling

◮ assumption for all sampling algorithms:

we can sample from the uniform U([0, 1])

◮ done by pseudo-random number generator (PNRG), in R: ?RNG ◮ does not work

well in dimensions > 1

◮ needs F −1(·) ◮ needs

normalization factor ◮ rejection sampling x F (x) 1 u

274

slide-74
SLIDE 74

Simulation & MCMC: Rejection Sampling

z0 z u0 kq(z0) kq(z)

  • p(z)

˜ p(z) ∝ target density q(z) proposal density

275

slide-75
SLIDE 75

Simulation & MCMC: Rejection Sampling

z0 z u0 kq(z0) kq(z)

  • p(z)

˜ p(z) ∝ target density q(z) proposal density

  • 1. sample z (↔) from q(z)

275

slide-76
SLIDE 76

Simulation & MCMC: Rejection Sampling

z0 z u0 kq(z0) kq(z)

  • p(z)

˜ p(z) ∝ target density q(z) proposal density

  • 1. sample z (↔) from q(z)
  • 2. sample u ( ) from U([0, kq(z)])

275

slide-77
SLIDE 77

Simulation & MCMC: Rejection Sampling

z0 z u0 kq(z0) kq(z)

  • p(z)

˜ p(z) ∝ target density q(z) proposal density

  • 1. sample z (↔) from q(z)
  • 2. sample u ( ) from U([0, kq(z)])

sample points uniformly from union of white and grey areas

275

slide-78
SLIDE 78

Simulation & MCMC: Rejection Sampling

z0 z u0 kq(z0) kq(z)

  • p(z)

˜ p(z) ∝ target density q(z) proposal density

  • 1. sample z (↔) from q(z)
  • 2. sample u ( ) from U([0, kq(z)])
  • 3. reject all points in the grey area

sample points uniformly from union of white and grey areas

275

slide-79
SLIDE 79

Simulation & MCMC: Rejection Sampling

z0 z u0 kq(z0) kq(z)

  • p(z)

˜ p(z) ∝ target density q(z) proposal density

  • 1. sample z (↔) from q(z)
  • 2. sample u ( ) from U([0, kq(z)])
  • 3. reject all points in the grey area
  • 4. forget about u: z distributed ∝ ˜

p(z)! sample points uniformly from union of white and grey areas

275

slide-80
SLIDE 80

Markov Chain Monte Carlo: General Idea

◮ need to sample from high-dimensional distributions

276

slide-81
SLIDE 81

Markov Chain Monte Carlo: General Idea

◮ need to sample from high-dimensional distributions ◮ idea: produce samples by a Markov Chain:

random walk over parameter space

276

slide-82
SLIDE 82

Markov Chain Monte Carlo: General Idea

◮ need to sample from high-dimensional distributions ◮ idea: produce samples by a Markov Chain:

random walk over parameter space

◮ random walk spends more time in high-probability regions

276

slide-83
SLIDE 83

Markov Chain Monte Carlo: General Idea

◮ need to sample from high-dimensional distributions ◮ idea: produce samples by a Markov Chain:

random walk over parameter space

◮ random walk spends more time in high-probability regions ◮ if in each step we move in one dimension only:

need to sample from one-dimensional distribution only, can use previous algorithms for that!

276

slide-84
SLIDE 84

Markov Chain Monte Carlo: General Idea

◮ need to sample from high-dimensional distributions ◮ idea: produce samples by a Markov Chain:

random walk over parameter space

◮ random walk spends more time in high-probability regions ◮ if in each step we move in one dimension only:

need to sample from one-dimensional distribution only, can use previous algorithms for that!

◮ but: samples are not independent!

276

slide-85
SLIDE 85

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

277

slide-86
SLIDE 86

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

277

slide-87
SLIDE 87

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

◮ accept step with certain probability

(tailored to make chain approach the target distribution)

277

slide-88
SLIDE 88

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

◮ accept step with certain probability

(tailored to make chain approach the target distribution)

◮ Stan uses an improved variant called Hamiltonian MH 277

slide-89
SLIDE 89

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

◮ accept step with certain probability

(tailored to make chain approach the target distribution)

◮ Stan uses an improved variant called Hamiltonian MH

◮ Gibbs sampler:

277

slide-90
SLIDE 90

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

◮ accept step with certain probability

(tailored to make chain approach the target distribution)

◮ Stan uses an improved variant called Hamiltonian MH

◮ Gibbs sampler:

◮ loop over parameter vector (θ1, θ2, . . .) 277

slide-91
SLIDE 91

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

◮ accept step with certain probability

(tailored to make chain approach the target distribution)

◮ Stan uses an improved variant called Hamiltonian MH

◮ Gibbs sampler:

◮ loop over parameter vector (θ1, θ2, . . .) ◮ draw from the full conditionals f (θi | everything else) ∝ joint 277

slide-92
SLIDE 92

Markov Chain Monte Carlo: Algorithms

◮ Metropolis-Hastings:

◮ propose a step

(draw from easy-to-sample-from proposal distribution)

◮ accept step with certain probability

(tailored to make chain approach the target distribution)

◮ Stan uses an improved variant called Hamiltonian MH

◮ Gibbs sampler:

◮ loop over parameter vector (θ1, θ2, . . .) ◮ draw from the full conditionals f (θi | everything else) ∝ joint ◮ special case of MH where proposals are always accepted 277

slide-93
SLIDE 93

Markov Chain Monte Carlo: Why does this work?

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution

278

slide-94
SLIDE 94

Markov Chain Monte Carlo: Why does this work?

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution X1 X2 . . . XT−1 XT

278

slide-95
SLIDE 95

Markov Chain Monte Carlo: Why does this work?

X1 X2 . . . XT−1 XT f (x2 | x1) = · · · = f (xn | xn−1) Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution

278

slide-96
SLIDE 96

Markov Chain Monte Carlo: Why does this work?

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution 1 2 3 4

0.4 0.6 0.5 0.5 0.2 0.8 0.7 0.3

278

slide-97
SLIDE 97

Markov Chain Monte Carlo: Why does this work?

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution 1 2 3 4

0.4 0.6 0.5 0.5 0.2 0.8 0.7 0.3

5

0.1 1

278

slide-98
SLIDE 98

Markov Chain Monte Carlo: Why does this work?

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution 1 2 3 4

0.4 0.6 0.5 0.5 0.2 0.8 0.7 0.3

278

slide-99
SLIDE 99

Markov Chain Monte Carlo: Why does this work?

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution 1 2 3 4

0.4 0.6 0.5 0.5 0.2 0.8 0.7 0.3 1 1 1 1

278

slide-100
SLIDE 100

Markov Chain Monte Carlo: Why does this work?

1 2 3 4

0.4 0.6 0.5 0.5 0.2 0.8 0.7 0.3 1 1 1 1

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution

278

slide-101
SLIDE 101

Markov Chain Monte Carlo: Why does this work?

1 2 3 4

0.4 0.6 0.5 0.5 0.2 0.8 0.7 0.3 1 1 1 1

Algorithms create a

◮ stationary, ◮ irreducible and ◮ aperiodic

Markov chain which has the joint as its limiting (invariant) distribution

  • p1(t), p2(t), p3(t), p4(t)
  • t → ∞
  • p1, p2, p3, p4
  • 278
slide-102
SLIDE 102

MCMC: Warm-Up ( = Burn-In), Mixing, Thinning

−2 2 4 250 500 750 1000

beta1

chain

1 2 3 4

279

slide-103
SLIDE 103

MCMC: Warm-Up ( = Burn-In), Mixing, Thinning

−2 2 4 250 500 750 1000

beta1

chain

1 2 3 4

0.00 0.25 0.50 0.75 1.00 10 20

Lag

  • Avg. autocorrelation (beta1)

279

slide-104
SLIDE 104

Bayesian hierarchical modelling, simulation and MCMC

Outline

Bayesian hierarchical modelling / Bayesian networks / graphical models Exercises I Simulation & MCMC Exercises II

280

slide-105
SLIDE 105

Exercise: Quick start RStan

A Stan model is defined by five program blocks:

model1 <- " data { ... } transformed data { ... } parameters { ... } transformed parameters { ... } model { ... } generated quantities { ... }"

281

slide-106
SLIDE 106

Exercise: Quick start RStan

A Stan model is defined by five program blocks:

model1 <- " data { ... } transformed data { ... } parameters { ... } transformed parameters { ... } model { ... } generated quantities { ... }"

required required

281

slide-107
SLIDE 107

Exercise: Quick start RStan

x θ a b f (x | θ) ∼ Binomial(n, θ) f (θ | a, b) ∼ Beta(a, b)

library(rstan) model0 <- " data { int <lower =0> n; int <lower =0> x; } parameters { real <lower =0, upper =1> theta; } model { theta ∼ beta (2 ,2); x ∼ binomial(n, theta ); } " data0 <- list(n=10, x=5)

282

slide-108
SLIDE 108

Exercise: Quick start RStan

x θ a b f (x | θ) ∼ Binomial(n, θ) f (θ | a, b) ∼ Beta(a, b)

library(rstan) model0 <- " data { int <lower =0> n; int <lower =0> x; } parameters { real <lower =0, upper =1> theta; } model { theta ∼ beta (2 ,2); x ∼ binomial(n, theta ); } " data0 <- list(n=10, x=5)

Running the model creates a stanfit object.

fit0 <- stan(model_code=model0 , data=data0 , iter =1000 , chains =4) print(fit0 ); plot(fit0)

282

slide-109
SLIDE 109

Exercise: Quick start RStan

x θ a b f (x | θ) ∼ Binomial(n, θ) f (θ | a, b) ∼ Beta(a, b)

library(rstan) model0 <- " data { int <lower =0> n; int <lower =0> x; } parameters { real <lower =0, upper =1> theta; } model { theta ∼ beta (2 ,2); x ∼ binomial(n, theta ); } " data0 <- list(n=10, x=5)

Running the model creates a stanfit object.

fit0 <- stan(model_code=model0 , data=data0 , iter =1000 , chains =4) print(fit0 ); plot(fit0)

The samples can be extracted by

samples0 = extract(fit0, c("theta"))

http://mc-stan.org/documentation/ http://github.com/stan-dev/rstan/wiki/RStan-Getting-Started#how-to-use-rstan 282

slide-110
SLIDE 110

Exercise: Linear regression in RStan

yi τ i = 1, . . . , n xi β1 β2 a b m1 t1 m2 t2 We want to estimate the parameters in the linear regression example, using RStan to sample from the posterior. The model assumptions are: yi | β1, β2, τ ∼ N(β1 + xiβ2, 1/τ) τ | a, b ∼ Gamma(a, b), a = b = 10−3 β1 | m1, t1 ∼ N(m1, 1/t1), m1 = 0, t1 = 104 β2 | m2, t2 ∼ N(m2, 1/t2), m2 = 0, t2 = 104

283

slide-111
SLIDE 111

Exercise: Linear regression in RStan

◮ Create an artificial data set x1, . . . , xn, y1, . . . , yn by data <- list () data$N <- 50 data$x <- rnorm(data$N)+30 data$y <- 3 + 5*data$x + rnorm(data$N, sd=1/10)

What are thus the ‘true’ parameter values?

◮ Define the model in Stan. Include a transformed

parameters block where you define σ =

  • 1/τ. (In Stan, the

Normal distribution is parametrized with the standard deviation σ!)

◮ Simulate four chains with 1000 iterations each and use

plot() and print() to get a first impression of the results. What point estimates do you get for β1, β2 and σ?

284

slide-112
SLIDE 112

Exercise: Linear regression in RStan

◮ The functions stan_trace(), stan_dens() and stan_ac()

allow you to analyze your sample from the posterior distribution more closely. (You can include the warm-up phase in your plots by setting inc_warmup = TRUE.) How long is the warm-up phase? Do your chains mix well? Is thinning necessary?

◮ The function pairs() also works on stanfit objects.

Plot pairwise scatterplots of your sample using pairs(). What do you observe about the relation between β1 and β2?

285

slide-113
SLIDE 113

Exercise: Linear regression in RStan

◮ The high correlation between β1 and β2 indicates that the

Markov chain cannot move around freely. You can mitigate this problem by centering the data x1, . . . , xn. The mean for the Normal distribution of yi is then given by βc

1 + β2(xi − ¯

x), where βc

1 = β1 + β2¯

x. Add the following block to your stan model definition,

transformed data { vector[N] xcentered; xcentered=x-mean(x); }

and edit the parameters and model blocks such that the model generates samples from βc

1 instead of β1. ◮ Edit the transformed parameters block to define β1 as

β1 = βc

1 + β2¯

x.

◮ Simulate four chains with 1000 iterations each from this new

model, and analyze your sample from the posterior distribution like for the first model. What has changed?

286

slide-114
SLIDE 114

Exercise: Linear regression in RStan

◮ Choose an informative prior for one or both of β1 and β2.

Try out different values for mean and standard deviation. What is the effect on the chains and the posterior densities?

287