Lecture 25: Introduction to Bayesian Inference
Jason Mezey jgm45@cornell.edu May 7, 2020 (Th) 8:40-9:55
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 - - PowerPoint PPT Presentation
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 25: Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu May 7, 2020 (Th) 8:40-9:55 Announcements No official office hours on Mon. (May 11) but if you
Jason Mezey jgm45@cornell.edu May 7, 2020 (Th) 8:40-9:55
you would like to meet I will zoom with you (!!)
material you may access BUT ONCE THE EXAM STARTS YOU MAY NOT ASK ANYONE ABOUT ANYTHING THAT COULD RELATE TO THE EXAM (!!!!)
20 (Weds.)
if you are well prepared)
will introduce Inference (!!)
Bayesian inference (you will also do this in lab!) plus a brief mention of additional / advanced topics
have a probability distribution associated with them that reflects our belief in the values that might be the true value of the parameter
joint distribution of the parameter AND a sample Y produced under a probability model:
certain value given a sample:
sample) we can rewrite this as follows:
Pr(θ ∩ Y)
Pr(θ|y)
Pr(θ|y) = Pr(y|θ)Pr(θ) Pr(y)
Pr(θ|y) ∝ Pr(y|θ)Pr(θ)
likelihood (!!):
values the true parameter value may take
where we make one assumption): 1. the probability distribution that generated the sample, 2. the probability distribution of the parameter
Pr(θ|y) ∝ Pr(y|θ)Pr(θ)
t Pr(θ|y) , i.e. the
t Pr(θ) i
| ∝ | Pr(y|θ) = L(θ|y)
where we are interested in the knowing the mean human height in the US (what are the components of the statistical framework for this example!? Note the basic components are the same in Frequentist / Bayesian!)
interested in inferring in this case and why?) in a Bayesian framework, we will at least need to define a prior:
parameter the same (what distribution are we assuming and what is a problem with this approach), which defines an improper prior:
are seldom infinite, etc. where one choice for incorporating this observations is my defining a prior that has the same distribution as our probability model, which defines a conjugate prior (which is also a proper prior):
nce, and use a math- r Pr(µ) ∼ N(κ, φ2),
2
Pr(µ) = c
form of the likelihood?):
dom variable Y ∼ N(µ, σ2)
2
nce, and use a math- r Pr(µ) ∼ N(κ, φ2),
Pr(θ|y) ∝ Pr(y|θ)Pr(θ)
Pr(µ|y) ∝ n Y
i=1
1 √ 2πσ2 e
−(yi−µ)2 2σ2
! 1 p 2πφ2 e
−(µ−κ)2 2φ2
Pr(µ|y) ∼ N ( κ
σ2 + Pn
i yi
σ2 )
( 1
φ2 + n σ2 ) , ( 1
φ2 + n σ2 )−1 !
framework in both estimation and hypothesis testing
construct estimators using the posterior probability distribution, for example:
likelihood (Frequentist) framework since estimator construction is fundamentally different (!!)
ˆ θ = mean(θ|y) = Z θPr(θ|y)dθ
ˆ θ = median(θ|y)
construct estimators using the posterior probability distribution, for example:
to infinite (=same as MLE under this condition):
ˆ θ = mean(θ|y) = Z θPr(θ|y)dθ
ˆ θ = median(θ|y)
| ˆ µ = median(µ|y) = mean(µ|y) = ( κ
σ2 + n¯ y σ2 )
( 1
φ2 + n σ2 )
( κ
σ2 + n¯ y σ2 )
( 1
φ2 + n σ2 ) ≈ ( n¯ y σ2 )
( n
σ2 ) ≈ ¯
y
hypothesis framework:
frequentist framework, where we use a Bayes factor to indicate the relative support for one hypothesis versus the other:
difficult to assign priors for hypotheses that have completely different ranges of support (e.g. the null is a point and alternative is a range of values)
hypothesis testing that makes use of credible intervals (which is what we will use in this course)
H0 : θ ∈ Θ0 HA : θ ∈ ΘA
Bayes = R
θ∈Θ0 Pr(y|θ)Pr(θ)dθ
R
θ∈ΘA Pr(y|θ)Pr(θ)dθ
at some level (say 0.95), which is an interval that will include the value of the parameter 0.95 of the times we performed the experiment an infinite number of times, calculating the confidence interval each time (note: a strange definition...)
completely different interpretation: this interval has a given probability of including the parameter value (!!)
if this interval includes the value of the parameter under the null hypothesis (!!)
c.i.(θ) = Z cα
−cα
Pr(θ|y)dθ = 1 − α
(note that we will focus on the linear regression model but we can perform Bayesian inference for any GLM!):
two parameters
Y = µ + Xaa + Xdd + ✏ ✏ ⇠ N(0, 2
✏ )
y = x + ✏ ✏ ⇠ multiN(0, I2
✏ )
poses of mapping, we ar s H0 : a = 0\d = 0
HA : a 6= 0 [ d 6= 0
distribution for the prior
normal (!!):
Pr(βµ, βa, βd, σ2
✏ ) =
Pr(βµ, βa, βd, σ2
✏ ) = Pr(βµ)Pr(βa)Pr(βd)Pr(σ2 ✏ )
Pr(βµ) = Pr(βa) = Pr(βd) = c Pr(σ2
✏ ) = c
Pr(βµ, βa, βd, σ2
✏ |y) ∝ Pr(y|βµ, βa, βd, σ2 ✏ )
Pr(θ|y) ∝ (σ2
✏ ) − n
2 e (y−x)T(y−x) 22 ✏
interested in is:
y = x + ✏ ✏ ⇠ multiN(0, I2
✏ )
Pr(µ, a, d, 2
✏ |y) / Pr(y|µ, a, d, 2 ✏ )Pr(µ, a, d, 2 ✏ )
Pr(βµ, βa, βd, σ2
✏ |y) ∝ Pr(y|βµ, βa, βd, σ2 ✏ )
Pr(βa, βd|y) = ⌦ ∞ ⌦ ∞
−∞
Pr(βµ, βa, βd, σ2
⇥ |y)dβµdσ2 ⇥
interval for our genetic null hypothesis and test a marker for a phenotype association and we can perform a GWAS by doing this for each marker (!!)
Pr(βa, βd|y) = Z ∞
−∞
Z ∞ Pr(βµ, βa, βd, σ2
✏ |y)dβµdσ2 ✏ ∼ multi-t-distribution
mean(Pr(βa, βd|y)) = h ˆ βa, ˆ βd iT = C−1 [Xa, Xd]T y cov = (y − [Xa, Xd] h ˆ βa, ˆ βd iT )T(y − [Xa, Xd] h ˆ βa, ˆ βd iT ) n − 6 C−1 C = XT
a Xa
XT
a Xd
XT
d Xa
XT
d Xd
f(multi−t) = n − 4
Pr(βa, βd|y) β Pr(βa, βd|y) β
Pr(βa, βd|y)
Pr(βa, βd|y)
βa βa
βd
βa βa
βd βd βd
0.95 credible interval 0.95 credible interval
Cannot reject H0! Reject H0!
simple closed form of the overall posterior
to put together more complex priors with our likelihood or consider a more complicated likelihood equation (e.g. for a logistic regression!)
still need to determine the credible interval from the posterior (or marginal) probability distribution so we need to determine the form
Markov chain Monte Carlo (MCMC) algorithm for this purpose
to consider models from another branch of probability (remember, probability is a field much larger than the components that we use for statistics / inference!): Stochastic processes
vectors (variables) with defined conditional relationships, often indexed by a ordered set t
this probability sub-field: Markov processes (or more specifically Markov chains)
accurately, a set of random vectors), which we will index with t:
property:
chain are in the same class of random variables (e.g. Bernoulli, normal, etc.) we allow the parameters of these random variables to be different, e.g. at time t and t+1
Xt, Xt+1, Xt+2, ...., Xt+k Xt, Xt−1, Xt−2, ...., Xt−k
− − −
Pr(Xt, |Xt−1, Xt−2, ...., Xt−k) = Pr(Xt, |Xt−1)