Probability Review
Applied Bayesian Statistics
- Dr. Earvin Balderama
Department of Mathematics & Statistics Loyola University Chicago
August 31, 2017
1
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Probability Review Applied Bayesian Statistics Dr. Earvin Balderama - - PowerPoint PPT Presentation
Probability Review Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics & Statistics Loyola University Chicago August 31, 2017 Applied Bayesian Statistics 1 Last edited September 8, 2017 by Earvin Balderama
Applied Bayesian Statistics
Department of Mathematics & Statistics Loyola University Chicago
August 31, 2017
1
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Mathematically, a random variable is a function that maps a sample space into the real numbers: X : S → R.
1
Countable (discrete).
2
Uncountable (continuous). Example: 3 coin tosses S = {HHH, HHT, HTH, THH, THT, TTH, TTT} We may want to create a random variable, X, defined as the number of tails. X = {0, 1, 2, 3}
2
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Mathematically, a probability function assigns numbers (between 0 and 1) to subsets of a sample space: P : B → [0, 1], ∀B ⊆ S. Two interpretations:
1
(Frequentist) Based on long-run relative frequencies of possible
2
(Bayesian) Based on belief about how likely each possible outcome is.
3
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Mathematically, a probability function assigns numbers (between 0 and 1) to subsets of a sample space: P : B → [0, 1], ∀B ⊆ S. Two interpretations:
1
(Frequentist) Based on long-run relative frequencies of possible
2
(Bayesian) Based on belief about how likely each possible outcome is. Regardless of interpretation, same basic probability laws apply, e.g., P(A) ≥ 0, P(S) = 1, P(A ∪ B) = P(A) + P(B), for mutually exclusive A and B.
3
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
A probability distribution is a list of all possible values of a random variable and their corresponding probabilities.
1
Discrete random variable: probability mass function (PMF)
PMF: f(x) = Prob(X = x) ≥ 0 Mean: E(X) =
xf(x) Variance: V(X) =
[x − E(X)]2f(x)
2
Continuous random variable: probability density function (PDF)
Prob(X = x) = 0 for all x PDF: f(x) ≥ 0, Prob(X ∈ B) =
f(x)dx Mean: E(X) =
Variance: V(X) = x − E(X) 2 f(x)dx
4
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
A statistical analysis typically proceeds by selecting a PMF (or PDF) that seems to match the distribution of a sample. We rarely know the PMF (or PDF) exactly, but we may assume it is from a parametric family of distributions, and estimate the parameters.
1
Discrete random variables
Binomial (Bernoulli is a special case) Poisson NegativeBinomial
2
Continuous random variables
Normal Gamma (Exponential and χ2 are special cases) InverseGamma Beta (Uniform is a special case)
5
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Only two outcomes, (success/failure, 0/1, zero/nonzero, etc.), where θ is the probability of success. X ∈ {0, 1} PMF: f(x) = Prob(X = x) =
if x = 0, θ, if x = 1. Mean: E(X) =
xf(x) = 0(1 − θ) + 1θ = θ Variance: V(X) =
[x − θ]2f(x) = (0 − θ)2(1 − θ) + (1 − θ)2θ = θ(1 − θ)
6
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
X = number of “successes” in n independent “Bernoulli trials,” where θ is the probability of success on each trial. X ∈ {0, 1, . . . , n} PMF: f(x) = Prob(X = x) = n
x
Mean: E(X) = nθ Variance: V(X) = nθ(1 − θ)
7
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
X = number of events that occur in a unit of time. X ∈ {0, 1, . . .} PMF: f(x) = Prob(X = x) = λxe−λ x! . Mean: E(X) = λ Variance: V(X) = λ Note: Can be parameterized with λ = nθ, where θ is the expected number of events per unit time. E(X) = V(X) = nθ.
8
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
X = number of “failures” until r “successes” in a sequence of independent “Bernoulli trials,” where θ is the probability of success on each trial. X ∈ {0, 1, . . . , n} PMF: f(x) = Prob(X = x) = x+r−1
x
Mean: E(X) = r(1−θ)
θ
Variance: V(X) = r(1−θ)
θ2
Note: The geometric distribution is a special case: Geom(θ) = NB(1, θ). Note: MANY different ways to specify the NB distribution. The important thing to note is that NB is a discrete count distribution that is a more flexible model than Poisson.
9
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
X ∈ (−∞, ∞) PDF: f(x) = 1 √ 2πσ exp
2 x − µ σ 2 . Mean: E(X) = µ Variance: V(X) = σ2
10
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
X ∈ (0, ∞) PDF: f(x) =
ba Γ(a)xa−1e−bx.
Mean: E(X) = a b Variance: V(X) = a b2 Parameters: shape a > 0, rate b > 0.
11
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
If Y ∼ Gamma(a, b), then X = 1
Y ∼ InverseGamma(a, b).
X ∈ (0, ∞) PDF: f(x) =
ba Γ(a)x−a−1e−b/x.
Mean: E(X) = b a − 1, for a > 1. Variance: V(X) = b2 (a − 1)2(a − 2), for a > 2. Parameters: shape a > 0, rate b > 0.
12
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
X ∈ [0, 1] PDF: f(x) =
Γ(a+b) Γ(a)Γ(b)xa−1(1 − x)b−1.
Mean: E(X) = a a + b Variance: V(X) = ab (a + b)2(a + b + 1) Parameters: a > 0, b > 0.
13
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
A random vector of p random variables: X = (X1, X2, . . . , Xp). For now, suppose we have just p = 2 random variables, X and Y. (X, Y) can be discrete or continuous.
14
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
1
Discrete (X, Y)
joint PMF: f(x, y) = Prob(X = x, Y = y) marginal PMF for X: fX(x) = Prob(X = x) =
f(x, y) marginal PMF for Y: fY(y) = Prob(Y = y) =
f(x, y)
2
Continuous (X, Y)
joint PDF: f(x, y) Prob[(X, Y) ∈ B] =
f(x, y)dxdy marginal PDF for X: fX(x) =
marginal PDF for Y: fY(y) =
15
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Patients are randomly assigned a dose and followed to determine whether they develop a tumor. X ∈ {5, 10, 20} is the dose; Y ∈ {0, 1} is 1 if a tumor develops and 0 otherwise. The joint PMF is given by X Y 5 10 20 0.469 0.124 0.049 1 0.231 0.076 0.051
16
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Find the marginal PMFs of X and Y.
17
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Find the marginal PMFs of X and Y. fY(0) =
f(x, 0) = 0.469 + 0.124 + 0.049 = 0.642 fY(1) =
f(x, 1) = 0.231 + 0.076 + 0.051 = 0.358
17
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Find the marginal PMFs of X and Y. fY(0) =
f(x, 0) = 0.469 + 0.124 + 0.049 = 0.642 fY(1) =
f(x, 1) = 0.231 + 0.076 + 0.051 = 0.358 fX(5) = 0.7, fX(10) = 0.2, fX(20) = 0.1 X Y 5 10 20 0.469 0.124 0.049 0.642 1 0.231 0.076 0.051 0.358 0.7 0.2 0.1 1
17
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
conditional PMF of Y given X: f(y |x) = Prob(Y = y |X = x) = Prob(X = x, Y = y) Prob(X = x) = f(x, y) fX(x) conditional = joint marginal
18
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
conditional PMF of Y given X: f(y |x) = Prob(Y = y |X = x) = Prob(X = x, Y = y) Prob(X = x) = f(x, y) fX(x) conditional = joint marginal Note: Here, x is treated as fixed, so f(y |x) is only a function of y. Note: This is not
f(x, y) = fY(y) nor
f(x, y) = fX(x). Note: Showing that f(y |x) is a valid PMF,
f(y |x) =
f(y, x) fX(x) =
fX(x) = fX(x) fX(x) = 1
18
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Find f(y |x) and f(x |y). The joint PMF is given by X Y 5 10 20 0.469 0.124 0.049 0.642 1 0.231 0.076 0.051 0.358 0.7 0.2 0.1 1
19
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Find f(y |x) and f(x |y). The joint PMF is given by X Y 5 10 20 0.469 0.124 0.049 0.642 1 0.231 0.076 0.051 0.358 0.7 0.2 0.1 1
20
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Find f(y |x) and f(x |y). The joint PMF is given by X Y 5 10 20 0.469 0.124 0.049 0.642 1 0.231 0.076 0.051 0.358 0.7 0.2 0.1 1 Prob(Y = 0 |X = 5) = 0.469 0.7 = 0.67 Prob(Y = 1 |X = 5) = 0.231 0.7 = 0.33 Prob(X = 5 |Y = 0) = 0.469 0.642 = 0.67 . . .
20
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Let X = birthweight, Y = gestational age. X ∈ (2, 10) pounds; Y ∈ (20, 50) weeks. The joint PDF is given by f(x, y) = 0.26 exp(−|x − 7| − |y − 40|). Find Prob(X > 7, Y > 40)
21
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Let X = birthweight, Y = gestational age. X ∈ (2, 10) pounds; Y ∈ (20, 50) weeks. The joint PDF is given by f(x, y) = 0.26 exp(−|x − 7| − |y − 40|). Find Prob(X > 7, Y > 40) = 50
40
10
7
f(x, y)dxdy = . . . = 0.25
21
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Let X = birthweight, Y = gestational age. X ∈ (2, 10) pounds; Y ∈ (20, 50) weeks. The joint PDF is given by f(x, y) = 0.26 exp(−|x − 7| − |y − 40|). Find fX(x)
22
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Let X = birthweight, Y = gestational age. X ∈ (2, 10) pounds; Y ∈ (20, 50) weeks. The joint PDF is given by f(x, y) = 0.26 exp(−|x − 7| − |y − 40|). Find fX(x) = 50
20
f(x, y)dy = . . . = 0.52e−|x−7|
22
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
conditional PDF of Y given X: f(y |x) = f(x, y) fX(x) conditional = joint marginal
23
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
conditional PDF of Y given X: f(y |x) = f(x, y) fX(x) conditional = joint marginal Note: Here, x is treated as fixed, so f(y |x) is only a function of y. Note: This is not
Note: Showing that f(y |x) is a valid PDF,
f(y, x) fX(x) dy =
fX(x) = fX(x) fX(x) = 1
23
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Let X = birthweight, Y = gestational age. X ∈ (2, 10) pounds; Y ∈ (20, 50) weeks. The joint PDF is given by f(x, y) = 0.26 exp(−|x − 7| − |y − 40|). Find f(y |x).
24
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
The bivariate normal is the most common multivariate family. There are 5 parameters:
1
µx is the marginal mean of X.
2
µy is the marginal mean of Y.
3
σ2
x is the marginal variance of X.
4
σ2
y is the marginal variance of Y.
5
ρxy is the correlation between X and Y.
The joint PDF is 1 2πσXσY
−
σX
2 +
σY
2 − 2ρ
σX
σY
25
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Suppose X and Y is bivariate normal with µX = µY = 0 and σX = σY = 1. Find the marginal distribution of X.
26
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Example Suppose X and Y is bivariate normal with µX = µY = 0 and σX = σY = 1. Find the conditional distribution of Y given X.
27
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
Recall conditional distributions: f(y |x) = f(x, y) f(x) conditional = joint marginal Can be extended to f(y |x) = f(x, y) f(x) = f(x |y)f(y) f(x) = f(x |y)f(y)
f(x |y)f(y) This is the form of the famous “Bayes’ theorem” (or “Bayes’ rule”). Note: the denominator is simply a normalizing constant.
28
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>
In a Bayesian data analysis, we select:
1
the prior f(θ),
2
the likelihood f(y |θ). Based on these, we must compute
3
the posterior f(θ |y).
Bayes’ theorem The mathematical formula to convert the likelihood and prior to the posterior. f(θ |y) = f(y |θ)f(θ) f(y) Posterior ∝ Likelihood × Prior
29
Applied Bayesian Statistics Last edited September 8, 2017 by Earvin Balderama <ebalderama@luc.edu>