Biospecimen Assessment
Michelle Danaher
University of Maryland, Baltimore County and Eunice Kennedy Shriver National Institute of Child Health and Human Development
1/60
Biospecimen Assessment Michelle Danaher University of Maryland, - - PowerPoint PPT Presentation
Biospecimen Assessment Michelle Danaher University of Maryland, Baltimore County and Eunice Kennedy Shriver National Institute of Child Health and Human Development 1/60 Outline Biospecimen Assessment Chapter 1: Background Pooling variables
Michelle Danaher
University of Maryland, Baltimore County and Eunice Kennedy Shriver National Institute of Child Health and Human Development
1/60
Biospecimen Assessment
Chapter 1: Background Pooling variables in biomarker assessment models
Chapter 2: Gene-environment interactions Chapter 3: Estimation of interaction effects using pooled biospecimen
Constrained estimation in biomarker assessment models
Chapter 4: Minkowski-Weyl priors for models with parameter constraints
2/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
Biospecimen Assessment
Chapter 1: Background Pooling variables in biomarker assessment models
Chapter 2: Gene-environment interactions Chapter 3: Estimation of interaction effects using pooled biospecimen
Constrained estimation in biomarker assessment models
Chapter 4: Minkowski-Weyl priors for models with parameter constraints
3/60
Danaher, M.R., Schisterman, E.F ., Roy, A. and Albert, P .S. (2012), Estimation of gene-environment interaction by pooling biospecimens. Statistics in Medicine, 31: 3241–3252. DOI: 10.1002/sim.5357 Danaher, M.R., Roy, A., Chen, Z., Mumford, S.L. and Schisterman, E.F . (2012), Minkowski-Weyl priors for models with parameter constraints: an analysis of the BioCycle Study. Journal of the American Statistical Association, in press.
Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
4/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
Why are we interested in biospecimen assessment?
5/60
Why are we interested in biospecimen assessment? Epidemiologic studies increasingly depend on biomarkers to indicate exposure.
5/60
Why are we interested in biospecimen assessment? Epidemiologic studies increasingly depend on biomarkers to indicate exposure. Financial expense limits epidemiologists from measuring biomarkers
5/60
Why are we interested in biospecimen assessment? Epidemiologic studies increasingly depend on biomarkers to indicate exposure. Financial expense limits epidemiologists from measuring biomarkers We develop methods for proper statistical inference when biospecimens are pooled.
5/60
Why are we interested in biospecimen assessment? Epidemiologic studies increasingly depend on biomarkers to indicate exposure. Financial expense limits epidemiologists from measuring biomarkers We develop methods for proper statistical inference when biospecimens are pooled. Biomarkers have known relationships due to biology
5/60
Why are we interested in biospecimen assessment? Epidemiologic studies increasingly depend on biomarkers to indicate exposure. Financial expense limits epidemiologists from measuring biomarkers We develop methods for proper statistical inference when biospecimens are pooled. Biomarkers have known relationships due to biology We develop methods for effectively incorporating known biological constraints into a model.
5/60
Pooling variables in biomarker assessment models
In World War II Dorfman (1943) proposed pooling biospecimen to identify syphilis. Later, pooling was used to obtain a sum of individual components which are used in estimation of model parameters. (Brown and Fisher, 1972; Rhode, 1976)
6/60
Pooling variables in biomarker assessment models
The gene-environment interaction is a particularly important hypothesis. Traditional interaction estimators require genotyping a lot
We propose a novel pooling strategy to obtain a substantially less expensive estimator.
7/60
Pooling variables in biomarker assessment models
When two or more exposures are measured in pools, it is interesting to estimate interaction effects. Weinberg and Umbach (1999) proposed the set-based logistic model which,
estimates main effects of exposures measured in pools, estimates interaction effects when one exposure is used to form pooling strata. HOWEVER, it can not estimate interaction, quadratic, or higher order effects of exposures measured in pools.
We propose a novel method for estimating interactions of pooled exposures.
8/60
Constrained estimation in biomarker assessment models
20 40 60 80 100 120 2 4 6 8 10 12 14 16 18 20
1 2
20 40 60 80 100 120 2 4 6 8 10 12 14 16 18 20
1 2
(10,0) (98.89, 8.89)
5 . 1 5 . 89 . 8 89 . 98 10 ) , ( U V U V
Gelfand et al. (1992) proposed drawing samples from an unconstrained posterior, and retaining only those satisfying the constraints. Dunson et al. (2003) proposed a transformation approach for draws falling outside the constraint space. We propose a Bayesian model with a prior density on a constraint space using Minkowski-Weyl decomposition.
9/60
10/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
Estimation of interaction effects using pooled biospecimen
Introduction Notation and assumptions A cohort study
The EM algorithm Variance estimation Simulation
A case control study
The EM algorithm Variance estimation Simulation
Discussion
11/60
1 2 p
Intro Notation Cohort Cohort Simulation Case Control Case Control Simulation Discussion
Why are we interested in pooling biospecimen?
12/60
Why are we interested in pooling biospecimen?
Pooling saves financial resources, and invaluable biological resources.
12/60
Why are we interested in pooling biospecimen?
Pooling saves financial resources, and invaluable biological resources. When measuring a continuous biomarker, pooling may reduce the percentage of measurements under the LOD.
Pooled exposure assessment Individual exposure assessment
LOD LOD
12/60
n = the # of individuals in the study. p = the # of individuals’ biospecimen within each pool. np = n/p is the # of pools of biospecimen. xij = [xij1, xij2]′ are two continuous exposures of interest, of the jth person in the ith pool. xp
i = p j=1 xij, is the observed exposure within a pool.
yij is the binary response of the jth person in the ith pool.
13/60
We assume that the disease status and exposures have the following distributions, yij|xij, β ind ∼ Ber[π(xij, β)], xij|µ, Σ iid ∼ N2(µ, Σ), for i = 1, 2, . . . , np, j = 1, 2, . . . , p π(xij, β) is the logistic link given by, π(xij, β) = exp
, β = [β0, β1, β2, β3]′ are the unknown parameters of interest, and D(xij) = [1, xij1, xij2, xij1 × xij2]′. Define the unknown parameters by θ = [β′, γ′]′, where γ = [σ1, σ2, ρ, µ′]′.
14/60
The likelihood for a cohort study is as follows, L(θ|y1, . . . yp, X1 . . . Xp) =
np
p
Pr(Y = yij|xij, β)φ(xij|µ, Σ) When p = 2, and pooling is matched within disease status (i.e. y1 = y2), L(θ|y1, X1, X p) =
np
g(xi1, yi1, xp
i , θ),
where the function g is defined as, g(xi1, yi1, xp
i , θ)
= Pr(Y = yi1|xi1, β)φ(xi1|µ, Σ) ×Pr(Y = yi1|xp
i − xi1, β)φ(xp i − xi1|µ, Σ).
15/60
The EM algorithm
The first step of the EM algorithm is to evaluate Q, Q(θ|θt) = Ef{log[L(θ|y1, X1, X p)]} =
np
i , θ)]fX1|y1,Xp,θt (xi1|y1, X p, θt) dxi1,
where the density fX1|y1,X p,θt is, fX1|y1,X p,θt(xi1|y1, X p, θt) = g(xi1, yi1, xp
i , θt)
i , θt) dxi1
.
16/60
The EM algorithm
The second step in the EM algorithm is to maximize Q with respect to (w.r.t.) the unknown parameters θ. We use Newton Raphson to maximize Q w.r.t β, where,
βk+1 = βk −
δβδβ′ −1 δQ(θ|θt) δβ ,
and k increases until convergence. The values of µ and Σ which maximize Q are,
µt+1 = 1 np
np
xp
i
2 , Σt+1 = µt+1(µt+1)′ + 1 np
np
xp
i xp′ i
2 − xp
i (µt+1)′ + Ef
i1
i
2 Ef
i1
i
2 ,
17/60
Variance estimation
We use the variance estimator proposed by Louis (1982),
Iobs(ˆ θ) = Icomplete(ˆ θ) − Imiss|obs(ˆ θ),
Icomplete(ˆ θ) is the observed information matrix of the conditional expected complete data,
Icomplete(ˆ θ) = −δ2Q(θ|θt) δθδθ′
θ
= −
δ2Q(θ|θt ) δβδβ′ δ2Q(θ|θt ) δγδγ′
θ
,
Imiss|obs(ˆ θ) is the expected information for the distribution f.
Imiss|obs(ˆ θ) =
np
Ef
δlog
′
θ 18/60
Simulation
β0 = 0.2 β1 = 0.1 β2 = 0.2 β3 = 0.1 Traditional MLE, n = 1000, Without Pooling (i.e. p = 1) Bias
0.001 0.002
RMSE 0.070 0.071 0.066 0.067 Est SD 0.072 0.072 0.066 0.067 Obs SD 0.070 0.071 0.066 0.067 Power 0.810 0.278 0.857 0.339 MLE using the EM algorithm, n = 1000, p = 2 Bias 0.001 0.008
RMSE 0.070 0.077 0.065 0.096 Est SD 0.072 0.079 0.067 0.096 Obs SD 0.071 0.077 0.066 0.096 Power 0.780 0.200 0.878 0.173 Traditional MLE, n = 500, Without Pooling (i.e. p = 1) Bias
0.001
RMSE 0.102 0.104 0.096 0.102 Est SD 0.102 0.103 0.094 0.096 Obs SD 0.102 0.105 0.096 0.102 Power 0.521 0.157 0.560 0.199
Table : For this simulation γ = [1, 1, 0.02, 0.1, 0.5]. Obs SD is the observed standard
deviation from the 1000 Monte Carlo replicate estimates, and Est SD is the estimated standard deviation.
19/60
The likelihood for a case control study is as follows,
L(θ|y1, . . . yp, X1 . . . Xp, s1, . . . sp) =
np
p
Pr(Y = yij, xij|sij = 1, θ) =
np
p
Pr(Y = yij|xij, β)φ(xij|µ, Σ)τ(yij) α(yij, θ) ,
where τ(yij) = Pr(Y = yij|sij = 1), and α(yij, θ) is defined,
α(yij, θ) =
When p = 2, and pooling is within strata of the disease (i.e. y1 = y2),
L(θ|y1, X1, X p, s1) =
np
g(xi1, yi1, xp
i , θ)τ(yi1)2
α(yi1, θ)2 .
20/60
Define κ to be an estimate of disease prevalence. Approximate α(1, θ) using Gauss-Hermite quadratures, α(1, θ) ≈ (π)−1
j
exp
1 + exp
, where s∗(z) = (2Σ)1/2z + µ, and rj and wj are the Gauss-Hermite nodes and weights. We solve the following equation for β0, 0 = κ − (π)−1
j
exp
1 + exp
. Define β∗ =
0, β1, β2, β3
′ and θ∗ = [β∗′, γ′].
21/60
The EM algorithm
The first step of the EM algorithm is to evaluate Q∗,
Q∗(θ∗|θ∗t) = E{log[L(θ∗|y1, X1, X p, s1)]|y1, X p, θ∗t, s1} =
np
i , θ∗)]fX1|y1,Xp,θ∗t (xi1|y1, X p, θ∗t) dxi1
+2log(τ(yi1)) − 2log(α(yi1, θ∗)) = Q(θ∗|θ∗t) + 2np1 [log(τ(1)) − log(α(1, θ∗))] +2np0 [log(1 − τ(1)) − log(1 − α(1, θ∗))] .
22/60
The EM algorithm
The second step in the EM algorithm is to maximize Q with respect to (w.r.t.) the unknown parameters θ∗. We are not able to separately maximize Q∗ w.r.t. β∗ and γ as we did for the cohort design because of α(1, θ∗) Instead, we use the Optimization Toolbox in MATLAB to maximize Q∗ w.r.t. [β1, β2, β3, γ′].
23/60
Variance estimation
Imiss|obs(ˆ θ) is exactly the same as in the cohort design. Icomplete(ˆ θ) must be modified slightly as follows,
Icomplete(ˆ θ) = −δ2Q(θ|θt) δθδθ′
θ
+ δ2 2np1log(α(1, θ)) + 2np0log(1 − α(1, θ)
θ 24/60
Simulation
β0 = −3.000 β1 = 0.100 β2 = 0.100 β3 = 0.100 Traditional MLE, n = 1000, Without Pooling (i.e. p = 1) Bias
0.002 0.000 RMSE 2.913 0.075 0.067 0.066 Power 0.069 0.307 0.324 0.334 MLE using the EM algorithm, n = 1000, p = 2, κ = 0.05 Bias 0.034 0.005
RMSE 0.051 0.079 0.065 0.091 Est SD 0.060 0.082 0.067 0.091 Obs SD 0.038 0.079 0.065 0.091 Power 1.000 0.193 0.337 0.202 MLE using the EM algorithm, n = 1000, p = 2, κ = 0.15 Bias
0.003
RMSE 1.176 0.080 0.064 0.091 Est SD 0.064 0.082 0.067 0.093 Obs SD 0.037 0.080 0.064 0.091 Power 1.000 0.207 0.340 0.188 Traditional MLE, n = 500, Without Pooling (i.e. p = 1) Bias
0.000
RMSE 2.912 0.104 0.094 0.093 Power 0.018 0.165 0.185 0.193
Table : γ = [1.0, 1.0, 0.02, 0.1, 0.5], and the true disease prevalence is 0.0516.
25/60
The EM algorithm is a useful tool when biospecimen have been measured in pools. Through simulation we showed that sampling half of the individuals compared to pooling pairs,
loses power in regards to the main effects, the power to test the interaction effect is approximately the same.
Future work:
Use Box-Cox transformation to allow more flexible distributions for the exposure. For pools larger than two individuals, use the Monte Carlo EM (MCEM) algorithm.
26/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
27/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
Gene-environment interactions
Introduction Pooled GxE estimator The pooling process The pooled GxE estimator with measurement error Simulations Discussion
28/60
1 2 p
Why are we interested in pooling for estimating GxE?
GxE are critical hypotheses in studying diseases with complex etiologies. Case-control interaction estimator is known to require a large sample size for reasonable power. Geneotyping a large number of individuals can be expensive and time consuming. Access to sufficient quantities of biospecimens from each individual may not be available.
29/60
Notation
D = binary disease status. E = binary environmental exposure. Ai = binary allele status, i ∈ {1, 2}, each individual has two copies of the allele. G = number of copies of the allele of interest, G = A1 + A2.
30/60
Our propose pooling strategy
D = 0 D = 1 E = 0 f00 f10 E = 1 f01 f11 Pool biospecimens within all levels of D and E. Perform R replicate measurements on each level (4R
31/60
Model for D given A and E
Assume that the disease is related to the allele and environmental statuses by logit(Pr(D|A, E)) = α + βEE + βAA + βA×EAE. Therefore βA×E is the allele-environment interaction parameter.
32/60
Model for D given G and E
Assume independence between A1 and A2. The model for the disease given genotype and environmental status is logit(Pr(D|G, E)) = ˜ α + ˜ βEE + βAG + βA×EGE. The GxE parameter of interest, is equivalent to the allele-environment interaction (βG×E = βA×E).
33/60
The estimator!
The closed form MLE of βA×E is, ˜ βG×E(MLE) = ˜ γ11 − ˜ γ01 − ˜ γ10 + ˜ γ00 where ˜ γDE = log ˜
fDE 1−˜ fDE
fDE is the sample allele frequency.
34/60
Three stages
1
Pool formation
2
PCR (polymerase chain reaction)
3
Estimation
35/60
Estimation: as described in Germer et al. (2000)
Frequency PCR cycle Threshold
t
C
Each PCR amplification cycle doubles the frequency. The minor allele frequency can be estimated as, ˜ fDE = 1 1 + 2∆˜
CtDE ,
36/60
The estimator and it’s estimated variance
Define ∆ˆ CtDE(k) to be the estimated cycle difference, ∆ˆ CtDE(k) = ∆˜ CtDE + ǫDE(k) The estimator for GxE is, ˆ βG×E = − log(2) R
R
∆ˆ Ct11(k)−∆ˆ Ct01(k)−∆ˆ Ct10(k)+∆ˆ Ct00(k) The variance of the GxE estimator is, ˆ var
βG×E
1
1
1 R
R
rDE(1 − ˆ fDE(k))ˆ fDE(k)
R (log(2))2s2
DE
37/60
Simulation parameters
We define Pr(A = 1) = 0.37, Pr(E = 1) = 0.25, and Pr(D|E, G) as follows, logit(Pr(D = 1|E, G)) = α + βEE + βGG + βG×EGE where α = −2.26, βE = 1.06, and βG = 0.18. We found ∆˜ CtDE and add measurement error using ∆ˆ CtDE(k) = ∆˜ CtDE + ǫDE(k), where ǫDE(k)
iid
∼ N(0, σ2
DE).
We defined R to be 3 and σ2
DE = 0.05.
38/60
Simulation results
βGXE Mean
βGXE
Mean
var
βGXE
Variance Probability
βGXE (95% CI)
0.0764 0.0759 0.9416
0.0757 0.0757 0.9432 0.0023 0.0732 0.0756 0.9478 .3 0.2926 0.0744 0.0768 0.9508 .5 0.4915 0.0773 0.0785 0.9470
Table : Simulation results. The level for the empirical coverage probability is 0.05. The number of Monte Carlo replicates is 5000.
39/60
Simulation results
40/60
We proposed a powerful method to estimate a gene-environment interaction by pooling biospecimens. Benefits:
Pooling can greatly reduce the number of measurements. Pooling can save valuable biospecimens.
Limitations:
The disease and environment statuses, must be known before the pooling. We assume that A1 and A2 (i.e. the mother and father) are independent within disease and environmental status.
Although our method has limitations in its assumptions, it may be very useful in a pilot study.
41/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
42/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
Minkowski-Weyl priors for models with parameter constraints
Introduction The Minkowski-Weyl Prior Prior Specification Posterior Computation and Simulation Analysis of the BioCycle Study Data Conclusions
43/60
Scientific Question
Associated
44/60
The BioCycle Study
Prospective cohort of 259 healthy menstruating women aged 18-44 from western New York. Women were provided with fertility monitors to schedule 8 clinic visits.
45/60
For illustration, let yi = µ + ǫi, ǫi
iid
∼ Np(0, Σ). Suppose there is prior knowledge that, M = {µ ∈ Rp : Aµ < b, µ > 0}. The likelihood is then given by, L(µ, Σ|y1, . . . , yn) =
n
Np(yi; µ, Σ)I(µ ∈ M). The covariance parameter Σ is taken to belong to C, the cone of all p × p positive definite matrices. Bayesian analysis would require one to specify a prior on the restricted set M × C.
46/60
Definition (Minkowski-Weyl) A set P in Rp is a convex polyhedral iff P = conv(v1, . . . , vk)
(1) for some vectors v1, . . . , vk, u1, . . . , ul. v1, . . . , vk denote the k extreme points of P. u1, . . . , ul denote the extreme directions of P. Any vector θ in P can be represented as θ = (α1v1 + · · · + αkvk) + (γ1u1 + · · · + γlul), where α = (α1, . . . , αk)′ belongs to the k-dimensional simplex and γ = (γ1, . . . , γl)′ is a vector of positive coefficients.
47/60
20
+ = ) , ( U V γ α γ α μ
16 18
⎥ ⎤ ⎢ ⎡ = 89 . 98 10 V
12 14
2
⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎢ ⎣ 5 1 5 . 89 . 8 U V
8 10
(98.89,8.89)
⎥ ⎦ ⎢ ⎣ 5 . 1
4 6 20 40 60 80 100 120 2
(10,0) (0.5,0.5) (0,1) (0 0)
20 40 60 80 100 120
(10,0) (0,0)
48/60
We invoke a prior distribution for µ via prior distributions on α and γ. We assume that α, γ and Σ are a priori mutually independent, therefore, p(α, γ, Σ) = p(MW)(πα, πγ)p(Σ), α ∼ Dir(M, a1, . . . , ak), γ ∼
l
G(g1i, g2i), Σ ∼ IW(Ψ, κ).
49/60
Consider the case where the parameters are as follows, µ = 100.1 10.5
0.2 0.2 5
where prior knowledge restricts the mean, µ, to M. We generate the outcomes, yi
iid
∼ N2(µ, Σ) for i = 1, 2, . . . , n, n = 250. We assumed the following prior distributions, α ∼ Dir(M, a1, . . . , ak), γ ∼
l
G(1, 4), Σ ∼ IW(10I2, 10).
50/60
10 20 30 40 50 60 70 80 90 100 110 5 10 15 20 25 30
µ1 µ2
(a) M, a = (2, 0.45, 0.05, 0.5)
99 99.2 99.4 99.6 99.8 100 100.2 100.4 100.6 100.8 101 9.5 10 10.5 11 11.5
µ1 µ2
(b) M, a = (2, 0.45, 0.05, 0.5)
Figure : On the left is the prior density of the means, µ, when the Dirichlet distribution with parameters (M, a) is the prior for α. On the right is samples from the posterior distribution.
51/60
10 20 30 40 50 60 70 80 90 100 110 5 10 15 20 25 30
µ1 µ2
(a) M, a = (1, 0.05, 0.05, 0.9)
99 99.2 99.4 99.6 99.8 100 100.2 100.4 100.6 100.8 101 9.5 10 10.5 11 11.5
µ1 µ2
(b) M, a = (1, 0.05, 0.05, 0.9)
10 20 30 40 50 60 70 80 90 100 110 5 10 15 20 25 30
µ1 µ2
(c) M, a = (1, 0.9, 0.05, 0.05)
99 99.2 99.4 99.6 99.8 100 100.2 100.4 100.6 100.8 101 9.5 10 10.5 11 11.5
µ1 µ2
(d) M, a = (1, 0.9, 0.05, 0.05)
52/60
We assume the following linear mixed model for i = 1, . . . , n, n = 250, h ∈ {E, L, F, P}, and t ∈ {1, . . . , 8} , yiht = µht + w′
itφh + ξih + ζit + ǫiht,
ξiE ξiL ξiF ξiP
iid
∼ N(0, DH), ζi1 ζi2 . . . ζi8
iid
∼ N(0, DT ), ǫi
iid
∼ N(0, σ2
eI)
53/60
Let µ = (µE1, . . . , µE8, . . . , µP1, . . . , µP8)′, We partition µ = [µC, µU]′, and µC = [µCE, µCL, µCP]′. Transform µC to the positive space by defining a maximum such that, µC < K, where K = 10 × 19. The following is the constraint space for µC∗, where µC = K − µC∗, M = {µC∗ : AhµCh∗ < bh, µCh∗ > 0, h ∈ {E, L, P}}. Applying Minkowski decomposition of each partition of M, we find the corresponding V h and Uh matrices.
54/60
Time varying covariate effects on hormones covariate hormone mean 90% credible int. log(beta- estrogen 0.000
carotene) FSH 0.000
LH
progesterone 0.048 0.001 , 0.095 log(γ-vitamin) estrogen 0.021
FSH 0.012
LH
progesterone
cholesterol estrogen 0.027
FSH
LH 0.003
progesterone 0.042
homocysteine estrogen
FSH 0.038 0.006 , 0.070 LH 0.015
progesterone
log(F2- estrogen 0.038 0.002 , 0.074 isoprostane) FSH
LH
progesterone 0.072 0.027 , 0.116
55/60
We defined a new class of priors that permit full Bayesian analysis for models with linear inequality constraints. Analyzing the BioCycle data using MW prior provides new insight to the association between menstrual hormone levels and agents for oxidative stress. We are interested in extending exploring prior distributions for models with nonlinear inequality constraints.
56/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments
Funding: Intramural Research Training Award, Eunice Kennedy Shriver, National Institute of Child Health and Human Development
57/60
58/60
59/60
60/60 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Acknowledgments