Regression Analysis for Probabilistic Cause-of-Disease Assignment - - PowerPoint PPT Presentation

regression analysis for probabilistic cause of disease
SMART_READER_LITE
LIVE PREVIEW

Regression Analysis for Probabilistic Cause-of-Disease Assignment - - PowerPoint PPT Presentation

Background Models Regression Simulations Results Discussion Regression Analysis for Probabilistic Cause-of-Disease Assignment Using Case-Control Diagnostic Tests Zhenke Wu Assistant Professor of Biostatistics Research Assistant Professor


slide-1
SLIDE 1

Background Models Regression Simulations Results Discussion

Regression Analysis for Probabilistic Cause-of-Disease Assignment Using Case-Control Diagnostic Tests

Zhenke Wu

Assistant Professor of Biostatistics Research Assistant Professor of Michigan Institute for Data Science (MIDAS) University of Michigan, Ann Arbor Twitter handle: @ZhenkeWu

R package “baker”: https://github.com/zhenkewu/baker

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 1 / 55

slide-2
SLIDE 2

Background Models Regression Simulations Results Discussion

Motivating Application

Pneumonia Etiology Research for Child Health (PERCH) (PERCH Study Group, Lancet 2019, In Press)

Background:

  • > 30 possible infectious causes
  • Difficult to directly observe

Goal:

  • Population disease etiology estimation
  • Individual diagnosis

Study details:

  • $40-mil, Gates-funded 7-country study;

Sites at Sub-Saharan Africa and South Asia

  • Diverse measures; variable precisions
  • ∼5,000 cases and ∼5,000 controls

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 2 / 55

slide-3
SLIDE 3

Background Models Regression Simulations Results Discussion

Measurements of Different Quality

cases

(~5,000)

controls

(~5,000)

Lung Infection

NA

  • *NP: nasopharyngeal; PCR: polymerase chain reaction; LA: lung aspirate

NA NA

𝐽𝑀

𝑗

Nasopharyngeal PCR

Blood Culture Lung aspirate 𝑁𝑗

𝑇

Bronze- Standard (BrS) Silver- Standard (SS) Gold- Standard (GS)

(𝜄, 𝜔)

Latent Health State Measurements Measurement Precisions Specimen (S)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 3 / 55

slide-4
SLIDE 4

Background Models Regression Simulations Results Discussion

Data From A Random Case

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 4 / 55

slide-5
SLIDE 5

Background Models Regression Simulations Results Discussion

Problem and Data Features

Summary

Problem:

  • 1. To infer individual latent health state
  • 2. To estimate population distribution of latent health states

(CSCFs) Features:

  • case data:
  • 1. Few or no gold-standard measure
  • 2. A large number of categories of latent health states
  • 3. Multiple sources of measurements of differential quality
  • extra control data to integrate

No method has effectively estimated the etiologic distribution (“pie”) using such data.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 5 / 55

slide-6
SLIDE 6

Background Models Regression Simulations Results Discussion

Previous Statistical Methods for Etiology Research

A Selected Review

  • Case-only, needs lots of GS data: verbal autopsy methods

for areas without medical death certification; Kernel smoothing for estimating sparse probability contingency table Pr[MBrS | I] (King and Lu, 2008, Stat. Sci.)

  • Case-only, BrS data: Bayesian nonparametric clustering

(Hoff, 2004, Biometrics); Subset clustering (Friedman and Meulman, 2004, JRSS-B). Both no pre-defined cluster labels.

  • Case-control, only allows BrS data, assumes perfect test

sensitivities: Attributable fraction method (Bruzzi et al., 1985, AJE) based on logistic regression logit Pr[Yi = 1 | MBrS

i

, Xi] =

J

  • j=1

βjMBrS

ij

+ X ′

i γ

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 6 / 55

slide-7
SLIDE 7

Background Models Regression Simulations Results Discussion

Case Measurement Model

Joint Distribution of (Health State, Measurements)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 7 / 55

slide-8
SLIDE 8

Background Models Regression Simulations Results Discussion

Hierarchical Bayes Model for Etiology Research

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 8 / 55

slide-9
SLIDE 9

Background Models Regression Simulations Results Discussion

Partially-Latent Class Models (pLCM; Wu et al., 2015)

Notation

  • Yi =
  • 0, control

1, case

  • I L

i =

         0, control 1, pathogen 1 ... L, pathogen L

  • MS

i = (MS i1, ..., MS iJS)′ - Measurement vector

  • Specimen S on individual i
  • 1 for presence of pathogen from the test; 0 for absence

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 9 / 55

slide-10
SLIDE 10

Background Models Regression Simulations Results Discussion

Partially-Latent Class Models (pLCM; Wu et al., 2016)

Model Structure (Bronze-Standard Data Only)

partial identifiability statistical information cases controls A B C D E 𝜌A 𝜌B 𝜌C 𝜌𝐸 𝜌𝐹

𝜔1

(𝐵)

𝜔1

(𝐶)

𝜔1

(𝐷)

𝜔1

(𝐸)

𝜔1

(𝐹)

𝜄1

(𝐵)

𝜔1

(𝐶)

𝜔1

(𝐷)

𝜔1

(𝐸)

𝜔1

(𝐹)

𝜔1

(𝐵)

𝜄1

(𝐶)

𝜔1

(𝐷)

𝜔1

(𝐸)

𝜔1

(𝐹)

𝜔1

(𝐵)

𝜔1

(𝐶)

𝜄1

(𝐷)

𝜔1

(𝐸)

𝜔1

(𝐹)

𝜔1

(𝐵)

𝜔1

(𝐶)

𝜔1

(𝐷)

𝜄1

(𝐸)

𝜔1

(𝐹)

𝜔1

(𝐵)

𝜔1

(𝐶)

𝜔1

(𝐷)

𝜔1

(𝐸)

𝜄1

(𝐹)

False positive rate (FPR) True positive rate (TPR) Population etiology (𝝆) disease “class” Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 10 / 55

slide-11
SLIDE 11

Background Models Regression Simulations Results Discussion

Assumptions

pLCM

  • Non-interference assumptions for BrS data:

P(MBrS

[−(j,j′)] | I L = j, Y = 1)

= P(MBrS

[−(j,j′)] | I L = j′, Y = 1),

j, j′ = 1, ..., J. P(MBrS

[−j] | Y = 0)

= P(MBrS

[−j] | I L = j, Y = 1),

j = 1, ..., J

  • Independence of measurements given class label (I L

i )

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 11 / 55

slide-12
SLIDE 12

Background Models Regression Simulations Results Discussion

Likelihood

pLCM

  • Bronze-standard

P0,BrS

i

=

J

j=1

  • ψBrS

j

mj 1−ψBrS

j

1−mj , P1,BrS

i′

=

J

j=1 πj·

  • θBrS

j

mj 1−θBrS

j

1−mj

l=j

  • ψBrS

l

ml 1−ψBrS

l

1−ml , m=mBrS

i′

  • Silver-standard

P1,SS

i′

=Pr(MSS

i′

=m|π,θSS )=J′

j=1 πj·

  • θSS

j

mj 1−θSS

j

1−mj 1{

J′ l=1 ml ≤1}, m=mSS i′

  • Gold-standard

P1,GS

i′

=Pr

  • MGS

i′

=m|π

  • =

J

j=1 π 1{mj =1} j

1{

j mj =1}, m=mGS i′

.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 12 / 55

slide-13
SLIDE 13

Background Models Regression Simulations Results Discussion

Partial Identifiablility

Necessity of Informative Priors on True Positive Rate

  • pLCM implies:

Model structure

Pr

  • MBrS

ij

= 1

  • = πjθBrS

j

+ (1 − πj)ψBrS

j

  • Formal argument: singular vectors and values of Jacobian

matrix of model parametrization

  • Bayesian framework sidesteps partial identifiability problem
  • Use TPR prior elicited from laboratory scientists (Cf. Wu et al.,

2015, JRSS-C)

  • No Bayesian free lunch: posterior of unidentified parameters

not shrinking to point mass as sample size grows

  • Identified set of parameter values; Valuable in epidemiology,

econometrics, sociology (Cf. Greenland, 2005, JRSS-A; Gustafson, 2009, JASA; Gustafson, 2005, Stat. Sci.; Manski, 2010, PNAS)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 13 / 55

slide-14
SLIDE 14

Background Models Regression Simulations Results Discussion

Priors

pLCM

  • Informative
  • θBrS

j

∼ Beta(c1j, c2j) - true positive rates for BrS data

  • θSS

j

∼ Beta(d1j, d2j) - true positive rates for SS data

  • Non-informative
  • π ∼ Dirichlet(0.5, ..., 0.5) - population etiology
  • ψBrS

j

∼ Beta(1, 1) - false positive rates for BrS data

Joint prior for γ = (π, ψBrS , θBrS , θSS)′, a priori independent: [γ] = [π][ψBrS ][θBrS ][θSS]

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 14 / 55

slide-15
SLIDE 15

Background Models Regression Simulations Results Discussion

Posterior Computing

  • Gibbs sampler: construct correlated samples to approximate the

shape of joint posterior distribution of the unknowns

  • Unknowns:
  • π-population etiology distribution
  • (ψBrS ,θBrS )′- TPRs and FPRs for BrS measurements
  • θSS - TPRs for SS measurements
  • I L

i -latent health state; for case i

  • Individual diagnosis: For a case with new measurements m∗,

approximate by Pr(I L

i = j | m∗, D)

=

  • Pr(I L

i = j | m∗, γ) Pr(γ | m∗, D)dγ,

j = 1, ...J

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 15 / 55

slide-16
SLIDE 16

Background Models Regression Simulations Results Discussion

Information for Correct Individual Diagnosis

  • Log relative probability of I L

i = j versus I L i = ℓ given others is Rjℓ = log πj πℓ

  • + log

     θBrS

j

ψBrS

j

 

m∗j 

 1 − θBrS

j

1 − ψBrS

j

 

1−m∗j 

  + log   

  • ψBrS

θBrS

m∗ℓ 1 − ψBrS

1 − θBrS

1−m∗ℓ  

  • Suppose I L

i = j. Averaging over m∗: E[Rjℓ] = log

  • πj/πℓ
  • +

I(θBrS

j

; ψBrS

j

) + I(ψBrS

; θBrS

)

  • large

& positive if the arguments are discrepant Model structure

  • I(v1; v2): expected amount of information in m∗j ∼ Bern(v1)

for discriminating against m∗j ∼ Bern(v2).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 16 / 55

slide-17
SLIDE 17

Background Models Regression Simulations Results Discussion

Inference with BrS+GS Data

Simulation: 3 Pathogens; 500 Cases/Controls; 5 Cases with GS Measure

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 17 / 55

slide-18
SLIDE 18

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-19
SLIDE 19

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-20
SLIDE 20

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level
  • Deviations from independence impacts inference (Cf. Pepe and

Janes, 2007, Biostatistics; Albert et al., 2001, Biometrics)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-21
SLIDE 21

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level
  • Deviations from independence impacts inference (Cf. Pepe and

Janes, 2007, Biostatistics; Albert et al., 2001, Biometrics)

  • Modeling Deviation from LI Modeling a cross-classified

probability contingency table P[Mi1 = m1, ..., MiJ = mJ | Ii], ∀m = (m1, ..., mJ)′

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-22
SLIDE 22

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level
  • Deviations from independence impacts inference (Cf. Pepe and

Janes, 2007, Biostatistics; Albert et al., 2001, Biometrics)

  • Modeling Deviation from LI Modeling a cross-classified

probability contingency table P[Mi1 = m1, ..., MiJ = mJ | Ii], ∀m = (m1, ..., mJ)′

  • Log-linear parameterization

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-23
SLIDE 23

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level
  • Deviations from independence impacts inference (Cf. Pepe and

Janes, 2007, Biostatistics; Albert et al., 2001, Biometrics)

  • Modeling Deviation from LI Modeling a cross-classified

probability contingency table P[Mi1 = m1, ..., MiJ = mJ | Ii], ∀m = (m1, ..., mJ)′

  • Log-linear parameterization
  • Generalized linear mixed-effect models (GLMM)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-24
SLIDE 24

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level
  • Deviations from independence impacts inference (Cf. Pepe and

Janes, 2007, Biostatistics; Albert et al., 2001, Biometrics)

  • Modeling Deviation from LI Modeling a cross-classified

probability contingency table P[Mi1 = m1, ..., MiJ = mJ | Ii], ∀m = (m1, ..., mJ)′

  • Log-linear parameterization
  • Generalized linear mixed-effect models (GLMM)
  • Simplex factor model; similar to mixed-membership model (Cf.

Bhattacharya and Dunson, 2012, JASA)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-25
SLIDE 25

Background Models Regression Simulations Results Discussion

“nested” pLCM

Relax the LI and Non-interference Assumption

  • Direct evidence against LI: control measurements

(Mi1, ..., MiJ)′

  • test cross-reactions (prevented in PERCH assays)
  • lab technicians effect
  • heterogeneity in subjects’ immunity level
  • Deviations from independence impacts inference (Cf. Pepe and

Janes, 2007, Biostatistics; Albert et al., 2001, Biometrics)

  • Modeling Deviation from LI Modeling a cross-classified

probability contingency table P[Mi1 = m1, ..., MiJ = mJ | Ii], ∀m = (m1, ..., mJ)′

  • Log-linear parameterization
  • Generalized linear mixed-effect models (GLMM)
  • Simplex factor model; similar to mixed-membership model (Cf.

Bhattacharya and Dunson, 2012, JASA)

  • PARAFAC decomposition (Cf. Dunson and Xing, 2009, JASA)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 18 / 55

slide-26
SLIDE 26

Background Models Regression Simulations Results Discussion

Nested Partially-Latent Class Models (npLCM; Wu and Zeger, 2016)

Example: 5 Pathogens, 2 Subclasses; BrS Data Only

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 19 / 55

slide-27
SLIDE 27

Background Models Regression Simulations Results Discussion

Nested Partially-Latent Class Models (npLCM; Wu and Zeger, 2016)

Example: 5 Pathogens, 3 Subclasses; BrS Data Only

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 20 / 55

slide-28
SLIDE 28

Background Models Regression Simulations Results Discussion

Encourage Few Subclasses: Stick-Breaking Prior

Vj ∼ Beta(1, α); Example: K = 10, α = 1

  • On average, the first several segments receive most weights

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 21 / 55

slide-29
SLIDE 29

Background Models Regression Simulations Results Discussion

npLCM: Likelihood and Prior

BrS Data Only

  • Likelihood

P0(Mi = m) =

K

  • k=1

νk

J

  • j=1
  • ψ(j)

k

mj 1 − ψ(j)

k

1−mj , P1(Mi = m) =

J

  • j=1

πj

K

  • k=1

 ηk

  • θ(j)

k

mj 1 − θ(j)

k

1−mj

ℓ=j

  • ψ(j)

k

mℓ 1 − ψ(j)

k

1−mℓ   ,

  • Prior:

π ∼ Dirichlet(.5, . . . , .5), ψ(j)

k

∼ Beta(1, 1), θk ∼ Beta(c1kj, c2kj), j = 1, ..., J; k = 1, ..., ∞, Zi′ | I L

i′ = j

  • k=1

Uk

  • ℓ<k

[1 − Uℓ] δk, Uk ∼ Beta(1, α0), for all cases, Zi ∼

  • k=1

Vk

  • ℓ<k

[1 − Vℓ]δk, Vk ∼ Beta(1, α0), for all controls, α0 ∼ Gamma(0.25, 0.25),

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 22 / 55

slide-30
SLIDE 30

Background Models Regression Simulations Results Discussion

Estimation Bias if Ignoring Local Dependence (LD)

Simulation: LD Truth (npLCM) Estimated by Working LI Models (pLCM)

A B C D E

(A,B)

0.5 1

A B C D E

(A,C)

0.5 1

A B C D E

(A,D)

0.5 1

A B C D E

(A,E)

0.5 1

(A,B)

0.2 0.5 1 2 4

A B C D E

(B,C)

A B C D E

(B,D)

A B C D E

(B,E) (A,C)

0.2 0.5 1 2 4

Odds Ratio (log−scale)

(B,C)

CASES

C O N T R O L S

A B C D E

(C,D)

A B C D E

(C,E) (A,D)

0.2 0.5 1 2 4

(B,D) (C,D)

A B C D E

(D,E) (A,E)

0.2 0.5 1 2 4 0.5 1

(B,E)

0.5 1

(C,E)

0.5 1

(D,E)

0.5 1

A B C D E

(A,B)

0.5 1

A B C D E

(A,C)

0.5 1

A B C D E

(A,D)

0.5 1

A B C D E

(A,E)

0.2 0.5 1 2 4 0.5 1

(A,B)

A B C D E

(B,C)

A B C D E

(B,D)

A B C D E

(B,E)

0.2 0.5 1 2 4

(A,C) (B,C)

CASES

C O N T R O L S

A B C D E

(C,D)

A B C D E

(C,E)

0.2 0.5 1 2 4

(A,D) (B,D) (C,D)

A B C D E

(D,E)

0.2 0.5 1 2 4

(A,E)

0.5 1

(B,E)

0.5 1

(C,E)

0.5 1

(D,E)

0.5 1

(I) (II)

smoothed_mat

A B C D E −120 −100 −80 −60 −40 −20 20 40 60 80 100 120

Percent Relative Asymptotic Bias (PRAB)

(

0.25 0.5 0.75 1

smoothed_mat

A B C D E 0.25 0.5 0.75 1

Cases' First Subclass Weight (η1)

(a)

Marginal Class A Class B Class C Class D Class E

Controls:

Marginal

(I: weak LD) (II: strong LD)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 23 / 55

slide-31
SLIDE 31

So Far: A General Framework

Nested Partially Latent Class Models (npLCM)

For simplicity, we assume “single-pathogen causes”, or a single relevant feature per cluster, or more visually, ”one row of green boxes per disease class”

slide-32
SLIDE 32

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-33
SLIDE 33

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-34
SLIDE 34

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

  • a. Cause-specific case fractions (CSCF): π = (π1, . . . , πL)⊤ =

{πℓ = P(I = ℓ | Y = 1), ℓ = 1, . . . , L} ∈ SL−1;

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-35
SLIDE 35

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

  • a. Cause-specific case fractions (CSCF): π = (π1, . . . , πL)⊤ =

{πℓ = P(I = ℓ | Y = 1), ℓ = 1, . . . , L} ∈ SL−1;

  • b. P1ℓ = {P1ℓ(m)} = {P(M = m | I = ℓ, Y = 1)}: a table of

probabilities of making J binary observations M = m in a case class ℓ = 0;

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-36
SLIDE 36

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

  • a. Cause-specific case fractions (CSCF): π = (π1, . . . , πL)⊤ =

{πℓ = P(I = ℓ | Y = 1), ℓ = 1, . . . , L} ∈ SL−1;

  • b. P1ℓ = {P1ℓ(m)} = {P(M = m | I = ℓ, Y = 1)}: a table of

probabilities of making J binary observations M = m in a case class ℓ = 0;

  • c. P0 = {P0(m)} = {P(M = m | I = 0, Y = 0)}: the same

probability table as above but for controls.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-37
SLIDE 37

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

  • a. Cause-specific case fractions (CSCF): π = (π1, . . . , πL)⊤ =

{πℓ = P(I = ℓ | Y = 1), ℓ = 1, . . . , L} ∈ SL−1;

  • b. P1ℓ = {P1ℓ(m)} = {P(M = m | I = ℓ, Y = 1)}: a table of

probabilities of making J binary observations M = m in a case class ℓ = 0;

  • c. P0 = {P0(m)} = {P(M = m | I = 0, Y = 0)}: the same

probability table as above but for controls. Cases’ disease classes are unobserved, so the distribution of their measurements is a weighted finite-mixture model: P1 = L

ℓ=1 πℓP1ℓ

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-38
SLIDE 38

Background Models Regression Simulations Results Discussion

npLCM Framework (no Covariates)

Three components of a likelihood function:

  • a. Cause-specific case fractions (CSCF): π = (π1, . . . , πL)⊤ =

{πℓ = P(I = ℓ | Y = 1), ℓ = 1, . . . , L} ∈ SL−1;

  • b. P1ℓ = {P1ℓ(m)} = {P(M = m | I = ℓ, Y = 1)}: a table of

probabilities of making J binary observations M = m in a case class ℓ = 0;

  • c. P0 = {P0(m)} = {P(M = m | I = 0, Y = 0)}: the same

probability table as above but for controls. Cases’ disease classes are unobserved, so the distribution of their measurements is a weighted finite-mixture model: P1 = L

ℓ=1 πℓP1ℓ

The likelihood: L = L1 · L0 =   

  • i:Yi=1

L

  • ℓ=1

πℓ · P1ℓ(Mi; Θ, Ψ, η)    ×

  • i′:Yi′=0

P0(Mi′; Ψ, ν)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 25 / 55

slide-39
SLIDE 39

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Setting η1 = 1 and ν1 = 1

Control model for multivariate binary data {Mi : where Yi = 0}:

  • 1. P0(m) = J

j=1{ψj}mj{1 − ψj}1−mj = Π(m; ψ)

  • 1a. Π(m; s) = J

j=1{sj}mij{1 − sj}1−mij is the probability mass

function for a product Bernoulli distribution given the success probabilities s = (s1, . . . , sJ)⊤, 0 ≤ sj ≤ 1

  • 1b. Parameters ψ = (ψ1, . . . , ψJ)⊤ represent the positive rates

absent disease, referred to as “false positive rates” (FPRs).

Local Independence: Mij ⊥ Mij′ | I = 0

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 26 / 55

slide-40
SLIDE 40

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Model for the multivariate binary data in case class ℓ = 0

  • 2. P1ℓ(m) is a product of the probabilities of measurements made

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 27 / 55

slide-41
SLIDE 41

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Model for the multivariate binary data in case class ℓ = 0

  • 2. P1ℓ(m) is a product of the probabilities of measurements made
  • 2a. on the causative pathogen ℓ,

P(Mℓ | I = ℓ, Y = 1, θ) = {θℓ}Mℓ{1 − θℓ}1−Mℓ, where θ = (θ1, . . . , θJ)⊤ are “true positive rates” (TPRs), larger than FPRs.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 27 / 55

slide-42
SLIDE 42

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Model for the multivariate binary data in case class ℓ = 0

  • 2. P1ℓ(m) is a product of the probabilities of measurements made
  • 2a. on the causative pathogen ℓ,

P(Mℓ | I = ℓ, Y = 1, θ) = {θℓ}Mℓ{1 − θℓ}1−Mℓ, where θ = (θ1, . . . , θJ)⊤ are “true positive rates” (TPRs), larger than FPRs.

  • 2b. on the non-causative pathogens

P(Mi[−ℓ] | Ii = ℓ, Yi = 1, ψ[−ℓ]) = Π(M[−ℓ]; ψ[−ℓ]), where a[−ℓ] represents all but the ℓ-th element in a vector a.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 27 / 55

slide-43
SLIDE 43

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Model for the multivariate binary data in case class ℓ = 0

  • 2. P1ℓ(m) is a product of the probabilities of measurements made
  • 2a. on the causative pathogen ℓ,

P(Mℓ | I = ℓ, Y = 1, θ) = {θℓ}Mℓ{1 − θℓ}1−Mℓ, where θ = (θ1, . . . , θJ)⊤ are “true positive rates” (TPRs), larger than FPRs.

  • 2b. on the non-causative pathogens

P(Mi[−ℓ] | Ii = ℓ, Yi = 1, ψ[−ℓ]) = Π(M[−ℓ]; ψ[−ℓ]), where a[−ℓ] represents all but the ℓ-th element in a vector a.

  • 2c. Under the single-pathogen-cause assumption, pLCM uses J

TPRs θ for L = J causes and J FPRs ψ.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 27 / 55

slide-44
SLIDE 44

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Model for the multivariate binary data in case class ℓ = 0

  • 2. P1ℓ(m) is a product of the probabilities of measurements made
  • 2a. on the causative pathogen ℓ,

P(Mℓ | I = ℓ, Y = 1, θ) = {θℓ}Mℓ{1 − θℓ}1−Mℓ, where θ = (θ1, . . . , θJ)⊤ are “true positive rates” (TPRs), larger than FPRs.

  • 2b. on the non-causative pathogens

P(Mi[−ℓ] | Ii = ℓ, Yi = 1, ψ[−ℓ]) = Π(M[−ℓ]; ψ[−ℓ]), where a[−ℓ] represents all but the ℓ-th element in a vector a.

  • 2c. Under the single-pathogen-cause assumption, pLCM uses J

TPRs θ for L = J causes and J FPRs ψ.

2a-2b: Local Independence (LI): Mij ⊥ Mij′ | I = ℓ = 0

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 27 / 55

slide-45
SLIDE 45

Background Models Regression Simulations Results Discussion

Special Case: pLCM (Wu et al., 2016)

Model for the multivariate binary data in case class ℓ = 0

  • 2. P1ℓ(m) is a product of the probabilities of measurements made
  • 2a. on the causative pathogen ℓ,

P(Mℓ | I = ℓ, Y = 1, θ) = {θℓ}Mℓ{1 − θℓ}1−Mℓ, where θ = (θ1, . . . , θJ)⊤ are “true positive rates” (TPRs), larger than FPRs.

  • 2b. on the non-causative pathogens

P(Mi[−ℓ] | Ii = ℓ, Yi = 1, ψ[−ℓ]) = Π(M[−ℓ]; ψ[−ℓ]), where a[−ℓ] represents all but the ℓ-th element in a vector a.

  • 2c. Under the single-pathogen-cause assumption, pLCM uses J

TPRs θ for L = J causes and J FPRs ψ.

2a-2b: Local Independence (LI): Mij ⊥ Mij′ | I = ℓ = 0 2a-2b. Non-interference: disease-causing pathogen(s) are more frequently detected among cases than controls (θℓ > ψℓ) and the non-causative pathogens are observed with the same rates among cases as in controls

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 27 / 55

slide-46
SLIDE 46

Background Models Regression Simulations Results Discussion

Regression Analysis in nested PLCM

In large-scale disease etiology studies:

  • Data: case-control diagnostic tests, multivariate binary
  • bservations
  • Scientific problem: estimate cause-specific case fractions

(CSCF); Think “Pie chart” for cases

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 28 / 55

slide-47
SLIDE 47

Background Models Regression Simulations Results Discussion

Regression Analysis in nested PLCM

In large-scale disease etiology studies:

  • Data: case-control diagnostic tests, multivariate binary
  • bservations
  • Scientific problem: estimate cause-specific case fractions

(CSCF); Think “Pie chart” for cases

  • Statistical problem: Using nested PLCM to estimate the

mixing distribution among the cases

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 28 / 55

slide-48
SLIDE 48

Background Models Regression Simulations Results Discussion

Regression Analysis in nested PLCM

In large-scale disease etiology studies:

  • Data: case-control diagnostic tests, multivariate binary
  • bservations
  • Scientific problem: estimate cause-specific case fractions

(CSCF); Think “Pie chart” for cases

  • Statistical problem: Using nested PLCM to estimate the

mixing distribution among the cases

  • Motivation for regression analyses: CSCFs may vary by

season, a child’s age, HIV status, disease severity

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 28 / 55

slide-49
SLIDE 49

Background Models Regression Simulations Results Discussion

Data (with Covariates)

  • D = {(Mi, Yi, XiYi, Wi), i = 1, . . . , N}

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 29 / 55

slide-50
SLIDE 50

Background Models Regression Simulations Results Discussion

Data (with Covariates)

  • D = {(Mi, Yi, XiYi, Wi), i = 1, . . . , N}
  • Mi = (Mi1, ..., MiJ)⊤: binary measurements; Indicate the

presence or absence of J pathogens for subject i = 1, . . . , N.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 29 / 55

slide-51
SLIDE 51

Background Models Regression Simulations Results Discussion

Data (with Covariates)

  • D = {(Mi, Yi, XiYi, Wi), i = 1, . . . , N}
  • Mi = (Mi1, ..., MiJ)⊤: binary measurements; Indicate the

presence or absence of J pathogens for subject i = 1, . . . , N.

  • Yi: case (1) or a control (0).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 29 / 55

slide-52
SLIDE 52

Background Models Regression Simulations Results Discussion

Data (with Covariates)

  • D = {(Mi, Yi, XiYi, Wi), i = 1, . . . , N}
  • Mi = (Mi1, ..., MiJ)⊤: binary measurements; Indicate the

presence or absence of J pathogens for subject i = 1, . . . , N.

  • Yi: case (1) or a control (0).
  • Xi = (Xi1, . . . , Xip)⊤: covariates that may influence case i’s

etiologic fractions

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 29 / 55

slide-53
SLIDE 53

Background Models Regression Simulations Results Discussion

Data (with Covariates)

  • D = {(Mi, Yi, XiYi, Wi), i = 1, . . . , N}
  • Mi = (Mi1, ..., MiJ)⊤: binary measurements; Indicate the

presence or absence of J pathogens for subject i = 1, . . . , N.

  • Yi: case (1) or a control (0).
  • Xi = (Xi1, . . . , Xip)⊤: covariates that may influence case i’s

etiologic fractions

  • Wi = (Wi1, . . . , Wiq)⊤: shared by cases and controls; possibly

different from Xi; may influence control distribution [Mi | Wi, Yi = 0]. For example, healthy controls do not have disease severity information (which can be included in Xi).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 29 / 55

slide-54
SLIDE 54

Background Models Regression Simulations Results Discussion

Data (with Covariates)

  • D = {(Mi, Yi, XiYi, Wi), i = 1, . . . , N}
  • Mi = (Mi1, ..., MiJ)⊤: binary measurements; Indicate the

presence or absence of J pathogens for subject i = 1, . . . , N.

  • Yi: case (1) or a control (0).
  • Xi = (Xi1, . . . , Xip)⊤: covariates that may influence case i’s

etiologic fractions

  • Wi = (Wi1, . . . , Wiq)⊤: shared by cases and controls; possibly

different from Xi; may influence control distribution [Mi | Wi, Yi = 0]. For example, healthy controls do not have disease severity information (which can be included in Xi).

  • Continuous covariates: the first p1 and q1 elements of Xi and

Wi, respectively.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 29 / 55

slide-55
SLIDE 55

Background Models Regression Simulations Results Discussion

Motivating Application Again: PERCH Study

Data : 494 cases and 944 controls from one site

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 30 / 55

slide-56
SLIDE 56

Background Models Regression Simulations Results Discussion

Motivating Application Again: PERCH Study

Data : 494 cases and 944 controls from one site Goal a. : Estimate CSCFs at all covariate values, and assign cause-specific probabilities for each case

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 30 / 55

slide-57
SLIDE 57

Background Models Regression Simulations Results Discussion

Motivating Application Again: PERCH Study

Data : 494 cases and 944 controls from one site Goal a. : Estimate CSCFs at all covariate values, and assign cause-specific probabilities for each case Goal b. : Quantify overall cause-specific disease burdens in a population, i.e., overall CSCFs π∗ = (π∗

1, . . . , π∗ L)⊤ as an

empirical average of the stratum-specific CSCFs (by X); Of policy interest (vaccine/antibiotics development and manufacture)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 30 / 55

slide-58
SLIDE 58

Background Models Regression Simulations Results Discussion

Motivating Application Again: PERCH Study

Data : 494 cases and 944 controls from one site Goal a. : Estimate CSCFs at all covariate values, and assign cause-specific probabilities for each case Goal b. : Quantify overall cause-specific disease burdens in a population, i.e., overall CSCFs π∗ = (π∗

1, . . . , π∗ L)⊤ as an

empirical average of the stratum-specific CSCFs (by X); Of policy interest (vaccine/antibiotics development and manufacture) Model :

  • J = 7: noisy presence/absence of 2 bacteria and 5 viruses in the

nose

  • Causes: seven single-pathogen causes plus an “Not Specified”

(NoS) cause; So L = J + 1

  • Xi: enrollment date, age (< or > 1 year), disease severity for

cases (severe or very severe), HIV status (+/-)

  • Wi: Xi minus “disease severity”.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 30 / 55

slide-59
SLIDE 59

Background Models Regression Simulations Results Discussion

PERCH Data: Sparsely-Populated Strata

Table: The observed count (frequency) of cases and controls by age, disease severity and HIV status (1: yes; 0: no). The marginal fractions among cases and controls for each covariate are shown at the bottom. Regression results will be shown for the first two strata.

age ≥ 1 very severe (VS) HIV positive # cases (%) # controls (%) (case-only) total: 524 (100) total: 964 (100) 208 (39.7) 545 (56.5) 1 72 (13.7) 278 (28.8) 1 116 (22.1)

  • 1

1 33 (6.3)

  • 1

37 (7.1) 85 (8.8) 1 1 24 (4.5) 51 (5.3) 1 1 25 (4.8)

  • 1

1 1 3 (0.6)

  • case: 25.2%

34.5% 17.0% control: 34.3%

  • 14.1%

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 31 / 55

slide-60
SLIDE 60

Background Models Regression Simulations Results Discussion

Current Methods Fall Short

  • Fully-stratified analysis: fit an npLCM to the case-control data

in each covariate stratum.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 32 / 55

slide-61
SLIDE 61

Background Models Regression Simulations Results Discussion

Current Methods Fall Short

  • Fully-stratified analysis: fit an npLCM to the case-control data

in each covariate stratum. Like pLCM, the npLCM is partially-identified in each stratum, necessitating multiple sets of independent informative priors across multiple strata. Two primary issues:

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 32 / 55

slide-62
SLIDE 62

Background Models Regression Simulations Results Discussion

Current Methods Fall Short

  • Fully-stratified analysis: fit an npLCM to the case-control data

in each covariate stratum. Like pLCM, the npLCM is partially-identified in each stratum, necessitating multiple sets of independent informative priors across multiple strata. Two primary issues: Gap 1a Unstable CSCF estimates due to sparsely-populated strata.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 32 / 55

slide-63
SLIDE 63

Background Models Regression Simulations Results Discussion

Current Methods Fall Short

  • Fully-stratified analysis: fit an npLCM to the case-control data

in each covariate stratum. Like pLCM, the npLCM is partially-identified in each stratum, necessitating multiple sets of independent informative priors across multiple strata. Two primary issues: Gap 1a Unstable CSCF estimates due to sparsely-populated strata. Gap 1b Informative TPR priors are often elicited for a case population and rarely for each stratum; Reusing independent prior distributions of the TPRs across all the strata will lead to

  • verly-optimistic posterior uncertainty in π∗, hampering policy

decisions.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 32 / 55

slide-64
SLIDE 64

Background Models Regression Simulations Results Discussion

The Rest of Talk

More focus on model formulation; Inference done by ‘baker‘

Extend the npLCM to perform regression analysis in case-control disease etiology studies that

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 33 / 55

slide-65
SLIDE 65

Background Models Regression Simulations Results Discussion

The Rest of Talk

More focus on model formulation; Inference done by ‘baker‘

Extend the npLCM to perform regression analysis in case-control disease etiology studies that (a) incorporates controls to estimate the CSCFs (π),

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 33 / 55

slide-66
SLIDE 66

Background Models Regression Simulations Results Discussion

The Rest of Talk

More focus on model formulation; Inference done by ‘baker‘

Extend the npLCM to perform regression analysis in case-control disease etiology studies that (a) incorporates controls to estimate the CSCFs (π), (b) specifies parsimonious functional dependence of π upon covariates such as additivity, and

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 33 / 55

slide-67
SLIDE 67

Background Models Regression Simulations Results Discussion

The Rest of Talk

More focus on model formulation; Inference done by ‘baker‘

Extend the npLCM to perform regression analysis in case-control disease etiology studies that (a) incorporates controls to estimate the CSCFs (π), (b) specifies parsimonious functional dependence of π upon covariates such as additivity, and (c) correctly assesses the posterior uncertainty of the CSCF functions and the overall CSCFs π∗ by applying the TPR priors just once.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 33 / 55

slide-68
SLIDE 68

Now, how to incorporate covariates, to which quantities?

Regression Extension for P0 and P1: letting πℓ, νk, ηk depend on covariates

slide-69
SLIDE 69

Background Models Regression Simulations Results Discussion

Roadmap

Let three sets of parameters in an npLCM (pg.17) depend on the

  • bserved covariates
  • 1x. Etiology regression function among cases, {πℓ(x), ℓ = 0},

which is of primary scientific interest

  • 2x. Conditional probability of measurements m given covariates w

in controls: P0(m; w) = [M = m | W = w, I = 0],

  • 3x. 2x above, but in the case class ℓ:

P1ℓ(m; w) = [M = m | W = w, I = ℓ], ℓ = 1, . . . , L note Keep the specifications for the TPRs and FPRs (Θ, Ψ) as in the original npLCM.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 35 / 55

slide-70
SLIDE 70

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-71
SLIDE 71

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-72
SLIDE 72

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

  • 2. Occurs with probability πiℓ that depends upon covariates.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-73
SLIDE 73

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

  • 2. Occurs with probability πiℓ that depends upon covariates.
  • 3. Over-parameterized multinomial logistic regression:

πiℓ = πℓ(Xi) = exp{φℓ(Xi)}/L

ℓ′=1 exp{φℓ′(Xi)}, ℓ = 1, ..., L,

where φℓ(Xi) − φL(Xi) is the log odds of case i in disease class ℓ relative to L: log πiℓ/πiL.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-74
SLIDE 74

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

  • 2. Occurs with probability πiℓ that depends upon covariates.
  • 3. Over-parameterized multinomial logistic regression:

πiℓ = πℓ(Xi) = exp{φℓ(Xi)}/L

ℓ′=1 exp{φℓ′(Xi)}, ℓ = 1, ..., L,

where φℓ(Xi) − φL(Xi) is the log odds of case i in disease class ℓ relative to L: log πiℓ/πiL.

  • 4. Without specifying a baseline category, we treat all the disease

classes symmetrically which simplifies prior specification.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-75
SLIDE 75

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

  • 2. Occurs with probability πiℓ that depends upon covariates.
  • 3. Over-parameterized multinomial logistic regression:

πiℓ = πℓ(Xi) = exp{φℓ(Xi)}/L

ℓ′=1 exp{φℓ′(Xi)}, ℓ = 1, ..., L,

where φℓ(Xi) − φL(Xi) is the log odds of case i in disease class ℓ relative to L: log πiℓ/πiL.

  • 4. Without specifying a baseline category, we treat all the disease

classes symmetrically which simplifies prior specification.

  • 5. Additive models for φℓ(x; Γπ

ℓ ) = p1 j=1 f π ℓj (xj; βπ ℓj) +

x⊤γπ

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-76
SLIDE 76

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

  • 2. Occurs with probability πiℓ that depends upon covariates.
  • 3. Over-parameterized multinomial logistic regression:

πiℓ = πℓ(Xi) = exp{φℓ(Xi)}/L

ℓ′=1 exp{φℓ′(Xi)}, ℓ = 1, ..., L,

where φℓ(Xi) − φL(Xi) is the log odds of case i in disease class ℓ relative to L: log πiℓ/πiL.

  • 4. Without specifying a baseline category, we treat all the disease

classes symmetrically which simplifies prior specification.

  • 5. Additive models for φℓ(x; Γπ

ℓ ) = p1 j=1 f π ℓj (xj; βπ ℓj) +

x⊤γπ

  • 5a. Use B-spline basis expansion to approximate f π

ℓj (·) and use

P-spline for estimating smooth functions.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-77
SLIDE 77

Background Models Regression Simulations Results Discussion

Etiology Regression πℓ(X)

πℓ(X) is the primary target of inference.

  • 1. Recall that Ii = ℓ represents case i’s disease being caused by

pathogen ℓ.

  • 2. Occurs with probability πiℓ that depends upon covariates.
  • 3. Over-parameterized multinomial logistic regression:

πiℓ = πℓ(Xi) = exp{φℓ(Xi)}/L

ℓ′=1 exp{φℓ′(Xi)}, ℓ = 1, ..., L,

where φℓ(Xi) − φL(Xi) is the log odds of case i in disease class ℓ relative to L: log πiℓ/πiL.

  • 4. Without specifying a baseline category, we treat all the disease

classes symmetrically which simplifies prior specification.

  • 5. Additive models for φℓ(x; Γπ

ℓ ) = p1 j=1 f π ℓj (xj; βπ ℓj) +

x⊤γπ

  • 5a. Use B-spline basis expansion to approximate f π

ℓj (·) and use

P-spline for estimating smooth functions. 5b. x is the subvector of the predictors x; Γπ

ℓ = (βπ ℓj, γπ ℓ ).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 36 / 55

slide-78
SLIDE 78

Background Models Regression Simulations Results Discussion

P0: Multivariate binary regression for controls

Desirable properties Model Specification:

  • Model space large enough for complex conditional dependence
  • f M given covariates W
  • Upward compatibility, or reproducibility (invariant parameter

interpretation with increasing dimensions or complex patterns

  • f missing responses)

Estimation:

  • Adaptivity: regularization to adapt to the difficulty of the

problem, e.g., model residual dependence [M | W , I = 0] only if necessary; model the effect of covariates only if necessary

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 37 / 55

slide-79
SLIDE 79

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls

  • The pmf for controls’ measurements:

Pr(Mi = m | Wi, Ii = 0) = K

k=1 νk(Wi)Π(m; Ψk),

Ψk = (ψ(1)

k , . . . , ψ(J) k )′

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 38 / 55

slide-80
SLIDE 80

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls

  • The pmf for controls’ measurements:

Pr(Mi = m | Wi, Ii = 0) = K

k=1 νk(Wi)Π(m; Ψk),

Ψk = (ψ(1)

k , . . . , ψ(J) k )′

  • The vector (ν1(Wi), . . . , νK(Wi)) lies in a (K − 1)-simplex

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 38 / 55

slide-81
SLIDE 81

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls

  • The pmf for controls’ measurements:

Pr(Mi = m | Wi, Ii = 0) = K

k=1 νk(Wi)Π(m; Ψk),

Ψk = (ψ(1)

k , . . . , ψ(J) k )′

  • The vector (ν1(Wi), . . . , νK(Wi)) lies in a (K − 1)-simplex
  • Π(m; s) = J

j=1{sj}mij(1 − sj)1−mij

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 38 / 55

slide-82
SLIDE 82

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls

  • The pmf for controls’ measurements:

Pr(Mi = m | Wi, Ii = 0) = K

k=1 νk(Wi)Π(m; Ψk),

Ψk = (ψ(1)

k , . . . , ψ(J) k )′

  • The vector (ν1(Wi), . . . , νK(Wi)) lies in a (K − 1)-simplex
  • Π(m; s) = J

j=1{sj}mij(1 − sj)1−mij

  • An equivalent generative process:

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 38 / 55

slide-83
SLIDE 83

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls

  • The pmf for controls’ measurements:

Pr(Mi = m | Wi, Ii = 0) = K

k=1 νk(Wi)Π(m; Ψk),

Ψk = (ψ(1)

k , . . . , ψ(J) k )′

  • The vector (ν1(Wi), . . . , νK(Wi)) lies in a (K − 1)-simplex
  • Π(m; s) = J

j=1{sj}mij(1 − sj)1−mij

  • An equivalent generative process:

sample subclass indicator : Zi | Wi ∼ CategoricalK(ν(Wi)) generate measurements : Mij | Zi = k ∼ Bernoulli(ψ(j)

k ),

independently for j = 1, ..., J.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 38 / 55

slide-84
SLIDE 84

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls Stick-breaking parametrization of weight functions νk(Wi) = P(Zi = k | Wi) by hk(Wi; Γν

k)

  • stick k

=

  • g(αν

ik) s<k {1 − g(αν is)} ,

if k < K,

  • s<k {1 − g(αν

is)} ,

if k = K,

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 39 / 55

slide-85
SLIDE 85

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls Stick-breaking parametrization of weight functions νk(Wi) = P(Zi = k | Wi) by hk(Wi; Γν

k)

  • stick k

=

  • g(αν

ik) s<k {1 − g(αν is)} ,

if k < K,

  • s<k {1 − g(αν

is)} ,

if k = K, g(·) = 1/(1 + exp{−(·)})

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 39 / 55

slide-86
SLIDE 86

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls Stick-breaking parametrization of weight functions νk(Wi) = P(Zi = k | Wi) by hk(Wi; Γν

k)

  • stick k

=

  • g(αν

ik) s<k {1 − g(αν is)} ,

if k < K,

  • s<k {1 − g(αν

is)} ,

if k = K, g(·) = 1/(1 + exp{−(·)}) . We specify αν

ik via additive models:

αν

ik = µk0 + q1

  • j=1

fkj(Wij; βν

kj) +

W ⊤

i γν k , k = 1, . . . , K − 1.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 39 / 55

slide-87
SLIDE 87

Background Models Regression Simulations Results Discussion

Let P0 depend on Wi

Regression model for controls Stick-breaking parametrization of weight functions νk(Wi) = P(Zi = k | Wi) by hk(Wi; Γν

k)

  • stick k

=

  • g(αν

ik) s<k {1 − g(αν is)} ,

if k < K,

  • s<k {1 − g(αν

is)} ,

if k = K, g(·) = 1/(1 + exp{−(·)}) . We specify αν

ik via additive models:

αν

ik = µk0 + q1

  • j=1

fkj(Wij; βν

kj) +

W ⊤

i γν k , k = 1, . . . , K − 1.

Expand the smooth functions by B-spline bases with coefficients βν

kj;

w is a subvector of covariates w

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 39 / 55

slide-88
SLIDE 88

Background Models Regression Simulations Results Discussion

Adaptivity Considerations

Proposed Model

  • Prevent overfitting when the regression is easy, and improve

interpretability

  • We a priori place substantial probabilities on models with the

following two features:

a) Few subclasses with effective weights (in the sense that νk(·) is bounded away from 0 and 1): a novel additive half-Cauchy prior for µk0. b) Smooth weight regression curves νk(·): by Bayesian Penalized-Splines (P-Splines) combined with mixture priors on spline coefficients to sensitively distinguish constant αν

k(·) from flexible smooth curves

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 40 / 55

slide-89
SLIDE 89

Background Models Regression Simulations Results Discussion

On Consideration a) “Uniform Shrinkage over Simplex” for νk(W )

Proposed Model

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 41 / 55

slide-90
SLIDE 90

Background Models Regression Simulations Results Discussion

On Consideration a) “Uniform Shrinkage over Simplex” for νk(W )

Proposed Model

  • We let µk0 = k

j=1 µ∗ j0, µ∗ j0 > 0. A large µk0 for a large k.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 41 / 55

slide-91
SLIDE 91

Background Models Regression Simulations Results Discussion

On Consideration a) “Uniform Shrinkage over Simplex” for νk(W )

Proposed Model

  • We let µk0 = k

j=1 µ∗ j0, µ∗ j0 > 0. A large µk0 for a large k.

  • µk0 increases with k: making the stick-breaking a priori more likely to

stop for a large k

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 41 / 55

slide-92
SLIDE 92

Background Models Regression Simulations Results Discussion

On Consideration a) “Uniform Shrinkage over Simplex” for νk(W )

Proposed Model

  • We let µk0 = k

j=1 µ∗ j0, µ∗ j0 > 0. A large µk0 for a large k.

  • µk0 increases with k: making the stick-breaking a priori more likely to

stop for a large k

  • We specify the prior distributions for µ∗

j0 to be heavy-tailed:

µ∗

j0 ∼ Cauchy+(0, sj), j = 1, . . . , K,

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 41 / 55

slide-93
SLIDE 93

Background Models Regression Simulations Results Discussion

On Consideration a) “Uniform Shrinkage over Simplex” for νk(W )

Proposed Model

  • We let µk0 = k

j=1 µ∗ j0, µ∗ j0 > 0. A large µk0 for a large k.

  • µk0 increases with k: making the stick-breaking a priori more likely to

stop for a large k

  • We specify the prior distributions for µ∗

j0 to be heavy-tailed:

µ∗

j0 ∼ Cauchy+(0, sj), j = 1, . . . , K,

  • A large sk produces a large µ∗

k0 and helps stop the stick-breaking at

class k.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 41 / 55

slide-94
SLIDE 94

Background Models Regression Simulations Results Discussion

On Consideration a) “Uniform Shrinkage over Simplex” for νk(W )

Proposed Model

  • We let µk0 = k

j=1 µ∗ j0, µ∗ j0 > 0. A large µk0 for a large k.

  • µk0 increases with k: making the stick-breaking a priori more likely to

stop for a large k

  • We specify the prior distributions for µ∗

j0 to be heavy-tailed:

µ∗

j0 ∼ Cauchy+(0, sj), j = 1, . . . , K,

  • A large sk produces a large µ∗

k0 and helps stop the stick-breaking at

class k.

  • Encourages using a small number of effective classes (< K) to

approximate the observed 2J probability contingency table in finite samples

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 41 / 55

slide-95
SLIDE 95

Background Models Regression Simulations Results Discussion

Inference of νk(x) at three hyperparameter values sj

Simulation: with a single continuous covariate; “—”: truth, “—”: posterior samples

X-axis: covariate values Y-axis: weight; 0 to 1.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 42 / 55

slide-96
SLIDE 96

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-97
SLIDE 97

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-98
SLIDE 98

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

  • pkℓ = {p(j)

kℓ , j = 1, . . . , J} are positive rates for J measurements

in subclass k of disease class ℓ: p(j)

kℓ =

  • θ(j)

k

I{j=ℓ} ·

  • ψ(j)

k

1−I{j=ℓ}

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-99
SLIDE 99

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

  • pkℓ = {p(j)

kℓ , j = 1, . . . , J} are positive rates for J measurements

in subclass k of disease class ℓ: p(j)

kℓ =

  • θ(j)

k

I{j=ℓ} ·

  • ψ(j)

k

1−I{j=ℓ}

  • Equals the TPR θ(j)

k

for a causative pathogen and the FPR ψ(j)

k

  • therwise

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-100
SLIDE 100

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

  • pkℓ = {p(j)

kℓ , j = 1, . . . , J} are positive rates for J measurements

in subclass k of disease class ℓ: p(j)

kℓ =

  • θ(j)

k

I{j=ℓ} ·

  • ψ(j)

k

1−I{j=ℓ}

  • Equals the TPR θ(j)

k

for a causative pathogen and the FPR ψ(j)

k

  • therwise
  • Subclass weight regression ηk(W ) is also specified via

stick-breaking: ηik = hk(Wi; Γη

k), k = 1, . . . , K − 1

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-101
SLIDE 101

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

  • pkℓ = {p(j)

kℓ , j = 1, . . . , J} are positive rates for J measurements

in subclass k of disease class ℓ: p(j)

kℓ =

  • θ(j)

k

I{j=ℓ} ·

  • ψ(j)

k

1−I{j=ℓ}

  • Equals the TPR θ(j)

k

for a causative pathogen and the FPR ψ(j)

k

  • therwise
  • Subclass weight regression ηk(W ) is also specified via

stick-breaking: ηik = hk(Wi; Γη

k), k = 1, . . . , K − 1

  • αη

ik: GAMs

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-102
SLIDE 102

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

  • pkℓ = {p(j)

kℓ , j = 1, . . . , J} are positive rates for J measurements

in subclass k of disease class ℓ: p(j)

kℓ =

  • θ(j)

k

I{j=ℓ} ·

  • ψ(j)

k

1−I{j=ℓ}

  • Equals the TPR θ(j)

k

for a causative pathogen and the FPR ψ(j)

k

  • therwise
  • Subclass weight regression ηk(W ) is also specified via

stick-breaking: ηik = hk(Wi; Γη

k), k = 1, . . . , K − 1

  • αη

ik: GAMs

  • αη

ik = αη k(Wi; Γη k ) = µk0 + q1 j=1 fkj(Wij; βη kj) +

W ⊤

i γη k, where

Γη

k = {µk0, {βη kj}, γη k} are the regression parameters.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-103
SLIDE 103

Background Models Regression Simulations Results Discussion

Let P1 depend on X and W

Subclass Weight Regression: For Cases

The pmf for cases’ measurements: Pr(Mi = m) = L

ℓ=1 πiℓ

K

k=1 ηikΠ(Mi; pkℓ)

  • pkℓ = {p(j)

kℓ , j = 1, . . . , J} are positive rates for J measurements

in subclass k of disease class ℓ: p(j)

kℓ =

  • θ(j)

k

I{j=ℓ} ·

  • ψ(j)

k

1−I{j=ℓ}

  • Equals the TPR θ(j)

k

for a causative pathogen and the FPR ψ(j)

k

  • therwise
  • Subclass weight regression ηk(W ) is also specified via

stick-breaking: ηik = hk(Wi; Γη

k), k = 1, . . . , K − 1

  • αη

ik: GAMs

  • αη

ik = αη k(Wi; Γη k ) = µk0 + q1 j=1 fkj(Wij; βη kj) +

W ⊤

i γη k, where

Γη

k = {µk0, {βη kj}, γη k} are the regression parameters.

  • we use µk0 from the controls (why?)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 43 / 55

slide-104
SLIDE 104

Background Models Regression Simulations Results Discussion

npLCM Regression Framework

The npLCM regression framework is then obtained as:

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 44 / 55

slide-105
SLIDE 105

Background Models Regression Simulations Results Discussion

npLCM Regression Framework

The npLCM regression framework is then obtained as:

  • Control likelihood with covariates:

Lreg =

i:Yi=0

K

k=1 νikΠ(Mi; Ψk).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 44 / 55

slide-106
SLIDE 106

Background Models Regression Simulations Results Discussion

npLCM Regression Framework

The npLCM regression framework is then obtained as:

  • Control likelihood with covariates:

Lreg =

i:Yi=0

K

k=1 νikΠ(Mi; Ψk).

  • Cases likelihood with covariates:

Lreg

1

=

  • i:Yi=1

  

L

  • ℓ=1

 πℓ(Xi; Γπ

ℓ )

  • CSCF ℓ

K

  • k=1

{ηik · Π(Mi; pkℓ)}      (2)

  • νik = hk(Wi; Γν

k) : The S? ? ? ?-B? ? ? ? parameterization

  • ηik = hk(Wi; Γη

k)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 44 / 55

slide-107
SLIDE 107

Background Models Regression Simulations Results Discussion

npLCM Regression Framework

The npLCM regression framework is then obtained as:

  • Control likelihood with covariates:

Lreg =

i:Yi=0

K

k=1 νikΠ(Mi; Ψk).

  • Cases likelihood with covariates:

Lreg

1

=

  • i:Yi=1

  

L

  • ℓ=1

 πℓ(Xi; Γπ

ℓ )

  • CSCF ℓ

K

  • k=1

{ηik · Π(Mi; pkℓ)}      (2)

  • νik = hk(Wi; Γν

k) : The S? ? ? ?-B? ? ? ? parameterization

  • ηik = hk(Wi; Γη

k)

The joint likelihood for the regression model can be written as: Lreg = Lreg

1

× Lreg

0 .

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 44 / 55

slide-108
SLIDE 108

Background Models Regression Simulations Results Discussion

Prior Specifications

Unknown parameters:

  • etiology regression coefficients ({Γπ

ℓ }),

  • subclass mixing weight parameters for cases ({Γη

k}) and

controls ({Γν

k}),

  • true and false positive rates (Θ = {θ(j)

k }, Ψ = {ψ(j) k }).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 45 / 55

slide-109
SLIDE 109

Background Models Regression Simulations Results Discussion

Prior Specifications

Unknown parameters:

  • etiology regression coefficients ({Γπ

ℓ }),

  • subclass mixing weight parameters for cases ({Γη

k}) and

controls ({Γν

k}),

  • true and false positive rates (Θ = {θ(j)

k }, Ψ = {ψ(j) k }).

To avoid potential overfitting, we a priori introduce:

  • (a) few non-trivial subclasses via novel additive half-Cauchy

prior for the intercepts {µk0}

  • (b) for continuous variable: smooth regression curves πℓ(·),

νk(·) and ηk(·) by Bayesian Penalized-splines (Lang, 2004) combined with shrinkage priors on spline coefficients (Ni et.al, 2015) (to encourage towards constant values)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 45 / 55

slide-110
SLIDE 110

Background Models Regression Simulations Results Discussion

Posterior Inference

Use Markov chain Monte Carlo (MCMC) algorithm to approximate joint posterior distribution

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 46 / 55

slide-111
SLIDE 111

Background Models Regression Simulations Results Discussion

Posterior Inference

Use Markov chain Monte Carlo (MCMC) algorithm to approximate joint posterior distribution

  • Posterior inference is flexible and can be obtained from any

functions of model parameters and individual latent variables

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 46 / 55

slide-112
SLIDE 112

Background Models Regression Simulations Results Discussion

Posterior Inference

Use Markov chain Monte Carlo (MCMC) algorithm to approximate joint posterior distribution

  • Posterior inference is flexible and can be obtained from any

functions of model parameters and individual latent variables Fit npLCMs (w/ or w/out covariates using R package baker (https://github.com/zhenkewu/baker)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 46 / 55

slide-113
SLIDE 113

Background Models Regression Simulations Results Discussion

Posterior Inference

Use Markov chain Monte Carlo (MCMC) algorithm to approximate joint posterior distribution

  • Posterior inference is flexible and can be obtained from any

functions of model parameters and individual latent variables Fit npLCMs (w/ or w/out covariates using R package baker (https://github.com/zhenkewu/baker)

  • calls Bayesian model fitting software JAGS 4.2.0 (Plummer et

al., 2003) from within R

  • provides functions to visualize the posterior distributions of the

unknowns

  • also performs posterior predictive model checking

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 46 / 55

slide-114
SLIDE 114

Background Models Regression Simulations Results Discussion

Simulation Results

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 47 / 55

slide-115
SLIDE 115

Background Models Regression Simulations Results Discussion

Simulation Results

  • Simulation I: flexible and valid statistical inferences about the

CSCF functions {πℓ(·)} (not shown here)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 47 / 55

slide-116
SLIDE 116

Background Models Regression Simulations Results Discussion

Simulation Results

  • Simulation I: flexible and valid statistical inferences about the

CSCF functions {πℓ(·)} (not shown here)

  • Simulation II: valid inferences about the overall CSCF π∗

(empirical average) to quantify disease burdens in a population (of policy interest)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 47 / 55

slide-117
SLIDE 117

Background Models Regression Simulations Results Discussion

Simulation II: Regression Model Reduces the Percent Relative Bias in Recovering the Overall CSCFs π∗

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 48 / 55

slide-118
SLIDE 118

Background Models Regression Simulations Results Discussion

Simulation II: Regression Model Produces More Valid 95% CrIs in Recovering the Overall CSCFs π∗

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 49 / 55

slide-119
SLIDE 119

Background Models Regression Simulations Results Discussion

Regression analysis of PERCH data from one site: Age<1, Severe Pneumonia, HIV negative

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 50 / 55

slide-120
SLIDE 120

Background Models Regression Simulations Results Discussion

Seasonal Trend for πRSV: Age<1, Severe Pneumonia, HIV negative

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 51 / 55

slide-121
SLIDE 121

Background Models Regression Simulations Results Discussion

Summary of the Regression Approach

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 52 / 55

slide-122
SLIDE 122

Background Models Regression Simulations Results Discussion

Summary of the Regression Approach

  • 1) allows analysts to specify a model that links important

covariates to CSCFs

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 52 / 55

slide-123
SLIDE 123

Background Models Regression Simulations Results Discussion

Summary of the Regression Approach

  • 1) allows analysts to specify a model that links important

covariates to CSCFs

  • 2) produces covariate-dependent reference distribution for

controls, which is critical for assigning cause-specific probabilities to a given case

  • because we can compare control measurements to case

measurements with similar covariate values

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 52 / 55

slide-124
SLIDE 124

Background Models Regression Simulations Results Discussion

Summary of the Regression Approach

  • 1) allows analysts to specify a model that links important

covariates to CSCFs

  • 2) produces covariate-dependent reference distribution for

controls, which is critical for assigning cause-specific probabilities to a given case

  • because we can compare control measurements to case

measurements with similar covariate values

  • 3) TPR priors are only used once; avoids overly-optimistic

etiology uncertainty estimates.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 52 / 55

slide-125
SLIDE 125

Background Models Regression Simulations Results Discussion

Main Points Once Again

Context: Modern large-scale etiology studies generate complex measurements of unobserved causes of disease, and have raised the analytic needs of estimating cause-specific case fractions (CSCFs)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 53 / 55

slide-126
SLIDE 126

Background Models Regression Simulations Results Discussion

Main Points Once Again

Context: Modern large-scale etiology studies generate complex measurements of unobserved causes of disease, and have raised the analytic needs of estimating cause-specific case fractions (CSCFs) Gap: Despite recent methodological advances, the need of describing the relationship between covariates and CSCFs, remains unmet

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 53 / 55

slide-127
SLIDE 127

Background Models Regression Simulations Results Discussion

Main Points Once Again

Context: Modern large-scale etiology studies generate complex measurements of unobserved causes of disease, and have raised the analytic needs of estimating cause-specific case fractions (CSCFs) Gap: Despite recent methodological advances, the need of describing the relationship between covariates and CSCFs, remains unmet Contribution: A general etiology regression framework building on npLCM that is broadly applicable to case-control studies A general framework for a class of statistical problems that can be formulated as estimating covariate-dependent class-mixing weights.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 53 / 55

slide-128
SLIDE 128

Background Models Regression Simulations Results Discussion

Discussions

  • Related to restricted latent class models (RLCM, Xu, 2017,

AOS; Wu 2019);

  • ”Restricted” means the response probability for a measurement

depends on the latent state in a monotonic way (e.g., we have TPR greater than FPR in the pneumonia example)

  • Established sufficient and necessary conditions for theoretical

identifibility (based on likelihood only).

  • Also related to boolean matrix decomposition (Rukat 2017,

ICML) and double feature allocation (Ni and Mueller, 2019, JASA)

  • Other applications in autoimmune disease subsetting (Wu et al,

2019, Biostatistics) and electronic health records (Ni and Mueller, 2019) and verbal autopsy (King and Lu, 2008 Stat Sci; McCormick et al., 2016, JASA)

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 54 / 55

slide-129
SLIDE 129

Background Models Regression Simulations Results Discussion

Thank You!

Student Irena Chen Collaborators Scott Zeger Katherine O’Brien Maria Deloria-Knoll Laura Hammitt Funding Patient-Centered Outcome Research Institute [PCORI ME-1408-20318] Bill & Melinda Gates Foundation [48968] Michigan Precision Health Investigator Award National Cancer Institute (P30CA046592, U01CA229437) Some References (More at: zhenkewu.com)

1. Wu Z and Chen I (2019+). Regression Analysis of Dependent Binary Data: Estimating Disease Etiology from Case-Control Studies.

  • Submitted. https: // arxiv. org/ abs/ 1906. 08436

2. PERCH Study Group (2019+). Causes of severe pneumonia re- quiring hospital admission in children without HIV infection from Africa and Asia: the PERCH multi- country case-control study. The Lancet. https: // doi. org/ 10. 1016/ S0140-6736( 19) 30721-4 3. Wu Z, Deloria-Knoll M and Zeger SL (2019+). A Bayesian Approach to Restricted Latent Class Mod- els for Scientifically-Structured Clustering of Multivariate Binary Outcomes.

  • Submitted. https: // doi. org/ 10. 1101/ 400192

4. Wu Z, Deloria-Knoll M and Zeger SL (2017). Nested Partially-Latent Class Models for Estimating Disease Etiology from Case-Control Data.

  • Biostatistics. 18 (2): 200-213.

5. Wu Z, Deloria-Knoll M, Hammitt LL, and Zeger SL, for the PERCH Core Team (2015). Partially Latent Class Models (pLCM) for Case-Control Studies of Childhood Pneumonia Etiology. Journal of the Royal Statistical Society: Series C (Applied Statistics). 65:97-114. Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 55 / 55

slide-130
SLIDE 130

Simulation I Results

  • Nd = 500 cases and Nu = 500 controls for each of two levels of

S (discrete covariate); Uniformly sample the subjects’ enrollment dates over a period of 300 days.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 56 / 55

slide-131
SLIDE 131

Simulation I: Recovery of Truth π0

ℓ(t, S = s)

0.0 0.2 0.4 0.6 0.8 1.0 positive rate

A

case −−> case −−> control−−> control−−>

1)

etiologic fraction

50.5%

44% 57.4%

2)

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1

case −−>

0.0 0.2 0.4 0.6 0.8 1.0

B

26.2%

20.4% 33.3%

<− Overall Pie −> <− 95% CrI −>

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0

C

11.8%

7.8% 16.8%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0

D

6.3%

3.1% 10.8%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0

E

1.7%

0.2% 4.1%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.8 1.0

F

1.1%

0.1% 3.1%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.8 1.0

G

0.6%

0% 2.3%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0

H

0.9%

0% 4.1%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0

I

0.9%

0% 2.8%

2010:Feb−01 May−01 Aug−01 Nov−01 0.2 0.4 0.6 0.8 1

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 57 / 55

slide-132
SLIDE 132

Simulation I: Recovery of νk(t) and ηk(t)

True K 0 = 2; Model fitted using a working number K = 7

(a) case

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 58 / 55

slide-133
SLIDE 133

Appendix: Simulation II Setup

  • npLCM regression analysis with K ∗ = 3, R = 200 replication

data sets simulated under 48 different scenarios

  • L = J = 3, 6, 9 causes, under single-pathogen-cause

assumption, BrS measurements made on Nd cases and Nu controls for each level of X where Nd = Nu = 250 or 500.

  • φℓ(X) = β0ℓ + β1ℓ I{X = 2} take two sets of values to reflect

CSCF variability across X: i) βi

0 = (0, 0, 0, 0, 0, 0),

βi

1 = (−1.5, 0, −1.5, −1.5, 0, −1.5); ii) βii 0 = (1, 0, 1, 1, 0, 1)

and βii

1 = (−1.5, 1, −1.5, −1.5, 1, −1.5)

  • TPRs θ(j)

k

= 0.95 or 0.8 and FPRs (ψ(j)

1 , ψ(j) 2 ) ∈ {(0.5, 0.05), (0.5, 0.15)}, for j = 1, . . . , J.

  • νk(W ) = ηk(W ) = logit−1 (γk0 + γk1 I{W = 2}) where

(γ10, γ11) = (−0.5, 1.5) and (γ20, γ21) = (1, −1.5).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 59 / 55

slide-134
SLIDE 134

Appendix

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

A

level 1 0.5

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

level 2 0.5

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20
  • verall pie (πl*)

empirical weights

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

B

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

C

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

D

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

E

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

F

Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Density 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

Figure: Posterior distributions of the stratum-specific (Row 1 and 2) and the

  • verall (Bottom Row) CSCFs based on a simulation with a two-level discrete

covariate and L = J = 6 causes. The vertical gray lines indicate the 2.5% and 97.5% posterior quantiles, respectively; The truths are indicated by vertical blue dashed lines. Row 1-2) CSCFs by stratum (level = 1,2) and cause (A-F); Bottom) π∗

ℓ : overall population etiologic fraction for cause A-F (empirical

average of the two CSCFs above).

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 60 / 55

slide-135
SLIDE 135

Appendix

  • Nd: 250

pop_frac_scn: 1 Nd: 250 pop_frac_scn: 2 Nd: 500 pop_frac_scn: 1 Nd: 500 pop_frac_scn: 2 psi_scn: 1 theta: 0.8 psi_scn: 2 theta: 0.8 psi_scn: 1 theta: 0.95 psi_scn: 2 theta: 0.95 A B C D E F A B C D E F A B C D E F A B C D E F −100 −50 50 100 −100 −50 50 100 −100 −50 50 100 −100 −50 50 100

cause (posterior mean − truth)/truth*100%

method

  • Reg

No Reg

Figure: NPLCM analyses with or without regression perform similarly in terms of

percent relative bias (top) and empirical coverage rates (bottom) over R = 100 replications in simulations where the case and control subclass weights do not vary by covariates. Each panel corresponds to one of 16 combinations of true parameter values and sample sizes

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 61 / 55

slide-136
SLIDE 136

Simulation II: Regression Model Reduces the Percent Relative Bias in Recovering the Overall CSCFs π∗

  • Nd: 250

pop_frac_scn: 1 Nd: 250 pop_frac_scn: 2 Nd: 500 pop_frac_scn: 1 Nd: 500 pop_frac_scn: 2 psi_scn: 1 theta: 0.8 psi_scn: 2 theta: 0.8 psi_scn: 1 theta: 0.95 psi_scn: 2 theta: 0.95 A B C D E F A B C D E F A B C D E F A B C D E F −100 −50 50 100 −100 −50 50 100 −100 −50 50 100 −100 −50 50 100

cause (posterior mean − truth)/truth*100%

method

  • Reg

No Reg

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 62 / 55

slide-137
SLIDE 137

Simulation II: Regression Model Produces More Valid 95% CrIs in Recovering the Overall CSCFs π∗

  • Nd: 250

pop_frac_scn: 1 Nd: 250 pop_frac_scn: 2 Nd: 500 pop_frac_scn: 1 Nd: 500 pop_frac_scn: 2 psi_scn: 1 theta: 0.8 psi_scn: 2 theta: 0.8 psi_scn: 1 theta: 0.95 psi_scn: 2 theta: 0.95 A B C D E F A B C D E F A B C D E F A B C D E F 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

cause coverage rate (intervals based on R replications)

method

  • Reg

No Reg

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 63 / 55

slide-138
SLIDE 138

Appendix

  • Nd: 250

pop_frac_scn: 1 Nd: 250 pop_frac_scn: 2 Nd: 500 pop_frac_scn: 1 Nd: 500 pop_frac_scn: 2 psi_scn: 1 theta: 0.8 psi_scn: 2 theta: 0.8 psi_scn: 1 theta: 0.95 psi_scn: 2 theta: 0.95 A B C D E F A B C D E F A B C D E F A B C D E F 0.85 0.90 0.95 1.00 0.85 0.90 0.95 1.00 0.85 0.90 0.95 1.00 0.85 0.90 0.95 1.00

cause coverage rate (intervals based on R replications)

method

  • Reg

No Reg

Figure: NPLCM analyses with or without regression perform similarly in terms of

percent relative bias (top) and empirical coverage rates (bottom) over R = 100 replications in simulations where the case and control subclass weights do not vary by covariates. Each panel corresponds to one of 16 combinations of true parameter values and sample sizes

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 64 / 55

slide-139
SLIDE 139

Appendix

  • 0.0
0.2 0.4 0.6 0.8 1.0

RSV individual etiology pie

  • younger than 1 year, severe, HIV−
  • lder than 1 year, severe, HIV−

enrollment date 2011:Sep−01 Dec−01 2012:Mar−01 Jun−01 Sep−01 Dec−01 2013:Mar−01 Jun−01 Sep−01

  • (a) Cause: RSV
  • 0.0
0.2 0.4 0.6 0.8 1.0

NoS individual etiology pie enrollment date 2011:Sep−01 Dec−01 2012:Mar−01 Jun−01 Sep−01 Dec−01 2013:Mar−01 Jun−01 Sep−01

  • (b) Cause: NoS

Figure: Individual etiology fraction estimates for RSV (left) and NoS

(right) differ by age and season among HIV negative and severe pneumonia cases for whom the seven pathogens were all tested negative in the nasopharyngeal specimens.

Zhenke Wu(zhenkewu@umich.edu) 2019 TAMU 65 / 55