Estimating the Size of Hidden Populations based on Partially-Observed Network Data
Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of Massachusetts
- Los Angeles
- Amherst
Estimating the Size of Hidden Populations based on - - PowerPoint PPT Presentation
Estimating the Size of Hidden Populations based on Partially-Observed Network Data Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of Massachusetts - Los Angeles -
I Krista J. Gile, UMass I Mark S. Handcock, UCLA I Lisa G. Johnston, Tulane University, UCSF I Corinne M. Mar, University of Washington I http://hpmrg.org
I epidemiology
I CDC HIV surveillance program I UNAIDS requires HIV prevalence estimates for all countries I Most countries: concentrated in high-risk populations:
I Hard-to-reach networked populations.
I labor economics: Unregulated workers I demography: displaced populations, immigrant populations
I Probability sample (e.g. simple random sampling, stratified
I Analyze data using sampling weights
I Each population joined by informal social network of
I Researchers can access some members of the population.
I Begin with a reachable convenience sample (the seeds) I Expand sample by following social network ties
I Seed Dependence: Follow only a few links from each sampled I Confidentiality: Respondents distribute uniquely identified
I Inference based on network positions: Under rapid development I Effective at obtaining large varied samples in many populations. I Widely used: over 100 studies, in over 30 countries. Often
I Challenges
I Sampling depends on (typically) partially-observed network data I Convenience mechanism for initial sample leads to non-probability
I Unknown population size and unknown sampling frame
I Sampling designs have much in common, but no consensus on
I Many critics in subject fields (Wang et. al 2005, ...) I Wejnert and Heckathorn (2008): compare in known population
I Gile (2008): Uses CDC data as basis for simulated population
I Goel and Salganik (2009) using a Markov chain model, effects of
I Gile and Handcock (2010): use realistic but simulated
I Goel and Salganik (2010): simulate RDS over (largely known)
I Thomas and Gile (2011): effect of differential recruitment,
I Salganik and Heckathorn (2004): simple Markov Chain model
I Volz and Heckathorn (2008): Markov Chain model over people I Gile (2008, 2011): Adjusts for with-replacement effects I Gile and Handcock (2008, 2011): a network model-assisted
I better performance, realistic representation of RDSprocess. I Unlike other Link-tracing methods, does not require initial
I is able to adjust for the bias from the selection of the seeds I Still subject to many assumptions: I Self-reported infected and uninfected contacts I Known population size I Adequate working network structure and sampling structure I Measurement Error
I We want to know the size for the population under study I We want to estimate population totals rather than averages I We want to estimate population counts rather than proportions I We need it to improve new estimators that require it (e.g. Gile
200 300 400 500 5 10 15 20 Time (order of being sampled) Degree of Sampled Node
Population Size 555
200 300 400 500 5 10 15 20 Time (order of being sampled) Degree of Sampled Node Population Size 555
200 300 400 500 5 10 15 20 Time (order of being sampled) Population Size 5000
I Goal: Estimate proportion “infected” :
N
i=1
I Generalized Horvitz-Thompson Estimator:
ˆ N
i Si zi πi
i Si 1 πi
I Begin with a population of N units, denoted by indices 1 . . . N
I Let G1, . . . , GN be the indices of the successively sampled
I Sample the first unit from the full population {1 . . . N} with
I Select each subsequent unit with probability proportional to size
dk P
j / ∈{G1...Gi−1} dj
i.i.d.
n
k=1
N
j=1
N
i=k
n
i=k
N
i=n+1
dunobs∈D(dobs)
dunobs∈D(dobs)
N
j=1
I the sampling design is central to the likelihood
n
k=1
dunobs∈DU(dobs) n
k=1
N
j=1
I the data effectively truncates the prior below the sample size. I improper uniform prior I natural parametric models (e.g., Negative Binomial,
I natural parametric models too thin in the tails I consider specifying prior knowledge about the sample proportion
I based on the idea that n may not be chosen separately from N
I a simple prior is uniform on n/N.
I translates to a closed form for the prior on N which has infinite
I Generalize to n/N ∼ Beta(1, β)
5000 10000 15000 3e−05 4e−05 5e−05 6e−05 7e−05 8e−05 9e−05
truth=1000 population size prior density
500 1000 1500 2000 2500 0.0000 0.0005 0.0010 0.0015
population size Density
I 1000, 835, 715, 625, 555, or 525 nodes I 20% “Infected”
I Mean degree 7 I Homophily on Infection: R = E(# infected to infected tie) ER=0(# infected to infected tie) = 5 (or
I Differential Activity: w = mean degree infected mean degree uninfected = 1 (or other)
I 500 total samples I 10 seeds, chosen proportional to degree I 2 coupons each I Coupons at random to relations I Sample without replacement
500 1000 1500 2000 2500
posteriorsize() Population Size REVISION 178 200 RDS samples, prior.size.mode = truth circle is mode, triangle is mean
Population Size
2 4 1 2 4 Differential Activity Level Homophily Ratio 1 Homophily Ratio 5
I Because of differential activity the Conway-Maxwell-Poisson
600 800 1000 1200 1400 0.000 0.001 0.002 0.003
Posterior for Population Size
population size Density truth = 1000 mode = 763 median = 801 mean = 835
I Because of differential activity the Conway-Maxwell-Poisson
I Solution: Model the degree distributions of the diseased from the
4 6 8 10 12 14 16 0.0 0.5 1.0 1.5 2.0
Posterior for Mean Degree: true overall mean degree is 7
degree Density No Disease With Disease
Posterior for s.d. degree
No Disease With Disease
I As we model many population characteristics, including the
I Example: disease prevalence
0.14 0.16 0.18 0.20 0.22 0.24 0.26
Disease Prevalence REVISION 170 200 RDS samples, prior.size.mode = truth Population size: red is 525, green is 715, and blue is 1000
Disease Prevalence
2 4 1 2 4 Differential Activity Level Homophily Ratio 1 Homophily Ratio 5
I Using the Bayesian framework, we can naturally compute the
I Examples: CI coverage for the population size and prevalence
0.88 0.90 0.92 0.94 0.96
REVISION 170 200 RDS samples, prior.size.mode = truth Coverage: proportion of samples whose 95% CI covered the true population size True population size: red is 525, green is 715, blue is 1000
Population Size
2 4 1 2 4 Differential Activity Level Homophily Ratio 1 Homophily Ratio 5
REVISION 170 200 RDS samples, prior sampling fraction = truth Coverage: proportion of samples whose 95% CI covered the true prevalence True prevalence = 0.2
I The SS estimator in Gile JASA (2011) requires N known I We use the posterior mode as a plug in estimate of N
0.10 0.15 0.20 0.25 0.30
GileSS 95%CI using Posteriorsize Estimate (median reported); True Population Size 1000 ; Seedtype Random mode=red square, median=green circle, mean=blue triangle
Prevalence
1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0
Homophily (alpha) DA 0.5 DA 0.8 DA 1.0 DA 1.5 DA 2.0
0.6 0.7 0.8 0.9 1.0
Root Mean Square Error Ratio: 1000 Random SS using true population size over SS using posteriorsize popsize estimate
RMSE Ratio
1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0
Homophily (alpha) DA 0.5 DA 0.8 DA 1.0 DA 1.5 DA 2.0
I Surveillance surveys by San Francisco Department of Public
I Focus on African-American (AA) men-who-have-sex-with-men
I RDS study of size n = 256 in 2009. I Intensive study provides a population size estimate of 4439. I Census data indicated 21518 AA men in San Francisco.
5000 10000 15000 20000 0e+00 2e−05 4e−05 6e−05 8e−05
population size Density SF
mean
2000 4000 6000 8000 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030
HIV+ count Density SF
mean
5000 10000 15000 20000 1e−05 2e−05 3e−05 4e−05 5e−05 6e−05
posterior for population size
population size Density SF
mean
2000 4000 6000 8000 10000 0.00000 0.00005 0.00010 0.00015
posterior for the number of AA MSM with HIV
HIV+ count Density SF
mean
I It is important to estimate the networked population size I There is information on the population size implicit in the
I Using successive sampling model we can model the decrease I We can incorporate prior information about the population size
I We can incorporate other features of the population I We can estimate population means (e.g., prevalence and counts) I In the Bayesian framework we can estimate uncertainty of the
I The difference between the model with disease and without
I The estimates depend on the prior distribution for population
I The estimates are biased because the successive sampling
I This approach is promising. It is designed to be combined with
I Fundamentally, RDS data typically does not contain much
I Krista J. Gile, Inference from Partially-Observed Network Data,
I Krista J. Gile and Mark S. Handcock (2010),
I Krista J. Gile (2011), “Improved Inference for Respondent-Driven
I Krista J. Gile and Mark S. Handcock, “Network Model-Assisted