[PPT] - Estimating the Size of Hidden Populations based on PowerPoint Presentation

SLIDE 1

Estimating the Size of Hidden Populations based on Partially-Observed Network Data

Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of Massachusetts

Los Angeles
Amherst

Corinne M. Mar Center for Studies in Demography and Ecology University of Washington

Supported by the DoD ONR MURI award N00014-08-1-1015.

Working Papers available at http://www.stat.ucla.edu/∼handcock http://arXiv.org

MURI Annual Review Meeting, Jan 10 2012

SLIDE 2

Hard-to-Reach Population Methods Research Group

I Krista J. Gile, UMass I Mark S. Handcock, UCLA I Lisa G. Johnston, Tulane University, UCSF I Corinne M. Mar, University of Washington I http://hpmrg.org

SLIDE 3

Sampling Hard-to-reach Populations

Many motivating fields:

I epidemiology

I CDC HIV surveillance program I UNAIDS requires HIV prevalence estimates for all countries I Most countries: concentrated in high-risk populations:

Injecting drug users, men who have sex with men, and sex workers

I Hard-to-reach networked populations.

I labor economics: Unregulated workers I demography: displaced populations, immigrant populations

Traditional Survey Sampling:

I Probability sample (e.g. simple random sampling, stratified

random sampling)

I Analyze data using sampling weights

Hard-to-reach populations: No practical conventional sampling frame.

SLIDE 4

Link-Tracing Sampling

Suppose:

I Each population joined by informal social network of

relationships.

I Researchers can access some members of the population.

Then:

I Begin with a reachable convenience sample (the seeds) I Expand sample by following social network ties

This is Link-tracing Network Sampling

SLIDE 5

Stylized population

SLIDE 6

Start with seeds . . .

SLIDE 7

Seeds recruit the first wave . . .

SLIDE 8

First wave recruit the second wave . . .

SLIDE 9

and so on . . .

SLIDE 10

SLIDE 11

SLIDE 12

(and with un-sampled)

SLIDE 13

Respondent-Driven Sampling - Link-tracing variant:

I Seed Dependence: Follow only a few links from each sampled I Confidentiality: Respondents distribute uniquely identified

coupons. No names. (respondent-driven)

I Inference based on network positions: Under rapid development I Effective at obtaining large varied samples in many populations. I Widely used: over 100 studies, in over 30 countries. Often

HIV-risk populations.

Heckathorn, D.D., “Respondent-driven sampling: A new approach to the study of hidden populations.” Social Problems, 1997. Salganik, M.J. and D.D. Heckathorn, “Sampling and estimation in hidden populations using respondent-driven sampling.” Sociological Methodology, 2004.

SLIDE 14

Link-Tracing Sampling:

I Challenges

I Sampling depends on (typically) partially-observed network data I Convenience mechanism for initial sample leads to non-probability

sample

I Unknown population size and unknown sampling frame

I Sampling designs have much in common, but no consensus on

inferential approach Respondent-Driven Sampling subject to all of these

SLIDE 15

Statistical Assessments of RDS

I Many critics in subject fields (Wang et. al 2005, ...) I Wejnert and Heckathorn (2008): compare in known population

(web-based)

I Gile (2008): Uses CDC data as basis for simulated population

(ERGM). Evaluates: (1) bias induced by the initial sample, (2) to uncontrollable features of respondent behavior, and (3) to the without-replacement structure of sampling.

I Goel and Salganik (2009) using a Markov chain model, effects of

clustering, non-branching assumption.

I Gile and Handcock (2010): use realistic but simulated

populations to show: (1) the number of sample waves typically used is to small; (2) that preferential referral behavior leads to bias; (3) finite population effects can be large.

I Goel and Salganik (2010): simulate RDS over (largely known)

friendship and IDU networks to show high variance of original estimators.

I Thomas and Gile (2011): effect of differential recruitment,

non-response and non-recruitment.

SLIDE 16

Inferential approaches

The key is the modeling of the sampling process

I Salganik and Heckathorn (2004): simple Markov Chain model

ver classes

I Volz and Heckathorn (2008): Markov Chain model over people I Gile (2008, 2011): Adjusts for with-replacement effects I Gile and Handcock (2008, 2011): a network model-assisted

estimator

I better performance, realistic representation of RDSprocess. I Unlike other Link-tracing methods, does not require initial

probability sample

I is able to adjust for the bias from the selection of the seeds I Still subject to many assumptions: I Self-reported infected and uninfected contacts I Known population size I Adequate working network structure and sampling structure I Measurement Error

SLIDE 17

Why estimate the population size?

I We want to know the size for the population under study I We want to estimate population totals rather than averages I We want to estimate population counts rather than proportions I We need it to improve new estimators that require it (e.g. Gile

(2008, 2011).

SLIDE 18

Is there information in RDS data about population size?

Idea: the approach is to use Gile’s sequential sampling model and leverage the information in the ordered sequence of degrees in the sample. Intuitively, the change in the degree distribution of successive waves indicates the depletion of the population and this can be quantified to estimate N.

SLIDE 19

100

200 300 400 500 5 10 15 20 Time (order of being sampled) Degree of Sampled Node

Population Size 555

SLIDE 20

100

200 300 400 500 5 10 15 20 Time (order of being sampled) Degree of Sampled Node Population Size 555

100

200 300 400 500 5 10 15 20 Time (order of being sampled) Population Size 5000

SLIDE 21

Classic Design-Based Inference: Generalized Horvitz-Thompson Estimator

I Goal: Estimate proportion “infected” :

µ = 1 N

N

X

i=1

zi where population labeled 1, 2, . . . N, zi = ⇢ 1 i infected i uninfected.

I Generalized Horvitz-Thompson Estimator:

ˆ µ = 1

ˆ N

P

i Si zi πi

ˆ N = P

i Si 1 πi

where Si = ⇢ 1 i sampled i not sampled πi = P(Si = 1). Key Point: Requires πi ∀ i : Si = 1

SLIDE 22

Sequential Sampling Proportional to Size (SS)

Consider the following sequential sampling with probability proportional to size (SS) sampling procedure defined as follows:

I Begin with a population of N units, denoted by indices 1 . . . N

with varying sizes represented by d1, d2, . . . dN.

I Let G1, . . . , GN be the indices of the successively sampled

people.

I Sample the first unit from the full population {1 . . . N} with

probability proportional to size di. Assign the index of this unit to the random variable Gi.

I Select each subsequent unit with probability proportional to size

from among the remaining units, such that P(Gi = k|G1 . . . Gi−1) = (

dk P

j / ∈{G1...Gi−1} dj

k / ∈ {G1 . . . Gi−1} else .

SLIDE 23

Models for the Personal Network Sizes

We postulate a parametric model for the degrees. Specifically: di

i.i.d.

∼ f(·|η) where f(·|η) a probability mass function (PMF) with support d = 0, 1, . . . , and η is a parameter.

SLIDE 24

Models for the Personal Network Sizes

The papers by Handcock and Jones cover this topic well. To specify f(·|η). We can consider:

1. Poisson
2. Negative binomial. This allows Gamma over-dispersion over

Poisson.

3. Yule, Waring. This allows power-law over-dispersion over

Poisson.

4. Poisson-log-normal. This allows log-normal over-dispersion over
Poisson. It is more than the Negative Binomial but less than the

power-law models.

5. Conway-Maxwell-Poisson distribution. This allows both

under-dispersion and over dispersion with a single additional parameter over a Poisson.

6. Non-parametric lower tails: To allow for poor fit in the lower

degrees. These are all coded up in the CRAN degreenet package and/or the size package.

SLIDE 25

Model the sequentially sampled network sizes

Without loss of generality, let Dobs = (D1, . . . , Dn) be the random degrees of the successively sampled people and Dunobs = (Dn+1, . . . , DN) be the unordered random degrees of the unknown members. Let dobs = (d1, . . . , dn) and dunobs = (dn+1, . . . , dN) be given observed and unobserved values. Similarly, let G = (G1, . . . , Gn) be the indices of the successively sampled people and gobs be the observed sequence. The combined sequential sampling and super-population draws then result in: P(G = gobs, Dobs = dobs, Dunobs = dunobs|η) = N! (N − n)!

n

Y

k=1

dk λk ·

N

Y

j=1

f(dj|η) (1) where λk =

N

X

i=k

di =

n

X

i=k

di +

N

X

i=n+1

di k = 1, . . . , n k = 1, . . . , n depends on both dobs and dunobs.

SLIDE 26

Inference for the personal network size distribution

Inference for η and should be based on all the available observed data including the sampling design information. This likelihood is any function of η proportional to P(G, Dobs|η): L[η|Dobs = Dobs, G = gobs] ∝ P(G = gobs, Dobs = dobs|η) = X

dunobs∈D(dobs)

P(G = gobs, Dobs = dobs, Dunobs = dunobs|η) = X

dunobs∈D(dobs)

P(G = gobs|Dobs = dobs, Dunobs = dunobs)

N

Y

j=1

f(dj|η) where D(dobs) is the set of possible dunobs given dobs.

I the sampling design is central to the likelihood

SLIDE 27

Inference for the personal network size distribution

For SS the probability of the observed sequence of G = gobs for a given population of degrees is then: P(G = gobs|Dobs = dobs, Dunobs = dunobs) =

n

Y

k=1

dk λk so that the full likelihood is: L[η|Dobs = dobs, G = gobs] ∝ P(G = gobs, Dobs = dobs|η) (2) = N! (N − n)! X

dunobs∈DU(dobs) n

Y

k=1

dk λj ·

N

Y

j=1

f(dj|η) where DU(dobs) is the set of possible unordered dunobs given dobs. Thus DU(dobs) is D(dobs) reduced by the different permutations of degrees. This likelihood can be the basis of maximum likelihood estimation for η. In general, this sum will be very difficult to compute because of the N − n embedded sums over typically infinite spaces.

SLIDE 28

The Bayesian framework

In the Bayesian framework we can treat N and the degree distribution parameter (η) as unknown parameters and estimate them. Bayesian inference enable us to make probability statements about the unknown N (and η) based on a model for them and conditional on the observed RDS data. Explicitly, we want to calculate: P(N, η|Dobs = dobs, G = gobs) P(N|Dobs = dobs, G = gobs) Note that: P(N|Dobs = dobs, G = gobs) ∝ P(Dobs = dobs, G = gobs|N)P(N) where P(N) specifies our knowledge about N before the RDS data is taken into account. That is, we specify our knowledge about N via a probability distribution over the values it can take.

SLIDE 29

Prior for the population size

The prior on population size (N) is flexible.

I the data effectively truncates the prior below the sample size. I improper uniform prior I natural parametric models (e.g., Negative Binomial,

Poisson-log-normal, Conway-Maxwell-Poisson).

I natural parametric models too thin in the tails I consider specifying prior knowledge about the sample proportion

(i.e. n/N).

I based on the idea that n may not be chosen separately from N

that the sample size is not chosen separately from the population size

I a simple prior is uniform on n/N.

SLIDE 30

Prior for the population size

I translates to a closed form for the prior on N which has infinite

mean

I Generalize to n/N ∼ Beta(1, β)

The density function on N (considered as a continuous variable) is: f(N|n) = βn(N − n)β−1/Nβ+1 for N > n The distribution has tail behavior ≈ 1/N2. The mode of the prior is at 0.5n(β + 1) and the median is given by n/(1 − (1/2)1/β) The median

r mode can be elicited from field researchers and translated to β. A

uniform distribution on the sample proportion corresponds to a median of twice the sample size.

SLIDE 31

5000 10000 15000 3e−05 4e−05 5e−05 6e−05 7e−05 8e−05 9e−05

Prior for population size Prior mode = 1000

truth=1000 population size prior density

Prior for population size Prior mode = 10000

SLIDE 32

Example: N=1000, homophily=2, diff. activity=3

500 1000 1500 2000 2500 0.0000 0.0005 0.0010 0.0015

posterior for population size

population size Density

SLIDE 33

Prior for the degree distribution model

Each of the degree distribution models is parametrized in terms of its mean and standard deviation. The prior is the joint conjugate family: the prior for the mean given the standard deviation is normal and the variance is scaled inverse Chi-squared. µ|σ ∼ N(µ0, σ0/dfmean) σ ∼ Invχ(σ0; dfsigma) The default prior on the degree distribution model parameters is close to uninformative (equivalent sample size of dfmean = 1 for the mean of the degree distribution and dfsigma = 5 for the variance of the degree distribution).

SLIDE 34

Simulation Study

Simulate Population

I 1000, 835, 715, 625, 555, or 525 nodes I 20% “Infected”

Simulate Social Network (from ERGM, using statnet)

I Mean degree 7 I Homophily on Infection: R = E(# infected to infected tie) ER=0(# infected to infected tie) = 5 (or

ther)

I Differential Activity: w = mean degree infected mean degree uninfected = 1 (or other)

Simulate Respondent-Driven Sample

I 500 total samples I 10 seeds, chosen proportional to degree I 2 coupons each I Coupons at random to relations I Sample without replacement

Blue parameters varied in study.

SLIDE 35

500 1000 1500 2000 2500

posteriorsize() Population Size REVISION 178 200 RDS samples, prior.size.mode = truth circle is mode, triangle is mean

Population Size

1

2 4 1 2 4 Differential Activity Level Homophily Ratio 1 Homophily Ratio 5

SLIDE 36

Miss-specification of the degree distribution shape

I Because of differential activity the Conway-Maxwell-Poisson

model class does not cover the bimodality of the degree distribution

SLIDE 37

600 800 1000 1200 1400 0.000 0.001 0.002 0.003

Posterior for Population Size

population size Density truth = 1000 mode = 763 median = 801 mean = 835

SLIDE 38

Miss-specification of the degree distribution shape

I Because of differential activity the Conway-Maxwell-Poisson

model class does not cover the bimodality of the degree distribution

I Solution: Model the degree distributions of the diseased from the

non-diseased with separate Conway-Maxwell-Poisson models.

SLIDE 39

4 6 8 10 12 14 16 0.0 0.5 1.0 1.5 2.0

Posterior for Mean Degree: true overall mean degree is 7

degree Density No Disease With Disease

Posterior for s.d. degree

No Disease With Disease

SLIDE 40

Modeling of other characteristics of the population

I As we model many population characteristics, including the

disease status, we can compute estimates of them directly

I Example: disease prevalence

SLIDE 41

0.14 0.16 0.18 0.20 0.22 0.24 0.26

Disease Prevalence REVISION 170 200 RDS samples, prior.size.mode = truth Population size: red is 525, green is 715, and blue is 1000

Disease Prevalence

1

2 4 1 2 4 Differential Activity Level Homophily Ratio 1 Homophily Ratio 5

SLIDE 42

Confidence intervals, design effects and standard errors

I Using the Bayesian framework, we can naturally compute the

probability intervals for the population size and other characteristics

I Examples: CI coverage for the population size and prevalence

SLIDE 43

0.88 0.90 0.92 0.94 0.96

REVISION 170 200 RDS samples, prior.size.mode = truth Coverage: proportion of samples whose 95% CI covered the true population size True population size: red is 525, green is 715, blue is 1000

Population Size

1

2 4 1 2 4 Differential Activity Level Homophily Ratio 1 Homophily Ratio 5

REVISION 170 200 RDS samples, prior sampling fraction = truth Coverage: proportion of samples whose 95% CI covered the true prevalence True prevalence = 0.2

SLIDE 44

Comparison to the Gile SS estimator

I The SS estimator in Gile JASA (2011) requires N known I We use the posterior mode as a plug in estimate of N

SLIDE 45

0.10 0.15 0.20 0.25 0.30

GileSS 95%CI using Posteriorsize Estimate (median reported); True Population Size 1000 ; Seedtype Random mode=red square, median=green circle, mean=blue triangle

Prevalence

1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0

Homophily (alpha) DA 0.5 DA 0.8 DA 1.0 DA 1.5 DA 2.0

1131
1125
1136
1104
1118
1202
1298
1279
1178
1206
1262
1142
1127
1276
1162
1159
1152
1108
1141
1176
1072
1062
1027
1034
1065

SLIDE 46

0.6 0.7 0.8 0.9 1.0

Root Mean Square Error Ratio: 1000 Random SS using true population size over SS using posteriorsize popsize estimate

RMSE Ratio

1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0 1.0 1.2 1.4 1.8 3.0

Homophily (alpha) DA 0.5 DA 0.8 DA 1.0 DA 1.5 DA 2.0

SLIDE 47

Application: The study of HIV/AIDS in San Francisco

I Surveillance surveys by San Francisco Department of Public

Health

I Focus on African-American (AA) men-who-have-sex-with-men

(MSM)

I RDS study of size n = 256 in 2009. I Intensive study provides a population size estimate of 4439. I Census data indicated 21518 AA men in San Francisco.

SLIDE 48

SLIDE 49

5000 10000 15000 20000 0e+00 2e−05 4e−05 6e−05 8e−05

posterior for population size

population size Density SF

mean

SLIDE 50

2000 4000 6000 8000 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030

posterior for the number of AA MSM with HIV

HIV+ count Density SF

mean

SLIDE 51

5000 10000 15000 20000 1e−05 2e−05 3e−05 4e−05 5e−05 6e−05

posterior for population size

population size Density SF

mean

SLIDE 52

2000 4000 6000 8000 10000 0.00000 0.00005 0.00010 0.00015

posterior for the number of AA MSM with HIV

HIV+ count Density SF

mean

SLIDE 53

Discussion

I It is important to estimate the networked population size I There is information on the population size implicit in the

decreasing degrees of the sample nodes over time

I Using successive sampling model we can model the decrease I We can incorporate prior information about the population size

using the Bayesian framework

I We can incorporate other features of the population I We can estimate population means (e.g., prevalence and counts) I In the Bayesian framework we can estimate uncertainty of the

estimates in a natural way

SLIDE 54

Cautions

I The difference between the model with disease and without

highlights the importance of the specification of the model for the degree distribution.

I The estimates depend on the prior distribution for population

size.

I The estimates are biased because the successive sampling

model is not perfect - and will be be increasingly misspecified as we get further from the configuration network.

I This approach is promising. It is designed to be combined with

data from other methods (e.g., scale up) to provide the most accurate overall estimate.

I Fundamentally, RDS data typically does not contain much

information about the population size. The Bayesian approach enables us to quantify this.

SLIDE 55

References:

I Krista J. Gile, Inference from Partially-Observed Network Data,

Ph.D. Dissertation, Department of Statistics, University of Washington, 2008.

I Krista J. Gile and Mark S. Handcock (2010),

“Respondent-Driven Sampling: An Assessment of Current Methodology,” Sociological Methodology, 40, 285-327, available

n arXiv (http://arxiv.org/abs/0904.1855v1).

I Krista J. Gile (2011), “Improved Inference for Respondent-Driven

Sampling Data with Application to HIV Prevalence Estimation,” Journal of the American Statistical Association, 106 (493), 135-146.

I Krista J. Gile and Mark S. Handcock, “Network Model-Assisted