Survey Sampling Theory & Small and Hard-to-reach Groups Geert - - PowerPoint PPT Presentation

survey sampling theory small and hard to reach groups
SMART_READER_LITE
LIVE PREVIEW

Survey Sampling Theory & Small and Hard-to-reach Groups Geert - - PowerPoint PPT Presentation

Survey Sampling Theory & Small and Hard-to-reach Groups Geert Molenberghs Interuniversity Institute for Biostatistics and statistical Bioinformatics (I-BioStat) Universiteit Hasselt & KU Leuven, Belgium geert.molenberghs@uhasselt.be


slide-1
SLIDE 1

Survey Sampling Theory & Small and Hard-to-reach Groups

Geert Molenberghs

Interuniversity Institute for Biostatistics and statistical Bioinformatics (I-BioStat) Universiteit Hasselt & KU Leuven, Belgium

geert.molenberghs@uhasselt.be & geert.molenberghs@kuleuven.be www.ibiostat.be

Interuniversity Institute for Biostatistics and statistical Bioinformatics

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016

slide-2
SLIDE 2

Relevant References

  • Barnett, V. (2002). Sample Survey: Principles and Methods (3rd ed.). London:

Arnold.

  • Billiet, J. (1990). Methoden van Sociaal-Wetenschappelijk Onderzoek: Ontwerp en
  • Dataverzameling. Leuven: Acco.
  • Billiet, J., Loosveldt, G., and Waterplas, L. (1984). Het Survey-Interview Onderzocht.

Sociologische Studies en Documenten, 19, Leuven.

  • Brinkman, J. (1994). De Vragenlijst. Groningen: Wolters-Noordhoff.
  • Chambers, R.L. and Skinner, C.J. (2003). Analyis of Survey Data. New York: Wiley.
  • Cochran, W.G. (1977). Sampling Techniques. New York: Wiley.
  • Foreman, E. K. (1991). Survey Sampling Principles. New York: Marcel Dekker.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 1

slide-3
SLIDE 3
  • Fowler, Jr., F.J. (1988). Survey Research Methods. Newbury Park, CA: Sage.
  • Groves, R.M., Fowler, F.J., Couper, M.P., Lepkowski, J.M., Singer, E., and

Tourangeau, R. (2004). Survey Methodology. New York: Wiley.

  • Heeringa, S.G., West, B.T., and Berglund, P.A. (2010). Applied Survey Data Analysis.

Boca Raton: Chapman & Hall/CRC.

  • Kish, L. (1965). Survey Sampling. New York: Wiley.
  • Knottnerus, P. (2003). Sample Survey Theory. New York: Springer.
  • Korn, E.L. and Graubard, B.I. (1999). Analysis of Health Surveys. New York: Wiley.
  • Lehtonen, R. and Pahkinen, E.J. (1995). Practical Methods for Design and Analysis of

Complex Surveys. Chichester: Wiley.

  • Lessler, J.T. and Kalsbeek, W.D. (1992). Nonsampling Error in Surveys. New York:

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 2

slide-4
SLIDE 4

Wiley.

  • Levy, P. and Lemeshow, S. (1999). Sampling of Populations. New York: Wiley.
  • Little, R.J.A. (1982). Models for nonresponse in sample surveys. Journal of the

American Statistical Association, 77, 237–250.

  • Little, R.J.A. (1985). Nonresponse adjustments in longitudinal surveys: models for

categorical data. Bulletin of the International Statistical Institute, 15, 1–15.

  • Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis with Missing Data (2nd

ed.). New York: Wiley.

  • Lumley, T. (2010). Complex Surveys. A Guide to Analysis Using R. Ch
  • Lynn, P. (2009). Methodology of Longitudinal Surveys. Chichester: New York: John

Wiley & Sons.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 3

slide-5
SLIDE 5
  • Molenberghs, G. and Kenward, M.G. (2007). Missing Data in Clinical Studies. New

York: Wiley.

  • Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. New

York: Springer.

  • Moser, C.A., Kalton, G. (1971). Survey Methods in Social Investigation. London:

Heinemann.

  • Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York:

Wiley.

  • Scheaffer, R.L., Mendenhall, W., and Ott L. (1990). Elementary Survey Sampling.

Boston: Duxbury Press.

  • Skinner, C.J., Holt, D., and Smith, T.M.F. (1989). Analysis of Complex Surveys. New

York: Wiley.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 4

slide-6
SLIDE 6
  • Som, R.J. (1996). Practical Sampling Techniques (3rd ed.). New York: Marcel

Dekker.

  • Swyngedouw, M. (1993). Transitietabelanalyse en ML-schattingen voor partieel

geclassificeerde verkiezingsdata via loglineaire modellen. Kwantitatieve Methoden, 43, 119–149.

  • Vehovar, V. (1999). Field substitution and unit nonresponse. Journal of Official

Statistics, 15, 335–350.

  • Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for Longitudinal Data.

New York: Springer.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 5

slide-7
SLIDE 7

Setting the Scene

Schepers, Juchtmans, and Nicaise (2016)

  • How rare is the population?
  • How readily can members of the population be identified?
  • Is there a large-scale survey that can serve as a screener sample for

identifying members of the target population?

  • Is the target population more concentrated in some parts of the sampling

frame?

  • Are there one or more partial sampling frames of the hard-to-sample

population that are available for use in sampling?

  • Is the target population accessible by sampling households?

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 6

slide-8
SLIDE 8

The Belgian Health Interview Survey

  • Conducted in years: 1997 – 2001 – 2004 – 2008 – 2013
  • Commissioned by:

⊲ Federal government ⊲ Flemish Community ⊲ French Community ⊲ German Community ⊲ Walloon Region ⊲ Brussels Region

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 7

slide-9
SLIDE 9

Design At-a-Glance

  • Regional stratification: fixed a priori
  • Provincial stratification: for convenience
  • Three-stage sampling:

⊲ Primary sampling units (PSU): Municipalities: proportional to size ⊲ Secondary sampling units (SSU): Households ⊲ Tertiary sampling units (TSU): Individuals

  • Over-representation of German Community
  • Over-representation of 4 (2) provinces in 2001 (2004):

Limburg Hainaut Antwerpen Luxembourg

  • Sampling done in 4 quarters: Q1, Q2, Q3, Q4

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 8

slide-10
SLIDE 10

Regional Stratification

1997 2001 2004 Region Goal Obt’d Goal Obt’d Goal Obt’d Flanders 3500 3536 3500+550=4050 4100 3500+450 + elderly +450=4400 4513 Wallonia 3500 3634 3500+1500=5000 4711 3500+900 + elderly +450=4850 4992 Brussels 3000 3051 3000 3006 3000 + elderly +350=3350 3440 Belgium 10,000 10,221 10,000+2050=12,050 12,111 10,000+1350 + elderly +1250=12,600 12,945

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 9

slide-11
SLIDE 11

Provincial Stratification in 1997

Province sample # sample %

  • pop. %

Antwerpen 945 26.7 27.7 Oost-Vlaanderen 812 23.0 23.0 West-Vlaanderen 733 20.7 19.1 Vlaams-Brabant 593 16.8 17.0 Limburg 453 12.8 13.2 Hainaut 1325 36.5 38.7 Li` ege 1210 33.3 30.6 Namur 465 12.8 13.2 Brabant-Wallon 356 9.8 10.3 Luxembourg 278 7.6 7.3 Brussels 3051

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 10

slide-12
SLIDE 12

Provincial Stratification in 2001

% in # interviews # # rate p. Province pop. region theor. round

  • versp.

sum actual groups towns 1000 Antwerpen 1,640,966 27.7 969 950 350 1300 1302 26 19 0.79 Oost-Vlaanderen 1,359,702 22.9 803 850 850 874 17 17 0.63 West-Vlaanderen 1,127,091 19.0 665 650 650 673 13 13 0.58 Vlaams-Brabant 1,011,588 17.1 598 600 600 590 12 12 0.59 Limburg 787,491 13.3 465 450 200 650 661 13 13 0.83 Flanders 5,926,838 100 3500 3500 550 4050 4100 81 74 0.68 Hainaut 1,280,427 39.3 1256 1250 500 1750 1747 35 27 1.37 Li` ege 947,787 29.0 929 950 950 935 19 19 1.00 Namur 441,205 13.5 433 450 450 435 9 7 1.02 Brabant Wallon 347,423 10.7 341 300 300 291 6 6 0.86 Luxembourg 245,140 7.5 241 250 1000 1250 1303 25 21 5.10 Wallonnia 3,261,982 100 3200 3200 1500 4700 4711 94 80 1.44 German comm. 70,472 1.1 300 300 300 294 6 6 4.26 Wallonnia+German 3,332,454 100 3500 3500 1500 5000 5005 100 86 1.50 Brussels 954,460 100 3000 3000 3000 3006 60 18 3.14 Belgium 10,213,752 100 10,000 10,000 2050 12,050 12,111 241 178 1.18 InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 11

slide-13
SLIDE 13

Provincial Stratification in 2001

% in # interviews # # rate p. Province pop. region theor. round

  • versp.

sum actual groups towns 1000 Antwerpen 1,640,966 27.7 969 950 350 1300 1302 26 19 0.79 Oost-Vlaanderen 1,359,702 22.9 803 850 850 874 17 17 0.63 West-Vlaanderen 1,127,091 19.0 665 650 650 673 13 13 0.58 Vlaams-Brabant 1,011,588 17.1 598 600 600 590 12 12 0.59 Limburg 787,491 13.3 465 450 200 650 661 13 13 0.83 Flanders 5,926,838 100 3500 3500 550 4050 4100 81 74 0.68 Hainaut 1,280,427 39.3 1256 1250 500 1750 1747 35 27 1.37 Li` ege 947,787 29.0 929 950 950 935 19 19 1.00 Namur 441,205 13.5 433 450 450 435 9 7 1.02 Brabant Wallon 347,423 10.7 341 300 300 291 6 6 0.86 Luxembourg 245,140 7.5 241 250 1000 1250 1303 25 21 5.10 Wallonnia 3,261,982 100 3200 3200 1500 4700 4711 94 80 1.44

German comm.

70,472 1.1 300 300 300 294 6 6

4.26

Wallonnia+German 3,332,454 100 3500 3500 1500 5000 5005 100 86 1.50 Brussels 954,460 100 3000 3000 3000 3006 60 18 3.14 Belgium 10,213,752 100 10,000 10,000 2050 12,050 12,111 241 178 1.18 InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 12

slide-14
SLIDE 14

Provincial Stratification in 2004

Province Goal Obtained Antwerpen 1100 1171 Oost-Vlaanderen 900 944 West-Vlaanderen 750 814 Vlaams-Brabant 650 561 Limburg 1000 1023 Hainaut 1500 1502 Li` ege 1200 1181 Namur 550 531 Brabant-Wallon 400 446 Luxembourg 1200 1332 Brussels 3350 3440

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 13

slide-15
SLIDE 15

Multi-Stage Sampling: Primary Sampling Units

Towns

  • Within each province, order communities ∝ size
  • Systematically sample in groups of 50:

X X X X X X

↑ ↑ ↑ ↑ ↑ ↑

random start jump

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 14

slide-16
SLIDE 16
  • The above is also known as area probability sampling
  • Representation with certainty of larger cities.

For 1997: ⊲ Antwerpen: 6 groups ⊲ Li` ege and Charlerloi: 4 groups each ⊲ Gent: 3 groups ⊲ Mons and Namur: 2 groups each ⊲ All towns in Brussels

  • Representation ensured of respondents, living in smaller towns

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 15

slide-17
SLIDE 17

Multi-Stage Sampling: Secondary Sampling Units

Households

  • List of households, ordered following

⊲ statistical sector ⊲ age of reference person ⊲ size of household

  • clusters of 4 households selected

← − field substitution

  • households within clusters randomized
  • twice as many clusters as households needed, to further account for

refusal and non-responders

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 16

slide-18
SLIDE 18

Multi-stage Sampling: Tertiary Sampling Units

Individual Respondents

  • Households of size ≤ 4: all members
  • Households of size ≥ 5:

⊲ reference person and partner (if applicable) ⊲ other households members selected on birthday rule in 1997 or by prior sampling from household members in 2001 and 2004 ⊲ maximum of 4 interviews per household

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 17

slide-19
SLIDE 19

Weights

  • Stratification

− → Region

  • Province
  • Age of reference person
  • Household size
  • Quarter
  • Multi-stage sampling

− → Selection probability of individual within household

  • Taking this into account is relatively easy, even with standard software

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 18

slide-20
SLIDE 20

Incomplete Data

  • Types of incompleteness in this survey:

⊲ Household level ∗ Households with which no interview was realized ∗ Households which explicitly refused ∗ Households which could not be contacted ⊲ Individual level ⊲ Item level

  • In addition, the reason of missingness needs to be considered. For example, is

missingness due to illness of the interviewer, or is it related to the income and social class of the potential respondent?

  • General missing data concepts as well as survey-specific missing data concepts need to

be combined

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 19

slide-21
SLIDE 21

Design −

→ Analysis

  • Weights & selection probabilities
  • Stratification
  • Multi-stage sampling & clustering
  • Incomplete data

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 20

slide-22
SLIDE 22

Definitions

Survey population: The collection of units (individuals) about which the researcher wants to make quantitative statements. Sample frame: The set of units (individuals) that has non-zero probability of being selected. Sample: The subset of units that have been selected. Probability sampling: The family of probabilistic (stochastic) methods by which a subset of the units from the sample frame is selected. Design properties: The entire collection of methodological aspects that leads to the selection of a sample. The probability sampling method is the most important design aspect.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 21

slide-23
SLIDE 23

Sample size: The number of units in the sample. Analysis and inference: The collection of statistical techniques by which population estimands are estimated. Examples: estimation of means, averages, totals, linear regression, ANOVA, logistic regression, loglinear models. Estimand: The true population quantity (e.g., the average body mass index of the Belgian population). Estimator: A (stochastic) function of the sample data, with the aim to “come close” to the estimand. Estimate: A particular realization of the estimator, for the particular sample taken (e.g., 22.37).

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 22

slide-24
SLIDE 24

Population

  • A population can be physical and/or geographical, but
  • does not have to be an entire country or region.
  • A population can be a cohort: all males born in a Brussels town in 1980.
  • There can be geographical, temporal, and definition characteristics at the same time:

all females living in Brussels, diagnosed with breast cancer between from 1990 until 1999 inclusive.

  • An hard-to-reach group can be a population in its own right:

homeless, illegal immigrants,...

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 23

slide-25
SLIDE 25

Sample Frame

  • The sample frame “operationalizes” the population.

⊲ Population: All females living in Brussels, diagnosed with breast cancer between from 1990 until 1999 inclusive. ⊲ Sample frame: The National Cancer Register for the given years. Population Sample Frame

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 24

slide-26
SLIDE 26
  • There are three groups of units:

⊲ 1. Belonging to both the population and the sample frame: This fraction should be as large as possible. Their probability is ≥ 0 of being selected. ⊲ 2. Belonging to the population but not to the sample frame: Can be damaging if too large and/or too different units. Their probability of selection is 0. ∗ If a selection is based on households, then dormitories, prisons, elderly homes, and homeless people have no chance of being selected. ∗ Driving licenses (US) ∗ Registered voters ∗ House owners ∗ Phone directories: excludes those without phone and those unlisted. ⊲ 3. Belonging to the sample frame but not to the population: May contribute to cost, but is not so harmful otherwise.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 25

slide-27
SLIDE 27

For example, a survey on elderly can be conducted as follows: ∗ select households from the general population ∗ retain those who are “sufficiently old” ∗ collect data on this subselected sample ∗ But this procedure is clearly inefficient. If group 1 is sufficiently large, then the sample frame is sufficiently representative.

  • It is important to answer such questions as:

⊲ What percentage is excluded from selection? ⊲ How different are these groups?

  • It is possible to opt for a selection scheme with less than full coverage of the

population, if it is sufficiently cheaper. − → Statistical and economic arguments have to be balanced.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 26

slide-28
SLIDE 28

Types of Sample Frame

  • It is useful to think of a sample frame as a list.
  • A list is a broad concept, there are widely different types.

⊲ Static, exhaustive lists: ∗ A single list contains all sample frame units ∗ The list exists prior to the start of the study ⊲ Dynamic lists: ∗ The list is generated together with the sample ∗ For example: all patients visiting a general practitioner during the coming year ∗ There are implications for knowledge about the selection probability

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 27

slide-29
SLIDE 29

⊲ Multi-stage lists: ∗ The natural companion to multi-stage sampling

  • If selection is undertaken based on a list, one has to consider the list’s quality:

⊲ How has the list been composed? ⊲ How does the updating take place? ⊲ Always report: ∗ who cannot be selected? ∗ in what way do those who have selection probability equal to zero differ from the

  • thers?

∗ who did have unknown selection probability ⇒ trustworthy, useful results

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 28

slide-30
SLIDE 30

Especially with Hard-to-reach Groups

Population = Sample frame

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 29

slide-31
SLIDE 31

Sampling Methods

Simple random sampling: the standard method; studied to compare other methods with: seldomly used Systematic sampling: chosen to increase precision and/or to ensure sampling with certainty for a subgroup of units. Stratification: performed: ⊲ to increase precision of population-level estimates ⊲ to allow for estimation at sub-population level ⊲ to reach generally hard-to-reach groups ⊲ a combination of these goals

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 30

slide-32
SLIDE 32

Multi-stage procedures: decrease precision but facilitate fieldwork. Differential rates: will often result from other sampling methods; the overall precision will decrease. Benchmark estimation: using outside, reliable sources, may introduce some bias but is aimed to increase precision; there is a need for external sources.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 31

slide-33
SLIDE 33

Simple Random Sampling (SRS)

  • We need the following information:

⊲ Population P ⊲ Population size N ⊲ Sample size n ⊲ Whether sampling is done with or without replacement

  • The sample fraction:

f = n N

  • Usually no adjustments needed, but...

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 32

slide-34
SLIDE 34
  • Finite population correction:

⊲ If a (sub)group is small, n may be large relative to N ⊲ Say it again: the sample size may be large relative to the size of the population ⊲ It then pays off to correct estimates for sampling without replacement from a finite population ⊲ Example: ∗ Classical estimate of a proportion:

  • π = 0.45 (s.e.0.10)

∗ Adjusted estimate of a proportion, when the sample is half of the population:

  • π = 0.45 (s.e.0.07)

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 33

slide-35
SLIDE 35
  • Capture-recapture:

⊲ Suppose we want to estimate the size of a subgroup, immersed into a larger population: ⊲ Take a first sample: continue until you have k1 members of the subgroup ⊲ Take a second, independent sample: continue until you have k1 members of the subgroup ⊲ Let there be k2 members in the intersection of both samples ⊲ It then follows that the total size k0 of the subgroup is: k0 = k2

1

k2 ⊲ Example: ∗ Let k1 = 100 — we sample until we have a 100 group members ∗ Let k2 = 5 — we find 5 people in the intersection ∗ Then

  • k0 = 2000, the estimated size of the subgroup

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 34

slide-36
SLIDE 36

Two Reasons for Stratification (STRAT)

Goal 1: to increase precision

  • Example: better precision for the population at large

Goal 2: to obtain inferences about the strata (as well)

  • Example: a (very) small and/or hard-to-reach subgroup should get a minimum sample

size as well.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 35

slide-37
SLIDE 37

Stratification Ingredients

  • Population P
  • Population size N
  • Sample size n
  • Whether sampling is done with or without replacement
  • The strata indicators h = 1, . . . , H
  • The number of subjects in stratum h: I = 1, . . . , Nh, with

N =

H

  • h=1 Nh
  • This defines the subpopulations, or population strata, Ph

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 36

slide-38
SLIDE 38
  • The way the sample of n units is allocated to the strata: nh, with

n =

H

  • h=1 nh
  • We can calculate the stratum-specific sample fraction:

fh = nh Nh

  • fh variable −

→ need for weights

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 37

slide-39
SLIDE 39

Post-stratification

  • Stratification can be done at two levels:

⊲ design stage: stratify when selecting the sample ⊲ analysis stage: construct stratified estimators, by: ∗ first: constructing estimators for each stratum ∗ second: combining these in an estimator for the entire population

  • Whether or not the method is applied at either one of the stages can be used for

characterizing a method: At design stage No Yes At analysis stage No SRS Problematic Yes Post-stratification Stratification

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 38

slide-40
SLIDE 40

Multi-Stage Sampling and Clustering

  • Informal definition of both concepts:

⊲ Multi-stage sampling: a hierarchy of units is selected: ∗ starting with primary sampling units (PSU), ∗ within with secondary sampling units (SSU) are sub-selected, ∗ within which tertiary sampling units (TSU) are subselected, ∗ etc. ⊲ Clustering: refers to the fact that several non-independent units (stemming from a ‘cluster’) are simultaneously selected.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 39

slide-41
SLIDE 41

Multi-Stage Sampling: the Relative Approach

Selection probabilities Stage 1: f1 Stage 2: f2 Total: f = f1 · f2 a.

1 1

×

1 10

=

1 10

b.

1 2

×

1 5

=

1 10

c.

1 5

×

1 2

=

1 10

d.

1 10

×

1 1

=

1 10

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 40

slide-42
SLIDE 42

Multi-Stage Sampling: the Absolute Approach

  • Assume N, n, and hence f are prespecified.
  • Fix the number of SSU taken per PSU: nc.
  • Construct a cumulative list of the number of SSU per PSU.
  • Conduct systematic selection within the cumulative list, with jump

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 41

slide-43
SLIDE 43

⊲ Sample fraction f = 10% ⊲ Number of respondents per neighborhood nc = 10 ⊲ Jump: g = 1 f · nc = 0.1−1 · 10 = 100 Sample selection block # houses cumulative hits 1 43 43

  • 2

87 130 70 3 109 239 170 4 27 266

  • 5

15 281 270 . . . . . . . . . . . .

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 42

slide-44
SLIDE 44

Selection probabilities block houses prob.(1) prob.(2) prob.(tot) 2 87 87/100 10/87 1/10 3 109 109/100 10/109 1/10 5 15 15/100 10/15 1/10

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 43

slide-45
SLIDE 45

Sample Sizes: The Belgian Health Interview Survey

Allocation for Belgian Health Interview Survey Focus on Region Nh population strata compromise Brussels 1,000,000 1000 3333.33≃ 3000 1929.93≃2000 Flanders 6,000,000 6000 3333.33≃ 3500 4727.34≃4750 Wallonia 3,000,000 3000 3333.33≃ 3000 3342.73≃3250

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 44

slide-46
SLIDE 46

Weighting: General Concepts and Design

⊲ When do weights appear? ⊲ Weighting in the context of stratification ⊲ Weighting in the context of clustering ⊲ Examples

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 45

slide-47
SLIDE 47

General Principles

  • Weighting arises naturally in a variety of contexts:

⊲ With stratification: different strata have different selection probabilities. ⊲ With clustering: weights differ within and between clusters. ⊲ Incomplete data: to correct for non-response. ⊲ In general: units are given probabilities of selection, e.g., proportional to their size.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 46

slide-48
SLIDE 48

Weighting and Stratification

  • There are two main reasons why selection probabilities are different between strata:

⊲ A (hard-to-reach) subgroup is of interest and not oversampling would lead to too small a sample size. Example: German Region in the Belgian HIS. ⊲ Strata are given equal sample sizes for comparative purposes, but also an estimate for the entire population is required. Example: Brussels, Flanders, and Wallonia in the Belgian HIS. ⊲ Units are then reweighted to ensure proper representativity.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 47

slide-49
SLIDE 49

Example

  • Suppose a certain subgroup represents 10% of the population.
  • With an unweighted scheme (SRS or stratified), this group will also contribute 10% to

the sample, on average.

  • If we need a sample which includes 100 individuals of the subgroup, then a total

sample of 1000 individuals has to be selected.

  • Enlarging the subgroup with 50% implies scaling up from 100 to 150, and hence 500

additional interviews for the entire sample are needed.

  • It is perfectly possible that 50 extra interviews in the subgroup are essential, but that

the other 450 are redundant.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 48

slide-50
SLIDE 50
  • A solution is to increase the selection probability for the subgroup, relative to the
  • thers.

Quantity Majority Minority Population 4500 500 Percentage 90 10 Sample portion 1/10 1/5 Number selected 450 100 Unweighted percentage in sample 81.8 18.2 Weight 1 1/2 Weighted number in sample 450 50 Weighted percentage in sample 90 10

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 49

slide-51
SLIDE 51
  • Unfortunately, it is not always possible to pre-determine whether a respondent belongs

to the majority or to the minority.

  • This implies that determining the weight is difficult.
  • As a surrogate, entire quarters (or other geographical entities) which are known to

have large minority populations can be oversampled ← − area probability sampling

  • This procedure works, since the weighting is done at the quarter level, hence

producing correct weights, such as in the example above.

  • If one calculates the subsample selection probability carefully, then it can be ensured

that the sample will contain a sufficient number of minority members.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 50

slide-52
SLIDE 52

Example: Artificial Population

  • Consider a small, artificial population

Stratum Unit Income level A 1 1 B 2 2 B 3 3 B 4 4 Average 2.5

  • We force hard-to-reach unit 1 into the sample.
  • Suppose we collect a sample, {1, 2}, say.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 51

slide-53
SLIDE 53
  • The average is:
  • y =

1 1/1 + 2 1/3 1 1/1 + 1 1/3

= 7 4 = 1.75 and not:

  • y = 1 + 2

2 = 1.50 Sample Unweighted average Weighted average {1, 2} 1.50 7/4=1.75 {1, 3} 2.00 10/4=2.50 {1, 4} 2.50 13/4=3.25 Average 2.00 2.50 wrong right .

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 52

slide-54
SLIDE 54

Weighting and Multi-Stage Sampling / Clustering

  • In multi-stage sampling and clustering, subunits may be selected with differential

probabilities. Example: Household members in the Belgian HIS.

  • In addition, entire clusters may be selected with variable probabilities.

Example: Towns in the Belgian HIS. Example: Hard-to-reach groups

  • Just like in the stratified case, this needs to be taken into account via weights.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 53

slide-55
SLIDE 55

Example

  • Consider a selection of households from a population with two household types:

⊲ 1000 2-person households of married couples. ⊲ 1000 1-person households of singles.

  • Obviously:

⊲ 50% of the households consist of married couples. ⊲ 66.7% of the people are married.

  • Select a sample of 100 households, and then one person per household.
  • We expect, on average, in the sample:

⊲ 50 married persons. ⊲ 50 unmarried persons.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 54

slide-56
SLIDE 56
  • If the survey question is: “Are your married?” then a naive estimate would produce:
  • z = 50% are married, which is wrong.
  • Weighting the answers by the relative selection probabilities:
  • z1 =

50 · 1 ·

1 1/2 + 50 · 0 · 1 1/1

50 ·

1 1/2 + 50 · 1 1/1

= 100 150 = 0.667

  • In case we want to assess the proportion of married households, then no weighting is

necessary:

  • z2 = 50 · 1 + 50 · 0

50 + 50 = 50 100 = 0.5

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 55

slide-57
SLIDE 57

Benchmark Estimation

  • SRS, SYS, STRAT, and MSS are sampling methods.
  • Benchmark estimation is an (enhanced) estimation method, in two steps:

⊲ Step 1: Estimate a population quantity using a conventional method (e.g., SRS). ⊲ Step 2: Construct a second estimator, using the first estimator and a so-called benchmark as input.

  • Variations to the theme are endless

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 56

slide-58
SLIDE 58

Example of Benchmark Estimation

  • We conduct a small survey in a difficult-to-reach group and find:

⊲ Of the male adults, 60% is unemployed ⊲ Of the female adults, 90% is unemployed ⊲ We have 70% females in the sample ⊲ A sample-based average would be

  • πunemployed = 0.70 × 0.90 + 0.30 × 0.60 = 0.81

⊲ However, more reliable, survey or census data tell us that the sex ratio among adults in this population is about 50%:

  • πunemployed,adjusted = 0.50 × 0.90 + 0.50 × 0.60 = 0.75
  • The sex ratio in the population is used as a benchmark
  • In fact, this is still similar to post-stratification

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 57

slide-59
SLIDE 59
  • Let also the average income be surveyed twice:

⊲ From the sample: A C1000 ⊲ From the more reliable source: A C1100 ⊲ Adjustment:

  • πunemployed,benchmark = 0.81 × 1000

1100 = 0.74

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 58

slide-60
SLIDE 60

Non-response / Missing Data / Incompleteness / . . .

  • Three types in BHIS:

⊲ Household level ⊲ Individual level ⊲ Item level

  • Household level: reasons:

non-participation

=

non-contactable

+

non-response

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 59

slide-61
SLIDE 61

Methodological Issues

MCAR ⊆ MAR ⊆ MNAR missingness de- pends at most on covariates missingness de- pends on covari- ates and observed

  • tucomes

missingness fur- ther depends

  • n

unobserved

  • utcomes

CC? direct likelihood! joint model!? AC? direct Bayes! sensitivity analysis?! imputation? multiple imputa- tion (MI)! weighted GEE!

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 60

slide-62
SLIDE 62

CC, AC, and Simple Imputation

MCAR

Complete case analysis: ⇒ delete incomplete subjects

  • Standard statistical software
  • Loss of information
  • Impact
  • n

precision and power

  • Missingness = MCAR ⇒ bias
  • (Case-wise deletion)

Available case analysis: ⇒ delete incomplete subjects per variable(s) studied

  • ± Standard statistical soft-

ware

  • Loss of information
  • Impact
  • n

precision and power

  • Missingness = MCAR ⇒ bias
  • (List-wise deletion)

Simple imputation: ⇒ impute missing values

  • Standard statistical software
  • Increase of information
  • Often

unrealistic assump- tions

  • Usually bias

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 61

slide-63
SLIDE 63

Original data Complete case analysis

Obs block x z y 1 1 1 8 1 2 2 3 1 2 3 3 4 6 3 4 4 6 10 4 5 5 7 4 5 6 6 8 3 6 7 7 10 7 (7) 8 8 11 11 (8) Obs block x z y 1 1 1 8 1 2 2 3 1 2 3 3 4 6 3 4 4 6 10 4 5 5 7 4 5 6 6 8 3 6

Available case analysis Mean imputation

Obs block x z y 1 1 1 8 1 2 2 3 1 2 3 3 4 6 3 4 4 6 10 4 5 5 7 4 5 6 6 8 3 6 7 7 10 7 . 8 8 11 11 . Obs block x z y 1 1 1 8 1.0 2 2 3 1 2.0 3 3 4 6 3.0 4 4 6 10 4.0 5 5 7 4 5.0 6 6 8 3 6.0 7 7 10 7 3.5 8 8 11 11 3.5

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 62

slide-64
SLIDE 64

Multiple Imputation

  • Valid under MAR
  • Useful next to direct likelihood
  • Three steps:
  • 1. The missing values are filled in M times =

⇒ M complete data sets

  • 2. The M complete data sets are analyzed by using standard procedures
  • 3. The results from the M analyses are combined into a single inference
  • Rubin (1987), Rubin and Schenker (1986), Little and Rubin (1987)

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 63

slide-65
SLIDE 65

Use of MI in Practice

  • Many analyses of the same incomplete set of data
  • A combination of missing outcomes and missing covariates
  • MI can be combined with classical GEE
  • MI in SAS:

Imputation Task: PROC MI ↓ Analysis Task: PROC “MYFAVORITE” ↓ Inference Task: PROC MIANALYZE

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 64

slide-66
SLIDE 66

Incompleteness in the Belgian Health Interview Survey

  • Household level

⊲ Households with which no interview was realized ⊲ Households which explicitly refused ⊲ Households which could not be contacted

  • Individual level

⊲ Individual refuses to participate, in spite of HH agreement

  • Item level

⊲ A participating respondent leaves some questions unanswered

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 65

slide-67
SLIDE 67

Design Measures Towards Missing Data

  • Increased number of sampled households (HHs)
  • Replacement scheme for drop-outs

⊲ HHs sampled in clusters of 4 ⊲ Oversampling of clusters

  • Proxy interviews
  • Invitation letter
  • Multiple attempts to contact a HH
  • Coding of the reasons for drop-outs

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 66

slide-68
SLIDE 68

Missing Data: HH-Level

  • 35,023 HHs sampled
  • 11,568 HHs attempted to contact
  • Different reasons for a HH non-interview:

Type Description # % NP: Non-Participation no interview regardless reason 6904 59.7% NA: Non-Availability no interview due to difficulty in contacting 3546 30.7% NR: Non-Response no interview due to explicit HH refusal 3358 29.0

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 67

slide-69
SLIDE 69

Individual-Level Missingness

  • 10,339 HH members selected for interview.
  • Similar reasons for missingness at this level:

Type Description # % Proxy NP: Non-Participation no personal interview 785 7.6% 671 NA: Non-Availability difficulty in contacting 408 3.9% 408 NR: Non-Response explicit refusal 210 2.0% 96

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 68

slide-70
SLIDE 70

Item-Level Missingness

  • Only non-response
  • More than 1000 variables obtained for the interviewed individuals.
  • Frequency of NR depending on the item (question):

⊲ BMI: 2.1% ⊲ VOEG: 3.7% ⊲ Maximum observed: 11%

  • May be substantial when several variables are considered jointly.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 69

slide-71
SLIDE 71

Factors Influencing Item-Level Missingness

  • Different across regions.
  • Missingness increases with HH size.
  • Effect of the age of the reference person.
  • Effect of nationality of reference person.
  • Effect of gender of reference person.

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 70

slide-72
SLIDE 72

Multiple Imputation for LNBMI

Effect Level AC (7272 obs.) MI (8564 obs.) Region Brussels — — Flanders 0.007 (0.006) 0.009 (0.006) Wallonia 0.023 (0.007) 0.027 (0.006) Gender Male — — Female

  • 0.050 (0.004)
  • 0.054 (0.003)

Education Primary — — Secondary

  • 0.011 (0.005)
  • 0.013 (0.004)

Higher

  • 0.046 (0.005)
  • 0.045 (0.005)

Income level < 40, 000 — — 40,000–60,000 0.008 (0.004) 0.006 (0.004) > 60, 000 0.003 (0.006)

  • 0.001 (0.006)

Smoking Non-smoker — — Smoker 0.003 (0.004) 0.004 (0.004) Age Age-group 0.030 (0.001) 0.001 (0.001)

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 71

slide-73
SLIDE 73

Multiple Imputation for LNVOEG

Effect Level AC (7389 obs.) MI (8564 obs.) Region Brussels — — Flanders

  • 0.264 (0.032)
  • 0.268 (0.031)

Wallonia 0.015 (0.033) 0.002 (0.033) Gender Male — — Female 0.296 (0.019) 0.284 (0.018) Education Primary — — Secondary

  • 0.072 (0.023)
  • 0.069 (0.023)

Higher

  • 0.099 (0.025)
  • 0.088 (0.025)

Income level < 40, 000 — — 40,000–60,000

  • 0.049 (0.021)
  • 0.039 (0.021)

> 60, 000

  • 0.107 (0.030)
  • 0.094 (0.034)

Smoking Non-smoker — — Smoker 0.238 (0.019) 0.220 (0.019) Age Age-group 0.051 (0.006) 0.050 (0.005)

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 72

slide-74
SLIDE 74

The Slovenian Plebiscite

Rubin, Stern, and Vehovar (1995)

  • Slovenian Public Opinion (SPO) Survey
  • Four weeks prior to decisive plebiscite
  • Three questions:
  • 1. Are you in favor of Slovenian independence?
  • 2. Are you in favor of Slovenia’s secession from Yugoslavia?
  • 3. Will you attend the plebiscite?
  • Political decision: ABSENCE≡NO
  • Primary Estimand: θ: Proportion in favor of independence

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 73

slide-75
SLIDE 75
  • Slovenian Public Opinion Survey Data:

Independence Secession Attendance Yes No ∗ Yes Yes 1191 8 21 No 8 4 ∗ 107 3 9 No Yes 158 68 29 No 7 14 3 ∗ 18 43 31 ∗ Yes 90 2 109 No 1 2 25 ∗ 19 8 96

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 74

slide-76
SLIDE 76

Slovenian Plebiscite ←

→ Slovenian Public

Opinion Survey

θ =0.885 Estimator

  • θ

Pessimistic bound 0.694 Optimistic bound 0.904 Complete cases 0.928 ? Available cases 0.929 ? MAR (2 questions) 0.892 MAR (3 questions) 0.883 MNAR 0.782

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 75

slide-77
SLIDE 77

Slovenian Public Opinion Survey: An MNAR Model Family

Baker, Rosenberger, and DerSimonian (1992)

  • Counts: Yr1r2jk
  • Questions: j, k = 1, 2
  • Non-response: r1, r2 = 0, 1

E(Y11jk) = mjk E(Y10jk) = mjkβjk E(Y01jk) = mjkαjk E(Y00jk) = mjkαjkβjkγjk ⊲ αjk: non-response on independence question ⊲ βjk: non-response on attendance question ⊲ γjk: interaction between both non-response indicators

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 76

slide-78
SLIDE 78

Slovenian Public Opinion Survey: Identifiable Models

Model Structure d.f. loglik θ C.I. BRD1 (α, β) 6

  • 2495.29

0.892 [0.878;0.906] BRD2 (α, βj) 7

  • 2467.43

0.884 [0.869;0.900] BRD3 (αk, β) 7

  • 2463.10

0.881 [0.866;0.897] BRD4 (α, βk) 7

  • 2467.43

0.765 [0.674;0.856] BRD5 (αj, β) 7

  • 2463.10

0.844 [0.806;0.882] BRD6 (αj, βj) 8

  • 2431.06

0.819 [0.788;0.849] BRD7 (αk, βk) 8

  • 2431.06

0.764 [0.697;0.832] BRD8 (αj, βk) 8

  • 2431.06

0.741 [0.657;0.826] BRD9 (αk, βj) 8

  • 2431.06

0.867 [0.851;0.884]

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 77

slide-79
SLIDE 79

Slovenian Public Opinion Survey: An MNAR “Interval”

θ =0.885 Estimator

  • θ

[Pessimistic; optimistic] [0.694;0.904] Complete cases 0.928 Available cases 0.929 MAR (2 questions) 0.892 MAR (3 questions) 0.883 MNAR 0.782 MNAR “interval” [0.741;0.892]

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 78

slide-80
SLIDE 80

Slovenian Public Opinion Survey: Interval

  • f Ignorance

Model Structure d.f. loglik θ C.I. BRD1 (α, β) 6

  • 2495.29

0.892 [0.878;0.906] BRD2 (α, βj) 7

  • 2467.43

0.884 [0.869;0.900] BRD3 (αk, β) 7

  • 2463.10

0.881 [0.866;0.897] BRD4 (α, βk) 7

  • 2467.43

0.765 [0.674;0.856] BRD5 (αj, β) 7

  • 2463.10

0.844 [0.806;0.882] BRD6 (αj, βj) 8

  • 2431.06

0.819 [0.788;0.849] BRD7 (αk, βk) 8

  • 2431.06

0.764 [0.697;0.832] BRD8 (αj, βk) 8

  • 2431.06

0.741 [0.657;0.826] BRD9 (αk, βj) 8

  • 2431.06

0.867 [0.851;0.884] Model 10 (αk, βjk) 9

  • 2431.06

[0.762;0.893] [0.744;0.907] Model 11 (αjk, βj) 9

  • 2431.06

[0.766;0.883] [0.715;0.920] Model 12 (αjk, βjk) 10

  • 2431.06

[0.694;0.904]

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 79

slide-81
SLIDE 81

Slovenian Public Opinion Survey: Counterpart Added

Model Structure d.f. loglik θ C.I.

  • θMAR

BRD1 (α, β) 6

  • 2495.29

0.892

[0.878;0.906] 0.8920 BRD2 (α, βj) 7

  • 2467.43

0.884 [0.869;0.900] 0.8915 BRD3 (αk, β) 7

  • 2463.10

0.881 [0.866;0.897] 0.8915 BRD4 (α, βk) 7

  • 2467.43

0.765 [0.674;0.856] 0.8915 BRD5 (αj, β) 7

  • 2463.10

0.844 [0.806;0.882] 0.8915 BRD6 (αj, βj) 8

  • 2431.06

0.819 [0.788;0.849] 0.8919 BRD7 (αk, βk) 8

  • 2431.06

0.764 [0.697;0.832] 0.8919 BRD8 (αj, βk) 8

  • 2431.06

0.741 [0.657;0.826] 0.8919 BRD9 (αk, βj) 8

  • 2431.06

0.867 [0.851;0.884] 0.8919 Model 10 (αk, βjk) 9

  • 2431.06

[0.762;0.893] [0.744;0.907] 0.8919 Model 11 (αjk, βj) 9

  • 2431.06

[0.766;0.883] [0.715;0.920] 0.8919 Model 12 (αjk, βjk) 10

  • 2431.06

[0.694;0.904] 0.8919

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 80

slide-82
SLIDE 82

Slovenian Public Opinion Survey: Incomplete Data

Observed ≡ BRD7 ≡ BRD7(MAR) ≡ BRD9 ≡ BRD9(MAR): 1439 78 16 16 159 32 144 54 136 BRD1 ≡ BRD1(MAR): 1381.6 101.7 24.2 41.4 182.9 8.1 179.7 18.3 136.0 BRD2 ≡ BRD2(MAR): 1402.2 108.9 15.6 22.3 159.0 32.0 181.2 16.8 136.0

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 81

slide-83
SLIDE 83

Slovenian Public Opinion Survey: Complete-data Prediction

BRD1 ≡ BRD1(MAR): 1381.6 101.7 24.2 41.4 170.4 12.5 3.0 5.1 176.6 13.0 3.1 5.3 121.3 9.0 2.1 3.6 BRD2: 1402.2 108.9 15.6 22.3 147.5 11.5 13.2 18.8 179.2 13.9 2.0 2.9 105.0 8.2 9.4 13.4 BRD2(MAR): 1402.2 108.9 15.6 22.3 147.7 11.3 13.3 18.7 177.9 12.5 3.3 4.3 121.2 9.3 2.3 3.2 BRD7: 1439 78 16 16 3.2 155.8 0.0 32.0 142.4 44.8 1.6 9.2 0.4 112.5 0.0 23.1 BRD9: 1439 78 16 16 150.8 8.2 16.0 16.0 142.4 44.8 1.6 9.2 66.8 21.0 7.1 41.1 BRD7(MAR) ≡ BRD9(MAR): 1439 78 16 18 148.1 10.9 11.8 20.2 141.5 38.4 2.5 15.6 121.3 9.0 2.1 3.6

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 82

slide-84
SLIDE 84

Slovenian Public Opinion Survey: Collapsed (Marginalized) Predictions

BRD1 ≡ BRD1(MAR): 1849.9 136.2 32.4 55.4 = ⇒

  • θ = 89.2%

BRD2: 1833.9 142.5 40.2 57.5 = ⇒

  • θ = 88.4%

BRD2(MAR): 1849.0 142.0 34.5 48.5 = ⇒

  • θ = 89.2%

BRD7: 1585.0 391.1 17.6 80.3 = ⇒

  • θ = 76.4%

BRD9: 1799.7 152.0 40.7 82.3 = ⇒

  • θ = 86.7%

BRD7(MAR) ≡ BRD9(MAR): 1849.9 136.3 30.4 57.4 = ⇒

  • θ = 89.2%

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 83

slide-85
SLIDE 85

Sensitive Questions

  • “Do you cheat on your partner?”
  • Procedure:

⊲ Respondents tosses a coin, unbeknownst to the interviewer: ⊲ If heads: The response is yes ⊲ If tails: The response is correct

  • The estimator for the number of genuine “yes” answers is:
  • π = 2p − n

n with ⊲ n: number of respondents ⊲ p: number of respondents that say “yes”

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 84

slide-86
SLIDE 86
  • Example:

⊲ n = 100 ⊲ p = 80 ⊲ The estimate:

  • π = 2 × 80 − 100

100 = 0.60 (s.e. 0.08)

  • Price to pay: otherwise, the standard error would be 0.05!
  • We trade precision for a more faithful response (i.e., less bias)

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 85

slide-87
SLIDE 87

Software Packages

  • SAS
  • STATA
  • SPSS
  • SUDAAN
  • ...

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 86

slide-88
SLIDE 88

Software for Design

  • Some software tools are constructed for design purposes.
  • The input data base is then the population or, stated more accurately, the sample

frame.

  • The output data base is then a sample selected from the input data base,

and taking 0, 1, or more design aspects into account.

  • SAS: PROC SURVEYSELECT

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 87

slide-89
SLIDE 89

Software for Analysis

  • Not surprisingly, most software tools are geared towards analysis.
  • Several views can be taken:

Simple estimators versus model: Estimating a mean, total, or frequency ← → Regression, ANOVA Simple cross-sectional data structure versus complex data structure: Cross-sectional data ← → Multivariate, multi-level, clustered, longitudinal data To survey or not to survey: Non-survey data (or SRS) ← →

  • ne or more survey-design aspects

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 88

slide-90
SLIDE 90

Further Topics

  • Network-based sampling
  • Respondent-driven sampling
  • Background of interviewers
  • Mixed-mode surveys
  • Follow-up survey on non-respondents

InGRID: ‘Reaching out to hard-to-surey groups among the poor’, Leuven, May 31, 2016 89