Inferential Problems with Nonprobability Samples Richard Valliant - - PowerPoint PPT Presentation

inferential problems with nonprobability samples
SMART_READER_LITE
LIVE PREVIEW

Inferential Problems with Nonprobability Samples Richard Valliant - - PowerPoint PPT Presentation

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan & University of Maryland 26 Feb 2016 (UMich & UMD) Ross-Royall Symposium 1 / 24 Types of samples Probability v. Nonprobability samples Goal in


slide-1
SLIDE 1

Inferential Problems with Nonprobability Samples

Richard Valliant

University of Michigan & University of Maryland

26 Feb 2016

(UMich & UMD) Ross-Royall Symposium 1 / 24

slide-2
SLIDE 2

Types of samples

Probability v. Nonprobability samples

Goal in surveys is to use sample to make estimates for entire finite population—external validity Probability samples became touchstone in surveys after Neyman (JRSS 1934) article

design-based approach: model-free inference stratified sampling with Neyman allocation cluster sampling confidence intervals

Early failure of nonprobability sample

1936 Literary Digest; 2.3 million mail surveys to subscribers plus automobile and telephone owners predicted landslide win by Alf Landon over FDR

(UMich & UMD) Ross-Royall Symposium 2 / 24

slide-3
SLIDE 3

Types of samples

Nonprobability samples

Standard in experiments—no finite population New sources of data

Twitter, Facebook, Web-scraping Billion Prices Project @ MIT, http://bpp.mit.edu/ ✎ Price indexes for 22 countries based on web-scraped data

Keiding & Louis (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. JRSS-A Are these data good for anything?

(UMich & UMD) Ross-Royall Symposium 3 / 24

slide-4
SLIDE 4

Types of samples (UMich & UMD) Ross-Royall Symposium 4 / 24

slide-5
SLIDE 5

Types of samples (UMich & UMD) Ross-Royall Symposium 5 / 24

slide-6
SLIDE 6

Types of samples

Not all nonprobability samples are created equal

College sophomores in Psych 100 Mall intercepts Volunteer samples, river samples, snowball samples Probability samples with low response rates

Coalitions of the willing

AAPOR task force report on non-probability samples (2013)

(UMich & UMD) Ross-Royall Symposium 6 / 24

slide-7
SLIDE 7

Types of samples

Not all nonprobability samples are created equal

College sophomores in Psych 100 Mall intercepts Volunteer samples, river samples, snowball samples Probability samples with low response rates

Coalitions of the willing

AAPOR task force report on non-probability samples (2013)

(UMich & UMD) Ross-Royall Symposium 6 / 24

slide-8
SLIDE 8

Types of samples

Not all nonprobability samples are created equal

College sophomores in Psych 100 Mall intercepts Volunteer samples, river samples, snowball samples Probability samples with low response rates

Coalitions of the willing

AAPOR task force report on non-probability samples (2013)

(UMich & UMD) Ross-Royall Symposium 6 / 24

slide-9
SLIDE 9

Types of samples

Declining response rates

Pew Research response rates in typical telephone surveys dropped from 36% in 1997 to 9% in 2012 (Kohut et al. 2012) With such low RRs, a sample initially selected randomly can hardly be called a probability sample Low RRs raise the question of whether probability sampling is worthwhile, at least for some applications ◮ Non-probs are faster, cheaper ◮ No worse?

(UMich & UMD) Ross-Royall Symposium 7 / 24

slide-10
SLIDE 10

Types of samples

Polls that failed

✎ British parliamentary election May 2015

Final Ipsos/MORI East Anglia/LSE/Durham U Party (online panel) (using poll aggregation) Conservative 51% 36% 43% Labour 36% 35% 41%

✎ Israeli March 2015 election (seats); online panels

Final Smith- TNS/ Maariv Channel Party Reshet Bet Walla 1 Likud 30 21 23 21 25 Zionist Union 24 25 25 25 25

✎ Scottish independence referendum, Sep 2014

(UMich & UMD) Ross-Royall Symposium 8 / 24

slide-11
SLIDE 11

Types of samples

One that worked

Xbox gamers: 345,000 people surveyed in opt-in poll for 45 days continuously before 2012 US presidential election Xboxers much different from overall electorate 18- to 29-year olds were 65% of dataset, compared to 19% in national exit poll 93% male vs. 47% in electorate Unadjusted data suggested landslide for Romney Gelman, et al. used some sort of regression and poststratification to get good estimates Covariates: sex, race, age, education, state, party ID, political ideology, and who voted for in the 2008 pres. election.

Wang, W., D. Rothschild, S. Goel, and A. Gelman. 2015. Forecasting Elections with Non-representative Polls. International Journal of Forecasting

(UMich & UMD) Ross-Royall Symposium 9 / 24

slide-12
SLIDE 12

Inference problem

Universe & sample

s

Potentially covered

Fc U-F Not covered U Fpc

Covered

For example ...

❯ = adult population ❋♣❝ = adults with internet access ❋❝ = adults with internet access who visit some webpage(s) s = adults who volunteer for a panel

(UMich & UMD) Ross-Royall Symposium 10 / 24

slide-13
SLIDE 13

Inference problem

Ideas used in missing data literature

MCAR–Every unit has same probability of appearing in sample MAR–Probability of appearing depends on covariates known for sample and nonsample cases NINR–Probability of appearing depends on covariates and ②’s

(UMich & UMD) Ross-Royall Symposium 11 / 24

slide-14
SLIDE 14

Inference problem

Table: Percentages of US households with Internet subscriptions; 2013 American Community Survey

Percent of households with Internet subscription Total households 74 Race and Hispanic origin of householder White alone, non-Hispanic 77 Black alone, non-Hispanic 61 Asian alone, non-Hispanic 87 Hispanic (of any race) 67 Household income Less than $25,000 48 $25,000-$49,999 69 $50,000-$99,999 85 $100,000-$149,999 93 $150,000 and more 95 Educational attainment of householder Less than high school graduate 44 High school graduate 63 Some college or associate’s degree 79 Bachelor’s degree or higher 90 (UMich & UMD) Ross-Royall Symposium 12 / 24

slide-15
SLIDE 15

Inference problem

Estimating a total

Pop total t ❂ P

s ②✐ ✰ P ❋❝s ②✐ ✰ P ❋♣❝❋❝ ②✐ ✰ P ❯❋ ②✐

To estimate t, predict 2nd, 3rd, and 4th sums What if non-covered units are much different from covered? ◮ No 70+ year old Black women in a web panel ◮ No 18-21 year old Hispanic males in a phone survey Difference from a bad probability sample with a good frame but low RR: ◮ No unit in ❯ ❋ or ❋♣❝ ❋❝ had any chance of appearing in the sample

(UMich & UMD) Ross-Royall Symposium 13 / 24

slide-16
SLIDE 16

Inference problem

Full pop vs. Domains

If domain is completely or mostly in the uncovered part (❯ ❋, ❋♣❝ ❋❝), then direct domain estimates not possible ◮ Small area approach where ♥❉ ❂ ✵ might be tried Full pop estimates may be OK if uncovered are "like" covered

(UMich & UMD) Ross-Royall Symposium 14 / 24

slide-17
SLIDE 17

Methods of Inference Quasi-randomization

Quasi-randomization

Model probability of appearing in sample Pr✭✐ ✷ s✮ ❂ Pr✭❤❛s ■♥t❡r♥❡t✮✂ Pr✭✈✐s✐ts ✇❡❜♣❛❣❡ ❥ ■♥t❡r♥❡t✮✂ Pr✭✈♦❧✉♥t❡❡rs ❢♦r ♣❛♥❡❧ ❥ ■♥t❡r♥❡t❀ ✈✐s✐ts ✇❡❜♣❛❣❡✮✂ Pr✭♣❛rt✐❝✐♣❛t❡s ✐♥ s✉r✈❡② ❥ ■♥t❡r♥❡t❀ ✈✐s✐ts ✇❡❜♣❛❣❡❀ ✈♦❧✉♥t❡❡rs✮

(UMich & UMD) Ross-Royall Symposium 15 / 24

slide-18
SLIDE 18

Methods of Inference Quasi-randomization

Reference sample

Select a probability sample from a frame with good coverage Combine probability and non-probability samples together Estimate probability of being in non-probability sample using logistic regression (or similar) Use inverse probability as a Horvitz-Thompson-like weight What does this probability mean? In a volunteer sample, there are people who would never visit recruiting webpage or never volunteer if they did visit The probability has no relative frequency interpretation

(UMich & UMD) Ross-Royall Symposium 16 / 24

slide-19
SLIDE 19

Methods of Inference Quasi-randomization

Reference sample

Select a probability sample from a frame with good coverage Combine probability and non-probability samples together Estimate probability of being in non-probability sample using logistic regression (or similar) Use inverse probability as a Horvitz-Thompson-like weight What does this probability mean? In a volunteer sample, there are people who would never visit recruiting webpage or never volunteer if they did visit The probability has no relative frequency interpretation

(UMich & UMD) Ross-Royall Symposium 16 / 24

slide-20
SLIDE 20

Methods of Inference Quasi-randomization

Reference sample

Select a probability sample from a frame with good coverage Combine probability and non-probability samples together Estimate probability of being in non-probability sample using logistic regression (or similar) Use inverse probability as a Horvitz-Thompson-like weight What does this probability mean? In a volunteer sample, there are people who would never visit recruiting webpage or never volunteer if they did visit The probability has no relative frequency interpretation

(UMich & UMD) Ross-Royall Symposium 16 / 24

slide-21
SLIDE 21

Methods of Inference Model for ②

Superpopulation model

Use a model to predict the value for each nonsample unit Linear model: ②✐ ❂ ①❚

✐ ☞ ✰ ✎✐

If this model holds, then ❫ t ❂

s

②✐ ✰

❋❝s

❫ ②✐ ✰

❋♣❝❋❝

❫ ②✐ ✰

❯❋

❫ ②✐ ❂

s

②✐ ✰ t❚

✭❯s✮❀① ❫

☞ ✿ ❂ t❚

❯① ❫

☞❀ ❫ ②✐ ❂ ①❚

✐ ❫

(UMich & UMD) Ross-Royall Symposium 17 / 24

slide-22
SLIDE 22

Methods of Inference Model for ②

Prediction Theory literature

Royall (BMKA 1976) Royall (AJE 1976) Royall (JASA 1976) Royall & Pfeffermann (BMKA 1982) Royall (Handbook of Stat 1988) Valliant, Dorfman, Royall (2000). Finite Population Sampling and Inference: A Prediction Approach

(UMich & UMD) Ross-Royall Symposium 18 / 24

slide-23
SLIDE 23

Methods of Inference Model for ②

Unit-level weights

Prediction estimator does lead to weights ✇✐ ❂ ✶ ✰ t❚

✭❯s✮❀①

❳❚

s ❳s

✑✶ ①✐

✿ ❂ t❚

❯①

❳❚

s ❳s

✑✶ ①✐

(UMich & UMD) Ross-Royall Symposium 19 / 24

slide-24
SLIDE 24

Methods of Inference Model for ②

②’s & Covariates

If ② is binary, a linear model is being used to predict a 0-1 variable ◮ Done routinely in surveys without thinking explicitly about a model Every ② may have a different model ✮ pick a set of ①’s good for many ②’s ◮ Same thinking as done for GREG and other calibration estimators Undercoverage: use ①’s associated with coverage ◮ Also done routinely in surveys

(UMich & UMD) Ross-Royall Symposium 20 / 24

slide-25
SLIDE 25

Methods of Inference Model for ②

Modeling considerations

Good modeling should consider how to predict ②’s and how to correct for coverage errors Covariate selection: LASSO, CART Covariates: an extensive set of covariates needed

Dever, Rafferty, & Valliant (2008). Svy. Rsch. Meth. Valliant, Dever (2011). Soc. Meth. Res. Gelman, et al. (2015). Intl. Jnl. Forecasting

Model fit for sample needs to hold for nonsample Proving that model estimated from sample holds for nonsample seems impossible

(UMich & UMD) Ross-Royall Symposium 21 / 24

slide-26
SLIDE 26

Methods of Inference Model for ②

SE estimation

Variance estimator must be model-based Replication is an option Jackknife or bootstrap Approximate jackknife Large sample approximation to jackknife ✈❏ ❂

s

✇ ✷

✐ r ✷ ✐

✭✶ ❤✐✐✮✷ ♥✶

✔ P

s ✇✐r✐

✭✶ ❤✐✐✮

✕✷

r✐ ❂ ②✐ ❫ ②✐; ❤✐✐ is a leverage ✈❏ is consistent if the linear model holds with uncorrelated errors

(UMich & UMD) Ross-Royall Symposium 22 / 24

slide-27
SLIDE 27

Methods of Inference Model for ②

Variance estimation articles

Royall & Eberhardt (Sankhy✖ ❛ 1975) Royall & Herson (JASA 1973a, b) Royall & Cumberland (JASA 1978, 1981a, 1981b, 1985) Royall (JASA 1986) Royall (ISR 1986) Royall & Cumberland (JRSS-B 1988)

(UMich & UMD) Ross-Royall Symposium 23 / 24

slide-28
SLIDE 28

Methods of Inference Future

Trouble Spots

How to use social media data: Twitter, Facebook, Snapchat, data scraped from web ... Volunteer panels Combining nonprobability samples with (better controlled) probability samples Official statistics: are there faster, cheaper ways of producing unemployment rate, inflation rate, new job creation, prevalence of health conditions?

????

(UMich & UMD) Ross-Royall Symposium 24 / 24

slide-29
SLIDE 29

Methods of Inference Future

Trouble Spots

How to use social media data: Twitter, Facebook, Snapchat, data scraped from web ... Volunteer panels Combining nonprobability samples with (better controlled) probability samples Official statistics: are there faster, cheaper ways of producing unemployment rate, inflation rate, new job creation, prevalence of health conditions?

????

(UMich & UMD) Ross-Royall Symposium 24 / 24