The Statistical Administrative Records System and Administrative - - PowerPoint PPT Presentation

the statistical administrative records system and
SMART_READER_LITE
LIVE PREVIEW

The Statistical Administrative Records System and Administrative - - PowerPoint PPT Presentation

The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges Dean H. Judson Planning, Research and Evaluation Division U.S. Census Bureau How Administrative Records Are


slide-1
SLIDE 1

The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges

Dean H. Judson Planning, Research and Evaluation Division U.S. Census Bureau

slide-2
SLIDE 2

11/20/2000 U.S. CENSUS BUREAU 2

How Administrative Records Are Created and Used

Presentation (query results and displays) Database Recorded Events and Objects (administrative record) Observed Events and Objects ("sampling frame") Events and Objects (population)

Policy changes which change the definition of events and objects “Ontologies” and thresholds for observation Data entry errors and coding schemes Data management issues Query structure and spurious structure Data collection

slide-3
SLIDE 3

11/20/2000 U.S. CENSUS BUREAU 3

State 1 State 2 State 3 State 1 State 2 State 3 State 4 State 1 State 2 State 3 State 1 State 2 State 1 State 2 State 3 State 1 State 2 State 1 State 2 State 1 State 2 State 3 State 4

Proper Representation Incomplete Representation Ambiguous Representation Meaningless States Source: Wand and Wang, 1996:90

Ontologies and Data Quality

slide-4
SLIDE 4

11/20/2000 U.S. CENSUS BUREAU 4

Background and History

  • Statistical Administrative Records System

– Six large Federal input files: IRS 1040, IRS 1099, Selective Service, Medicare, Indian Health Service, HUD-TRACS – One lookup file: SSA/Census Numident

  • AREX 2000

– Attempt to use STARS data to simulate administrative records census

slide-5
SLIDE 5

11/20/2000 U.S. CENSUS BUREAU 5

Return to 4

IRS 1099 Person Edited File

5.25

IRS 1040 Person Edited File

5.20

HUD-TRACS Person Edited File

5.35

Medicare Person Edited File

5.30

Medicaid

Person Edited File (future possibility) 5.45

SSS Person Edited File

5.15

IHS Person Edited File

5.40

FAFSA

Person Edited File (future possibility) 5.55

CHUMS

Person Edited File (future possibility) 5.50

Unduplicate & Reset Address Pointers

5.75

Address Output

(aka 4.25) 5.70

Original Address Pointers

5.65

5

Concatenate, sort, and unduplicate

5.10

Person Characteristic File (PCF)

(aka 14.100) 5.05

Updated Address Pointers

5.80

Composite Person Output

5.60

Person Output

5.90

Merge

5.85 7 9 7 9

A Diagrammatic Depiction of Files Used to Create the Final StARS Database

slide-6
SLIDE 6

11/20/2000 U.S. CENSUS BUREAU 6

Characteristics of Files Included in the STARS System

  • IRS Individual Master 1040 File:

– Tax year data; April, 2000 refers to “tax year” 1999 – TY ‘99 file arrives October, 2000 – Business entities, estates, other institutions included – 120 million records/year – Households below the filing threshold do not need to file

  • Tax Filing Unit ≠ Housing Unit

– Czajka, 2000: 10-20% of addresses are PO Boxes, business addresses, tax preparers

  • Limited microdata content:

– TY95+: SSN’s of dependents requested, recorded – Czjaika, 2000: 1987 study: .5% of primary filer, 1.6% of secondary filer, 3.4% of dependents’ SSN’s in error – Age, race, sex hispanic origin microdata not available

slide-7
SLIDE 7

11/20/2000 U.S. CENSUS BUREAU 7

Characteristics of Files Included in the STARS System, cont.

  • IRS Information Returns (1099) File:

– Tax year data; April, 2000 refers to “tax year” 1999 – TY ‘99 file arrives October, 2000 – Business entities, estates, other institutions included – 775 million records/year – Recipient address ≠ Housing Unit – Czajka, 2000: 10-20% of addresses are PO Boxes, business addresses, tax preparers – Limited microdata content: Age, race, sex hispanic

  • rigin microdata not available
slide-8
SLIDE 8

11/20/2000 U.S. CENSUS BUREAU 8

Characteristics of Files Included in the STARS System, cont

  • Selective Service File:

– About 13 million records – Registration required in 1940, suspended in 1975, resumed in 1980 – Presumably, males 18-25 are required to inform SSS when they move – Females, non-immigrant aliens, hospitalized, incarcerated, and institutionalized males, and members

  • f the armed forces are exempt

– Limited microdata content: Race, Hispanic origin microdata not available – Address information may not be current

slide-9
SLIDE 9

11/20/2000 U.S. CENSUS BUREAU 9

Characteristics of Files Included in the STARS System, cont.

  • Medicare Enrollment Database (EDB):

– Current and historical Medicare enrollment – “Active” and “Inactive” cases – 35-40 million records at any one point in time; September ‘93: 77 million records (active + inactive) – Proxy recipients listed on the file (e.g., John Doe’s benefits c/o Jane Doe; John Doe’s benefits c/o nursing home) – A small portion of records at any point in time are probably deceased (Kim and Sater, 2000) – Used in population estimates system for 65+ household population estimates

slide-10
SLIDE 10

11/20/2000 U.S. CENSUS BUREAU 10

Characteristics of Files Included in the STARS System, cont.

  • Medicare EDB, cont.:

– Recipient Address ≠ Housing Unit

  • Proxy recipients

– Coverage is believed high (93-102%) but not perfect and unevenly distributed geographically

  • “Snowbird” states appear to have lower ratios of medicare to

65+ population than “non-snowbird” states

slide-11
SLIDE 11

11/20/2000 U.S. CENSUS BUREAU 11

Characteristics of Files Included in the STARS System

  • Indian Health Service patient file:

– About 10 million patient/transaction records – Transaction record ≠ person record – Unduplication

  • about 10 million patient records, 2 million unduplicated SSN’s

– Many missing SSN’s

  • about 20% missing SSN’s
slide-12
SLIDE 12

11/20/2000 U.S. CENSUS BUREAU 12

Characteristics of Files Included in the STARS System, cont.

  • Housing and Urban Development Tenant Rental

Assistance Certification System (HUD-TRACS):

– HUD subsidy payments – Currently, about 3.3 million records – Short form data for all members of household (Race/Hispanic only for head of household) – Address information may represent project or landlord address

slide-13
SLIDE 13

11/20/2000 U.S. CENSUS BUREAU 13

Characteristics of Files Included in the STARS System, cont.

  • Census NUMIDENT File:

– 750 million transaction records → 400 million individual SSN records – Post 1985: Enumeration at birth – For each SSN: Date of birth, gender, race, place of birth

  • About 50-60 million persons on the file are deceased but not identified

as such

  • No current residence information on the file
  • Taxpayer ID Numbers (TINs) not on the file
  • About 35% of SSN’s on file have alternate names (marriage, divorce,

etc.)

  • 6% missing gender
  • Race coding has changed (prior to 1980, 3 races: White, Black, Other);

20% either “unknown” or “other”

  • About 25% of SSN’s have transactions with different race codes
slide-14
SLIDE 14

11/20/2000 U.S. CENSUS BUREAU 14

STARS Processing Diagrams

  • Two Goals:

– For person data: One output record per person, assigned to an individual residence corresponding as closely as possible to Census residence definitions, in a household structure corresponding as closely as possible to Census household structure, containing microdata corresponding as closely as possible to Census short form microdata, and excluding persons which are not in the population of interest. – For address data: One output record per individual housing unit at a Basic Street Address, geocoded to Census TIGER geography, with address microdata and concepts corresponding as closely as possible to DMAF address fields and concepts, and excluding locations which are not in the population of interest.

slide-15
SLIDE 15

11/20/2000 U.S. CENSUS BUREAU 15

STARS Processing Overview

15

Process file this cycle?

15.05

Yes No Process file this cycle?

15.05

Process file this cycle?

15.05

Hold for next cycle

15.10

End

Household Data Processing

15.90

17

Household Output 15.95

Address Data Processing

15.20

10

Person Editing

15.35

15

Program Development

Final Output Program 15.100

8

Data Delivery

15.115

5

Go To End

15a

Final StARS Processing

15.105

18

Final StARS Output 15.110 Address Output 15.25 Person Output 15.80

Program Development

Household Processing Program 15.85

8

Program Development

Address Processing Program 15.15

8

Program Development

Person Editing Program 15.30

8

No Yes

Is current year’s PCF available? 15.60

Process Person Data

15.75

16

Social Security Number (SSN) Verification

15.50

13

Program Development

SSN Verification Program 15.45

8

Edited IHS File 15.40 Verified IHS File 15.55

Create Person Characteristic File (PCF)

15.65

14

Person Characteristic File (PCF) 15.70

slide-16
SLIDE 16

11/20/2000 U.S. CENSUS BUREAU 16

Administrative Records Experiment in 2000 (AREX 2000)

  • Five selected sites in Maryland and Colorado

– MD: Baltimore city, Baltimore county; – CO: El Paso county, Douglas county, Jefferson county

  • Attempt to simulate an Administrative Records Census
  • Not all aspects of an Administrative Records Census are

simulated

– Group Quarters survey – Coverage measurement survey

  • Special operations not included in StARS

– Request for physical address (PO boxes/RR’s) – MAFGOR Geocoding – Field verification of addresses not matched to DMAF

slide-17
SLIDE 17

11/20/2000 U.S. CENSUS BUREAU 17

Post-Processing

For details, see AREX 2000: Administrative Records Research File Processing Flowcharts.

17.195

Post-Processing

For details, see AREX 2000: Administrative Records Research File Processing Flowcharts.

17.195

Method 2 Only (Bottom-Up) Method 2 Only (Bottom-Up) Methods 1 and 2 Methods 1 and 2

Unmatched DMAF Addresses

17.160

Start

DMAF

17.120

Maryland & Colorado (MD&CO) Geocoded Files (with test site

records flagged) 17.25

Computer geocode the National File

(GEO) 17.20

Extract test site records from MD&CO Files

(GEO) 17.700

Receive MD&CO Files from GEO

(PRED) 17.30

Create StARS 1999 from MD&CO Files

(PRED) 17.35

StARS 1999 Master Housing File (MHF) for MD&CO

17.40

Extract ungeocoded city-style records

(GEO) 17.75

Perform Exploratory Data Analysis (EDA)

  • n test sites

(PRED) 17.45

Request for Physical Addresses Mailout & Processing

(DSCMO/NPC/GEO/RCCs) 17.110

2

Unmatched Admin. Record Addresses

17.145

Census 2000 Person Data

17.190

AREX Address File

(after MAFGOR, Request for Physical Addresses, and Field Address Verification updates) 17.180

Matched Addresses

17.185

StARS Person Data

17.175

G Q Person

Data from Census

17.170

Clerical Resolution

  • f Ungeocoded

Addresses (MAFGOR)

(GEO/FLD/RCCs) 17.80

3

Additional Un- geocoded Test Site Records

17.55

Additional Geocoded Test Site Records

17.50

Obtain DMAF from DSCMO

(PRED) 17.125

Pull off address records from DMAF by AREX test site counties

(PRED) 17.130

Planning & OMB Approval

(PRED) 17.05 National Administrative Address Records File 17.15

Acquire National Administrative Records File (PRED) 17.10

Field Address Verification & Processing

(FLD / DSCMO / NPC) 17.150

4

Copy P.O.Box and rural-style addresses

(PRED) 17.95

AREX P.O. Box and rural-style addresses

(aka 2.40) 17.100

Perform clerical review

  • f match results

(PRED) 17.140

Copy test site records to create AREX Address File

(PRED) 17.60

Match Geocoded City-style AREX Addresses to DMAF

(PRED) 17.135

AREX Address File

17.65

Update AREX Address File with MAFGOR results

(PRED) 17.85

Obtain person data from Census 2000

(DSCMO)17.165

Update AREX Address File with

  • Fld. Addr. Ver. & Proc. results

(PRED) 17.155

Update AREX Address File with

  • Req. for Phys. Addr. results

(PRED) 17.115

Geocoded City-style AREX Addresses

17.90

AREX 2000 Overview Flowchart

slide-18
SLIDE 18

11/20/2000 U.S. CENSUS BUREAU 18

AREX 2000 Evaluation Plans

g Evaluation 1: Comparison of both methods’ site and block level counts of population by race, Hispanic origin, age groups and gender, with comparable decennial census counts g Evaluation 2: Analyzing selected components of the AREX implementation processing g Evaluation 3: Comparison of “bottom up” housing unit and household level information with comparable Census 2000 housing unit and household information g Evaluation 4: Assessing the feasibility of using administrative records in lieu

  • f a field interview to obtain data on nonresponding households
slide-19
SLIDE 19

11/20/2000 U.S. CENSUS BUREAU 19

Major Analytic Issues with StARS Processing

  • Ontologies

– A delivery address suitable for receiving a payment check may not suffice for putting individuals at a street address – Difficult to distinguish individual units within the Basic Street Address – Race coding: Hispanic Origin is a separate race on NUMIDENT – Transaction data ≠ person data – How many names does a person have (and in what order)?

  • Proxies – IRS & Medicare records

– JOHN WILSON The address is for Mary Smith. John Wilson may or – C/O MARY SMITH may not live there. – 1004 LAUREL LANE – ROCKMONT, MD 22345

slide-20
SLIDE 20

11/20/2000 U.S. CENSUS BUREAU 20

Major Analytic Issues with StARS Processing

  • Addresses that are difficult to place on the ground

– Huang and Kim, 2000: About 10 % of addresses are rural style – PO Boxes: 45% for IHS, 9.5% for Medicare, 7.5% for IRS 1040, 6.8% for SSS, 3.8% for IRS 1099, .4% for HUD-TRACS – Sater, 1995 IRS/CPS match: 86.5% of tax return cases had the same address as residence address, 94% coded to same county

  • John Smith
  • H&R BLOCK
  • P.O. BOX 12
  • GREENWAY, MD 29752

– Addresses with both business and residential components

  • Dean H. Judson
  • JUDSON OLD GROWTH LOGGING & SPOTTED OWL EXTERMINATION SERVICES
  • 45850 BACKWOODS HIGHWAY
  • BOONDOCKS, OR 96432
slide-21
SLIDE 21

11/20/2000 U.S. CENSUS BUREAU 21

Major Analytic Issues with StARS Processing, cont.

  • Unduplication and matching

– When addresses or personal characteristics are measured with substantial variation, it is often not obvious whether a particular pair of records represent a duplicate or not. Yet, with multiple files, unduplication decisions must be made.

A Banana St 1 Apple St B 17 Banana St 3 Apple St Apt 1 C 19 Banana St Apt 5 3 Apple St Apt 2 D 44 MLK, Jr. Blvd 3 Apple St Apt 3 E 100 Route 4 3 Apple St Apt 4 F 7 Marie Ln 7 Apple St G Wife Mrs. Smith 9 Apple St H 5 Apple St # Apple St I 27 Apple St # Martin Luther King, Jr. Blvd J Apple St # Pennsylvania Ave K 9999 Apple St 7 Maria Ln L 3 Apple St Apt 5 M 1 Apple St N 3 Apple St Apt A O 3 Apple St ZZ P 3 Apple St Q 3 Apple St Apt 1 CHUMS-enhanced IMH File MAF

slide-22
SLIDE 22

11/20/2000 U.S. CENSUS BUREAU 22

Street BSA BSA+Unit Example NO N/A N/A 1 Street is not in MAF, either it was just missing or it's a new street A,B,C 2 Different, but valid representation of street name D,E 3 Misspelling of street name F 4 Erroneous street name G YES NO N/A 1 BSA is not in MAF, either it was just missing or it's a new BSA - There is a "hole" in MAF H 2 BSA is not in MAF, either it was just missing or it's a new BSA - A missing "street extension" I 3 Existing street with no incoming street number J 4 Erroneous street number K YES YES NO 1 Unit not in MAF, either it was just missing or it's a new unit L 2 Valid match - a BSA without separate units M 3 Different representation of a unit N 4 Erroneous unit information O 5 Missing unit information P YES YES YES 1 Valid match Q MATCH

Outcome of "CHUMS-enhanced IMH File" / MAF Match

Possible Explanations

Major Analytic Issues with StARS Processing, cont.

slide-23
SLIDE 23

11/20/2000 U.S. CENSUS BUREAU 23

Major Analytic Issues with StARS Processing, cont.

  • Variations in data from different sources

– Huang and Kim, 2000: Of the 50% of SSN’s found on multiple files,

  • about 1% have more than one gender recorded
  • about 32% have multiple addresses
  • about 2% have multiple races
  • “Imputation” from the NUMIDENT

– Many files have limited microdata. For those that are found on the NUMIDENT, we can “impute” microdata from the approximately equivalent NUMIDENT fields.

slide-24
SLIDE 24

11/20/2000 U.S. CENSUS BUREAU 24

Major Analytic Issues with StARS Processing, cont.

  • Changing information states

– Distinct problem from “point in time” data collection – Information states change over time/over databases

  • Address information ages over time and varies over databases
  • SAM SMITH

SAM SMITH

  • BOX 2 RURAL ROUTE 37

486 MAIN STREET

  • WESTPORT, VA 32784

FAIRFIELD, VA 33412

  • (Dated 10/14/98 from Medicare)

(From TY97 IRS file, filed sometime in 1998)

  • Mortality information ages over time and varies over databases
  • One database provides information about the other, provided that

matching can be performed

  • Data processing requires complex, and substantively important,

decision logic at each step

slide-25
SLIDE 25

11/20/2000 U.S. CENSUS BUREAU 25

References

  • Bye, B (1998). Race and ethnicity modeling with SSA Numident Data: Interim report: File development and
  • tabulations. Unpublished document available from the U.S. Bureau of the Census.
  • Bryant, C. (1995). Comparing the LUCA address list to “local records.” Paper presented at the 1995 State Data

Center Meeting, San Francisco, CA, April 4, 1995.

  • Czajka, J. (1999). Can we count on administrative records in future U.S. Censuses? Presentation at the Bureau of the

Census, December 15, 1999.

  • Huang, E., and Kim, J. (2000). One Percent Sample Study Report (SRD-DRAFT). Unpublished document available

from the U.S. Bureau of the Census, February 10, 2000.

  • Judson, D.H., and Popoff, C.L. (2000). Research Use of Administrative Records. Unpublished document.
  • Judson, Dean H. (2000). The Statistical Administrative Records System: System Design, Successes, and Challenges.

Unpublished document.

  • Kim, Myoung Ouk, and Sater, Douglas (2000). Defining the Medicare Data Universe for the U.S. Census Bureau's

Population Estimates Program. Paper presented at the Southern Demographic Association meetings, New Orleans, LA, August 29, 2000.

  • Sater, D. (1995). Differences in Location of Households and Tax Filing Units. Paper presented at the 1995 meeting of

the Population Association of America, San Francisco, CA, April 6, 1995.

  • Wand, Yair, and Wang, Richard Y. (1996). Anchoring data quality dimensions in ontological foundations.

Communications of the ACM, 39: 86-95.

  • Zanutto, E. (1996). Estimating a population roster from an incomplete census using mailback questionnaires,

administrative records, and sampled nonresponse followup. Presentation to the U.S. Bureau of the Census, August 6, 1996.

  • Zanutto, E., and Zaslavsky, A. (1999). Using Administrative Records to Impute for Nonresponse. Paper presented at

the International Conference on Survey Nonresponse, Portland, OR., October 29, 1999.