Evaluating Data Quality to Support Evidence-Building Zachary H. - - PowerPoint PPT Presentation

evaluating data quality
SMART_READER_LITE
LIVE PREVIEW

Evaluating Data Quality to Support Evidence-Building Zachary H. - - PowerPoint PPT Presentation

Software Tools for Evaluating Data Quality to Support Evidence-Building Zachary H. Seeskin NORC at the University of Chicago AcademyHealth Annual Research Meeting Washington, DC June 3, 2019 Acknowledgments : Rupa Datta, Gabriel Ugarte, Evan


slide-1
SLIDE 1

Software Tools for Evaluating Data Quality to Support Evidence-Building

AcademyHealth Annual Research Meeting Washington, DC June 3, 2019 Zachary H. Seeskin NORC at the University of Chicago

Acknowledgments: Rupa Datta, Gabriel Ugarte, Evan Herring-Nathan, Andrew Latterner, NORC; Bob George, Emily Wiegand, Chapin Hall at the University of Chicago Disclaimer: This research was supported by the Family Self-Sufficiency Research Consortium, Grant Number #90PD0272, funded by the Office of Planning, Research, and Evaluation in the Administration for Children and Families, U.S. Department of Health and Human Services to the University of Chicago, with NORC at the University of Chicago as a sub-grantee. The views expressed are solely those of the authors and do not necessarily represent the views of the Office of Planning, Research, and Evaluation.

slide-2
SLIDE 2

2

Motivation

  • Increasing research use of administrative data including for health and health

care research

  • Report of the Commission on Evidence-Based Policymaking, 2017
  • Passage of Foundations for Evidence-Based Policymaking Act in January
  • Understanding data quality critical for expanding informed use of such data

sources for evidence-building

  • But resources needed to inform evaluations of data quality
  • Literature largely focused on federal statistical agencies
  • NORC is developing software tools to fulfill this need
  • Provide best practices
  • Incorporate descriptive statistics and multivariate visualization
slide-3
SLIDE 3

3

Overview

  • 1. Growing diversity of data sources for health research
  • 2. Overview of data quality and assessment
  • 3. Dimensions of data quality
  • 4. NORC’s Data File Orientation Toolkit with examples
  • 5. Conclusion
slide-4
SLIDE 4

4

Administrative Data Sources Used in Health and Health Care Research

  • Medicare and Medicaid enrollment
  • State registries (ex: immunization)
  • Insurance claims
  • Electronic health records/Electronic medical

records

  • E-prescription data
  • Consumer purchase data
  • Many others

Range of data sources being used for research:

  • Directly for analysis/estimation
  • With or without linkage to other sources
  • Indirectly to support estimation with other

sources (such as surveys)

  • Survey frames, imputation, calibration
  • Monitoring
  • Surveillance
  • Further background from Seeskin et al. (2018)

Uses of data sources:

slide-5
SLIDE 5

5

Challenges with Administrative Data Sources

  • Data collected for administration rather than to support statistical analyses
  • Common data quality concerns: Data entry errors, Missing data, Duplicate records
  • Varying quality for different variables based on importance for administration
  • Represent special populations without ready official statistics available
  • Subject to changes over time and differential treatment for different groups
slide-6
SLIDE 6

6

Principles for Data Quality Analyses: Know Your Data

  • Conduct careful review of metadata and documentation
  • Understand context in which data are collected and maintained
  • Including legal and compliance issues impacting measures in data file
  • Focus on data exploration to detect possible quality issues
  • Seek potential validation data related to measures in your file
  • By unit or in aggregate
  • If available, conduct detailed comparisons
  • Ask: Is your data fit for the purpose at hand?
  • Needs of data quality differ for different kinds of research questions

(ex: cross-sectional, time series, longitudinal)

slide-7
SLIDE 7

7

Data Quality Dimensions from Literature

Dimension Description Relevance Degree to which statistics meet needs of user, including whether data provide what is needed for use or research topic. Accuracy Whether data values reflect true values and are processed correctly. Completeness Whether data cover population of interest, include correct records, and do not contain duplicate or out-of-scope records. Additionally, whether cases have information filled in for all appropriate fields without missing data. Timeliness Whether the data are available in time to inform policy matters of interest. Accessibility The conditions in which users can obtain and work with the data, including physical conditions and legal requirements for access. Clarity/Interpretability Whether data are accompanied by sufficient and appropriate metadata to understand the data and their quality. Coherence/Consistency Data from different sources are based on the same approaches, classifications, and methodologies, with enough metadata available to support combining information from different sources. Comparability Extent to which differences between statistics reflect real phenomena rather than methodological differences. Types of comparability: over time, across geographies, among domains.

slide-8
SLIDE 8

8

Recommended Checks from Literature

Analysis Description Validity of units Assesses validity of identification keys for units in the dataset. Validity of variable values Assesses sensibility of values of single variables and among variables using the metadata. Trustworthy variable values Determines values in data that, while valid, are suspicious from judgment or experience. Analysis Description Coverage of units Assesses whether there are units that are missing or not available for the analysis. Duplicates Looks at the occurrence of multiple registrations of identical units in the dataset. Missing values Looks at the absence of values for the variables and analyzes whether characteristics of the units with missing data are different from those

  • f units with complete data.

Accuracy Completeness

Analysis Description Distribution of variables Assesses distribution of relevant variables to look for incongruences with expected distributions. Relationships among variables Looks for unexpected patterns in relationships among variables. Consistency over time Looks for unexpected patterns in variables over time.

Comparability Key sources on methods and frameworks: Daas et al. 2011, Laitila et al. 2011, Iwig et al. 2013, Office for National Statistics UK 2013, Statistics Canada 2018

slide-9
SLIDE 9

9

NORC Software Tools for Evaluating Data Quality

  • Data File Orientation Toolkit

for Family Self-Sufficiency Data Center

  • Produces report applying data quality analyses

to your data file

  • Provides detailed written guidance on how to

interpret analyses, organizing by dimensions

  • Based in R Markdown; Primarily designed for

researchers and R programmers

  • Planned release upcoming at

http://www.norc.org/Research/Projects/Pages/fa mily-self-sufficiency-data-center.aspx

  • Future plans: Data Quality Dashboard
  • Designed for broader set of users
  • Load data file and use point-and-click interface

to conduct recommended data quality checks

slide-10
SLIDE 10

10

Example Analysis: Tableplots (Tennekes et al. 2011)

Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases

slide-11
SLIDE 11

11

Example Analysis: Tableplots

Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases

slide-12
SLIDE 12

12

Example Analysis: Tableplots

Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases

slide-13
SLIDE 13

13

Example Analysis: Tableplots

Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases

slide-14
SLIDE 14

14

Example Analysis: Tableplots

Note: From simulated data source with about 1.5 million observations representing five year range with 100,000 cases

slide-15
SLIDE 15

15

Example Analysis: Tableplots

Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records

slide-16
SLIDE 16

16

Example Analysis: Tableplots

Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records

slide-17
SLIDE 17

17

Example Analysis: Letter Value Plots (Hofmann et al. 2015)

Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records

slide-18
SLIDE 18

18

Example Analysis: Letter Value Plots

Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records

slide-19
SLIDE 19

19

Example Analysis: Letter Value Plots

Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records

slide-20
SLIDE 20

20

Example Analysis: Letter Value Plots

Note: From CMS Medicare Claims Synthetic PUF, 2008-2010, 343,644 Records

slide-21
SLIDE 21

21

Conclusion

  • Use of administrative data sources for health and health care research is

expanding

  • Key issues and recommendations described in Report of the Commission on Evidence-

Based Policymaking, 2017

  • Provide much needed resource to support evaluations of data quality of

administrative data sources to support evidence-building

  • Value of using software tools to explore your data
  • Advantages of R environment
  • Current tools geared toward researchers and programmers
  • In future, aim to develop and provide tools that are more broadly accessible
slide-22
SLIDE 22

22

To Learn More

Commission on Evidence-Based Policymaking. The promise of evidence-based policymaking: Report of the commission on evidence-based

  • policymaking. 2017. Retrieved from https://www.cep.gov/cep-final-report.html

Daas P, Ossen S, Tennekes M, Zhang LC, Hendriks C, Haugen KF, Laitila T, Wallgren A, Wallgren B, Bernardi A, Cerroni F. List of quality groups and indicators identified for administrative data sources. First deliverable of WP4 of the BLUE-ETS project, 2011 Mar 10. http://www.pietdaas.nl/beta/pubs/pubs/BLUE-ETS_WP4_Del1.pdf Hofmann H, Wickham H, Kafadar K. Letter-Value Plots: Boxplots for Large Data. Journal of Computational and Graphical Statistics. 2017 Jul 3;26(3):469-77. https://doi.org/10.1080/10618600.2017.1305277 Iwig W, Berning M, Marck P, Prell M. Data quality assessment tool for administrative data. Prepared for a subcommittee of the Federal Committee

  • n Statistical Methodology, Washington, DC. 2013 Feb. https://stats.bls.gov/osmr/datatool.pdf

Laitila T, Wallgren A, Wallgren B. Quality assessment of administrative data. Statistics Sweden; 2011. http://www.scb.se/statistik/_publikationer/ov9999_2011a01_br_x103br1102.pdf Office for National Statistics UK. Guidelines for measuring statistical output quality. Office for National Statistics UK. 2013 Sep. https://www.statisticsauthority.gov.uk/wp-content/uploads/2017/01/Guidelines-for-Measuring-Statistical-Outputs-Quality.pdf Seeskin ZH, LeClere F, Ahn J, Williams JA. Uses of alternative data sources for public health statistics and policymaking: Challenges and

  • pportunities. JSM Proceedings, Government Statistics Section. 2018: 1822-1861.

http://www.norc.org/PDFs/Publications/SeeskinZ_Uses%20of%20Alternative%20Data%20Sources_2018.pdf Seeskin ZH, Ugarte G, Datta AR. Constructing a toolkit to evaluate quality of state and local administrative data. International Journal of Population Data Science. 2019 Jan 31;4(1): 1-11. https://ijpds.org/article/download/937/1031 Statistics Canada. Use of Administrative Data. 2018. Retrieved from Statistics Canada: http:/www.statcan.gc.ca/pub/12-539- x/2009001/administrative-administratives-eng.htm Tennekes M, de Jonge E, Daas PJ. Visual profiling of large statistical datasets. In New Techniques and Technologies for Statistics conference, Brussels, Belgium 2011 Feb 10. http://www.academia.edu/download/32828743/NTTS2011_Tableplot_paper.pdf

slide-23
SLIDE 23

Thank You!

Zachary Seeskin Seeskin-Zachary@norc.org