Good Data Gone Bad, Bad Data Gone Worse
Renee Phillips pgconf.eu 2019
1
Good Data Gone Bad, Bad Data Gone Worse Renee Phillips pgconf.eu - - PowerPoint PPT Presentation
Good Data Gone Bad, Bad Data Gone Worse Renee Phillips pgconf.eu 2019 1 This is me. 2 Sakeeb Sabaaka Creative commons 2.0 license 3 @DataRenee https://2019.pgconf.eu/f 4 This is a talk about how good data goes bad, and how bad data gets
Renee Phillips pgconf.eu 2019
1
This is me.
2
Sakeeb Sabaaka Creative commons 2.0 license
3
@DataRenee https://2019.pgconf.eu/f
4
This is a talk about how good data goes bad, and how bad data gets worse
5
First, what is data?
format
by a computer
6
Data is: A representation of some aspect of the world
7
Next, what is good data? Fit for its intended uses in operations, planning, decision making
8
9
Why do we want good data?
10
Finally, what is bad data? Not fit for its intended uses in
making
11
12
What can we assess to check if the data might work for the intended purpose?
13
The Six Primary Dimensions for Data Quality Assessment
PDF from UK International Data Management Association
14
Guidelines for quality assurance in health and health care research
PDF from Amsterdam Centre for Health and Healthcare Research
15
Dan Moyle creatvice commons 2.0 license
11 Things to Look At
16
17
Data Attributes
Assessing Data Quality
Data Actions
18
Acquisition/ Entry Cleaning Storage Analysis Accuracy x Completeness x x Conformance x x Consistency x Timeliness x x Uniqueness x x Validity x
19
Assessing Data Quality
Data Attributes
Data Actions
20
Have we stored the correct value?
21
Accuracy at Entry
Signing up for airline rewards program, I entered my date of birth. Super easy.
22
But Wait
My birthday was not in the month they return...
23
Ohhhh
The dreaded off by
24
Just to be sure
This really isn’t user error. August 31 happens in every year...
25
26
27
https://www.bitboost.com/pawsense/
28
29
30
Assessing Data Quality
Data Attributes
Data Actions
31
Are there gaps between expected data and the data we have?
32
33
34
problem
35
36
Assessing Data Quality
Data Attributes
Data Actions
37
patricia m creative commons 2.0 license
38
patricia m creative commons 2.0 license Conner McCall creative commons 2.0 license
39
40
41
Assessing Data Quality
Data Attributes
Data Actions
42
smcgee creative commons 2.0 license
43
United States Department of Agriculture License Creative Commons 2.0
Choose the right size storage for your database.
44
45
prevent loss
46
Assessing Data Quality
Data Attributes
Data Actions
47
Dear Rich Bastard,
48
Dear Rich Bastard,
\pset null '¯\\_(ツ)_/¯'
49
50
Null Unknown)
noticeable
51
Assessing Data Quality
Data Attributes
Data Actions
52
Is the data in a format that is expected and acceptable?
53
left mister ebby creative commons 2.0 license right Ann Althouse creative commons 2.0 license
Not to be Confused With Entropy
54
55
consistency
56
Assessing Data Quality
Data Attributes
Data Actions
57
anilmohabir creative commons 2.0
58
59
60
61
Sometimes data is machine generated
quisnovus creative commons 2.0 license
62
63
64
Assessing Data Quality
Data Attributes
Data Actions
65
66
People get creative
67
68
A black hole of data quality issues Tony Hoar feels bad about it
69
sufficient to capture respondent needs
70
71
Assessing Data Quality
Data Attributes
Data Actions
72
73
74
75
Does the database only change data in expected ways? Are there conflicts between data?
76
Camille Rose creative commons 2.0 license
77
78
database
79
method
80
Assessing Data Quality
Data Actions
Data Attributes
81
Changing granularity may make analysis unreliable or impossible
82
83
84
85
data model
column
86
Assessing Data Quality
Data Actions
Data Attributes
87
88
89
and what that does to the analysis
○ https://www.postgresql.org/docs/current/mvcc.html
90
Is there more recent data that is appropriate to the task? Is the data accessible quickly enough?
91
Assessing Data Quality
Data Attributes
Data Actions
92
Google Maps Lost a Neighborhood. Again.
jeff creative commons 2.0 license
Via Slashdot
Really, this story is just like a greatest hits of problems.
93
94
95
Assessing Data Quality
Data Attributes
Data Actions
96
Erinn Simon creative commons 2.0 license Jonathan Cristoferreti creative commons 2.0 license
97
98
99
Assessing Data Quality
Data Attributes
Data Actions
100
Michael Brace creative commons 2.0 license
101
102
103
104
Assessing Data Quality
Data Attributes
Data Actions
105
Are there duplicates in the dataset?
106
matthew venn creative commons 2.0 license
Uniqueness
107
# SELECT DISTINCT fruit FROM fruits ORDER BY fruit; fruit
banana banananana grape loom naranja
(7 rows)
108
sergio santos creative commons 2.0 license
109
110
111
Are the format, syntax, and type correct? Does the data have the potential to be accurate?
112
Assessing Data Quality
Data Attributes
Data Actions
113
patient | birth | temperature
Susan | 5/12/84 | 101.4 Meg | 1/12/90 | 98.6 Julie | 1/12/90 | 97.2 Fiona | 3/31/65 | 970 Sally | 4/3/01 | 111111
114
115
116
117
Acquisition/ Entry Cleaning Storage Analysis Accuracy x Completeness x x Conformance x x Consistency x Timeliness x x Uniqueness x x Validity x
118
119
https://2019.pgconf.eu/f
120