Good Data Gone Bad, Bad Data Gone Worse Renee Phillips pgconf.eu - - PowerPoint PPT Presentation

good data gone bad bad data gone worse
SMART_READER_LITE
LIVE PREVIEW

Good Data Gone Bad, Bad Data Gone Worse Renee Phillips pgconf.eu - - PowerPoint PPT Presentation

Good Data Gone Bad, Bad Data Gone Worse Renee Phillips pgconf.eu 2019 1 This is me. 2 Sakeeb Sabaaka Creative commons 2.0 license 3 @DataRenee https://2019.pgconf.eu/f 4 This is a talk about how good data goes bad, and how bad data gets


slide-1
SLIDE 1

Good Data Gone Bad, Bad Data Gone Worse

Renee Phillips pgconf.eu 2019

1

slide-2
SLIDE 2

This is me.

2

slide-3
SLIDE 3

Sakeeb Sabaaka Creative commons 2.0 license

3

slide-4
SLIDE 4

@DataRenee https://2019.pgconf.eu/f

4

slide-5
SLIDE 5

This is a talk about how good data goes bad, and how bad data gets worse

5

slide-6
SLIDE 6

First, what is data?

  • Pieces of information, with a

format

  • Facts used for making decisions
  • Information stored in and/or used

by a computer

6

slide-7
SLIDE 7

Data is: A representation of some aspect of the world

7

slide-8
SLIDE 8

Next, what is good data? Fit for its intended uses in operations, planning, decision making

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

Why do we want good data?

  • Planning
  • Operations
  • Looking smart
  • Saving money
  • Decision making
  • Completing transactions

10

slide-11
SLIDE 11

Finally, what is bad data? Not fit for its intended uses in

  • perations, planning, decision

making

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

What can we assess to check if the data might work for the intended purpose?

13

slide-14
SLIDE 14

The Six Primary Dimensions for Data Quality Assessment

  • 1. Completeness
  • 2. Uniqueness
  • 3. Timeliness
  • 4. Validity
  • 5. Accuracy
  • 6. Consistency

PDF from UK International Data Management Association

14

slide-15
SLIDE 15

Guidelines for quality assurance in health and health care research

  • 1. Acquisition
  • 2. Entry
  • 3. Cleaning
  • 4. Storage
  • 5. Analysis

PDF from Amsterdam Centre for Health and Healthcare Research

15

slide-16
SLIDE 16

Dan Moyle creatvice commons 2.0 license

11 Things to Look At

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Assessing Data Quality

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

18

slide-19
SLIDE 19

Acquisition/ Entry Cleaning Storage Analysis Accuracy x Completeness x x Conformance x x Consistency x Timeliness x x Uniqueness x x Validity x

19

slide-20
SLIDE 20

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

20

slide-21
SLIDE 21

Accuracy

Have we stored the correct value?

21

slide-22
SLIDE 22

Accuracy at Entry

Signing up for airline rewards program, I entered my date of birth. Super easy.

22

slide-23
SLIDE 23

But Wait

My birthday was not in the month they return...

23

slide-24
SLIDE 24

Ohhhh

The dreaded off by

  • ne error.

24

slide-25
SLIDE 25

Just to be sure

This really isn’t user error. August 31 happens in every year...

25

slide-26
SLIDE 26

How does this happen?

  • Bad UX/UI

26

slide-27
SLIDE 27

What can we do?

  • Test the inputs when we design database
  • Talk to the Front End team

27

slide-28
SLIDE 28

https://www.bitboost.com/pawsense/

28

slide-29
SLIDE 29

How does this happen?

  • Errors in entry
  • Unauthorized entry

29

slide-30
SLIDE 30

What can we do?

  • Knowledge Elicitation (talk to domain experts)
  • Set database constraints
  • Check anticipated vs actual values
  • Maintain security

30

slide-31
SLIDE 31

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

31

slide-32
SLIDE 32

Completeness

Are there gaps between expected data and the data we have?

32

slide-33
SLIDE 33

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

How does this happen?

  • Incorrect sampling
  • Incomplete understanding of the business

problem

  • Not enough data available

35

slide-36
SLIDE 36

What can we do?

  • Ask more questions of domain experts
  • Find additional data sets

36

slide-37
SLIDE 37

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

37

slide-38
SLIDE 38

patricia m creative commons 2.0 license

Sometimes data is discarded for relevance or size

38

slide-39
SLIDE 39

patricia m creative commons 2.0 license Conner McCall creative commons 2.0 license

39

slide-40
SLIDE 40

How does this happen?

  • Dataset is too large
  • Dataset contains unnecessary columns

40

slide-41
SLIDE 41

What can we do?

  • Import selectively
  • Screen data carefully
  • Trim and filter as appropriate

41

slide-42
SLIDE 42

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

42

slide-43
SLIDE 43

smcgee creative commons 2.0 license

Storage

43

slide-44
SLIDE 44

United States Department of Agriculture License Creative Commons 2.0

Choose the right size storage for your database.

44

slide-45
SLIDE 45

How does this happen?

  • Storage size chosen incorrectly or not updated
  • Storage location or equipment chosen poorly
  • Column, table, or database dropped in error

45

slide-46
SLIDE 46

What can we do?

  • Be realistic about data needs, assess frequently
  • Have backups
  • Trust your users, apply alerts and process to

prevent loss

46

slide-47
SLIDE 47

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

47

slide-48
SLIDE 48

Dear Rich Bastard,

48

slide-49
SLIDE 49

Dear Rich Bastard,

  • r maybe try

\pset null '¯\\_(ツ)_/¯'

49

slide-50
SLIDE 50

How does this happen?

  • Null is a black hole of data problems

50

slide-51
SLIDE 51

What can we do?

  • Document the code
  • Be careful with null (Go see Lætitia’s talk about

Null Unknown)

  • Make NULL appear as something more

noticeable

51

slide-52
SLIDE 52

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

52

slide-53
SLIDE 53

Conformance

Is the data in a format that is expected and acceptable?

53

slide-54
SLIDE 54

left mister ebby creative commons 2.0 license right Ann Althouse creative commons 2.0 license

Entry:

Not to be Confused With Entropy

54

slide-55
SLIDE 55

How does this happen?

  • Errors in entry
  • Limitations in data collection/ availability
  • Transforming from one data type to another

55

slide-56
SLIDE 56

What can we do?

  • Test the inputs when we design database
  • Have multiple entries of subset of data, check for

consistency

  • Search out additional datasets

56

slide-57
SLIDE 57

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

57

slide-58
SLIDE 58

anilmohabir creative commons 2.0

58

slide-59
SLIDE 59
  • 2, -1, 1, 2

2BCE, 1BCE, 1CE, 2CE

59

slide-60
SLIDE 60

How does this happen?

  • Some databases have a year 0
  • Money is a challenge

60

slide-61
SLIDE 61

What can we do?

  • Beware of calendar challenges
  • Don’t use the money type
  • Know what jurisdictions recognize leap seconds
  • Daylight Savings
  • Offsets

61

slide-62
SLIDE 62

Sometimes data is machine generated

quisnovus creative commons 2.0 license

62

slide-63
SLIDE 63

How does this happen?

  • Improper machine calibration
  • Improper machine reading

63

slide-64
SLIDE 64

What can we do?

  • Set database constraints
  • Ensure machine generated data is tested
  • Ensure data collectors are well trained

64

slide-65
SLIDE 65

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

65

slide-66
SLIDE 66

NOTNULL

66

slide-67
SLIDE 67

NOTNULL

People get creative

67

slide-68
SLIDE 68

NULL

68

slide-69
SLIDE 69

NULL

A black hole of data quality issues Tony Hoar feels bad about it

69

slide-70
SLIDE 70

How does this happen?

  • Data entry is forced to provide a response
  • Radio button or check boxes provided are not

sufficient to capture respondent needs

70

slide-71
SLIDE 71

What can we do?

  • Consult domain experts
  • Provide free form “other” option and/or “unknown”
  • ption
  • Practice good knowledge elicitation

71

slide-72
SLIDE 72

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

72

slide-73
SLIDE 73

73

slide-74
SLIDE 74

How does this happen?

  • Text fields
  • Integer fields

74

slide-75
SLIDE 75

What can we do?

  • Use BOOLEAN data type
  • Cast BOOLEAN to TEXT or INT if you must

75

slide-76
SLIDE 76

Consistency

Does the database only change data in expected ways? Are there conflicts between data?

76

slide-77
SLIDE 77

Camille Rose creative commons 2.0 license

77

slide-78
SLIDE 78

78

slide-79
SLIDE 79

How does this happen?

  • Selecting datasets for different areas of the

database

  • Databases started with inconsistency

79

slide-80
SLIDE 80

What can we do?

  • Coalesce appropriately
  • Clean data with a documented, reproducible

method

  • Set database constraints to check consistency

80

slide-81
SLIDE 81

Assessing Data Quality

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

81

slide-82
SLIDE 82

Changing granularity may make analysis unreliable or impossible

82

slide-83
SLIDE 83

83

slide-84
SLIDE 84

84

slide-85
SLIDE 85

How does this happen?

  • Data models are changed
  • Database system is changed
  • Data is transferred inappropriately

85

slide-86
SLIDE 86

What can we do?

  • Exercise extreme caution when designing first

data model

  • Speak with domain experts
  • Use new column instead of changing existing

column

86

slide-87
SLIDE 87

Assessing Data Quality

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

87

slide-88
SLIDE 88

88

slide-89
SLIDE 89

How does this happen?

  • Improper database constraints
  • Database system is changed
  • Data is entered inappropriately
  • Data not available in emergency

89

slide-90
SLIDE 90

What can we do?

  • Have access to multiple datasets for emergencies
  • Speak with domain experts to plan permission
  • Be clear in analysis when the change happened

and what that does to the analysis

  • Be able to correct entries after initial input
  • Use concurrency control in PostgreSQL

○ https://www.postgresql.org/docs/current/mvcc.html

90

slide-91
SLIDE 91

Timeliness

Is there more recent data that is appropriate to the task? Is the data accessible quickly enough?

91

slide-92
SLIDE 92

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

92

slide-93
SLIDE 93

Google Maps Lost a Neighborhood. Again.

jeff creative commons 2.0 license

Via Slashdot

  • Acquisition
  • Entry
  • Cleaning
  • Analysis
  • Consistency
  • Accuracy

Really, this story is just like a greatest hits of problems.

93

slide-94
SLIDE 94

How does this happen?

  • A newly discovered dataset is outdated
  • A newly created dataset is not imported
  • User is not aware of the age of data

94

slide-95
SLIDE 95

What can we do?

  • Check provenance of data
  • Actively search for additional sources
  • Combine datasets where appropriate
  • Identify if more data is really needed

95

slide-96
SLIDE 96

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

96

slide-97
SLIDE 97

Erinn Simon creative commons 2.0 license Jonathan Cristoferreti creative commons 2.0 license

97

slide-98
SLIDE 98

How does this happen?

  • Infrastructure limitations

98

slide-99
SLIDE 99

What can we do?

  • Ensure enough storage on collector
  • Offline first design

99

slide-100
SLIDE 100

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

100

slide-101
SLIDE 101

Michael Brace creative commons 2.0 license

101

slide-102
SLIDE 102

How does this happen?

  • Data set is large
  • Data security prevents access
  • Data is stale

102

slide-103
SLIDE 103

103

slide-104
SLIDE 104

What can we do?

  • Build or select better indexes
  • Use partitioning
  • Store less data
  • Evaluate roles and users
  • Update materialized views in transactions

104

slide-105
SLIDE 105

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

105

slide-106
SLIDE 106

Uniqueness

Are there duplicates in the dataset?

106

slide-107
SLIDE 107

matthew venn creative commons 2.0 license

Uniqueness

107

slide-108
SLIDE 108

# SELECT DISTINCT fruit FROM fruits ORDER BY fruit; fruit

  • apple

banana banananana grape loom naranja

  • rangle

(7 rows)

108

slide-109
SLIDE 109

sergio santos creative commons 2.0 license

109

slide-110
SLIDE 110

How does this happen?

  • Assumptions about domain
  • Entry errors
  • Naming things
  • Duplicate entries

110

slide-111
SLIDE 111

What can we do?

  • Consult domain experts
  • Check entry
  • Set primary key
  • Clean based on information not instinct

111

slide-112
SLIDE 112

Validity

Are the format, syntax, and type correct? Does the data have the potential to be accurate?

112

slide-113
SLIDE 113

Assessing Data Quality

Data Attributes

  • 1. Accuracy
  • 2. Completeness
  • 3. Conformance
  • 4. Consistency
  • 5. Timeliness
  • 6. Uniqueness
  • 7. Validity

Data Actions

  • 1. Acquisition/Entry
  • 2. Cleaning
  • 3. Storage
  • 4. Analysis

113

slide-114
SLIDE 114

patient | birth | temperature

  • --------+---------+------

Susan | 5/12/84 | 101.4 Meg | 1/12/90 | 98.6 Julie | 1/12/90 | 97.2 Fiona | 3/31/65 | 970 Sally | 4/3/01 | 111111

114

slide-115
SLIDE 115

000861 == 861

115

slide-116
SLIDE 116

How does this happen?

  • Importing from various sources
  • Improper data type selection
  • Improper format changes to data
  • NOTNULL
  • Data Entry errors

116

slide-117
SLIDE 117

What can we do?

  • Select correct data type
  • Set value constraint on column
  • Figure out a solution for NULL?

117

slide-118
SLIDE 118

Acquisition/ Entry Cleaning Storage Analysis Accuracy x Completeness x x Conformance x x Consistency x Timeliness x x Uniqueness x x Validity x

118

slide-119
SLIDE 119

119

slide-120
SLIDE 120

https://2019.pgconf.eu/f

120