Data Cleansing and Data Understanding Best Practices and Lessons - - PowerPoint PPT Presentation

data cleansing and data understanding
SMART_READER_LITE
LIVE PREVIEW

Data Cleansing and Data Understanding Best Practices and Lessons - - PowerPoint PPT Presentation

Data Cleansing and Data Understanding Best Practices and Lessons from the Field Casey Stella @casey_stella 2017 1 Hi, Im Casey Stella! Im a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open source software


slide-1
SLIDE 1

Data Cleansing and Data Understanding

Best Practices and Lessons from the Field

Casey Stella @casey_stella 2017

1

slide-2
SLIDE 2

Hi, I’m Casey Stella!

  • I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open

source software

  • I work on Apache Metron (Incubating), constructing a platform to do advanced

analytics and data science for cyber security at scale

2

slide-3
SLIDE 3

Hi, I’m Casey Stella!

  • I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open

source software

  • I work on Apache Metron (Incubating), constructing a platform to do advanced

analytics and data science for cyber security at scale

  • Prior to this, I was
  • Doing data science consulting on the Hadoop ecosystem for Hortonworks
  • Doing data mining on medical data at Explorys using the Hadoop ecosystem
  • Doing signal processing on seismic data at Ion Geophysical
  • A graduate student in the Math department at Texas A&M in algorithmic

complexity theory

2

slide-4
SLIDE 4

Garbage In = ⇒ Garbage Out

“80% of the work in any data project is in cleaning the data.”

— D.J. Patel in Data Jujitsu

3

slide-5
SLIDE 5

Data Cleansing = ⇒ Data Understanding

There are two ways to understand your data

  • Syntactic Understanding
  • Semantic Understanding

If you hope to get anything out of your data, you have to have a handle on both.

4

slide-6
SLIDE 6

Syntactic Understanding: True Types

A true type is a label applied to data points xi such that xi are mutually comparable.

  • Schemas type != true data type
  • A specific column can have many different types

5

slide-7
SLIDE 7

Syntactic Understanding: True Types

A true type is a label applied to data points xi such that xi are mutually comparable.

  • Schemas type != true data type
  • A specific column can have many different types

“735” has a true type of integer but could have a schema type of string or double

5

slide-8
SLIDE 8

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

6

slide-9
SLIDE 9

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

  • For numerical data, distributions and statistical characteristics are informative.

6

slide-10
SLIDE 10

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

  • For numerical data, distributions and statistical characteristics are informative.
  • For non-numeric data, counts and distinct counts of a canonical representation are

extremely useful.

6

slide-11
SLIDE 11

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

  • For numerical data, distributions and statistical characteristics are informative.
  • For non-numeric data, counts and distinct counts of a canonical representation are

extremely useful.

  • For ALL data, an indication of how “empty” the data is.

6

slide-12
SLIDE 12

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

  • For numerical data, distributions and statistical characteristics are informative.
  • For non-numeric data, counts and distinct counts of a canonical representation are

extremely useful.

  • For ALL data, an indication of how “empty” the data is.

Canonical representations are representations which give you an idea at a glance of the data format

6

slide-13
SLIDE 13

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

  • For numerical data, distributions and statistical characteristics are informative.
  • For non-numeric data, counts and distinct counts of a canonical representation are

extremely useful.

  • For ALL data, an indication of how “empty” the data is.

Canonical representations are representations which give you an idea at a glance of the data format

  • Replacing digits with the character ‘d’
  • Stripping whitespace
  • Normalizing punctuation

6

slide-14
SLIDE 14

Syntactic Understanding: Density

Data density is an indication of how data is clumped together.

  • For numerical data, distributions and statistical characteristics are informative.
  • For non-numeric data, counts and distinct counts of a canonical representation are

extremely useful.

  • For ALL data, an indication of how “empty” the data is.

Canonical representations are representations which give you an idea at a glance of the data format

  • Replacing digits with the character ‘d’
  • Stripping whitespace
  • Normalizing punctuation

Data density is an assumption underlying any conclusions drawn from your data.

6

slide-15
SLIDE 15

Syntactic Understanding: Density over Time

∆Density ∆t

is how data clumps change over time.

7

slide-16
SLIDE 16

Syntactic Understanding: Density over Time

∆Density ∆t

is how data clumps change over time. This kind of analysis can show

  • Problems in the data pipeline

7

slide-17
SLIDE 17

Syntactic Understanding: Density over Time

∆Density ∆t

is how data clumps change over time. This kind of analysis can show

  • Problems in the data pipeline
  • Whether the assumptions of your analysis are violated

7

slide-18
SLIDE 18

Syntactic Understanding: Density over Time

∆Density ∆t

is how data clumps change over time. This kind of analysis can show

  • Problems in the data pipeline
  • Whether the assumptions of your analysis are violated

∆Density ∆t

= ⇒

  • Automation

7

slide-19
SLIDE 19

Syntactic Understanding: Density over Time

∆Density ∆t

is how data clumps change over time. This kind of analysis can show

  • Problems in the data pipeline
  • Whether the assumptions of your analysis are violated

∆Density ∆t

= ⇒

  • Automation
  • Outlier Alerting

7

slide-20
SLIDE 20

Story Time: A Summation over Time Saves Face

The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital.

8

slide-21
SLIDE 21

Story Time: A Summation over Time Saves Face

The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital.

  • Think of it as a rules engine that takes medical data and outputs how well doctors

and departments are doing

8

slide-22
SLIDE 22

Story Time: A Summation over Time Saves Face

The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital.

  • Think of it as a rules engine that takes medical data and outputs how well doctors

and departments are doing

  • Insights aren’t trusted if they’re wrong.

8

slide-23
SLIDE 23

Story Time: A Summation over Time Saves Face

The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital.

  • Think of it as a rules engine that takes medical data and outputs how well doctors

and departments are doing

  • Insights aren’t trusted if they’re wrong.
  • Correctness depend on good data

8

slide-24
SLIDE 24

Semantic Understanding: “Do what I mean, not what I say”

Semantic understanding is understanding based on how the data is used rather than how it is stored.

9

slide-25
SLIDE 25

Semantic Understanding: “Do what I mean, not what I say”

Semantic understanding is understanding based on how the data is used rather than how it is stored.

  • Finding equivalences based on semantic understanding are often context sensitive.

9

slide-26
SLIDE 26

Semantic Understanding: “Do what I mean, not what I say”

Semantic understanding is understanding based on how the data is used rather than how it is stored.

  • Finding equivalences based on semantic understanding are often context sensitive.
  • May come from humans (e.g. domain experience and ontologies)

9

slide-27
SLIDE 27

Semantic Understanding: “Do what I mean, not what I say”

Semantic understanding is understanding based on how the data is used rather than how it is stored.

  • Finding equivalences based on semantic understanding are often context sensitive.
  • May come from humans (e.g. domain experience and ontologies)
  • May come from machine learning (e.g. analyzing usage patterns to find synonyms)

9

slide-28
SLIDE 28

Semantic Understanding: “Do what I mean, not what I say”

Semantic understanding is understanding based on how the data is used rather than how it is stored.

  • Finding equivalences based on semantic understanding are often context sensitive.
  • May come from humans (e.g. domain experience and ontologies)
  • May come from machine learning (e.g. analyzing usage patterns to find synonyms)

Semantic understanding may require data science. At the same time, data science will require semantic understanding.

9

slide-29
SLIDE 29

Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon

The stage: An unnamed insurance company building models to predict disease

10

slide-30
SLIDE 30

Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon

The stage: An unnamed insurance company building models to predict disease

  • Doctors and Nurses are busy people

10

slide-31
SLIDE 31

Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon

The stage: An unnamed insurance company building models to predict disease

  • Doctors and Nurses are busy people
  • Humans suffer from confirmation bias

10

slide-32
SLIDE 32

Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon

The stage: An unnamed insurance company building models to predict disease

  • Doctors and Nurses are busy people
  • Humans suffer from confirmation bias
  • Machines can only interpret what they can see

10

slide-33
SLIDE 33

Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon

The stage: An unnamed insurance company building models to predict disease

  • Doctors and Nurses are busy people
  • Humans suffer from confirmation bias
  • Machines can only interpret what they can see
  • Together we can fill in the gaps

10

slide-34
SLIDE 34

SummarizerCLI

slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39

Implications for Team Structure

To be successful,

15

slide-40
SLIDE 40

Implications for Team Structure

To be successful,

  • Your data science teams have to be integrally involved in the data transformation

and understanding.

15

slide-41
SLIDE 41

Implications for Team Structure

To be successful,

  • Your data science teams have to be integrally involved in the data transformation

and understanding.

  • Your data science teams have to be willing to get their hands dirty

15

slide-42
SLIDE 42

Implications for Team Structure

To be successful,

  • Your data science teams have to be integrally involved in the data transformation

and understanding.

  • Your data science teams have to be willing to get their hands dirty
  • Your data science teams have to be allowed to get their hands dirty

15

slide-43
SLIDE 43

Implications for Team Structure

To be successful,

  • Your data science teams have to be integrally involved in the data transformation

and understanding.

  • Your data science teams have to be willing to get their hands dirty
  • Your data science teams have to be allowed to get their hands dirty
  • Your data science teams need software engineering chops.

15

slide-44
SLIDE 44

Questions

slide-45
SLIDE 45

Questions

Thanks for your attention! Questions?

  • Code & scripts for this talk available on my github presentation page.1
  • Find me at http://caseystella.com
  • Twitter handle: @casey_stella
  • Email address: cstella@hortonworks.com

1http://github.com/cestella/presentations/

16