Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, - - PowerPoint PPT Presentation

data cleaning tools
SMART_READER_LITE
LIVE PREVIEW

Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, - - PowerPoint PPT Presentation

Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, Lorenz Leitner 1 of 28 Motivation Introduction Data Sets Feature Matrix 2 of 28 Motivation Why is clean data important? Low-quality data leads to: Incorrect results


slide-1
SLIDE 1

Data Cleaning Tools

Survey Final

G1 Lukas Bodner, Daniel Geiger, Lorenz Leitner

1 of 28

slide-2
SLIDE 2

Introduction

Motivation Data Sets Feature Matrix

2 of 28

slide-3
SLIDE 3

Motivation

Why is clean data important? Low-quality data leads to:

  • Incorrect results
  • Wrong conclusions

⇒ Costly for businesses ⇒ Material failure ⇒ Injury to people

3 of 28

slide-4
SLIDE 4

Data Sets - Parking (Task: Merging)

4 of 28

Source: http://data.graz.gv.at/katalog/verkehr und technik/Parkgaragen.csv http://data.graz.gv.at/katalog/verkehr und technik/ParkRide.csv

slide-5
SLIDE 5

Data Sets - Candy Ratings (Task: Standardization)

5 of 28

Source: https://www.scq.ubc.ca/so-much-candy-data-seriously/

slide-6
SLIDE 6

Data Sets - Green Area (Task: Filtering)

6 of 28

Source: https://data.world/unhabitat-guo/7babf915-12a0-4ceb-ad9c-7ee24b776614

slide-7
SLIDE 7

Feature Matrix - 12 Characteristics

  • Local/Web
  • Paid/Free
  • License
  • Platforms/OS
  • Data privacy
  • Input formats
  • Character encoding
  • Output formats
  • User-friendliness/ease-of-use
  • Documentation
  • Support
  • Other

7 of 28

slide-8
SLIDE 8

Feature Matrix - 25 Tools x 12 Features

8 of 28

Excerpt of the final feature matrix

slide-9
SLIDE 9

Tools

Data Cleaning Tools

Descriptions Evaluations Examples

9 of 28

slide-10
SLIDE 10

OpenRefine

  • Local web app
  • Free and open source (BSD)
  • Cross-platform
  • Freebase Gridwork ⇒ Google Refine ⇒ OpenRefine
  • Main features

○ Explore data ○ Clean and transform data ○ Match data ○ General Refine Expression Language (GREL) ○ History of applied operations Showcase video: https://youtu.be/Eqp1OMzW3oQ

10 of 28

slide-11
SLIDE 11

OpenRefine - Example 1: Merging

11 of 28

slide-12
SLIDE 12

OpenRefine - Example 2: Standardization

12 of 28

Demo

slide-13
SLIDE 13

OpenRefine - Example 2: Standardization

13 of 28

slide-14
SLIDE 14

OpenRefine - Example 3: Filtering (Pre-Processing 1)

14 of 28

slide-15
SLIDE 15

OpenRefine - Example 3: Filtering (Pre-Processing 2)

15 of 28

slide-16
SLIDE 16

OpenRefine - Example 3: Filtering (Actual Filtering)

16 of 28

slide-17
SLIDE 17

Trifacta

17 of 28

  • Web app
  • Paid / Free (limited functionality, 100mb upload limit, 1gb download limit)
  • Requirements: Chrome and at least 4gb ram (but also works with Firefox)
  • Originally called Stanford DataWrangler
  • Main features:

○ Suggestions ○ Many transformation functions ○ Preview of transformations ○ Scheduling

  • Limitations:

Online only Showcase video: https://youtu.be/HvFGO-U86t8

slide-18
SLIDE 18

Trifacta - Example 1: Merging

18 of 28

slide-19
SLIDE 19

Trifacta - Example 2: Standardization

19 of 28

slide-20
SLIDE 20

Trifacta - Example 3: Filtering (Pre-Processing)

20 of 28

1. 2. 3.

slide-21
SLIDE 21

Trifacta - Example 3: Filtering (Actual Filtering)

21 of 28

slide-22
SLIDE 22

DataCleaner

  • Standalone desktop application
  • Paid commercial edition and free and open-source community edition (LGPL-3.0)
  • Cross-platform
  • First released in 2008
  • Main features:

○ Data profiling (Discovering and analyzing quality of data) ○ Data wrangling (Transforming and cleaning data) ○ Community driven extensions

  • Limitations:

○ In practice many errors and crashes ○ Unintuitive usage Showcase video: https://youtu.be/bvLEYrTC6CY

22 of 28

slide-23
SLIDE 23

DataCleaner - Example 1: Merging

23 of 28

slide-24
SLIDE 24

DataCleaner - Example 2: Standardization

24 of 28

slide-25
SLIDE 25

DataCleaner - Example 3: Filtering

25 of 28

slide-26
SLIDE 26

Conclusion

Summary Recommendation

26 of 28

slide-27
SLIDE 27

Conclusion

  • Achieve high-quality data using data cleaning tools.
  • Different use cases call for different tools.

○ E.g. data analysis (DataCleaner), cleaning, transformation, (OpenRefine/Trifacta)...

  • Different user requirements call for different tools.

○ E.g. data privacy (non-online tools), platform (cross-platform tools), input formats, enterprise/private use (paid vs free), ...

  • Some tools cater to almost all requirements. (OpenRefine)
  • Others offer a subset. (Trifacta, Alteryx Designer, DataCleaner, ...)
  • Look at feature matrix for quick comparison according to needs.

27 of 28

slide-28
SLIDE 28

Recommendation

28 of 28

Tool Rating Limitations

OpenRefine +++ Trifacta ++ Online only, paid Alteryx Designer + Windows only, paid DataCleaner

  • Breaks, unintuitive

Honorable mentions:

  • Tabula (PDF data extraction) ++
  • Potter’s Wheel (Pioneer) -

Additional videos: Alteryx Designer, Tabula, Potter’s Wheel

slide-29
SLIDE 29

Thank you for your attention.