data cleaning tools
play

Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, - PowerPoint PPT Presentation

Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, Lorenz Leitner 1 of 28 Motivation Introduction Data Sets Feature Matrix 2 of 28 Motivation Why is clean data important? Low-quality data leads to: Incorrect results


  1. Data Cleaning Tools Survey Final G1 Lukas Bodner, Daniel Geiger, Lorenz Leitner 1 of 28

  2. Motivation Introduction Data Sets Feature Matrix 2 of 28

  3. Motivation Why is clean data important? Low-quality data leads to: ● Incorrect results Wrong conclusions ● ⇒ Costly for businesses ⇒ Material failure ⇒ Injury to people 3 of 28

  4. Data Sets - Parking (Task: Merging) Source: http://data.graz.gv.at/katalog/verkehr und technik/Parkgaragen.csv http://data.graz.gv.at/katalog/verkehr und technik/ParkRide.csv 4 of 28

  5. Data Sets - Candy Ratings (Task: Standardization) 5 of 28 Source: https://www.scq.ubc.ca/so-much-candy-data-seriously/

  6. Data Sets - Green Area (Task: Filtering) 6 of 28 Source: https://data.world/unhabitat-guo/7babf915-12a0-4ceb-ad9c-7ee24b776614

  7. Feature Matrix - 12 Characteristics Local/Web Character encoding ● ● Paid/Free Output formats ● ● License User-friendliness/ease-of-use ● ● Platforms/OS Documentation ● ● Data privacy Support ● ● Input formats Other ● ● 7 of 28

  8. Feature Matrix - 25 Tools x 12 Features Excerpt of the final feature matrix 8 of 28

  9. Descriptions Tools Evaluations Examples Data Cleaning Tools 9 of 28

  10. OpenRefine Local web app ● Free and open source (BSD) ● Cross-platform ● Freebase Gridwork ⇒ Google Refine ⇒ OpenRefine ● Main features ● Explore data ○ Clean and transform data ○ Match data ○ General Refine Expression Language (GREL) ○ History of applied operations ○ Showcase video: https://youtu.be/Eqp1OMzW3oQ 10 of 28

  11. OpenRefine - Example 1: Merging 11 of 28

  12. OpenRefine - Example 2: Standardization Demo 12 of 28

  13. OpenRefine - Example 2: Standardization 13 of 28

  14. OpenRefine - Example 3: Filtering (Pre-Processing 1) 14 of 28

  15. OpenRefine - Example 3: Filtering (Pre-Processing 2) 15 of 28

  16. OpenRefine - Example 3: Filtering (Actual Filtering) 16 of 28

  17. Trifacta Web app ● Paid / Free (limited functionality, 100mb upload limit, 1gb download limit) ● Requirements: Chrome and at least 4gb ram (but also works with Firefox) ● Originally called Stanford DataWrangler ● Main features: ● Suggestions ○ Many transformation functions ○ Preview of transformations ○ Scheduling ○ Limitations: ● Online only ○ Showcase video: https://youtu.be/HvFGO-U86t8 17 of 28

  18. Trifacta - Example 1: Merging 18 of 28

  19. Trifacta - Example 2: Standardization 19 of 28

  20. Trifacta - Example 3: Filtering (Pre-Processing) 1. 3. 2. 20 of 28

  21. Trifacta - Example 3: Filtering (Actual Filtering) 21 of 28

  22. DataCleaner Standalone desktop application ● Paid commercial edition and free and open-source community edition (LGPL-3.0) ● Cross-platform ● First released in 2008 ● Main features: ● Data profiling (Discovering and analyzing quality of data) ○ Data wrangling (Transforming and cleaning data) ○ Community driven extensions ○ Limitations: ● In practice many errors and crashes ○ Unintuitive usage ○ Showcase video: https://youtu.be/bvLEYrTC6CY 22 of 28

  23. DataCleaner - Example 1: Merging 23 of 28

  24. DataCleaner - Example 2: Standardization 24 of 28

  25. DataCleaner - Example 3: Filtering 25 of 28

  26. Conclusion Summary Recommendation 26 of 28

  27. Conclusion Achieve high-quality data using data cleaning tools. ● Different use cases call for different tools. ● E.g. data analysis (DataCleaner), cleaning, transformation, ○ (OpenRefine/Trifacta)... Different user requirements call for different tools. ● E.g. data privacy (non-online tools), platform (cross-platform tools), input ○ formats, enterprise/private use (paid vs free), ... Some tools cater to almost all requirements. (OpenRefine) ● Others offer a subset. (Trifacta, Alteryx Designer, DataCleaner, ...) ● Look at feature matrix for quick comparison according to needs. ● 27 of 28

  28. Recommendation Tool Rating Limitations OpenRefine +++ Trifacta ++ Online only, paid Alteryx Designer + Windows only, paid DataCleaner - Breaks, unintuitive Honorable mentions: Tabula (PDF data extraction) ++ ● Potter’s Wheel (Pioneer) - ● Additional videos: Alteryx Designer, Tabula, Potter’s Wheel 28 of 28

  29. Thank you for your attention.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend