data cleaning

Data Cleaning February 6, 2020 Data Science CSCI 1951A Brown - PowerPoint PPT Presentation

Data Cleaning February 6, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter Thanks to C. Binning for some stolen slides. :) 1 Announcements Assignment 1: down! Assignment


  1. Data Cleaning February 6, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter Thanks to C. Binning for some stolen slides. :) 1

  2. Announcements • Assignment 1: down! Assignment 2: up! • Projects: • Let me know by today at 10:20 if you want to be N != 4 • Being thinking about your project data…the first deliverable is not just a “ceremonial” checkpoint • “Will we know how to do X by in time?” —> maybe/ probably/probably not but you should do it regardless! 2

  3. Today • 45 minutes—let’s just see how far we get…. • Problems with dirty data • Cleaning and string matching heuristics • Monday: bash commands (come with a command line…if you don’t know what that means, ask me) 3

  4. ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … 4

  5. Problems? ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … 5

  6. Inconsistent Problems? Representations ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … 6

  7. Inconsistent Problems? Representations ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … Missing Values 7

  8. Inconsistent Problems? Representations ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … Typos Missing Values 8

  9. Inconsistent Duplicates Problems? Representations ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … Typos Missing Values 9

  10. Inconsistent Duplicates Problems? Representations ID Name Street City State Zip Hours 1 N Aldroubi 123 University Ave Providence RI 98106 42 2 Natalie Delworth 245 3rd St Pawtucket RI 98052-1234 30 3 Nam Do 345 Broadway PVD Rhode Island 98101 19 4 N Dellworth 245 Third Street Pawtucket NULL 98052 299 5 Do Nam 345 Broadway St Providnce Rhode Island 98101 19 6 Nazem Aldroubi 123 Univ Ave PVD Rhode Island NULL 41 7 Minna Kimura-T 123 University Ave Providence Guyana 94305 NULL … Typos Missing Values Maybe Duplicates? 10

  11. Dirty Data… • Data is dirty on its own • Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions) • Data doesn’t “age well” (inflation, restricting) • Any combination of the above 11

  12. Dirty Data… • Data is dirty on its own • Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions) • Data doesn’t “age well” (inflation, restricting) • Any combination of the above 12

  13. Dirty Data… • Data is dirty on its own • Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions) • Data doesn’t “age well” (inflation, restricting) • Any combination of the above 13

  14. Dirty Data… • Data is dirty on its own • Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions) • Data doesn’t “age well” (inflation, redistricting) • Any combination of the above 14

  15. Dirty Data… • Data is dirty on its own • Data sets are clean on their own but combining them introduces errors (e.g. duplicates, different naming conventions) • Data doesn’t “age well” (inflation, redistricting) • Any combination of the above 15

  16. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 16

  17. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 17

  18. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 18

  19. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 19

  20. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 20

  21. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 21

  22. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 22

  23. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 23

  24. Dirty Data… • Parsing input data (e.g., separator issues) • Naming conventions: NYC vs New York • Formatting issues – esp. dates • Missing values and required fields (e.g., always use 0) • Different representations (2 vs Two) • Fields too long (get truncated) • Primary key violations (from data merging) • Redundant Records (from data merging) 24

  25. Clicker Questions! 25

Recommend


More recommend