wh y do missing v al u es e x ist
play

Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e Ho w gaps in data occ u r Data not being collected properl y Collection


  1. Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  2. Ho w gaps in data occ u r Data not being collected properl y Collection and management errors Data intentionall y being omi � ed Co u ld be created d u e to transformations of the data FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  3. Wh y w e care ? Some models cannot w ork w ith missing data ( N u lls / NaNs ) Missing data ma y be a sign of a w ider data iss u e Missing data can be a u sef u l feat u re FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  4. Missing v al u e disco v er y print(df.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 999 entries, 0 to 998 Data columns (total 12 columns): SurveyDate 999 non-null object ... StackOverflowJobsRecommend 487 non-null float64 VersionControl 999 non-null object Gender 693 non-null object RawSalary 665 non-null object dtypes: float64(2), int64(2), object(8) memory usage: 93.7+ KB FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  5. Finding missing v al u es print(df.isnull()) StackOverflowJobsRecommend VersionControl ... \ 0 True False ... 1 False False ... 2 False False ... 3 True False ... 4 False False ... Gender RawSalary 0 False True 1 False False 2 True True 3 False False 4 False False FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  6. Finding missing v al u es print(df['StackOverflowJobsRecommend'].isnull().sum()) 512 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  7. Finding non - missing v al u es print(df.notnull()) StackOverflowJobsRecommend VersionControl ... \ 0 False True ... 1 True True ... 2 True True ... 3 False True ... 4 True True ... Gender RawSalary 0 True False 1 True True 2 False False 3 True True 4 True True FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  8. Go ahead and find missing v al u es ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  9. Dealing w ith missing v al u es ( I ) FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  10. List w ise deletion SurveyDate ConvertedSalary Hobby ... \ 0 2/28/18 20:20 NaN Yes ... 1 6/28/18 13:26 70841.0 Yes ... 2 6/6/18 3:37 NaN No ... 3 5/9/18 1:06 21426.0 Yes ... 4 4/12/18 22:41 41671.0 Yes ... FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  11. List w ise deletion in P y thon # Drop all rows with at least one missing values df.dropna(how='any') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  12. List w ise deletion in P y thon # Drop rows with missing values in a specific column df.dropna(subset=['VersionControl']) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  13. Iss u es w ith deletion It deletes v alid data points Relies on randomness Red u ces information FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  14. Replacing w ith strings # Replace missing values in a specific column # with a given string df['VersionControl'].fillna( value='None Given', inplace=True ) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  15. Recording missing v al u es # Record where the values are not missing df['SalaryGiven'] = df['ConvertedSalary'].notnull() # Drop a specific column df.drop(columns=['ConvertedSalary']) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  16. Practice time FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  17. Fill contin u o u s missing v al u es FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  18. Deleting missing v al u es Can ' t delete ro w s w ith missing v al u es in the test set FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  19. What else can y o u do ? Categorical col u mns : Replace missing v al u es w ith the most common occ u rring v al u e or w ith a string that � ags missing v al u es s u ch as ' None ' N u meric col u mns : Replace missing v al u es w ith a s u itable v al u e FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  20. Meas u res of central tendenc y Mean Median FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  21. Calc u lating the meas u res of central tendenc y print(df['ConvertedSalary'].mean()) print(df['ConvertedSalary'].median()) 92565.16992481203 55562.0 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  22. Fill the missing v al u es df['ConvertedSalary'] = df['ConvertedSalary'].fillna( df['ConvertedSalary'].mean() ) df['ConvertedSalary'] = df['ConvertedSalary']\ .astype('int64') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  23. Ro u nding v al u es df['ConvertedSalary'] = df['ConvertedSalary'].fillna( round(df['ConvertedSalary'].mean()) ) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  24. Let ' s Practice ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  25. Dealing w ith other data iss u es FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  26. Bad characters print(df['RawSalary'].dtype) dtype('O') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  27. Bad characters print(df['RawSalary'].head()) 0 NaN 1 70,841.00 2 NaN 3 21,426.00 4 41,671.00 Name: RawSalary, dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  28. Dealing w ith bad characters df['RawSalary'] = df['RawSalary'].str.replace(',', '') df['RawSalary'] = df['RawSalary'].astype('float') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  29. Finding other stra y characters coerced_vals = pd.to_numeric(df['RawSalary'], errors='coerce') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  30. Finding other stra y characters print(df[coerced_vals.isna()].head()) 0 NaN 2 NaN 4 $51408.00 Name: RawSalary, dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  31. Chaining methods df['column_name'] = df['column_name'].method1() df['column_name'] = df['column_name'].method2() df['column_name'] = df['column_name'].method3() Same as : df['column_name'] = df['column_name']\ .method1().method2().method3() FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  32. Go ahead and fi x bad characters ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend