handling missing data
play

Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G - PowerPoint PPT Presentation

Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist Prereq u isites S u per v ised Learning w ith scikit - learn Uns u per v ised Learning in P y thon PRACTICING


  1. Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist

  2. Prereq u isites S u per v ised Learning w ith scikit - learn Uns u per v ised Learning in P y thon PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  3. Co u rse o u tline Chapter 1: Pre - processing and Vis u ali z ation Missing data , O u tliers , Normali z ation Chapter 2: S u per v ised Learning Feat u re selection , Reg u lari z ation , Feat u re engineering Chapter 3: Uns u per v ised Learning Cl u ster algorithm selection , Feat u re e x traction , Dimension red u ction Chapter 4: Model Selection and E v al u ation Model generali z ation and e v al u ation , Model selection PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  4. Machine learning ( ML ) pipeline PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  5. O u r ML pipeline PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  6. Missing data Impact of di � erent techniq u es Finding missing v al u es Strategies to handle PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  7. Techniq u es 1. Omission Remo v al of ro w s --> .dropna(axis=0) Remo v al of col u mns --> .dropna(axis=1) 2. Imp u tation Fill w ith z ero --> SimpleImputer(strategy='constant', fill_value=0) Imp u te mean -> SimpleImputer(strategy='mean') Imp u te median --> SimpleImputer(strategy='median') Imp u te mode --> SimpleImputer(strategy='most_frequent') Iterati v e imp u tation --> IterativeImputer() PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  8. Wh y bother ? Red u ce the probabilit y of introd u cing bias Most ML algorithms req u ire complete data PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  9. Effects of imp u tation Depend on : Missing v al u es Original v ariance Presence of o u tliers Si z e and direction of ske w Omission --> Can remo v e too m u ch Zero --> Bias res u lts do w n w ard Mean --> A � ected more b y o u tliers Median --> Be � er in case of o u tliers Mode and iterati v e imp u tation --> Tr y them o u t PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  10. F u nction ret u rns df.isna().sum() n u mber missing df['feature'].mean() feat u re mean .shape ro w, col u mn dimensions df.columns col u mn names .fillna(0) � lls missing w ith 0 select_dtypes(include = [np.number] ) n u meric col u mns select_dtypes(include = ['object'] ) string col u mns .fit_transform(numeric_cols) � ts and transforms PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  11. Effects of missing v al u es What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? Select the ans w er that is tr u e : Missing v al u es aren ' t a problem since most of sklearn ' s algorithms can handle them . Remo v ing obser v ations or feat u res w ith missing v al u es is generall y a good idea . Missing data tends to introd u ce bias that leads to misleading res u lts so the y cannot be ignored . Filling missing v al u es w ith z ero w ill bias res u lts u p w ard . PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  12. Effect of missing v al u es : ans w er What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? The correct ans w er is : Missing data tends to introd u ce bias that leads to misleading res u lts so the y cannot be ignored . ( Filling missing v al u es b y testing w hich impacts the v ariance of a gi v en dataset the least is the best approach .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  13. Effects of missing v al u es : incorrect ans w ers What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? Missing v al u es aren ' t a problem ... ( Most of sklearn ' s algorithms cannot handle missing v al u es and w ill thro w an error .) Remo v ing obser v ations or feat u res w ith missing v al u es ... ( Unless y o u r dataset is large and the proportion of missing v al u es small , remo v ing ro w s or col u mns w ith missing data u s u all y res u lts in shrinking y o u r dataset too m u ch to be u sef u l in s u bseq u ent ML .) Filling missing v al u es w ith z ero w ill bias res u lts u p w ard .( It ' s the opposite , � lling w ith z ero w ill bias res u lts do w n w ard .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  14. Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON

  15. Data distrib u tions and transformations P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist

  16. Different distrib u tions 1 h � ps ://www. researchgate . net /� g u re / Bias - Training - and - test - data - sets - are - dra w n - from - di � erent - distrib u tions _� g 22_330485084 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  17. Train / test split train, test = train_test_split(X, y, test_size=0.3) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) sns.pairplot() --> plot matri x of distrib u tions and sca � erplots PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  18. Data transformation 1 h � ps ://www. researchgate . net /� g u re / E x ample - of - the - e � ect - of - a - log - transformation - on - the - distrib u tion - of - the - dataset _� g 20_308007227 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  19. Bo x- Co x Transformations scipy.stats.boxcox(data, lmbda= ) p lmbda ( p ) x transform −2 = 1/2 x -2 reciprocal sq u are −1 = 1/ x x -1 reciprocal −1/2 = 1/ √ x x -0.5 reciprocal sq u are root log ( x ) 0.0 log 1/2 = √ x x 0.5 sq u are root 1 x = x 1 no transform 2 x = x 2 sq u are PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  20. Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON

  21. Data o u tliers and scaling P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist

  22. O u tliers One or more obser v ations that are distant from the rest of the obser v ations in a gi v en feat u re . 1 h � ps :// bolt . mph .u�. ed u/6050-6052/u nit -1/ one - q u antitati v e -v ariable - introd u ction /u nderstanding - o u tliers / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  23. Inter - q u artile range ( IQR ) 1 B y Jhg u ch at en .w ikipedia , CC BY - SA 2.5, h � ps :// commons .w ikimedia . org /w/ inde x. php ? c u rid =14524285 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  24. Line of best fit 1 h � ps ://www. r - bloggers . com / o u tlier - detection - and - treatment -w ith - r / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  25. O u tlier f u nctions F u nction ret u rns sns.boxplot(x= , y='Loan Status') bo x plot conditioned on target v ariable sns.distplot() histogram and kernel densit y estimate ( kde ) np.abs() ret u rns absol u te v al u e stats.zscore() calc u lated z- score mstats.winsorize(limits=[0.05, 0.05]) � oor and ceiling applied to o u tliers np.where(condition, true, false) replaced v al u es PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  26. High v s lo w v ariance 1 h � ps :// machinelearningmaster y. com / a - gentle - introd u ction - to - calc u lating - normal - s u mmar y- statistics / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  27. Standardi z ation v s normali z ation Standardi z ation : Normali z ation : Z - score standardi z ation Min / ma x normali z ing Scales to mean 0 and sd 1 Scales to bet w een (0, 1) 1 h � ps :// medi u m . com /@ rrfd / standardi z e - or - normali z e - e x amples - in - p y thon - e 3 f 174 b 65 dfc PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  28. Scaling f u nctions scikit-learn.preprocessing.StandardScaler() --> ( mean =0, sd =1) sklearn.preprocessing.MinMaxScaler() --> (0,1) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  29. O u tliers and scaling Ho w sho u ld o u tliers be identi � ed and properl y dealt w ith ? What res u lt does min / ma x or z- score standardi z ation ha v e on data ? Select the statement that is tr u e : An o u tlier is a point that is j u st o u tside the range of similar points in a feat u re . In a gi v en conte x t , o u tliers considered anomalo u s are helpf u l in b u ilding a predicti v e ML model . Mi x/ ma x scaling gi v es data a mean of 0, an SD of 1, and increases v ariance . Z - score standardi z ation scales data to be in the inter v al (0,1) and impro v es model � t . PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

  30. O u tliers and scaling : ans w er Ho w sho u ld o u tliers be identi � ed and properl y dealt w ith ? What res u lt does min / ma x or z- score standardi z ation ha v e on data ? The correct ans w er is : In a gi v en conte x t , o u tliers considered anomalo u s are helpf u l in b u ilding a predicti v e ML model . ( Data anomalies are common in fra u d detection , c y bersec u rit y e v ents , and other scenarios w here the goal is to � nd them .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend