dropping data
play

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John - PowerPoint PPT Presentation

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Where can data go bad ? Recorded w rong Uniq u e e v ents Forma ed incorrectl y D u plications Missing Not rele v ant FEATURE


  1. Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  2. Where can data go bad ? Recorded w rong Uniq u e e v ents Forma � ed incorrectl y D u plications Missing Not rele v ant FEATURE ENGINEERING WITH PYSPARK

  3. Dropping Col u mns df.select(['NO', 'UNITNUMBER', 'CLASS']).show() +----+----------+-----+ | NO|UNITNUMBER|CLASS| +----+----------+-----+ | 1| null| SF| | 156| A8| SF| | 157| 207| SF| | 158| 701G| SF| | 159| 36| SF| M u ltiple � elds are not needed for o u r anal y sis 'NO' a u to - generated record n u mber 'UNITNUMBER' irrele v ant data 'CLASS' all constant FEATURE ENGINEERING WITH PYSPARK

  4. Dropping Col u mns drop(*cols) *cols – a col u mn name to drop or a list of col u mn names to drop . Ret u rns a ne w DataFrame that drops the speci � ed # List of columns to drop cols_to_drop = ['NO', 'UNITNUMBER', 'CLASS'] # Drop the columns df = df.drop(*cols_to_drop) FEATURE ENGINEERING WITH PYSPARK

  5. Te x t Filtering where(condition) condition – a Col u mn of types.BooleanType or a string of SQL e x pression . Filters dataframe w here the condition is tr u e like(other) other – a SQL LIKE pa � ern Ret u rns a boolean Col u mn ~ The NOT condition df = df.where(~df['POTENTIALSHORTSALE'].like('Not Disclosed')) FEATURE ENGINEERING WITH PYSPARK

  6. O u tlier Filtering Filter data to w ithin three standard de v iations (3?) of the mean (?) FEATURE ENGINEERING WITH PYSPARK

  7. Val u e Filtering E x ample # Calculate values used for filtering std_val = df.agg({'SALESCLOSEPRICE': 'stddev'}).collect()[0][0] mean_val = df.agg({'SALESCLOSEPRICE': 'mean'}).collect()[0][0] # Create three standard deviation (? ± 3?) upper and lower bounds for data hi_bound = mean_val + (3 * std_val) low_bound = mean_val - (3 * std_val) # Use where() to filter the DataFrame between values df = df.where((df['LISTPRICE'] < hi_bound) & (df['LISTPRICE'] > low_bound)) FEATURE ENGINEERING WITH PYSPARK

  8. Dropping NA ' s or NULLs DataFrame.dropna() how : ‘ an y’ or ‘ all ’. If ‘ an y’, drop a record if it contains an y n u lls . If ‘ all ’, drop a record onl y if all its v al u es are n u ll . thresh : int , defa u lt None If speci � ed , drop records that ha v e less than thresh non - n u ll v al u es . This o v er w rites the ho w parameter . subset : optional list of col u mn names to consider . FEATURE ENGINEERING WITH PYSPARK

  9. Dropping NA ' s or NULLs # Drop any records with NULL values df = df.dropna() # drop records if both LISTPRICE and SALESCLOSEPRICE are NULL df = df.dropna(how='all', subset['LISTPRICE', 'SALESCLOSEPRICE ']) # Drop records where at least two columns have NULL values df = df.dropna(thresh=2) FEATURE ENGINEERING WITH PYSPARK

  10. Dropping D u plicates What is a d u plicate ? T w o or more records contains all the same information A � er dropping col u mns or joining datasets , check for d u plicates dropDuplicates() Can be r u n across entire DataFrame or a list of col u mns In P y Spark there is no order for w hich record is remo v ed # Entire DataFrame df.dropDuplicates() # Check only a column list df.dropDuplicates(['streetaddress']) FEATURE ENGINEERING WITH PYSPARK

  11. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  12. Adj u sting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  13. Wh y Transform Data ? FEATURE ENGINEERING WITH PYSPARK

  14. What is MinMa x Scaling FEATURE ENGINEERING WITH PYSPARK

  15. Minma x Scaling # define min and max values and collect them max_days = df.agg({'DAYSONMARKET': 'max'}).collect()[0][0] min_days = df.agg({'DAYSONMARKET': 'min'}).collect()[0][0] # create a new column based off the scaled data df = df.withColumn("scaled_days", (df['DAYSONMARKET'] - min_days) / (max_days - min_days)) df[['scaled_days']].show(5) +--------------------+ | scaled_days| +--------------------+ |0.044444444444444446| |0.017777777777777778| | 0.12444444444444444| | 0.08444444444444445| | 0.09333333333333334| +--------------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

  16. What is Standardi z ation ? Transform data to standard normal distrib u tion z = (x - ?)/ ? Mean , ? of 0 Standard De v iation , ? of 1 FEATURE ENGINEERING WITH PYSPARK

  17. Standardi z ation mean_days = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0] stddev_days = df.agg({'DAYSONMARKET': 'stddev'}).collect()[0][0] # Create a new column with the scaled data df = df.withColumn("ztrans_days", (df['DAYSONMARKET'] - mean_days) / stddev_days) df.agg({'ztrans_days': 'mean'}).collect() [Row(avg(ztrans_days)=-3.6568525985103407e-16)] df.agg({'ztrans_days': 'stddev'}).collect() [Row(stddev(ztrans_days)=1.0000000000000009)] FEATURE ENGINEERING WITH PYSPARK

  18. What is Log Scaling Unscaled distrib u tion Log - scaled distrib u tion FEATURE ENGINEERING WITH PYSPARK

  19. Log Scaling # import the log function from pyspark.sql.functions import log # Recalculate log of SALESCLOSEPRICE df = df.withColumn('log_SalesClosePrice', log(df['SALESCLOSEPRICE'])) FEATURE ENGINEERING WITH PYSPARK

  20. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  21. Working w ith Missing Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  22. Ho w does data go missing in the digital age ? Data Collection Broken Sensors Data Storage R u les 2017-01-01 v s Jan u ar y 1 st , 2017 Joining Disparate Data Monthl y to Weekl y Intentionall y Missing Pri v ac y Concerns FEATURE ENGINEERING WITH PYSPARK

  23. T y pes of Missing Missing completel y at random Missing Data is j u st a completel y random s u bset Missing at random Missing conditionall y at random based on another obser v ation Missing not at random Data is missing beca u se of ho w it is collected FEATURE ENGINEERING WITH PYSPARK

  24. Assessing Missing Val u es df.where(df['ROOF'].isNull()).count() When to drop ro w s w ith missing data ? Missing v al u es are rare 765 Missing Completel y at Random isNull() Tr u e if the c u rrent e x pression is n u ll . FEATURE ENGINEERING WITH PYSPARK

  25. Plotting Missing Val u es # Import library import seaborn as sns # subset the dataframe sub_df = df.select(['ROOMAREA1']) # sample the dataframe sample_df = sub_df.sample(False, .5, 4) # Convert to Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.heatmap(data=pandas_df.isnull()) FEATURE ENGINEERING WITH PYSPARK

  26. Missing Val u es Heatmap FEATURE ENGINEERING WITH PYSPARK

  27. Imp u tation of Missing Val u es Process of replacing missing v al u es R u le Based Val u e based on b u siness logic Statistics Based Using mean , median , etc Model Based Use model to predict v al u e FEATURE ENGINEERING WITH PYSPARK

  28. Imp u tation of Missing Val u es ** fillna(value, subset=None) value the v al u e to replace missings w ith subset the list of col u mn names to replace missings # Replacing missing values with zero df.fillna(0, subset=['DAYSONMARKET']) # Replacing with the mean value for that column col_mean = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0] df.fillna(col_mean, subset=['DAYSONMARKET']) FEATURE ENGINEERING WITH PYSPARK

  29. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  30. Getting More Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  31. Tho u ghts on E x ternal Data Sets PROS CONS Add important predictors Ma y ' bog ' anal y sis do w n S u pplement / replace v al u es Eas y to ind u ce data leakage Cheap or eas y to obtain Become data set s u bject ma � er e x pert FEATURE ENGINEERING WITH PYSPARK

  32. Abo u t Joins Orienting o u r data directions Le �; o u r starting data set Right ; ne w data set to incorporate FEATURE ENGINEERING WITH PYSPARK

  33. P y Spark DataFrame Joins DataFrame.join( other, # Other DataFrame to merge on=None, # The keys to join on how=None) # Type of join to perform (default is 'inner') FEATURE ENGINEERING WITH PYSPARK

  34. P y Spark Join E x ample # Inspect dataframe head hdf.show(2) +----------+--------------------+ | dt| nm| +----------+--------------------+ |2012-01-02| New Year Day| |2012-01-16|Martin Luther Kin...| +----------+--------------------+ only showing top 2 rows # Specify join conditon cond = [df['OFFMARKETDATE'] == hdf['dt']] # Join two hdf onto df df = df.join(hdf, on=cond, 'left') # How many sales occurred on bank holidays? df.where(~df['nm'].isNull()).count() 0 FEATURE ENGINEERING WITH PYSPARK

  35. SparkSQL Join Appl y SQL to y o u r dataframe # Register the dataframe as a temp table df.createOrReplaceTempView("df") hdf.createOrReplaceTempView("hdf") # Write a SQL Statement sql_df = spark.sql(""" SELECT * FROM df LEFT JOIN hdf ON df.OFFMARKETDATE = hdf.dt """) FEATURE ENGINEERING WITH PYSPARK

  36. Let ' s Join Some Data ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend