where to begin
play

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K - PowerPoint PPT Presentation

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De ne goals of anal y sis Research y o u r data Be


  1. Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  2. Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De � ne goals of anal y sis Research y o u r data Be c u rio u s , ask q u estions FEATURE ENGINEERING WITH PYSPARK

  3. The Data Science Process FEATURE ENGINEERING WITH PYSPARK

  4. Spark changes fast and freq u entl y Latest doc u mentation : h � ps :// spark . apache . org / docs / latest / Speci � c v ersion (2.3.1) h � ps :// spark . apache . org / docs /2.3.1/ Check y o u r v ersions ! # return spark version spark.version # return python version import sys sys.version_info FEATURE ENGINEERING WITH PYSPARK

  5. Data Formats : Parq u et Data is s u pplied as Parq u et Stored Col u mn -w ise Fast to q u er y col u mn s u bsets Str u ct u red , de � ned schema Fields and Data T y pes de � ned Great for mess y te x t data Ind u str y Adopted Good skill to ha v e ! ???? FEATURE ENGINEERING WITH PYSPARK

  6. Getting the Data to Spark P y Spark read methods P y Spark s u pports man y � le t y pes ! # JSON spark.read.json('example.json') # CSV or delimited files spark.read.csv('example.csv') # Parquet spark.read.parquet('example.parq') # Read a parquet file to a PySpark DataFrame df = spark.read.parquet('example.parq') FEATURE ENGINEERING WITH PYSPARK

  7. Let ' s Practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  8. Defining A Problem FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  9. What ’ s Yo u r Problem ? Predict the selling price of a ho u se Gi v en is listed price and feat u res X , independent ' kno w n ' v ariables Ho w m u ch to b uy the ho u se for Y , dependent 'u nkno w n ' v ariable SALESCLOSEPRICE FEATURE ENGINEERING WITH PYSPARK

  10. Conte x t & Limitations of o u r Real Estate Homes sold St Pa u l , MN Area Incl u des se v eral s u b u rbs Real Estate T y pes Residential - Single Residential - M u lti - Famil y F u ll Year of Data Impact of seasonalit y FEATURE ENGINEERING WITH PYSPARK

  11. What t y pes of attrib u tes are a v ailable ? Dates Price Date Listed List Price Year B u ilt Sales Closing Price Location Amenities Cit y Pool School District Fireplace Address Garage Si z e Constr u ction Materials # Bedrooms & Bathrooms Siding Li v ing Area Roo � ng FEATURE ENGINEERING WITH PYSPARK

  12. Validating Yo u r Data Load DataFrame.count() for ro w co u nt df.count() 5000 DataFrame.columns for a list of col u mns df.columns ['No.', 'MLSID', 'StreetNumberNumeric', ... ] Length of DataFrame.columns for the n u mber of col u mns len(df.columns) 74 FEATURE ENGINEERING WITH PYSPARK

  13. Checking Datat y pes DataFrame.dtypes Creates a list of col u mns and their data t y pes t u ples df.dtypes [('No.', 'integer'), ('MLSID', 'string'), ... ] FEATURE ENGINEERING WITH PYSPARK

  14. Let ' s Practice FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  15. Vis u all y Inspecting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  16. Getting Descripti v e w ith DataFrame . describe () df.describe(['LISTPRICE']).show() +-------+------------------+ |summary| LISTPRICE| +-------+------------------+ | count| 5000| | mean| 263419.365| | stddev|143944.10818036905| | min| 100000| | max| 99999| +-------+------------------+ FEATURE ENGINEERING WITH PYSPARK

  17. Man y descripti v e f u nctions are alread y a v ailable Mean pyspark.sql.functions.mean(col) Ske w ness pyspark.sql.functions.skewness(col) Minim u m pyspark.sql.functions.min(col) Co v ariance cov(col1, col2) Correlation corr(col1, col2) FEATURE ENGINEERING WITH PYSPARK

  18. E x ample w ith mean () mean(col) Aggregate f u nction : ret u rns the a v erage ( mean ) of the v al u es in a gro u p . df.agg({'SALESCLOSEPRICE': 'mean'}).collect() [Row(avg(SALESCLOSEPRICE)=262804.4668)] FEATURE ENGINEERING WITH PYSPARK

  19. E x ample w ith co v() cov(col1, col2) Parameters : col 1 – � rst col u mn col 2 – second col u mn df.cov('SALESCLOSEPRICE', 'YEARBUILT') 1281910.3840634783 FEATURE ENGINEERING WITH PYSPARK

  20. seaborn : statistical data v is u ali z ation FEATURE ENGINEERING WITH PYSPARK

  21. Notes on plotting Plo � ing P y Spark DataFrames u sing standard libraries like Seaborn req u ire con v ersion to Pandas WARNING : Sample P y Spark DataFrames before con v erting to Pandas ! sample(withReplacement, fraction, seed=None) withReplacement allo w repeats in sample fraction % of records to keep seed random seed for reprod u cibilit y # Sample 50% of the PySpark DataFrame and count rows df.sample(False, 0.5, 42).count() 2504 FEATURE ENGINEERING WITH PYSPARK

  22. Prepping for plotting a distrib u tion Seaborn distplot() seaborn.distplot(a) a : Series , 1 d - arra y, or list . Obser v ed data . # Import your favorite visualization library import seaborn as sns # Sample the dataframe sample_df = df.select(['SALESCLOSEPRICE']).sample(False, 0.5, 42) # Convert the sample to a Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.distplot(pandas_df) FEATURE ENGINEERING WITH PYSPARK

  23. Distrib u tion plot of sales closing price FEATURE ENGINEERING WITH PYSPARK

  24. Relationship plotting Seaborn lmplot() seaborn.lmplot(x, y, data) x , y : strings , Inp u t v ariables ; these sho u ld be col u mn names in data . data : Pandas DataFrame # Import your favorite visualization library import seaborn as sns # Select columns s_df = df.select(['SALESCLOSEPRICE', 'SQFTABOVEGROUND']) # Sample dataframe s_df = s_df.sample(False, 0.5, 42) # Convert to Pandas DataFrame pandas_df = s_df.toPandas() # Plot it sns.lmplot(x='SQFTABOVEGROUND', y='SALESCLOSEPRICE', data=pandas_df) FEATURE ENGINEERING WITH PYSPARK

  25. Linear model plot bet w een SQFT abo v e gro u nd and sales price FEATURE ENGINEERING WITH PYSPARK

  26. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend