Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K - PowerPoint PPT Presentation

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De � ne goals of anal y sis Research y o u r data Be c u rio u s , ask q u estions FEATURE ENGINEERING WITH PYSPARK

The Data Science Process FEATURE ENGINEERING WITH PYSPARK

Spark changes fast and freq u entl y Latest doc u mentation : h � ps :// spark . apache . org / docs / latest / Speci � c v ersion (2.3.1) h � ps :// spark . apache . org / docs /2.3.1/ Check y o u r v ersions ! # return spark version spark.version # return python version import sys sys.version_info FEATURE ENGINEERING WITH PYSPARK

Data Formats : Parq u et Data is s u pplied as Parq u et Stored Col u mn -w ise Fast to q u er y col u mn s u bsets Str u ct u red , de � ned schema Fields and Data T y pes de � ned Great for mess y te x t data Ind u str y Adopted Good skill to ha v e ! ???? FEATURE ENGINEERING WITH PYSPARK

Getting the Data to Spark P y Spark read methods P y Spark s u pports man y � le t y pes ! # JSON spark.read.json('example.json') # CSV or delimited files spark.read.csv('example.csv') # Parquet spark.read.parquet('example.parq') # Read a parquet file to a PySpark DataFrame df = spark.read.parquet('example.parq') FEATURE ENGINEERING WITH PYSPARK

Let ' s Practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Defining A Problem FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

What ’ s Yo u r Problem ? Predict the selling price of a ho u se Gi v en is listed price and feat u res X , independent ' kno w n ' v ariables Ho w m u ch to b uy the ho u se for Y , dependent 'u nkno w n ' v ariable SALESCLOSEPRICE FEATURE ENGINEERING WITH PYSPARK

Conte x t & Limitations of o u r Real Estate Homes sold St Pa u l , MN Area Incl u des se v eral s u b u rbs Real Estate T y pes Residential - Single Residential - M u lti - Famil y F u ll Year of Data Impact of seasonalit y FEATURE ENGINEERING WITH PYSPARK

What t y pes of attrib u tes are a v ailable ? Dates Price Date Listed List Price Year B u ilt Sales Closing Price Location Amenities Cit y Pool School District Fireplace Address Garage Si z e Constr u ction Materials # Bedrooms & Bathrooms Siding Li v ing Area Roo � ng FEATURE ENGINEERING WITH PYSPARK

Validating Yo u r Data Load DataFrame.count() for ro w co u nt df.count() 5000 DataFrame.columns for a list of col u mns df.columns ['No.', 'MLSID', 'StreetNumberNumeric', ... ] Length of DataFrame.columns for the n u mber of col u mns len(df.columns) 74 FEATURE ENGINEERING WITH PYSPARK

Checking Datat y pes DataFrame.dtypes Creates a list of col u mns and their data t y pes t u ples df.dtypes [('No.', 'integer'), ('MLSID', 'string'), ... ] FEATURE ENGINEERING WITH PYSPARK

Let ' s Practice FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Vis u all y Inspecting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Getting Descripti v e w ith DataFrame . describe () df.describe(['LISTPRICE']).show() +-------+------------------+ |summary| LISTPRICE| +-------+------------------+ | count| 5000| | mean| 263419.365| | stddev|143944.10818036905| | min| 100000| | max| 99999| +-------+------------------+ FEATURE ENGINEERING WITH PYSPARK

Man y descripti v e f u nctions are alread y a v ailable Mean pyspark.sql.functions.mean(col) Ske w ness pyspark.sql.functions.skewness(col) Minim u m pyspark.sql.functions.min(col) Co v ariance cov(col1, col2) Correlation corr(col1, col2) FEATURE ENGINEERING WITH PYSPARK

E x ample w ith mean () mean(col) Aggregate f u nction : ret u rns the a v erage ( mean ) of the v al u es in a gro u p . df.agg({'SALESCLOSEPRICE': 'mean'}).collect() [Row(avg(SALESCLOSEPRICE)=262804.4668)] FEATURE ENGINEERING WITH PYSPARK

E x ample w ith co v() cov(col1, col2) Parameters : col 1 – � rst col u mn col 2 – second col u mn df.cov('SALESCLOSEPRICE', 'YEARBUILT') 1281910.3840634783 FEATURE ENGINEERING WITH PYSPARK

seaborn : statistical data v is u ali z ation FEATURE ENGINEERING WITH PYSPARK

Notes on plotting Plo � ing P y Spark DataFrames u sing standard libraries like Seaborn req u ire con v ersion to Pandas WARNING : Sample P y Spark DataFrames before con v erting to Pandas ! sample(withReplacement, fraction, seed=None) withReplacement allo w repeats in sample fraction % of records to keep seed random seed for reprod u cibilit y # Sample 50% of the PySpark DataFrame and count rows df.sample(False, 0.5, 42).count() 2504 FEATURE ENGINEERING WITH PYSPARK

Prepping for plotting a distrib u tion Seaborn distplot() seaborn.distplot(a) a : Series , 1 d - arra y, or list . Obser v ed data . # Import your favorite visualization library import seaborn as sns # Sample the dataframe sample_df = df.select(['SALESCLOSEPRICE']).sample(False, 0.5, 42) # Convert the sample to a Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.distplot(pandas_df) FEATURE ENGINEERING WITH PYSPARK

Distrib u tion plot of sales closing price FEATURE ENGINEERING WITH PYSPARK

Relationship plotting Seaborn lmplot() seaborn.lmplot(x, y, data) x , y : strings , Inp u t v ariables ; these sho u ld be col u mn names in data . data : Pandas DataFrame # Import your favorite visualization library import seaborn as sns # Select columns s_df = df.select(['SALESCLOSEPRICE', 'SQFTABOVEGROUND']) # Sample dataframe s_df = s_df.sample(False, 0.5, 42) # Convert to Pandas DataFrame pandas_df = s_df.toPandas() # Plot it sns.lmplot(x='SQFTABOVEGROUND', y='SALESCLOSEPRICE', data=pandas_df) FEATURE ENGINEERING WITH PYSPARK

Linear model plot bet w een SQFT abo v e gro u nd and sales price FEATURE ENGINEERING WITH PYSPARK

Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K - PowerPoint PPT Presentation

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De ne goals of anal y sis Research y o u r data Be

10 minutes Welcome The presentation will begin in: 9 minutes Welcome The presentation will

Planning Part 2 Schedule 3/4/19 through 3/6/19 Filming will begin to occur, Filming will begin.

Thank you for joining! We will begin our webinar shortly. Before we begin, please check that the

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

A Bundle of Problems Lloyd Wood IEEE Aerospace conference Big Sky, Montana. March 2009. How

Thank you for joining! We will begin our webinar shortly. Before we begin, please check that the

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

These all continued with one accord in prayer and supplication, with the women and Mary the

W l Welcome! ! The webinar will begin at The webinar will begin at 1:00 Eastern/10:00 Pacific

GETTING TO KNOW YOUR WEEDS Summer Weeds Winter Weeds Begin growing in May Begin growing

W l Welcome! ! The workshop will begin at The workshop will begin at 2:00 Eastern/11:00

Facilitating Virtually: We will begin at 12:30 Helpful Zoom Tips to begin: Navigation bar at

The Webcast Will Begin Shortly The presentations will begin at 2:00 p.m. EDT Dont forget to

THE WEBINAR WILL BEGIN SHORTLY Todays webinar will begin at 3:00 PM EDT All lines are muted

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

Haskell Literacy in Six Slides Greg Price ( price ) 2008 Jan 29 Greg Price ( price ) () Haskell

CSC 4992: Cyber Security Practice Team Projects Fengwei Zhang Wayne State University CSC 4992

vulnerabilities CVE, Common Vulnerabilities and Exposures CWE, Common Weakness Enumeration

1 2 3 SMAC: social media analytics cloud; Texty, can we list these in bullets? Alternatively,

Tribulations of the in this presentation. Affordable Care Act Debbie Albert, PhD, BSN, IBCLC

Next Level Property Management 1. Your attitude / your belief / your focus 2. Innovating your

rich Danny Dorling University of Oxford London School of Economics Lecture, Oct 7 th , 2014

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K - PowerPoint PPT Presentation

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De ne goals of anal y sis Research y o u r data Be

10 minutes Welcome The presentation will begin in: 9 minutes Welcome The presentation will

Planning Part 2 Schedule 3/4/19 through 3/6/19 Filming will begin to occur, Filming will begin.

Thank you for joining! We will begin our webinar shortly. Before we begin, please check that the

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

A Bundle of Problems Lloyd Wood IEEE Aerospace conference Big Sky, Montana. March 2009. How

Thank you for joining! We will begin our webinar shortly. Before we begin, please check that the

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

These all continued with one accord in prayer and supplication, with the women and Mary the

W l Welcome! ! The webinar will begin at The webinar will begin at 1:00 Eastern/10:00 Pacific

GETTING TO KNOW YOUR WEEDS Summer Weeds Winter Weeds Begin growing in May Begin growing

W l Welcome! ! The workshop will begin at The workshop will begin at 2:00 Eastern/11:00

Facilitating Virtually: We will begin at 12:30 Helpful Zoom Tips to begin: Navigation bar at

The Webcast Will Begin Shortly The presentations will begin at 2:00 p.m. EDT Dont forget to

THE WEBINAR WILL BEGIN SHORTLY Todays webinar will begin at 3:00 PM EDT All lines are muted

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

Haskell Literacy in Six Slides Greg Price ( price ) 2008 Jan 29 Greg Price ( price ) () Haskell

CSC 4992: Cyber Security Practice Team Projects Fengwei Zhang Wayne State University CSC 4992

vulnerabilities CVE, Common Vulnerabilities and Exposures CWE, Common Weakness Enumeration

1 2 3 SMAC: social media analytics cloud; Texty, can we list these in bullets? Alternatively,

Tribulations of the in this presentation. Affordable Care Act Debbie Albert, PhD, BSN, IBCLC

Next Level Property Management 1. Your attitude / your belief / your focus 2. Innovating your

rich Danny Dorling University of Oxford London School of Economics Lecture, Oct 7 th , 2014

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data &amp; Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing