Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR - PowerPoint PPT Presentation

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist

Wh y generate ne w feat u res ? M u ltipl y ing S u mming Di � erencing Di v iding FEATURE ENGINEERING WITH PYSPARK

Wh y generate ne w feat u res ? FEATURE ENGINEERING WITH PYSPARK

Combining T w o Feat u res M u ltiplication # Creating a new feature, area by multiplying df = df.withColumn('TSQFT', (df['WIDTH'] * df['LENGTH'])) FEATURE ENGINEERING WITH PYSPARK

Other Wa y s to Combine T w o Feat u res # Sum two columns df = df.withColumn('TSQFT', (df['SQFTBELOWGROUND'] + df['SQFTABOVEGROUND'])) # Divide two columns df = df.withColumn('PRICEPERTSQFT', (df['LISTPRICE'] / df['TSQFT'])) # Difference two columns df = df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE')) FEATURE ENGINEERING WITH PYSPARK

What ' s the limit ? A u tomation of Feat u res Feat u reTools & TSFresh E x plosion of Feat u res Higher Order & Be y ond ? FEATURE ENGINEERING WITH PYSPARK

Go forth and combine ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Time Feat u res FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

The C y clical Nat u re of Things FEATURE ENGINEERING WITH PYSPARK

Choosing the Right Le v el FEATURE ENGINEERING WITH PYSPARK

Treating Date Fields as Dates ... from pyspark.sql.functions import to_date # Cast the data type to Date df = df.withColumn('LISTDATE', to_date('LISTDATE')) # Inspect the field df[['LISTDATE']].show(2) +----------+ | LISTDATE| +----------+ |2017-07-14| |2017-10-08| +----------+ only showing top 2 rows FEATURE ENGINEERING WITH PYSPARK

Time Components from pyspark.sql.functions import year, month # Create a new column of year number df = df.withColumn('LIST_YEAR', year('LISTDATE')) # Create a new column of month number df = df.withColumn('LIST_MONTH', month('LISTDATE')) from pyspark.sql.functions import dayofmonth, weekofyear # Create new columns of the day number within the month df = df.withColumn('LIST_DAYOFMONTH', dayofmonth('LISTDATE')) # Create new columns of the week number within the year df = df.withColumn('LIST_WEEKOFYEAR', weekofyear('LISTDATE')) FEATURE ENGINEERING WITH PYSPARK

Basic Time Based Metrics from pyspark.sql.functions import datediff # Calculate difference between two date fields df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE')) FEATURE ENGINEERING WITH PYSPARK

Lagging Feat u res window() Ret u rns a record based o � a gro u p of records lag(col, count=1) Ret u rns the v al u e that is o � set b y ro w s before the c u rrent ro w FEATURE ENGINEERING WITH PYSPARK

Lagging Feat u res , the P y Spark Wa y from pyspark.sql.functions import lag from pyspark.sql.window import Window # Create Window w = Window().orderBy(m_df['DATE']) # Create lagged column m_df = m_df.withColumn('MORTGAGE-1wk', lag('MORTGAGE', count=1).over(w)) # Inspect results m_df.show(3) +----------+------------+----------------+ | DATE| MORTGAGE| MORTGAGE-1wk| +----------+------------+----------------+ |2013-10-10| 4.23| null| |2013-10-17| 4.28| 4.23| |2013-10-24| 4.13| 4.28| +----------+------------+----------------+ only showing top 3 rows FEATURE ENGINEERING WITH PYSPARK

It ' s TIME to practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

E x tracting Feat u res FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

E x tracting Age w ith Te x t Match ROOF Asphalt Shingles , Pitched , Age 8 Years or Less Asphalt Shingles , Age O v er 8 Years Asphalt Shingles , Age 8 Years or Less Roof _ Age becomes Roof >8y rs Age 8 Years or Less ? 0 Age O v er 8 Years ? 1 Age 8 Years or Less ? 0 FEATURE ENGINEERING WITH PYSPARK

E x tracting Age w ith Te x t Match from pyspark.sql.functions import when # Create boolean filters find_under_8 = df['ROOF'].like('%Age 8 Years or Less%') find_over_8 = df['ROOF'].like('%Age Over 8 Years%') # Apply filters using when() and otherwise() df = df.withColumn('old_roof', (when(find_over_8, 1) .when(find_under_8, 0) .otherwise(None))) # Inspect results df[['ROOF', 'old_roof']].show(3, truncate=100) +----------------------------------------------+--------+ | ROOF|old_roof| +----------------------------------------------+--------+ | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less| 0| | Asphalt Shingles, Age Over 8 Years| 1| +----------------------------------------------+--------+ only showing top 3 rows FEATURE ENGINEERING WITH PYSPARK

Splitting Col u mns ROOF becomes Roof _ Material Asphalt Shingles , Pitched , Age 8 Years or Less ? Asphalt Shingles N u ll ? Asphalt Shingles , Age O v er 8 Years ? Asphalt Shingles Metal , Age 8 Years or Less ? Metal Tile , Age 8 Years or Less ? Tile Asphalt Shingles ? Asphalt Shingles FEATURE ENGINEERING WITH PYSPARK

Splitting Col u mns from pyspark.sql.functions import split # Split the column on commas into a list split_col = split(df['ROOF'], ',') # Put the first value of the list into a new column df = df.withColumn('Roof_Material', split_col.getItem(0)) # Inspect results df[['ROOF', 'Roof_Material']].show(5, truncate=100) +----------------------------------------------+----------------+ | ROOF| Roof_Material| +----------------------------------------------+----------------+ | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles| | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles| | Asphalt Shingles, Age Over 8 Years|Asphalt Shingles| +----------------------------------------------+----------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

E x plode ! Starting Record NO roof _ list 2 [ Asphalt Shingles , Pitched , Age 8 Years or Less ] E x ploded Record NO e x_ roof _ list 2 Asphalt Shingles 2 Pitched 2 Age 8 Years or Less FEATURE ENGINEERING WITH PYSPARK

Pi v ot ! E x ploded Record NO e x_ roof _ list 2 Asphalt Shingles 2 Pitched 2 Age 8 Years or Less Pi v oted Record Age 8 Years or Age O v er 8 Asphalt NO Flat Metal Other Pitched ... Less Years Shingles 2 0 1 1 0 0 0 1 ... FEATURE ENGINEERING WITH PYSPARK

E x plode & Pi v ot ! from pyspark.sql.functions import split, explode, lit, coalesce, first # Split the column on commas into a list df = df.withColumn('roof_list', split(df['ROOF'], ', ')) # Explode list into new records for each value ex_df = df.withColumn('ex_roof_list', explode(df['roof_list'])) # Create a dummy column of constant value ex_df = ex_df.withColumn('constant_val', lit(1)) # Pivot the values into boolean columns piv_df = ex_df.groupBy('NO').pivot('ex_roof_list')\ .agg(coalesce(first('constant_val'))) FEATURE ENGINEERING WITH PYSPARK

Let ' s w rangle some feat u res ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Binari z ing , B u cketing & Encoding FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist

Binari z ing FIREPLACES becomes Has _ Fireplace 1 ? 1 3 ? 1 1 ? 1 2 ? 1 0 ? 0 FEATURE ENGINEERING WITH PYSPARK

Binari z ing from pyspark.ml.feature import Binarizer # Cast the data type to double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double')) # Create binarizing transformer bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Apply the transformer df = bin.transform(df) # Inspect the results df[['FIREPLACES','FireplaceT']].show(3) +----------+-------------+ |FIREPLACES| FireplaceT| +----------+-------------+ | 0.0| 0.0| | 1.0| 1.0| | 2.0| 1.0| +----------+-------------+ only showing top 3 rows FEATURE ENGINEERING WITH PYSPARK

B u cketing from pyspark.ml.feature import Bucketizer # Define how to split data splits = [0, 1, 2, 3, 4, float('Inf')] # Create bucketing transformer buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Apply transformer df = buck.transform(df) # Inspect results df[['BATHSTOTAL', 'baths']].show(4) +----------+-----------------+ |BATHSTOTAL|baths | +----------+-----------------+ | 2| 2.0| | 3| 3.0| | 1| 1.0| | 5| 4.0| +----------+-----------------+ only showing top 4 rows FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding CITY becomes LELM MAPW OAKD STP WB LELM - Lake Elmo ? 1 0 0 0 0 MAPW - Maple w ood ? 0 1 0 0 0 OAKD - Oakdale ? 0 0 1 0 0 STP - Saint Pa u l ? 0 0 0 1 0 WB - Woodb u r y ? 0 0 0 0 1 FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding the P y Spark Wa y from pyspark.ml.feature import OneHotEncoder, StringIndexer # Create indexer transformer stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index') # Fit transformer model = stringIndexer.fit(df) # Apply transformer indexed = model.transform(df) FEATURE ENGINEERING WITH PYSPARK

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR - PowerPoint PPT Presentation

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist Wh y generate ne w feat u res ? M u ltipl y ing S u mming Di erencing Di v iding FEATURE ENGINEERING WITH PYSPARK Wh y generate ne w feat

Feat u re e x traction D IME N SION AL ITY R E D U C TION IN P YTH ON Jeroen Boe y e Machine

Feat u re engineering P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G IN P YTH ON Sarah G

Feat ure Select ion using/ f or Feat ure Select ion using/ f or Transduct ive ransduct ive S

Regression : feat u re selection P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v

Wh y generate feat u res ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P

Deploying FMWR-DS Feat aturing uring a Fort Knox Case Study dy What is FMWR-DS? The Fam

CH CHECKLIST ECKLIST of f Essential sential Fe Feat atures ures Medium-to-High Densities

Fiverr Experiment Feat. Derrick, Karen and Janelle What is Fiverr? Launched in 2010, Fiverr is

Mainstreaming walking as low carbon urban mobility to improve air quality in the cities Feat.:

ver. 1, 15.01.2020 Main feat atures: 1. 1. Interchangeab geable printing modules es wide

MRT9 Specifications Features Feat es Performance Performance Engine Power with

Camera to camera communication Why it is unique Descript ion Use cases Feat ures Hive LPR

20 years of isogeny-based cryptography Luca De Feo feat. Jean Kieffer, Benjamin Smith

Preprocessing data SU P E R VISE D L E AR N IN G W ITH SC IK IT - L E AR N Andreas M ller

Selecting feat u res for model performance D IME N SION AL ITY R E D U C TION IN P YTH ON

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

Professor Flavia Berys 619.665.3528 www.BerysLaw.com/cwsl Class 1 www.BerysLaw.com/cwsl

Beach Guide for Dogs and Their Owners 2 3 www.thecornishcoast.co.uk 4 7 9 5 8 6 10 Dogs

Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal Community Detection

Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

Ar Archite chitectur ctural al Styles tyles Stud tudent ent Slides lides Du Dutch ch

Creating Successful Grant Projects Grants in general ! Relationship ! Seldom has anything to do