Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR - - PowerPoint PPT Presentation

feat u re generation
SMART_READER_LITE
LIVE PREVIEW

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR - - PowerPoint PPT Presentation

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist Wh y generate ne w feat u res ? M u ltipl y ing S u mming Di erencing Di v iding FEATURE ENGINEERING WITH PYSPARK Wh y generate ne w feat


slide-1
SLIDE 1

Feature Generation

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist

slide-2
SLIDE 2

FEATURE ENGINEERING WITH PYSPARK

Why generate new features?

Multiplying Summing Dierencing Dividing

slide-3
SLIDE 3

FEATURE ENGINEERING WITH PYSPARK

Why generate new features?

slide-4
SLIDE 4

FEATURE ENGINEERING WITH PYSPARK

Combining Two Features

Multiplication

# Creating a new feature, area by multiplying df = df.withColumn('TSQFT', (df['WIDTH'] * df['LENGTH']))

slide-5
SLIDE 5

FEATURE ENGINEERING WITH PYSPARK

Other Ways to Combine Two Features

# Sum two columns df = df.withColumn('TSQFT', (df['SQFTBELOWGROUND'] + df['SQFTABOVEGROUND'])) # Divide two columns df = df.withColumn('PRICEPERTSQFT', (df['LISTPRICE'] / df['TSQFT'])) # Difference two columns df = df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE'))

slide-6
SLIDE 6

FEATURE ENGINEERING WITH PYSPARK

What's the limit?

Automation of Features FeatureTools & TSFresh Explosion of Features Higher Order & Beyond?

slide-7
SLIDE 7

Go forth and combine!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-8
SLIDE 8

Time Features

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist, General Mills

slide-9
SLIDE 9

FEATURE ENGINEERING WITH PYSPARK

The Cyclical Nature of Things

slide-10
SLIDE 10

FEATURE ENGINEERING WITH PYSPARK

Choosing the Right Level

slide-11
SLIDE 11

FEATURE ENGINEERING WITH PYSPARK

Choosing the Right Level

slide-12
SLIDE 12

FEATURE ENGINEERING WITH PYSPARK

Treating Date Fields as Dates...

from pyspark.sql.functions import to_date # Cast the data type to Date df = df.withColumn('LISTDATE', to_date('LISTDATE')) # Inspect the field df[['LISTDATE']].show(2) +----------+ | LISTDATE| +----------+ |2017-07-14| |2017-10-08| +----------+

  • nly showing top 2 rows
slide-13
SLIDE 13

FEATURE ENGINEERING WITH PYSPARK

Time Components

from pyspark.sql.functions import year, month # Create a new column of year number df = df.withColumn('LIST_YEAR', year('LISTDATE')) # Create a new column of month number df = df.withColumn('LIST_MONTH', month('LISTDATE')) from pyspark.sql.functions import dayofmonth, weekofyear # Create new columns of the day number within the month df = df.withColumn('LIST_DAYOFMONTH', dayofmonth('LISTDATE')) # Create new columns of the week number within the year df = df.withColumn('LIST_WEEKOFYEAR', weekofyear('LISTDATE'))

slide-14
SLIDE 14

FEATURE ENGINEERING WITH PYSPARK

Basic Time Based Metrics

from pyspark.sql.functions import datediff # Calculate difference between two date fields df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE'))

slide-15
SLIDE 15

FEATURE ENGINEERING WITH PYSPARK

Lagging Features

window()

Returns a record based o a group of records

lag(col, count=1)

Returns the value that is oset by rows before the current row

slide-16
SLIDE 16

FEATURE ENGINEERING WITH PYSPARK

Lagging Features, the PySpark Way

from pyspark.sql.functions import lag from pyspark.sql.window import Window # Create Window w = Window().orderBy(m_df['DATE']) # Create lagged column m_df = m_df.withColumn('MORTGAGE-1wk', lag('MORTGAGE', count=1).over(w)) # Inspect results m_df.show(3) +----------+------------+----------------+ | DATE| MORTGAGE| MORTGAGE-1wk| +----------+------------+----------------+ |2013-10-10| 4.23| null| |2013-10-17| 4.28| 4.23| |2013-10-24| 4.13| 4.28| +----------+------------+----------------+

  • nly showing top 3 rows
slide-17
SLIDE 17

It's TIME to practice!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-18
SLIDE 18

Extracting Features

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist, General Mills

slide-19
SLIDE 19

FEATURE ENGINEERING WITH PYSPARK

Extracting Age with Text Match

ROOF Asphalt Shingles, Pitched, Age 8 Years or Less Asphalt Shingles, Age Over 8 Years Asphalt Shingles, Age 8 Years or Less Roof_Age becomes Roof>8yrs Age 8 Years or Less ? Age Over 8 Years ? 1 Age 8 Years or Less ?

slide-20
SLIDE 20

FEATURE ENGINEERING WITH PYSPARK

Extracting Age with Text Match

from pyspark.sql.functions import when # Create boolean filters find_under_8 = df['ROOF'].like('%Age 8 Years or Less%') find_over_8 = df['ROOF'].like('%Age Over 8 Years%') # Apply filters using when() and otherwise() df = df.withColumn('old_roof', (when(find_over_8, 1) .when(find_under_8, 0) .otherwise(None))) # Inspect results df[['ROOF', 'old_roof']].show(3, truncate=100) +----------------------------------------------+--------+ | ROOF|old_roof| +----------------------------------------------+--------+ | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less| 0| | Asphalt Shingles, Age Over 8 Years| 1| +----------------------------------------------+--------+

  • nly showing top 3 rows
slide-21
SLIDE 21

FEATURE ENGINEERING WITH PYSPARK

Splitting Columns

ROOF becomes Roof_Material Asphalt Shingles, Pitched, Age 8 Years or Less ? Asphalt Shingles Null ? Asphalt Shingles, Age Over 8 Years ? Asphalt Shingles Metal, Age 8 Years or Less ? Metal Tile, Age 8 Years or Less ? Tile Asphalt Shingles ? Asphalt Shingles

slide-22
SLIDE 22

FEATURE ENGINEERING WITH PYSPARK

Splitting Columns

from pyspark.sql.functions import split # Split the column on commas into a list split_col = split(df['ROOF'], ',') # Put the first value of the list into a new column df = df.withColumn('Roof_Material', split_col.getItem(0)) # Inspect results df[['ROOF', 'Roof_Material']].show(5, truncate=100) +----------------------------------------------+----------------+ | ROOF| Roof_Material| +----------------------------------------------+----------------+ | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles| | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles| | Asphalt Shingles, Age Over 8 Years|Asphalt Shingles| +----------------------------------------------+----------------+

  • nly showing top 5 rows
slide-23
SLIDE 23

FEATURE ENGINEERING WITH PYSPARK

Explode!

Starting Record NO roof_list 2 [Asphalt Shingles, Pitched, Age 8 Years or Less] Exploded Record NO ex_roof_list 2 Asphalt Shingles 2 Pitched 2 Age 8 Years or Less

slide-24
SLIDE 24

FEATURE ENGINEERING WITH PYSPARK

Pivot!

Exploded Record NO ex_roof_list 2 Asphalt Shingles 2 Pitched 2 Age 8 Years or Less Pivoted Record NO Age 8 Years or Less Age Over 8 Years Asphalt Shingles Flat Metal Other Pitched ... 2 1 1 1 ...

slide-25
SLIDE 25

FEATURE ENGINEERING WITH PYSPARK

Explode & Pivot!

from pyspark.sql.functions import split, explode, lit, coalesce, first # Split the column on commas into a list df = df.withColumn('roof_list', split(df['ROOF'], ', ')) # Explode list into new records for each value ex_df = df.withColumn('ex_roof_list', explode(df['roof_list'])) # Create a dummy column of constant value ex_df = ex_df.withColumn('constant_val', lit(1)) # Pivot the values into boolean columns piv_df = ex_df.groupBy('NO').pivot('ex_roof_list')\ .agg(coalesce(first('constant_val')))

slide-26
SLIDE 26

Let's wrangle some features!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-27
SLIDE 27

Binarizing, Bucketing & Encoding

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist

slide-28
SLIDE 28

FEATURE ENGINEERING WITH PYSPARK

Binarizing

FIREPLACES becomes Has_Fireplace 1 ? 1 3 ? 1 1 ? 1 2 ? 1 ?

slide-29
SLIDE 29

FEATURE ENGINEERING WITH PYSPARK

Binarizing

from pyspark.ml.feature import Binarizer # Cast the data type to double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double')) # Create binarizing transformer bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Apply the transformer df = bin.transform(df) # Inspect the results df[['FIREPLACES','FireplaceT']].show(3) +----------+-------------+ |FIREPLACES| FireplaceT| +----------+-------------+ | 0.0| 0.0| | 1.0| 1.0| | 2.0| 1.0| +----------+-------------+

  • nly showing top 3 rows
slide-30
SLIDE 30

FEATURE ENGINEERING WITH PYSPARK

Bucketing

from pyspark.ml.feature import Bucketizer # Define how to split data splits = [0, 1, 2, 3, 4, float('Inf')] # Create bucketing transformer buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Apply transformer df = buck.transform(df) # Inspect results df[['BATHSTOTAL', 'baths']].show(4) +----------+-----------------+ |BATHSTOTAL|baths | +----------+-----------------+ | 2| 2.0| | 3| 3.0| | 1| 1.0| | 5| 4.0| +----------+-----------------+

  • nly showing top 4 rows
slide-31
SLIDE 31

FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding

CITY becomes LELM MAPW OAKD STP WB LELM - Lake Elmo ? 1 MAPW - Maplewood ? 1 OAKD - Oakdale ? 1 STP - Saint Paul ? 1 WB - Woodbury ? 1

slide-32
SLIDE 32

FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding the PySpark Way

from pyspark.ml.feature import OneHotEncoder, StringIndexer # Create indexer transformer stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index') # Fit transformer model = stringIndexer.fit(df) # Apply transformer indexed = model.transform(df)

slide-33
SLIDE 33

FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding the PySpark Way

# Create encoder transformer encoder = OneHotEncoder(inputCol='City_Index', outputCol='City_Vec) # Apply the encoder transformer encoded_df = encoder.transform(indexed) # Inspect results encoded_df[['City_Vec']].show(4) +-------------+ | City_Vec| +-------------+ | (4,[],[])| | (4,[],[])| |(4,[2],[1.0])| |(4,[2],[1.0])| +-------------+

  • nly showing top 4 rows
slide-34
SLIDE 34

Get Transforming!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K