Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P - - PowerPoint PPT Presentation

choosing the algorithm
SMART_READER_LITE
LIVE PREVIEW

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P - - PowerPoint PPT Presentation

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK Spark ML


slide-1
SLIDE 1

Choosing the Algorithm

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist, General Mills

slide-2
SLIDE 2

FEATURE ENGINEERING WITH PYSPARK

Spark ML Landscape

slide-3
SLIDE 3

FEATURE ENGINEERING WITH PYSPARK

Spark ML Landscape

slide-4
SLIDE 4

FEATURE ENGINEERING WITH PYSPARK

Spark ML Landscape

slide-5
SLIDE 5

FEATURE ENGINEERING WITH PYSPARK

Spark ML Landscape

slide-6
SLIDE 6

FEATURE ENGINEERING WITH PYSPARK

PySpark Regression Methods

Methods in ml.regression :

GeneralizedLinearRegression IsotonicRegression LinearRegression DecisionTreeRegression GBTRegression RandomForestRegression

slide-7
SLIDE 7

FEATURE ENGINEERING WITH PYSPARK

PySpark Regression Methods

Methods in ml.regression :

GeneralizedLinearRegression IsotonicRegression LinearRegression DecisionTreeRegression GBTRegression RandomForestRegression

slide-8
SLIDE 8

FEATURE ENGINEERING WITH PYSPARK

slide-9
SLIDE 9

FEATURE ENGINEERING WITH PYSPARK

Test and Train Splits for Time Series

hps://www.kaggle.com/c/santander-value-prediction-challenge/discussion/61408

slide-10
SLIDE 10

FEATURE ENGINEERING WITH PYSPARK

Test and Train Splits for Time Series

# Create variables for max and min dates in our dataset max_date = df.agg({'OFFMKTDATE': 'max'}).collect()[0][0] min_date = df.agg({'OFFMKTDATE': 'min'}).collect()[0][0] # Find how many days our data spans from pyspark.sql.functions import datediff range_in_days = datediff(max_date, min_date) # Find the date to split the dataset on from pyspark.sql.functions import date_add split_in_days = round(range_in_days * 0.8) split_date = date_add(min_date, split_in_days) # Split the data into 80% train, 20% test train_df = df.where(df['OFFMKTDATE'] < split_date) test_df = df.where(df['OFFMKTDATE'] >= split_date)\ .where(df['LISTDATE'] >= split_date)

slide-11
SLIDE 11

Time to practice!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-12
SLIDE 12

Preparing for Random Forest Regression

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist, General Mills

slide-13
SLIDE 13

FEATURE ENGINEERING WITH PYSPARK

Assumptions Needed for Features

Random Forest Regression Skewed/Non Normal Data? OK Unscaled? OK Missing Data? OK Categorical Data? OK

slide-14
SLIDE 14

FEATURE ENGINEERING WITH PYSPARK

Appended Features

Economic 30 Year Mortgage Rates Governmental Median Home Price for City Home Age Percentages for City Home Size Percentages for City Social Walk Score Bike Score Seasonal Bank Holidays

slide-15
SLIDE 15

FEATURE ENGINEERING WITH PYSPARK

Engineered Features

Temporal Features Limited value with one year of data Holiday Weeks Rates, Ratios, Sums Business Context Personal Context Expanded Features Non-Free Form Text Columns Need to Remove Low Observations

# What is shape of our data? print((df.count(), len(df.columns))) (5000, 126)

slide-16
SLIDE 16

FEATURE ENGINEERING WITH PYSPARK

Dataframe Columns to Feature Vectors

from pyspark.ml.feature import VectorAssembler # Replace Missing values df = df.fillna(-1) # Define the columns to be converted to vectors features_cols = list(df.columns) # Remove the dependent variable from the list features_cols.remove('SALESCLOSEPRICE')

slide-17
SLIDE 17

FEATURE ENGINEERING WITH PYSPARK

Dataframe Columns to Feature Vectors

# Create the vector assembler transformer vec = VectorAssembler(inputCols=features_cols, outputCol='features') # Apply the vector transformer to data df = vec.transform(df) # Select only the feature vectors and the dependent variable ml_ready_df = df.select(['SALESCLOSEPRICE', 'features']) # Inspect Results ml_ready_df.show(5) +----------------+--------------------+ | SALESCLOSEPRICE| features| +----------------+--------------------+ |143000 |(125,[0,1,2,3,5,6...| |190000 |(125,[0,1,2,3,5,6...| |225000 |(125,[0,1,2,3,5,6...| |265000 |(125,[0,1,2,3,4,5...| |249900 |(125,[0,1,2,3,4,5...| +----------------+--------------------+

  • nly showing top 5 rows
slide-18
SLIDE 18

We are now ready for machine learning!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-19
SLIDE 19

Building a Model

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist, General Mills

slide-20
SLIDE 20

FEATURE ENGINEERING WITH PYSPARK

RandomForestRegressor

Basic Model Parameters

featuresCol="features" labelCol="label" predictionCol="prediction" seed=None

Our Model Parameter values

featuresCol="features" labelCol="SALESCLOSEPRICE" predictionCol="Prediction_Price" seed=42

slide-21
SLIDE 21

FEATURE ENGINEERING WITH PYSPARK

Training a Random Forest

from pyspark.ml.regression import RandomForestRegressor # Initialize model with columns to utilize rf = RandomForestRegressor(featuresCol="features", labelCol="SALESCLOSEPRICE", predictionCol="Prediction_Price", seed=42 ) # Train model model = rf.fit(train_df)

slide-22
SLIDE 22

FEATURE ENGINEERING WITH PYSPARK

Predicting with a Model

# Make predictions predictions = model.transform(test_df) # Inspect results predictions.select("Prediction_Price", "SALESCLOSEPRICE").show(5) +------------------+---------------+ | Prediction_Price|SALESCLOSEPRICE| +------------------+---------------+ |426029.55463222397| 415000| | 708510.8806005502| 842500| | 164275.7116183204| 161000| | 208943.4143642175| 200000| |217152.43272221283| 205000| +------------------+---------------+

  • nly showing top 5 rows
slide-23
SLIDE 23

FEATURE ENGINEERING WITH PYSPARK

Evaluating a Model

from pyspark.ml.evaluation import RegressionEvaluator # Select columns to compute test error evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE", predictionCol="Prediction_Price") # Create evaluation metrics rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"}) r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"}) # Print Model Metrics print('RMSE: ' + str(rmse)) print('R^2: ' + str(r2)) RMSE: 22898.84041072095 R^2: 0.9666594402208077

slide-24
SLIDE 24

Let's model some data!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-25
SLIDE 25

Interpreting, Saving & Loading Models

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist, General Mills

slide-26
SLIDE 26

FEATURE ENGINEERING WITH PYSPARK

Interpreting a Model

import pandas as pd # Convert feature importances to a pandas column fi_df = pd.DataFrame(model.featureImportances.toArray(), columns=['importance']) # Convert list of feature names to pandas column fi_df['feature'] = pd.Series(feature_cols) # Sort the data based on feature importance fi_df.sort_values(by=['importance'], ascending=False, inplace=True)

slide-27
SLIDE 27

FEATURE ENGINEERING WITH PYSPARK

Interpreting a Model

# Interpret results model_df.head(9) | feature |importance| |-------------------------|----------| | LISTPRICE | 0.312101 | | ORIGINALLISTPRICE | 0.202142 | | LIVINGAREA | 0.124239 | | SQFT_TOTAL | 0.081260 | | LISTING_TO_MEDIAN_RATIO | 0.075086 | | TAXES | 0.048452 | | SQFTABOVEGROUND | 0.045859 | | BATHSTOTAL | 0.034397 | | LISTING_PRICE_PER_SQFT | 0.018253 |

slide-28
SLIDE 28

FEATURE ENGINEERING WITH PYSPARK

Saving & Loading Models

# Save model model.save('rfr_real_estate_model') from pyspark.ml.regression import RandomForestRegressionModel # Load model from model2 = RandomForestRegressionModel.load('rfr_real_estate_model')

slide-29
SLIDE 29

On to your last set

  • f exercises!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

slide-30
SLIDE 30

Final Thoughts

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

John Hogue

Lead Data Scientist

slide-31
SLIDE 31

FEATURE ENGINEERING WITH PYSPARK

What you learned!

Inspecting visually & statistically Dropping rows and columns Scaling and adjusting data Handling missing values Joining external datasets Generating features Extracting variables from messy elds Binning, bucketing and encoding Training and evaluating a model Interpreting model results

slide-32
SLIDE 32

Time to learn something new!

FE ATU R E E N G IN E E R IN G W ITH P YSPAR K