Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P - PowerPoint PPT Presentation

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK

P y Spark Regression Methods Methods in ml.regression : GeneralizedLinearRegression DecisionTreeRegression IsotonicRegression GBTRegression LinearRegression RandomForestRegression FEATURE ENGINEERING WITH PYSPARK

FEATURE ENGINEERING WITH PYSPARK

Test and Train Splits for Time Series h � ps ://www. kaggle . com / c / santander -v al u e - prediction - challenge / disc u ssion /61408 FEATURE ENGINEERING WITH PYSPARK

Test and Train Splits for Time Series # Create variables for max and min dates in our dataset max_date = df.agg({'OFFMKTDATE': 'max'}).collect()[0][0] min_date = df.agg({'OFFMKTDATE': 'min'}).collect()[0][0] # Find how many days our data spans from pyspark.sql.functions import datediff range_in_days = datediff(max_date, min_date) # Find the date to split the dataset on from pyspark.sql.functions import date_add split_in_days = round(range_in_days * 0.8) split_date = date_add(min_date, split_in_days) # Split the data into 80% train, 20% test train_df = df.where(df['OFFMKTDATE'] < split_date) test_df = df.where(df['OFFMKTDATE'] >= split_date)\ .where(df['LISTDATE'] >= split_date) FEATURE ENGINEERING WITH PYSPARK

Time to practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Preparing for Random Forest Regression FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Ass u mptions Needed for Feat u res Random Forest Regression Ske w ed / Non Normal Data ? OK Unscaled ? OK Missing Data ? OK Categorical Data ? OK FEATURE ENGINEERING WITH PYSPARK

Appended Feat u res Economic Social 30 Year Mortgage Rates Walk Score Go v ernmental Bike Score Seasonal Median Home Price for Cit y Home Age Percentages for Cit y Bank Holida y s Home Si z e Percentages for Cit y FEATURE ENGINEERING WITH PYSPARK

Engineered Feat u res Temporal Feat u res E x panded Feat u res Limited v al u e w ith one y ear of data Non - Free Form Te x t Col u mns Holida y Weeks Need to Remo v e Lo w Obser v ations # What is shape of our data? Rates , Ratios , S u ms print((df.count(), len(df.columns))) B u siness Conte x t Personal Conte x t (5000, 126) FEATURE ENGINEERING WITH PYSPARK

Dataframe Col u mns to Feat u re Vectors from pyspark.ml.feature import VectorAssembler # Replace Missing values df = df.fillna(-1) # Define the columns to be converted to vectors features_cols = list(df.columns) # Remove the dependent variable from the list features_cols.remove('SALESCLOSEPRICE') FEATURE ENGINEERING WITH PYSPARK

Dataframe Col u mns to Feat u re Vectors # Create the vector assembler transformer vec = VectorAssembler(inputCols=features_cols, outputCol='features') # Apply the vector transformer to data df = vec.transform(df) # Select only the feature vectors and the dependent variable ml_ready_df = df.select(['SALESCLOSEPRICE', 'features']) # Inspect Results ml_ready_df.show(5) +----------------+--------------------+ | SALESCLOSEPRICE| features| +----------------+--------------------+ |143000 |(125,[0,1,2,3,5,6...| |190000 |(125,[0,1,2,3,5,6...| |225000 |(125,[0,1,2,3,5,6...| |265000 |(125,[0,1,2,3,4,5...| |249900 |(125,[0,1,2,3,4,5...| +----------------+--------------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

We are no w read y for machine learning ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

B u ilding a Model FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

RandomForestRegressor Basic Model Parameters O u r Model Parameter v al u es featuresCol="features" featuresCol="features" labelCol="label" labelCol="SALESCLOSEPRICE" predictionCol="prediction" predictionCol="Prediction_Price" seed=None seed=42 FEATURE ENGINEERING WITH PYSPARK

Training a Random Forest from pyspark.ml.regression import RandomForestRegressor # Initialize model with columns to utilize rf = RandomForestRegressor(featuresCol="features", labelCol="SALESCLOSEPRICE", predictionCol="Prediction_Price", seed=42 ) # Train model model = rf.fit(train_df) FEATURE ENGINEERING WITH PYSPARK

Predicting w ith a Model # Make predictions predictions = model.transform(test_df) # Inspect results predictions.select("Prediction_Price", "SALESCLOSEPRICE").show(5) +------------------+---------------+ | Prediction_Price|SALESCLOSEPRICE| +------------------+---------------+ |426029.55463222397| 415000| | 708510.8806005502| 842500| | 164275.7116183204| 161000| | 208943.4143642175| 200000| |217152.43272221283| 205000| +------------------+---------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

E v al u ating a Model from pyspark.ml.evaluation import RegressionEvaluator # Select columns to compute test error evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE", predictionCol="Prediction_Price") # Create evaluation metrics rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"}) r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"}) # Print Model Metrics print('RMSE: ' + str(rmse)) print('R^2: ' + str(r2)) RMSE: 22898.84041072095 R^2: 0.9666594402208077 FEATURE ENGINEERING WITH PYSPARK

Let ' s model some data ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Interpreting , Sa v ing & Loading Models FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Interpreting a Model import pandas as pd # Convert feature importances to a pandas column fi_df = pd.DataFrame(model.featureImportances.toArray(), columns=['importance']) # Convert list of feature names to pandas column fi_df['feature'] = pd.Series(feature_cols) # Sort the data based on feature importance fi_df.sort_values(by=['importance'], ascending=False, inplace=True) FEATURE ENGINEERING WITH PYSPARK

Interpreting a Model # Interpret results model_df.head(9) | feature |importance| |-------------------------|----------| | LISTPRICE | 0.312101 | | ORIGINALLISTPRICE | 0.202142 | | LIVINGAREA | 0.124239 | | SQFT_TOTAL | 0.081260 | | LISTING_TO_MEDIAN_RATIO | 0.075086 | | TAXES | 0.048452 | | SQFTABOVEGROUND | 0.045859 | | BATHSTOTAL | 0.034397 | | LISTING_PRICE_PER_SQFT | 0.018253 | FEATURE ENGINEERING WITH PYSPARK

Sa v ing & Loading Models # Save model model.save('rfr_real_estate_model') from pyspark.ml.regression import RandomForestRegressionModel # Load model from model2 = RandomForestRegressionModel.load('rfr_real_estate_model') FEATURE ENGINEERING WITH PYSPARK

On to y o u r last set of e x ercises ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Final Tho u ghts FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist

What y o u learned ! Inspecting v is u all y & statisticall y Generating feat u res Dropping ro w s and col u mns E x tracting v ariables from mess y � elds Scaling and adj u sting data Binning , b u cketing and encoding Handling missing v al u es Training and e v al u ating a model Joining e x ternal datasets Interpreting model res u lts FEATURE ENGINEERING WITH PYSPARK

Time to learn something ne w! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P - PowerPoint PPT Presentation

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK Spark ML

Your Plan After High School Choosing a Career Choosing a College College Admissions

CHOOSING WISELY CANADA BRINGING PM&R TO THE TABLE Larry Robinson MD Choosing Wisely Canada

Choosing a school Have you started looking at schools yet? Choosing a school How do I know

Choosing a License foss-north pod foss-north foss-north Choosing a License Things to consider

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Choosing curves D. J. Bernstein University of Illinois at Chicago Traditional algorithm design:

Choosing a Strategy October 25, 2018 Session Goals Review steps to complete before choosing a

Choosing Hardware and Operating Systems for MySQL Apr 15, 2009 O'Reilly MySQL Conference and

Regression: Choosing Variables LIR 832 November 14, 2006 Topics of the Day Choosing

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Personnel rostering - Local and global constraint consistency Pieter Smet, Fabio Salassa ,

Bill Leavelle Bill Leavelle Director Director Castle Metals Aerospace Castle Metals

Mellow Writes Extending Lifetime in Resistive Memories through Selective Slow Write Backs Zhang,

IREDA Energy Efficiency Loan Fund The Indian Renewable Energy Development Agency Ltd (IREDA),

Employment Law Changes 2020 - Laura Murphy Nicola Sheridan Vicky Clarke Agenda Q&A

CHRONIC DISEASES MANAGEMENT, MULTIMORBIDITY, RESOURCES LIMITATION/SPENDING REVIEW AND QUALITY

Change Management Change Management Change Management Appreciate the balance within Change

Electricity demand forecasting and the problem of embedded generation Place your chosen image

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P - PowerPoint PPT Presentation

Choosing the Algorithm FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK Spark ML Landscape FEATURE ENGINEERING WITH PYSPARK Spark ML

Your Plan After High School Choosing a Career Choosing a College College Admissions

CHOOSING WISELY CANADA BRINGING PM&amp;R TO THE TABLE Larry Robinson MD Choosing Wisely Canada

Choosing a school Have you started looking at schools yet? Choosing a school How do I know

Choosing a License foss-north pod foss-north foss-north Choosing a License Things to consider

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Choosing curves D. J. Bernstein University of Illinois at Chicago Traditional algorithm design:

Choosing a Strategy October 25, 2018 Session Goals Review steps to complete before choosing a

Choosing Hardware and Operating Systems for MySQL Apr 15, 2009 O'Reilly MySQL Conference and

Regression: Choosing Variables LIR 832 November 14, 2006 Topics of the Day Choosing

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm &amp; Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Personnel rostering - Local and global constraint consistency Pieter Smet, Fabio Salassa ,

Bill Leavelle Bill Leavelle Director Director Castle Metals Aerospace Castle Metals

Mellow Writes Extending Lifetime in Resistive Memories through Selective Slow Write Backs Zhang,

IREDA Energy Efficiency Loan Fund The Indian Renewable Energy Development Agency Ltd (IREDA),

Employment Law Changes 2020 - Laura Murphy Nicola Sheridan Vicky Clarke Agenda Q&amp;A

CHRONIC DISEASES MANAGEMENT, MULTIMORBIDITY, RESOURCES LIMITATION/SPENDING REVIEW AND QUALITY

Change Management Change Management Change Management Appreciate the balance within Change

Electricity demand forecasting and the problem of embedded generation Place your chosen image

CHOOSING WISELY CANADA BRINGING PM&R TO THE TABLE Larry Robinson MD Choosing Wisely Canada

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Employment Law Changes 2020 - Laura Murphy Nicola Sheridan Vicky Clarke Agenda Q&A