ACCT 420: Advanced linear regression Project example Dr. Richard - PowerPoint PPT Presentation

ACCT 420: Advanced linear regression Project example Dr. Richard M. Crowley 1

Weekly revenue prediction at Walmart 2 . 1

The question How can we predict weekly departmental revenue for Walmart, leveraging our knowledge of Walmart, its business, and some limited historical information? ▪ Predict weekly for 115,064 (Store, Department, Week) tuples ▪ From 2012-11-02 to 2013-07-26 ▪ Using [incomplete] weekly revenue data from 2010-02-015 to 2012-10- 26 ▪ By department (some weeks missing for some departments) 2 . 2

More specifically… ▪ Consider time dimensions ▪ What matters: ▪ Time of the year? ▪ Holidays? ▪ Do different stores or departments behave differently? ▪ Wrinkles: ▪ Walmart won’t give us testing data ▪ But they’ll tell us how well the algorithm performs ▪ We can’t use past week sales for prediction because we won’t have it for most of the prediction… 2 . 3

The data ▪ Revenue by week for each department of each of 45 stores ▪ Department is just a number between 1 and 99 ▪ We don’t know what these numbers mean ▪ Date of that week ▪ If the week is considered a holiday for sales purposes ▪ Super Bowl, Labor Day, Black Friday, Christmas ▪ Store data: ▪ Which store the data is for, 1 to 45 ▪ Store type (A, B, or C) ▪ We don’t know what these letters mean ▪ Store size ▪ Other data, by week and location: ▪ Temperature, gas price, sales (by department), CPI, Unemployment rate, Holidays 2 . 4

Walmart’s evaluation metric ▪ Walmart uses MAE (mean absolute error), but with a twist: ▪ They care more about holidays, so any error on holidays has 5 times the penalty ▪ They call this WMAE, for weighted mean absolute error is the number of test data points ▪ is your prediction ▪ is the actual sales ▪ is 5 on holidays and 1 otherwise ▪ wmae <- function (actual, predicted, holidays) { sum ( abs (actual - predicted) * (holidays * 4 + 1)) / ( length (actual) + 4 *sum (holidays)) } 2 . 5

Before we get started… ▪ The data isn’t very clean: ▪ Markdowns are given by 5 separate variables instead of 1 ▪ Date is text format instead of a date ▪ CPI and unemployment data are missing in around a third of the testing data ▪ There are some (week, store, department) groups missing from our training data! We’ll have to fix these 2 . 6

Also… ▪ Some features to add: ▪ Year ▪ Week ▪ A unique ID for tracking (week, firm, department) tuples ▪ The ID Walmart requests we use for submissions ▪ Average sales by (store, department) ▪ Average sales by (week, store, department) 2 . 7

Load data and packages library (tidyverse) # we'll extensively use dplyr here library (lubridate) # Great for simple date functions library (broom) weekly <- read.csv ("../../Data/WMT_train.csv", stringsAsFactors=FALSE) weekly.test <- read.csv ("../../Data/WMT_test.csv", stringsAsFactors=FALSE) weekly.features <- read.csv ("../../Data/WMT_features.csv", stringsAsFactors=FALSE) weekly.stores <- read.csv ("../../Data/WMT_stores.csv", stringsAsFactors=FALSE) ▪ weekly is our training data ▪ weekly.test is our testing data – no Weekly_Sales column ▪ weekly.features is general information about (week, store) pairs ▪ Temperature, pricing, etc. ▪ weekly.stores is general information about each store 2 . 8

Cleaning preprocess_data <- function (df) { # Merge the data together (Pulled from outside of function -- "scoping") df <- inner_join (df, weekly.stores) df <- inner_join (df, weekly.features[,1 : 11]) # Compress the weird markdown information to 1 variable df $ markdown <- 0 df[ !is.na (df $ MarkDown1),] $ markdown <- df[ !is.na (df $ MarkDown1),] $ MarkDown1 df[ !is.na (df $ MarkDown2),] $ markdown <- df[ !is.na (df $ MarkDown2),] $ MarkDown2 df[ !is.na (df $ MarkDown3),] $ markdown <- df[ !is.na (df $ MarkDown3),] $ MarkDown3 df[ !is.na (df $ MarkDown4),] $ markdown <- df[ !is.na (df $ MarkDown4),] $ MarkDown4 df[ !is.na (df $ MarkDown5),] $ markdown <- df[ !is.na (df $ MarkDown5),] $ MarkDown5 # Fix dates and add useful time variables df $ date <- as.Date (df $ Date) df $ week <- week (df $ date) df $ year <- year (df $ date) df } df <- preprocess_data (weekly) df_test <- preprocess_data (weekly.test) Merge data, fix markdown , build time data 2 . 9

What this looks like df[91 : 94,] %>% select (Store, date, markdown, MarkDown3, MarkDown4, MarkDown5) %>% html_df () Store date markdown MarkDown3 MarkDown4 MarkDown5 91 1 2011-10-28 0.00 NA NA NA 92 1 2011-11-04 0.00 NA NA NA 93 1 2011-11-11 6551.42 215.07 2406.62 6551.42 94 1 2011-11-18 5988.57 51.98 427.39 5988.57 df[1 : 2,] %>% select (date, week, year) %>% html_df () date week year 2010-02-05 6 2010 2010-02-12 7 2010 2 . 10

Cleaning: Missing CPI and Unemployment # Fill in missing CPI and Unemployment data df_test <- df_test %>% group_by (Store, year) %>% mutate (CPI= ifelse ( is.na (CPI), mean (CPI,na.rm=T), CPI), Unemployment= ifelse ( is.na (Unemployment), mean (Unemployment,na.rm=T), Unemployment)) %>% ungroup () Apply the (year, Store)’s CPI and Unemployment to missing data 2 . 11

Cleaning: Adding IDs ▪ Build a unique ID ▪ Since Store, week, and department are all 2 digits, make a 6 digit number with 2 digits for each ▪ sswwdd ▪ Build Walmart’s requested ID for submissions ▪ ss_dd_YYYY-MM-DD # Unique IDs in the data df $ id <- df $ Store * 10000 + df $ week * 100 + df $ Dept df_test $ id <- df_test $ Store * 10000 + df_test $ week * 100 + df_test $ Dept # Unique ID and factor building swd <- c (df $ id, df_test $ id) # Pool all IDs swd <- unique (swd) # Only keep unique elements swd <- data.frame (id=swd) # Make a data frame swd $ swd <- factor (swd $ id) # Extract factors for using later # Add unique factors to data -- ensures same factors for both data sets df <- left_join (df,swd) df_test <- left_join (df_test,swd) df_test $ Id <- paste0 (df_test $ Store,'_',df_test $ Dept,"_",df_test $ date) 2 . 12

What the IDs look like html_df (df_test[ c (20000,40000,60000), c ("Store","week","Dept","id","swd","Id")]) Store week Dept id swd Id 8 27 33 82733 82733 8_33_2013-07-05 15 46 91 154691 154691 15_91_2012-11-16 23 52 25 235225 235225 23_25_2012-12-28 2 . 13

Add in (store, department) average sales # Calculate average by store-dept and distribute to df_test df <- df %>% group_by (Store, Dept) %>% mutate (store_avg= mean (Weekly_Sales, rm.na=T)) %>% ungroup () df_sa <- df %>% group_by (Store, Dept) %>% slice (1) %>% select (Store, Dept, store_avg) %>% ungroup () df_test <- left_join (df_test, df_sa) ## Joining, by = c("Store", "Dept") # 36 observations have messed up department codes -- ignore (set to 0) df_test[ is.na (df_test $ store_avg),] $ store_avg <- 0 # Calculate multipliers based on store_avg (and removing NaN and Inf) df $ Weekly_mult <- df $ Weekly_Sales / df $ store_avg df[ !is.finite (df $ Weekly_mult),] $ Weekly_mult <- NA 2 . 14

Add in (week, store, dept) average sales # Calculate mean by week-store-dept and distribute to df_test df <- df %>% group_by (Store, Dept, week) %>% mutate (naive_mean= mean (Weekly_Sales, rm.na=T)) %>% ungroup () df_wm <- df %>% group_by (Store, Dept, week) %>% slice (1) %>% ungroup () %>% select (Store, Dept, week, naive_mean) df_test <- df_test %>% arrange (Store, Dept, week) df_test <- left_join (df_test, df_wm) ## Joining, by = c("Store", "Dept", "week") 2 . 15

ISSUE: New (week, store, dept) groups ▪ This is in our testing data! ▪ So we’ll need to predict out groups we haven’t observed at all table ( is.na (df_test $ naive_mean)) ## ## FALSE TRUE ## 113827 1237 ▪ Fix: Fill with 1 or 2 lags where possible using and ifelse() lag() ▪ Fix: Fill with 1 or 2 leads where possible using and ifelse() lag() ▪ Fill with store_avg when the above fail ▪ Code is available in the code file – a bunch of code like: df_test <- df_test %>% arrange (Store, Dept, date) %>% group_by (Store, Dept) %>% mutate (naive_mean= ifelse ( is.na (naive_mean), lag (naive_mean),naive_mean)) %>% ungroup () 2 . 16

Cleaning is done ▪ Data is in order ▪ No missing values where data is needed ▪ Needed values created df %>% group_by (week, Store) %>% mutate (sales= mean (Weekly_Sales)) %>% slice (1) %>% ungroup () %>% ggplot ( aes (y=sales, x=week, color= factor (Store))) + geom_line () + xlab ("Week") + ylab ("Sales for Store (dept average)") + theme (legend.position="none") 2 . 17

Tackling the problem 3 . 1

First try ▪ Ideal: Use last week to predict ▪ First instinct: try to use a next week! linear regression to solve this No data for testing… We have this 3 . 2

What to put in the model? 3 . 3

ACCT 420: Advanced linear regression Project example Dr. Richard - PowerPoint PPT Presentation

ACCT 420: Advanced linear regression Project example Dr. Richard M. Crowley 1 Weekly revenue prediction at Walmart 2 . 1 The question How can we predict weekly departmental revenue for Walmart, leveraging our knowledge of Walmart, its

Salting Loft 1 WEEK STARTS COST 1 02-Jan 420.00 2 9 420.00 3 16 420.00 4 23

ACCT 101: Welcome and Intro to FA Session 1 Dr. Richard M. Crowley 1 About Me 2 . 1 Teaching

WELCOME Bakari Lee Chair, ACCT Board of Directors and Trustee, Hudson County Community College

ACCT 420: Advanced linear regression Session 4 Dr. Richard M. Crowley 1 Front matter 2 . 1

ACCT 420: Advanced linear regression Session 4 Dr. Richard M. Crowley 1 Front matter 2 . 1

ACCT 420: Advanced linear regression Session 3 Dr. Richard M. Crowley 1 Front matter 2 . 1

ACCT 420: Linear Regression Session 3 Dr. Richard M. Crowley 1 Front matter 2 . 1 Learning

ACCT 420: Linear Regression Session 3 Dr. Richard M. Crowley 1 Front matter 2 . 1 Learning

ACCT 420: Course Logistics Session 1 Dr. Richard M. Crowley 1 About Me 2 . 1 Teaching

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

ACCT 420: Logistic Regression for Corporate Fraud Session 6 Dr. Richard M. Crowley 1 Front

ACCT 420: Logistic Regression for Corporate Fraud Session 7 Dr. Richard M. Crowley 1 Front

ACCT 420: Logistic Regression for Bankruptcy Session 6 Dr. Richard M. Crowley 1 Front matter

ACCT 420: Logistic Regression for Bankruptcy Session 5 Dr. Richard M. Crowley 1 Front matter

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Technological Innova.on at Alibaba Alan Qi Vice President of Ant Financial Service Group Outline

OVERVIEW What's different about customer acquisition during BFCM this year How to keep

Reconstruction 1876 1 1865 Timeline 1865 - 1876 1876 1865 Overview Presidential

HOLIDAY MARKETING QUICK WINS Brought to you by Whereoware #WOWWEBINAR Private and Confidential.

Exploiting Surveillance Cameras Like a Hollywood Hacker Craig Heffner, Tactical Network Solutions

For Friday Research paper due Homework: Chaper 10, exercise 43 Final Exam

Ecommerce search is broken. If a problem cannot be solved, enlarge it -Dwight D. Eisenhower, 34th

Explaining Fermi Liquid stability with AdS Black holes Koenraad Schalm Institute Lorentz for