Big data management and predictive analytics for customer - PowerPoint PPT Presentation

Big data management and predictive analytics for customer transactions Large Scale Distributed Machine Learning & operations using Apache Spark and AWS BDA761 Big Data Mgmt in a Supercomputing environment MUDIT UPPAL

Problem with Big Data(s) ❖ Machine learning practices at scale for PB/TB data ❖ A framework which provides and computes models using virtual nodes with processors and memory getting cheaper every year ❖ Using GPU + multi-threading + make use of multiple cores ❖ Goal: Thinking in ‘big data’; create a tool which can be used in any operations/Sales/customer analysis ❖ Parallelizing — DATA and MODEL

Case study ❖ Rossmann : operates over 3000 drug stores in 7 european countries ❖ Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. ❖ DECISION TASKS: ❖ Forecast Sales for upto 6 weeks using stores data, customers, promotion data et cetera ❖ Predicting Sales of ~1000 stores daily Data Source: Rossmann, Walmart via Kagle

~2.8 million rows data points Label: Sales Features: Store, Sales, customers, Open, date, stateholiday, schoolHoliday, storetype, Assortment, competitionDistance, CompetitionSinceMonth, sinceYear, Promo, Promo2, PromoeInterval, Promo2SinceMonth, PromoSince year, DayOfWeek, etc…

Goal ❖ Aim is to create a Predictive Analytics Framework ❖ distributed machine learning for any size dataset ❖ 3 Demos in this presentation ❖ Apache Spark ❖ Exploratory DA with R ❖ xgBoost(python) ML Data Source: Rossmann, Walmart via Kagle

Main Demos ❖ Exploratory data analysis ❖ Apache Spark(sql spark context) - Distributed ML across multiple machines/nodes ❖ Linear Regression analysis ❖ Gradient descent ❖ Ensemble and boosting algorithms (XGBoost)

Tools ❖ R + python for exploratory analysis ❖ Apache Spark for implementation ❖ Hadoop ❖ Spark MLlib ❖ AWS cloud(m4 large instances) ❖ Ganglia(distributed monitoring system ti work with clusters) ❖ pyspark

Data Pipeline Obtain Split Feature Find Supervised Data Data extraction relationships learning Predict Evaluation Spark Parse Initial Use ‘LabeledPoint’ Shift labels(starting Visualize features dataset class from zero) Train (via gradient Create and Evaluate Use weights to make descent) and evaluate Baseline model predictions a linear regression

Demo 1: exploratory analysis R+python ❖ Data and source code can ❖ be downloaded from: http://muppal.com

ApacheSpark <——> xgboost XGBoost : start off with a rough prediction and ❖ then building a series of decision trees; with each trying to correct the prediction error of the one before ❖ Spark excels at distributing operations ❖ - Large-scale and Distributed Gradient across a cluster while abstracting away Boosting (GBDT, GBRT or GBM) many of the underlying Library, on single node, hadoop yarn implementation details. etc XGBoost can also be distributed and ❖ Thinking in terms of ❖ scale to Terascale data RDD(transformations and actions) ❖ You can define threads with “nthreads ❖ Still under development =..” https://github.com/dmlc/xgboost ❖

Data pipeline/Tools hdfs Driver Worker2 - Worker3 - Worker1 - ML ML ML Algorithm 1 Algorithm Algorithm & 2 3 & 4 5 & 6 Worker4 - Distributed ML ML Algorithm Analysis 7 & 8

Demo 2/3: xgboost + ML Spark on clusters xgboost + Apache Spark/pyspark ❖ Findings: [1199] train-rmspe:0.103954 eval-rmspe:0.094526 ❖ Validating — RMSPE: 0.094526 ❖ Data and source code can be downloaded from: http://muppal.com ❖

Thank you!

References ❖ http://rcarneva.github.io/understanding-gradient- boosting-part-1.html ❖ https://www.kaggle.com/c/walmart-recruiting-trip- type-classification

Big data management and predictive analytics for customer - PowerPoint PPT Presentation

Big data management and predictive analytics for customer transactions Large Scale Distributed Machine Learning & operations using Apache Spark and AWS BDA761 Big Data Mgmt in a Supercomputing environment MUDIT UPPAL Problem with Big

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

Predictive Simulation & Big Data Analytics ISD Analytics Predict a better future

Automating Predictive Analytics www.xpanseanalytics.com Agenda Predictive Analytics vs

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

COVID-19 Predictive Analytics April 8th, 2020 Predictive Analytics Focus Areas Health System

Educational Predictive Analytics: Navigating Disparate Views Aaron Springer , Victoria Chou,

Session 2 Predictive Analytics in Policyholder Behavior Eileen Burns, FSA, MAAA David Wang, FSA,

Model Type Selection in Predictive Big Data Analytics Mustafa Nural, Hao Peng, John A. Miller

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Overcoming big data bottlenecks in healthcare : a Predictive Modeling case study Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

THE ACADEMY OF FINANCIAL MARKETS NQF 5 Module 12 Thought for the day Live life while you

Welcome to the Gem gem@babelquest.co.uk Latimer 01235 313555 Inbound Strategist Before we

Frederik Borgesius 21 Nov 2014, W3C Workshop Defending Privacy Empowerment Protection

Focus on Workforce & People March 2019 Thanks for your feedback in the survey What we heard

Consumer Immersion Todays agenda Consumers Communication Overall Pathway Tasks

USA Staffing and USAJOBS: Supporting Talent Acquisition at HHS Sailis Johnson, HR Solutions,

Industry Panel: Consumer Acceptance of the Smart Grid Moderator: Sean Smith Dartmouth College |

Low and Moderate-Income Solar Part 1: Opportunities and Challenges Warren Leon, CESA October 6,

Big data management and predictive analytics for customer - PowerPoint PPT Presentation

Big data management and predictive analytics for customer transactions Large Scale Distributed Machine Learning & operations using Apache Spark and AWS BDA761 Big Data Mgmt in a Supercomputing environment MUDIT UPPAL Problem with Big

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

Predictive Simulation &amp; Big Data Analytics ISD Analytics Predict a better future

Automating Predictive Analytics www.xpanseanalytics.com Agenda Predictive Analytics vs

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

COVID-19 Predictive Analytics April 8th, 2020 Predictive Analytics Focus Areas Health System

Educational Predictive Analytics: Navigating Disparate Views Aaron Springer , Victoria Chou,

Session 2 Predictive Analytics in Policyholder Behavior Eileen Burns, FSA, MAAA David Wang, FSA,

Model Type Selection in Predictive Big Data Analytics Mustafa Nural, Hao Peng, John A. Miller

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Overcoming big data bottlenecks in healthcare : a Predictive Modeling case study Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

THE ACADEMY OF FINANCIAL MARKETS NQF 5 Module 12 Thought for the day Live life while you

Welcome to the Gem gem@babelquest.co.uk Latimer 01235 313555 Inbound Strategist Before we

Frederik Borgesius 21 Nov 2014, W3C Workshop Defending Privacy Empowerment Protection

Focus on Workforce &amp; People March 2019 Thanks for your feedback in the survey What we heard

Consumer Immersion Todays agenda Consumers Communication Overall Pathway Tasks

USA Staffing and USAJOBS: Supporting Talent Acquisition at HHS Sailis Johnson, HR Solutions,

Industry Panel: Consumer Acceptance of the Smart Grid Moderator: Sean Smith Dartmouth College |

Low and Moderate-Income Solar Part 1: Opportunities and Challenges Warren Leon, CESA October 6,

Predictive Simulation & Big Data Analytics ISD Analytics Predict a better future

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Focus on Workforce & People March 2019 Thanks for your feedback in the survey What we heard