big data management and predictive analytics for customer
play

Big data management and predictive analytics for customer - PowerPoint PPT Presentation

Big data management and predictive analytics for customer transactions Large Scale Distributed Machine Learning & operations using Apache Spark and AWS BDA761 Big Data Mgmt in a Supercomputing environment MUDIT UPPAL Problem with Big


  1. Big data management and predictive analytics for customer transactions Large Scale Distributed Machine Learning & operations using Apache Spark and AWS BDA761 Big Data Mgmt in a Supercomputing environment MUDIT UPPAL

  2. Problem with Big Data(s) ❖ Machine learning practices at scale for PB/TB data ❖ A framework which provides and computes models using virtual nodes with processors and memory getting cheaper every year ❖ Using GPU + multi-threading + make use of multiple cores ❖ Goal: Thinking in ‘big data’; create a tool which can be used in any operations/Sales/customer analysis ❖ Parallelizing — DATA and MODEL

  3. Case study ❖ Rossmann : operates over 3000 drug stores in 7 european countries ❖ Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. ❖ DECISION TASKS: ❖ Forecast Sales for upto 6 weeks using stores data, customers, promotion data et cetera ❖ Predicting Sales of ~1000 stores daily Data Source: Rossmann, Walmart via Kagle

  4. ~2.8 million rows data points Label: Sales Features: Store, Sales, customers, Open, date, stateholiday, schoolHoliday, storetype, Assortment, competitionDistance, CompetitionSinceMonth, sinceYear, Promo, Promo2, PromoeInterval, Promo2SinceMonth, PromoSince year, DayOfWeek, etc…

  5. Goal ❖ Aim is to create a Predictive Analytics Framework ❖ distributed machine learning for any size dataset ❖ 3 Demos in this presentation ❖ Apache Spark ❖ Exploratory DA with R ❖ xgBoost(python) ML Data Source: Rossmann, Walmart via Kagle

  6. Main Demos ❖ Exploratory data analysis ❖ Apache Spark(sql spark context) - Distributed ML across multiple machines/nodes ❖ Linear Regression analysis ❖ Gradient descent ❖ Ensemble and boosting algorithms (XGBoost)

  7. Tools ❖ R + python for exploratory analysis ❖ Apache Spark for implementation ❖ Hadoop ❖ Spark MLlib ❖ AWS cloud(m4 large instances) ❖ Ganglia(distributed monitoring system ti work with clusters) ❖ pyspark

  8. Data Pipeline Obtain Split Feature Find Supervised Data Data extraction relationships learning Predict Evaluation Spark Parse Initial Use ‘LabeledPoint’ Shift labels(starting Visualize features dataset class from zero) Train (via gradient Create and Evaluate Use weights to make descent) and evaluate Baseline model predictions a linear regression

  9. Demo 1: exploratory analysis R+python ❖ Data and source code can ❖ be downloaded from: http://muppal.com

  10. ApacheSpark <——> xgboost XGBoost : start off with a rough prediction and ❖ then building a series of decision trees; with each trying to correct the prediction error of the one before ❖ Spark excels at distributing operations ❖ - Large-scale and Distributed Gradient across a cluster while abstracting away Boosting (GBDT, GBRT or GBM) many of the underlying Library, on single node, hadoop yarn implementation details. etc XGBoost can also be distributed and ❖ Thinking in terms of ❖ scale to Terascale data RDD(transformations and actions) ❖ You can define threads with “nthreads ❖ Still under development =..” https://github.com/dmlc/xgboost ❖

  11. Data pipeline/Tools hdfs Driver Worker2 - Worker3 - Worker1 - ML ML ML Algorithm 1 Algorithm Algorithm & 2 3 & 4 5 & 6 Worker4 - Distributed ML ML Algorithm Analysis 7 & 8

  12. Demo 2/3: xgboost + ML Spark on clusters xgboost + Apache Spark/pyspark ❖ Findings: [1199] train-rmspe:0.103954 eval-rmspe:0.094526 ❖ Validating — RMSPE: 0.094526 ❖ Data and source code can be downloaded from: http://muppal.com ❖

  13. Thank you!

  14. References ❖ http://rcarneva.github.io/understanding-gradient- boosting-part-1.html ❖ https://www.kaggle.com/c/walmart-recruiting-trip- type-classification

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend