MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - PowerPoint PPT Presentation

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY

Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join TrueCar for Studied some ClearBook in Ruby, Rails/mobile General Biology Python and Java C/C# Game Engines development How do we build real-time APIs Begin work on based on Spark Try several ideas, machine learning Models? fail, learn a ton, platform based on SparkContext begin MLeap project Spark/Hadoop prohibitive MLeap - Massive pre- - Spark enables data - Continues to train computation in processing and training models in SAS/R Hadoop/ElasticSearch of models in same environment - SQL-based batch Studied some Math - Predictive modeling - - Joins TrueCar to do - Portions of models still pipelines + python for and Economics @ of at risk patients @ new car price translated, now to Java Linear Regressions in University of UHG modeling Minnesota API layer - R/SAS batch - SQL-based batch pipelines pipelines + python for Linear Regressions in API layer

Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common source of friction at data-driven organizations Action Reaction - Data scientists write data pipelines to - Engineers re-write the data pipelines for a construct research datasets production-ready system - Engineers write scalable libraries for - Data scientists largely don’t use those computing features and algorithms libraries and maintain/re-write their own copy of the code - Data scientists largely focus on - Talented engineers get largely tired of coding linear/logistic regressions due to up linear regressions and updating engineering constraints coefficients Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.

Existing Solutions: You won’t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations. Hard-Coded PMML Emerging Enterprise MLeap Models Solutions Solutions (SQL, Java, Ruby) (yHat, DataRobot) (Microsoft, IBM, SAS) Quick to Implement Open Sourced Committed to Spark/Hadoop API Server Infrastructure Lesson Learned: Push code down to where the data is, not the other way around!

MLeap Solution • Born out of need to deploy models quickly to a real time API server • Leverage Hadoop/Spark ecosystem for training, get rid of Spark dependency for execution • Easily reuse models with serialization and executing without Spark

MLeap Components • core - provides linear algebra system, regression models, and feature builders mleap-spark mleap-serialization • runtime - provides DataFrame-like “LeapFrame” and transformers for it mleap-runtime Bundle.ML • spark - provides easy conversion from Spark mleap-core transformers to MLeap transformers • serialization - common serialization format for Spark and MLeap (Bundle.ML) New features: expanded serialization formats to include both json and protobuf for large models (i.e. random forests with thousands of features)

mleap-core MLeap Core Components Regressions Classifiers Linear Algebra Feature Builders LinearRegression Dense/Sparse Vectors VectorAssembler RandomForest RandomForest BLAS from Spark StringIndexer LogisticRegression Gradient Boosted Reg. StandardScaler Trees

mleap-runtime MLeap Runtime • Provides LeapFrame, which stores data for transformations by MLeap transformers • MLeap transformers use mleap-core building blocks to transform LeapFrame • MLeap transformers correspond one-to-one with Spark transformers • No dependencies on Spark

Feature Pipeline Legend Categorical Feature Categorical Feature Index Continuous Scaled Continuous VectorAssembler StandardScaler Categorical Feature Vector Feature Vector Feature One Hot Vector Continuous Feature VectorAssembler Final Feature Vector StringIndexer OneHotEncoder Categorical StringIndexer OneHotEncoder VectorAssembler Feature Vector StringIndexer OneHotEncoder Regression Pipeline LinearRegression Final Feature Vector Prediction

Categorical Pipeline Categorical Categorical Categorical StringIndexer OneHotEncoder Feature One Feature Feature Index Hot Vector LeapFrame LeapFrame LeapFrame StringIndexer OneHotEncoder

MLeap Serialization (Bundle.ML) • Provides common serialization for both Spark and MLeap • 100% protobuf/JSON based for easy reading, compact data, and portability • No dependencies on Parquet * • Can be written to zip files, file system, HDFS, anywhere with an FS-like structure mleap-serialization

String Indexer Model Linear Regression Model Linear Regression Model (Code)

MLeap Spark • Train an ML pipeline with Spark then export it to MLeap MLeap Spark Spark Estimator Spark Model MLeap Model • Execute an MLeap pipeline against a Spark DataFrame MLeap MLeap MLeap Spark Transformer Spark Spark DataFrame Spark LeapFrame Spark LeapFrame Spark DataFrame

Benchmarks Spark: 23.4ms /transform MLeap: 0.011ms /transform

Web Services Algo 1 Algo 2 Algo n REST API Mobile Apps MLeap API + Server Java API Map/ Reduce MLeap Transformers + Serialization Spark Jobs MLlib Scikit + other Spark + Hadoop + HDFS Pipeline Code Notebooks

Demo Usage of MLeap • Train a sample listing price model using linear regression and random forest against some AirBnb training data • Deploy both models to a local API server • Get real-time results • IN UNDER 5 MINUTES!

Future of MLeap • Unify linear algebra and core libraries with Spark • Python/R interface • Deploy easily to embedded systems and outside of JVM • Full support for all Spark transformers

MLeap Development • Currently 5 people working on projects across 4 different companies • Talk to us if you are interested in deploying this technology at your company • MLeap Demo Project: https://github. com/TrueCar/mleap-demo

Thank You! Hollin Wilkins Mikhail Semeniuk email: hollinrwilkins@gmail.com email: seme0021@gmail.com github: https://github.com/hollinwilkins github: https://github.com/seme0021 twitter: https://twitter.com/HollinWilkins twitter: https://twitter.com/MikhailSemeniuk SATURDAY

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - PowerPoint PPT Presentation

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

Plans and Policies for LSST Alert Distribution Eric Bellm Alert Production Science Lead, LSST DM

reliable yet affordable solutions. We operate worldwide with offices in Germany, UK, Italy,

Investor overview January 2019 Disclaimer The information in this presentation does not

AACS AACS Managed Copy Managed Copy Status Update Status Update And Implementation And

The Global Assembly Journal for SMT and Volume 11

High-speed Serial Interface Lect. 1 Introduction 1 High-Speed Circuits and Systems Lab.,

INVESTOR PRESENTATION - October, 2017 TSXV: DM TSXV: EVC TSXV: DM Disclaimer This presentation

Presented By: Tim McLaughlin, Director of Business Development - Engnuity and Richard Theron,

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - PowerPoint PPT Presentation

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

Plans and Policies for LSST Alert Distribution Eric Bellm Alert Production Science Lead, LSST DM

reliable yet affordable solutions. We operate worldwide with offices in Germany, UK, Italy,

Investor overview January 2019 Disclaimer The information in this presentation does not

AACS AACS Managed Copy Managed Copy Status Update Status Update And Implementation And

The Global Assembly Journal for SMT and Volume 11

High-speed Serial Interface Lect. 1 Introduction 1 High-Speed Circuits and Systems Lab.,

INVESTOR PRESENTATION - October, 2017 TSXV: DM TSXV: EVC TSXV: DM Disclaimer This presentation

Presented By: Tim McLaughlin, Director of Business Development - Engnuity and Richard Theron,

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure