MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - - PowerPoint PPT Presentation

mleap release spark ml pipelines
SMART_READER_LITE
LIVE PREVIEW

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - - PowerPoint PPT Presentation

MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join


slide-1
SLIDE 1

MLeap: Release Spark ML Pipelines

Hollin Wilkins & Mikhail Semeniuk

SATURDAY

slide-2
SLIDE 2

Introduction

Studied some Math and Economics @ University of Minnesota

  • Predictive modeling
  • f at risk patients @

UHG

  • R/SAS batch

pipelines

  • Joins TrueCar to do

new car price modeling

  • SQL-based batch

pipelines + python for Linear Regressions in API layer

  • Continues to train

models in SAS/R

  • SQL-based batch

pipelines + python for Linear Regressions in API layer

  • Massive pre-

computation in Hadoop/ElasticSearch

  • Portions of models still

translated, now to Java

  • Spark enables data

processing and training

  • f models in same

environment

MLeap

Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook in Ruby, Python and Java Erlang for highly- concurrent game servers C/C# Game Engines Join TrueCar for Rails/mobile development Begin work on machine learning platform based on Spark/Hadoop How do we build real-time APIs based on Spark Models? SparkContext prohibitive Try several ideas, fail, learn a ton, begin MLeap project

slide-3
SLIDE 3

Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.

Action Reaction

Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common source of friction at data-driven

  • rganizations
  • Data scientists write data pipelines to

construct research datasets

  • Engineers re-write the data pipelines for a

production-ready system

  • Engineers write scalable libraries for

computing features and algorithms

  • Data scientists largely don’t use those

libraries and maintain/re-write their own copy of the code

  • Data scientists largely focus on

linear/logistic regressions due to engineering constraints

  • Talented engineers get largely tired of coding

up linear regressions and updating coefficients

slide-4
SLIDE 4

Existing Solutions: You won’t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations.

Hard-Coded Models (SQL, Java, Ruby) PMML Emerging Solutions (yHat, DataRobot) Enterprise Solutions (Microsoft, IBM, SAS) MLeap Quick to Implement Open Sourced Committed to Spark/Hadoop API Server Infrastructure

Lesson Learned: Push code down to where the data is, not the other way around!

slide-5
SLIDE 5

MLeap Solution

  • Born out of need to deploy models quickly to a real time

API server

  • Leverage Hadoop/Spark ecosystem for training, get rid of

Spark dependency for execution

  • Easily reuse models with serialization and executing

without Spark

slide-6
SLIDE 6

MLeap Components

  • core - provides linear algebra system,

regression models, and feature builders

  • runtime - provides DataFrame-like

“LeapFrame” and transformers for it

  • spark - provides easy conversion from Spark

transformers to MLeap transformers

  • serialization - common serialization format for

Spark and MLeap (Bundle.ML)

New features: expanded serialization formats to include both json and protobuf for large models (i.e. random forests with thousands of features)

mleap-spark mleap-runtime mleap-core Bundle.ML mleap-serialization

slide-7
SLIDE 7

Linear Algebra

Dense/Sparse Vectors BLAS from Spark

Feature Builders

VectorAssembler StringIndexer

Regressions

LinearRegression RandomForest

MLeap Core Components

Classifiers

RandomForest LogisticRegression Gradient Boosted Reg. Trees StandardScaler

mleap-core

slide-8
SLIDE 8

MLeap Runtime

  • Provides LeapFrame, which stores data for transformations by MLeap

transformers

  • MLeap transformers use mleap-core building blocks to transform

LeapFrame

  • MLeap transformers correspond one-to-one with Spark transformers
  • No dependencies on Spark

mleap-runtime

slide-9
SLIDE 9

Feature Pipeline

VectorAssembler

Continuous Feature Vector

StandardScaler StringIndexer StringIndexer StringIndexer OneHotEncoder OneHotEncoder VectorAssembler LinearRegression

Categorical Feature Categorical Feature Index Categorical Feature One Hot Vector Categorical Feature Vector

VectorAssembler

Scaled Continuous Feature Vector Final Feature Vector Continuous Feature

Legend

Final Feature Vector Prediction

Regression Pipeline

OneHotEncoder

slide-10
SLIDE 10

Categorical Pipeline

LeapFrame LeapFrame LeapFrame

Categorical Feature StringIndexer OneHotEncoder Categorical Feature Index Categorical Feature One Hot Vector StringIndexer OneHotEncoder

slide-11
SLIDE 11

MLeap Serialization (Bundle.ML)

  • Provides common serialization for both Spark and MLeap
  • 100% protobuf/JSON based for easy reading, compact data, and

portability

  • No dependencies on Parquet *
  • Can be written to zip files, file system, HDFS, anywhere with an FS-like

structure

mleap-serialization

slide-12
SLIDE 12

Linear Regression Model String Indexer Model Linear Regression Model (Code)

slide-13
SLIDE 13

MLeap Spark

  • Train an ML pipeline with Spark then export it to MLeap
  • Execute an MLeap pipeline against a Spark DataFrame

Spark Estimator Spark Model MLeap Model MLeap Spark

Spark DataFrame Spark LeapFrame Spark LeapFrame

MLeap Spark

Spark DataFrame MLeap Transformer

MLeap Spark

slide-14
SLIDE 14

Benchmarks

MLeap: 0.011ms/transform Spark: 23.4ms/transform

slide-15
SLIDE 15

MLeap API + Server

Algo 1 Algo 2 Algo n

Spark + Hadoop + HDFS MLlib Scikit + other

MLeap Transformers + Serialization

Web Services

Java API REST API

Mobile Apps Spark Jobs Map/ Reduce

Pipeline Code Notebooks

slide-16
SLIDE 16

Demo Usage of MLeap

  • Train a sample listing price model using linear regression

and random forest against some AirBnb training data

  • Deploy both models to a local API server
  • Get real-time results
  • IN UNDER 5 MINUTES!
slide-17
SLIDE 17

Future of MLeap

  • Unify linear algebra and core libraries with Spark
  • Python/R interface
  • Deploy easily to embedded systems and outside of JVM
  • Full support for all Spark transformers
slide-18
SLIDE 18

MLeap Development

  • Currently 5 people working on projects across 4 different

companies

  • Talk to us if you are interested in deploying this

technology at your company

  • MLeap Demo Project: https://github.

com/TrueCar/mleap-demo

slide-19
SLIDE 19

Thank You!

Hollin Wilkins email: hollinrwilkins@gmail.com github: https://github.com/hollinwilkins twitter: https://twitter.com/HollinWilkins Mikhail Semeniuk email: seme0021@gmail.com github: https://github.com/seme0021 twitter: https://twitter.com/MikhailSemeniuk

SATURDAY