MLeap: Release Spark ML Pipelines
Hollin Wilkins & Mikhail Semeniuk
SATURDAY
MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail - - PowerPoint PPT Presentation
MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Introduction Rails Consulting for TrueCar and other companies Erlang for highly- Web Dev @ Cornell Implement ML concurrent game model for servers Join
SATURDAY
Studied some Math and Economics @ University of Minnesota
UHG
pipelines
new car price modeling
pipelines + python for Linear Regressions in API layer
models in SAS/R
pipelines + python for Linear Regressions in API layer
computation in Hadoop/ElasticSearch
translated, now to Java
processing and training
environment
MLeap
Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook in Ruby, Python and Java Erlang for highly- concurrent game servers C/C# Game Engines Join TrueCar for Rails/mobile development Begin work on machine learning platform based on Spark/Hadoop How do we build real-time APIs based on Spark Models? SparkContext prohibitive Try several ideas, fail, learn a ton, begin MLeap project
Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.
Action Reaction
Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common source of friction at data-driven
construct research datasets
production-ready system
computing features and algorithms
libraries and maintain/re-write their own copy of the code
linear/logistic regressions due to engineering constraints
up linear regressions and updating coefficients
Existing Solutions: You won’t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations.
Hard-Coded Models (SQL, Java, Ruby) PMML Emerging Solutions (yHat, DataRobot) Enterprise Solutions (Microsoft, IBM, SAS) MLeap Quick to Implement Open Sourced Committed to Spark/Hadoop API Server Infrastructure
Lesson Learned: Push code down to where the data is, not the other way around!
regression models, and feature builders
“LeapFrame” and transformers for it
transformers to MLeap transformers
Spark and MLeap (Bundle.ML)
New features: expanded serialization formats to include both json and protobuf for large models (i.e. random forests with thousands of features)
mleap-spark mleap-runtime mleap-core Bundle.ML mleap-serialization
Linear Algebra
Dense/Sparse Vectors BLAS from Spark
Feature Builders
VectorAssembler StringIndexer
Regressions
LinearRegression RandomForest
Classifiers
RandomForest LogisticRegression Gradient Boosted Reg. Trees StandardScaler
mleap-core
transformers
LeapFrame
mleap-runtime
VectorAssembler
Continuous Feature Vector
StandardScaler StringIndexer StringIndexer StringIndexer OneHotEncoder OneHotEncoder VectorAssembler LinearRegression
Categorical Feature Categorical Feature Index Categorical Feature One Hot Vector Categorical Feature Vector
VectorAssembler
Scaled Continuous Feature Vector Final Feature Vector Continuous Feature
Legend
Final Feature Vector Prediction
Regression Pipeline
OneHotEncoder
LeapFrame LeapFrame LeapFrame
Categorical Feature StringIndexer OneHotEncoder Categorical Feature Index Categorical Feature One Hot Vector StringIndexer OneHotEncoder
portability
structure
mleap-serialization
Linear Regression Model String Indexer Model Linear Regression Model (Code)
Spark Estimator Spark Model MLeap Model MLeap Spark
Spark DataFrame Spark LeapFrame Spark LeapFrame
MLeap Spark
Spark DataFrame MLeap Transformer
MLeap Spark
MLeap: 0.011ms/transform Spark: 23.4ms/transform
MLeap API + Server
Algo 1 Algo 2 Algo n
Spark + Hadoop + HDFS MLlib Scikit + other
MLeap Transformers + Serialization
Web Services
Java API REST API
Mobile Apps Spark Jobs Map/ Reduce
Pipeline Code Notebooks
Hollin Wilkins email: hollinrwilkins@gmail.com github: https://github.com/hollinwilkins twitter: https://twitter.com/HollinWilkins Mikhail Semeniuk email: seme0021@gmail.com github: https://github.com/seme0021 twitter: https://twitter.com/MikhailSemeniuk
SATURDAY