ModelDB : a system for managing ML models Manasi Vartak , PhD - - PowerPoint PPT Presentation

modeldb a system for managing ml models
SMART_READER_LITE
LIVE PREVIEW

ModelDB : a system for managing ML models Manasi Vartak , PhD - - PowerPoint PPT Presentation

ModelDB : a system for managing ML models Manasi Vartak , PhD Candidate MIT Database Group mvartak@csail.mit.edu | @DataCereal Why Model Management? IMDB Prediction Task Given data about movies (e.g. year made, studio, genres, actors)


slide-1
SLIDE 1

ModelDB: a system for managing ML models

Manasi Vartak, PhD Candidate MIT Database Group

mvartak@csail.mit.edu | @DataCereal

slide-2
SLIDE 2

Why Model Management?

slide-3
SLIDE 3

IMDB Prediction Task

  • Given data about movies (e.g. year made, studio,

genres, actors)

  • Predict IMDB_score
slide-4
SLIDE 4

Accuracy: 62%

Model 1

LinearRegression

slide-5
SLIDE 5

Accuracy: 68%

Model 2

CrossValidation

slide-6
SLIDE 6

Accuracy: 75%

Model 3

CrossValidation RandomForest

slide-7
SLIDE 7

Accuracy: 80%

Model 4

CrossValidation RandomForest FeatureEngg

slide-8
SLIDE 8

Accuracy: 84%

Model 50

FeatureEngg CrossValidation GBDT

slide-9
SLIDE 9

Why is this a problem?

  • No record of experiments
  • Insights lost along the way
  • Difficult to reproduce results
  • Cannot search for or query models
  • Difficult to collaborate

Did my colleague do that already? How did normalization affect my ROC? How does someone review your model? Where is the prod version of the model for churn? What params did I use?

slide-10
SLIDE 10

ModelDB: an end-to-end model management system

Store and version modeling artifacts Query Ingest models, metadata Collaborate, Reproduce results

slide-11
SLIDE 11

ModelDB Architecture

spark.ml scikit-learn ModelDB Backend Storage thrift

Scala Python …

ModelDB Frontend: vis + query

Light Client

Events

slide-12
SLIDE 12

Demo

slide-13
SLIDE 13

ML Infrastructure

Model Training Model Management Data Processing Serving Monitoring

+ Visualizations + Interpretability + Debugging + A/B testing + Model Retraining

  • DBMSs
  • Spark
  • Hive
  • CSV
  • Spark.ml
  • sklearn
  • R
  • DL frmks
  • H2O
  • Custom
  • TF-serving
  • Clipper

Custom Custom

slide-14
SLIDE 14

Benefits of model management

Offline Online

Developer Productivity

+ Provenance + Reproducibility + Meta-analyses

Increased Transparency

+ What models have been built + How well do models work? + Auditability

Fast Failure Analyses

+ How was this model built? + What has changed?

Model Monitoring

+ Model performance over time + Anomaly detection + Trigger retraining

slide-15
SLIDE 15

At last NIPS

  • Initial version of ModelDB with sklearn, spark.ml

support

  • Early adopters (banks, financial firms), early

feedback

  • Focus on developer productivity
slide-16
SLIDE 16

Since last NIPS!

  • Initial release of ModelDB in Feb early 2017
  • Adoption/evaluation at Adobe, banks, financial

institutions, and tech companies

  • Won AIGrant for open-source projects
  • See papers at SIGMOD, NIPS workshops
slide-17
SLIDE 17

Since last NIPS!

  • Easy installation: docker, pip
  • Light clients (R, YAML,

packages outside of sklearn)

  • Flexible metadata storage
  • Collecting metrics over time
  • Fine-grained visualizations
  • In the (research) pipeline
  • Data and intermediate

storage

  • Model diagnosis
slide-18
SLIDE 18

ModelDB so far

  • Incredible inbound interest
  • Banks, finance, insurance, tech
  • Lots of feature requests (e..g monitoring, diagnosis, DL).

More than research resources can handle :)

  • Validation
  • Every data scientist building > 10 models needs model

management and is looking for these tools

  • Vision: Industry standard tool for managing ML models and

metadata

slide-19
SLIDE 19

Moving to Apache Incubation

  • With MIT, Adobe, other partners (*MLSys

community)

  • Open development to wider community
  • Contributions across industry
  • Roadmap
  • Multiple storage backends, DL frameworks, R
  • Monitoring capabilities
slide-20
SLIDE 20

Call for Contributions!

  • Community over code
  • Build once, reuse many times
  • Why?
  • It will measurably improve your

workflow

  • Pay it forward
  • Be part of larger open-source project
slide-21
SLIDE 21

How to Contribute

  • Test it out and give feedback
  • Share: teams, meetups, data science meetings, blogs
  • Documentation
  • Code:
  • Lots of issues on GitHub
  • Add support for your favorite ML frameworks
slide-22
SLIDE 22

Informal Meeting at MLSys

  • Interested in testing/adopting ModelDB?
  • Did you build such a system, can you share lessons?
  • Open-source Contributors!
  • How/when
  • Whova app (“Model Management Meetup”)
  • mvartak@csail.mit.edu
  • Poster
slide-23
SLIDE 23

People

slide-24
SLIDE 24

ModelDB

Manasi Vartak | @DataCereal

https://github.com/mitdbg/modeldb http://modeldb.csail.mit.edu