SLIDE 1 DBMS + ML
Julian Oks Josh Sennett
SLIDE 2
Context + Problem Statement
SLIDE 3 Context: DBMS + ML
DBMS / RDBMS
- Prevalent in all industries
- Efficient
- Highly reliable & available
- Consistent & Transactional
- Provide concurrency
- Declarative API (SQL)
- Support for versioning, auditing,
encryption
SLIDE 4 Context: DBMS + ML
Machine Learning
- Hardware is expensive and resource usage is non-uniform, but cloud
computing makes it affordable
- Data science expertise is expensive too, but ML services and tools aim to
make it accessible
- Becoming more mainstream; no longer exclusive to “unicorn” ML applications
- Growing focus on fairness, security, privacy, and auditability
“Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”
SLIDE 5 Context: DBMS + ML
Challenges faced in employing machine learning:
- Many ML frameworks
- Many ML algorithms
- Heterogeneous hardware + runtime environments
- Complex APIs -- typically requiring data science experts
- Complexity of model selection, training, deployment, and improvement
- Lack of security, auditability, versioning
- Majority of time is spent on data cleaning, preprocessing, and integration
Can DBMS + ML integration address these challenges?
SLIDE 6 Life without DBMS + ML integration
1. Join tables, export, create features, and do offline ML Problem: slow, memory + runtime intensive, redundant work 2. Write super-complex SQL UDFs to implement ML models Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms
SLIDE 7
Recent trends in DBMS + ML
SLIDE 8
Recent trends in DBMS + ML
SLIDE 9 Problem Statement
Some Big Picture Questions:
- How do you make ML accessible to a typical database user?
- How do you provide flexibility to use the right ML model?
- How do you support different frameworks, cloud service providers, and
hardware and runtime environments?
- Data is often split across tables; can we do prediction without needing to
materialize joins across these tables?
- Can DBMS + ML efficiency match (or outperform) ML alone?
Tradeoff: simplicity vs. flexibility
SLIDE 10
SLIDE 11 Rafiki: Motivation
Building an ML application is complex, even when using cloud platforms and popular frameworks.
- Training:
- Many models to choose from
- Many (critical) hyperparameters to tune
- Inference:
- Using a single model is faster, but less accurate than
using a model ensemble
- Need to select the right model(s) to trade off accuracy
and latency
Rafiki
SLIDE 12 Rafiki: Approach
Training Service: automate model selection and hyperparameter tuning
- automated search to find the “best” model
- highest accuracy
- lowest memory usage
- lowest runtime
Inference Service: online ensemble modeling
- automated selection of the “best” model (or ensemble)
- maximize accuracy
- minimize excess latency (time exceeding SLO)
SLIDE 13 Automate development and training of a new ML model:
- distributed model selection
- distributed hyper-parameter tuning
SLIDE 14 Rafiki: Training Service
Automate development and training of a new ML model:
- distributed model selection
- distributed hyper-parameter tuning
SLIDE 15 Rafiki: Training Service
Automate development and training of a new ML model:
- distributed model selection
- distributed hyper-parameter tuning
Rafiki parallelizes hyper-parameter tuning to reduce tuning time
SLIDE 16 Rafiki: Inference Service
Automate model selection and scheduling:
- Maximize accuracy
- Minimize latency exceeding SLO
These are typically competing objectives Other optimizations (similar to Clipper):
- Parallel ensemble modeling
- Batch size selection (throughput vs latency)
- Parameter caching
Model ensembles improve accuracy, but are slower due to stragglers
SLIDE 17 Rafiki: Inference Service
How does Rafiki optimize model selection for its inference service?
- Pose as optimization problem
- Reinforcement Learning
- Objective is to maximize
accuracy while minimizing excess latency (beyond SLO)
SLIDE 18 Rafiki: How does this integrate with DBMS?
CREATE FUNCTION food_name(image_path) RETURNS text AS BEGIN ...
... END; SELECT food_name(image_path) AS name, count(*) FROM foodlog WHERE age > 52 GROUP BY name;
SLIDE 19
Interfaces
Web Interface Python UDF & Web API
SLIDE 20 Strengths & Weaknesses: Rafiki
Strengths:
- Allows users to specify an SLO that’s used for model selection and inference
- Easy to use for large class of tasks (ie regression, classification, NLP)
- Automates and optimizes complex decisions in ML design and deployment
Weaknesses:
- Not very general: you have to use Rafiki’s limited set of models, model
selection + tuning algorithms, and model ensemble strategy
- Could be very expensive, since you have to train many models to find best
- Rafiki has to compete with other offerings for automated model tuning and
model selection services
SLIDE 21
Questions: Rafiki
SLIDE 22
SLIDE 23 Motivation
DBMSs have many advantages
- High Performance
- Mature and Reliable
- High availability, Transactional, Concurrent Access, ...
- Encryption, Auditing, Familiar & Prevalent, ….
Store and serve models in the DBMS Question: Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks? Answer: Yes!
SLIDE 24 Raven
Supports in-DB model inference Key Features:
- Unified IR enables advanced
cross-optimizations between ML and DB operators
- Takes advantage of ML runtimes
integrated into Microsoft SQL Server
Raven
SLIDE 25 Background: ONNX (Open Neural Network Exchange)
Standardized ML model representation Enables portable models, cross-platform inference Integrated ONNX Runtime into SQL Server
SLIDE 26 Background: MLflow Model Pipelines
Model Pipeline contains:
- Trained Model
- Preprocessing Steps
- Dependencies
SLIDE 27 Background: MLflow Model Pipelines
Model Pipeline contains:
- Trained Model
- Preprocessing Steps
- Dependencies
Pipeline is stored in the RDBMS A SQL query can invoke a Model Pipeline
SLIDE 28
Defining a Model Pipeline
SLIDE 29
Using a Model Pipeline
SLIDE 30 Raven’s IR
Uses both ML and relational
SLIDE 31 Raven’s IR
Operator categories:
- Relational Algebra
- Linear Algebra
- Other ML and data featurizers (eg scikit-learn operations)
- UDFs, for when the static analyzer is unable to map operators
SLIDE 32
Cross- Optimization
SLIDE 33 Cross-Optimization
DBMS + ML optimizations:
- Predicate-based model pruning: conditions are pushed into the decision tree,
allowing subtrees to be pruned
- Model-projection pushdown: unused features are discarded in the query plan
- NN translation: Transform operations/models into equivalent NNs, leverage
- ptimized runtime
- Model / query splitting: Model can be partitioned
- Model inlining: ML operators transformed to relational ones (eg small decision
trees can be inlined) Other standard DB and compiler optimizations (constant folding, join elimination, …)
SLIDE 34 Runtime Code Generation
A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.
SLIDE 35
Raven: Full Pipeline
SLIDE 36 Query Execution
The generated SQL query is executed as either, 1. In-process execution (Raven): uses the integrated PREDICT statement 2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed as an external script 3. Containerized execution: if unable to run Raven Ext., run within Docker
SLIDE 37 Results
Effects of some optimizations:
SLIDE 38
Results: does Raven outperform ONNX?
SLIDE 39 Strengths & Weaknesses: Raven
Strengths:
- Brings advantages of a DBMS to machine learning models
- Raven’s cross-optimizations and ONNX integration make inference faster
- Very generalized: natively supports many ML frameworks, runtimes,
specialized hardware, and provides containerized execution for all others Weaknesses:
- Only compatible with SQL Server
- Limited to inference; it does not facilitate training
- Limits to static analysis (eg analysis of for loops & conditionals)
SLIDE 40
Questions: Raven
SLIDE 41
SLIDE 42 Motivation: a typical ML pipeline
Collect Data, Materialize Joins ML / LA (Linear Algebra) Operations
SLIDE 43 Morpheus: Factorized ML
Idea: avoid materializing joins by pushing ML computations through joins Any Linear Algebra computation over the join output can be factorized in terms of the base tables Morpheus (2017): Factor operations over a “normalized matrix” using a framework
Morpheus
SLIDE 44 The Normalized Matrix
A logical data type that represents joined tables Consider a PK-FK join between two tables S, R The normalized matrix is the triple (S, K, R) Where K is an indicator matrix, The output of the join, T, is then T = [S, KR] (column-wise concatenation)
Key S: left table R: right table K: indicator matrix (0/1s) T: normalized matrix
SLIDE 45 Rewrite Rules
- Element-wise operators: f(T) → (f(S), K, f(R))
- Aggregators:
○ rowSum(T) → rowSum(S) + K rowSum(R) ○ colSum(T) → [colSum(S), colSum(K)R] ○ sum(T) → sum(S) + colSum(K) rowSum(R)
- Left Matrix Multiplication: TX → SX[1 : dS, ] + K(RX[dS + 1 : d, ])
- Matrix Inversion:
Key S: left table R: right table K: indicator matrix (0/1s) T: normalized matrix
SLIDE 46 Rewrite Rules Applied to Logistic Regression
Key S: left table R: right table K: indicator matrix (0/1s) T: normalized matrix
SLIDE 47 Performance
Key F: Factorized M: Materialized FR: feature ratio TR: tuple ratio
SLIDE 48 Performance
Key Domain Size - # of unique values
SLIDE 49 MorpheusFI: Quadratic Feature Interactions for Morpheus
Limitation: Factorized Linear Algebra is restricted to linearity over feature vectors
SLIDE 50 MorpheusFI
Add quadratic feature interactions into factorized ML by adding two non-linear interaction operators:
- self-interaction within a matrix
- cross-interaction between matrices
participating in a join
SLIDE 51 A new abstraction: Interacted Normalized Matrix
with the following relationships:
SLIDE 52
Formal proofs of algebraic rewrite rules
SLIDE 53
Rewrite rules are extremely complex
SLIDE 54 Results: Matrix multiplication
Key LMM: Left matrix mult RMM: Right matrix mult p: # of joined tables q: # of sparse dimension tables
SLIDE 55
Results: time to convergence
SLIDE 56 Strengths & Weaknesses: Morpheus / MorpheusFI
Strengths:
- Automatically rewrite any LA computation over a join’s output as LA
- perations over the base tables
- Decouples factorization and execution; backend-agnostic
- In many cases, it can dramatically improve runtime
- MorpheusFI extends Morpheus to support quadratic feature interactions
Weaknesses:
- It cannot be generalized to support higher degree interactions
- At the time of the publication, only a simple heuristics-based approach to
- ptimizing the execution plan
- Only supports ML models that can be expressed in linear algebraic terms
SLIDE 57
Questions: Morpheus / MorpheusFI
SLIDE 58
Discussion
SLIDE 59 Commonalities
Overall objectives:
- Empower database users to use ML from their DBMS
- Avoid the high cost of doing “offline” ML
- Aim for flexibility + generalizability
- Improve efficiency
Optimization via Translation:
- Raven: Translate data and ML operations into a
unified IR
- MorpheusFI: Translate ML models into LA operations
Rafiki Raven MorpheusFI
SLIDE 60 Differences
Implementation:
- Raven: major modifications to DBMS engine
- Rafiki: cloud application, many interfaces
- MorpheusFI: lightweight Python library
Generalizability:
- MorpheusFI: works for any ML model built with LA
- perators with linear or quadratic feature interactions
- Raven: native support for many popular models and
frameworks, and out-of-process execution for all others
- Rafiki: only supports a limited set of models + runtimes
Rafiki Raven Morpheus
SLIDE 61 Differences
Inference vs Training:
- Raven: training in the cloud, inference in the DBMS
- Rafiki: training & inference in the cloud, with a
DBMS interface
Rafiki Raven Morpheus
SLIDE 62
Questions & Discussion
SLIDE 63 Questions & Discussion
- How will ML be affected by stricter data governance? How can DBs play a
role?
- What challenges remain (technical or not) for applying ML in an enterprise
setting?
- What challenges can the DBs solve?
- What's the role of the DBs in ML?
- Reasons not to invest in DBMS + ML?