 
              DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020
Context + Problem Statement
Context: DBMS + ML DBMS / RDBMS - Prevalent in all industries - Efficient - Highly reliable & available - Consistent & Transactional - Provide concurrency - Declarative API (SQL) - Support for versioning, auditing, encryption
Context: DBMS + ML Machine Learning - Hardware is expensive and resource usage is non-uniform, but cloud computing makes it affordable - Data science expertise is expensive too, but ML services and tools aim to make it accessible - Becoming more mainstream; no longer exclusive to “unicorn” ML applications - Growing focus on fairness, security, privacy, and auditability “Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”
Context: DBMS + ML Challenges faced in employing machine learning: - Many ML frameworks - Many ML algorithms - Heterogeneous hardware + runtime environments - Complex APIs -- typically requiring data science experts - Complexity of model selection, training, deployment, and improvement - Lack of security, auditability, versioning - Majority of time is spent on data cleaning, preprocessing, and integration Can DBMS + ML integration address these challenges?
Life without DBMS + ML integration 1. Join tables, export, create features, and do offline ML Problem : slow, memory + runtime intensive, redundant work 2. Write super-complex SQL UDFs to implement ML models Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms
Recent trends in DBMS + ML
Recent trends in DBMS + ML
Problem Statement Some Big Picture Questions: - How do you make ML accessible to a typical database user? - How do you provide flexibility to use the right ML model? - How do you support different frameworks, cloud service providers, and hardware and runtime environments? - Data is often split across tables; can we do prediction without needing to materialize joins across these tables? - Can DBMS + ML efficiency match (or outperform) ML alone? Tradeoff: simplicity vs. flexibility
Rafiki: Motivation Building an ML application is complex, even when using cloud platforms and popular frameworks. - Training: - Many models to choose from Many (critical) hyperparameters to tune - - Inference: - Using a single model is faster, but less accurate than using a model ensemble Rafiki - Need to select the right model(s) to trade off accuracy and latency
Rafiki: Approach Training Service: automate model selection and hyperparameter tuning - automated search to find the “best” model - highest accuracy - lowest memory usage - lowest runtime Inference Service: online ensemble modeling - automated selection of the “best” model (or ensemble) - maximize accuracy - minimize excess latency (time exceeding SLO)
Automate development and training of a new ML model: - distributed model selection - distributed hyper-parameter tuning
Rafiki: Training Service Automate development and training of a new ML model: - distributed model selection - distributed hyper-parameter tuning
Rafiki: Training Service Automate development and training of a new ML model: - distributed model selection - distributed hyper-parameter tuning Rafiki parallelizes hyper-parameter tuning to reduce tuning time
Rafiki: Inference Service Automate model selection and scheduling: - Maximize accuracy - Minimize latency exceeding SLO These are typically competing objectives Other optimizations (similar to Clipper): - Parallel ensemble modeling - Batch size selection (throughput vs latency) Model ensembles improve accuracy, - Parameter caching but are slower due to stragglers
Rafiki: Inference Service How does Rafiki optimize model selection for its inference service? - Pose as optimization problem - Reinforcement Learning - Objective is to maximize accuracy while minimizing excess latency (beyond SLO)
Rafiki: How does this integrate with DBMS? - REST API via SQL UDF - Python SDK CREATE FUNCTION food_name(image_path) RETURNS text AS BEGIN ... -- CALL RAFIKI API -- ... END; SELECT food_name(image_path) AS name, count(*) FROM foodlog WHERE age > 52 GROUP BY name;
Interfaces Web Interface Python UDF & Web API
Strengths & Weaknesses: Rafiki Strengths: - Allows users to specify an SLO that’s used for model selection and inference - Easy to use for large class of tasks (ie regression, classification, NLP) - Automates and optimizes complex decisions in ML design and deployment Weaknesses: - Not very general: you have to use Rafiki’s limited set of models, model selection + tuning algorithms, and model ensemble strategy - Could be very expensive, since you have to train many models to find best - Rafiki has to compete with other offerings for automated model tuning and model selection services
Questions: Rafiki
Motivation DBMSs have many advantages ● High Performance ● Mature and Reliable ● High availability, Transactional, Concurrent Access, ... ● Encryption, Auditing, Familiar & Prevalent, …. Store and serve models in the DBMS Question : Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks? Answer : Yes!
Raven Supports in-DB model inference Key Features: ● Unified IR enables advanced cross-optimizations between ML and DB operators ● Takes advantage of ML runtimes Raven integrated into Microsoft SQL Server
Background: ONNX (Open Neural Network Exchange) Standardized ML model representation Enables portable models, cross-platform inference Integrated ONNX Runtime into SQL Server
Background: MLflow Model Pipelines Model Pipeline contains: ● Trained Model ● Preprocessing Steps ● Dependencies
Background: MLflow Model Pipelines Model Pipeline contains: ● Trained Model ● Preprocessing Steps ● Dependencies Pipeline is stored in the RDBMS A SQL query can invoke a Model Pipeline
Defining a Model Pipeline
Using a Model Pipeline
Raven’s IR Uses both ML and relational operators
Raven’s IR Operator categories: ● Relational Algebra ● Linear Algebra ● Other ML and data featurizers (eg scikit-learn operations) ● UDFs, for when the static analyzer is unable to map operators
Cross- Optimization
Cross-Optimization DBMS + ML optimizations: ● Predicate-based model pruning: conditions are pushed into the decision tree, allowing subtrees to be pruned ● Model-projection pushdown: unused features are discarded in the query plan ● NN translation: Transform operations/models into equivalent NNs, leverage optimized runtime ● Model / query splitting: Model can be partitioned ● Model inlining: ML operators transformed to relational ones (eg small decision trees can be inlined) Other standard DB and compiler optimizations (constant folding, join elimination, …)
Runtime Code Generation A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.
Raven: Full Pipeline
Query Execution The generated SQL query is executed as either, 1. In-process execution (Raven): uses the integrated PREDICT statement 2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed as an external script 3. Containerized execution: if unable to run Raven Ext., run within Docker
Results Effects of some optimizations:
Results: does Raven outperform ONNX?
Strengths & Weaknesses: Raven Strengths: - Brings advantages of a DBMS to machine learning models - Raven’s cross-optimizations and ONNX integration make inference faster - Very generalized: natively supports many ML frameworks, runtimes, specialized hardware, and provides containerized execution for all others Weaknesses: - Only compatible with SQL Server - Limited to inference; it does not facilitate training - Limits to static analysis (eg analysis of for loops & conditionals )
Questions: Raven
Motivation: a typical ML pipeline ML / LA (Linear Algebra) Operations Collect Data, Materialize Joins
Morpheus: Factorized ML Idea: avoid materializing joins by pushing ML computations through joins Any Linear Algebra computation over the join output can be factorized in terms of the base tables Morpheus (2017): Factor operations over a “normalized matrix” using a framework of rewrite rules. Morpheus
The Normalized Matrix Key S: left table A logical data type that represents joined tables R: right table K: indicator matrix (0/1s) Consider a PK-FK join between two tables S, R T: normalized matrix The normalized matrix is the triple (S, K, R) Where K is an indicator matrix, The output of the join, T, is then T = [S, KR] (column-wise concatenation)
Recommend
More recommend