DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem - PowerPoint PPT Presentation

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020

Context + Problem Statement

Context: DBMS + ML DBMS / RDBMS - Prevalent in all industries - Efficient - Highly reliable & available - Consistent & Transactional - Provide concurrency - Declarative API (SQL) - Support for versioning, auditing, encryption

Context: DBMS + ML Machine Learning - Hardware is expensive and resource usage is non-uniform, but cloud computing makes it affordable - Data science expertise is expensive too, but ML services and tools aim to make it accessible - Becoming more mainstream; no longer exclusive to “unicorn” ML applications - Growing focus on fairness, security, privacy, and auditability “Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”

Context: DBMS + ML Challenges faced in employing machine learning: - Many ML frameworks - Many ML algorithms - Heterogeneous hardware + runtime environments - Complex APIs -- typically requiring data science experts - Complexity of model selection, training, deployment, and improvement - Lack of security, auditability, versioning - Majority of time is spent on data cleaning, preprocessing, and integration Can DBMS + ML integration address these challenges?

Life without DBMS + ML integration 1. Join tables, export, create features, and do offline ML Problem : slow, memory + runtime intensive, redundant work 2. Write super-complex SQL UDFs to implement ML models Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms

Recent trends in DBMS + ML

Problem Statement Some Big Picture Questions: - How do you make ML accessible to a typical database user? - How do you provide flexibility to use the right ML model? - How do you support different frameworks, cloud service providers, and hardware and runtime environments? - Data is often split across tables; can we do prediction without needing to materialize joins across these tables? - Can DBMS + ML efficiency match (or outperform) ML alone? Tradeoff: simplicity vs. flexibility

Rafiki: Motivation Building an ML application is complex, even when using cloud platforms and popular frameworks. - Training: - Many models to choose from Many (critical) hyperparameters to tune - - Inference: - Using a single model is faster, but less accurate than using a model ensemble Rafiki - Need to select the right model(s) to trade off accuracy and latency

Rafiki: Approach Training Service: automate model selection and hyperparameter tuning - automated search to find the “best” model - highest accuracy - lowest memory usage - lowest runtime Inference Service: online ensemble modeling - automated selection of the “best” model (or ensemble) - maximize accuracy - minimize excess latency (time exceeding SLO)

Automate development and training of a new ML model: - distributed model selection - distributed hyper-parameter tuning

Rafiki: Training Service Automate development and training of a new ML model: - distributed model selection - distributed hyper-parameter tuning

Rafiki: Training Service Automate development and training of a new ML model: - distributed model selection - distributed hyper-parameter tuning Rafiki parallelizes hyper-parameter tuning to reduce tuning time

Rafiki: Inference Service Automate model selection and scheduling: - Maximize accuracy - Minimize latency exceeding SLO These are typically competing objectives Other optimizations (similar to Clipper): - Parallel ensemble modeling - Batch size selection (throughput vs latency) Model ensembles improve accuracy, - Parameter caching but are slower due to stragglers

Rafiki: Inference Service How does Rafiki optimize model selection for its inference service? - Pose as optimization problem - Reinforcement Learning - Objective is to maximize accuracy while minimizing excess latency (beyond SLO)

Rafiki: How does this integrate with DBMS? - REST API via SQL UDF - Python SDK CREATE FUNCTION food_name(image_path) RETURNS text AS BEGIN ... -- CALL RAFIKI API -- ... END; SELECT food_name(image_path) AS name, count(*) FROM foodlog WHERE age > 52 GROUP BY name;

Interfaces Web Interface Python UDF & Web API

Strengths & Weaknesses: Rafiki Strengths: - Allows users to specify an SLO that’s used for model selection and inference - Easy to use for large class of tasks (ie regression, classification, NLP) - Automates and optimizes complex decisions in ML design and deployment Weaknesses: - Not very general: you have to use Rafiki’s limited set of models, model selection + tuning algorithms, and model ensemble strategy - Could be very expensive, since you have to train many models to find best - Rafiki has to compete with other offerings for automated model tuning and model selection services

Questions: Rafiki

Motivation DBMSs have many advantages ● High Performance ● Mature and Reliable ● High availability, Transactional, Concurrent Access, ... ● Encryption, Auditing, Familiar & Prevalent, …. Store and serve models in the DBMS Question : Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks? Answer : Yes!

Raven Supports in-DB model inference Key Features: ● Unified IR enables advanced cross-optimizations between ML and DB operators ● Takes advantage of ML runtimes Raven integrated into Microsoft SQL Server

Background: ONNX (Open Neural Network Exchange) Standardized ML model representation Enables portable models, cross-platform inference Integrated ONNX Runtime into SQL Server

Background: MLflow Model Pipelines Model Pipeline contains: ● Trained Model ● Preprocessing Steps ● Dependencies

Background: MLflow Model Pipelines Model Pipeline contains: ● Trained Model ● Preprocessing Steps ● Dependencies Pipeline is stored in the RDBMS A SQL query can invoke a Model Pipeline

Defining a Model Pipeline

Using a Model Pipeline

Raven’s IR Uses both ML and relational operators

Raven’s IR Operator categories: ● Relational Algebra ● Linear Algebra ● Other ML and data featurizers (eg scikit-learn operations) ● UDFs, for when the static analyzer is unable to map operators

Cross- Optimization

Cross-Optimization DBMS + ML optimizations: ● Predicate-based model pruning: conditions are pushed into the decision tree, allowing subtrees to be pruned ● Model-projection pushdown: unused features are discarded in the query plan ● NN translation: Transform operations/models into equivalent NNs, leverage optimized runtime ● Model / query splitting: Model can be partitioned ● Model inlining: ML operators transformed to relational ones (eg small decision trees can be inlined) Other standard DB and compiler optimizations (constant folding, join elimination, …)

Runtime Code Generation A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.

Raven: Full Pipeline

Query Execution The generated SQL query is executed as either, 1. In-process execution (Raven): uses the integrated PREDICT statement 2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed as an external script 3. Containerized execution: if unable to run Raven Ext., run within Docker

Results Effects of some optimizations:

Results: does Raven outperform ONNX?

Strengths & Weaknesses: Raven Strengths: - Brings advantages of a DBMS to machine learning models - Raven’s cross-optimizations and ONNX integration make inference faster - Very generalized: natively supports many ML frameworks, runtimes, specialized hardware, and provides containerized execution for all others Weaknesses: - Only compatible with SQL Server - Limited to inference; it does not facilitate training - Limits to static analysis (eg analysis of for loops & conditionals )

Questions: Raven

Motivation: a typical ML pipeline ML / LA (Linear Algebra) Operations Collect Data, Materialize Joins

Morpheus: Factorized ML Idea: avoid materializing joins by pushing ML computations through joins Any Linear Algebra computation over the join output can be factorized in terms of the base tables Morpheus (2017): Factor operations over a “normalized matrix” using a framework of rewrite rules. Morpheus

The Normalized Matrix Key S: left table A logical data type that represents joined tables R: right table K: indicator matrix (0/1s) Consider a PK-FK join between two tables S, R T: normalized matrix The normalized matrix is the triple (S, K, R) Where K is an indicator matrix, The output of the join, T, is then T = [S, KR] (column-wise concatenation)

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem - PowerPoint PPT Presentation

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML DBMS / RDBMS - Prevalent in all industries - Efficient - Highly reliable & available - Consistent & Transactional - Provide

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

Distributed DBMS reliability Distributed DBMS reliability

Database Management System (DBMS) DBMS contains information about a particular enterprise

Database Management Systems (DBMS) Prof. Pfaff. Lafayette College February 19, 2018 Prof.

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data

Architecture of DBMS Mrs. Maninder Kaur professormaninder@gmail.com Mrs. Maninder Kaur

DRY-SAS/DBMS UPDATE Executive Committee meeting 9 OCTOBER 2020 BACKGROUND DRY-SAS AND DBMS

Running JavaScript Inside the Database Data Base Management System (DBMS) Definition A

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone Michael Stonebraker December, 2008

Database Management Systems (DBMS) Prof. Pfaff. Lafayette College February 22, 2018 Prof.

DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and

Why Is This Important? So far, accessed DBMS directly through client tools Great for

GE 103- Database Management Course Introduction DBMS Database == Data collection managed by a

Vector Extensions for Decision Support DBMS Acceleration Timothy Hayes, Oscar Palomar, Osman

Morpheus: Neo, sooner or later youre going to realize, just as I did, that theres a

Most of the text about sounds was taken verbatim from Cornells www.allaboutbirds.org. Ive

An Introduction: The Beginning Steps towards CSIL Paul Gauthier Executive Director

WATER: ART MEETS SCIENCE W HAT A RE W E G IVING UP BY IGNORING IT ? M USEUM OF S CIENCE O CTOBER 1,

Modelling music Dorien Herremans ISTD, Singapore University of Technology and Design IHPC,

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Timothy R. Newman, Ph.D. Wireless @ VT Wireless @ Virginia Tech Wireless Umbrella Group

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png Neo4j the benefits of graph