DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem - - PowerPoint PPT Presentation

dbms ml
SMART_READER_LITE
LIVE PREVIEW

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem - - PowerPoint PPT Presentation

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML DBMS / RDBMS - Prevalent in all industries - Efficient - Highly reliable & available - Consistent & Transactional - Provide


slide-1
SLIDE 1

DBMS + ML

Julian Oks Josh Sennett

  • Jan. 29, 2020
slide-2
SLIDE 2

Context + Problem Statement

slide-3
SLIDE 3

Context: DBMS + ML

DBMS / RDBMS

  • Prevalent in all industries
  • Efficient
  • Highly reliable & available
  • Consistent & Transactional
  • Provide concurrency
  • Declarative API (SQL)
  • Support for versioning, auditing,

encryption

slide-4
SLIDE 4

Context: DBMS + ML

Machine Learning

  • Hardware is expensive and resource usage is non-uniform, but cloud

computing makes it affordable

  • Data science expertise is expensive too, but ML services and tools aim to

make it accessible

  • Becoming more mainstream; no longer exclusive to “unicorn” ML applications
  • Growing focus on fairness, security, privacy, and auditability

“Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”

slide-5
SLIDE 5

Context: DBMS + ML

Challenges faced in employing machine learning:

  • Many ML frameworks
  • Many ML algorithms
  • Heterogeneous hardware + runtime environments
  • Complex APIs -- typically requiring data science experts
  • Complexity of model selection, training, deployment, and improvement
  • Lack of security, auditability, versioning
  • Majority of time is spent on data cleaning, preprocessing, and integration

Can DBMS + ML integration address these challenges?

slide-6
SLIDE 6

Life without DBMS + ML integration

1. Join tables, export, create features, and do offline ML Problem: slow, memory + runtime intensive, redundant work 2. Write super-complex SQL UDFs to implement ML models Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms

slide-7
SLIDE 7

Recent trends in DBMS + ML

slide-8
SLIDE 8

Recent trends in DBMS + ML

slide-9
SLIDE 9

Problem Statement

Some Big Picture Questions:

  • How do you make ML accessible to a typical database user?
  • How do you provide flexibility to use the right ML model?
  • How do you support different frameworks, cloud service providers, and

hardware and runtime environments?

  • Data is often split across tables; can we do prediction without needing to

materialize joins across these tables?

  • Can DBMS + ML efficiency match (or outperform) ML alone?

Tradeoff: simplicity vs. flexibility

slide-10
SLIDE 10
slide-11
SLIDE 11

Rafiki: Motivation

Building an ML application is complex, even when using cloud platforms and popular frameworks.

  • Training:
  • Many models to choose from
  • Many (critical) hyperparameters to tune
  • Inference:
  • Using a single model is faster, but less accurate than

using a model ensemble

  • Need to select the right model(s) to trade off accuracy

and latency

Rafiki

slide-12
SLIDE 12

Rafiki: Approach

Training Service: automate model selection and hyperparameter tuning

  • automated search to find the “best” model
  • highest accuracy
  • lowest memory usage
  • lowest runtime

Inference Service: online ensemble modeling

  • automated selection of the “best” model (or ensemble)
  • maximize accuracy
  • minimize excess latency (time exceeding SLO)
slide-13
SLIDE 13

Automate development and training of a new ML model:

  • distributed model selection
  • distributed hyper-parameter tuning
slide-14
SLIDE 14

Rafiki: Training Service

Automate development and training of a new ML model:

  • distributed model selection
  • distributed hyper-parameter tuning
slide-15
SLIDE 15

Rafiki: Training Service

Automate development and training of a new ML model:

  • distributed model selection
  • distributed hyper-parameter tuning

Rafiki parallelizes hyper-parameter tuning to reduce tuning time

slide-16
SLIDE 16

Rafiki: Inference Service

Automate model selection and scheduling:

  • Maximize accuracy
  • Minimize latency exceeding SLO

These are typically competing objectives Other optimizations (similar to Clipper):

  • Parallel ensemble modeling
  • Batch size selection (throughput vs latency)
  • Parameter caching

Model ensembles improve accuracy, but are slower due to stragglers

slide-17
SLIDE 17

Rafiki: Inference Service

How does Rafiki optimize model selection for its inference service?

  • Pose as optimization problem
  • Reinforcement Learning
  • Objective is to maximize

accuracy while minimizing excess latency (beyond SLO)

slide-18
SLIDE 18

Rafiki: How does this integrate with DBMS?

  • REST API via SQL UDF

CREATE FUNCTION food_name(image_path) RETURNS text AS BEGIN ...

  • - CALL RAFIKI API --

... END; SELECT food_name(image_path) AS name, count(*) FROM foodlog WHERE age > 52 GROUP BY name;

  • Python SDK
slide-19
SLIDE 19

Interfaces

Web Interface Python UDF & Web API

slide-20
SLIDE 20

Strengths & Weaknesses: Rafiki

Strengths:

  • Allows users to specify an SLO that’s used for model selection and inference
  • Easy to use for large class of tasks (ie regression, classification, NLP)
  • Automates and optimizes complex decisions in ML design and deployment

Weaknesses:

  • Not very general: you have to use Rafiki’s limited set of models, model

selection + tuning algorithms, and model ensemble strategy

  • Could be very expensive, since you have to train many models to find best
  • Rafiki has to compete with other offerings for automated model tuning and

model selection services

slide-21
SLIDE 21

Questions: Rafiki

slide-22
SLIDE 22
slide-23
SLIDE 23

Motivation

DBMSs have many advantages

  • High Performance
  • Mature and Reliable
  • High availability, Transactional, Concurrent Access, ...
  • Encryption, Auditing, Familiar & Prevalent, ….

Store and serve models in the DBMS Question: Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks? Answer: Yes!

slide-24
SLIDE 24

Raven

Supports in-DB model inference Key Features:

  • Unified IR enables advanced

cross-optimizations between ML and DB operators

  • Takes advantage of ML runtimes

integrated into Microsoft SQL Server

Raven

slide-25
SLIDE 25

Background: ONNX (Open Neural Network Exchange)

Standardized ML model representation Enables portable models, cross-platform inference Integrated ONNX Runtime into SQL Server

slide-26
SLIDE 26

Background: MLflow Model Pipelines

Model Pipeline contains:

  • Trained Model
  • Preprocessing Steps
  • Dependencies
slide-27
SLIDE 27

Background: MLflow Model Pipelines

Model Pipeline contains:

  • Trained Model
  • Preprocessing Steps
  • Dependencies

Pipeline is stored in the RDBMS A SQL query can invoke a Model Pipeline

slide-28
SLIDE 28

Defining a Model Pipeline

slide-29
SLIDE 29

Using a Model Pipeline

slide-30
SLIDE 30

Raven’s IR

Uses both ML and relational

  • perators
slide-31
SLIDE 31

Raven’s IR

Operator categories:

  • Relational Algebra
  • Linear Algebra
  • Other ML and data featurizers (eg scikit-learn operations)
  • UDFs, for when the static analyzer is unable to map operators
slide-32
SLIDE 32

Cross- Optimization

slide-33
SLIDE 33

Cross-Optimization

DBMS + ML optimizations:

  • Predicate-based model pruning: conditions are pushed into the decision tree,

allowing subtrees to be pruned

  • Model-projection pushdown: unused features are discarded in the query plan
  • NN translation: Transform operations/models into equivalent NNs, leverage
  • ptimized runtime
  • Model / query splitting: Model can be partitioned
  • Model inlining: ML operators transformed to relational ones (eg small decision

trees can be inlined) Other standard DB and compiler optimizations (constant folding, join elimination, …)

slide-34
SLIDE 34

Runtime Code Generation

A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.

slide-35
SLIDE 35

Raven: Full Pipeline

slide-36
SLIDE 36

Query Execution

The generated SQL query is executed as either, 1. In-process execution (Raven): uses the integrated PREDICT statement 2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed as an external script 3. Containerized execution: if unable to run Raven Ext., run within Docker

slide-37
SLIDE 37

Results

Effects of some optimizations:

slide-38
SLIDE 38

Results: does Raven outperform ONNX?

slide-39
SLIDE 39

Strengths & Weaknesses: Raven

Strengths:

  • Brings advantages of a DBMS to machine learning models
  • Raven’s cross-optimizations and ONNX integration make inference faster
  • Very generalized: natively supports many ML frameworks, runtimes,

specialized hardware, and provides containerized execution for all others Weaknesses:

  • Only compatible with SQL Server
  • Limited to inference; it does not facilitate training
  • Limits to static analysis (eg analysis of for loops & conditionals)
slide-40
SLIDE 40

Questions: Raven

slide-41
SLIDE 41
slide-42
SLIDE 42

Motivation: a typical ML pipeline

Collect Data, Materialize Joins ML / LA (Linear Algebra) Operations

slide-43
SLIDE 43

Morpheus: Factorized ML

Idea: avoid materializing joins by pushing ML computations through joins Any Linear Algebra computation over the join output can be factorized in terms of the base tables Morpheus (2017): Factor operations over a “normalized matrix” using a framework

  • f rewrite rules.

Morpheus

slide-44
SLIDE 44

The Normalized Matrix

A logical data type that represents joined tables Consider a PK-FK join between two tables S, R The normalized matrix is the triple (S, K, R) Where K is an indicator matrix, The output of the join, T, is then T = [S, KR] (column-wise concatenation)

Key S: left table R: right table K: indicator matrix (0/1s) T: normalized matrix

slide-45
SLIDE 45

Rewrite Rules

  • Element-wise operators: f(T) → (f(S), K, f(R))
  • Aggregators:

○ rowSum(T) → rowSum(S) + K rowSum(R) ○ colSum(T) → [colSum(S), colSum(K)R] ○ sum(T) → sum(S) + colSum(K) rowSum(R)

  • Left Matrix Multiplication: TX → SX[1 : dS, ] + K(RX[dS + 1 : d, ])
  • Matrix Inversion:

Key S: left table R: right table K: indicator matrix (0/1s) T: normalized matrix

slide-46
SLIDE 46

Rewrite Rules Applied to Logistic Regression

Key S: left table R: right table K: indicator matrix (0/1s) T: normalized matrix

slide-47
SLIDE 47

Performance

Key F: Factorized M: Materialized FR: feature ratio TR: tuple ratio

slide-48
SLIDE 48

Performance

Key Domain Size - # of unique values

slide-49
SLIDE 49

MorpheusFI: Quadratic Feature Interactions for Morpheus

Limitation: Factorized Linear Algebra is restricted to linearity over feature vectors

slide-50
SLIDE 50

MorpheusFI

Add quadratic feature interactions into factorized ML by adding two non-linear interaction operators:

  • self-interaction within a matrix
  • cross-interaction between matrices

participating in a join

slide-51
SLIDE 51

A new abstraction: Interacted Normalized Matrix

with the following relationships:

slide-52
SLIDE 52

Formal proofs of algebraic rewrite rules

slide-53
SLIDE 53

Rewrite rules are extremely complex

slide-54
SLIDE 54

Results: Matrix multiplication

Key LMM: Left matrix mult RMM: Right matrix mult p: # of joined tables q: # of sparse dimension tables

slide-55
SLIDE 55

Results: time to convergence

slide-56
SLIDE 56

Strengths & Weaknesses: Morpheus / MorpheusFI

Strengths:

  • Automatically rewrite any LA computation over a join’s output as LA
  • perations over the base tables
  • Decouples factorization and execution; backend-agnostic
  • In many cases, it can dramatically improve runtime
  • MorpheusFI extends Morpheus to support quadratic feature interactions

Weaknesses:

  • It cannot be generalized to support higher degree interactions
  • At the time of the publication, only a simple heuristics-based approach to
  • ptimizing the execution plan
  • Only supports ML models that can be expressed in linear algebraic terms
slide-57
SLIDE 57

Questions: Morpheus / MorpheusFI

slide-58
SLIDE 58

Discussion

slide-59
SLIDE 59

Commonalities

Overall objectives:

  • Empower database users to use ML from their DBMS
  • Avoid the high cost of doing “offline” ML
  • Aim for flexibility + generalizability
  • Improve efficiency

Optimization via Translation:

  • Raven: Translate data and ML operations into a

unified IR

  • MorpheusFI: Translate ML models into LA operations

Rafiki Raven MorpheusFI

slide-60
SLIDE 60

Differences

Implementation:

  • Raven: major modifications to DBMS engine
  • Rafiki: cloud application, many interfaces
  • MorpheusFI: lightweight Python library

Generalizability:

  • MorpheusFI: works for any ML model built with LA
  • perators with linear or quadratic feature interactions
  • Raven: native support for many popular models and

frameworks, and out-of-process execution for all others

  • Rafiki: only supports a limited set of models + runtimes

Rafiki Raven Morpheus

slide-61
SLIDE 61

Differences

Inference vs Training:

  • Raven: training in the cloud, inference in the DBMS
  • Rafiki: training & inference in the cloud, with a

DBMS interface

Rafiki Raven Morpheus

slide-62
SLIDE 62

Questions & Discussion

slide-63
SLIDE 63

Questions & Discussion

  • How will ML be affected by stricter data governance? How can DBs play a

role?

  • What challenges remain (technical or not) for applying ML in an enterprise

setting?

  • What challenges can the DBs solve?
  • What's the role of the DBs in ML?
  • Reasons not to invest in DBMS + ML?