CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1

About Myself 2009: Bachelors in CSE from IIT Madras, India Summers: 110F! 2009—16: MS and PhD in CS from UW-Madison PhD thesis area: Data systems for ML workloads Winters: —40F! 2016-: Asst. Prof. at UC San Diego CSE 2019-: + Asst. Prof. at UC San Diego HDSI Ahh! :) 2

My Current Research New abstractions, algorithms, and software systems to “democratize ” ML-based data analytics from a data management/systems standpoint System Efficiency Human Efficiency + Democratization = (Lower resource costs) (Higher productivity) Practical and scalable data systems for ML analytics ML/AI Inspired by relational database systems principles Data Systems Management Exploit insights from learning theory and optimization theory 3

My Current Research Research Abstract Formalize Automate Optimize Approach : + + + key steps computation grunt work execution https://adalabucsd.github.io/ 4 4

What is this course about? Why take it? 5

1. Netflix’s “spot-on” recommendations 6

How does Netflix know that? 7

Large datasets + Machine learning! Log all user behavior (views, clicks, pauses, searches, etc.) Recommender systems apply ML to TBs of data from all users and movies to deliver a tailored experience 8

2. Structured data with search results 9

How does Google know that? 10

Large datasets + Machine learning! Knowledge Base Construction (KBC) process extracts tabular/relational data from large amounts of text data 11

3. AlphaGo defeats human champion! 12

How did AlphaGo achieve that? 13

Breakthrough powered by deep learning! Deep CNNs to visually process board status in plays 14 https://www.slideshare.net/SanFengChang/mastering-the-game-of-go-with-deep-neural-networks-and-tree-search

Innumerable “enterprise” applications 15

“Domain sciences” and healthcare tech are also becoming data+ML intensive 17

Software systems for ML over large and complex datasets are now critical for digital applications in many domains 19

The Age of “Big Data”/“Data Science” 20

But what more is there to it than just taking a bunch of ML/AI courses? 21

Academic ML 101 Generalized Linear Models (GLMs); from statistics Bayesian Networks ; inspired by causal reasoning Decision Tree-based : CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience 22

Real-World ML 101 GLMs Vast majority of ML applications Tree use off-the-shelf ML methods! learners Deep Learning 23 https://www.kaggle.com/c/kaggle-survey-2019

Real-World ML 101 Almost all of your ML / AI courses put together! :) 24

Real-World ML 101 80% of ML users’ time/effort (often more) spent on data issues! 25 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Real-World ML 101 “Building and managing data pipelines is typically one of the most costly pieces of a complete machine learning solution.” “Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.” https://eng.uber.com/michelangelo-machine-learning-platform/ 26 http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

Real-World ML 101 1. System design 2. Structured ML modules 3. Software testing 4. Integrating with data infrastructure 5. Model serving 27 https://blog.insightdatascience.com/preparing-for-the-transition-to-applied-ai-8eaf53624079

CSE 291D/234 will get you to think about the data systems that power this new boom of ML/AI ML/AI Data Systems Management 1. “Data …”: How to organize, query, scale, and manage the analysis of large and complex datasets? 2. “… Systems …”: How to make the most effective use of all machine resources? 3. “… for ML” : 3.1. Source : Application’s raw data -> “ML-ready” data 3.2. Build : “ML-ready” data -> Prediction pipelines 3.3. Deploy : Productionize prediction pipelines 28

The Lifecycle of ML-based Analytics Feature Engineering Data acquisition Serving Training & Inference Data preparation Monitoring Model Selection 29

ML Systems Q: What is a Machine Learning (ML) System? ❖ A data processing system (aka data system ) for mathematically advanced data analysis operations (inferential or predictive): ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs to express ML computations over (large) datasets ❖ Execution engine to run ML computations efficiently 30

Categorizing ML Systems ❖ Orthogonal Dimensions of Categorization : 1. Scalability: In-memory libraries vs Scalable ML system (works on larger-than-memory datasets) 2. Target Workloads: General ML library vs Decision tree-oriented vs Deep learning, etc. 3. Implementation Reuse: Layered on top of scalable data system vs Custom from-scratch framework 31

Major Existing ML Systems General ML libraries: In-memory: Disk-based files: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented: 32

Data Systems Concerns in ML Key concerns in ML: Q: How do “ML Systems” relate to ML? Accuracy Runtime efficiency (sometimes) Additional key practical concerns in ML Systems: ML Systems : ML :: Computer Systems : TCS Scalability (and efficiency at scale) Long-standing Usability concerns in the Manageability DB systems Developability world! Q: How does it fit within production systems and workflows? Q: How to simplify the implementation of such systems? Q: What if the dataset is larger than single-node RAM? Can often trade off accuracy a bit to gain on the rest! Q: How are the features and models configured? 33

Conceptual System Stack Analogy Relational DB Systems ML Systems First-Order Logic Learning Theory Theory Optimization Theory Complexity Theory Program Matrix Algebra Relational Algebra Formalism Gradient Descent Program TensorFlow? SQL Specification Scikit-learn? Program Query Optimization ??? Modification Execution Parallel Relational Depends on ML Algorithm Primitives Operator Dataflows Hardware CPU, GPU, FPGA, NVM, RDMA, etc. 34

Real-World ML: Pareto Surfaces Q: Suppose you are given ad click-through prediction models A, B, C, and D with accuracies of 95%, 85%, 90%, and 85%, respectively. Which one will you pick? ❖ Real-world ML users must Q: What about now? grapple with multi-dimensional 95% A Pareto surfaces : accuracy, E Accuracy monetary cost, training time, 90% C Pareto scalability, inference latency, Frontier tool availability, interpretability, 85% D B fairness, etc. ❖ Multi-objective optimization criteria set by application $1K $10K needs / business policies. Monetary cost 35

Learning Outcomes of this Course ❖ View ML/AI algorithms as data-intensive programs and apply systems techniques to make them scalable and fast. ❖ Understand the myriad data management issues in the end- to-end ML lifecycle and how to handle them in practice. ❖ Reason about practical tradeoffs between accuracy, scalability, efficiency, usability, cost, etc. in ML applications. ❖ Think critically and objectively about research in this intersectional area and maybe identify gaps in the literature. 36

What this course is NOT about ❖ NOT a course on basics of ML, databases, or systems ❖ Sanity check! You should know what these terms mean: gradient descent, decision tree, neural network, schema, query optimization, memory hierarchy, and GPU. ❖ NOT a course on ML algorithmics ; we focus on ML systems ❖ NOT a course on how to use/apply ML algorithms or tools 37

Now for the (boring) logistics … 38

Prerequisites ❖ A course on ML algorithms , e.g., CSE 151. ❖ A course on either database systems (e.g., CSE 132C) or operating systems (e.g., CSE 120). ❖ The above courses could have been taken at UCSD or elsewhere. ❖ Industrial or substantial project experience on these topics may suffice in place of these courses. Email me if you are not sure if you satisfy the prerequisites. http://cseweb.ucsd.edu/classes/fa20/cse291-d 39

Components and Grading ❖ Quizzes : 4 x 6% = 24% ❖ Dates will be announced later; ~ 20min long each with likely 6hr time window ❖ Exams : 2 x 26% = 52% ❖ On 11/10 (Tue) and 12/12 (Sat); non-cumulative; 80min long each with 24hr time window ❖ All quizzes and exams delivered as Canvas Quizzes ❖ Paper Reviews (best 8 of 9): 8 x 3% = 24% ❖ See course homepage for more details http://cseweb.ucsd.edu/classes/fa20/cse291-d 40

Grading Scheme Hybrid of relative and absolute; grade is the better of the two Grade Relative Bin (Use strictest) Absolute Cutoff (>=) A+ Highest 10% 92 A Next 15% (10-25) 85 A- Next 15% (25-40) 80 B+ Next 15% (40-55) 75 B Next 15% (55-70) 70 B- Next 5% (70-75) 65 C+ Next 5% (75-80) 60 C Next 5% (80-85) 55 C- Next 5% (85-90) 50 D Next 5% (90-95) 45 F Lowest 5% < 45 Example: Score 82 but 43%le; Rel: B; Abs: A-; so, A- 41

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1 About Myself 2009: Bachelors in CSE from IIT Madras, India Summers: 110F! 200916: MS and PhD in CS from UW-Madison PhD thesis area: Data systems for ML workloads

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems DL

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 4: Data Sourcing and

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at

Natural Backgrounds 1 U 238 Decay Chain U 234 U 238 92 92 4.5e9 245,500 Years Years

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition click to return to table of contents Slide 5 / 234 Relate

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Adversarial Search (Game Playing) Chapter 5 Adapted from materials by Tim Finin, Marie

Welcome to CSCE 496/896: Deep Learning! Welcome to CSCE 496/896: Deep Learning! Please check

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Spring 2020 Most

AI Methodology Theoretical aspects Mathematical formalizations, properties, algorithms

Deep Learning and Mixed Integer Optimization Matteo Fischetti, University of Padova 1 Designing

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

OOP Inheri -tance En- capsu- lation Learning to Play Tetris via Kuan-Ting Lai 2020/5/25

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning

Sambuz

Useful Links

Newsletter

Mail Us

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1 About Myself 2009: Bachelors in CSE from IIT Madras, India Summers: 110F! 200916: MS and PhD in CS from UW-Madison PhD thesis area: Data systems for ML workloads

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems DL

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 4: Data Sourcing and

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at

Natural Backgrounds 1 U 238 Decay Chain U 234 U 238 92 92 4.5e9 245,500 Years Years

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition click to return to table of contents Slide 5 / 234 Relate

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Adversarial Search (Game Playing) Chapter 5 Adapted from materials by Tim Finin, Marie

Welcome to CSCE 496/896: Deep Learning! Welcome to CSCE 496/896: Deep Learning! Please check

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Spring 2020 Most

AI Methodology Theoretical aspects Mathematical formalizations, properties, algorithms

Deep Learning and Mixed Integer Optimization Matteo Fischetti, University of Padova 1 Designing

DS595/CS525: Reinforcement Learning --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

OOP Inheri -tance En- capsu- lation Learning to Play Tetris via Kuan-Ting Lai 2020/5/25

Classical Planning Systems Chapter 10 R&amp;N ICS 271 Fall 2016 Outline: Planning Planning

Sambuz

Useful Links

Newsletter

Mail Us

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning