cse 232a database system implementation
play

CSE 232A Database System Implementation Arun Kumar Topic 8: Data - PowerPoint PPT Presentation

CSE 232A Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: Data Management in ML Systems by Morgan & Claypool Publishing 1 Big Data Systems Parallel RDBMSs and Cloud-Native RDBMSs


  1. CSE 232A 
 Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads Book: “Data Management in ML Systems” by Morgan & Claypool Publishing 1

  2. “Big Data” Systems Parallel RDBMSs and Cloud-Native RDBMSs ❖ Beyond RDBMSs: A Brief History ❖ “Big Data” Systems ❖ The MapReduce/Hadoop Craze ❖ Spark and Other Dataflow Systems ❖ Key-Value NoSQL Systems ❖ Graph Processing Systems ❖ Advanced Analytics/ML Systems ❖ 2

  3. Lifecycle/Tasks of ML-based Analytics Feature Engineering Data acquisition Inference Training Data preparation Monitoring Model Selection 3

  4. ML 101: Popular Forms of ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ; inspired by causal reasoning Decision Tree-based : CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience 4

  5. Advanced Analytics/ML Systems Q: What is a Machine Learning (ML) System? ❖ A data processing system (aka data system ) for mathematically advanced data analysis ops (inferential or predictive), i.e., beyond just SQL aggregates ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs for expressing statistical/ML/DL computations over large datasets 5

  6. Data Management Concerns in ML Key concerns in ML: Q: How do “ML Systems” relate to ML? Accuracy Runtime efficiency (sometimes) Additional key practical concerns in ML Systems: ML Systems : ML :: Computer Systems : TCS Scalability (and efficiency at scale) Long-standing Usability concerns in the Manageability DB systems Developability world! Q: How does it fit within production systems and workflows? Q: How to simplify the implementation of such systems? Q: What if the dataset is larger than single-node RAM? Can often trade off accuracy a bit to gain on the rest! Q: How are the features and models configured? 6

  7. Conceptual System Stack Analogy Relational DB Systems ML Systems First-Order Logic Learning Theory Theory Optimization Theory Complexity Theory Program Matrix Algebra Relational Algebra Formalism Gradient Descent Program Declarative TensorFlow? Specification Query Language R? Scikit-learn? Program Query Optimization ??? Modification Execution Parallel Relational Depends on ML Algorithm Primitives Operator Dataflows Hardware CPU, GPU, FPGA, NVM, RDMA, etc. 7

  8. Categorizing ML Systems ❖ Orthogonal Dimensions of Categorization : 1. Scalability: In-memory libraries vs Scalable ML system (works on larger-than-memory datasets) 2. Target Workloads: General ML library vs Decision tree-oriented vs Deep learning, etc. 3. Implementation Reuse: Layered on top of scalable data system vs Custom from-scratch framework 8

  9. Major Existing ML Systems General ML libraries: In-memory: Disk-based files: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented: 9

  10. <latexit sha1_base64="RqFgvBkpZLuXZDIZXEvdMwy9X6U=">AB8XicbVDLSsNAFL2pr1pfVZduBotQUriA10W3bisYB/YhjCZTtqhk0mYmYgh9C/cuFDErX/jzr9x2mahrQcuHM65l3v8WPOlLbtb6uwtLyulZcL21sbm3vlHf3WipKJKFNEvFIdnysKGeCNjXTnHZiSXHoc9r2RzcTv/1IpWKRuNdpTN0QDwQLGMHaSA/VJ4+doNRjx165YtfsKdAicXJSgRwNr/zV60ckCanQhGOluo4dazfDUjPC6bjUSxSNMRnhAe0aKnBIlZtNLx6jI6P0URBJU0Kjqfp7IsOhUmnom84Q6Ga9ybif1430cGVmzERJ5oKMlsUJBzpCE3eR30mKdE8NQTycytiAyxESbkEomBGf+5UXSOq05Z7WLu/NK/TqPowgHcAhVcOAS6nALDWgCAQHP8ApvlrJerHfrY9ZasPKZfgD6/MHRuCQAw=</latexit> <latexit sha1_base64="2xHoHxDTpYj9FZ94WjRGA4eUSI=">ACIXicbVBNSwJBGJ61L7OvrY5dhiRQENntg7wIUpcOHQxSA9eW2XFWB2dnl5nZSBb/Spf+SpcORXiL/kyj7sG0BwYenud5mfd9vIhRqSzr28isrK6tb2Q3c1vbO7t75v5BU4axwKSBQxaKBw9JwignDUVIw+RICjwGl5g+uJ3oiQtKQ36thRDoB6nHqU4yUlyzcltwAqT6np+0RkVYhY6MAzehVXv0yCGDhaFLS9CfC5Xgs0uLRdfMW2VrCrhM7JTkQYq6a46dbojgHCFGZKybVuR6iRIKIoZGeWcWJI4QHqkbamHAVEdpLphSN4opUu9EOhH1dwqs5PJCiQch4OjlZVC56E/E/rx0rv9JKI9iRTiefeTHDKoQTuqCXSoIVmyoCcKC6l0h7iOBsNKl5nQJ9uLJy6R5WrbPyhd35/naVpHFhyBY1ANrgENXAD6qABMHgBb+ADfBqvxrvxZYxn0YyRzhyCPzB+fgE0I6Gu</latexit> <latexit sha1_base64="3nwsg8hxgtnGtmOBHQ5mjABxB8g=">ACMnicbVDLSgMxFM34rPU16tJNsAiuyowPFdFXeiugn1Ip5RMmlDM5khuaOUod/kxi8RXOhCEbd+hOm0C9t6IHA4515y7vFjwTU4zps1N7+wuLScW8mvrq1vbNpb21UdJYqyCo1EpOo+0UxwySrAQbB6rBgJfcFqfu9y6NcemNI8knfQj1kzJB3JA04JGKl3wTn2AsJdCkR6dWglXE/SGsD7AEPmZ5w69hTvNMFolT0OHc45ZdcIpOBjxL3DEpoDHKLfvFa0c0CZkEKojWDdeJoZkSBZwKNsh7iWYxoT3SYQ1DJTFpml28gDvG6WNg0iZJwFn6t+NlIRa90PfTA5T6mlvKP7nNRIzpopl3ECTNLR0EiMER42B9uc8UoiL4hCpusmLaJYpQMC3nTQnu9MmzpHpYdI+KJ7fHhdLFuI4c2kV76AC56BSV0DUqowqi6Am9og/0aT1b79aX9T0anbPGOztoAtbPLyQXq1A=</latexit> ML as Numeric Optimization ❖ Recall that an ML model is a parametric function: f : D W × D X → D Y ❖ Training: Process of fitting model parameters from data ❖ Training can be expressed in this form for many ML models; aka “empirical risk minimization” (ERM) aka “loss” function: n X L ( W ) = l ( y i , f ( W , x i )) ( x i , y i ) is a training example i =1 ❖ l() is a differentiable function; can be compositions ❖ GLMs, linear SVMs, and ANNs fit the above template 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend