CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at Scale Chapters 2, 5, and 6 of MLSys book 1

Academic ML 101 “Classical” ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ; inspired by causal reasoning Decision Tree-based : CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience 2

Real-World ML 101 GLMs Tree learners Deep Learning 3 https://www.kaggle.com/c/kaggle-survey-2019

Scalable ML Training in the Lifecycle Feature Engineering Data acquisition Serving Training & Inference Data preparation Monitoring Model Selection 4

Scalable ML Training in the Big Picture 5

ML Systems Q: What is a Machine Learning (ML) System? ❖ A data processing system (aka data system ) for mathematically advanced data analysis operations (inferential or predictive): ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs to express ML computations over (large) datasets ❖ Execution engine to run ML computations efficiently and in a scalable manner 6

But what exactly does it mean for an ML system to be “scalable”? 7

Outline ❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability 8

Background: Memory Hierarchy CPU ~MBs A C C E S S C Y C L E S ~100GB/s ~$2/MB Cache Access Speed 100s Capacity ~10GBs Main Price ~10GB/s ~$5/GB Memory 10 5 – 10 6 ~TBs ~GB/s ~$200/TB Flash Storage 10 7 – 10 8 Magnetic Hard Disk Drive ~10TBs (HDD) ~200MB/s ~$40/TB 9

Memory Hierarchy in Action Q: What does this program do when run with ‘python’? (Assume tmp.csv is in current working directory) import pandas as p m = p.read_csv(‘tmp.csv’,header=None) tmp.py s = m.sum().sum() print(s) 1,2,3 tmp.csv 4,5,6 10

Memory Hierarchy in Action Rough sequence of events when program is executed Computations done by Processor Processor CU ALU Store; Retrieve Retrieve; ‘21’ Registers DRAM ‘21’ Process ‘21’ Caches Q: What if this does not fit in DRAM? Commands interpreted Bus Store; Retrieve I/O for Display I/O for code I/O for data Disk Monitor tmp.py tmp.csv 11

Scalable ML Systems ❖ ML systems that do not require the (training) dataset to fit entirely in main memory (DRAM) of one node ❖ Conversely, if the system thrashes when data file does not fit in RAM, it is not scalable Basic Idea : Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor 12

Scalable ML Systems 4 main approaches to scale ML to large data: ❖ Single-node disk : Paged access from file on local disk ❖ Remote read : Paged access from disk(s) over network ❖ Distributed memory : Fits on a cluster’s total DRAM ❖ Distributed disk : Fits on a cluster’s full set of disks 13

Evolution of Scalable ML Systems ML on Scalability Cloud ML Dataflow Systems Manageability Late 1990s to 1980s Mid 2000s Mid 2010s— Mid Late 2000s to Early 2010s 1990s S Parameter Server Deep Learning Systems In-RDBMS ML Systems ML System Developability Abstractions 14

Major Existing ML Systems General ML libraries: In-memory: Disk-based files: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented: 15

Outline ❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability 16

ML Algorithm = Program Over Data Basic Idea : Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor ❖ To scale an ML program’s computations, split them up to operate over “chunks” of data at a time ❖ How to split up an ML program this way can be non-trivial! ❖ Depends on data access pattern of the algorithm ❖ A large class of ML algorithms do just sequential scans for iterative numerical optimization 17

<latexit sha1_base64="uvgXp+dpvG7N5TecG7ye7m4Eg=">AB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NOcthNJcRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjpvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nl8v6iXL2Z1GEIziGU3DhCqpwBzVoAIERPMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit> <latexit sha1_base64="wtingO0LqaDZ0Zshlk1cGJ54IY=">ACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGkjudnLJke6/nxXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7utmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit> Numerical Optimization in ML ❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM) D Y X1 X2 X3 n w ∗ = argmin w X 0 1b 1c 1d l ( y i , f ( w , x i )) 1 2b 2c 2d i =1 1 3b 3c 3d 0 4b 4c 4d ❖ GLMs define hyperplanes … … … … and use f() that is a scalar function of distances: w T x i 18

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit> <latexit sha1_base64="ayIl16A2siz4c7l2gSKsLvcCX4=">ACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFVzXRXlClpyfe/e6WDBw8fHZYfV548fb8RfXlUdemuRHYEalKT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657u897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwIMSaFwWQlzixmIGUxw4KiGBO2wWF+5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca4pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982z6e1sW2jJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvzvl/d5ES9525pjtwPvzF5+7rMc=</latexit> <latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">ACH3icbVDLSgMxFM3UV62vUZdugkVoZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXldW19HpmY3Nre8fc3avJIBKYVHAtFwkSMclJVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh4ZLmBQwvQm8kUnh2aztm1ipaY8BFYk9JFkxRcyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg=</latexit> Batch Gradient Descent for ML n X L ( w ) = l ( y i , f ( w , x i )) i =1 ❖ For many ML models, loss function l() is convex ; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent: ❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w (0) n ❖ Compute gradient : X r L ( w ( k ) ) = r l ( y i , f ( w ( k ) , x i )) i =1 ❖ Descend along gradient: w ( k +1) w ( k ) � η r L ( w ( k ) ) (Aka Update Rule ) ❖ Repeat until we get close to w* , aka convergence 19

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at Scale Chapters 2, 5, and 6 of MLSys book 1 Academic ML 101 Classical ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ;

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1 About Myself 2009:

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems DL

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 4: Data Sourcing and

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model

Natural Backgrounds 1 U 238 Decay Chain U 234 U 238 92 92 4.5e9 245,500 Years Years

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition click to return to table of contents Slide 5 / 234 Relate

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Targeted Proteomics Environment Status of the Skyline open-source software project five years

Maius CIS (MCIS) Quintin Siebers TMRA 2007 -- open space sessions 11-10-2007 Morpheus Software

Finally, a Use for Componentized Transport Protocols Tyson Condie, Joseph M. Hellerstein, Petros

On Random Sampling, its Applications, and Efficient Distributed Algorithms Asad Awan, Ronaldo

Learning Learning Retrieval Knowledge Retrieval Knowledge from Data from Data Helge Langseth

Finite Mathematics MAT 141: Chapter 2 Notes Solutions to Linear Systems by the Echelon Method

CS641 Advanced Computer Networks Lecture 38 Bhaskaran Raman Department of CSE, IIT Bombay

Todays Objec2ves Kerberos Peer To Peer Overlay Networks Final Projects Nov 27,

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at Scale Chapters 2, 5, and 6 of MLSys book 1 Academic ML 101 Classical ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ;

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1 About Myself 2009:

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems DL

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 4: Data Sourcing and

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model

Natural Backgrounds 1 U 238 Decay Chain U 234 U 238 92 92 4.5e9 245,500 Years Years

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition click to return to table of contents Slide 5 / 234 Relate

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Targeted Proteomics Environment Status of the Skyline open-source software project five years

Maius CIS (MCIS) Quintin Siebers TMRA 2007 -- open space sessions 11-10-2007 Morpheus Software

Finally, a Use for Componentized Transport Protocols Tyson Condie, Joseph M. Hellerstein, Petros

On Random Sampling, its Applications, and Efficient Distributed Algorithms Asad Awan, Ronaldo

Learning Learning Retrieval Knowledge Retrieval Knowledge from Data from Data Helge Langseth

Finite Mathematics MAT 141: Chapter 2 Notes Solutions to Linear Systems by the Echelon Method

CS641 Advanced Computer Networks Lecture 38 Bhaskaran Raman Department of CSE, IIT Bombay

Todays Objec2ves Kerberos Peer To Peer Overlay Networks Final Projects Nov 27,

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506: