CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: - - PowerPoint PPT Presentation

cse 291d 234 data systems for machine learning
SMART_READER_LITE
LIVE PREVIEW

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: - - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at Scale Chapters 2, 5, and 6 of MLSys book 1 Academic ML 101 Classical ML Generalized Linear Models (GLMs); from statistics Bayesian Networks ;


slide-1
SLIDE 1

CSE 291D/234 Data Systems for Machine Learning

1

Topic 1: Classical ML Training at Scale Chapters 2, 5, and 6 of MLSys book Arun Kumar

slide-2
SLIDE 2

2

Academic ML 101

Generalized Linear Models (GLMs); from statistics Bayesian Networks; inspired by causal reasoning Decision Tree-based: CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience “Classical” ML

slide-3
SLIDE 3

3

Real-World ML 101

https://www.kaggle.com/c/kaggle-survey-2019

Deep Learning GLMs Tree learners

slide-4
SLIDE 4

4

Scalable ML Training in the Lifecycle

Data acquisition Data preparation Feature Engineering Training & Inference Model Selection Serving Monitoring

slide-5
SLIDE 5

5

Scalable ML Training in the Big Picture

slide-6
SLIDE 6

6

ML Systems

❖ A data processing system (aka data system) for mathematically advanced data analysis operations (inferential or predictive): ❖ Statistical analysis; ML, deep learning (DL); data mining (domain-specific applied ML + feature eng.) ❖ High-level APIs to express ML computations over (large) datasets ❖ Execution engine to run ML computations efficiently Q: What is a Machine Learning (ML) System? and in a scalable manner

slide-7
SLIDE 7

7

But what exactly does it mean for an ML system to be “scalable”?

slide-8
SLIDE 8

8

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-9
SLIDE 9

9

Background: Memory Hierarchy

Flash Storage

CPU

Main Memory Magnetic Hard Disk Drive (HDD)

Cache

Price Capacity

~MBs ~$2/MB ~10GBs ~$5/GB ~TBs ~$200/TB ~10TBs ~$40/TB

105 – 106 A C C E S S C Y C L E S 107 – 108 100s

Access Speed

~GB/s ~10GB/s ~100GB/s ~200MB/s

slide-10
SLIDE 10

10

Memory Hierarchy in Action

Q: What does this program do when run with ‘python’? (Assume tmp.csv is in current working directory)

import pandas as p m = p.read_csv(‘tmp.csv’,header=None) s = m.sum().sum() print(s) 1,2,3 4,5,6 tmp.py tmp.csv

slide-11
SLIDE 11

11

Memory Hierarchy in Action

Bus

CU ALU Caches

DRAM Disk Store; Retrieve Store; Retrieve Retrieve; Process

Registers

Processor

Rough sequence of events when program is executed

tmp.csv tmp.py

Commands interpreted Computations done by Processor Monitor ‘21’ I/O for Display I/O for code I/O for data ‘21’ ‘21’ Q: What if this does not fit in DRAM?

slide-12
SLIDE 12

12

Scalable ML Systems

❖ ML systems that do not require the (training) dataset to fit entirely in main memory (DRAM) of one node ❖ Conversely, if the system thrashes when data file does not fit in RAM, it is not scalable Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor

slide-13
SLIDE 13

13

Scalable ML Systems

4 main approaches to scale ML to large data: ❖ Single-node disk: Paged access from file on local disk ❖ Remote read: Paged access from disk(s) over network ❖ Distributed memory: Fits on a cluster’s total DRAM ❖ Distributed disk: Fits on a cluster’s full set of disks

slide-14
SLIDE 14

14

Evolution of Scalable ML Systems

1980s S Mid 1990s Mid 2010s— Late 2000s to Early 2010s Late 1990s to Mid 2000s In-RDBMS ML Systems Scalability Manageability Developability ML on Dataflow Systems Cloud ML Deep Learning Systems Parameter Server ML System Abstractions

slide-15
SLIDE 15

15

Major Existing ML Systems

General ML libraries: Disk-based files: In-memory: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented:

slide-16
SLIDE 16

16

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-17
SLIDE 17

17

ML Algorithm = Program Over Data

Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor ❖ To scale an ML program’s computations, split them up to

  • perate over “chunks” of data at a time

❖ How to split up an ML program this way can be non-trivial! ❖ Depends on data access pattern of the algorithm ❖ A large class of ML algorithms do just sequential scans for iterative numerical optimization

slide-18
SLIDE 18

18

Numerical Optimization in ML

❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM) w∗ = argminw

n

X

i=1

l(yi, f(w, xi))

<latexit sha1_base64="wtingO0LqaDZ0Zshlk1cGJ54IY=">ACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGkjudnLJke6/nxXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7utmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit>

❖ GLMs define hyperplanes and use f() that is a scalar function of distances:

wT xi

<latexit sha1_base64="uvgXp+dpvG7N5TecG7ye7m4Eg=">AB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NOcthNJcRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjpvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nl8v6iXL2Z1GEIziGU3DhCqpwBzVoAIERPMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit>

Y X1 X2 X3 1b 1c 1d 1 2b 2c 2d 1 3b 3c 3d 4b 4c 4d … … … …

D

slide-19
SLIDE 19

19

❖ For many ML models, loss function l() is convex; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent:

L(w) =

n

X

i=1

l(yi, f(w, xi))

<latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">ACH3icbVDLSgMxFM3UV62vUZdugkVoZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXldW19HpmY3Nre8fc3avJIBKYVHAtFwkSMclJVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh4ZLmBQwvQm8kUnh2aztm1ipaY8BFYk9JFkxRcyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg=</latexit>

Batch Gradient Descent for ML

❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w(0) ❖ Compute gradient: ❖ Descend along gradient: (Aka Update Rule) ❖ Repeat until we get close to w*, aka convergence

w(k+1) w(k) ηrL(w(k))

<latexit sha1_base64="ayIl16A2siz4c7l2gSKsLvcCX4=">ACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFVzXRXlClpyfe/e6WDBw8fHZYfV548fb8RfXlUdemuRHYEalKT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657u897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwIMSaFwWQlzixmIGUxw4KiGBO2wWF+5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca4pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982z6e1sW2jJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvzvl/d5ES9525pjtwPvzF5+7rMc=</latexit>

rL(w(k)) =

n

X

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>
slide-20
SLIDE 20

20

Batch Gradient Descent for ML

w

<latexit sha1_base64="q3pU+DFyxWArDM3ABiJpfISIf2Y=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclRkf6LoxmUF+8C2lEx6pw3NZIYko5Shf+HGhSJu/Rt3/o2ZdhbaeiBwOdecu7xY8G1cd1vZ2l5ZXVtvbBR3Nza3tkt7e03dJQohnUWiUi1fKpRcIl1w43AVqyQhr7Apj+6yfzmIyrNI3lvxjF2QzqQPOCMGis9dEJqhn6QPk16pbJbcacgi8TLSRly1Hqlr04/YkmI0jBtW57bmy6KVWGM4GTYifRGFM2ogNsWypiLqbThNPyLFV+iSIlH3SkKn6eyOlodbj0LeTWUI972Xif147McFVN+UyTgxKNvsoSAQxEcnOJ32ukBkxtoQyxW1WwoZUWZsSUVbgjd/8iJpnFa8s8rF3Xm5ep3XUYBDOIT8OASqnALNagDAwnP8ApvjnZenHfnYza65OQ7B/AHzucP/fiRIg=</latexit>

w∗

<latexit sha1_base64="HB3noaSvzO2mOiaEhFaq57JcQLw=">AB83icbVDLSsNAFL2pr1pfVZduBosgLkriA10W3bisYB/QxDKZTtqhk0mYmSgl5DfcuFDErT/jzr9x0mahrQcGDufcyz1z/JgzpW372yotLa+srpXKxubW9s71d29toSWiLRDySXR8rypmgLc0p91YUhz6nHb8U3udx6pVCwS93oSUy/EQ8ECRrA2kuGWI/8IH3KHk761Zpdt6dAi8QpSA0KNPvVL3cQkSkQhOleo5dqy9FEvNCKdZxU0UjTEZ4yHtGSpwSJWXTjNn6MgoAxRE0jyh0VT9vZHiUKlJ6JvJPKOa93LxP6+X6ODKS5mIE0FmR0KEo50hPIC0IBJSjSfGIKJZCYrIiMsMdGmpopwZn/8iJpn9ads/rF3XmtcV3UYDOIRjcOASGnALTWgBgRie4RXerMR6sd6tj9loySp29uEPrM8fHvyRvg=</latexit>

L(w)

<latexit sha1_base64="TY7U6q5Tv+/ah6YJI3mw93bVcTs=">AB9HicbVC7TsMwFL0pr1JeBUYWiwqpLFVCQTBWsDAwFIk+pDaqHNdprTpOsJ2iKup3sDCAECsfw8bf4LQZoOVIlo7OuVf3+HgRZ0rb9reVW1ldW9/Ibxa2tnd294r7B0VxpLQBgl5KNseVpQzQRuaU7bkaQ48DhteaOb1G+NqVQsFA96ElE3wAPBfEawNpJ7V+4GWA89P3manvaKJbtiz4CWiZOREmSo94pf3X5I4oAKThWquPYkXYTLDUjnE4L3VjRCJMRHtCOoQIHVLnJLPQUnRilj/xQmic0mqm/NxIcKDUJPDOZRlSLXir+53Vi7V+5CRNRrKkg80N+zJEOUdoA6jNJieYTQzCRzGRFZIglJtr0VDAlOItfXibNs4pTrVzcn5dq1kdeTiCYyiDA5dQg1uoQwMIPMIzvMKbNbZerHfrYz6as7KdQ/gD6/MHZHeR3Q=</latexit>

w(0)

<latexit sha1_base64="OxTqfbjDPbei9MJL9SITiklUMw=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhma03YkKQ48Tlve+DbzWxMqFQvFg5GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFO075dqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THOkQZTWgAZOUaD41BPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz8BpNm</latexit>

Gradient

rL(w(0))

<latexit sha1_base64="l790wK2qXm9kyZmjZI9U8DgCRQ=">ACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhoJ9QBPLZDph04mYWailBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1z+HIYwheV5wAqaHnJw/pXVKxqm1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXtiLlJkgoihlJS04sSYTwCA1IV1OAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T/O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpDOW4Q=</latexit>

w(1)

<latexit sha1_base64="G17ClWIFMobSQkQcWLAcftDEbk=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhma03YkKQ48Tlve+DbzWxMqFQvFg5GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFP075dqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THOkQZTWgAZOUaD41BPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz89jJNn</latexit>

w(2)

<latexit sha1_base64="Yg7pZgdEXbdSQ9HL4psFPomzX4=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRV0WXRjcsK9gFtLJPpB06mYSZSaWE/okbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm4Wt4vbO7t6+fXDYVFEiCW2QiEey7WNFORO0oZnmtB1LikOf05Y/us381phKxSLxoCcx9UI8ECxgBGsj9Wy7G2I9IP0afqYlqtn05dcirODGiZuDkpQY56z/7q9iOShFRowrFSHdeJtZdiqRnhdFrsJorGmIzwgHYMFTikyktnyafo1Ch9FETSPKHRTP29keJQqUnom8ksp1r0MvE/r5Po4NpLmYgTQWZHwoSjnSEshpQn0lKNJ8YgolkJisiQywx0asoinBXfzyMmlWK+5fL+olS7yesowDGcQBlcuIa3EdGkBgDM/wCm9War1Y79bHfHTFyneO4A+szx8/EpNo</latexit>

rL(w(1))

<latexit sha1_base64="EX6+u+ozvgd46YH9lTyT79WA=">ACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhoJ9QBPLZDph04mYWailBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1z+HIYwheV5wAqaHnJw/pXVKxq2m1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXtiLlJkgoihlJS04sSYTwCA1IV1OAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T/O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpbqW4g=</latexit>

❖ Learning rate is a hyper-parameter selected by user or “AutoML” tuning procedures ❖ Number of iterations/epochs of BGD also hyper-parameter

w(1) w(0) ηrL(w(0))

<latexit sha1_base64="MZhyY1bDdwCYWm3cMV+ti7/2tw=">ACN3icbVDLSiNBFK12fMT4yoxLN4VBiAtDtw90GcaNCxEFo0I6htuV21pYXd1U3VZCk7+azfyGO924mEHc+gdWYhYaPVBwOdc6t4TZUpa8v0Hb+LH5NT0TGm2PDe/sLhU+fnrzKa5EdgUqUrNRQWldTYJEkKLzKDkEQKz6Ob/YF/fovGylSfUi/DdgJXWsZSADmpUzkKE6DrKC7u+pdFLVjv81BhTGBMesc/eb7zNniIBDzUECngh7XxwHqnUvXr/hD8KwlGpMpGO5U7sNuKvIENQkF1rYCP6N2AYakUNgvh7nFDMQNXGHLUQ0J2nYxvLvP15zS5XFq3NPEh+rHiQISa3tJ5JKDRe24NxC/81o5xXvtQuosJ9Ti/aM4V5xSPiRd6VBQarnCAgj3a5cXIMBQa7qsishGD/5KznbrAdb9Z2T7Wrj96iOElthq6zGArbLGuyAHbMmE+wPe2T/2H/vr/fkPXsv79EJbzSzD7Be30DhDerpw=</latexit>

w(2) w(1) ηrL(w(1))

<latexit sha1_base64="g9I7Eq4QpMR9QRqIjWoJuHM2mqI=">ACN3icbVDLSiNBFK32bXxFXbopDEJcGLp9oMugm1kMomBUSMdwu3JbC6urm6rbhtDkr2Yzv+Fu3MxiBnHrH1iJWj0QMHhnHOpe0+UKWnJ9/94E5NT0zOzc/OlhcWl5ZXy6tqlTXMjsCFSlZrCwqbFBkhReZwYhiReRfcnA/qAY2Vqb6gXoatBG61jKUAclK7fBomQHdRXHT7N0V1d7vPQ4UxgTFpl3/yAuft8BAJeKghUsB/VscD2+1yxa/5Q/CvJBiRChvhrF1+DupyBPUJBRY2wz8jFoFGJCYb8U5hYzEPdwi01HNSRoW8Xw7j7fckqHx6lxTxMfqh8nCkis7SWRSw4WtePeQPzOa+YUH7UKqbOcUIv3j+JcUr5oETekQYFqZ4jIx0u3JxBwYEuapLroRg/OSv5HK3FuzVDs73K/XjUR1zbINtsioL2CGrsx/sjDWYL/YE/vH/nu/vb/es/fyHp3wRjPr7BO81zeJKquq</latexit>
slide-21
SLIDE 21

21

Data Access Pattern of BGD at Scale

❖ The data-intensive computation in BGD is the gradient ❖ In scalable ML, dataset D may not fit in DRAM ❖ Model w is typically small and DRAM-resident ❖ Gradient is like SQL SUM over vectors (one per example) ❖ At each epoch, 1 filescan over D to get gradient ❖ Update of w happens normally in DRAM ❖ Monitoring across epochs for convergence needed ❖ Loss function L() is also just a SUM in a similar manner

rL(w(k)) =

n

X

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>

Q: What SQL op is this reminiscent of?

slide-22
SLIDE 22

22

Scaling BGD to Disk

Disk OS Cache in DRAM … Process wants to read file’s pages

  • ne by one and then discard: aka

“filescan” access pattern

P1 P2 P3 P4 P5 P6

Read P1 Read P2 Read P3 Read P4 Read P5 Read P6 Cache is full; replace

  • ld pages

Evict P1 Evict P2 Suppose OS Cache can hold 4 pages of file …

P1 P2 P3 P4 P5 P6

Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor

slide-23
SLIDE 23

23

❖ Sequential scan to read pages from disk to DRAM ❖ Modern DRAM sizes can be 10s of GBs; so we read a “chunk” of file at a time (say, 1000s of pages) ❖ Compute partial gradient on each chunk and add all up rL(w(k)) =

n

X

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>

Y X1 X2 X3 1b 1c 1d 1 2b 2c 2d 1 3b 3c 3d 4b 4c 4d … … … …

D

0,1b,1c,1d 1,2b,2c,2d

Data pages

1,3b,3c,3d … …

<latexit sha1_base64="bG8iKwvZLsoFfOTbQ3+UYwXeHc4=">AB/XicbVDLSsNAFL3xWesrPnZuBovQbkoi6Lbly4qGAf0MYwmU7aoZNJmJkoNR/xY0LRdz6H+78G6ePhbYeuHA4517uvSdIOFPacb6thcWl5ZXV3Fp+fWNza9ve2a2rOJWE1kjMY9kMsKcCVrTHPaTCTFUcBpI+hfjvzGPZWKxeJWDxLqRbgrWMgI1kby7f2wAH6Np3UfHhLiv2S8OSbxecsjMGmifulBRgiqpvf7U7MUkjKjThWKmW6yTay7DUjHA6zLdTRNM+rhLW4YKHFHlZePrh+jIKB0UxtKU0Gis/p7IcKTUIApMZ4R1T816I/E/r5Xq8NzLmEhSTQWZLApTjnSMRlGgDpOUaD4wBPJzK2I9LDERJvA8iYEd/bleVI/LrunZefmpFC5mMaRgwM4hCK4cAYVuIq1IDAIzDK7xZT9aL9W59TFoXrOnMHvyB9fkDOR6Txw=</latexit>

rL1(w(k))

<latexit sha1_base64="9Xu8gK1dsg0vjYS6WgkRx2IkG/s=">AB/XicbVDLSgNBEJyNrxhf6+PmZTAIySXsBkWPQS8ePEQwD0jW0DuZJENmZ5eZWSUuwV/x4kERr/6HN/GSbIHTSxoKq6e7yI86UdpxvK7O0vLK6l3PbWxube/Yu3t1FcaS0BoJeSibPijKmaA1zTSnzUhSCHxOG/7wcuI37qlULBS3ehRL4C+YD1GQBupYx+0Bfgc8HWnjAsPd0lhWBwXO3beKTlT4EXipiSPUlQ79le7G5I4oEITDkq1XCfSXgJSM8LpONeOFY2ADKFPW4YKCKjykun1Y3xslC7uhdKU0Hiq/p5IFBqFPimMwA9UPeRPzPa8W6d+4lTESxpoLMFvVijnWIJ1HgLpOUaD4yBIhk5lZMBiCBaBNYzoTgzr+8SOrlkntacm5O8pWLNI4sOkRHqIBcdIYq6ApVUQ0R9Iie0St6s56sF+vd+pi1Zqx0Zh/9gfX5Azqsk8g=</latexit>

rL2(w(k))

Scaling BGD to Disk

<latexit sha1_base64="6IPITVWyLEPj25KRcgv51KGPtpA=">ACLnicbZDLSgMxFIYz9VbrbdSlm2ARWoQyUxTdCEURXLioYC/QjkMmTdvQTGZIMkoZ+kRufBVdCri1scwbUdqW38I/HznHE7O74WMSmVZb0ZqYXFpeSW9mlb39jcMrd3qjKIBCYVHLBA1D0kCaOcVBRVjNRDQZDvMVLzehfDeu2eCEkDfqv6IXF81OG0TFSGrnmZMjyF4nXu4i3O9/CAPz+Avc+0JPZzQ4hRtBUq6ZtYqWCPBeWMnJgsSlV3zRc/hyCdcYakbNhWqJwYCUxI4NM5IkRLiHOqShLUc+kU48OncADzRpwXYg9OMKjujfiRj5UvZ9T3f6SHXlbG0I/6s1ItU+dWLKw0gRjseL2hGDKoD7GCLCoIV62uDsKD6rxB3kUBY6YQzOgR79uR5Uy0W7OCdXOULZ0ncaTBHtgHOWCDE1ACV6AMKgCDR/AM3sGH8WS8Gp/G17g1ZSQzu2BKxvcP5A2kyw=</latexit>

rL(w(k)) = rL1(w(k)) + rL2(w(k)) + . . .

slide-24
SLIDE 24

24

DaskML’s Scalable DataFrame

❖ Dask DF scales to on-disk files by splitting it as a bunch of Pandas DF under the covers ❖ Dask API is a “wrapper” around Pandas API to scale ops to splits and put all results together

https://docs.dask.org/en/latest/dataframe.html

Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor

slide-25
SLIDE 25

25

Scaling with Remote Reads

❖ Similar to scaling to disk but instead read pages/chunks

  • ver the network from remote disk/disks (e.g., from S3)

❖ Good in practice for a one-shot filescan access pattern ❖ For iterative ML, repeated reads over network ❖ Can combine with caching on local disk / DRAM ❖ Increasingly popular for cloud-native ML workloads Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor

slide-26
SLIDE 26

26

Stochastic Gradient Descent for ML

❖ Two key cons of BGD: ❖ Slow to converge to optimal (too many epochs) ❖ Costly full scan of D for each update of w ❖ Stochastic GD (SGD) mitigates both issues ❖ Basic Idea: Use a sample (called mini-batch) of D to approximate gradient instead of “full batch” gradient ❖ Without replacement sampling ❖ Randomly shuffle D before each epoch ❖ One pass = sequence of mini-batches ❖ SGD works well for non-convex loss functions too, unlike BGD; “workhorse” of scalable ML

slide-27
SLIDE 27

27

Data Access Pattern of Scalable SGD

W(t+1) W(t) ηr˜ L(W(t))

<latexit sha1_base64="UQEnrJTW1ICg3Ivhlxip9zAg4io=">ACRHicbVDBTtAFyHUmhKW1OvawaVQpCjewCao+oXHrgABIhSHaInjfPsGK9tnafiyLH8elH8CNL+ilh6Kq16rkAMkjLTa0cw87dtJCiUtBcGt1p6tvx8ZfVF+Xaq9dv/PW3JzYvjcC+yFVuThOwqKTGPklSeFoYhCxROEgu9xt/8B2Nlbk+pkmBwzOtUylAHLSyI/iDOgiSatBfVZ1iW/xcLPmscKUwJj8ij/2nfeRx0jAYw2JchdJNcbqoObd+eTmyO8EvWAKvkjCGemwGQ5H/k08zkWZoSahwNoDAoaVmBICoV1Oy4tFiAu4RwjRzVkaIfVtISaf3DKmKe5cUcTn6oPJyrIrJ1kiUs2i9p5rxGf8qKS0i/DSuqiJNTi/qG0VJxy3jTKx9KgIDVxBISRblcuLsCAINd725UQzn95kZx86oXbvd2jnc7e1kdq+wde8+6LGSf2R7xg5Znwl2zX6y3+zO+H98v54f+jLW82s8Eewfv3H7dAsLY=</latexit>

Sample mini-batch from dataset without replacement

Original dataset Random “shuffle” Epoch 1

W(0)

<latexit sha1_base64="S+keXcWMFf6wrOJ7MBq7Rng6AM=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVecs69sVp+bMgJaJW5AKFGj07a/eICJSIUmHCvVdZ1YeymWmhFOs3IvUTGZIyHtGuowCFVXjpLnqFTowxQEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUcJjuEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAKpNG</latexit>
  • Seq. scan

W(1)

<latexit sha1_base64="FI2/tJBTSmjc6HdcV6spCWeSwCc=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVfcs69sVp+bMgJaJW5AKFGj07a/eICJSIUmHCvVdZ1YeymWmhFOs3IvUTGZIyHtGuowCFVXjpLnqFTowxQEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUcJjuEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAMLJNH</latexit>

W(2)

<latexit sha1_base64="EHwZoqY5j/U4nFUPxcanVrWIr8=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyWpi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdKqVd2L6tXDZbl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8NspNI</latexit>

Mini-batch 1 Mini-batch 2 Mini-batch 3

W(3)

<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>

Epoch 2

  • Seq. scan

(Optional) New Random Shuffle

W(3)

<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>

W(4)

<latexit sha1_base64="7HsMu7dOxTzZgQl2dq540qpd9U=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXRi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdK6qLqX1auHWrl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8QvpNK</latexit>

… … ORDER BY RAND() Randomized dataset r˜ L(w(k)) = X

(yi,xi)∈B⊂D

rl(yi, f(w(k), xi))

<latexit sha1_base64="AfnLQWBOhwvN5z3/kLEnQiCekI=">ACVnicbVFNb9QwEHVSsvyFeDIZcQKaVeqVkBwaVSVThw4FAktq20WSLHO2mtZ3InlBWUf5ke4GfwgXhbHOALiNZenrvzXj8nFdKOorjn0G4dWf7s7uvcH9Bw8fPY6ePD1xZW0FTkWpSnuWc4dKGpySJIVnlUWuc4Wn+fJ9p59+Q+tkab7QqsK5udGFlJw8lQW6dTwXHFISaoFNp/aUao5XeRFc9l+bUbLcTuGA0hdrbNmtMrk3vdMjiGVBo46NndI8KGFfopaW6DYGLIHXd84i4bxJF4XbIKkB0PW13EWXaWLUtQaDQnFnZslcUXzhluSQmE7SGuHFRdLfo4zDw3X6ObNOpYWXnpmAUVp/TEa/bvjoZr51Y6985uXdb68j/abOainfzRpqJjTi5qKiVkAldBnDQloUpFYecGl3xXEBbdckP+JgQ8huf3kTXCyP0leTd58fj08POrj2GXP2Qs2Ygl7yw7ZR3bMpkywa/YrCIOt4EfwO9wOd26sYdD3PGP/VBj9AZJwsek=</latexit>
slide-28
SLIDE 28

28

❖ Mini-batch gradient computations: 1 filescan per epoch ❖ Update of w happens in DRAM ❖ During filescan, count number of examples seen and update per mini-batch ❖ Typical Mini-batch sizes: 10s to 1000s ❖ Orders of magnitude more updates than BGD! ❖ Random shuffle is not trivial to scale; requires “external merge sort” (roughly 2 scans of file) ❖ ML practitioners often shuffle dataset only once up front; good enough in most cases in practice

Data Access Pattern of Scalable SGD

slide-29
SLIDE 29

29

Handling pages directly is so low-level! Is there a higher-level way to scale ML?

slide-30
SLIDE 30

30

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-31
SLIDE 31

31

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ In-RDBMS ML ❖ ML on Dataflow Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-32
SLIDE 32

32

RDBMS Scales Queries over Data

❖ Mature software systems to scale to larger-than-RAM data

slide-33
SLIDE 33

33

RDBMS User-Defined Aggregates

❖ Most industrial and open source RDBMSs allow extensibility to their SQL dialect with User-Defined Functions (UDFs) ❖ User-Defined Aggregate (UDA) is a UDF API to specify custom aggregates over the whole dataset

Initialize Transition Merge Finalize Start by setting up “agg. state” in DRAM Example with SQL AVG: (S, C): (Partial sum, partial count) RDBMS gives a tuple from table; update agg. state (Optional: In parallel RDBMS, combine agg. states of workers)

<latexit sha1_base64="N8pZRaxhzlXn8gQ86guS41Ybwio=">ACnicbVDLSgMxFM34rPU16tJNtAgtljIji6L3bisaB/QDkMmzbShmWRIMpUydO3GX3HjQhG3foE7/8b0sdDWAxcO59zLvfcEMaNKO863tbS8srq2ntnIbm5t7+zae/t1JRKJSQ0LJmQzQIowyklNU81IM5YERQEjaBfGfuNAZGKCn6vhzHxItTlNKQYaSP59lH+rlgpwDYjoUZSigc4FU5hfuDTInQLvp1zSs4EcJG4M5IDM1R9+6vdETiJCNeYIaVarhNrL0VSU8zIKNtOFIkR7qMuaRnKUSUl05eGcETo3RgKQpruFE/T2RokipYRSYzgjpnpr3xuJ/XivR4ZWXUh4nmnA8XRQmDGoBx7nADpUEazY0BGFJza0Q95BEWJv0siYEd/7lRVI/K7kXJef2PFe+nsWRAYfgGOSBCy5BGdyAKqgBDB7BM3gFb9aT9WK9Wx/T1iVrNnMA/sD6/AGhDZcI</latexit>(S, C) ← (S, C) + (vi, 1) <latexit sha1_base64="ckJ8N6ixBXtCU6c1aXo4eYT0U=">ACH3icbVDLSgNBEJz1bXxFPXoZDGIECbvi6xjMxWNEo0I2LOTXp1kdmeZ6VXCEr/Ei7/ixYMi4s2/cRJz8FXQUFR1090VplIYdN0PZ2x8YnJqema2MDe/sLhUXF45NyrTHBpcSaUvQ2ZAigQaKFDCZaqBxaGEi7BbG/gXN6CNUMkZ9lJoxewqEZHgDK0UFPfLp5vbtc0t6kuIkGmtbqlvsjI/ZjhtcD8Vuku6P5dp0/Lp0Fnm9aCzlZQLkVdwj6l3gjUiIj1IPiu9WPIshQS6ZMU3PTbGVM42CS+gX/MxAyniXUHT0oTFYFr58L8+3bBKm0ZK20qQDtXvEzmLjenFoe0cHG1+ewPxP6+ZYXTYykWSZgJ/1oUZKioOwaFto4Ch7ljCuhb2V8mumGUcbacG4P1+S8536l4exX3ZLdUPRrFMUPWyDopE48ckCo5JnXSIJzck0fyTF6cB+fJeXevlrHnNHMKvkB5+MTQ5qh6A=</latexit>

(S0, C0) ← X

worker j

(Sj, Cj)

Post-process agg. state and return result Return S’/C’

slide-34
SLIDE 34

34

❖ SGD epoch implemented as UDA for in-RDBMS execution

Bismarck: SGD as RDBMS UDA

❖ Data-intensive computation scaled automatically by RDBMS ❖ Commands for shuffling, running multiple epochs, checking convergence, and validation/test error measurements issued from an external controller written in Python

Initialize Transition Merge Finalize Allocate memory for model W(t) and mini-batch gradient stats Given tuple with (y,x), compute gradient and update stats If mini-batch limit hit, update model and reset stats (Optional: applies only to parallel RDBMS) “Combine” model parameters from indep. workers Return model parameters

slide-35
SLIDE 35

35

❖ Many ML algorithms fit within Bismarck’s template

Programming ML as RDBMS UDAs

❖ Transition function is where the bulk of the ML logic is; a few lines of code for SGD updates

slide-36
SLIDE 36

36

Distributed SGD via RDBMS UDA

❖ Data is pre-sharded across workers in parallel RDBMSs ❖ Initialize and Transition run independently on shard ❖ Merge “combines” model params from workers; tricky since SGD epoch is not an algebraic agg. like SUM/AVG ❖ Common heuristic: “Model Averaging”

Worker 1 Worker 2 Worker n … Master

W(t)

1

<latexit sha1_base64="DvIp/6uHDZOjNMUvnpHo6Owbk=">AB+3icbVDLSsNAFJ3UV62vWJdugkWom5L4QJdFNy4r2Ae0MUymk3boZBJmbsQS8ituXCji1h9x5984abPQ1gMDh3Pu5Z45fsyZAtv+Nkorq2vrG+XNytb2zu6euV/tqCiRhLZJxCPZ87GinAnaBgac9mJcehz2vUnN7nfaRSsUjcwzSmbohHgWMYNCSZ1YHIYaxH6Td7CGtw0nmOZ5Zsxv2DNYycQpSQwVanvk1GEYkCakAwrFSfceOwU2xBEY4zSqDRNEYkwke0b6mAodUuekse2Yda2VoBZHUT4A1U39vpDhUahr6ejJPqha9XPzP6ycQXLkpE3ECVJD5oSDhFkRWXoQ1ZJIS4FNMJFMZ7XIGEtMQNdV0SU4i19eJp3ThnPWuLg7rzWvizrK6BAdoTpy0CVqolvUQm1E0BN6Rq/ozciMF+Pd+JiPloxi5wD9gfH5A6OglC4=</latexit>

W(t)

2

<latexit sha1_base64="xRaCmja4+Alc2G0VSB2AwXG38es=">AB+3icbVDLSsNAFL3xWesr1qWbYBHqpiRV0WXRjcsK9gFtDJPpB06mYSZiVhCfsWNC0Xc+iPu/BsnbRbaemDgcM693DPHjxmVyra/jZXVtfWNzdJWeXtnd2/fPKh0ZJQITNo4YpHo+UgSRjlpK6oY6cWCoNBnpOtPbnK/+0iEpBG/V9OYuCEacRpQjJSWPLMyCJEa+0HazR7SmjrNvIZnVu26PYO1TJyCVKFAyzO/BsMIJyHhCjMkZd+xY+WmSCiKGcnKg0SGOEJGpG+phyFRLrpLHtmnWhlaAWR0I8ra6b+3khRKOU09PVknlQuern4n9dPVHDlpTHiSIczw8FCbNUZOVFWEMqCFZsqgnCguqsFh4jgbDSdZV1Cc7il5dJp1F3zuoXd+fV5nVRwmO4Bhq4MAlNOEWtAGDE/wDK/wZmTGi/FufMxHV4xi5xD+wPj8AaUklC8=</latexit>

W(t)

n

<latexit sha1_base64="dQYgF7AMPEB/nejv+XzaZ3+cDiQ=">AB+3icbVDLSsNAFJ3UV62vWpduBotQNyXxgS6LblxWsA9oY5lMJ+3QySTM3Igl5FfcuFDErT/izr9x0mahrQcGDufcyz1zvEhwDb9bRVWVtfWN4qbpa3tnd298n6lrcNYUdaioQhV1yOaCS5ZCzgI1o0UI4EnWMeb3GR+5EpzUN5D9OIuQEZSe5zSsBIg3KlHxAYe37SR+SGpykAyNW7bo9A14mTk6qKEdzUP7qD0MaB0wCFUTrnmNH4CZEAaeCpaV+rFlE6ISMWM9QSQKm3WSWPcXHRhliP1TmScAz9fdGQgKtp4FnJrOketHLxP+8Xgz+lZtwGcXAJ0f8mOBIcRZEXjIFaMgpoYQqrjJiumYKELB1FUyJTiLX14m7dO6c1a/uDuvNq7zOoroEB2hGnLQJWqgW9RELUTRE3pGr+jNSq0X6936mI8WrHznAP2B9fkDACOUaw=</latexit>

W(t) = 1 n

n

X

i=1

W(t)

i

<latexit sha1_base64="YvR+r5RoOYyp/ar/kI19j/9+94=">ACJ3icbVDLSgMxFM3UV62vqks3wSLUTZnxgW4qRTcuK9gHdNohk2ba0ExmSDJCfM3bvwVN4K6NI/MW1noa0HAodziX3Hj9mVCrb/rJyS8srq2v59cLG5tb2TnF3rymjRGDSwBGLRNtHkjDKSUNRxUg7FgSFPiMtf3Qz8VsPREga8Xs1jk3RANOA4qRMpJXvHJDpIZ+oFtpT5fVcQqr0A0EwtpJNU+hK5PQ07TqpD0O57Me9Yolu2JPAReJk5ESyFD3iq9uP8JSLjCDEnZcexYdTUSimJG0oKbSBIjPEID0jGUo5DIrp7emcIjo/RhEAnzuIJT9feERqGU49A3ycmct6biP95nUQFl1NeZwowvHsoyBhUEVwUhrsU0GwYmNDEBbU7ArxEJmWlKm2YEpw5k9eJM2TinNaOb87K9Wuszry4AcgjJwAWogVtQBw2AwSN4Bm/g3XqyXqwP63MWzVnZzD74A+v7B2K9pkU=</latexit>

❖ Affects convergence ❖ Works OK for GLMs Q: How does the RDBMS parallelize the (SGD) UDA?

slide-37
SLIDE 37

37

❖ Model Averaging for distributed SGD has poor convergence for non-convex/ANN models ❖ Too many epochs, typically poor ML accuracy ❖ Model sizes can be too large (even 10s of GBs) for UDA’s aggregation state ❖ UDA’s Merge step is choke point at scale (100s of workers) ❖ Bulk Synchronous Parallelism (BSP) parallel RDBMSs

Bottlenecks of RDBMS SGD UDA

slide-38
SLIDE 38

38

The MADlib Library

❖ A decade-old library of scalable statistical and ML procedures on PostgreSQL and Greenplum (parallel RDBMS) ❖ Many procedures are UDAs; some are written in pure SQL ❖ All can be invoked from SQL console ❖ RDBMS can be used for ETL (extract, transform, load) of features

https://madlib.apache.org/docs/latest/index.html

slide-39
SLIDE 39

Some data analysts like ML from SQL console In-situ data governance and security/auth. of RDBMSs Faster in some cases; typically slower due to API restrictions Massively parallel processing of RDBMSs like Greenplum SQL-based ETL Most ML users want full flexibility of Python console Many ML users want more hands-on data access in Jupyter notebooks Custom ML systems typically faster; Interference with OLTP workloads BSP is a bottleneck for 100+ nodes (asynchrony needed) Unnatural APIs to write ML Usability: Manageability: Efficiency: Scalability: Developability:

Pros: Cons:

Tradeoffs of In-RDBMS ML

39

slide-40
SLIDE 40

40

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ In-RDBMS ML ❖ ML on Dataflow Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-41
SLIDE 41

41

❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data ❖ Map and Reduce are terms/ideas from functional PL ❖ Developer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data distribution, parallelization, etc. under the covers

The MapReduce Abstraction

slide-42
SLIDE 42

42

❖ Standard example: count word occurrences in a doc corpus ❖ Input: A set of text documents (say, webpages) ❖ Output: A dictionary of unique words and their counts Hmmm, sounds suspiciously familiar … :) function map (String docname, String doctext) : for each word w in doctext : emit (w, 1) function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum) Part of MapReduce API

The MapReduce Abstraction

slide-43
SLIDE 43

43

How MapReduce Works

Parallel flow of control and data during MapReduce execution: Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP) Fault tolerance achieved using data replication

slide-44
SLIDE 44

44

What is Hadoop then?

❖ Open-source system implementation with MapReduce as

  • prog. model and HDFS as distr. filesystem

❖ Map() and Reduce() functions in API; input splitting, data distribution, shuffling, fault tolerance, etc. all handled by the Hadoop library under the covers ❖ Mostly superseded by the Spark ecosystem these days although HDFS is still the base

slide-45
SLIDE 45

45

Abstract Semantics of MapReduce

❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different! ❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final ❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user

slide-46
SLIDE 46

46

Benefits of MapReduce

❖ Goal: Higher level abstraction for parallel data processing ❖ Key Benefits: ❖ Out-of-the-box scalability and fault tolerance ❖ Map() and Reduce() can be highly general; no restrictions

  • n data types; easier ETL

❖ Free and OSS (Hadoop) ❖ New burden on users: Cast data-intensive computation to the Map() + Reduce() API ❖ But libraries exist in many PLs to mitigate coding pains: Java, C++, Python, R, Scala, etc.

slide-47
SLIDE 47

47

Emulate MapReduce in SQL?

Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query! SELECT Word, COUNT (*) FROM DocWords GROUP BY Word [ORDER BY Word] Parallelism, scaling, etc. done by RDBMS under the covers

slide-48
SLIDE 48

48

RDBMS UDA vs MapReduce

❖ Aggregation state: data structure computed (independently) by workers and unified by master ❖ Initialize: Set up info./initialize RAM for agg. state; runs independently on each worker ❖ Transition: Per-tuple function run by worker to update its agg. state; analogous to Map() in MapReduce ❖ Merge: Function that combines agg. states from workers; run by master after workers done; analogous to Reduce() ❖ Finalize: Run once at end by master to return final result

slide-49
SLIDE 49

49

BGD via MapReduce

function map (String datafile) : Read W(t) from known file on HDFS Initialize G = 0 For each tuple in datafile: G += per-example gradient on (tuple, W(t)) emit (G) function reduce (Iterator partialGradients) : FullG = 0 for each G in partialGradients : FullG += G emit (FullG)

❖ Assume data is sharded; a map partition has many examples ❖ Initial model W(t) read from file on HDFS by each mapper

slide-50
SLIDE 50

50

SGD (Averaging) via MapReduce

function map (String datafile) : Read W(t) from known file on HDFS Initialize G = 0 For each tuple in datafile: G += per-example gradient on (tuple, W(t)) Wj = Update W(t) with G emit (Wj) function reduce (Iterator workerWeights) : AvgW = 0 for each Wj in workerWeights : AvgW += Wj AvgW /= workerWeights.length() emit (AvgW)

❖ Similar to BGD but Model Averaging across map partitions

slide-51
SLIDE 51

51

Apache Spark

❖ Extended dataflow programming model to subsume MapReduce, most relational operators ❖ Inspired by Python Pandas style of function calls for ops ❖ Unified system to handle relations, text, etc.; support more general distributed data processing ❖ Uses distributed memory for caching; faster than Hadoop ❖ New fault tolerance mechanism using lineage, not replication ❖ From UC Berkeley AMPLab; commercialized as Databricks

slide-52
SLIDE 52

52

Spark’s Dataflow Programming Model

Transformations are relational ops, MR, etc. as functions Actions are what force computation; aka lazy evaluation

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

slide-53
SLIDE 53

53

Word Count Example in Spark

Spark RDD API available in Python, Scala, Java, and R

slide-54
SLIDE 54

54

Programming ML in MapReduce/Spark

❖ All ML procedures that can be cast as RDBMS UDAs can be cast to MapReduce API of Hadoop or Spark RDD APIs ❖ MapReduce is just easier to use for most developers ❖ Spark integrates better with PyData ecosystem

https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf

❖ Apache Mahout is a library of classical ML algorithms written as MapReduce programs for Hadoop; expanded later to DSL ❖ SparkML is a library of classical ML algorithms written using Spark RDD API; common in enterprises for scalable ML

slide-55
SLIDE 55

55

Tradeoffs of ML on Dataflow Systems

SparkML integrates well with Python stacks In-situ data governance and security/auth. of “data lakes” Comparable to in- RDBMS ML; no OLTP interference Massively parallel processing SQL- & MR-based ETL Less code to write ML Not all ML algorithms scaled Many ML users may not be familiar with Spark data ETL Custom ML systems typically faster still BSP is a bottleneck for 100+ nodes (asynchrony needed) Still somewhat unnatural APIs to write ML Usability: Manageability: Efficiency: Scalability: Developability:

Pros: Cons:

slide-56
SLIDE 56

56

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-57
SLIDE 57

57

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Parameter Server ❖ GraphLab ❖ XGBoost ❖ Advanced Issues in ML Scalability

slide-58
SLIDE 58

58

Parameter Server for Distributed SGD

❖ Recall bottlenecks of Model Averaging-based SGD in RDBMS UDA or with MapReduce: ❖ BSP becomes a choke point (Merge / Reduce stage) ❖ Often poor convergence, especially for non-convex ❖ Hard to handle large models ❖ Parameter Server (PS) mitigates all these issues: ❖ Breaks the synchronization barrier for merging: allows asynchronous updates from workers to master ❖ Flexible communication frequency: can send updates at every mini-batch or a set of few mini-batches

slide-59
SLIDE 59

59

ParameterServer for Distributed SGD

Worker 1 Worker 2 Worker n

PS 1 PS 2 PS k …

Multi-server “master”; each server manages a part of W(t)

<latexit sha1_base64="jHixyQj+DLWH5ucHau8W5BCjQHQ=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVX2W9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBjZcyESeaCjI/FCQc6QjlNaABk5RoPjUE8lMVkRGWGKiTVlU4K7+OVl0jqvuRe1q4fLSv2qKMEx3ACVXDhGupwDw1oAoEJPMrvFmp9WK9Wx/z0RWr2DmCP7A+fwByPpOK</latexit>

Push / Pull when ready/needed No sync. for workers or servers Workers send gradients to master for updates at each mini-batch (or lower frequency)

r˜ L(W(t+1)

n

)

<latexit sha1_base64="GLCamUO1sNnU5MPUcSRaDlHVdg=">ACEHicbVC7SgNBFJ2NrxhfUubwSAmCGHXB1oGbSwsIpgHZGOYncwmQ2Znl5m7Qlj2E2z8FRsLRWwt7fwbJ49CEw9cOJxzL/fe40WCa7DtbyuzsLi0vJdza2tb2xu5bd36jqMFWU1GopQNT2imeCS1YCDYM1IMRJ4gjW8wdXIbzwpXko72AYsXZAepL7nBIwUid/6EriCYJd4KLkpsUF92AQN/zk0Z6nxThyCmlHVnq5At2R4DzxNnSgpoimon/+V2QxoHTAIVROuWY0fQTogCTgVLc26sWUTogPRYy1BJAqbyfihFB8YpYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyZEJzZl+dJ/bjsnJTPbk8LlctpHFm0h/ZRETnoHFXQNaqiGqLoET2jV/RmPVkv1rv1MWnNWNOZXfQH1ucPdTycNg=</latexit>

r˜ L(W(t−1)

2

)

<latexit sha1_base64="z+XBRFQZlyoDHOZkNVbaz0Q+wU=">ACEHicbVC7SgNBFJ31GeMramkzGMSkMOxGRcugjYVFBPOAbAyzk9lkyOzsMnNXCMt+go2/YmOhiK2lnX/j5Fo4oELh3Pu5d57vEhwDb9bS0sLi2vrGbWsusbm1vbuZ3dug5jRVmNhiJUTY9oJrhkNeAgWDNSjASeYA1vcDXyGw9MaR7KOxhGrB2QnuQ+pwSM1MkduZJ4gmAXuOiy5CbFBTcg0Pf8pJHeJwU4dop1zs5PJ2yR4DzxNnSvJoimon9+V2QxoHTAIVROuWY0fQTogCTgVLs26sWUTogPRYy1BJAqbyfihFB8apYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyaEJzZl+dJvVxyTkpnt6f5yuU0jgzaRweogBx0jiroGlVRDVH0iJ7RK3qznqwX6936mLQuWNOZPfQH1ucPHSb/A=</latexit>

r˜ L(W(t)

1 )

<latexit sha1_base64="OcO+0LVRY+mliWjDoWz5mG5oEXI=">ACDnicbVC7SgNBFJ31GeNr1dJmMASJuz6QMugjYVFBPOAbAyzk9lkyOzsMnNXCMt+gY2/YmOhiK21nX/j5Fo4oELh3Pu5d57/FhwDY7zbS0tr6yurec28ptb2zu79t5+Q0eJoqxOIxGplk80E1yOnAQrBUrRkJfsKY/vBr7zQemNI/kHYxi1glJX/KAUwJG6tpFTxJfEOwBFz2W3mS45IUEBn6QNrP7tATlrOuWu3bBqTgT4EXizkgBzVDr2l9eL6JyCRQbRu04MnZQo4FSwLO8lmsWEDkmftQ2VJGS6k07eyXDRKD0cRMqUBDxRf0+kJNR6FPqmc3yqnvfG4n9eO4HgopNyGSfAJ0uChKBIcLjbHCPK0ZBjAwhVHFzK6YDogFk2DehODOv7xIGscV96RydntaqF7O4sihQ3SESshF56iKrlEN1RFj+gZvaI368l6sd6tj2nrkjWbOUB/YH3+AChxm4k=</latexit>

❖ Model params may get out-of-sync or stale; but SGD turns out to be remarkably robust—multiple updates/epoch really helps ❖ Communication cost per epoch is higher (per mini-batch)

slide-60
SLIDE 60

60

Programming ML using PS

❖ Designed mainly for sparse feature vectors/updates ❖ Easy to parallelize updates to model params ❖ ML developer recasts ML procedure into two pars: worker- side updates and server-side aggregation ❖ Loosely analogous to Map and Reduce, respectively ❖ But more complex due to flexible update schedules ❖ Supports 3 consistency models for staleness of updates, with different tradeoffs on efficiency vs accuracy

slide-61
SLIDE 61

61

Systems-level Advances in PS

❖ Workers and Servers can both be multi-nodes/multi-process; fine-grained task scheduling on nodes ❖ Range-based partitioning of params for servers ❖ Timestamped and compressed updates exchanged ❖ Replication and fault tolerance

slide-62
SLIDE 62

62

Tradeoffs of Parameter Server

Supports billions of features and params In-built fault tolerance Faster than in-RDBMS ML and ML-on-Dataflow Highest; can work with 1000s of nodes Abstracts away many systems scaling issues Not reproducible; not well integrated with ETL stacks TensorFlow offers it natively;

  • /w, hard to operate/govern

Not suitable for dense updates; high comm. cost Not suitable for smaller scales due to overheads Reasoning about ML (in)consistency is hard Usability: Manageability: Efficiency: Scalability: Developability:

Pros: Cons:

slide-63
SLIDE 63

63

Sample of Your Strong Points on PS

❖ “Abstracts away most system complexity such as asynchronous communication between nodes, fault tolerance, and data replication” ❖ “Allows addition and deletion of nodes to the framework without any restarts by means of a consistent hash ring with replication of data” ❖ “Relaxing consistency constraints gives the algorithm designer flexibility in defining different consistency models (going from sequential to eventual) by balancing system efficiency and algorithm convergence” ❖ “Outperforms other open source systems in Sparse Logistic Regression” ❖ “Range construct, while simple, seems fundamental to efficient communication and thus fault tolerance” ❖ “Generality. Both supervised learning and unsupervised learning” ❖ “Fault tolerance is crucial … in an unreliable environment like the cloud where the machines can fail or jobs can be preempted”

slide-64
SLIDE 64

64

Sample of Your Weak Points on PS

❖ “Does not show where or when the bottlenecks of synchronization appear” ❖ “Could lead to a longer amount of time for an ML algorithm model to train if the model is sensitive to inconsistent data” ❖ “Does not specify how to set a reasonable threshold to achieve best trade-

  • ff between system efficiency and algorithm convergence rate”

❖ “Flexibility in choosing consistency models also adds in complexity” ❖ “Non-convex models, such as complex neural networks, may not be suitable for this pipeline” ❖ “Didn’t show how the system performed in training a deep neural net” ❖ “Crucial and widely used non-parametric models like Decision Trees and K nearest neighbors do not fit the architecture” ❖ “Google is hardly representative of the broader industry” ❖ “There is no eval at all for … fault tolerance”

slide-65
SLIDE 65

65

Innovativeness and Depth Ratings

slide-66
SLIDE 66

66

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Parameter Server ❖ GraphLab ❖ XGBoost ❖ Advanced Issues in ML Scalability

slide-67
SLIDE 67

67

Graph-Parallel Algorithms

❖ Some data analytics algorithms (not just ML) operate on graph data and have complex update dependencies Example: PageRank

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

❖ Not a simple sequential access pattern like SGD ❖ If viewed as a table: reads and writes to tuples, with each write depending on all neighboring tuples ❖ Does not scale well with RDBMS/UDA or MapReduce/Spark

slide-68
SLIDE 68

68

The GraphLab Abstraction

❖ “Think like a vertex” paradigm over data graph (V, E, D) ❖ Arbitrary data state associated with vertices and edges ❖ 3-function API: Gather-Apply-Scatter ❖ Gather: Collect latest states from neighbors ❖ Apply: Vertex-local state update function ❖ Scatter: Send local state to neighbors

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

❖ Original single-node GraphLab assumed shared-memory ❖ Scaled to single-node disk with careful sharding & caching of graph

slide-69
SLIDE 69

69

Distributed GraphLab Execution

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

❖ Enabled asynchronous updates of vertex and edge states ❖ Consistency-parallelism tradeoff ❖ Some algorithms seem robust to such inconsistency ❖ Sophisticated distributed locking protocols

slide-70
SLIDE 70

70

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Parameter Server ❖ GraphLab ❖ XGBoost ❖ Advanced Issues in ML Scalability

slide-71
SLIDE 71

71

Decision Tree Data Access Pattern

❖ CART has complex non-sequential data access patterns ❖ Compare candidate splits on each feature at each node ❖ Class-conditional aggregates needed per candidate ❖ Repartition data for sub-tree growth

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html

❖ Does not scale well with RDBMS/UDA or MapReduce/Spark

slide-72
SLIDE 72

72

Decision Tree Ensembles

❖ RandomForest is very popular ❖ Just a bunch of independent trees on column subsets ❖ Tree Boosting is a popular adaptive ensemble ❖ Construct trees sequentially and weigh them ❖ “Weak” learners aggregated to a strong learner ❖ Gradient-Boosted Decision Tree (GBDT) is very popular ❖ Also adds trees to ensemble sequentially ❖ Real-valued prediction; convex differentiable loss

slide-73
SLIDE 73

73

GBDT Data Access Pattern

❖ More complex non-sequential access pattern than single tree! ❖ An “iteration” adds a tree by exploring candidate splits ❖ Still needs recursive data re-partitioning ❖ Key difference: scoring function has more statistics (1st and 2nd deriv.); this needs read-write access per example ❖ Access pattern over per- example stats is random, depends on split location ❖ Ideal if whole stats and a whole column (at a time) can fit in RAM

Y X1 X2 X3 1b 1c 1d 1 2b 2c 2d 1 3b 3c 3d 4b 4c 4d … … … …

Dataset

Gi Hi … … … … … … … … … …

Stats

slide-74
SLIDE 74

74

XGBoost

❖ Custom ML system to scale GBDT to larger-than-RAM data, both single-node disk and on a cluster ❖ Very popular on tabular data; won many Kaggle contests ❖ Key philosophy: Algorithm-system “co-design”: ❖ Make system implementation memory hierarchy-aware based on algorithm’s data access patterns ❖ Modify ML algorithmics to better suit system scale

slide-75
SLIDE 75

75

XGBoost: Algorithm-level Ideas

❖ 2 key changes to make GBDT more scalable ❖ Bottleneck: Computing candidate split stats at scale ❖ Idea: Approximate stats with weighted quantile sketch ❖ Bottleneck: Sparse feature vectors and missing data ❖ Idea: Bake in “default direction” for child during learning

slide-76
SLIDE 76

76

XGBoost: Systems-level Ideas

❖ 4 key choices to ensure efficiency at scale ❖ Goal: Reduce overhead of evaluating candidates ❖ Idea: Pre-sort all columns independently ❖ Goal: Exploit parallelism to raise throughput ❖ Idea: Shard data into a column “block” per worker; workers compute local columns’ stats independently

slide-77
SLIDE 77

77

XGBoost: Systems-level Ideas

❖ 4 key choices to ensure efficiency at scale ❖ Goal: Mitigate read-write randomness to stats in RAM ❖ Idea: CPU cache-aware staging of stats to reduce stalls ❖ Goal: Scale to on-disk and cluster data ❖ Idea: Shard blocks further on disk; stage reads to RAM; block-level compression to reduce I/O latency

slide-78
SLIDE 78

78

XGBoost: Scalability Gains

❖ Gains from both algorithmic changes and systems ideas

slide-79
SLIDE 79

79

Tradeoffs of Custom Distr. ML Sys.

Suitable for hyper- specialized ML use case More feasible in cloud- native / managed env. Often much lower runtimes and costs Often more scalable to larger datasets More amenable if ML algorithm is familiar; new open source/ startup communities Need to (re)learn new system APIs again Extra overhead to add and maintain in tools ecosystem Debugging runtime issues needs specialized knowhow Arcane scalability issues may arise, e.g., inconsistency Need to (re)learn new implementation APIs and consistency tradeoffs; risk of lower technical support Usability: Manageability: Efficiency: Scalability: Developability:

Pros: Cons:

slide-80
SLIDE 80

80

Your Strong Points on XGBoost

❖ “The scalability of XGBoost algorithm enables much faster learning speed in model exploration by using parallel and distributed computing” ❖ “Versatile: it could be applied to a wide range of problems, including classification, ranking, rate prediction, and categorization” ❖ “Flexibility. The system supports the exact greedy algorithm and the approximate algorithm” ❖ “Handles the issue of sparsity in data” ❖ “Theoretically justified weighted quantile sketch” ❖ “Language Support and portability” ❖ “By making XGBoost open-source, they put their promises into action” ❖ “Make a point of cleaving to real-world issues faced by industry users” ❖ “This end-to-end system is widely used by data scientists”

slide-81
SLIDE 81

81

Your Weak Points on XGBoost

❖ “Specificity: The system only works for gradient boosted tree algorithms” ❖ “In recent years deep neural networks can beat boosting tree algorithms” ❖ “Accuracy of XGBoost algorithms is generally lower than LightGBM” ❖ “Kaggle competitions and industry ML roles have quite a gap” ❖ “System introduces a lot of hyperparameters” ❖ “XGBoost tuning is difficult” ❖ “Sparsity-aware approach to split-finding introduces another layer of learned uncertainty … no data exploring effect on accuracy” ❖ “No evaluation on their weighted quantile sketch algorithm” ❖ “Finding correct block size for approximate algorithm … need to tune” ❖ “Not much detail on how the system parallelizes its workload” ❖ “Assumes readers have deep understanding about Boosting algorithm”

slide-82
SLIDE 82

82

Innovativeness and Depth Ratings

slide-83
SLIDE 83

83

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-84
SLIDE 84

84

Advanced Issues in ML Scalability

❖ Streaming/Incremental ML at scale ❖ Scaling Massive Task Parallelism in ML ❖ Pushing ML Through Joins ❖ Larger-than-Memory Models ❖ Models with More Complex Data Access Patterns ❖ Delay-Tolerant / Geo-Distributed / Federated ML ❖ Scaling end-to-end ML Pipelines

slide-85
SLIDE 85

85

Scalable Incremental ML

❖ Datasets keep growing in many real-world ML applications ❖ Incremental ML: update a learned model using only new data ❖ Streaming ML is one variant (near real-time) ❖ Non-trivial to make all ML algorithms incremental ❖ SGD-based procedures are “online” by default; just resume gradient descent on new data ❖ ML/data mining folks have studied how to make other ML algorithms incremental; accuracy-runtime tradeoffs

slide-86
SLIDE 86

86

Scalable Incremental ML: SageMaker

❖ Industrial cloud-native ML requirements: ❖ Incremental training and model freshness ❖ Predictable training runtime ❖ Elasticity and pause-resume ❖ Trainable on ephemeral (non-archived) data ❖ Automatic model/hyper-parameter tuning ❖ Design: Streaming ML algorithms that fit into a 3-function API

  • f Initialize-Update-Finalize (akin to UDA)

❖ All SGD-based procedures (GLMs, fact. Machines) ❖ Variants of K-Means, PCA, topic models, forecasting

slide-87
SLIDE 87

87

Scalable Incremental ML: SageMaker

❖ Parameter Server-based architecture with streaming workers

slide-88
SLIDE 88

88

Massive Task Parallelism: Ray

❖ Advanced ML applications that use reinforcement learning produce large numbers of short-running tasks ❖ Training robots, self-driving cars, etc. ❖ ML-based cluster resource management

https://www.usenix.org/system/files/osdi18-moritz.pdf

❖ Ray is an ML system that automatically scales such tasks from single-node to large clusters ❖ Tasks can have shared state for control ❖ Data is replicated/broadcast

slide-89
SLIDE 89

89

Pushing ML Through Joins

❖ Most real-world tabular/relational datasets are multi-table ❖ A central fact table with target; many dimension tables ❖ Key-foreign key joins are ubiquitous to denormalize such data into single table before ML training ❖ Single table has a lot of data redundancy ❖ In turn, many ML algorithms end up having a lot of computational redundancy ❖ Pushing ML computations through joins enables computations directly over the normalized database

slide-90
SLIDE 90

90

Pushing ML Through Joins

❖ Orion first showed how to rewrite ML computations to operate

  • n join input rather than output, aka “factorized ML”

❖ GLMs with BGD, L-BFGS, Newton methods ❖ Rewritten impl. fits within UDA / MapReduce abstractions

https://adalabucsd.github.io/papers/2015_Orion_SIGMOD.pdf https://adalabucsd.github.io/papers/2017_Morpheus_VLDB.pdf

❖ Morpheus offered a unified abstraction based on linear algebra to factorize many ML algorithms in one formalism ❖ K-Means clustering, matrix factorization, etc.

slide-91
SLIDE 91

91

Larger-than-Memory Models

❖ Some ML algorithms may have state that does not fit in RAM ❖ Need to shard model too (not just data) to stage updates ❖ Specific to the ML algorithm’s data access patterns

http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf

Example: Matrix factorization trained with SGD

slide-92
SLIDE 92

92

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

slide-93
SLIDE 93

93

Evaluating “Scalability”

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

❖ Message: Not just scalability but efficiency matters too; not just speedup curve but time vs strong (single-node) baselines