DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: - PowerPoint PPT Presentation

Quantifying Benefit of Parallelism Runtime speedup (fixed data size) Runtime speedup 12 2 Linear Speedup 8 Linear Scaleup 1 Sublinear 4 0.5 Scaleup Sublinear 1 Speedup 1 4 8 12 1 4 8 12 Number of workers Factor (# workers, data size) Speedup plot / Strong scaling Scaleup plot / Weak scaling Q: Is superlinear speedup/scaleup ever possible? 30

Idle Times in Task Parallelism ❖ Due to varying task completion times and varying degrees of parallelism in workload, idle workers waste resources Example: Gantt Chart visualization of schedule: Given 3 workers T6 10 W1: T1 T1 T4 T6 T6 W2: T2 T5 T5 T5 T5 20 T4 T5 5 W3: T3 T3 T3 15 0 5 10 15 20 25 30 35 5 T1 T2 T3 10 Idle times D 31

Idle Times in Task Parallelism ❖ Due to varying task completion times and varying degrees of parallelism in workload, idle workers waste resources Example: ❖ In general, overall workload’s Given 3 completion time on task-parallel workers T6 10 setup is always lower bounded by the longest path in the task graph 20 T4 T5 ❖ Possibility: A task-parallel scheduler 5 can “release” a worker if it knows 15 5 that will be idle till the end T1 T2 T3 10 ❖ Can saves costs in cloud D 32

Calculating Task Parallelism Speedup ❖ Due to varying task completion times and varying degrees of parallelism in workload, idle workers waste resources Example: Given 3 Completion time 10+5+15+5+ workers T6 10 with 1 worker 20+10 = 65 Parallel 35 20 T4 T5 completion time 5 15 Speedup = 65/35 = 1.9x 5 T1 T2 T3 10 Ideal/linear speedup is 3x Q: Why is it only 1.9x? D 33

Outline ❖ Basics of Parallelism ❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators ❖ Basics of Scalable Data Access ❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations ❖ Data Parallelism: Parallelism + Scalability ❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism 34

Multi-core CPUs ❖ Modern machines often have multiple processors and multiple cores per processor; hierarchy of shared caches ❖ OS Scheduler now controls what cores/processors assigned to what processes/threads when 35

Single-Instruction Multiple-Data ❖ Single-Instruction Multiple-Data (SIMD): A fundamental form of parallel processing in which different chunks of data are processed by the “same” set of instructions shared by multiple processing units (PUs) ❖ Aka “vectorized” instruction processing (vs “scalar”) ❖ Data science workloads are very amenable to SIMD! Example for SIMD in data science: 36

SIMD Generalizations ❖ Single-Instruction Multiple Thread (SIMT): Generalizes notion of SIMD to different threads concurrently doing so ❖ Each thread may be assigned a PU or a whole core ❖ Single-Program Multiple Data (SPMD): A higher level of abstraction generalizing SIMD operations or programs ❖ Under the covers, may use multiple processes or threads ❖ Each chunk of data processed by one core/PU ❖ Applicable to any CPU, not just “vectorized” PUs ❖ Most common form of parallel programming 37

“Data Parallel” Multi-core Execution D1 D2 D3 D4 D1 D3 D2 D4 D1 D2 D3 D4 38

Quantifying Efficiency: Speedup Q: How do we quantify the performance benefits of multi-core parallelism? ❖ As with task parallelism, we measure the speedup: Completion time given only 1 core Speedup = Completion time given n (>1) core ❖ In data science computations, an often useful surrogate for completion time is the instruction throughput FLOP/s , which is the number of floating point operations per second ❖ Modern data processing programs, especially deep learning (DL) may have billions of FLOPs aka GFLOPs! 39

Amdahl’s Law Q: But given n cores, can we get a speedup of n? It depends! (Just like it did with task parallelism) ❖ Amdahl’s Law: Formula to upper bound possible speedup ❖ A program has 2 parts: one that benefits from multi-core parallelism and one that does not ❖ Non-parallel part could be for control, memory stalls, etc. 1 core: n cores: T yes + T no n(1 + f) T yes T yes /n Speedup = = T no n + f T no T yes /n + T no Denote T yes /T no = f 40

Amdahl’s Law Speedup = n(1 + f) n + f f = T yes /T no Parallel portion = f / (1 + f) 41

Hardware Trends on Parallelism ❖ Multi-core processors grew rapidly in early 2000s but hit physical limits due to packing efficiency and power issues ❖ End of “Moore’s Law” and End of “Dennard Scaling” ❖ Basic conclusion of hardware trends: it is hard for general- purposes CPUs to sustain FLOP-heavy programs (e.g., ML) ❖ Motivated the rise of “accelerators” for some classes of programs 42

Hardware Accelerators: GPUs ❖ Graphics Processing Unit (GPU) : Custom processor to run matrix/tensor operations faster ❖ Basic idea: use tons of ALUs; massive data parallelism (like SIMD “on steroids”); Titan X offers ~11 TFLOP/s! ❖ Popularized by NVIDIA in early 2000s for video games, graphics, and video/multimedia; now for deep learning ❖ CUDA released in 2007; later wrapper APIs on top: CuDNN, CuSparse, CuDF (RapidsAI), etc. 43

GPUs on the Market 44

Other Hardware Accelerators ❖ Tensor Processing Unit (TPU) : Even more specialized tensor processor for deep learning inference; ~45 TFLOP/s! ❖ An “application-specific integrated circuit” (ASIC) created by Google in mid 2010s; use in the AlphaGo game! ❖ Field-Programmable Gate Array (FPGA) : Configurable processors for any class of programs; ~0.5-3 TFLOPs/s ❖ Cheaper; h/w-s/w stacks for ML/DL; Azure/AWS support 45

Comparing Modern Parallel Hardware ASICs Multi-core CPU GPU FPGA (e.g., TPUs) Peak FLOPS/s Moderate High High Very High Power High Very High Very Low Low Consumption Cost Low High Very High Highest Generality / Highest Medium Very High Lowest Flexibility Fitness for DL Potential exists Poor Fit Best Fit Low Fit Training but unrealized Fitness for DL Moderate Moderate Good Fit Best Fit Inference Cloud Vendor All All AWS, Azure GCP Support 46 https://www.embedded.com/leveraging-fpgas-for-deep-learning/

Recap: Memory Hierarchy CPU ~MBs A C C E S S C Y C L E S ~100GB/s ~$2/MB Cache Access Speed 100s Capacity Main Price ~10GBs ~10GB/s ~$5/GB Memory 10 5 – 10 6 ~TBs ~GB/s Flash Storage ~$200/TB 10 7 – 10 8 Magnetic Hard Disk Drive ~10TBs (HDD) ~200MB/s ~$40/TB 48

Memory Hierarchy in Action Rough sequence of events when program is executed Arithmetic done within CPU CPU CU ALU Store; Retrieve Retrieve; ‘21’ Registers DRAM ‘21’ Process ‘21’ Caches Q: What if this does not fit in DRAM? Commands interpreted Bus Store; Retrieve I/O for Display I/O for code I/O for data Disk Monitor tmp.py tmp.csv 49

Scalable Data Access Central Issue : Large data file does not fit entirely in DRAM Basic Idea : “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) 4 key regimes of scalability / staging reads: ❖ Single-node disk : Paged access from file on local disk ❖ Remote read : Paged access from disk(s) over a network ❖ Distributed memory : Fits on a cluster’s total DRAM ❖ Distributed disk : Fits on a cluster’s full set of disks 50

Paged Data Access to DRAM Basic Idea : “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) Page in an F1P1 F2P1 F2P2 occupied frame OS Cache F1P3 Free frames in DRAM F1P2 F5P7 F3P1 F3P5 ❖ Recall that files are already virtually Disk split and stored as pages on both disk F2P2 … F1P1 F1P2 F1P3 F2P1 and DRAM! 52

Page Management in DRAM Cache ❖ Caching : The act of reading in page(s) from disk to DRAM ❖ Eviction : The act of removing page(s) from DRAM ❖ Spilling : The act of writing out page(s) from DRAM to disk ❖ If a page in DRAM is “ dirty ” (i.e., some bytes were written), eviction requires a spill; o/w, ignore that page ❖ Depending on what pages are needed for program, the set of memory-resident pages will change over time ❖ Cache Replacement Policy : The algorithm that chooses which page frame(s) to evict when a new page has to be cached but the OS cache in DRAM is full ❖ Popular policies include Least Recently Used, Most Recently Used, Clock, etc. (more shortly) 53

Quantifying I/O: Disk and Network ❖ Page reads/writes to/from DRAM from/to disk incur latency ❖ Disk I/O Cost : Abstract counting of number of page I/Os; can map to bytes given page size ❖ Sometimes, programs read/write data over network ❖ Communication/Network I/O Cost : Abstract counting of number of pages/bytes sent/received over network ❖ I/O cost is abstract ; mapping to latency is hardware-specific Example : Suppose a data file is 40GB; page size is 4KB I/O cost to read file = 10 million page I/Os Disk with I/O throughput: 800 MB/s 40K/800 = 50s Network with speed: 200 MB/s 40K/200 = 200s 54

Scaling to (Local) Disk Basic Idea : “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) Suppose OS Cache has only 4 frames; initially empty Process wants to read file’s pages Evict P1 Evict P2 one by one and then discard: aka OS Cache “ filescan ” access pattern in DRAM Read P1 Cache is Read P2 full! Read P3 Cache Read P4 Disk Repl. Read P5 needed … … Read P6 P1 P1 P2 P2 P3 P3 P4 P4 P5 P5 P6 P6 Total I/O cost: 6 55

Scaling to (Local) Disk ❖ In general, scalable programs stage access to pages of file on disk and efficiently use available DRAM ❖ Recall that typically DRAM size << Disk size ❖ Modern DRAM sizes can be 10s of GBs; so we read a “chunk”/“block” of file at a time (say, 1000s of pages) ❖ On magnetic hard disks, such chunking leads to more sequential I/Os , raising throughput and lowering latency! ❖ Similarly, write a chunk of dirtied pages at a time 56

Generic Cache Replacement Policies Q: What to do if number of cache frames is too few for file? ❖ Cache Replacement Policy : Algorithm to decide which page frame(s) to evict to make space for new page reads ❖ Typical ranking criteria for frames: recency of use, frequency of use, number of processes reading it, etc. ❖ Typical optimization goal: Reduce overall page I/O costs ❖ A few well-known policies: ❖ Least Recently Used (LRU) : Evict page that was used the longest time ago ❖ Most Recently Used (MRU) : Opposite of LRU ❖ Clock Algorithm (lightweight approximation of LRU) ❖ ML-based caching policies are “hot” nowadays! :) Take CSE 132C for more cache replacement details 57

Data Layouts and Access Patterns ❖ Recall that data layouts and access patterns affect what data subset gets cached in higher level of memory hierarchy ❖ Recall matrix multiplication example and CPU caches! ❖ Key Principle: Optimizing data layout on disk for data file based on data access pattern can help reduce I/O costs ❖ Applies to both magnetic hard disk and flash SSDs ❖ But especially critical for former due to vast differences in latency of random vs sequential access! 58

Row-store vs Column-store Layouts ❖ A common dichotomy when serializing 2-D structured data (relations, matrices, DataFrames) to file on disk A B C D Say, a page can fit only 4 cell values 1a 1b 1c 1d 2a 2b 2c 2d 1a,1b, 2a,2b, 3a,3b, … Row-store: 1c,1d 2c,2d 3c,3d 3a 3b 3c 3d 4a 4b 4c 4d 1a,2a, 1b,2b, … Col-store: 5a,6a 5a 5b 5c 5d 3a,4a 3b,4b 6a 6b 6c 6d ❖ Based on data access pattern of program, I/O costs with row- vs col-store can be orders of magnitude apart! ❖ Can generalize to higher dimensions for storing tensors 59

Example: Dask’s Scalable Access Basic Idea : “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) ❖ This is how Dask DF scales to disk-resident data files ❖ “Virtual” split: each split is under- the-covers a Pandas DF ❖ Dask API is a “wrapper” around Pandas API to scale ops to splits and put all results together ❖ If file is too large for DRAM, can do manual repartition() to get physically smaller splits (< ~1GB) 60 https://docs.dask.org/en/latest/dataframe-best-practices.html#repartition-to-reduce-overhead

Hybrid/Tiled/“Blocked” Layouts ❖ Sometimes, it is beneficial to do a hybrid, especially in analytical RDBMSs and matrix/tensor processing systems A B C D Say, a page can fit only 4 cell values 1a 1b 1c 1d Hybrid store with 2x2 tiled layout: 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 1a,1b, 1c,1d, 3a,3b, … 2a,2b 2c,2d 4a,4b 5a 5b 5c 5d 6a 6b 6c 6d What data layout will yield lower I/O costs (row vs col vs tiled) depends on data access pattern of the program! 61

Scaling with Remote Reads Basic Idea : “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) ❖ Same approach as scaling to local disk, except there may not be a local disk! ❖ Instead, scale by staging reads of pages over the network from remote disk/disks (e.g., from S3) ❖ Same issues of managing a DRAM cache, picking a cache replacement policy, etc. matter here too ❖ More restrictive than scaling with local disk, since spilling is not possible or requires costly network I/Os ❖ Good in practice for a one-shot filescan access pattern ❖ Better to combine with local disk to cache (like in PA1!) 62

Scaling Data Science Operations ❖ Scalable data access for key representative examples of programs/operations that are ubiquitous in data science: ❖ DB systems: ❖ Relational select ❖ Non-deduplicating project ❖ Simple SQL aggregates ❖ SQL GROUP BY aggregates ❖ ML systems: ❖ Matrix sum/norms ❖ Gramian matrix ❖ (Stochastic) Gradient Descent 64

<latexit sha1_base64="CFyZsUvqpr0p06CBht34mIB8QyY=">AB/3icbVDLSgMxFM34rPU1KrhxE1qEuikzVtGNUOrGZRX7gHaYZtK0DU0yQ5IRytiFv+LGhSJu/Q13/o1pOwtPXDhcM693HtPEDGqtON8W0vLK6tr65mN7ObW9s6uvbdfV2EsManhkIWyGSBFGBWkpqlmpBlJgnjASCMYXk/8xgORiobiXo8i4nHUF7RHMdJG8u3DtqJ9jvykAq9gp1MKcmNYuDvx7bxTdKaAi8RNSR6kqPr2V7sb4pgToTFDSrVcJ9JegqSmJFxth0rEiE8RH3SMlQgTpSXTO8fw2OjdGEvlKaEhlP190SCuFIjHphOjvRAzXsT8T+vFevepZdQEcWaCDxb1IsZ1CGchAG7VBKs2cgQhCU1t0I8QBJhbSLmhDc+ZcXSf206JaK57dn+XIljSMDjkAOFIALkAZ3IAqAEMHsEzeAVv1pP1Yr1bH7PWJSudOQB/YH3+ADpclEs=</latexit> Scaling to Disk: Relational Select σ B =“3 b ” ( R ) R A B C D 1a 1b 1c 1d 2a 2b 2c 2d 1a,1b, 2a,2b, 3a,3b, Row-store: 1c,1d 2c,2d 3c,3d 3a 3b 3c 3d 4a 4b 4c 4d 4a,4b, 5a,5b, 6a,6b, 5a 5b 5c 5d 4c,4d 5c,5d 6c,6d 6a 6b 6c 6d ❖ Straightforward filescan data access pattern ❖ Read pages/chunks from disk to DRAM one by one ❖ CPU applies predicate to tuples in pages in DRAM ❖ Copy satisfying tuples to temporary output pages ❖ Use LRU for cache replacement, if needed ❖ I/O cost: 6 (read) + output # pages (write) 65

<latexit sha1_base64="CFyZsUvqpr0p06CBht34mIB8QyY=">AB/3icbVDLSgMxFM34rPU1KrhxE1qEuikzVtGNUOrGZRX7gHaYZtK0DU0yQ5IRytiFv+LGhSJu/Q13/o1pOwtPXDhcM693HtPEDGqtON8W0vLK6tr65mN7ObW9s6uvbdfV2EsManhkIWyGSBFGBWkpqlmpBlJgnjASCMYXk/8xgORiobiXo8i4nHUF7RHMdJG8u3DtqJ9jvykAq9gp1MKcmNYuDvx7bxTdKaAi8RNSR6kqPr2V7sb4pgToTFDSrVcJ9JegqSmJFxth0rEiE8RH3SMlQgTpSXTO8fw2OjdGEvlKaEhlP190SCuFIjHphOjvRAzXsT8T+vFevepZdQEcWaCDxb1IsZ1CGchAG7VBKs2cgQhCU1t0I8QBJhbSLmhDc+ZcXSf206JaK57dn+XIljSMDjkAOFIALkAZ3IAqAEMHsEzeAVv1pP1Yr1bH7PWJSudOQB/YH3+ADpclEs=</latexit> Scaling to Disk: Relational Select σ B =“3 b ” ( R ) Reserved for writing output 3a,3b, data of program (may be 3c,3d OS Cache spilled to a temp. file) in DRAM CPU finds a matching tuple! Copies that to output page Need to evict some page Disk 1a,1b, 1a,1b, 2a,2b, 2a,2b, 3a,3b, 3a,3b, LRU says kick out page 1 1c,1d 1c,1d 2c,2d 2c,2d 3c,3d 3c,3d Then page 2 and so on … 4a,4b, 4a,4b, 5a,5b, 5a,5b, 6a,6b, 6a,6b, 4c,4d 4c,4d 5c,5d 5c,5d 6c,6d 6c,6d 66

Scaling to Disk: Non-dedup. Project SELECT C FROM R R A B C D 1a 1b 1c 1d 2a 2b 2c 2d 1a,1b, 2a,2b, 3a,3b, Row-store: 1c,1d 2c,2d 3c,3d 3a 3b 3c 3d 4a 4b 4c 4d 4a,4b, 5a,5b, 6a,6b, 5a 5b 5c 5d 4c,4d 5c,5d 6c,6d 6a 6b 6c 6d ❖ Again, straightforward filescan data access pattern ❖ Similar I/O behavior as the previous selection case ❖ I/O cost: 6 (read) + output # pages (write) 67

Scaling to Disk: Non-dedup. Project SELECT C FROM R R A B C D 1a 1b 1c 1d 1a,2a, 1b,2b, 2a 2b 2c 2d Col-store: 5a,6a 5b,6b 3a,4a 3b,4b 3a 3b 3c 3d 4a 4b 4c 4d 1c,2c, 1b,2b, 5c,6c 5b,6b 5a 5b 5c 5d 3c,4c 3b,4b 6a 6b 6c 6d ❖ Since we only need col C, no need to read other pages! ❖ I/O cost: 2 (read) + output # pages (write) ❖ Col-stores offer this advantage over row-store for SQL analytics queries (projects, aggregates, etc.) aka “OLAP” ❖ Main basis of column-store RDBMSs (e.g., Vertica) 68

Scaling to Disk: Simple Aggregates SELECT MAX(A) FROM R R A B C D 1a 1b 1c 1d 2a 2b 2c 2d 1a,1b, 2a,2b, 3a,3b, Row-store: 1c,1d 2c,2d 3c,3d 3a 3b 3c 3d 4a 4b 4c 4d 4a,4b, 5a,5b, 6a,6b, 5a 5b 5c 5d 4c,4d 5c,5d 6c,6d 6a 6b 6c 6d ❖ Again, straightforward filescan data access pattern ❖ Similar I/O behavior as the previous selection and non-deduplicating projection cases ❖ I/O cost: 6 (read) + output # pages (write) 69

Scaling to Disk: Simple Aggregates SELECT MAX(A) FROM R R A B C D 1a 1b 1c 1d 1a,2a, 1b,2b, 2a 2b 2c 2d Col-store: 5a,6a 5b,6b 3a,4a 3b,4b 3a 3b 3c 3d 4a 4b 4c 4d 1c,2c, 1b,2b, 5c,6c 5b,6b 5a 5b 5c 5d 3c,4c 3b,4b 6a 6b 6c 6d ❖ Similar to the non-dedup. project, we only need col A; no need to read other pages! ❖ I/O cost: 2 (read) + output # pages (write) 70

Scaling to Disk: Group By Aggregate SELECT A, SUM(D) R A B C D FROM R GROUP BY A a1 1b 1c 4 a2 2b 2c 3 ❖ Now it is not straightforward due to a1 3b 3c 5 the GROUP BY! a3 4b 4c 1 ❖ Need to “collect” all tuples in a group a2 5b 5c 10 and apply agg. func. to each a1 6b 6c 8 ❖ Typically done with a hash table maintained in DRAM Hash table (output) ❖ Has 1 record per group and A Running Info. maintains “running information” a1 17 for that group’s agg. func. a2 13 ❖ Built on the fly during filescan of a3 1 R; has the output in the end! 71

Scaling to Disk: Group By Aggregate SELECT A, SUM(D) R A B C D FROM R GROUP BY A a1 1b 1c 4 a2 2b 2c 3 a1,1b, a2,2b, a1,3b, Row-store: a1 3b 3c 5 1c,4 2c,3 3c,5 a3 4b 4c 1 a2 5b 5c 10 a3,4b, a2,5b, a1,6b, 4c,1 5c,10 6c,8 a1 6b 6c 8 Hash table in DRAM ❖ Note that the sum for each group is constructed incrementally A Running Info. ❖ I/O cost: 6 (read) + output # pages a1 4 -> 9 -> 17 (write); just one filescan again! a2 3 -> 13 a3 1 Q: But what if hash table > DRAM size?! 72

Scaling to Disk: Group By Aggregate SELECT A, SUM(D) FROM R GROUP BY A Q: But what if hash table > DRAM size? ❖ Program will likely just crash! :) OS may keep swapping pages of hash table to/from disk; aka “thrashing” Q: How to scale to large number of groups? ❖ Divide and conquer! Split up R based on values of A ❖ HT for each split may fit in DRAM alone ❖ Reduce running info. size if possible 73

<latexit sha1_base64="WP4FUq7RsObclJ4fi/3JkXyXLa0=">AB+nicbVDLSsNAFJ34rPWV6tLNYBFclaQquiy6cSNUsA9oY5hMJ+3QySTM3Cil9lPcuFDErV/izr9xmahrQcu93DOvcydEySCa3Ccb2tpeWV1b2wUdzc2t7ZtUt7TR2nirIGjUWs2gHRTHDJGsBsHaiGIkCwVrB8Grqtx6Y0jyWdzBKmBeRvuQhpwSM5NulbpMpwDc46371vurbZafiZMCLxM1JGeWo+/ZXtxfTNGISqCBad1wnAW9MFHAq2KTYTVLCB2SPusYKknEtDfOTp/gI6P0cBgrUxJwpv7eGJNI61EUmMmIwEDPe1PxP6+TQnjhjblMUmCSzh4KU4EhxtMcI8rRkGMDCFUcXMrpgOiCAWTVtGE4M5/eZE0qxX3pHJ2e1quXeZxFNABOkTHyEXnqIauUR01EWP6Bm9ojfryXqx3q2P2eiSle/soz+wPn8Au5KTBA=</latexit> Scaling to Disk: Matrix Sum/Norms k M k 2 M 6x4 2 1 0 0 2 2 1 0 0 2,1, 2,1 0,1, 0 1 0 2 Row-store: 0,0 0,0 0,2 0 0 1 2 3 0 1 0 0,0, 3,0, 3,0, 1,2 1,0 1,0 3 0 1 0 ❖ Again, straightforward filescan data access pattern ❖ Very similar to relational simple aggregate ❖ Running info. in DRAM for sum of squares of cells ❖ 0 -> 5 -> 10 -> 15 -> 20 -> 30 -> 40 ❖ I/O cost: 6 (read) + output # pages (write) ❖ Col-store and tiled-store have same I/O cost; why? 74

<latexit sha1_base64="OMhQmIb57LnfsVrkZPsysEXtJDw=">AB7HicbVBNSwMxEJ2tX7V+VT16CRbBU9n1Az0WvXgpVOi2hXYt2TbhmaTJckKZelv8OJBEa/+IG/+G9N2D1p9MPB4b4aZeWHCmTau+UVlbX1jeKm6Wt7Z3dvfL+QUvLVBHqE8ml6oRYU84E9Q0znHYSRXEctoOx7czv/1IlWZSNM0koUGMh4JFjGBjJb/+0ET1frniVt050F/i5aQCORr98mdvIEkaU2EIx1p3PTcxQYaVYTamXapgMsZD2rVU4JjqIJsfO0UnVhmgSCpbwqC5+nMiw7HWkzi0nTE2I73szcT/vG5qousgYyJDRVksShKOTISzT5HA6YoMXxiCSaK2VsRGWGFibH5lGwI3vLf0nrOqdVy/vLyq1mzyOIhzBMZyCB1dQgztogA8EGDzBC7w6wnl23pz3RWvByWcO4Recj2/xHY4f</latexit> Scaling to Disk: Gramian Matrix M T M M 6x4 2 1 0 0 2 1 0 0 2,1, 2,1 0,1, 0 1 0 2 Row-store: 0,0 0,0 0,2 0 0 1 2 3 0 1 0 0,0, 3,0, 3,0, 1,2 1,0 1,0 3 0 1 0 ❖ A bit tricky, since output may not fit entirely in DRAM ❖ Similar to GROUP BY difficult case ❖ Output here is 4x4, i.e., 4 pages; only 3 can be in DRAM! ❖ Each row will need to update entire output matrix ❖ Row-store can be a poor fit for such matrix algebra ❖ What about col-store or tiled-store? 75

<latexit sha1_base64="OMhQmIb57LnfsVrkZPsysEXtJDw=">AB7HicbVBNSwMxEJ2tX7V+VT16CRbBU9n1Az0WvXgpVOi2hXYt2TbhmaTJckKZelv8OJBEa/+IG/+G9N2D1p9MPB4b4aZeWHCmTau+UVlbX1jeKm6Wt7Z3dvfL+QUvLVBHqE8ml6oRYU84E9Q0znHYSRXEctoOx7czv/1IlWZSNM0koUGMh4JFjGBjJb/+0ET1frniVt050F/i5aQCORr98mdvIEkaU2EIx1p3PTcxQYaVYTamXapgMsZD2rVU4JjqIJsfO0UnVhmgSCpbwqC5+nMiw7HWkzi0nTE2I73szcT/vG5qousgYyJDRVksShKOTISzT5HA6YoMXxiCSaK2VsRGWGFibH5lGwI3vLf0nrOqdVy/vLyq1mzyOIhzBMZyCB1dQgztogA8EGDzBC7w6wnl23pz3RWvByWcO4Recj2/xHY4f</latexit> Scaling to Disk: Gramian Matrix M 6x4 M T M 2 1 0 0 2x2 Tiled store A B 2 1 0 0 0 1 0 2 O1 O2 A T C T E T A B C D 0 0 1 2 D C O3 O4 B T D T F T 3 0 1 0 F E F E 3 0 1 0 ❖ Read A, C, E one by one to get O1 = A T A + C T C + E T E; O1 is incrementally computed; write O1 out; I/O: 3 (r) + 1 (w) ❖ Likewise with B, D, F for O4; I/O: 3 (r) + 1 (w) ❖ Read A, B and put A T B in O2; read C, D to add C T D to O2; read E, F to add E T F to O2; write O2 out; I/O: 6 + 1 ❖ Likewise with B,A; D,C; F,E for O3; I/O: 6 + 1 ❖ Max I/O cost: 18 (r) + 4(w); scalable on both dimensions! 76

Scalable Matrix/Tensor Algebra ❖ Almost all basic matrix operations can be DRAM/cache- efficiently implemented using tiled operations ❖ Tile-based storage and/or processing is common in ML systems that support scalable matrix/tensor algebra ❖ SystemML, pBDR, and Dask Arrays (over NumPy) for matrix ops WRT disk-DRAM ❖ NVIDIA CUDA for tensor ops WRT DRAM-GPU caches 77

<latexit sha1_base64="uvgXp+dpvG7N5TecG7ye7m4Eg=">AB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NOcthNJcRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjpvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nl8v6iXL2Z1GEIziGU3DhCqpwBzVoAIERPMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit> <latexit sha1_base64="wtingO0LqaDZ0Zshlk1cGJ54IY=">ACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGkjudnLJke6/nxXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7utmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit> Numerical Optimization in ML ❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM) approach ❖ Computes “loss” of predictions over labeled examples n w ∗ = argmin w X l ( y i , f ( w , x i )) i =1 ❖ Hyperplane-based models aka Generalized Linear Models (GLMs) use f() that is a scalar function of distances: w T x i 78

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit> <latexit sha1_base64="ayIl16A2siz4c7l2gSKsLvcCX4=">ACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFVzXRXlClpyfe/e6WDBw8fHZYfV548fb8RfXlUdemuRHYEalKT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657u897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwIMSaFwWQlzixmIGUxw4KiGBO2wWF+5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca4pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982z6e1sW2jJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvzvl/d5ES9525pjtwPvzF5+7rMc=</latexit> <latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">ACH3icbVDLSgMxFM3UV62vUZdugkVoZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXldW19HpmY3Nre8fc3avJIBKYVHAtFwkSMclJVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh4ZLmBQwvQm8kUnh2aztm1ipaY8BFYk9JFkxRcyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg=</latexit> Batch Gradient Descent for ML n X L ( w ) = l ( y i , f ( w , x i )) i =1 ❖ In many cases, loss function l() is convex ; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent: ❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w (0) n ❖ Compute gradient : X r L ( w ( k ) ) = r l ( y i , f ( w ( k ) , x i )) i =1 ❖ Descend along gradient: w ( k +1) w ( k ) � η r L ( w ( k ) ) (Aka Update Rule ) ❖ Repeat until we get close to w* , aka convergence 79

<latexit sha1_base64="g9I7Eq4QpMR9QRqIjWoJuHM2mqI=">ACN3icbVDLSiNBFK32bXxFXbopDEJcGLp9oMugm1kMomBUSMdwu3JbC6urm6rbhtDkr2Yzv+Fu3MxiBnHrH1iJWj0QMHhnHOpe0+UKWnJ9/94E5NT0zOzc/OlhcWl5ZXy6tqlTXMjsCFSlZrCwqbFBkhReZwYhiReRfcnA/qAY2Vqb6gXoatBG61jKUAclK7fBomQHdRXHT7N0V1d7vPQ4UxgTFpl3/yAuft8BAJeKghUsB/VscD2+1yxa/5Q/CvJBiRChvhrF1+DupyBPUJBRY2wz8jFoFGJCYb8U5hYzEPdwi01HNSRoW8Xw7j7fckqHx6lxTxMfqh8nCkis7SWRSw4WtePeQPzOa+YUH7UKqbOcUIv3j+JcUr5oETekQYFqZ4jIx0u3JxBwYEuapLroRg/OSv5HK3FuzVDs73K/XjUR1zbINtsioL2CGrsx/sjDWYL/YE/vH/nu/vb/es/fyHp3wRjPr7BO81zeJKquq</latexit> <latexit sha1_base64="TY7U6q5Tv+/ah6YJI3mw93bVcTs=">AB9HicbVC7TsMwFL0pr1JeBUYWiwqpLFVCQTBWsDAwFIk+pDaqHNdprTpOsJ2iKup3sDCAECsfw8bf4LQZoOVIlo7OuVf3+HgRZ0rb9reVW1ldW9/Ibxa2tnd294r7B0VxpLQBgl5KNseVpQzQRuaU7bkaQ48DhteaOb1G+NqVQsFA96ElE3wAPBfEawNpJ7V+4GWA89P3manvaKJbtiz4CWiZOREmSo94pf3X5I4oAKThWquPYkXYTLDUjnE4L3VjRCJMRHtCOoQIHVLnJLPQUnRilj/xQmic0mqm/NxIcKDUJPDOZRlSLXir+53Vi7V+5CRNRrKkg80N+zJEOUdoA6jNJieYTQzCRzGRFZIglJtr0VDAlOItfXibNs4pTrVzcn5dq1kdeTiCYyiDA5dQg1uoQwMIPMIzvMKbNbZerHfrYz6as7KdQ/gD6/MHZHeR3Q=</latexit> <latexit sha1_base64="Yg7pZgdEXbdSQ9HL4psFPomzX4=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRV0WXRjcsK9gFtLJPpB06mYSZSaWE/okbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm4Wt4vbO7t6+fXDYVFEiCW2QiEey7WNFORO0oZnmtB1LikOf05Y/us381phKxSLxoCcx9UI8ECxgBGsj9Wy7G2I9IP0afqYlqtn05dcirODGiZuDkpQY56z/7q9iOShFRowrFSHdeJtZdiqRnhdFrsJorGmIzwgHYMFTikyktnyafo1Ch9FETSPKHRTP29keJQqUnom8ksp1r0MvE/r5Po4NpLmYgTQWZHwoSjnSEshpQn0lKNJ8YgolkJisiQywx0asoinBXfzyMmlWK+5fL+olS7yesowDGcQBlcuIa3EdGkBgDM/wCm9War1Y79bHfHTFyneO4A+szx8/EpNo</latexit> <latexit sha1_base64="EX6+u+ozvgd46YH9lTyT79WA=">ACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhoJ9QBPLZDph04mYWailBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1z+HIYwheV5wAqaHnJw/pXVKxq2m1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXtiLlJkgoihlJS04sSYTwCA1IV1OAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T/O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpbqW4g=</latexit> <latexit sha1_base64="G17ClWIFMobSQkQcWLAcftDEbk=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhma03YkKQ48Tlve+DbzWxMqFQvFg5GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFP075dqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THOkQZTWgAZOUaD41BPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz89jJNn</latexit> <latexit sha1_base64="l790wK2qXm9kyZmjZI9U8DgCRQ=">ACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhoJ9QBPLZDph04mYWailBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1z+HIYwheV5wAqaHnJw/pXVKxqm1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXtiLlJkgoihlJS04sSYTwCA1IV1OAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T/O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpDOW4Q=</latexit> <latexit sha1_base64="OxTqfbjDPbei9MJL9SITiklUMw=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhma03YkKQ48Tlve+DbzWxMqFQvFg5GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFO075dqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THOkQZTWgAZOUaD41BPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz8BpNm</latexit> <latexit sha1_base64="HB3noaSvzO2mOiaEhFaq57JcQLw=">AB83icbVDLSsNAFL2pr1pfVZduBosgLkriA10W3bisYB/QxDKZTtqhk0mYmSgl5DfcuFDErT/jzr9x0mahrQcGDufcyz1z/JgzpW372yotLa+srpXKxubW9s71d29toSWiLRDySXR8rypmgLc0p91YUhz6nHb8U3udx6pVCwS93oSUy/EQ8ECRrA2kuGWI/8IH3KHk761Zpdt6dAi8QpSA0KNPvVL3cQkSkQhOleo5dqy9FEvNCKdZxU0UjTEZ4yHtGSpwSJWXTjNn6MgoAxRE0jyh0VT9vZHiUKlJ6JvJPKOa93LxP6+X6ODKS5mIE0FmR0KEo50hPIC0IBJSjSfGIKJZCYrIiMsMdGmpopwZn/8iJpn9ads/rF3XmtcV3UYDOIRjcOASGnALTWgBgRie4RXerMR6sd6tj9loySp29uEPrM8fHvyRvg=</latexit> <latexit sha1_base64="MZhyY1bDdwCYWm3cMV+ti7/2tw=">ACN3icbVDLSiNBFK12fMT4yoxLN4VBiAtDtw90GcaNCxEFo0I6htuV21pYXd1U3VZCk7+azfyGO924mEHc+gdWYhYaPVBwOdc6t4TZUpa8v0Hb+LH5NT0TGm2PDe/sLhU+fnrzKa5EdgUqUrNRQWldTYJEkKLzKDkEQKz6Ob/YF/fovGylSfUi/DdgJXWsZSADmpUzkKE6DrKC7u+pdFLVjv81BhTGBMesc/eb7zNniIBDzUECngh7XxwHqnUvXr/hD8KwlGpMpGO5U7sNuKvIENQkF1rYCP6N2AYakUNgvh7nFDMQNXGHLUQ0J2nYxvLvP15zS5XFq3NPEh+rHiQISa3tJ5JKDRe24NxC/81o5xXvtQuosJ9Ti/aM4V5xSPiRd6VBQarnCAgj3a5cXIMBQa7qsishGD/5KznbrAdb9Z2T7Wrj96iOElthq6zGArbLGuyAHbMmE+wPe2T/2H/vr/fkPXsv79EJbzSzD7Be30DhDerpw=</latexit> <latexit sha1_base64="q3pU+DFyxWArDM3ABiJpfISIf2Y=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclRkf6LoxmUF+8C2lEx6pw3NZIYko5Shf+HGhSJu/Rt3/o2ZdhbaeiBwOdecu7xY8G1cd1vZ2l5ZXVtvbBR3Nza3tkt7e03dJQohnUWiUi1fKpRcIl1w43AVqyQhr7Apj+6yfzmIyrNI3lvxjF2QzqQPOCMGis9dEJqhn6QPk16pbJbcacgi8TLSRly1Hqlr04/YkmI0jBtW57bmy6KVWGM4GTYifRGFM2ogNsWypiLqbThNPyLFV+iSIlH3SkKn6eyOlodbj0LeTWUI972Xif147McFVN+UyTgxKNvsoSAQxEcnOJ32ukBkxtoQyxW1WwoZUWZsSUVbgjd/8iJpnFa8s8rF3Xm5ep3XUYBDOIT8OASqnALNagDAwnP8ApvjnZenHfnYza65OQ7B/AHzucP/fiRIg=</latexit> Batch Gradient Descent for ML w (1) w (0) � η r L ( w (0) ) Gradient w (2) w (1) � η r L ( w (1) ) L ( w ) r L ( w (0) ) r L ( w (1) ) … w (1) w (0) w ∗ w (2) w ❖ Learning rate is a hyper-parameter selected by user or “AutoML” tuning procedures ❖ Number of iterations/epochs of BGD also hyper-parameter 80

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit> Data Access Pattern of BGD at Scale ❖ The data-intensive computation in BGD is the gradient ❖ In scalable ML, dataset D may not fit in DRAM ❖ Model w is typically small and DRAM-resident n X r L ( w ( k ) ) = r l ( y i , f ( w ( k ) , x i )) i =1 Q: What SQL op is this reminiscent of? ❖ Gradient is like SQL SUM over vectors (one per example) ❖ At each epoch, 1 filescan over D to get gradient ❖ Update of w happens normally in DRAM ❖ Monitoring across epochs for convergence needed ❖ Loss function L() is also just a SUM in a similar manner 81

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit> I/O Cost of Scalable BGD n r L ( w ( k ) ) = X r l ( y i , f ( w ( k ) , x i )) D i =1 Y X1 X2 X3 0 1b 1c 1d 0,1b, 1,2b, 1,3b, Row-store: 1c,1d 2c,2d 3c,3d 1 2b 2c 2d 1 3b 3c 3d 0 4b 4c 4d 0,4b, 1,5b, 0,6b, 4c,4d 5c,5d 6c,6d 1 5b 5c 5d 0 6b 6c 6d ❖ Straightforward filescan data access pattern for SUM ❖ Similar I/O behavior as relational select, non-dedup. project, and simple SQL aggregates ❖ I/O cost: 6 (read) + output # pages (write for final w ) 82

Stochastic Gradient Descent for ML ❖ Two key cons of BGD: ❖ Often takes too many epochs to get close to optimal ❖ Each update of w requires full scan of D: costly I/Os ❖ Stochastic GD (SGD) mitigates both issues ❖ Basic Idea: Use a sample (called mini-batch ) of D to approximate gradient instead of “full batch” gradient ❖ Sampling typically done without replacement ❖ Randomly reorder/shuffle D before every epoch ❖ Then do a sequential pass: sequence of mini-batches ❖ Another major pro of SGD: works well for non-convex loss functions too, especially ANNs/deep nets; BGD does not ❖ SGD is now standard for scalable GLMs and deep nets 83

<latexit sha1_base64="AfnLQWBOhwvN5z3/kLEnQiCekI=">ACVnicbVFNb9QwEHVSsvyFeDIZcQKaVeqVkBwaVSVThw4FAktq20WSLHO2mtZ3InlBWUf5ke4GfwgXhbHOALiNZenrvzXj8nFdKOorjn0G4dWf7s7uvcH9Bw8fPY6ePD1xZW0FTkWpSnuWc4dKGpySJIVnlUWuc4Wn+fJ9p59+Q+tkab7QqsK5udGFlJw8lQW6dTwXHFISaoFNp/aUao5XeRFc9l+bUbLcTuGA0hdrbNmtMrk3vdMjiGVBo46NndI8KGFfopaW6DYGLIHXd84i4bxJF4XbIKkB0PW13EWXaWLUtQaDQnFnZslcUXzhluSQmE7SGuHFRdLfo4zDw3X6ObNOpYWXnpmAUVp/TEa/bvjoZr51Y6985uXdb68j/abOainfzRpqJjTi5qKiVkAldBnDQloUpFYecGl3xXEBbdckP+JgQ8huf3kTXCyP0leTd58fj08POrj2GXP2Qs2Ygl7yw7ZR3bMpkywa/YrCIOt4EfwO9wOd26sYdD3PGP/VBj9AZJwsek=</latexit> Access Pattern and I/O Cost of SGD ❖ I/O cost of random shuffle is not trivial; requires a so- called “external merge sort” (skipped in this course) ❖ Typically requires 1 or 2 passes over large file ❖ Mini-batch gradient computations: 1 filescan per epoch r ˜ L ( w ( k ) ) = X r l ( y i , f ( w ( k ) , x i )) ( y i ,x i ) ∈ B ⊂ D ❖ Update of w happens in DRAM; as filescan proceeds, count number of examples seen and update for each mini-batch ❖ Typical Mini-batch sizes: 10s to 1000s ❖ Orders of magnitude more updates than BGD! ❖ So, I/O per epoch: 1 shuffle cost + 1 filescan cost ❖ Often, shuffling-only-once up front suffices! ❖ Loss function L() access pattern is the same before 84

Scaling Data Science Operations ❖ Scalable data access for key representative examples of programs/operations that are ubiquitous in data science: ❖ DB systems: ❖ Relational select ❖ Non-deduplicating project ❖ Simple SQL aggregates ❖ SQL GROUP BY aggregates ❖ ML systems: ❖ Matrix sum/norms ❖ Gramian matrix ❖ (Stochastic) Gradient Descent 85

Introducing Data Parallelism Basic Idea of Scalability : “Split” data file (virtually or physically) and stage reads/writes of its pages between disk and DRAM Q: What is “data parallelism”? Data Parallelism : Partition large data file physically across nodes/workers; within worker: DRAM-based or disk-based ❖ The most common approach to marrying parallelism and scalability in data systems! ❖ Generalization of SIMD and SPMD idea from parallel processors to large-data and multi-worker/node setting ❖ Distributed-memory vs Distributed-disk possibilities 87

3 Paradigms of Multi-Node Parallelism Interconnect Interconnect Interconnect Shared-Nothing Shared-Disk Shared-Memory Parallelism Parallelism Parallelism Data parallelism is technically orthogonal to these 3 paradigms but most commonly paired with shared-nothing 88

Shared-Nothing Data Parallelism ❖ Partitioning a data file across Interconnect nodes is aka “ sharding ” ❖ It is a part of a process in data D1 D2 systems called Extract- D3 Transform-Load (ETL) D4 D1 D3 D5 D5 ❖ ETL is an umbrella term for all D6 various processing done to the D5 D1 D3 data file before it is ready for D6 D2 D4 users to query, analyze, etc. Shared-Nothing ❖ Sharding, compression, file Parallel Cluster format conversions, etc. 89

Data Parallelism in Other Paradigms? D1 D1 D3 D5 D2 Contention D3 D1 D1 D2 D2 D3 D3 D4 D4 Interconnect Interconnect D5 D5 D6 D6 D1 D3 D5 D1 D3 D5 D2 D4 D6 D2 D4 D6 Shared-Disk Shared-Memory Parallel Cluster Parallel Cluster 90

Data Partitioning Strategies ❖ Row-wise/horizontal partitioning is most common (sharding) ❖ 3 common schemes (given k nodes): ❖ Round-robin : assign tuple i to node i MOD k ❖ Hashing-based : needs hash partitioning attribute(s) ❖ Range-based : needs ordinal partitioning attribute(s) ❖ Tradeoffs: Round-robin often inefficient for SQL queries later (why?); range-based good for range predicates in SQL; hashing-based is most common in practice and can be good for both SQL queries; for many ML workloads, all 3 are ok ❖ Replication of partition across nodes (e.g., 3x) often used for more availability and better parallel performance 91

Other Forms of Data Partitioning ❖ Just like with disk-aware data layout on single-node, we can partition a large data file across workers in other ways too: Columnar Partitioning R A B C D Node 1 Node 2 Node 3 1a 1b 1c 1d 1a,2a, 1b,2b, 1c,2c, 2a 2b 2c 2d 3a,4a 3b,4b 3c,4c 3a 3b 3c 3d … 4a 4b 4c 4d 5a,6a 5b,6b 5c,6c 5a 5b 5c 5d 6a 6b 6c 6d 92

Other Forms of Data Partitioning ❖ Just like with disk-aware data layout on single-node, we can partition a large data file across workers in other ways too: Hybrid/Tiled Partitioning R A B C D Node 1 Node 2 Node 3 1a 1b 1c 1d 1a,1b, 3a,3b, 5a,5b, 2a 2b 2c 2d 2a,2b 4a,4b 6a,6b 3a 3b 3c 3d … 4a 4b 4c 4d 1c,1d, 3c,3d, 5c,5d, 2c,2d 4c,4d 6c,6d 5a 5b 5c 5d 6a 6b 6c 6d 93

Cluster Architectures Q: What is the protocol for cluster nodes to talk to each other? Master-Worker Architecture Peer-to-Peer Architecture Worker Master Worker 1 k-1 … … Worker 1 Worker 2 Worker k Worker 2 Worker k ❖ 1 (or few) special node called Master ❖ No special master or Server ; many interchangeable ❖ Workers talk to each nodes called “ Workers ” other freely ❖ Master gives commands to workers, ❖ E.g., Horovod incl. talking to each other ❖ Aka Decentralized vs ❖ Most common in data systems (e.g., Centralized Dask, Spark, par. RDBMS, etc.) 94

Bulk Synchronous Parallelism (BSP) ❖ Most common protocol of data parallelism in data systems ❖ Combines shared-nothing sharding with master-worker ❖ Used by parallel RDBMSs, Hadoop, Spark, etc. 1. Given sharded data file on workers 2. Data processing program given by Master client to master (e.g., SQL query, ML training routine, etc.) Communication costs! 3. Master divides first piece of work of program among workers 4. A worker works independently D1 D3 D5 on its own data partition … D2 D4 D6 5. A worker sends its partial results Worker 1 Worker 2 Worker k to master after it is done 6. Master waits till all k are done Aka (Barrier) Synchronization 7. Go to step 3 for next piece 95

Speedup Analysis/Limits of of BSP Q: What is the speedup yielded by BSP? Completion time given only 1 worker Speedup = Completion time given k (>1) workers ❖ Many factors contribute to overheads on cluster: ❖ Per-worker parallel overheads: starting up worker processes; tearing down worker processes ❖ On-master overheads: dividing up work; collecting partial results and unifying them ❖ Communication costs: between master-worker and among workers (when commanded by master) ❖ Barrier synchronization also suffers from “ stragglers ” due to imbalances in data partitioning and/or resources 96

Quantifying Benefit of Parallelism Runtime speedup (fixed data size) Runtime speedup 12 2 Linear Speedup 8 Linear Scaleup 1 Sublinear 4 0.5 Scaleup Sublinear 1 Speedup 1 4 8 12 1 4 8 12 Number of workers Factor (# workers, data size) Speedup plot / Strong scaling Scaleup plot / Weak scaling Q: Is superlinear speedup/scaleup ever possible? 97

Distributed Filesystems ❖ Recall that a file is an OS abstraction for persistent data on local disk; distributed file generalizes this notion to a cluster of networked disks and OSs ❖ Distributed filesystem (DFS) is software that works with many networked disks and OSs to manage distr. files ❖ A layer of abstraction on top of local filesystems; a node manages local data (partition) as if it is local file ❖ Illusion of a one global file : DFS has APIs to let nodes access data (partitions) sitting on other nodes ❖ Two main variants: Remote DFS vs In-Situ DFS ❖ Remote DFS: Files reside elsewhere and read/written on demand by workers ❖ In-Situ DFS: Files resides on cluster where workers run 98

Network Filesystem (NFS) ❖ An old remote DFS (c. 1980s) with simple client-server architecture for replicating files over the network ❖ Main pro is simplicity of setup and usage ❖ But many cons: ❖ Not scalable to very large files ❖ Full data replication ❖ High contention for concurrent data reads/write ❖ Single-point of failure 99

Hadoop Distributed File System (HDFS) ❖ Most popular in-situ DFS (c. late 2000s); part of Hadoop; open source spinoff of Google File system (GFS) ❖ Highly scalable ; can do 10s of 1000s of nodes, PB files ❖ Designed for clusters of cheap commodity nodes ❖ Parallel reads/writes of sharded data “blocks” ❖ Replication of blocks improves fault tolerance ❖ Cons: Read-only + batch- append (no fine-grained updates/writes!) https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf 100 https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: Parallel and Scalable Data Processing Ch. 9.4, 12.2, 14.1.1, 14.6, 22.1-22.3, 22.4.1, 22.8 of Cow Book Ch. 5, 6.1, 6.3, 6.4 of MLSys Book 1 Q: Why bother with large-scale data?

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 7: ML Deployment Not included for

Slide 7 / 102 Slide 8 / 102 4 Compare/Contrast Pulse and Wave. 5 In a transverse wave, compare

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 1: Computer Organization; Operating

DSC 102 Systems for Scalable Analytics Winter 2020 Arun Kumar 1 About Myself 2009:

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 2: Basics of Cloud Computing 1

Slide 1 / 102 Slide 2 / 102 8th Grade Wave Properties Classwork-Homwork Slides 2015-10-15

Slide 4 / 102 1 What causes a wave? Slide 5 / 102 2 In terms of wave motion, define medium.

How to do research in clinical practice Dr P S Shankar, MD, FRCP(Lond), FAMS, DSc(Gul),

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

DSC 10: Lecture 1 Introduction Cause and Effect Credit: Anindita Adhikari and John DeNero

AP Physics C - Mechanics Simple Harmonic Motion 2015-12-05 www.njctl.org Slide 3 / 102 Slide 4

How recurrent networks implement contextual processing in sentiment analysis Niru

Cyber-Physical Event Processing Chao Wang CSE 520S References Core material of this lecture:

Internet measurement and the impact of big data Kenjiro Cho (IIJ/WIDE) Big Data everywhere

Frontiers at the interface of High Performance Computing Deep Learning and Multimessenger

Texas Tech University Jialin Liu, Yong Chen , Suren Byna September 1 st , 2015 P2S2 2015 Outline

Lecture 9: Data Abstraction Marvin Zhang 07/05/2016 Announcements Roadmap Introduction

Topic 5 Reminder: primitive expressions, means of Data Abstraction combination, means of

Data Abstraction Announcements Data Abstraction Data Abstraction 4 Data Abstraction