DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: - - PowerPoint PPT Presentation

dsc 102 systems for scalable analytics
SMART_READER_LITE
LIVE PREVIEW

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: - - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: Parallel and Scalable Data Processing Ch. 9.4, 12.2, 14.1.1, 14.6, 22.1-22.3, 22.4.1, 22.8 of Cow Book Ch. 5, 6.1, 6.3, 6.4 of MLSys Book 1 Q: Why bother with large-scale data?


slide-1
SLIDE 1

Topic 3: Parallel and Scalable Data Processing

  • Ch. 9.4, 12.2, 14.1.1, 14.6, 22.1-22.3, 22.4.1, 22.8 of Cow Book
  • Ch. 5, 6.1, 6.3, 6.4 of MLSys Book

Arun Kumar

1

DSC 102
 Systems for Scalable Analytics

slide-2
SLIDE 2

2

Q: Why bother with large-scale data? Why does sampling not suffice?

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

Large-Scale Data in Astronomy

High-res. images: ~200 GB per day since 2000 (1PB+) Astronomers can study complex galactic evolution behaviors

slide-5
SLIDE 5

5

slide-6
SLIDE 6

6

Large-Scale Data in Genomics

Precision Medicine is becoming a reality Analyze genomes across cohorts and prescribe targeted drugs and treatments ~3GB genome per human 900PB+ for USA

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

Large-Scale Data in E-commerce

Log all user behavior (views, clicks, pauses, searches, etc.) Recommender systems combine TBs of data from all users and movies to deliver a tailored experience

slide-9
SLIDE 9

9

Large-Scale Data in Computer Vision

10million+ images labeled (20,000 classes) by crowdsourcing >500GB uncompressed as tensors Harbinger of deep learning revolution

slide-10
SLIDE 10

10

“The Unreasonable Effectiveness of Data”

https://ai.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html

When prediction target complexity is high, more training data coupled with more complex models yield higher accuracy as number

  • f training examples grows
slide-11
SLIDE 11

11

Bias-Variance Tradeoff of ML

High Bias: Roughly, model is not rich enough to represent data High Variance: Model overfits to given data; poor generalization Large-scale training data lowers variance and raises accuracy!

slide-12
SLIDE 12

12

Why Large-Scale Data?

❖ Large-scale data is a game changer for data science: ❖ Enables study of granular phenomena in sciences, businesses, etc. that were never before possible ❖ Enables fundamentally new applications and a high degree of personalization/customization ❖ Enables more complex ML prediction targets and mitigates variance to offer high accuracy ❖ Hardware has kept pace to power the above: ❖ Storage capacity has exploded (PB clusters) ❖ Compute capacity has grown (multi-core, GPUs, etc.) ❖ DRAM capacity has grown (10GBs to TBs) ❖ Cloud computing helps “democratize” access

slide-13
SLIDE 13

13

“Big Data”

❖ Marketing term; think “Big” as in “Big Oil”, not “big building” ❖ Became popular in late 2000s to early 2010s ❖ Wikipedia says: “Data that is so large and complex that existing toolkits [read RDBMSs!] are not adequate” ❖ Typical characterization by 3 Vs: ❖ Volume: larger than single-node DRAM ❖ Variety: relations, docs, tweets, multimedia, etc. ❖ Velocity: high generation rate, e.g., sensors, surveillance

slide-14
SLIDE 14

Why “Big Data” now? 1. Applications

❖ New “data-driven mentality” in almost all human endeavors: ❖ Web: search, e-commerce, e-mails, social media ❖ Science: satellite imagery, CERN’s LHC, document corpora ❖ Medicine: pharmacogenomics, precision medicine ❖ Logistics: sensors, GPS, “Internet of Things” ❖ Finance: high-throughput trading, monitoring ❖ Humanities: digitized books/literature, social media ❖ Governance: e-voting, targeted campaigns, NSA ☺ ❖ …

14

slide-15
SLIDE 15

Why “Big Data” now? 2. Storage

15

slide-16
SLIDE 16

16

To analyze large-scale data, parallel and scalable data systems are indispensable!

slide-17
SLIDE 17

17

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-18
SLIDE 18

18

Parallel Data Processing

Central Issue: Workload takes too long for one processor! Basic Idea: Split up workload across processors and perhaps also across machines/workers (aka “Divide and Conquer”) Key new concept in parallel processing: ❖ Threads: Generalization of process abstraction of OS ❖ A program/process can spawn many threads; each runs a part of the program’s computations simultaneously ❖ All threads share process address space (so, data too) ❖ In multi-core CPUs, a thread uses up 1 core ❖ “Hyper-threading”: Virtualizes a core to run 2 threads!

slide-19
SLIDE 19

19

Multiple Threads in a Process

slide-20
SLIDE 20

20

Parallel Data Processing

Central Issue: Workload takes too long for one processor! Basic Idea: Split up workload across processors and perhaps also across machines/workers (aka “Divide and Conquer”) Key new concept in parallel processing: ❖ DataFlow Graph: A directed graph representation of a program with vertices being abstract operations from a restricted set of computational primitives: ❖ Relational dataflows: RDBMS, Pandas ❖ Matrix/tensor dataflows: R, NumPy, TensorFlow, etc. ❖ Task Graph: Similar but coarse-grained; vertex is a process

slide-21
SLIDE 21

21

Example Relational Dataflow Graph

Aka Logical Query Plan in the DB systems world Intermediate data Operators from extended relational algebra Input data

⇡((R) ∪ S . / T)

<latexit sha1_base64="uJ8drxcvBFGS0x5RMTqM6MQfWQc=">ACHicbVC7TsMwFHV4lvIKMDJgUSG1S5XwEIwVLIwF+pKaqHJcp7Vqx5HtgKqoIwu/wsIAQqx8Aht/g9tmgJYjXenonHt17z1BzKjSjvNtLSwuLa+s5tby6xubW9v2zm5DiURiUseCdkKkCKMRqSuqWakFUuCeMBIMxhcjf3mPZGKiqimhzHxOepFNKQYaSN17AMvpkVP0R5HxdsS9HASwzvoBeJBUwJrpY5dcMrOBHCeuBkpgAzVjv3ldQVOIk0ZkiptuvE2k+R1BQzMsp7iSIxwgPUI21DI8SJ8tPJIyN4ZJQuDIU0FWk4UX9PpIgrNeSB6eRI9WsNxb/89qJDi/8lEZxokmEp4vChEt4DgV2KWSYM2GhiAsqbkV4j6SCGuTXd6E4M6+PE8ax2X3pHx2c1qoXGZx5MA+OARF4IJzUAHXoArqAINH8AxewZv1ZL1Y79bHtHXBymb2wB9Ynz81WJgv</latexit>
slide-22
SLIDE 22

22

Example Tensor Dataflow Graph

ReLU (WX + b)

<latexit sha1_base64="7XGdCxSG3BTtGF2gduKYywtlYU=">ACAnicbVDLSsNAFJ3UV62vqCtxM1iEilASH+iy6MaFiyqmLbShTKY37dDJg5mJUEJx46+4caGIW7/CnX/jpM1CqwcuHM65l3v8WLOpLKsL6MwN7+wuFRcLq2srq1vmJtbDRklgoJDIx6JlkckcBaCo5ji0IoFkMDj0PSGl5nfvAchWRTeqVEMbkD6IfMZJUpLXOnExA1YCq9hWtnjCu4iVv4EHv4oGuWrao1Af5L7JyUY561/zs9CKaBAqyomUbduKlZsSoRjlMC51EgkxoUPSh7amIQlAunkhTHe10oP+5HQFSo8UX9OpCSQchR4ujM7WM56mfif106Uf+6mLIwTBSGdLvITjlWEszxwjwmgio80IVQwfSumAyIVTq1kg7Bn35L2kcVe3j6unNSbl2kcdRLtoD1WQjc5QDV2hOnIQRQ/oCb2gV+PReDbejPdpa8HIZ7bRLxgf3+wclSw=</latexit>

Input data Operators from LA/DL tool’s tensor algebra Intermediate data Aka Neural Computational Graph in the ML systems world

slide-23
SLIDE 23

23

Example Task Graph

NB: Dask conflates the concepts of Dataflow and Task graphs because an “operation” on a Dask DataFrame becomes its own separate process/program under the covers!

https://docs.dask.org/en/latest/graphviz.html

❖ More coarse-grained than

  • perator-level dataflows

❖ Vertex: A full task/process ❖ Edge: A dependency between tasks ❖ Directed Acyclic Graph model (DAG) common; cycles? ❖ Data may not be shown

slide-24
SLIDE 24

24

Parallel Data Processing

Key parallelism paradigms in data systems: Central Issue: Workload takes too long for one processor! Basic Idea: Split up workload across processors and perhaps also across machines/workers (aka “Divide and Conquer”) Within a node: Across nodes: Shared Replicated Partitioned “SIMD” “Pipelining” Dataset is: “Task Parallel” Systems N/A “Data Parallel” Systems

slide-25
SLIDE 25

25

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-26
SLIDE 26

26

Task Parallelism

Basic Idea: Split up tasks across workers; if there is a common dataset that they read, just make copies of it (aka replication)

T1

D

T2 T3 T4 T5 T6

Example:

2) Put T1 on worker 1 (W1), T2 on W2, T3 on W3; run all 3 in parallel

This is your PA1 setup! Except, Dask Scheduler puts tasks on workers for you.

1) Copy whole D to all workers

Given 3 workers

3) After T1 ends, run T4 on W1; after T2 ends, run T5 on W2; after T3 ends, W3 is idle 4) After T4 & T5 end, run T6 on W1; W2 is idle

slide-27
SLIDE 27

27

Task Parallelism

❖ Topological sort of tasks in task graph for scheduling ❖ Notion of a “worker” in task parallelism can be at processor/ core level, not just at node/server level ❖ Thread-level parallelism instead of process-level ❖ E.g., Dask: 4 worker nodes x 4 cores = 16 workers total ❖ Main pros of task parallelism: ❖ Simple to understand; easy to implement ❖ Independence of workers => low software complexity ❖ Main cons of task parallelism: ❖ Data replication across nodes; wastes memory/storage ❖ Idle times possible on workers Basic Idea: Split up tasks across workers; if there is a common dataset that they read, just make copies of it (aka replication)

slide-28
SLIDE 28

28

Degree of Parallelism

❖ The largest amount of concurrency possible in the task graph, i.e., how many task can be run simultaneously

T1

D

T2 T3 T4 T5 T6

Example: Given 3 workers

Degree of parallelism is only 3 So, more than 3 workers is not useful for this workload!

Q: How do we quantify the performance benefits of task parallelism?

But over time, degree of parallelism keeps dropping in this example

slide-29
SLIDE 29

29

Quantifying Benefit of Parallelism: Speedup

Speedup = Completion time given only 1 worker Completion time given n (>1) workers Q: But given n workers, can we get a speedup of n? It depends! (On degree of parallelism, task dependency graph structure, intermediate data sizes, etc.)

slide-30
SLIDE 30

30

Quantifying Benefit of Parallelism

Number of workers Runtime speedup (fixed data size) 1 4 8 12 1 4 8 12 Linear Speedup Sublinear Speedup Q: Is superlinear speedup/scaleup ever possible? Factor (# workers, data size) Runtime speedup 1 4 8 12 1 0.5 2 Linear Scaleup Sublinear Scaleup Speedup plot / Strong scaling Scaleup plot / Weak scaling

slide-31
SLIDE 31

31

Idle Times in Task Parallelism

❖ Due to varying task completion times and varying degrees of parallelism in workload, idle workers waste resources

T1

D

T2 T3 T4 T5 T6

Example: Given 3 workers

Gantt Chart visualization of schedule:

15 5 10 20 5 10

W1: T1 T1 T4 T6 T6 W2: T2 T5 T5 T5 T5 W3: T3 T3 T3 5 10 15 20 25 30 35

Idle times

slide-32
SLIDE 32

32

Idle Times in Task Parallelism

❖ Due to varying task completion times and varying degrees of parallelism in workload, idle workers waste resources

T1

D

T2 T3 T4 T5 T6

Example: Given 3 workers 15 5 10 20 5 10 ❖ In general, overall workload’s completion time on task-parallel setup is always lower bounded by the longest path in the task graph ❖ Possibility: A task-parallel scheduler can “release” a worker if it knows that will be idle till the end ❖ Can saves costs in cloud

slide-33
SLIDE 33

33

Calculating Task Parallelism Speedup

❖ Due to varying task completion times and varying degrees of parallelism in workload, idle workers waste resources

T1

D

T2 T3 T4 T5 T6

Example: Given 3 workers 15 5 10 20 5 10 Speedup = 65/35 = 1.9x Completion time with 1 worker 10+5+15+5+ 20+10 = 65 Parallel completion time 35 Ideal/linear speedup is 3x Q: Why is it only 1.9x?

slide-34
SLIDE 34

34

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-35
SLIDE 35

35

Multi-core CPUs

❖ Modern machines often have multiple processors and multiple cores per processor; hierarchy of shared caches ❖ OS Scheduler now controls what cores/processors assigned to what processes/threads when

slide-36
SLIDE 36

36

Single-Instruction Multiple-Data

❖ Single-Instruction Multiple-Data (SIMD): A fundamental form of parallel processing in which different chunks of data are processed by the “same” set of instructions shared by multiple processing units (PUs) ❖ Aka “vectorized” instruction processing (vs “scalar”) ❖ Data science workloads are very amenable to SIMD! Example for SIMD in data science:

slide-37
SLIDE 37

37

SIMD Generalizations

❖ Single-Instruction Multiple Thread (SIMT): Generalizes notion of SIMD to different threads concurrently doing so ❖ Each thread may be assigned a PU or a whole core ❖ Single-Program Multiple Data (SPMD): A higher level of abstraction generalizing SIMD operations or programs ❖ Under the covers, may use multiple processes or threads ❖ Each chunk of data processed by one core/PU ❖ Applicable to any CPU, not just “vectorized” PUs ❖ Most common form of parallel programming

slide-38
SLIDE 38

38

“Data Parallel” Multi-core Execution

D1 D2 D3 D4 D3 D4 D1 D2 D4 D3 D2 D1

slide-39
SLIDE 39

39

Quantifying Efficiency: Speedup

❖ As with task parallelism, we measure the speedup: Q: How do we quantify the performance benefits of multi-core parallelism? Speedup = Completion time given only 1 core Completion time given n (>1) core ❖ In data science computations, an often useful surrogate for completion time is the instruction throughput FLOP/s, which is the number of floating point operations per second ❖ Modern data processing programs, especially deep learning (DL) may have billions of FLOPs aka GFLOPs!

slide-40
SLIDE 40

40

Amdahl’s Law

❖ Amdahl’s Law: Formula to upper bound possible speedup ❖ A program has 2 parts: one that benefits from multi-core parallelism and one that does not ❖ Non-parallel part could be for control, memory stalls, etc. Q: But given n cores, can we get a speedup of n? It depends! (Just like it did with task parallelism) Tno Tyes

1 core: Speedup = n cores:

Tno Tyes/n

Tyes + Tno Tyes/n + Tno = n(1 + f) n + f Denote Tyes/Tno = f

slide-41
SLIDE 41

41

Amdahl’s Law

Speedup = n(1 + f) n + f f = Tyes/Tno Parallel portion = f / (1 + f)

slide-42
SLIDE 42

42

Hardware Trends on Parallelism

❖ Multi-core processors grew rapidly in early 2000s but hit physical limits due to packing efficiency and power issues ❖ End of “Moore’s Law” and End of “Dennard Scaling” ❖ Basic conclusion of hardware trends: it is hard for general- purposes CPUs to sustain FLOP-heavy programs (e.g., ML) ❖ Motivated the rise of “accelerators” for some classes of programs

slide-43
SLIDE 43

43

Hardware Accelerators: GPUs

❖ Graphics Processing Unit (GPU): Custom processor to run matrix/tensor operations faster ❖ Basic idea: use tons of ALUs; massive data parallelism (like SIMD “on steroids”); Titan X offers ~11 TFLOP/s! ❖ Popularized by NVIDIA in early 2000s for video games, graphics, and video/multimedia; now for deep learning ❖ CUDA released in 2007; later wrapper APIs on top: CuDNN, CuSparse, CuDF (RapidsAI), etc.

slide-44
SLIDE 44

44

GPUs on the Market

slide-45
SLIDE 45

45

Other Hardware Accelerators

❖ Tensor Processing Unit (TPU): Even more specialized tensor processor for deep learning inference; ~45 TFLOP/s! ❖ An “application-specific integrated circuit” (ASIC) created by Google in mid 2010s; use in the AlphaGo game! ❖ Field-Programmable Gate Array (FPGA): Configurable processors for any class of programs; ~0.5-3 TFLOPs/s ❖ Cheaper; h/w-s/w stacks for ML/DL; Azure/AWS support

slide-46
SLIDE 46

46

Comparing Modern Parallel Hardware

Multi-core CPU GPU FPGA ASICs (e.g., TPUs) Peak FLOPS/s Moderate High High Very High Power Consumption High Very High Very Low Low Cost Low High Very High Highest Generality / Flexibility Highest Medium Very High Lowest Fitness for DL Training Poor Fit Best Fit Low Fit Potential exists but unrealized Fitness for DL Inference Moderate Moderate Good Fit Best Fit Cloud Vendor Support All All AWS, Azure GCP

https://www.embedded.com/leveraging-fpgas-for-deep-learning/

slide-47
SLIDE 47

47

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-48
SLIDE 48

48

Recap: Memory Hierarchy

Flash Storage

105 – 106

CPU

Main Memory Magnetic Hard Disk Drive (HDD)

A C C E S S C Y C L E S 107 – 108 100s

Cache

Price Capacity Access Speed

~GB/s ~10GB/s ~100GB/s ~MBs ~$2/MB ~10GBs ~$5/GB ~TBs ~$200/TB ~10TBs ~$40/TB ~200MB/s

slide-49
SLIDE 49

49

Memory Hierarchy in Action

Bus

CU ALU Caches

DRAM Disk Store; Retrieve Store; Retrieve Retrieve; Process

Registers

CPU

Rough sequence of events when program is executed

tmp.csv tmp.py

Commands interpreted Arithmetic done within CPU Monitor ‘21’ I/O for Display I/O for code I/O for data ‘21’ ‘21’ Q: What if this does not fit in DRAM?

slide-50
SLIDE 50

50

Scalable Data Access

4 key regimes of scalability / staging reads: ❖ Single-node disk: Paged access from file on local disk ❖ Remote read: Paged access from disk(s) over a network ❖ Distributed memory: Fits on a cluster’s total DRAM ❖ Distributed disk: Fits on a cluster’s full set of disks Central Issue: Large data file does not fit entirely in DRAM Basic Idea: “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes)

slide-51
SLIDE 51

51

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-52
SLIDE 52

52

Paged Data Access to DRAM

Disk Page in an

  • ccupied frame

Free frames OS Cache in DRAM

F1P1 F1P2 F1P3 F2P1 F2P2 …

Basic Idea: “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) ❖ Recall that files are already virtually split and stored as pages on both disk and DRAM!

F1P1 F1P2 F2P1 F2P2 F1P3 F3P1 F5P7 F3P5

slide-53
SLIDE 53

53

Page Management in DRAM Cache

❖ Caching: The act of reading in page(s) from disk to DRAM ❖ Eviction: The act of removing page(s) from DRAM ❖ Spilling: The act of writing out page(s) from DRAM to disk ❖ If a page in DRAM is “dirty” (i.e., some bytes were written), eviction requires a spill; o/w, ignore that page ❖ Depending on what pages are needed for program, the set

  • f memory-resident pages will change over time

❖ Cache Replacement Policy: The algorithm that chooses which page frame(s) to evict when a new page has to be cached but the OS cache in DRAM is full ❖ Popular policies include Least Recently Used, Most Recently Used, Clock, etc. (more shortly)

slide-54
SLIDE 54

54

Quantifying I/O: Disk and Network

❖ Page reads/writes to/from DRAM from/to disk incur latency ❖ Disk I/O Cost: Abstract counting of number of page I/Os; can map to bytes given page size ❖ Sometimes, programs read/write data over network ❖ Communication/Network I/O Cost: Abstract counting of number of pages/bytes sent/received over network ❖ I/O cost is abstract; mapping to latency is hardware-specific Example: Suppose a data file is 40GB; page size is 4KB I/O cost to read file = 10 million page I/Os Disk with I/O throughput: 800 MB/s Network with speed: 200 MB/s 40K/800 = 50s 40K/200 = 200s

slide-55
SLIDE 55

55

Scaling to (Local) Disk

Basic Idea: “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) Disk OS Cache in DRAM … Process wants to read file’s pages

  • ne by one and then discard: aka

“filescan” access pattern

P1 P2 P3 P4 P5 P6

Read P1 Read P2 Read P3 Read P4 Read P5 Read P6 Cache is full! Cache Repl. needed Evict P1 Evict P2 Suppose OS Cache has only 4 frames; initially empty Total I/O cost: 6 …

P1 P2 P3 P4 P5 P6

slide-56
SLIDE 56

56

❖ In general, scalable programs stage access to pages of file

  • n disk and efficiently use available DRAM

❖ Recall that typically DRAM size << Disk size ❖ Modern DRAM sizes can be 10s of GBs; so we read a “chunk”/“block” of file at a time (say, 1000s of pages) ❖ On magnetic hard disks, such chunking leads to more sequential I/Os, raising throughput and lowering latency! ❖ Similarly, write a chunk of dirtied pages at a time

Scaling to (Local) Disk

slide-57
SLIDE 57

57

Generic Cache Replacement Policies

❖ Cache Replacement Policy: Algorithm to decide which page frame(s) to evict to make space for new page reads ❖ Typical ranking criteria for frames: recency of use, frequency

  • f use, number of processes reading it, etc.

❖ Typical optimization goal: Reduce overall page I/O costs ❖ A few well-known policies: ❖ Least Recently Used (LRU): Evict page that was used the longest time ago ❖ Most Recently Used (MRU): Opposite of LRU ❖ Clock Algorithm (lightweight approximation of LRU) ❖ ML-based caching policies are “hot” nowadays! :) Q: What to do if number of cache frames is too few for file? Take CSE 132C for more cache replacement details

slide-58
SLIDE 58

58

Data Layouts and Access Patterns

❖ Recall that data layouts and access patterns affect what data subset gets cached in higher level of memory hierarchy ❖ Recall matrix multiplication example and CPU caches! ❖ Key Principle: Optimizing data layout on disk for data file based on data access pattern can help reduce I/O costs ❖ Applies to both magnetic hard disk and flash SSDs ❖ But especially critical for former due to vast differences in latency of random vs sequential access!

slide-59
SLIDE 59

59

Row-store vs Column-store Layouts

❖ A common dichotomy when serializing 2-D structured data (relations, matrices, DataFrames) to file on disk

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d

Say, a page can fit only 4 cell values

1a,1b, 1c,1d

Row-store:

2a,2b, 2c,2d 3a,3b, 3c,3d

1a,2a, 3a,4a

Col-store:

5a,6a 1b,2b, 3b,4b

… ❖ Based on data access pattern of program, I/O costs with row- vs col-store can be orders of magnitude apart! ❖ Can generalize to higher dimensions for storing tensors

slide-60
SLIDE 60

60

Example: Dask’s Scalable Access

❖ This is how Dask DF scales to disk-resident data files ❖ “Virtual” split: each split is under- the-covers a Pandas DF ❖ Dask API is a “wrapper” around Pandas API to scale ops to splits and put all results together ❖ If file is too large for DRAM, can do manual repartition() to get physically smaller splits (< ~1GB) Basic Idea: “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes)

https://docs.dask.org/en/latest/dataframe-best-practices.html#repartition-to-reduce-overhead

slide-61
SLIDE 61

61

Hybrid/Tiled/“Blocked” Layouts

❖ Sometimes, it is beneficial to do a hybrid, especially in analytical RDBMSs and matrix/tensor processing systems

1a,1b, 2a,2b

Hybrid store with 2x2 tiled layout:

1c,1d, 2c,2d 3a,3b, 4a,4b

… What data layout will yield lower I/O costs (row vs col vs tiled) depends on data access pattern of the program! Say, a page can fit only 4 cell values

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d

slide-62
SLIDE 62

62

Scaling with Remote Reads

Basic Idea: “Split” data file (virtually or physically) and stage reads of its pages from disk to DRAM (vice versa for writes) ❖ Same approach as scaling to local disk, except there may not be a local disk! ❖ Instead, scale by staging reads of pages over the network from remote disk/disks (e.g., from S3) ❖ Same issues of managing a DRAM cache, picking a cache replacement policy, etc. matter here too ❖ More restrictive than scaling with local disk, since spilling is not possible or requires costly network I/Os ❖ Good in practice for a one-shot filescan access pattern ❖ Better to combine with local disk to cache (like in PA1!)

slide-63
SLIDE 63

63

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-64
SLIDE 64

64

Scaling Data Science Operations

❖ Scalable data access for key representative examples of programs/operations that are ubiquitous in data science: ❖ DB systems: ❖ Relational select ❖ Non-deduplicating project ❖ Simple SQL aggregates ❖ SQL GROUP BY aggregates ❖ ML systems: ❖ Matrix sum/norms ❖ Gramian matrix ❖ (Stochastic) Gradient Descent

slide-65
SLIDE 65

65

Scaling to Disk: Relational Select

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d 1a,1b, 1c,1d

Row-store:

2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

❖ Straightforward filescan data access pattern ❖ Read pages/chunks from disk to DRAM one by one ❖ CPU applies predicate to tuples in pages in DRAM ❖ Copy satisfying tuples to temporary output pages ❖ Use LRU for cache replacement, if needed ❖ I/O cost: 6 (read) + output # pages (write) R

σB=“3b”(R)

<latexit sha1_base64="CFyZsUvqpr0p06CBht34mIB8QyY=">AB/3icbVDLSgMxFM34rPU1KrhxE1qEuikzVtGNUOrGZRX7gHaYZtK0DU0yQ5IRytiFv+LGhSJu/Q13/o1pOwtPXDhcM693HtPEDGqtON8W0vLK6tr65mN7ObW9s6uvbdfV2EsManhkIWyGSBFGBWkpqlmpBlJgnjASCMYXk/8xgORiobiXo8i4nHUF7RHMdJG8u3DtqJ9jvykAq9gp1MKcmNYuDvx7bxTdKaAi8RNSR6kqPr2V7sb4pgToTFDSrVcJ9JegqSmJFxth0rEiE8RH3SMlQgTpSXTO8fw2OjdGEvlKaEhlP190SCuFIjHphOjvRAzXsT8T+vFevepZdQEcWaCDxb1IsZ1CGchAG7VBKs2cgQhCU1t0I8QBJhbSLmhDc+ZcXSf206JaK57dn+XIljSMDjkAOFIALkAZ3IAqAEMHsEzeAVv1pP1Yr1bH7PWJSudOQB/YH3+ADpclEs=</latexit>
slide-66
SLIDE 66

66

Scaling to Disk: Relational Select

1a,1b, 1c,1d 2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

Disk OS Cache in DRAM

σB=“3b”(R)

<latexit sha1_base64="CFyZsUvqpr0p06CBht34mIB8QyY=">AB/3icbVDLSgMxFM34rPU1KrhxE1qEuikzVtGNUOrGZRX7gHaYZtK0DU0yQ5IRytiFv+LGhSJu/Q13/o1pOwtPXDhcM693HtPEDGqtON8W0vLK6tr65mN7ObW9s6uvbdfV2EsManhkIWyGSBFGBWkpqlmpBlJgnjASCMYXk/8xgORiobiXo8i4nHUF7RHMdJG8u3DtqJ9jvykAq9gp1MKcmNYuDvx7bxTdKaAi8RNSR6kqPr2V7sb4pgToTFDSrVcJ9JegqSmJFxth0rEiE8RH3SMlQgTpSXTO8fw2OjdGEvlKaEhlP190SCuFIjHphOjvRAzXsT8T+vFevepZdQEcWaCDxb1IsZ1CGchAG7VBKs2cgQhCU1t0I8QBJhbSLmhDc+ZcXSf206JaK57dn+XIljSMDjkAOFIALkAZ3IAqAEMHsEzeAVv1pP1Yr1bH7PWJSudOQB/YH3+ADpclEs=</latexit>

Reserved for writing output data of program (may be spilled to a temp. file)

1a,1b, 1c,1d 2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

CPU finds a matching tuple! Copies that to output page

3a,3b, 3c,3d

Need to evict some page LRU says kick out page 1 Then page 2 and so on …

slide-67
SLIDE 67

67

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d 1a,1b, 1c,1d

Row-store:

2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

❖ Again, straightforward filescan data access pattern ❖ Similar I/O behavior as the previous selection case ❖ I/O cost: 6 (read) + output # pages (write) R

Scaling to Disk: Non-dedup. Project

SELECT C FROM R

slide-68
SLIDE 68

68

Scaling to Disk: Non-dedup. Project

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d

SELECT C FROM R R

1a,2a, 3a,4a

Col-store:

5a,6a 1b,2b, 3b,4b 1c,2c, 3c,4c 5c,6c 1b,2b, 3b,4b 5b,6b 5b,6b

❖ Since we only need col C, no need to read other pages! ❖ I/O cost: 2 (read) + output # pages (write) ❖ Col-stores offer this advantage over row-store for SQL analytics queries (projects, aggregates, etc.) aka “OLAP” ❖ Main basis of column-store RDBMSs (e.g., Vertica)

slide-69
SLIDE 69

69

Scaling to Disk: Simple Aggregates

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d 1a,1b, 1c,1d

Row-store:

2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

❖ Again, straightforward filescan data access pattern ❖ Similar I/O behavior as the previous selection and non-deduplicating projection cases ❖ I/O cost: 6 (read) + output # pages (write) R SELECT MAX(A) FROM R

slide-70
SLIDE 70

70

A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d

R

1a,2a, 3a,4a

Col-store:

5a,6a 1b,2b, 3b,4b 1c,2c, 3c,4c 5c,6c 1b,2b, 3b,4b 5b,6b 5b,6b

❖ Similar to the non-dedup. project, we only need col A; no need to read other pages! ❖ I/O cost: 2 (read) + output # pages (write)

Scaling to Disk: Simple Aggregates

SELECT MAX(A) FROM R

slide-71
SLIDE 71

71

Scaling to Disk: Group By Aggregate

A B C D a1 1b 1c 4 a2 2b 2c 3 a1 3b 3c 5 a3 4b 4c 1 a2 5b 5c 10 a1 6b 6c 8

R SELECT A, SUM(D) FROM R GROUP BY A ❖ Now it is not straightforward due to the GROUP BY! ❖ Need to “collect” all tuples in a group and apply agg. func. to each ❖ Typically done with a hash table maintained in DRAM ❖ Has 1 record per group and maintains “running information” for that group’s agg. func. ❖ Built on the fly during filescan of R; has the output in the end!

A Running Info. a1 17 a2 13 a3 1

Hash table (output)

slide-72
SLIDE 72

72

Scaling to Disk: Group By Aggregate

A B C D a1 1b 1c 4 a2 2b 2c 3 a1 3b 3c 5 a3 4b 4c 1 a2 5b 5c 10 a1 6b 6c 8

R SELECT A, SUM(D) FROM R GROUP BY A

A Running Info.

Hash table in DRAM

a1,1b, 1c,4

Row-store:

a2,2b, 2c,3 a1,3b, 3c,5 a3,4b, 4c,1 a2,5b, 5c,10 a1,6b, 6c,8

a1 4 a2 3 a3 1

❖ Note that the sum for each group is constructed incrementally ❖ I/O cost: 6 (read) + output # pages (write); just one filescan again! Q: But what if hash table > DRAM size?!

  • > 9 -> 17
  • > 13
slide-73
SLIDE 73

73

Scaling to Disk: Group By Aggregate

SELECT A, SUM(D) FROM R GROUP BY A ❖ Program will likely just crash! :) OS may keep swapping pages of hash table to/from disk; aka “thrashing” Q: But what if hash table > DRAM size? Q: How to scale to large number of groups? ❖ Divide and conquer! Split up R based on values of A ❖ HT for each split may fit in DRAM alone ❖ Reduce running info. size if possible

slide-74
SLIDE 74

74

Scaling to Disk: Matrix Sum/Norms

2 1 2 1 1 2 1 2 3 1 3 1 2,1, 0,0

Row-store:

2,1 0,0 0,1, 0,2 0,0, 1,2 3,0, 1,0 3,0, 1,0

❖ Again, straightforward filescan data access pattern ❖ Very similar to relational simple aggregate ❖ Running info. in DRAM for sum of squares of cells ❖ 0 -> 5 -> 10 -> 15 -> 20 -> 30 -> 40 ❖ I/O cost: 6 (read) + output # pages (write) ❖ Col-store and tiled-store have same I/O cost; why? M6x4

kMk2

2

<latexit sha1_base64="WP4FUq7RsObclJ4fi/3JkXyXLa0=">AB+nicbVDLSsNAFJ34rPWV6tLNYBFclaQquiy6cSNUsA9oY5hMJ+3QySTM3Cil9lPcuFDErV/izr9xmahrQcu93DOvcydEySCa3Ccb2tpeWV1b2wUdzc2t7ZtUt7TR2nirIGjUWs2gHRTHDJGsBsHaiGIkCwVrB8Grqtx6Y0jyWdzBKmBeRvuQhpwSM5NulbpMpwDc46371vurbZafiZMCLxM1JGeWo+/ZXtxfTNGISqCBad1wnAW9MFHAq2KTYTVLCB2SPusYKknEtDfOTp/gI6P0cBgrUxJwpv7eGJNI61EUmMmIwEDPe1PxP6+TQnjhjblMUmCSzh4KU4EhxtMcI8rRkGMDCFUcXMrpgOiCAWTVtGE4M5/eZE0qxX3pHJ2e1quXeZxFNABOkTHyEXnqIauUR01EWP6Bm9ojfryXqx3q2P2eiSle/soz+wPn8Au5KTBA=</latexit>
slide-75
SLIDE 75

75

Scaling to Disk: Gramian Matrix

2 1 2 1 1 2 1 2 3 1 3 1 2,1, 0,0

Row-store:

2,1 0,0 0,1, 0,2 0,0, 1,2 3,0, 1,0 3,0, 1,0

M6x4

M T M

<latexit sha1_base64="OMhQmIb57LnfsVrkZPsysEXtJDw=">AB7HicbVBNSwMxEJ2tX7V+VT16CRbBU9n1Az0WvXgpVOi2hXYt2TbhmaTJckKZelv8OJBEa/+IG/+G9N2D1p9MPB4b4aZeWHCmTau+UVlbX1jeKm6Wt7Z3dvfL+QUvLVBHqE8ml6oRYU84E9Q0znHYSRXEctoOx7czv/1IlWZSNM0koUGMh4JFjGBjJb/+0ET1frniVt050F/i5aQCORr98mdvIEkaU2EIx1p3PTcxQYaVYTamXapgMsZD2rVU4JjqIJsfO0UnVhmgSCpbwqC5+nMiw7HWkzi0nTE2I73szcT/vG5qousgYyJDRVksShKOTISzT5HA6YoMXxiCSaK2VsRGWGFibH5lGwI3vLf0nrOqdVy/vLyq1mzyOIhzBMZyCB1dQgztogA8EGDzBC7w6wnl23pz3RWvByWcO4Recj2/xHY4f</latexit>

❖ A bit tricky, since output may not fit entirely in DRAM ❖ Similar to GROUP BY difficult case ❖ Output here is 4x4, i.e., 4 pages; only 3 can be in DRAM! ❖ Each row will need to update entire output matrix ❖ Row-store can be a poor fit for such matrix algebra ❖ What about col-store or tiled-store?

slide-76
SLIDE 76

76

Scaling to Disk: Gramian Matrix

2 1 2 1 1 2 1 2 3 1 3 1

2x2 Tiled store M6x4

M T M

<latexit sha1_base64="OMhQmIb57LnfsVrkZPsysEXtJDw=">AB7HicbVBNSwMxEJ2tX7V+VT16CRbBU9n1Az0WvXgpVOi2hXYt2TbhmaTJckKZelv8OJBEa/+IG/+G9N2D1p9MPB4b4aZeWHCmTau+UVlbX1jeKm6Wt7Z3dvfL+QUvLVBHqE8ml6oRYU84E9Q0znHYSRXEctoOx7czv/1IlWZSNM0koUGMh4JFjGBjJb/+0ET1frniVt050F/i5aQCORr98mdvIEkaU2EIx1p3PTcxQYaVYTamXapgMsZD2rVU4JjqIJsfO0UnVhmgSCpbwqC5+nMiw7HWkzi0nTE2I73szcT/vG5qousgYyJDRVksShKOTISzT5HA6YoMXxiCSaK2VsRGWGFibH5lGwI3vLf0nrOqdVy/vLyq1mzyOIhzBMZyCB1dQgztogA8EGDzBC7w6wnl23pz3RWvByWcO4Recj2/xHY4f</latexit>

A B C D E F A B C D E F AT CT ET BT DT FT O1 O2 O3 O4 ❖ Read A, C, E one by one to get O1 = ATA + CTC + ETE; O1 is incrementally computed; write O1 out; I/O: 3 (r) + 1 (w) ❖ Likewise with B, D, F for O4; I/O: 3 (r) + 1 (w) ❖ Read A, B and put ATB in O2; read C, D to add CTD to O2; read E, F to add ETF to O2; write O2 out; I/O: 6 + 1 ❖ Likewise with B,A; D,C; F,E for O3; I/O: 6 + 1 ❖ Max I/O cost: 18 (r) + 4(w); scalable on both dimensions!

slide-77
SLIDE 77

77

Scalable Matrix/Tensor Algebra

❖ Almost all basic matrix operations can be DRAM/cache- efficiently implemented using tiled operations ❖ Tile-based storage and/or processing is common in ML systems that support scalable matrix/tensor algebra ❖ SystemML, pBDR, and Dask Arrays (over NumPy) for matrix ops WRT disk-DRAM ❖ NVIDIA CUDA for tensor ops WRT DRAM-GPU caches

slide-78
SLIDE 78

78

Numerical Optimization in ML

❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM) approach ❖ Computes “loss” of predictions over labeled examples w∗ = argminw

n

X

i=1

l(yi, f(w, xi))

<latexit sha1_base64="wtingO0LqaDZ0Zshlk1cGJ54IY=">ACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGkjudnLJke6/nxXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7utmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit>

❖ Hyperplane-based models aka Generalized Linear Models (GLMs) use f() that is a scalar function of distances: wT xi

<latexit sha1_base64="uvgXp+dpvG7N5TecG7ye7m4Eg=">AB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NOcthNJcRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjpvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nl8v6iXL2Z1GEIziGU3DhCqpwBzVoAIERPMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit>
slide-79
SLIDE 79

79

❖ In many cases, loss function l() is convex; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent:

L(w) =

n

X

i=1

l(yi, f(w, xi))

<latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">ACH3icbVDLSgMxFM3UV62vUZdugkVoZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXldW19HpmY3Nre8fc3avJIBKYVHAtFwkSMclJVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh4ZLmBQwvQm8kUnh2aztm1ipaY8BFYk9JFkxRcyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg=</latexit>

Batch Gradient Descent for ML

❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w(0) ❖ Compute gradient: ❖ Descend along gradient: (Aka Update Rule) ❖ Repeat until we get close to w*, aka convergence

w(k+1) w(k) ηrL(w(k))

<latexit sha1_base64="ayIl16A2siz4c7l2gSKsLvcCX4=">ACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFVzXRXlClpyfe/e6WDBw8fHZYfV548fb8RfXlUdemuRHYEalKT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657u897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwIMSaFwWQlzixmIGUxw4KiGBO2wWF+5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca4pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982z6e1sW2jJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvzvl/d5ES9525pjtwPvzF5+7rMc=</latexit>

rL(w(k)) =

n

X

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>
slide-80
SLIDE 80

80

Batch Gradient Descent for ML

w

<latexit sha1_base64="q3pU+DFyxWArDM3ABiJpfISIf2Y=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclRkf6LoxmUF+8C2lEx6pw3NZIYko5Shf+HGhSJu/Rt3/o2ZdhbaeiBwOdecu7xY8G1cd1vZ2l5ZXVtvbBR3Nza3tkt7e03dJQohnUWiUi1fKpRcIl1w43AVqyQhr7Apj+6yfzmIyrNI3lvxjF2QzqQPOCMGis9dEJqhn6QPk16pbJbcacgi8TLSRly1Hqlr04/YkmI0jBtW57bmy6KVWGM4GTYifRGFM2ogNsWypiLqbThNPyLFV+iSIlH3SkKn6eyOlodbj0LeTWUI972Xif147McFVN+UyTgxKNvsoSAQxEcnOJ32ukBkxtoQyxW1WwoZUWZsSUVbgjd/8iJpnFa8s8rF3Xm5ep3XUYBDOIT8OASqnALNagDAwnP8ApvjnZenHfnYza65OQ7B/AHzucP/fiRIg=</latexit>

w∗

<latexit sha1_base64="HB3noaSvzO2mOiaEhFaq57JcQLw=">AB83icbVDLSsNAFL2pr1pfVZduBosgLkriA10W3bisYB/QxDKZTtqhk0mYmSgl5DfcuFDErT/jzr9x0mahrQcGDufcyz1z/JgzpW372yotLa+srpXKxubW9s71d29toSWiLRDySXR8rypmgLc0p91YUhz6nHb8U3udx6pVCwS93oSUy/EQ8ECRrA2kuGWI/8IH3KHk761Zpdt6dAi8QpSA0KNPvVL3cQkSkQhOleo5dqy9FEvNCKdZxU0UjTEZ4yHtGSpwSJWXTjNn6MgoAxRE0jyh0VT9vZHiUKlJ6JvJPKOa93LxP6+X6ODKS5mIE0FmR0KEo50hPIC0IBJSjSfGIKJZCYrIiMsMdGmpopwZn/8iJpn9ads/rF3XmtcV3UYDOIRjcOASGnALTWgBgRie4RXerMR6sd6tj9loySp29uEPrM8fHvyRvg=</latexit>

L(w)

<latexit sha1_base64="TY7U6q5Tv+/ah6YJI3mw93bVcTs=">AB9HicbVC7TsMwFL0pr1JeBUYWiwqpLFVCQTBWsDAwFIk+pDaqHNdprTpOsJ2iKup3sDCAECsfw8bf4LQZoOVIlo7OuVf3+HgRZ0rb9reVW1ldW9/Ibxa2tnd294r7B0VxpLQBgl5KNseVpQzQRuaU7bkaQ48DhteaOb1G+NqVQsFA96ElE3wAPBfEawNpJ7V+4GWA89P3manvaKJbtiz4CWiZOREmSo94pf3X5I4oAKThWquPYkXYTLDUjnE4L3VjRCJMRHtCOoQIHVLnJLPQUnRilj/xQmic0mqm/NxIcKDUJPDOZRlSLXir+53Vi7V+5CRNRrKkg80N+zJEOUdoA6jNJieYTQzCRzGRFZIglJtr0VDAlOItfXibNs4pTrVzcn5dq1kdeTiCYyiDA5dQg1uoQwMIPMIzvMKbNbZerHfrYz6as7KdQ/gD6/MHZHeR3Q=</latexit>

w(0)

<latexit sha1_base64="OxTqfbjDPbei9MJL9SITiklUMw=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhma03YkKQ48Tlve+DbzWxMqFQvFg5GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFO075dqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THOkQZTWgAZOUaD41BPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz8BpNm</latexit>

Gradient

rL(w(0))

<latexit sha1_base64="l790wK2qXm9kyZmjZI9U8DgCRQ=">ACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhoJ9QBPLZDph04mYWailBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1z+HIYwheV5wAqaHnJw/pXVKxqm1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXtiLlJkgoihlJS04sSYTwCA1IV1OAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T/O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpDOW4Q=</latexit>

w(1)

<latexit sha1_base64="G17ClWIFMobSQkQcWLAcftDEbk=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhma03YkKQ48Tlve+DbzWxMqFQvFg5GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFP075dqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THOkQZTWgAZOUaD41BPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz89jJNn</latexit>

w(2)

<latexit sha1_base64="Yg7pZgdEXbdSQ9HL4psFPomzX4=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRV0WXRjcsK9gFtLJPpB06mYSZSaWE/okbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm4Wt4vbO7t6+fXDYVFEiCW2QiEey7WNFORO0oZnmtB1LikOf05Y/us381phKxSLxoCcx9UI8ECxgBGsj9Wy7G2I9IP0afqYlqtn05dcirODGiZuDkpQY56z/7q9iOShFRowrFSHdeJtZdiqRnhdFrsJorGmIzwgHYMFTikyktnyafo1Ch9FETSPKHRTP29keJQqUnom8ksp1r0MvE/r5Po4NpLmYgTQWZHwoSjnSEshpQn0lKNJ8YgolkJisiQywx0asoinBXfzyMmlWK+5fL+olS7yesowDGcQBlcuIa3EdGkBgDM/wCm9War1Y79bHfHTFyneO4A+szx8/EpNo</latexit>

rL(w(1))

<latexit sha1_base64="EX6+u+ozvgd46YH9lTyT79WA=">ACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhoJ9QBPLZDph04mYWailBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1z+HIYwheV5wAqaHnJw/pXVKxq2m1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXtiLlJkgoihlJS04sSYTwCA1IV1OAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T/O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpbqW4g=</latexit>

❖ Learning rate is a hyper-parameter selected by user or “AutoML” tuning procedures ❖ Number of iterations/epochs of BGD also hyper-parameter

w(1) w(0) ηrL(w(0))

<latexit sha1_base64="MZhyY1bDdwCYWm3cMV+ti7/2tw=">ACN3icbVDLSiNBFK12fMT4yoxLN4VBiAtDtw90GcaNCxEFo0I6htuV21pYXd1U3VZCk7+azfyGO924mEHc+gdWYhYaPVBwOdc6t4TZUpa8v0Hb+LH5NT0TGm2PDe/sLhU+fnrzKa5EdgUqUrNRQWldTYJEkKLzKDkEQKz6Ob/YF/fovGylSfUi/DdgJXWsZSADmpUzkKE6DrKC7u+pdFLVjv81BhTGBMesc/eb7zNniIBDzUECngh7XxwHqnUvXr/hD8KwlGpMpGO5U7sNuKvIENQkF1rYCP6N2AYakUNgvh7nFDMQNXGHLUQ0J2nYxvLvP15zS5XFq3NPEh+rHiQISa3tJ5JKDRe24NxC/81o5xXvtQuosJ9Ti/aM4V5xSPiRd6VBQarnCAgj3a5cXIMBQa7qsishGD/5KznbrAdb9Z2T7Wrj96iOElthq6zGArbLGuyAHbMmE+wPe2T/2H/vr/fkPXsv79EJbzSzD7Be30DhDerpw=</latexit>

w(2) w(1) ηrL(w(1))

<latexit sha1_base64="g9I7Eq4QpMR9QRqIjWoJuHM2mqI=">ACN3icbVDLSiNBFK32bXxFXbopDEJcGLp9oMugm1kMomBUSMdwu3JbC6urm6rbhtDkr2Yzv+Fu3MxiBnHrH1iJWj0QMHhnHOpe0+UKWnJ9/94E5NT0zOzc/OlhcWl5ZXy6tqlTXMjsCFSlZrCwqbFBkhReZwYhiReRfcnA/qAY2Vqb6gXoatBG61jKUAclK7fBomQHdRXHT7N0V1d7vPQ4UxgTFpl3/yAuft8BAJeKghUsB/VscD2+1yxa/5Q/CvJBiRChvhrF1+DupyBPUJBRY2wz8jFoFGJCYb8U5hYzEPdwi01HNSRoW8Xw7j7fckqHx6lxTxMfqh8nCkis7SWRSw4WtePeQPzOa+YUH7UKqbOcUIv3j+JcUr5oETekQYFqZ4jIx0u3JxBwYEuapLroRg/OSv5HK3FuzVDs73K/XjUR1zbINtsioL2CGrsx/sjDWYL/YE/vH/nu/vb/es/fyHp3wRjPr7BO81zeJKquq</latexit>
slide-81
SLIDE 81

81

Data Access Pattern of BGD at Scale

❖ The data-intensive computation in BGD is the gradient ❖ In scalable ML, dataset D may not fit in DRAM ❖ Model w is typically small and DRAM-resident ❖ Gradient is like SQL SUM over vectors (one per example) ❖ At each epoch, 1 filescan over D to get gradient ❖ Update of w happens normally in DRAM ❖ Monitoring across epochs for convergence needed ❖ Loss function L() is also just a SUM in a similar manner

rL(w(k)) =

n

X

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>

Q: What SQL op is this reminiscent of?

slide-82
SLIDE 82

82

I/O Cost of Scalable BGD

Y X1 X2 X3 1b 1c 1d 1 2b 2c 2d 1 3b 3c 3d 4b 4c 4d 1 5b 5c 5d 6b 6c 6d

D

rL(w(k)) =

n

X

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">ACOXicbVDLSgMxFM34rPVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGUk7qipFWKAjyXUa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>

0,1b, 1c,1d

Row-store:

1,2b, 2c,2d 1,3b, 3c,3d 0,4b, 4c,4d 1,5b, 5c,5d 0,6b, 6c,6d

❖ Straightforward filescan data access pattern for SUM ❖ Similar I/O behavior as relational select, non-dedup. project, and simple SQL aggregates ❖ I/O cost: 6 (read) + output # pages (write for final w)

slide-83
SLIDE 83

83

Stochastic Gradient Descent for ML

❖ Two key cons of BGD: ❖ Often takes too many epochs to get close to optimal ❖ Each update of w requires full scan of D: costly I/Os ❖ Stochastic GD (SGD) mitigates both issues ❖ Basic Idea: Use a sample (called mini-batch) of D to approximate gradient instead of “full batch” gradient ❖ Sampling typically done without replacement ❖ Randomly reorder/shuffle D before every epoch ❖ Then do a sequential pass: sequence of mini-batches ❖ Another major pro of SGD: works well for non-convex loss functions too, especially ANNs/deep nets; BGD does not ❖ SGD is now standard for scalable GLMs and deep nets

slide-84
SLIDE 84

84

Access Pattern and I/O Cost of SGD

❖ I/O cost of random shuffle is not trivial; requires a so- called “external merge sort” (skipped in this course) ❖ Typically requires 1 or 2 passes over large file ❖ Mini-batch gradient computations: 1 filescan per epoch ❖ Update of w happens in DRAM; as filescan proceeds, count number of examples seen and update for each mini-batch ❖ Typical Mini-batch sizes: 10s to 1000s ❖ Orders of magnitude more updates than BGD! ❖ So, I/O per epoch: 1 shuffle cost + 1 filescan cost ❖ Often, shuffling-only-once up front suffices! ❖ Loss function L() access pattern is the same before

r˜ L(w(k)) = X

(yi,xi)∈B⊂D

rl(yi, f(w(k), xi))

<latexit sha1_base64="AfnLQWBOhwvN5z3/kLEnQiCekI=">ACVnicbVFNb9QwEHVSsvyFeDIZcQKaVeqVkBwaVSVThw4FAktq20WSLHO2mtZ3InlBWUf5ke4GfwgXhbHOALiNZenrvzXj8nFdKOorjn0G4dWf7s7uvcH9Bw8fPY6ePD1xZW0FTkWpSnuWc4dKGpySJIVnlUWuc4Wn+fJ9p59+Q+tkab7QqsK5udGFlJw8lQW6dTwXHFISaoFNp/aUao5XeRFc9l+bUbLcTuGA0hdrbNmtMrk3vdMjiGVBo46NndI8KGFfopaW6DYGLIHXd84i4bxJF4XbIKkB0PW13EWXaWLUtQaDQnFnZslcUXzhluSQmE7SGuHFRdLfo4zDw3X6ObNOpYWXnpmAUVp/TEa/bvjoZr51Y6985uXdb68j/abOainfzRpqJjTi5qKiVkAldBnDQloUpFYecGl3xXEBbdckP+JgQ8huf3kTXCyP0leTd58fj08POrj2GXP2Qs2Ygl7yw7ZR3bMpkywa/YrCIOt4EfwO9wOd26sYdD3PGP/VBj9AZJwsek=</latexit>
slide-85
SLIDE 85

85

Scaling Data Science Operations

❖ Scalable data access for key representative examples of programs/operations that are ubiquitous in data science: ❖ DB systems: ❖ Relational select ❖ Non-deduplicating project ❖ Simple SQL aggregates ❖ SQL GROUP BY aggregates ❖ ML systems: ❖ Matrix sum/norms ❖ Gramian matrix ❖ (Stochastic) Gradient Descent

slide-86
SLIDE 86

86

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-87
SLIDE 87

87

Introducing Data Parallelism

❖ The most common approach to marrying parallelism and scalability in data systems! ❖ Generalization of SIMD and SPMD idea from parallel processors to large-data and multi-worker/node setting ❖ Distributed-memory vs Distributed-disk possibilities Q: What is “data parallelism”? Basic Idea of Scalability: “Split” data file (virtually or physically) and stage reads/writes of its pages between disk and DRAM Data Parallelism: Partition large data file physically across nodes/workers; within worker: DRAM-based or disk-based

slide-88
SLIDE 88

88

3 Paradigms of Multi-Node Parallelism

Shared-Nothing Parallelism Shared-Memory Parallelism Shared-Disk Parallelism

Interconnect Interconnect Interconnect

Data parallelism is technically orthogonal to these 3 paradigms but most commonly paired with shared-nothing

slide-89
SLIDE 89

89

Shared-Nothing Data Parallelism

Shared-Nothing Parallel Cluster

Interconnect D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6

❖ Partitioning a data file across nodes is aka “sharding” ❖ It is a part of a process in data systems called Extract- Transform-Load (ETL) ❖ ETL is an umbrella term for all various processing done to the data file before it is ready for users to query, analyze, etc. ❖ Sharding, compression, file format conversions, etc.

D1 D3 D5

slide-90
SLIDE 90

90

Data Parallelism in Other Paradigms?

Shared-Memory Parallel Cluster Shared-Disk Parallel Cluster

Interconnect Interconnect D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6 D1 D3 D5 D1 D2 D3

Contention

slide-91
SLIDE 91

91

Data Partitioning Strategies

❖ Row-wise/horizontal partitioning is most common (sharding) ❖ 3 common schemes (given k nodes): ❖ Round-robin: assign tuple i to node i MOD k ❖ Hashing-based: needs hash partitioning attribute(s) ❖ Range-based: needs ordinal partitioning attribute(s) ❖ Tradeoffs: Round-robin often inefficient for SQL queries later (why?); range-based good for range predicates in SQL; hashing-based is most common in practice and can be good for both SQL queries; for many ML workloads, all 3 are ok ❖ Replication of partition across nodes (e.g., 3x) often used for more availability and better parallel performance

slide-92
SLIDE 92

92

Other Forms of Data Partitioning

❖ Just like with disk-aware data layout on single-node, we can partition a large data file across workers in other ways too:

Node 1 A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d

R Columnar Partitioning

Node 2 Node 3

1a,2a, 3a,4a 5a,6a 1b,2b, 3b,4b 1c,2c, 3c,4c 5c,6c 5b,6b

slide-93
SLIDE 93

93

Other Forms of Data Partitioning

❖ Just like with disk-aware data layout on single-node, we can partition a large data file across workers in other ways too:

Node 1 A B C D 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d

R Hybrid/Tiled Partitioning

Node 2 Node 3

1a,1b, 2a,2b 1c,1d, 2c,2d 3a,3b, 4a,4b 5a,5b, 6a,6b 5c,5d, 6c,6d 3c,3d, 4c,4d

slide-94
SLIDE 94

94

Cluster Architectures

Worker 1 Master

Q: What is the protocol for cluster nodes to talk to each other? Master-Worker Architecture

Worker 2 Worker k

❖ 1 (or few) special node called Master

  • r Server; many interchangeable

nodes called “Workers” ❖ Master gives commands to workers,

  • incl. talking to each other

❖ Most common in data systems (e.g., Dask, Spark, par. RDBMS, etc.)

Peer-to-Peer Architecture

Worker 1 Worker 2 Worker k

Worker k-1

❖ No special master ❖ Workers talk to each

  • ther freely

❖ E.g., Horovod ❖ Aka Decentralized vs Centralized

slide-95
SLIDE 95

95

Bulk Synchronous Parallelism (BSP)

❖ Most common protocol of data parallelism in data systems ❖ Combines shared-nothing sharding with master-worker ❖ Used by parallel RDBMSs, Hadoop, Spark, etc.

Worker 1 Master Worker 2 Worker k

D1 D2 D3 D4 D5 D6

  • 1. Given sharded data file on workers
  • 2. Data processing program given by

client to master (e.g., SQL query, ML training routine, etc.)

  • 3. Master divides first piece of

work of program among workers

  • 4. A worker works independently
  • n its own data partition
  • 5. A worker sends its partial results

to master after it is done

  • 6. Master waits till all k are done
  • 7. Go to step 3 for next piece

Aka (Barrier) Synchronization Communication costs!

slide-96
SLIDE 96

96

Speedup Analysis/Limits of of BSP

❖ Many factors contribute to overheads on cluster: ❖ Per-worker parallel overheads: starting up worker processes; tearing down worker processes ❖ On-master overheads: dividing up work; collecting partial results and unifying them ❖ Communication costs: between master-worker and among workers (when commanded by master) ❖ Barrier synchronization also suffers from “stragglers” due to imbalances in data partitioning and/or resources Q: What is the speedup yielded by BSP? Speedup = Completion time given only 1 worker Completion time given k (>1) workers

slide-97
SLIDE 97

97

Quantifying Benefit of Parallelism

Number of workers Runtime speedup (fixed data size) 1 4 8 12 1 4 8 12 Linear Speedup Sublinear Speedup Q: Is superlinear speedup/scaleup ever possible? Factor (# workers, data size) Runtime speedup 1 4 8 12 1 0.5 2 Linear Scaleup Sublinear Scaleup Speedup plot / Strong scaling Scaleup plot / Weak scaling

slide-98
SLIDE 98

98

Distributed Filesystems

❖ Recall that a file is an OS abstraction for persistent data on local disk; distributed file generalizes this notion to a cluster

  • f networked disks and OSs

❖ Distributed filesystem (DFS) is software that works with many networked disks and OSs to manage distr. files ❖ A layer of abstraction on top of local filesystems; a node manages local data (partition) as if it is local file ❖ Illusion of a one global file: DFS has APIs to let nodes access data (partitions) sitting on other nodes ❖ Two main variants: Remote DFS vs In-Situ DFS ❖ Remote DFS: Files reside elsewhere and read/written

  • n demand by workers

❖ In-Situ DFS: Files resides on cluster where workers run

slide-99
SLIDE 99

99

❖ An old remote DFS (c. 1980s) with simple client-server architecture for replicating files over the network

Network Filesystem (NFS)

❖ Main pro is simplicity of setup and usage ❖ But many cons: ❖ Not scalable to very large files ❖ Full data replication ❖ High contention for concurrent data reads/write ❖ Single-point of failure

slide-100
SLIDE 100

100

Hadoop Distributed File System (HDFS)

❖ Most popular in-situ DFS (c. late 2000s); part of Hadoop;

  • pen source spinoff of Google File system (GFS)

❖ Highly scalable; can do 10s of 1000s of nodes, PB files

https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

❖ Designed for clusters of cheap commodity nodes ❖ Parallel reads/writes of sharded data “blocks” ❖ Replication of blocks improves fault tolerance ❖ Cons: Read-only + batch- append (no fine-grained updates/writes!)

slide-101
SLIDE 101

101

Hadoop Distributed File System (HDFS)

❖ NameNode’s roster maps data blocks to DataNodes/IPs ❖ A distributed file on HDFS is just a directory (!) with individual filenames for each data block and metadata files ❖ HDFS data block size and replication factor are configurable parameters; default are 128 MB and 3x

slide-102
SLIDE 102

102

Data-Parallel Dataflow/Workflow

❖ Data-Parallel Dataflow: A dataflow graph with ops wherein each operation is executed in a data-parallel manner ❖ Data-Parallel Workflow: A generalization; each vertex a whole task/process that is run in a data-parallel manner Each of these extended relational ops have scalable data-parallel implementations in parallel RDBMSs, Spark, etc. All input tables are sharded!

⇡((R) ∪ S . / T)

<latexit sha1_base64="uJ8drxcvBFGS0x5RMTqM6MQfWQc=">ACHicbVC7TsMwFHV4lvIKMDJgUSG1S5XwEIwVLIwF+pKaqHJcp7Vqx5HtgKqoIwu/wsIAQqx8Aht/g9tmgJYjXenonHt17z1BzKjSjvNtLSwuLa+s5tby6xubW9v2zm5DiURiUseCdkKkCKMRqSuqWakFUuCeMBIMxhcjf3mPZGKiqimhzHxOepFNKQYaSN17AMvpkVP0R5HxdsS9HASwzvoBeJBUwJrpY5dcMrOBHCeuBkpgAzVjv3ldQVOIk0ZkiptuvE2k+R1BQzMsp7iSIxwgPUI21DI8SJ8tPJIyN4ZJQuDIU0FWk4UX9PpIgrNeSB6eRI9WsNxb/89qJDi/8lEZxokmEp4vChEt4DgV2KWSYM2GhiAsqbkV4j6SCGuTXd6E4M6+PE8ax2X3pHx2c1qoXGZx5MA+OARF4IJzUAHXoArqAINH8AxewZv1ZL1Y79bHtHXBymb2wB9Ynz81WJgv</latexit>

Q: So how do we run data sci. ops in data-parallel manner?

slide-103
SLIDE 103

103

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-104
SLIDE 104

104

Data-Parallel Data Science Ops

❖ Scalable data access for key representative examples of programs/operations that are ubiquitous in data science: ❖ DB systems: ❖ Relational select ❖ Non-deduplicating project ❖ Simple SQL aggregates ❖ SQL GROUP BY aggregates ❖ ML systems: ❖ Matrix sum/norms ❖ Gramian matrix ❖ (Stochastic) Gradient Descent

slide-105
SLIDE 105

105

Data-Parallel Relational Select

σB=“3b”(R)

<latexit sha1_base64="CFyZsUvqpr0p06CBht34mIB8QyY=">AB/3icbVDLSgMxFM34rPU1KrhxE1qEuikzVtGNUOrGZRX7gHaYZtK0DU0yQ5IRytiFv+LGhSJu/Q13/o1pOwtPXDhcM693HtPEDGqtON8W0vLK6tr65mN7ObW9s6uvbdfV2EsManhkIWyGSBFGBWkpqlmpBlJgnjASCMYXk/8xgORiobiXo8i4nHUF7RHMdJG8u3DtqJ9jvykAq9gp1MKcmNYuDvx7bxTdKaAi8RNSR6kqPr2V7sb4pgToTFDSrVcJ9JegqSmJFxth0rEiE8RH3SMlQgTpSXTO8fw2OjdGEvlKaEhlP190SCuFIjHphOjvRAzXsT8T+vFevepZdQEcWaCDxb1IsZ1CGchAG7VBKs2cgQhCU1t0I8QBJhbSLmhDc+ZcXSf206JaK57dn+XIljSMDjkAOFIALkAZ3IAqAEMHsEzeAVv1pP1Yr1bH7PWJSudOQB/YH3+ADpclEs=</latexit>
  • 1. After ETL, sharded large input

file sits cluster’s disks

  • 2. When query/program given,

master broadcasts it as such

  • 3. Each worker does node-local

Select as explained before and writes local output to local file

  • 4. Master reports union of local

files as global output file; note that output is also sharded file!

Master

We focus on BSP data-parallel

Worker 1 Disk DRAM

1a,1b, 1c,1d 2a,2b, 2c,2d

Worker 2 Disk DRAM Worker 3 Disk DRAM

3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

Basic Idea: Master splits work -> node- local work -> master unifies results

1a,1b, 1c,1d 2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d 3a,3b, 3c,3d

I/O costs: Disk: 6 (pages) + output; Network: 0

slide-106
SLIDE 106

106

Data-Parallel Non-dedup. Project

  • 1. After ETL, sharded large input

file sits cluster’s disks

  • 2. When query/program given,

master broadcasts it as such

  • 3. Each worker does node-local

Non-dedeup Project as explained before and writes local output to local file

  • 4. Master reports union of local

files as global output file

Master

We focus on BSP data-parallel

Worker 1 Disk DRAM

1a,1b, 1c,1d 2a,2b, 2c,2d

Worker 2 Disk DRAM Worker 3 Disk DRAM

3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

Basic Idea: Master splits work -> node- local work -> master unifies results

1a,1b, 1c,1d 2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d 3b,4b

SELECT C FROM R

1b,2b 5b,6b

I/O costs: Disk: 6 (pages) + output; Network: 0

slide-107
SLIDE 107

107

Data-Parallel Simple Aggregates

  • 1. After ETL, sharded large input

file sits cluster’s disks

  • 2. When query/program given,

master broadcasts it as such

  • 3. Each worker does node-local

simple partial aggregate as explained before and sends it to the master for unification

  • 4. Master unifies partial results

based on op semantics

Master

We focus on BSP data-parallel

Worker 1 Disk DRAM

1a,1b, 1c,1d 2a,2b, 2c,2d

Worker 2 Disk DRAM Worker 3 Disk DRAM

3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d

Basic Idea: Master splits work -> node- local work -> master unifies results

1a,1b, 1c,1d 2a,2b, 2c,2d 3a,3b, 3c,3d 4a,4b, 4c,4d 5a,5b, 5c,5d 6a,6b, 6c,6d 4a 2a 6a

SELECT MAX(A) FROM R

4a 2a 6a 6a

I/O costs: Disk: 6 (pages) + output; Network: 3 (#workers)

slide-108
SLIDE 108

108

Data-Parallel Simple Aggregates

❖ Based on how easy it is to split up on shards, SQL aggs (aka descriptive stats) are categorized into 2/3 types: ❖ Distributive Aggs: A shard sends only 1 datum to master ❖ MIN, MAX, COUNT, SUM ❖ Algebraic Aggs: A shard sends O(1) size stats to master ❖ AVG (send SUM, COUNT separately); VARIANCE and STDEV (send SUM, SUM of squares, COUNT); etc. ❖ Holistic Aggs: Just O(1) size stats not enough in general; may need larger intermediate stats ❖ MEDIAN, MODE, PERCENTILES, etc. Q: Are all SQL aggregates easy to split up on sharded data?

slide-109
SLIDE 109

109

Data-Parallel Group By Aggregate

A B C D a1 1b 1c 4 a2 2b 2c 3 a1 3b 3c 5 a3 4b 4c 1 a2 5b 5c 10 a1 6b 6c 8

R SELECT A, SUM(D) FROM R GROUP BY A

A Running Info. a1 17 a2 13 a3 1

Output

Master Worker 1 Disk DRAM

a1,1b, 1c,4 a2,2b, 2c,3

Worker 2 Disk DRAM Worker 3 Disk DRAM

a1,3b, 3c,5 a3,4b, 4c,1 a2,5b, 5c,10 a1,6b, 6c,8 a1,1b, 1c,4 a2,2b, 2c,3 a1,3b, 3c,5 a3,4b, 4c,1 a2,5b, 5c,10 a1,6b, 6c,8 a1: 5 a3: 1 a1: 4 a2: 3 a1: 8 a2: 10 a1: 5 a3: 1 a1: 4 a2: 3 a1: 8 a2: 10 a1: 17 a2: 13 a3: 1

Similar to data-parallel simple agg Workers send partial hash table to master based on local shards Master collects and unifies local hash tables into global output Network I/O cost depends on data stats (domain size of A) Q: What is Master DRAM not enough to cache all hash tables?

slide-110
SLIDE 110

110

Data-Parallel Matrix Sum/Norm

Similar to data-parallel simple agg Disk I/O cost: 6 (pages) Network I/O cost: 3 (#workers)

Master Worker 1 Disk DRAM

2,1, 2,1 0,0, 0,0

Worker 2 Disk DRAM Worker 3 Disk DRAM

0,1, 0,0 0,2, 1,2 3,0, 3,0 1,0, 1,0 2,1, 2,1 0,0, 0,0 0,1, 0,0 0,2, 1,2 3,0, 3,0 1,0, 1,0 10 10 20 10 10 20 40

kMk2

2

<latexit sha1_base64="WP4FUq7RsObclJ4fi/3JkXyXLa0=">AB+nicbVDLSsNAFJ34rPWV6tLNYBFclaQquiy6cSNUsA9oY5hMJ+3QySTM3Cil9lPcuFDErV/izr9xmahrQcu93DOvcydEySCa3Ccb2tpeWV1b2wUdzc2t7ZtUt7TR2nirIGjUWs2gHRTHDJGsBsHaiGIkCwVrB8Grqtx6Y0jyWdzBKmBeRvuQhpwSM5NulbpMpwDc46371vurbZafiZMCLxM1JGeWo+/ZXtxfTNGISqCBad1wnAW9MFHAq2KTYTVLCB2SPusYKknEtDfOTp/gI6P0cBgrUxJwpv7eGJNI61EUmMmIwEDPe1PxP6+TQnjhjblMUmCSzh4KU4EhxtMcI8rRkGMDCFUcXMrpgOiCAWTVtGE4M5/eZE0qxX3pHJ2e1quXeZxFNABOkTHyEXnqIauUR01EWP6Bm9ojfryXqx3q2P2eiSle/soz+wPn8Au5KTBA=</latexit>

2 1 2 1 1 2 1 2 3 1 3 1

M6x4 Say 2x2 tiled layout+partitioning

slide-111
SLIDE 111

111

2 1 2 1 1 2 1 2 3 1 3 1

M6x4

M T M

<latexit sha1_base64="OMhQmIb57LnfsVrkZPsysEXtJDw=">AB7HicbVBNSwMxEJ2tX7V+VT16CRbBU9n1Az0WvXgpVOi2hXYt2TbhmaTJckKZelv8OJBEa/+IG/+G9N2D1p9MPB4b4aZeWHCmTau+UVlbX1jeKm6Wt7Z3dvfL+QUvLVBHqE8ml6oRYU84E9Q0znHYSRXEctoOx7czv/1IlWZSNM0koUGMh4JFjGBjJb/+0ET1frniVt050F/i5aQCORr98mdvIEkaU2EIx1p3PTcxQYaVYTamXapgMsZD2rVU4JjqIJsfO0UnVhmgSCpbwqC5+nMiw7HWkzi0nTE2I73szcT/vG5qousgYyJDRVksShKOTISzT5HA6YoMXxiCSaK2VsRGWGFibH5lGwI3vLf0nrOqdVy/vLyq1mzyOIhzBMZyCB1dQgztogA8EGDzBC7w6wnl23pz3RWvByWcO4Recj2/xHY4f</latexit>

A B C D E F A B C D E F AT CT ET BT DT FT O1 O2 O3 O4

Data-Parallel Gramian Matrix

Say 2x2 tiled layout+partitioning Basic Idea: Master splits work -> node-local work -> master commands workers to talk to others as needed -> more node-local work -> master unifies results More complex in the data-parallel setting, since we may need to communicate data shards across workers!

slide-112
SLIDE 112

112

Data-Parallel Gramian Matrix

Master Worker 1 Disk DRAM

A B

Worker 2 Disk DRAM Worker 3 Disk DRAM

C D E F … ATB …

M T M

<latexit sha1_base64="OMhQmIb57LnfsVrkZPsysEXtJDw=">AB7HicbVBNSwMxEJ2tX7V+VT16CRbBU9n1Az0WvXgpVOi2hXYt2TbhmaTJckKZelv8OJBEa/+IG/+G9N2D1p9MPB4b4aZeWHCmTau+UVlbX1jeKm6Wt7Z3dvfL+QUvLVBHqE8ml6oRYU84E9Q0znHYSRXEctoOx7czv/1IlWZSNM0koUGMh4JFjGBjJb/+0ET1frniVt050F/i5aQCORr98mdvIEkaU2EIx1p3PTcxQYaVYTamXapgMsZD2rVU4JjqIJsfO0UnVhmgSCpbwqC5+nMiw7HWkzi0nTE2I73szcT/vG5qousgYyJDRVksShKOTISzT5HA6YoMXxiCSaK2VsRGWGFibH5lGwI3vLf0nrOqdVy/vLyq1mzyOIhzBMZyCB1dQgztogA8EGDzBC7w6wnl23pz3RWvByWcO4Recj2/xHY4f</latexit>

A B C D E F AT CT ET BT DT FT O1 O2 O3 O4

  • 1. Master assigns tiles of O to workers,

e.g., W1: O1; W2: O2; W3: O3,O4

  • 2. Workers compute node-local partial

tiles needed and cache them on local disk: W1: ATA, BTB, ATB, etc.

  • 3. Master commands workers to send
  • corresp. tiles to other workers based on

2: W1 gets CTC from W2 and ETE from W3 for O1; W2 gets ATB from W1; etc.

  • 4. More node-local work to finish all Oi
  • 5. Union of local tiles is sharded output!

Many ways to do this, with differing disk and network I/O costs!

ATB … …

slide-113
SLIDE 113

113

Data-Parallel Gramian Matrix

❖ Not straightforward to determine I/O costs (both disk I/O and network I/O) of matrix mult., even simple Gramian! ❖ CPU costs can also differ based on whether workers repeat redundant work vs cache it to file ❖ Runtime is a complex function combining disk I/O cost, network I/O cost, and CPU/compute cost ❖ Different operator implementations exist in the parallel data systems literature: crossproduct-based multiply, replication-based multiply, etc.

https://sfu-db.github.io/dbsystems/Papers/systemML.pdf http://www.vldb.org/pvldb/vol9/p1425-boehm.pdf

slide-114
SLIDE 114

114

Data Access Pattern of Scalable SGD

W(t+1) W(t) ηr˜ L(W(t))

<latexit sha1_base64="UQEnrJTW1ICg3Ivhlxip9zAg4io=">ACRHicbVDBTtAFyHUmhKW1OvawaVQpCjewCao+oXHrgABIhSHaInjfPsGK9tnafiyLH8elH8CNL+ilh6Kq16rkAMkjLTa0cw87dtJCiUtBcGt1p6tvx8ZfVF+Xaq9dv/PW3JzYvjcC+yFVuThOwqKTGPklSeFoYhCxROEgu9xt/8B2Nlbk+pkmBwzOtUylAHLSyI/iDOgiSatBfVZ1iW/xcLPmscKUwJj8ij/2nfeRx0jAYw2JchdJNcbqoObd+eTmyO8EvWAKvkjCGemwGQ5H/k08zkWZoSahwNoDAoaVmBICoV1Oy4tFiAu4RwjRzVkaIfVtISaf3DKmKe5cUcTn6oPJyrIrJ1kiUs2i9p5rxGf8qKS0i/DSuqiJNTi/qG0VJxy3jTKx9KgIDVxBISRblcuLsCAINd725UQzn95kZx86oXbvd2jnc7e1kdq+wde8+6LGSf2R7xg5Znwl2zX6y3+zO+H98v54f+jLW82s8Eewfv3H7dAsLY=</latexit>

r˜ L(W) = X

i∈B

rl(yi, f(W, xi))

<latexit sha1_base64="9NcFodfWR8o3UOsWYu1349rFPk=">ACOXicbVDLSsNAFJ34rPVdenmYhFakJL4QDdCqRsXLirYVmhKmEwnOnQyCTMTsYT8lhv/wp3gxoUibv0Bp20W9XFg4HDOucy9x485U9q2n62Z2bn5hcXCUnF5ZXVtvbSx2VZRIgltkYhH8trHinImaEszel1LCkOfU47/uBs5HfuqFQsEld6GNeiG8ECxjB2kheqekK7HMrma8T9OLDCpuiPWtH6SdrAqn4Kok9FIGLhPQyCP8rQY3sQTIX34N5j1apXKts1ewz4S5yclFGOpld6cvsRSUIqNOFYqa5jx7qXYqkZ4TQruomiMSYDfEO7hgocUtVLx5dnsGuUPgSRNE9oGKvTEykOlRqGvkmOFlW/vZH4n9dNdHDS5mIE0FmXwUJBx0BKMaoc8kJZoPDcFEMrMrkFsMdGm7KIpwfl98l/S3q85B7Wjy8NyvZHXUDbaAdVkIOUR2doyZqIYIe0At6Q+/Wo/VqfVifk+iMlc9soR+wvr4Bbvircw=</latexit>

Sample mini-batch from dataset without replacement

Original dataset Random “shuffle” Epoch 1

W(0)

<latexit sha1_base64="S+keXcWMFf6wrOJ7MBq7Rng6AM=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVecs69sVp+bMgJaJW5AKFGj07a/eICJSIUmHCvVdZ1YeymWmhFOs3IvUTGZIyHtGuowCFVXjpLnqFTowxQEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUcJjuEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAKpNG</latexit>
  • Seq. scan

W(1)

<latexit sha1_base64="FI2/tJBTSmjc6HdcV6spCWeSwCc=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVfcs69sVp+bMgJaJW5AKFGj07a/eICJSIUmHCvVdZ1YeymWmhFOs3IvUTGZIyHtGuowCFVXjpLnqFTowxQEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUcJjuEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAMLJNH</latexit>

W(2)

<latexit sha1_base64="EHwZoqY5j/U4nFUPxcanVrWIr8=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyWpi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdKqVd2L6tXDZbl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8NspNI</latexit>

Mini-batch 1 Mini-batch 2 Mini-batch 3

W(3)

<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>

Epoch 2

  • Seq. scan

(Optional) New Random Shuffle

W(3)

<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>

W(4)

<latexit sha1_base64="7HsMu7dOxTzZgQl2dq540qpd9U=">AB+XicbVDLSsNAFL2pr1pfUZduBotQNyXRi6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpma04kKQ48Ttve5C7z21MqFQvFo5FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FdqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOTHOkQZTWgIZOUaD4zBPJTFZExlhiok1ZJVOCu/zlVdK6qLqX1auHWrl+m9dRhBM4hQq4cA1uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8QvpNK</latexit>

… … ORDER BY RAND() Randomized dataset

slide-115
SLIDE 115

115

❖ An SGD epoch is similar to SQL aggs but also different: ❖ More complex agg. state (running info): model param. ❖ Multiple mini-batch updates to model param. within a pass ❖ Sequential dependency across mini-batches in a pass ❖ Need to keep track of model param. across epochs ❖ Not an algebraic aggregate; hard to parallelize! ❖ Not even commutative: different random shuffle orders give different results (very unlike relational ops)! ❖ (Optional) New random shuffling before each epoch Q: How to execute SGD in a data-parallel manner?

W(t)

<latexit sha1_base64="jHixyQj+DLWH5ucHau8W5BCjQHQ=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVX2W9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBjZcyESeaCjI/FCQc6QjlNaABk5RoPjUE8lMVkRGWGKiTVlU4K7+OVl0jqvuRe1q4fLSv2qKMEx3ACVXDhGupwDw1oAoEJPMrvFmp9WK9Wx/z0RWr2DmCP7A+fwByPpOK</latexit>

Data Access Pattern of Scalable SGD

slide-116
SLIDE 116

116

Data-Parallel Scalable SGD: Take 1

❖ Given sharded dataset across workers ❖ Master broadcasts latest model parameters to workers ❖ Each worker uses that model to do local disk-scalable SGD ❖ Tricky part: How to “combine” indep. models from workers? ❖ Model Averaging: Treat it like SQL AVG! Master receives news models from each worker (parameters or gradients) and averages them!

Worker 1 Worker 2 Worker n … Master

W(t)

1

<latexit sha1_base64="DvIp/6uHDZOjNMUvnpHo6Owbk=">AB+3icbVDLSsNAFJ3UV62vWJdugkWom5L4QJdFNy4r2Ae0MUymk3boZBJmbsQS8ituXCji1h9x5984abPQ1gMDh3Pu5Z45fsyZAtv+Nkorq2vrG+XNytb2zu6euV/tqCiRhLZJxCPZ87GinAnaBgac9mJcehz2vUnN7nfaRSsUjcwzSmbohHgWMYNCSZ1YHIYaxH6Td7CGtw0nmOZ5Zsxv2DNYycQpSQwVanvk1GEYkCakAwrFSfceOwU2xBEY4zSqDRNEYkwke0b6mAodUuekse2Yda2VoBZHUT4A1U39vpDhUahr6ejJPqha9XPzP6ycQXLkpE3ECVJD5oSDhFkRWXoQ1ZJIS4FNMJFMZ7XIGEtMQNdV0SU4i19eJp3ThnPWuLg7rzWvizrK6BAdoTpy0CVqolvUQm1E0BN6Rq/ozciMF+Pd+JiPloxi5wD9gfH5A6OglC4=</latexit>

W(t)

2

<latexit sha1_base64="xRaCmja4+Alc2G0VSB2AwXG38es=">AB+3icbVDLSsNAFL3xWesr1qWbYBHqpiRV0WXRjcsK9gFtDJPpB06mYSZiVhCfsWNC0Xc+iPu/BsnbRbaemDgcM693DPHjxmVyra/jZXVtfWNzdJWeXtnd2/fPKh0ZJQITNo4YpHo+UgSRjlpK6oY6cWCoNBnpOtPbnK/+0iEpBG/V9OYuCEacRpQjJSWPLMyCJEa+0HazR7SmjrNvIZnVu26PYO1TJyCVKFAyzO/BsMIJyHhCjMkZd+xY+WmSCiKGcnKg0SGOEJGpG+phyFRLrpLHtmnWhlaAWR0I8ra6b+3khRKOU09PVknlQuern4n9dPVHDlpTHiSIczw8FCbNUZOVFWEMqCFZsqgnCguqsFh4jgbDSdZV1Cc7il5dJp1F3zuoXd+fV5nVRwmO4Bhq4MAlNOEWtAGDE/wDK/wZmTGi/FufMxHV4xi5xD+wPj8AaUklC8=</latexit>

W(t)

n

<latexit sha1_base64="dQYgF7AMPEB/nejv+XzaZ3+cDiQ=">AB+3icbVDLSsNAFJ3UV62vWpduBotQNyXxgS6LblxWsA9oY5lMJ+3QySTM3Igl5FfcuFDErT/izr9x0mahrQcGDufcyz1zvEhwDb9bRVWVtfWN4qbpa3tnd298n6lrcNYUdaioQhV1yOaCS5ZCzgI1o0UI4EnWMeb3GR+5EpzUN5D9OIuQEZSe5zSsBIg3KlHxAYe37SR+SGpykAyNW7bo9A14mTk6qKEdzUP7qD0MaB0wCFUTrnmNH4CZEAaeCpaV+rFlE6ISMWM9QSQKm3WSWPcXHRhliP1TmScAz9fdGQgKtp4FnJrOketHLxP+8Xgz+lZtwGcXAJ0f8mOBIcRZEXjIFaMgpoYQqrjJiumYKELB1FUyJTiLX14m7dO6c1a/uDuvNq7zOoroEB2hGnLQJWqgW9RELUTRE3pGr+jNSq0X6936mI8WrHznAP2B9fkDACOUaw=</latexit>

W(t) = 1 n

n

X

i=1

W(t)

i

<latexit sha1_base64="YvR+r5RoOYyp/ar/kI19j/9+94=">ACJ3icbVDLSgMxFM3UV62vqks3wSLUTZnxgW4qRTcuK9gHdNohk2ba0ExmSDJCfM3bvwVN4K6NI/MW1noa0HAodziX3Hj9mVCrb/rJyS8srq2v59cLG5tb2TnF3rymjRGDSwBGLRNtHkjDKSUNRxUg7FgSFPiMtf3Qz8VsPREga8Xs1jk3RANOA4qRMpJXvHJDpIZ+oFtpT5fVcQqr0A0EwtpJNU+hK5PQ07TqpD0O57Me9Yolu2JPAReJk5ESyFD3iq9uP8JSLjCDEnZcexYdTUSimJG0oKbSBIjPEID0jGUo5DIrp7emcIjo/RhEAnzuIJT9feERqGU49A3ycmct6biP95nUQFl1NeZwowvHsoyBhUEVwUhrsU0GwYmNDEBbU7ArxEJmWlKm2YEpw5k9eJM2TinNaOb87K9Wuszry4AcgjJwAWogVtQBw2AwSN4Bm/g3XqyXqwP63MWzVnZzD74A+v7B2K9pkU=</latexit>

❖ A bizarre heuristic that affects SGD convergence ❖ Works OK for GLMs ❖ Terrible for ANNs!

slide-117
SLIDE 117

117

Data-Parallel Scalable SGD: Take 2

❖ Disadvantages of Model Averaging for data-parallel SGD: ❖ Poor convergence for non-convex/ANN models; leads to too many epochs and typically poor ML accuracy ❖ Master’s averaging step is choke point at scale (n ~ 100s); model param. sizes can even be GBs; wastes resources! ❖ ParameterServer: A more flexible from-scratch design of an ML system specifically for distributed SGD: ❖ Breaks the synchronization barrier for aggregation: allows asynchronous updates from workers to master ❖ Flexible communication frequency: can send updates at every mini-batch or a set of few mini-batches

https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf

slide-118
SLIDE 118

118

ParameterServer for Scalable SGD

Worker 1 Worker 2 Worker n

PS 1 PS 2 PS 2 …

Multi-server “master”; each server manages a part of W(t)

<latexit sha1_base64="jHixyQj+DLWH5ucHau8W5BCjQHQ=">AB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPpB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv9oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVX2W9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBjZcyESeaCjI/FCQc6QjlNaABk5RoPjUE8lMVkRGWGKiTVlU4K7+OVl0jqvuRe1q4fLSv2qKMEx3ACVXDhGupwDw1oAoEJPMrvFmp9WK9Wx/z0RWr2DmCP7A+fwByPpOK</latexit>

Push / Pull when ready/needed No sync. for workers or servers Workers send gradients to master for updates at each mini-batch (or lower frequency)

r˜ L(W(t+1)

n

)

<latexit sha1_base64="GLCamUO1sNnU5MPUcSRaDlHVdg=">ACEHicbVC7SgNBFJ2NrxhfUubwSAmCGHXB1oGbSwsIpgHZGOYncwmQ2Znl5m7Qlj2E2z8FRsLRWwt7fwbJ49CEw9cOJxzL/fe40WCa7DtbyuzsLi0vJdza2tb2xu5bd36jqMFWU1GopQNT2imeCS1YCDYM1IMRJ4gjW8wdXIbzwpXko72AYsXZAepL7nBIwUid/6EriCYJd4KLkpsUF92AQN/zk0Z6nxThyCmlHVnq5At2R4DzxNnSgpoimon/+V2QxoHTAIVROuWY0fQTogCTgVLc26sWUTogPRYy1BJAqbyfihFB8YpYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyZEJzZl+dJ/bjsnJTPbk8LlctpHFm0h/ZRETnoHFXQNaqiGqLoET2jV/RmPVkv1rv1MWnNWNOZXfQH1ucPdTycNg=</latexit>

r˜ L(W(t−1)

2

)

<latexit sha1_base64="z+XBRFQZlyoDHOZkNVbaz0Q+wU=">ACEHicbVC7SgNBFJ31GeMramkzGMSkMOxGRcugjYVFBPOAbAyzk9lkyOzsMnNXCMt+go2/YmOhiK2lnX/j5Fo4oELh3Pu5d57vEhwDb9bS0sLi2vrGbWsusbm1vbuZ3dug5jRVmNhiJUTY9oJrhkNeAgWDNSjASeYA1vcDXyGw9MaR7KOxhGrB2QnuQ+pwSM1MkduZJ4gmAXuOiy5CbFBTcg0Pf8pJHeJwU4dop1zs5PJ2yR4DzxNnSvJoimon9+V2QxoHTAIVROuWY0fQTogCTgVLs26sWUTogPRYy1BJAqbyfihFB8apYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyaEJzZl+dJvVxyTkpnt6f5yuU0jgzaRweogBx0jiroGlVRDVH0iJ7RK3qznqwX6936mLQuWNOZPfQH1ucPHSb/A=</latexit>

r˜ L(W(t)

1 )

<latexit sha1_base64="OcO+0LVRY+mliWjDoWz5mG5oEXI=">ACDnicbVC7SgNBFJ31GeNr1dJmMASJuz6QMugjYVFBPOAbAyzk9lkyOzsMnNXCMt+gY2/YmOhiK21nX/j5Fo4oELh3Pu5d57/FhwDY7zbS0tr6yurec28ptb2zu79t5+Q0eJoqxOIxGplk80E1yOnAQrBUrRkJfsKY/vBr7zQemNI/kHYxi1glJX/KAUwJG6tpFTxJfEOwBFz2W3mS45IUEBn6QNrP7tATlrOuWu3bBqTgT4EXizkgBzVDr2l9eL6JyCRQbRu04MnZQo4FSwLO8lmsWEDkmftQ2VJGS6k07eyXDRKD0cRMqUBDxRf0+kJNR6FPqmc3yqnvfG4n9eO4HgopNyGSfAJ0uChKBIcLjbHCPK0ZBjAwhVHFzK6YDogFk2DehODOv7xIGscV96RydntaqF7O4sihQ3SESshF56iKrlEN1RFj+gZvaI368l6sd6tj2nrkjWbOUB/YH3+AChxm4k=</latexit>

❖ Model params may get out-of-sync or stale; but SGD turns out to be remarkably robust! Multiple updates/epoch really helps ❖ Network I/O cost per epoch is higher (per mini-batch)

slide-119
SLIDE 119

119

Data-Parallel Data Science Ops

❖ Scalable data access for key representative examples of programs/operations that are ubiquitous in data science: ❖ DB systems: ❖ Relational select ❖ Non-deduplicating project ❖ Simple SQL aggregates ❖ SQL GROUP BY aggregates ❖ ML systems: ❖ Matrix sum/norms ❖ Gramian matrix ❖ (Stochastic) Gradient Descent

slide-120
SLIDE 120

120

Outline

❖ Basics of Parallelism

❖ Task Parallelism ❖ Single-Node Multi-Core; SIMD; Accelerators

❖ Basics of Scalable Data Access

❖ Paged Access; I/O Costs; Layouts/Access Patterns ❖ Scaling Data Science Operations

❖ Data Parallelism: Parallelism + Scalability

❖ Data-Parallel Data Science Operations ❖ Optimizations and Hybrid Parallelism

slide-121
SLIDE 121

121

Data-Parallel System Optimizations

❖ Some systems-level optimizations are commonly used in data-parallel systems to reduce runtimes ❖ Replication: Put a shard on more than 1 worker; allows for more parallelism during execution ❖ Caching: Each worker stores as much of its data on its DRAM as possible; as cluster size goes up, whole input data and even intermediate data might all fit in DRAM! ❖ Asynchrony: Less common in DB systems but more common in ML systems (e.g., ParameterServer) ❖ Approximation: Use sampling in a careful manner ❖ Recent research explores use of ML-based approaches to decide data placement and caching strategies across memory hierarchy using past workload+system info.

slide-122
SLIDE 122

122

Hybrid Parallelism

❖ We saw 2 main parallelism paradigms: Task Parallelism vs Data Parallelism with different pros and cons ❖ Task-par. wastes memory/storage due to replication; remote reads waste network; but easy to implement ❖ Data-par. is painful to implement at op level; but scales without wasting memory/storage; more network costs Q: Is it possible to get the best of both these worlds? ❖ That is, we want to run task parallelism on sharded data, e.g., different SQL queries or different ML training procedures run on top of same sharded data regime ❖ Aka “Multi-Query Execution” in the DB world

slide-123
SLIDE 123

123

Task Par. vs BSP Data Par.

T1

D

T2 T3 T4

Example: Given 4 nodes

6 15 6 9

N1: T1 T3 T4 N2: T2 N3: N4: 6 15 21 30

Suppose each task gets perfect linear speedup

  • n its useful work on BSP and a master
  • verhead of just 1 unit each before/ after

Fully BSP data-par schedule:

Mstr: T 1 T 1 T 2 T 2 T 3 T 3 T 4 T 4 W1: T1 T2 T3 T4 W2: T1 T2 T3 T4 W3: T1 T2 T3 T4 1 6 7 8 10 12 14 16 20

Fully task-par schedule: N3, N4 are both

  • useless. Why?

N2 has idle times

  • too. Why?
slide-124
SLIDE 124

124

Hybrid of Task and Data Parallelism

Example:

Mstr: T 1 T 1 T 4 T 4 W1: T1 T2 T4 W2: T1 T3 T4 W3: T1 T4 1 6

13 14

18

Q: Can we go faster if we hybridize task and data par?

Vs Fully task-par: 30 Vs Fully data-par: 20 ❖ Most scalable data systems today support only full task-par. (e.g., Dask) or full data-par. (e.g., RDBMS); hybrid software complexity is high ❖ Some RDBMSs do internally exploit hybrid-

  • par. for relational dataflows

❖ Spark is beginning to support task-par. too ❖ Later topic: Cerebro system for deep learning One possible hybrid schedule:

T1

D

T2 T3 T4

Given 4 nodes

6 15 6 9