DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML - - PowerPoint PPT Presentation

dsc 102 systems for scalable analytics
SMART_READER_LITE
LIVE PREVIEW

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML - - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1 DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1.


slide-1
SLIDE 1

Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book

Arun Kumar

1

DSC 102
 Systems for Scalable Analytics

slide-2
SLIDE 2

2

DSC 102 will get you thinking about the fundamentals of scalable analytics systems

  • 1. “Systems”: What resources does a computer have?

How to store and compute efficiently over large data? What is cloud computing?

  • 2. “Scalability”: How to scale and parallelize data-

intensive computations?

  • 3. Scalable Systems for “Analytics”:

3.1. Source: Data acquisition & preparation for ML 3.2. Build: Dataflow & Deep Learning systems 3.3. Deploying ML models

  • 4. Hands-on experience with tools for scalable analytics
slide-3
SLIDE 3

3

The Lifecycle of ML-based Analytics

Data acquisition Data preparation Feature Engineering Training & Inference Model Selection Model Serving Monitoring

slide-4
SLIDE 4

4

Data Science in the Real World

Q: How do real-world data scientists spend their time?

CrowdFlower Data Science Report 2016 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

slide-5
SLIDE 5

5

Data Science in the Real World

CrowdFlower Data Science Report 2016 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Q: How do real-world data scientists spend their time?

slide-6
SLIDE 6

6

Data Science in the Real World

Q: How do real-world data scientists spend their time?

Kaggle State of ML and Data Science Survey 2018

slide-7
SLIDE 7

7

Data Science in the Real World

IDC-Alteryx State of Data Science and Analytics Report 2019

Q: How do real-world data scientists spend their time?

slide-8
SLIDE 8

8

❖ Data science does not exist in a vacuum. It must interplay with the data-generating process and prediction application ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “analytics/ML-ready” datasets ❖ What makes sourcing challenging/time-consuming? ❖ Data access/availability constraints ❖ Heterogeneity of data sources/formats/types ❖ Messy, incomplete, ambiguous, and/or erroneous data ❖ Poor data governance in organization ❖ Bespoke/diverse kinds of prediction applications ❖ Evolution of data-generating process/application ❖ Large scale of data

Sourcing Stage of Data Science

slide-9
SLIDE 9

9

❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “analytics/ML-ready” datasets ❖ At a high level, roughly 5 kinds of activities:

Sourcing Stage of Data Science

Raw data sources/repos

  • 1. Acquiring
  • 2. Organizing
  • 3. Cleaning
  • 5. Feature Engineering

(aka Feature Extraction)

  • 4. Labeling

(Sometimes)

Analytics/ML- ready data

slide-10
SLIDE 10

10

Acquiring Data

Raw data sources/repos

  • 1. Acquiring
  • 2. Organizing
  • 3. Cleaning
  • 5. Feature Engineering

(aka Feature Extraction)

  • 4. Labeling

(Sometimes)

Analytics/ML- ready data

slide-11
SLIDE 11

11

Acquiring Data: Data Sources

❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets!

Raw data sources/repos

❖ Structured data: Exported from RDBMSs (e.g., Redshift), often with SQL ❖ Semistruct. data: Exported from “NoSQL” stores (e.g., MongoDB) ❖ Log files, text files, documents, multimedia files, etc.: Typically stored on HDFS, S3, etc. ❖ Graph/network data: Managed by Neo4j Ad: Take DSC 104 to learn semistruct. and graph databases

slide-12
SLIDE 12

12

Acquiring Data: Examples

Example: Recommendation System (e.g., Netflix) Prediction App: Identify top movies to display for user Data Sources: User data and past click logs Movie data Movie images Example: Social media analytics for social science Prediction App: Predicts which tweets will go viral Data Sources: Tweets as JSON Structured metadata Graph data Entity Dictionaries

slide-13
SLIDE 13

13

Acquiring Data

❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! Potential challenges and mitigation: ❖ Access control: Learn organization’s data security and authentication policies ❖ Data heterogeneity: Do you really need all data sources/types? ❖ Data volume: Do you really need all data? ❖ Scale: Avoid sequential file copying ❖ Manual errors: Use automated “data pipeline” services such as AirFlow (later)

Raw data sources/repos

slide-14
SLIDE 14

14

Organizing Data

Raw data sources/repos

  • 1. Acquiring
  • 2. Organizing
  • 3. Cleaning
  • 5. Feature Engineering

(aka Feature Extraction)

  • 4. Labeling

(Sometimes)

Analytics/ML- ready data

slide-15
SLIDE 15

15

(Re-)Organizing Data

❖ Given diverse data sources/file formats, data scientist must reorganize them into a usable format for analytics/ML ❖ Organization is specific to the analytics/ML task at hand ❖ Might need SQL, MapReduce (later), and file handling ❖ Examples of usable organization:

Prediction App: Fraud detection in banking Large single-table CSV file, say, on HDFS Joins to denormalize Flatten JSON records Prediction App: Image captioning on social media Large binary file with 1 image tensor and 1 string per line Fuse JSON records Extract image tensors

slide-16
SLIDE 16

16

(Re-)Organizing Data: Tips

❖ Data re-organization these days often involves a lot of coding (Python, SQL, Java) and scripting (bash) Some suggested best practices: ❖ Documentation: Maintain notes/READMEs with your code ❖ Automation: Use scripts (meta-programs) to automate

  • rchestration of data re-org. code

❖ Provenance: Manage metadata on where your data records/variables come from and why they are there ❖ Versioning: You might do data re-org. many times; manage metadata on what version has what and when

slide-17
SLIDE 17

17

(Re-)Organizing Data: Schematization

❖ Increasingly, “ML platforms” in industry are imposing more discipline on what re-organized data must look like ❖ Lightweight and flexible schemas becoming common ❖ Makes it easier to automate data validation

https://www.tensorflow.org/tfx/guide

slide-18
SLIDE 18

18

https://eng.uber.com/michelangelo/

❖ Custom ML platforms proliferating in industry, each with its

  • wn approach to organizing and cataloging ML data!

(Re-)Organizing Data

slide-19
SLIDE 19

19

Data Cleaning

Raw data sources/repos

  • 1. Acquiring
  • 2. Organizing
  • 3. Cleaning
  • 5. Feature Engineering

(aka Feature Extraction)

  • 4. Labeling

(Sometimes)

Analytics/ML- ready data

slide-20
SLIDE 20

20

Data Cleaning

❖ Real-world datasets often have errors, ambiguity, incompleteness, inconsistency, and other quality issues ❖ Data cleaning: The process of fixing data quality issues to ensure errors do not cascade/corrupt analytics/ML results ❖ Diverse sources/causes of data quality issues: ❖ Human-generated data: Mistakes, misunderstandings ❖ Hardware-generated data: Noise, failures ❖ Software-generated data: Bugs, errors, semantic issues ❖ Attribute encoding/formatting conventions (e.g., dates) ❖ Attribute unit/semantics conventions (e.g., km vs mi) ❖ Data integration: Duplicate entities, value differences ❖ Evolution of data schemas in application

slide-21
SLIDE 21

21

Data Cleaning Task: Missing Values

❖ Long standing problem studied in statistics and DB/AI ❖ Various assumptions on “missingness” property in terms of correlations of missing vs observed values in dataset: ❖ Missing Completely at Random (MCAR): No (causal) relationships for missing vs non-missing values ❖ Missing at Random (MAR): Systematic relationships between missing values and observed values ❖ Missing Not at Random (MNAR): Missingness itself depends on the value missing ❖ Add 0/1 missingness variable and impute missing values: ❖ Statistical approaches: distributional properties ❖ ML/DL-based approaches: self-supervised ❖ Some ML packages offer these at scale (e.g., DaskML)

slide-22
SLIDE 22

22

Data Cleaning Task: Entity Matching

❖ A common cleaning task for multi-source datasets ❖ Duplications of real-world entities can arise when using data drawn from multiple sources ❖ Often need to match and deduplicate entities in unified data; o/w, query/ML answers can be wrong! ❖ Aka entity deduplication/record linkage/entity linkage

FullName Age City Sate Aisha Williams 27 San Diego CA LastName FirstName MI Age Zipcode Williams Aisha R 27 92122

Customers1 Customers2 Q: Are these the same person (“entity”)?

slide-23
SLIDE 23

23

General Workflow of Entity Matching

❖ 3 main stages: Blocking -> Pairwise check -> Clustering ❖ Pairwise check: ❖ Given 2 records, how likely is it that they are the same entity? SOTA: “Entity embeddings” + deep learning ❖ Blocking: ❖ Pairwise check cost for a whole table is too high: O(n2) ❖ Create “blocks”/subsets of records; pairwise only within ❖ Domain-specific heuristics for “obvious” non-matches using similarity/distance metrics (e.g., edit dist. on Name) ❖ Clustering: ❖ Given pairwise scores, consolidate records into entities

slide-24
SLIDE 24

24

Data Cleaning

❖ Many approaches studied by DB and AI communities: ❖ Integrity constraints: E.g., if ZipCode is same across customer records, State must be same too ❖ Business logic/rules: domain knowledge programs ❖ Supervised ML: E.g., predict missing values ❖ Unfortunately, data quality issues are often so peculiar and specific to dataset/application that human intervention (by data scientist) is often the only reliable way in practice! ☺ ❖ Crowdsourcing / expertsourcing another alternative Data cleaning in practice is “death by a thousand cuts”! :) Q: How can we even hope to automate data cleaning?

slide-25
SLIDE 25

25

Data Labeling

Raw data sources/repos

  • 1. Acquiring
  • 2. Organizing
  • 3. Cleaning
  • 5. Feature Engineering

(aka Feature Extraction)

  • 4. Labeling

(Sometimes)

Analytics/ML- ready data

slide-26
SLIDE 26

26

Data Labeling

❖ Most of recent AI successes are due to supervised ML ❖ Large datasets are not enough—need labeled examples, i.e., pairs of inputs and outputs ❖ Labeling: The process of annotating an example (raw form

  • r processed feature vector) with ground truth label for use

by a given prediction task Q: What is a label for this image? Dog (object recognition) Couch (object recognition) Dog w/ bounding box (obj. detection) Shiba Inu (dog breed classifier) Yes (meme classifier!)

slide-27
SLIDE 27

27

Data Labeling: Application Need

❖ WRT sources of labels, 3 kinds of prediction applications:

  • 1. Data-generating process offers labels naturally, e.g.,

customer churn prediction, forecasting, etc.

  • 2. Product/service users offer labels (in)directly, e.g.,

email spam filters, computational advertising, product recommendations, photo tagging, web search, etc.

  • 3. Need application-specific extra effort to get labels,

e.g., radiology, self-driving cars, species classification, video surveillance, machine translation, knowledge base construction, document summarization, etc.

slide-28
SLIDE 28

28

Data Labeling Approaches

https://www.snorkel.org/blog/weak-supervision

slide-29
SLIDE 29

29

Data Labeling Approaches

5 most common approaches to acquiring labels:

  • 1. Manual supervision by subject matter experts (SMEs)

Traditional approach; slow and expensive but common

  • 2. Active learning with SMEs (less common)

Prioritize which unlabeled examples SME must label based

  • n benefit; possible for some kinds of ML; pay-as-you-go
  • 3. Crowdsourcing; expertsourcing

For tasks where lay people intelligence suffices; o/w if task is more technical, get workers with domain knowledge

  • 4. Programmatic supervision
  • 5. Transfer learning-based supervision
slide-30
SLIDE 30

30

Programmatic Supervision

❖ Basic Idea: Instead of manually labeling each example, write programs/rules/heuristics that encode some domain intuition to label examples en masse ❖ Pros: Improved labeling productivity; likely lower costs ❖ Cons: Need to write code; less reliable accuracy; unclear if complex prediction

  • utputs supportable

http://cidrdb.org/cidr2019/papers/p58-ratner-cidr19.pdf

slide-31
SLIDE 31

31

Transfer Learning

❖ Basic Idea: Use an ML model that was pre-trained on a different but related task (maybe it had large labeled data) to reduce labeled data needs of your task

https://medium.com/the-official-integrate-ai-blog/transfer-learning-explained-7d275c1e34e2

❖ Works well for image/vision and text/NLP tasks and models ❖ If target task is a subset of source task, even better: use its outputs as pseudo-labels Source Task Target Task

slide-32
SLIDE 32

32

The Lifecycle of ML-based Analytics

Data acquisition Data preparation Feature Engineering Training & Inference Model Selection Model Serving Monitoring

slide-33
SLIDE 33

33

❖ Building: The stage where one performs model selection, i.e., goes from prepped data to prediction functions and/or

  • ther analytics outputs for application

❖ What makes building challenging/time-consuming? ❖ Heterogeneity of data sources/formats/types ❖ Configuration complexity of ML models ❖ Large scale of data ❖ Long training runtimes for some ML models ❖ Multiple optimization criteria for application ❖ Evolution of data-generating process/application

Building Stage of Data Science

slide-34
SLIDE 34

34

❖ Building: The stage where one performs model selection, i.e., goes from prepped data to prediction functions and/or

  • ther analytics outputs for application

❖ Data scientist needs to steer 3 key activities that invoke ML training and inference as sub-routines:

  • 1. Feature Engineering (FE): How to represent signals/

variables appropriately for ML model to consume?

  • 2. Algorithm/Architecture Selection (AS): What class of

prediction functions (model type/ANN architecture) to use?

  • 3. Hyper-parameter Tuning (HT): How to improve prediction

accuracy by configuring ML “knobs” better?

Building Stage of Data Science

https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

slide-35
SLIDE 35

35

❖ Building: The stage where one performs model selection, i.e., goes from prepped data to prediction functions and/or

  • ther analytics outputs for application

❖ Model selection is typically an iterative exploratory process with data scientist making decisions on FE, AS, and HT

Building Stage of Data Science

https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 …

slide-36
SLIDE 36

36

Building Stage of Data Science

https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 … ❖ Many constraints guide decisions on FE, AS, HT: prediction accuracy, data/feature types, interpretability, tool availability, scalability, runtimes, legal issues, etc. ❖ Usually application-specific and dataset-specific; recall Pareto surfaces

slide-37
SLIDE 37

37

Feature Engineering

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 …

slide-38
SLIDE 38

38

❖ Overarching process of converting prepared data into a feature vector representation for ML training/inference ❖ Aka feature extraction, representation extraction, etc. ❖ Umbrella term for a wide variety of tasks that are informed by what kind of ML model will be trained

  • 1. Recoding and value conversions
  • 2. Joins and/or aggregates
  • 3. Feature interactions
  • 4. Feature selection
  • 5. Dimensionality reduction
  • 6. Temporal feature extraction
  • 7. Textual feature extraction and embeddings
  • 8. Learned feature extraction in deep learning

Feature Engineering

slide-39
SLIDE 39

39

❖ Common on relational/structured/tabular data ❖ Typically requires some global column stats + code to reconvert each tuple (example’s feature values)

  • 1. Recoding and value conversions

UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is

  • verrated”
  • 337

NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …

Example: Decision trees can use categorical features directly but GLMs support only numeric features; need one-hot encoded 0/1 vector Scaling global stats: “SELECT DISTINCT State”? Reconversion: Tuple-level function to look up domain hash table

slide-40
SLIDE 40

40

❖ Common on relational/structured/tabular data ❖ Typically requires some global column stats + code to reconvert each tuple (example’s feature values)

  • 1. Recoding and value conversions

UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is

  • verrated”
  • 337

NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …

Example: GLMs and ANNs need whitening of numeric features; dense: subtract mean and divide by stdev; sparse: divide by max-min Scaling global stats: How to scale mean/stdev/max/min? Reconversion: Tuple-level function to modify number using stats

slide-41
SLIDE 41

41

❖ Common on relational/structured/tabular data ❖ Typically requires some global column stats + code to reconvert each tuple (example’s feature values)

  • 1. Recoding and value conversions

UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is

  • verrated”
  • 337

NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …

Example: Some models like Bayesian Networks or Markov Logic Networks benefit from (or even need) binning/discretization of numerics Scaling global stats: How to scale histogram computations? Reconversion: Tuple-level function to convert number to bin ID

slide-42
SLIDE 42

42

❖ Common on relational/structured/tabular data ❖ Most real-world relational datasets are multi-table; require key-foreign key joins, aggregation-and-key-key-joins, etc.

  • 2. Joins and Aggregates

UserID State Date Upvotes Comment Label 143 CA … … …

  • 337

NY … … … + 143 CA … … … + … … … … … …

Example: Join tables on UserID; concatenate user’s info. as extra features! What kind of join is this? How to scale this computation?

UserID Age Name 304 40 … 23 25 … 143 33 … … … …

slide-43
SLIDE 43

43

❖ Common on relational/structured/tabular data ❖ Most real-world relational datasets are multi-table; require key-foreign key joins, aggregation-and-key-key-joins, etc.

  • 2. Joins and Aggregates

UserID State Date Upvotes Comment Label 143 CA … … …

  • 337

NY … … … + 143 CA … … … + … … … … … …

Example: Join table with itself on UserID to count #reviews and avg #upvotes for each user in a new temp. table and join that to get more features! What kind of computation is this? How to scale it?

slide-44
SLIDE 44

44

❖ Sometimes used on relational/structured/tabular data, especially when using simple (high-bias) models like GLMs ❖ Pairwise is common; ternary is not unheard of

  • 3. Feature Interactions

F1 F2 F3 Label 3 2 …

  • 4

20 … + 5 10 … + … … … … F1 F2 F3 F11 F12 F13 F22 F23 F33 Label 3 2 … 9 6 … 4 … …

  • 4

20 … 16 80 … 400 … … + 5 10 … 25 50 … 100 … … + … … … … … … … … … …

❖ No global stats, just a tuple-level function ❖ NB: Popularity of this has reduced due to kernel SVMs

slide-45
SLIDE 45

45

❖ Sometimes used on relational/structured/tabular data ❖ Basic Idea: Instead of using whole feature set, use a subset

  • 4. Feature Selection

UserID State Date Upvotes Comment Label … … … … … …

❖ Formulated as a discrete optimization problem ❖ General problem is NP-Hard in #features ❖ Many heuristics exist in ML/data mining; typically based

  • n some information theoretic measures

❖ Typically scaled as “outer loops” over training/inference ❖ Some ML users also prefer human-in-the-loop approach

State Upvotes Comment Label … … … … Upvotes Comment Label … … …

slide-46
SLIDE 46

46

❖ Often used on relational/structured/tabular data ❖ Basic Idea: Transforms features to a different latent space ❖ Examples: PCA, SVD, LDA, Matrix factorization

  • 5. Dimensionality Reduction

UserID State Date Upvotes Comment Label … … … … … …

❖ Feat. sel. preserves semantics of each feature, while dim.

  • red. may not and can combine features in unintuitive ways

❖ Scaling these is non-trivial! Similar to scaling individual ML training algorithms (later)

F1 F2 F3 Label 0.3 4.2

  • 29.2

Q: How is this different from “feature selection”?

slide-47
SLIDE 47

47

❖ Many relational/structured/tabular data have time/date ❖ Per-example reconversion to extract numerics/categoricals ❖ Sometimes global stats needed to calibrate time ❖ Complex temporal features studied in time series mining

  • 6. Temporal Feature Extraction

UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is

  • verrated”
  • 337

NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …

Example: Most classifiers cannot use Date directly; extract month (categorical), year (categorical?), day? (categorical), etc. Reconversion: Tuple-level function to extract numbers/categories

slide-48
SLIDE 48

48

❖ Many relational/structured/tabular data have text columns; in NLP, whole example is often just text ❖ Most classifiers cannot process text/strings directly ❖ Extracting numerics from text studied in text mining

  • 7. Textual Feature Extraction

… Comment Label … “This restaurant is not good”

“Good good!” + … “Pretty rad” + … … …

Example: Bag-of-words features: count number of times each word in a given vocabulary arises; need to know vocabulary first Scaling global stats: How to get vocabulary? Reconversion: Tuple-level function to count words; look up index

… sucks good … Label … 1 1 …

2 … + … … + … … … … …

slide-49
SLIDE 49

49

❖ Knowledge Base-based: Some apps use domain-specific knowledge bases like entity dictionaries (e.g., celebrity or chemical names) to extract more domain-specific features ❖ Embedding-based: ❖ Embedding is a dense numeric vector to represent text ❖ Aka “distributed representation”; popular in text mining ❖ Offline training of function from string to numeric vector typically in unsupervised way on large text corpora (e.g., news articles); dimensionality of embedding vector is a hyper-parameter ❖ Pre-trained word embeddings like Word2Vec and GloVe and sentence embeddings like Doc2Vec available off-the- shelf! Just a tuple-level reconversion function

  • 7. Textual Feature Extraction
slide-50
SLIDE 50

50

❖ A key benefit of deep learning is dramatically lower need (or no need!) for manual feature eng. on unstruct. data ❖ NB: Deep nets are not common on struct. data! ❖ Deep nets are astonishingly versatile: they can accept any data type/structure as input and/or output directly, e.g.: ❖ Convolutional NNs (CNNs) accept image tensors ❖ Recurrent NNs (RNNs) and “transformers” accept strings as sequence of character-level one-hot encodings ❖ Graph NNs (GNNs) accept graph-structured data ❖ Neural architecture specifies how to extract and transform features internally with weights that are learned ❖ Software 2.0: Buzzword for such “learned feature extraction” programs vs old hand-crafted feature extraction programs

  • 8. Learned Feature Extraction in Deep Learning

https://medium.com/@karpathy/software-2-0-a64152b37c35

slide-51
SLIDE 51

51

Hyper-Parameter Tuning

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 …

slide-52
SLIDE 52

52

❖ Hyper-parameters: Knobs for an ML model or training algorithm to control bias-variance tradeoff in a dataset- specific manner and make learning effective ❖ Examples: ❖ GLMs typically have regularizer to constrain weights ❖ All gradient methods have learning rate ❖ Mini-batch SGD also has batch size ❖ A common HT practice is grid search (pick some values for each hyper-param and take crossproduct); random search to subsample a grid also used; complex automated HT heuristics exist too (e.g., Hyperband) ❖ Typically HT is “outer loop” around training/inference

Hyper-Parameter Tuning

slide-53
SLIDE 53

53

Algorithm Selection

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 … ❖ Not much to say; data scientist typically picks model(s)/ algorithm(s) ab initio in “classical ML” (non-deep learning) ❖ Best practice is to train some simple models (log. reg.) as baselines before trying complex models (XGBoost) ❖ Ensembles: Multiple models can be built and used together

slide-54
SLIDE 54

54

Architecture Selection

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 … ❖ But: AS is a lot more critical in deep learning/Software 2.0! ❖ Neural arch. is the inductive bias in classical ML parlance; controls feature learning and bias-variance tradeoff on data ❖ Some apps enjoy rich off-the-shelf pre-trained neural archs (recall transfer learning); others: swap pain of hand-crafted feature eng. for pain of curating neural AS! ☺

slide-55
SLIDE 55

55

❖ It depends. HT and most of FE already automated mostly in practice; (neural) AS is often application-dictated ❖ Automated ML (AutoML) tools/systems now aim to reduce data scientist’s work; or even replace them?! ☺

Automated Model Selection / AutoML

Q: Can we just automate the ML/model selection process? ❖ Pros: Ease of use; lower human cost; easier to audit; improves ML accessibility ❖ Cons: Higher resource cost; less user control; may waste domain knowledge ❖ Pareto-optima; hybrids possible But: The Sourcing stage is still very hard to automate!

slide-56
SLIDE 56

56

Scalable ML Training and Inference

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 …

slide-57
SLIDE 57

57

Major ML Model Families

Generalized Linear Models (GLMs); from statistics Bayesian Networks; inspired by causal reasoning Decision Tree-based: CART, Random Forest, Gradient- Boosted Trees (GBT), etc.; inspired by symbolic logic Support Vector Machines (SVMs); inspired by psychology Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience Unsupervised: Clustering (e.g., K-Means), Matrix Factorization, Latent Dirichlet Allocation (LDA), etc.

slide-58
SLIDE 58

58

ML Models in Kaggle 2019 Survey

Deep Learning GLMs Tree learners

slide-59
SLIDE 59

59

Categorizing ML Systems

Orthogonal Dimensions of Categorization:

  • 1. Scalability: In-memory libraries vs Scalable ML system

(works on larger-than-memory datasets)

  • 2. Target Workloads: General ML library vs Decision

tree-oriented vs Deep learning, etc.

  • 3. Implementation Reuse: Layered on top of scalable

data system vs Custom from-scratch framework

slide-60
SLIDE 60

60

Major Existing ML Systems

General ML libraries: Disk-based files: In-memory: Layered on RDBMS/Spark: Cloud-native: “AutoML” platforms: Decision tree-oriented: Deep learning-oriented:

slide-61
SLIDE 61

61

Scalable ML Inference

❖ A trained/learned ML model is just a prediction function:

f : DX → DY

<latexit sha1_base64="vpNiNcIHsUfv2UuLNG9qRtORpk=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiRSUVwVdeGygn1IE8JkOmHTmbCzEQpof/gxl9x40IRt27c+TdO2i5s64ELh3Pu5d57woRpR3nx1pYXFpeWS2sFdc3Nre27Z3dhKpxKSOBROyFSJFGOWkrqlmpJVIguKQkWbYv8r95gORigp+pwcJ8WPU5TSiGkjBfZxdAG9GOkeRiy7HgYt6Ena7WkpXicu4Du+SUnRHgPHEnpAQmqAX2t9cROI0J15ghpdquk2g/Q1JTzMiw6KWKJAj3UZe0DeUoJsrPRj8N4aFROjAS0hTXcKT+nchQrNQgDk1nfqSa9XLxP6+d6ujczyhPUk04Hi+KUga1gHlAsEMlwZoNDEFYUnMrxD0kEdYmxqIJwZ19eZ40TspupXx6WylVLydxFMA+OABHwAVnoApuQA3UAQZP4AW8gXfr2Xq1PqzPceuCNZnZA1Owvn4BEeueTw=</latexit>

❖ Assumption 1: An example fits entirely in DRAM ❖ Assumption 2: f fits entirely in DRAM ❖ If both assumptions hold, trivial data access pattern: single filescan, apply per-tuple function f, write output ❖ If either assumption fails, access pattern becomes more complex and dependent on breaking up internals of f: ❖ Stage partial computations with scalable data access ❖ Very rare in practice; possible example: video inference

  • n ultra-HD video stream for activity recognition

Q: Given large dataset of examples, how to scale inference?

slide-62
SLIDE 62

62

Scalable ML Training

❖ Scaling ML training is trickier and model family-dependent ❖ We will focus on scalability of 2 groups of ML algorithms: ❖ MapReduce-amenable ML algorithms: ❖ GLMs with BGD and similar gradient methods ❖ K-Means Clustering ❖ Matrix Factorization ❖ Deep learning (mini-batch SGD-based) ❖ If interested in scalable tree-learning (especially gradient- boosted trees), check out XGBoost paper

http://dmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf

Ad: CSE 234 covers scalable ML systems in more depth