CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: - - PowerPoint PPT Presentation

cse 291d 234 data systems for machine learning
SMART_READER_LITE
LIVE PREVIEW

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: - - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model Selection Systems DL book; Chapters 8.2 and 8.3 of MLSys book 1 Model Selection in the Lifecycle Feature Engineering Data acquisition Serving


slide-1
SLIDE 1

CSE 291D/234 Data Systems for Machine Learning

1

Topic 3: Feature Engineering and Model Selection Systems DL book; Chapters 8.2 and 8.3 of MLSys book

Arun Kumar

slide-2
SLIDE 2

2

Model Selection in the Lifecycle

Data acquisition Data preparation Feature Engineering Training & Inference Model Selection Serving Monitoring

slide-3
SLIDE 3

3

Model Selection in the Big Picture

slide-4
SLIDE 4

4

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-5
SLIDE 5

5

Bias-Variance-Noise Decomposition

ML (Test) Error = Bias + Variance + Bayes Noise Discriminability of examples Complexity of model/ hypothesis space x = (a,b,c); y = +1 vs x = (a,b,c); y = -1

slide-6
SLIDE 6

6

Hypothesis Space of Functions

❖ A trained ML model is a parametric prediction function:

f : DW × DX → DY

<latexit sha1_base64="3nwsg8hxgtnGtmOBHQ5mjABxB8g=">ACMnicbVDLSgMxFM34rPU16tJNsAiuyowPFdFXeiugn1Ip5RMmlDM5khuaOUod/kxi8RXOhCEbd+hOm0C9t6IHA4515y7vFjwTU4zps1N7+wuLScW8mvrq1vbNpb21UdJYqyCo1EpOo+0UxwySrAQbB6rBgJfcFqfu9y6NcemNI8knfQj1kzJB3JA04JGKl3wTn2AsJdCkR6dWglXE/SGsD7AEPmZ5w69hTvNMFolT0OHc45ZdcIpOBjxL3DEpoDHKLfvFa0c0CZkEKojWDdeJoZkSBZwKNsh7iWYxoT3SYQ1DJTFpml28gDvG6WNg0iZJwFn6t+NlIRa90PfTA5T6mlvKP7nNRIzpopl3ECTNLR0EiMER42B9uc8UoiL4hCpusmLaJYpQMC3nTQnu9MmzpHpYdI+KJ7fHhdLFuI4c2kV76AC56BSV0DUqowqi6Am9og/0aT1b79aX9T0anbPGOztoAtbPLyQXq1A=</latexit>

❖ Hypothesis Space: The set of all possible functions f that can be represented by a model ❖ Training: Picks one f from hypo. space; needs estimation procedure (e.g. optimization, greedy, etc.) ❖ Factors that determine hypo. space: ❖ Feature representation ❖ Inductive bias of model ❖ Regularization

<latexit sha1_base64="XH3I3s1WlW/61E5YOPkn4yHCZA=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRJRdFl02UF+4A2lMl0g6dzISZG6GEfoYbF4q49Wvc+TdO2iy0emDgcM69zLknTAQ36HlfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt5ChYL9GMxKFg3XB6l/vdR6YNV/IBZwkLYjKWPOKUoJX6g5jghBKRNefDas2rewu4f4lfkBoUaA2rn4ORomnMJFJBjOn7XoJBRjRyKti8MkgNSwidkjHrWypJzEyQLSLP3TOrjNxIafskugv150ZGYmNmcWgn84hm1cvF/7x+itFNkHGZpMgkX4UpcJF5eb3uyOuGUxs4RQzW1Wl06IJhRtSxVbgr968l/Suaj7V3Xv/rLWuC3qKMJnMI5+HANDWhC9pAQcETvMCrg86z8+a8L0dLTrFzDL/gfHwDfEORYg=</latexit>H
slide-7
SLIDE 7

7

Another View of Bias-Variance

❖ Bias arise because hypo. space does not hold “truth” ❖ Shrinking hypo. space raises bias ❖ Variance arises due to finite training sample ❖ Estimation approximately nears truth ❖ Shrinking hypo. space lowers variance

slide-8
SLIDE 8

8

3 Ways to Control Learning/Accuracy

❖ Reduce Bayes Noise: ❖ Augment with new useful features from appl. ❖ Reduce Bias: ❖ Enhance hypo. space: derive different features; more complex model ❖ Reduce shrinkage (less regularization) ❖ Reduce Variance: ❖ Shrink hypo. space: derive different features; drop features; less complex model ❖ Enhance shrinkage (more regularization)

slide-9
SLIDE 9

9

The Double Descent Phenomenon

❖ DL and some other ML families can get arbitrarily complex ❖ Can “memorize” entire training set ❖ Curiously, variance can drop after rising; bias goes to 0! ❖ “Interpolation regime” is open question in ML theory

https://arxiv.org/pdf/1812.11118.pdf

slide-10
SLIDE 10

10

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-11
SLIDE 11

11

Unpredictability of Model Selection

❖ Recall 3 ways to control ML accuracy: reduce bias, reduce variance, reduce Bayes noise ❖ Alas, the exact raises/drops in errors on given training task and sample are not predictable ❖ Need empirical comparisons of configurations on data ❖ Train-validation-test splits; cross-validation procedures

slide-12
SLIDE 12

12

The Model Selection Triple

❖ The data scientist/AutoML procedure must steer 3 key activities to alter the Model Selection Triple (MST):

  • 1. Feature Engineering (FE): What is/are the domain(s) of

the hypo. space(s) to consider?

  • 2. Algorithm/Architecture Selection (AS): What exact
  • hypo. space to use (model type/ANN architecture)?
  • 3. Hyper-parameter Tuning (HT): How to configure hypo.

space shrinkage and estimation procedure approx.?

https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

slide-13
SLIDE 13

13

The Model Selection Triple

❖ The data scientist/AutoML procedure must steer 3 key activities to explore the Model Selection Triple (MST)

https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

FE1 FE2 …

Train and test model config(s)

  • n ML system

Post-process and consume results Next iteration

AS1 AS2 … HT1 HT2 … ❖ Stopping criterion is application-specific / user-specific on Pareto surface: time, cost, accuracy, tiredness (!), etc.

slide-14
SLIDE 14

14

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-15
SLIDE 15

15

Feature Engineering

❖ Process of converting prepared data into a feature vector representation for ML training/inference ❖ Aka feature extraction, representation extraction, etc. ❖ Activities vary based on data type: Temporal feature extraction Join and Group Bys Feature interactions Feature selection Value recoding Dimensionality reduction

slide-16
SLIDE 16

16

Feature Engineering

❖ Process of converting prepared data into a feature vector representation for ML training/inference ❖ Aka feature extraction, representation extraction, etc. ❖ Activities vary based on data type: Signal processing- based features Deep learning Transfer learning Bag of words N-grams Parsing-based features

slide-17
SLIDE 17

17

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-18
SLIDE 18

18

Hyperparameter Tuning

❖ Most ML models have hyper-parameter knobs Learning rate Regularization ❖ Most of them raise bias slightly but reduce variance more ❖ No hyp.par. settings universally best for all tasks/data Complexity Learning rate Regularization Dropout prob. Number of trees Max height/min split Learning rate?

slide-19
SLIDE 19

19

Hyperparameter Tuning

❖ Common methods to tune hyp.par. configs:

https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf http://gael-varoquaux.info/science/survey-of-machine-learning-experimental-methods-at-neurips2019-and-iclr2020.html

Grid search “Random” search

slide-20
SLIDE 20

20

Hyperband

❖ An automated ML (AutoML) procedure for tuning hyp.par. ❖ Basic Idea: For iterative procedures (e.g., SGD), stop non- promising hyp.par. configs at earlier epochs ❖ Based on multi-armed bandit idea from gambling/RL ❖ Benefits: ❖ Reapportioning resources with early stopping may help reach better overall accuracy sooner ❖ Total resource use may be lower vs grid/random search ❖ 2 knobs as input: ❖ R: Max budget per config (e.g., # SGD epochs) ❖ \eta: Stop rate for configs

https://arxiv.org/pdf/1603.06560.pdf

slide-21
SLIDE 21

21

Hyperband

Brackets: independent trials Akin to random search Survival of the fittest!

https://arxiv.org/pdf/1603.06560.pdf

slide-22
SLIDE 22

22

Hyperband

https://arxiv.org/pdf/1603.06560.pdf

R = 81; \eta = 3 ni: # hyp.par.configs run ri: # epochs per config ❖ Still not as popular as grid/random search; latter is simpler and easier to use (e.g., how to set R and \eta?)

slide-23
SLIDE 23

23

Review Zoom Poll

slide-24
SLIDE 24

24

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-25
SLIDE 25

25

Algorithm Selection

❖ Basic Goal: AutoML procedure to pick among a set of interchangeable models (hyp.par. tuning included) ❖ Automate a data scientist’s intuition on feature preprocessing, missing values, hyp.par. tuning, etc. ❖ Many heuristics: AutoWeka, AutoSKLearn, DataRobot, etc.

https://www.cs.ubc.ca/labs/beta/Projects/autoweka/papers/autoweka.pdf

AutoWeka

slide-26
SLIDE 26

26

Algorithm Selection

❖ AutoScikitLearn uses a more sequential Bayesian

  • ptimization approach

http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf

slide-27
SLIDE 27

27

NAS and AutoKeras

❖ DL NCG arch. akin to model family in classical ML ❖ Some AutoML tools aim to automate NCG design too

https://arxiv.org/pdf/1611.01578.pdf https://arxiv.org/pdf/1806.10282.pdf

Google’ NAS uses RL to construct and evaluate NCGs AutoKeras uses Bayesian optimization and has optimized impl. ❖ Not that popular in practice; compute-intensive; hard to debug

slide-28
SLIDE 28

28

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-29
SLIDE 29

29

Systems Aspects of Model Selection

❖ ML/data mining folks have studied model selection from an algorithmic automation/accuracy standpoint ❖ But its resource efficiency is a pressing ML systems issue: ❖ Long running times; need lots of CPUs/GPUs ❖ Cost and energy footprints non-trivial ❖ If user is in the loop, latency matters too ❖ Need to raise throughput of exploring training configs with minimal resource expenses

slide-30
SLIDE 30

30

Asynchronous Successive Halving (ASHA)

❖ Successor to Hyperband that uses resource more fully ❖ Issues -> New Ideas: ❖ Top-k evals in Hyperband are sync. point bottleneck when configs are diverse -> Asynchronous top-k check; better for diverse configs ❖ Fewer and fewer configs towards bracket end (lower deg.

  • f par.) -> Add new hyp.par. configs on the fly; keep all

workers busy ❖ ASHA adapts AutoML procedure to cluster setting for massive parallel hyp.par tuning

https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/

slide-31
SLIDE 31

31

Asynchronous Successive Halving (ASHA)

500 workers Total time of weeks! 25 workers

slide-32
SLIDE 32

32

Introducing Cerebro

❖ Key Observation: False dichotomy of 2 main parallelism paradigms in ML for scalable training / model selection

+ High data scalability via sharding — BSP does not converge; mini-batch level has high communication costs — Low throughput overall + High throughput model selection + Best accuracy from Sequential SGD — Low data scalability; wastes space (copy) or network (remote read)

Task Parallelism

(Dask, Hyperband, ASHA, Vizier, etc.)

Data Parallelism

(RDBMS, Spark, PS, Horovod, etc.)

slide-33
SLIDE 33

33

Q: Can we get the best of both worlds?

slide-34
SLIDE 34

34

Cerebro’s Model Hopper Parallelism

❖ A new hybrid of task- and data-parallelism for SGD

D2

Worker 2 DNN 2

D3

Worker 3 DNN 3

D1

Worker 1 DNN 1

D4

Worker 4 DNN 4 Epoch 1.1 starts in parallel 1.2 2 1 4 3

slide-35
SLIDE 35

35

❖ Key Insight: SGD is robust to randomness of data ordering ❖ Properties of Model Hopper Parallelism (MOP): ❖ All configs visit dataset in some sequential order; ensures similar accuracy as task parallelism ❖ Scheduler keeps all workers busy on shard; just like data parallelism ❖ No sync. point within an epoch of training all configs; very little idling of workers due to 1 comm. step per epoch

Cerebro’s Model Hopper Parallelism

slide-36
SLIDE 36

36

Communication Cost Analysis of MOP

❖ p workers; |S| configs; k epochs; b batch size; m model size

<latexit sha1_base64="GB+7FuBxXUQqh32x59bZ8d7DeXg=">AC3icbVC7TsMwFHXKq5RXgJHFaoXUDpSkAsFYAQNjEfQhNVHluE5r1U4i20Gqku4s/AoLAwix8gNs/A1O2wEKR7rS8Tn3yvceL2JUKsv6MnJLyura/n1wsbm1vaOubvXkmEsMGnikIWi4yFJGA1IU1HFSCcSBHGPkbY3usz89j0RkobBnRpHxOVoEFCfYqS01DOLtREvR0d2Jb1NocMwoQymV+lx2Ysq0BHZu2eWrKo1BfxL7DkpgTkaPfPT6Yc45iRQmCEpu7YVKTdBQlHMyKTgxJECI/QgHQ1DRAn0k2mt0zgoVb60A+FrkDBqfpzIkFcyjH3dCdHaigXvUz8z+vGyj93ExpEsSIBn3kxwyqEGbBwD4VBCs21gRhQfWuEA+RQFjp+Ao6BHvx5L+kVavap1Xr5qRUv5jHkQcHoAjKwAZnoA6uQM0AQYP4Am8gFfj0Xg23oz3WvOmM/sg18wPr4BHSqZNQ=</latexit>2km(p 1)|S|d|D|/(bp)e <latexit sha1_base64="5zbAh/v0yCVijSVbzdh6MZW4udU=">AB+3icbVBNS8NAEN34WetXrEcvwSJ4KklR6rHUi8cK9gPaUDbTbt0dxN2J9IS4k/x4kERr/4Rb/4bt20O2vpg4PHeDPzgpgzDa7bW1sbm3v7Bb2ivsHh0fH9kmpraNEdoiEY9UN8CaciZpCxhw2o0VxSLgtBNMbud+5EqzSL5ALOY+gKPJAsZwWCkgV2qVZ/6QKcgIyUwT5uNbGCX3Yq7gLNOvJyUY7mwP7qDyOSCqBcKx1z3Nj8FOsgBFOs2I/0TGZIJHtGeoxIJqP13cnjkXRhk6YaRMSXAW6u+JFAutZyIwnQLDWK96c/E/r5dAeOnTMYJUEmWi8KEOxA58yCcIVOUAJ8Zgoli5laHjLHCBExcROCt/ryOmlXK951xb2/KtcbeRwFdIbO0SXyUA3V0R1qohYiaIqe0St6szLrxXq3PpatG1Y+c4r+wPr8AR4alH4=</latexit>72 PB
slide-37
SLIDE 37

37

Empirical Results

❖ Cerebro/MOP is near Pareto-optimal on completion time, memory/space efficiency, and network cost

slide-38
SLIDE 38

38

Discussion on Cerebro paper

slide-39
SLIDE 39

39

Vision of Cerebro Platform

Transfer Learning High-level Model Building APIs Optimization and Scheduling Layer Execution and Storage Layer Ablation Analysis Sequence Analysis Feature Transfer Hyperparameter Tuning

Architecture Search Grouped Learning Multi-task Batching

Model Hopper Parallelism (MOP) MOP Hybrids Materialization and Memory Manager AutoDiff and SGD Execution Scheduler Direct Filesystem Access (EXT + NFS; HDFS)

Cloud Native Dataflow Engines

EC2 EBS Lambda S3

Metadata Manager Fault Tolerance and Elasticity Manager

CEREBRO

CLIs GUIs

Explanation Engine

slide-40
SLIDE 40

40

Determined AI Training Platform

https://determined.ai/

slide-41
SLIDE 41

41

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-42
SLIDE 42

42

❖ Received less attention than model building systems ❖ Key issues they address: ❖ Usability: Higher level specification of feature eng. ops ❖ Efficiency: Automated systems-level optimization ❖ Challenges: ❖ Feature eng. is very heterogeneous; tough for one tool to capture all ops, data types, etc. ❖ Turing-complete code rampant in feature eng.; tough for automated optimization

Feature Engineering Systems

slide-43
SLIDE 43

43

❖ Sample of feature engineering systems:

Feature Engineering Systems

Joins Feature interactions Feature selection Columbus Vista KeystoneML Textual / signal proc. features Deep transfer learning

slide-44
SLIDE 44

44

❖ Setting: Exploratory feature subset selection for GLMs on tabular data in R (or NumPy/Pandas) ❖ Goal: Reduce compute redundancy and data access at scale ❖ Approach:

Feature Selection in Columbus

An embedded domain- specific language (DSL) with “logical” ops Example program in Columbus DSL

slide-45
SLIDE 45

45

❖ Optimization techniques: ❖ Some logical ops have alternate physical ops with different runtimes; Columbus picks automatically

Feature Selection in Columbus

❖ Exact: Batching, Subset materialization, QR decomposition ❖ Approx.: Coreset sampling, Warm starting

slide-46
SLIDE 46

46

❖ Similar to Columbus but more general: larger set of classical ML training and feat. eng. ops on top of Spark ❖ Supports text and signal proc.-based image features

Feature Pipelines in KeystoneML

https://amplab.cs.berkeley.edu/wp-content/uploads/2017/01/ICDE_2017_CameraReady_475.pdf

❖ Optimizations: Diff. distributed linear solvers at op level; at full pipeline level: materializing and caching intermediates, sampling, common sub-expression elimination

slide-47
SLIDE 47

47

Feature Transfer in Vista

❖ Setting: Pre-trained CNNs are commonly used to extract image feature repr. for multimodal analytics ❖ Issue: No single layer of CNN is universally best for downstream accuracy; need to compare multiple layers

slide-48
SLIDE 48

48

Feature Transfer in Vista

Brand Tags Price

Structured Data Image Data Pre-trained Deep CNN Downstream ML Model Training But no single CNN layer is always best for accuracy

slide-49
SLIDE 49

49

Feature Transfer in Vista

❖ Approach: Vista casts feature transfer as a multi-query

  • ptimization problem and creates materialized views

❖ Optimizations: Staging out layer materializations avoids compute redundancy; automated memory management

Naive prior approach: Vista’s multi-query optimization:

slide-50
SLIDE 50

50

❖ Pros: ❖ High level ops may help improve ML user productivity ❖ Automated resource optimization reduces costs ❖ Cons: ❖ Lack of sufficient generality ❖ ML user needs to (re)learn new APIs; may be complex ❖ Extra dependencies and maintenance issues ❖ Some companies now have in-house custom APIs/tools or general code/notebook orchestration for feat. eng. pipelines (not really optimized). More on “feature stores” in Topic 6.

Tradeoffs of Feature Eng. Systems

slide-51
SLIDE 51

51

Example Industrial Feat. Eng. Sys.

slide-52
SLIDE 52

52

Outline

❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues

slide-53
SLIDE 53

53

❖ Some tools claim to automate data preparation, feat. eng., and model building holistically

End-to-End AutoML

❖ Unclear how effective they are; no public benchmarks ❖ Unclear if they do any holistic optimizations, e.g., caching common intermediates, logical-physical separation ❖ Open questions on systematizing and optimizing end-to-end AutoML

slide-54
SLIDE 54

54

Cloud-Native Model Selection

❖ ML resource availability is now flexible and heterogenous ❖ Local machine -> on-premise cluster -> cloud ❖ Cloud-native offers new opportunities/challenges: ❖ Elasticity: upscale/downscale compute/RAM as needed ❖ Cheap decoupled storage (e.g., S3) ❖ Cheap ephemeral compute (e.g., Spot, Serverless) ❖ Need to redesign model sel. sys. to be cloud-native: ❖ Open questions on optimizing resource efficiency vs runtimes vs total cost

slide-55
SLIDE 55

55

❖ Most DL users still hand craft NCG for AS ❖ Analogous to manual feat. eng. in classical ML ❖ NAS / AutoKeras still have only limited adoption

More Effective Architecture Selection

https://www.youtube.com/watch?v=r5aEkpEkDzI&feature=emb_title

❖ Open questions on bridging usability gap ❖ Need fast human-in- the-loop tools ❖ Domain-specific GUI- based AS tools?

slide-56
SLIDE 56

56

❖ Name 3 model sel. systems/approaches for SGD-based ML discussed in class whose communication complexity is independent of SGD batch size. ❖ Briefly explain 2 cons of building separate feat. eng. systems. ❖ Briefly explain one common systems-level optimization seen in many feat. eng. systems. ❖ Why bother redesigning model sel. systems for the cloud?

Review Questions