Scalable Machine Learning with Apache Spark Introductions - PowerPoint PPT Presentation

LINEAR REGRESSION LAB II

MLflow Tracking

MLflow Open-source platform for machine learning lifecycle ▪ Operationalizing machine learning ▪ Developed by Databricks ▪ Pre-installed on the Databricks Runtime for ML ▪

Core Machine Learning Issues Keeping track of experiments or model development ▪ Reproducing code ▪ Comparing models ▪ Standardization of packaging and deploying models ▪ MLflow addresses these issues.

MLflow Components MLflow Tracking ▪ MLflow Projects ▪ MLflow Models ▪ MLflow Plugins ▪ APIs: CLI, Python, R, Java, REST ▪

MLflow Tracking Logging API ▪ Specific to machine learning ▪ Library and environment agnostic ▪ Runs Experiments Executions of data science code Aggregations of runs E.g. a model build, an optimization Typically correspond to a data science run project

What Gets Tracked Parameters ▪ Key-value pairs of parameters (e.g. hyperparameters) ▪ Metrics ▪ Evaluation metrics (e.g. RMSE) ▪ Artifacts ▪ Arbitrary output files (e.g. images, pickled models, data files) ▪ Source ▪ The source code from the run ▪

Examining Past Runs Querying Past Runs via the API ▪ MLflowClient Object ▪ List experiments ▪ Search runs ▪ Return run metrics ▪ MLflow UI ▪ Built in to Databricks platform ▪

MLFLOW TRACKING DEMO

MLflow Model Registry

MLflow Model Registry Collaborative, centralized model hub ▪ Facilitate experimentation, testing, and production ▪ Integrate with approval and governance workflows ▪ Monitor ML deployments and their performance ▪ Databricks MLflow Blog Post

MLFLOW MODEL REGISTRY DEMO

MLFLOW LAB

Decision Trees

Decision Making Decision Node Root Node Salary > $50,000 Yes No Decision Node Leaf Node Decline Offer Commute > 1 hr Yes No Decision Node Leaf Node Decline Offer Offers Free Coffee Yes No Leaf Node Leaf Node Accept Offer Decline Offer

Determining Splits Commute? Bonus? < 30 min 30 min - 1 hr > 1 hr No Yes Commute is a better choice because it provides information about the classification.

Creating Decision Boundaries Commute Decline Offer Salary > $50,000 Yes No 1 hour Decline Offer Commute > 1 hr Decline Offer Yes No Accept Offer Decline Offer Accept Offer $50,000 Salary

Lines vs. Boundaries Linear Regression Decision Trees Lines through data Boundaries instead of lines ▪ ▪ Assumed linear relationship Learn complex relationships ▪ ▪ Commute Y 1 hour X $50,000 Salary

Linear Regression or Decision Tree? It depends on the data...

Tree Depth Tree Depth: the length of the 0 Root Node Salary > $50,000 longest path from a root note to a Yes No leaf node 1 Decline Offer Commute > 1 hr 3 Yes No 2 Offers Free Decline Offer Coffee Yes No Leaf Node 3 Leaf Node Accept Offer Decline Offer Note: shallow trees tend to underfit , and deep trees tend to overfit

Underfitting vs. Overfitting Just Right Underfitting Overfitting

Additional Resource R2D3 has an excellent visualization of how decision trees work.

DECISION TREE DEMO

Random Forests

Decision Trees Pros Cons Interpretable ▪ Poor accuracy ▪ Simple ▪ High variance ▪ Classification ▪ Regression ▪ Nonlinear relationships ▪

Bias vs. Variance

Bias-Variance Tradeoff Error = Variance + Bias 2 + noise Reduce Bias ▪ Error Total Error Optimum Model Build more complex ▪ Complexity Variance models Reduce Variance ▪ Use a lot of data ▪ Build simple models ▪ What about the noise? ▪ Bias 2 Model Complexity

https://www.explainxkcd.com/wiki/index.php/2021:_Software_Development

Building Five Hundred Decision Trees Using more data reduces variance for one model ▪ Averaging more predictions reduces prediction variance ▪ But that would require more decision trees ▪ And we only have one training set … or do we? ▪

Bootstrap Sampling A method for simulating N new datasets: 1. Take sample with replacement from original training set 2. Repeat N times

Bootstrap Visualization Bootstrap 1 (N = 100) Bootstrap 2 (N = 100) Training Set (N = 100) Bootstrap 3 (N = 100) Bootstrap 4 (N = 100) Why are some points in the bootstrapped samples not selected?

Training Set Coverage Assume we are bootstrapping N draws from a training set with N observations ... Probability of an element getting picked in each draw : ▪ Probability of an element not getting picked in each draw : ▪ Probability of an element not getting drawn in the entire sample : ▪ As N → ∞, the probability for each element of not getting picked in a sample approaches 0.368.

Bootstrap Aggregating Train a tree on each of sample, and average the predictions ▪ This is bootstrap aggregating, commonly referred to as bagging ▪ Bootstrap 1 Bootstrap 2 Bootstrap 3 Bootstrap 4 Decision Tree 1 Decision Tree 2 Decision Tree 3 Decision Tree 4 Final Prediction

Random Forest Algorithm Full Training Data Bootstrap 1 Bootstrap 2 Bootstrap K ... At each split, a subset of features is considered to ensure each tree is different.

Random Forest Aggregation Scoring Record ... Aggregation Final Prediction Majority-voting for classification ▪ Mean for regression ▪

RANDOM FOREST DEMO

Hyperparameter Tuning

What is a Hyperparameter? Examples for Random Forest: ▪ Tree depth ▪ Number of trees ▪ Number of features to consider ▪ A parameter whose value is used to control the training process.

Selecting Hyperparameter Values Build a model for each hyperparameter value ▪ Evaluate each model to identify the optimal hyperparameter value ▪ What dataset should we use to train and evaluate? ▪ Training Validation Test What if there isn’t enough data to split into three separate sets?

K-Fold Cross Validation Pass 1: Training Training Validation Average Validation Errors to Identify Pass 2: Training Validation Training Optimal Hyperparameter Values Pass 3: Validation Training Training Final Pass: Training with Optimal Hyperparameters Test

Optimizing Hyperparameter Values Grid Search Train and validate every unique combination of hyperparameters ▪ Tree Depth Number of Trees Tree Depth Number of Trees 5 2 5 2 5 4 8 4 8 2 8 4 Question: With 3-fold cross validation, how many models will this build?

HYPERPARAMETER TUNING DEMO

HYPERPARAMETER TUNING LAB

Hyperparameter Tuning with Hyperopt

Problems with Grid Search Exhaustive enumeration is expensive ▪ Manually determined search space ▪ Past information on good hyperparameters isn’t used ▪ So what do you do if… ▪ You have a training budget ▪ You have a non-parametric search space ▪ You want to pick your hyperparameters based on past results ▪

Hyperopt Open-source Python library ▪ Optimization over awkward search spaces ▪ Serial ▪ Parallel ▪ Spark integration ▪ Three core algorithms for optimization: ▪ Random Search ▪ Tree of Parzen Estimators (TPE) ▪ Adaptive TPE ▪ Paper

Optimizing Hyperparameter Values Random Search Generally outperforms grid search ▪ Can struggle on some datasets (e.g. convex spaces) ▪

Optimizing Hyperparameter Values Tree of Parzen Estimators Meta-learner, Bayesian process ▪ Non-parametric densities ▪ Returns candidate hyperparameters based on best expected ▪ improvement Provide a range and distribution for continuous and discrete ▪ values Adaptive TPE better tunes the search space ▪ Freezes hyperparameters ▪ Tunes number of random trials before TPE ▪

HYPEROPT DEMO

Scalable Machine Learning with Apache Spark Introductions - PowerPoint PPT Presentation

Scalable Machine Learning with Apache Spark Introductions Instructor Introduction Student Introductions Name Professional Responsibilities Fun Personal Interest/Fact Expectations for the Course Course Objectives 1

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center IBM

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Conflicts of Interest Managing Upper Extremity Problems in the Adolescent Athlete

OneList Updates and Consultation How We Got Here Child Tech Care Support Survey Data

Mana aotroa/Exploration Do you let me fly? 1 Unuhia te p, te p whiri mrama Tomokia te

Seeds of Learning The New Britain Infant/Toddler Early Childhood Business Incubator Robin

CLIFF EROSION Cleeve Hill, Watchet The coastal cliffs within the study area are actively eroding

GARAGE SALE go back to work. He is continuing follow-up treatment with a machine that has been

Adaptation to Weather Impacts Grasse River Superfund Site Young S. Chang EPA Region 2 RPM

Biomechanics BIOEN 520 | ME 527 Session 3B Tools of the

Scalable Machine Learning with Apache Spark Introductions - PowerPoint PPT Presentation

Scalable Machine Learning with Apache Spark Introductions Instructor Introduction Student Introductions Name Professional Responsibilities Fun Personal Interest/Fact Expectations for the Course Course Objectives 1

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center IBM

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Conflicts of Interest Managing Upper Extremity Problems in the Adolescent Athlete

OneList Updates and Consultation How We Got Here Child Tech Care Support Survey Data

Mana aotroa/Exploration Do you let me fly? 1 Unuhia te p, te p whiri mrama Tomokia te

Seeds of Learning The New Britain Infant/Toddler Early Childhood Business Incubator Robin

CLIFF EROSION Cleeve Hill, Watchet The coastal cliffs within the study area are actively eroding

GARAGE SALE go back to work. He is continuing follow-up treatment with a machine that has been

Adaptation to Weather Impacts Grasse River Superfund Site Young S. Chang EPA Region 2 RPM

Biomechanics BIOEN 520 | ME 527 Session 3B Tools of the

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb