Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks - PowerPoint PPT Presentation

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019

Hopsworks Technical Milestones First non-Google ML World’s fastest World’s First World’s first Hadoop Platform with HDFS Published at Open Source Feature platform to support TensorFlow Extended Store for Machine USENIX FAST with (TFX) support through GPUs-as-a-Resource Learning Oracle and Spotify Beam/Flink 2017 2018 2019 World’s First Winner of IEEE World’s most scalable World’s First Unified Scale POSIX-like Hierarchical Hyperparam and Distributed Filesystem to Challenge 2017 Filesystem with Ablation Study store small files in Multi Data Center Availability with HopsFS - Parallel Prog.` metadata on NVMe disks with 1.6m ops/sec on GCP 1.2m ops/sec Framework “If you’re working with big data and Hadoop, this one paper could repay your investment in the Morning Paper many times over.... HopsFS is a huge win. ” - Adrian Colyer, The Morning Paper

0. Slides: http://hops.io/id2223.pdf 1. Register for an account at: www.hops.site 2. Follow the Instructions here: https://bit.ly/2UEixTr 3. Getting started Videos https://bit.ly/2NnbKgu

Hopsworks hides the Complexity of Deep Learning A/B Distributed Testing Training Data validation Data Model Prediction Hopsworks Model Data Serving φ(x) REST API Collection HyperParameter Hopsworks Tuning Monitoring Hardware Feature Store Management Feature Engineering Pipeline Management [Adapted from Schulley et Al “Technical Debt of ML” ]

Machine Learning Pipelines

End-to-End ML Pipelines

ML Pipelines with a Feature Store

End-to-End ML Pipelines in Hopsworks

Stage 1 Stage 3 Stage 4 Redshift S3 Cassandra Hadoop Data Engineer ML Ops Engineer App Developer Roles in Machine Learning Feature Engineering Stage 1. Data Engineer Get Online Features Feature Store Log Predictions and Join Outcomes Stage 2. Data Scientist Feature KPI Dashboards Streaming or Stage 2 Selection Serverless Alerts Data Scientist Monitoring App Actions Test Data Training Data Stage 3. ML Ops Engineer Predict Log Predictions Model Design Online App pplica cation Kafka Stage 4. App Developer Log Predictions Model Candidates Model Inference API Model Model Model Architecture Predict Model Architecture Model Architecture Architecture Architecture Kubernetes / Serverless Model Model Model Model Hyperparameters Architecture Model Architecture Architecture Model Architecture Model Model Model Model Architecture Architecture Hyperparameter Architecture Architecture Trial Experiment Model Deploy Model Repository

Running TensorFlow/Keras/PyTorch Apps in PySpark Warning: micro-exposure to PySpark may cure you of distributed programming phobia

GPU(s) in PySpark Executor, Driver coordinates PySpark makes it easier to write TensorFlow/Keras/ Executor Executor PyTorch code that can either be run on a single GPU or scale to run on lots of GPUS for Parallel Experiments or Distributed Training. Driver

Need Distributed Filesystem for Coordination • Training/Test Datasets • Model checkpoints, Trained Models • Experiment run data • Provenance data • Application logs Model Executor 1 Executor N Driver TensorBoard Serving HopsFS

PySpark Hello World

PySpark – Hello World Executor print(“Hello from GPU”) * 1 Driver experiment.launch(..)

Leave code unchanged, but configure 4 Executors print(“Hello print(“Hello print(“Hello print(“Hello from GPU”) from GPU”) from GPU”) from GPU”) Driver

Driver with 4 Executors

Same/Replica Conda Environment on all Executors conda_env conda_env conda_env conda_env conda_env

A Conda Environment Per Project in Hopsworks

Use Pip or Conda to install Python libraries

TensorFlow Distributed Training with PySpark def train(): # Separate shard of dataset per worker # create Estimator w/ DistribStrategy # as CollectiveAllReduce # train model, evaluate return loss # Driver code below here # builds TF_CONFIG and shares to workers from hops import experiment experiment.collective_allreduce(train) More details: http//github.com/logicalclocks/hops-examples

Undirected Hyperparam Search with PySpark def train(dropout): # Same dataset for all workers # create model and optimizer # add this worker’s value of dropout # train model and evaluate return loss # Driver code below here from hops import experiment args ={“ dropout ”:[0.1, 0.4, 0.8]} experiment.grid_search(train,args) More details: http//github.com/logicalclocks/hops-examples

Directed Hyperparameter Search with PySpark def train(dropout): # Same dataset for all workers # create model and optimizer optimizer.apply(dropout) # train model and evaluate return loss from hops import experiment args ={“ dropout ”: “0.1 - 0.8”} experiment.diff_ev(train,args) More details: http//github.com/logicalclocks/hops-examples

Wasted Compute!

Parallel ML Trials with Maggy

Maggy: Unified Hparam Opt & Ablation Programming Evaluate Synchronous or Asynchronous Trials Machine Hyperparameter Ablation Study Learning Directed or Optimizer Controller System Undirected Search User-Defined Search/Optimizers New Hyperparameter Values New Dataset/Model-Architecture

Directed Hyperparameter Search with Maggy def train(dropout, reporter): ….. Task 11 from maggy import experiment, Task 12 Searchspace Barrier sp = Task 13 SearchSpace(dropout=('INTEGER', Metrics [2, 8])) … New Trial Task 1N experiment.lagom(train, sp) Early Driver Stop More details: http//github.com/logicalclocks/hops-examples

Parallel Ablation Studies with Maggy def train(dataset_function, model_function): Task 11 ….. Task 12 Barrier from maggy import experiment Task 13 ablation_study =… Metrics experiment.lagom(train, … New experiment_type='ablation', Trial Task 1N ablation_study=ablation_study, Early ablator='loco') Driver Stop More details: http//github.com/logicalclocks/hops-examples

/Experiments Executions add entries in /Experiments: ● /Projects/MyProj └ Experiments experiment.launch (…) └ < app_id> experiment.grid_search (…) └ <type> experiment.collective_allreduce (…) ├─ checkpoints experiment.lagom (…) ├─ tensorboard_logs ├─ logfile /Experiments contains: ● └─ versioned_resources ├─ notebook.ipynb logs (application, tensorboard) ○ └─ conda_env.yml ○ executed notebook file conda environment used ○ checkpoints ○

/Models /Projects/MyProj Named/versioned model ● └ Models management for: └ <name> TensorFlow/Keras └ <version> Scikit Learn ├─ saved_model.pb └─ variables/ A Models dataset can be ● ... securely shared with other projects or the whole cluster ● The provenance API returns the conda.yml and execution used to train a given model

That was Hopsworks Efficiency & Performance Development & Operations Security & Governance Feature Store Development Environment Secure Multi-Tenancy FS Data warehouse for ML First-class Python Support Project-based restricted access Distributed Deep Learning Version Everything Encryption At-Rest, In-Motion Faster with more GPUs Code, Infrastructure, Data TLS/SSL everywhere HopsFS Model Serving on Kubernetes AI-Asset Governance NVMe speed with Big Data TF Serving, SkLearn Models, experiments, data, GPUs Horizontally Scalable End-to-End ML Pipelines Data/Model/Feature Lineage Ingestion, DataPrep, Orchestrated by Airflow Discover/track dependencies Training, Serving

Acknowledgements and References Slides and Diagrams from colleagues: ● Maggy: Moritz Meister and Sina Sheikholeslami ● Feature Store: Kim Hammar ● Beam/Flink on Hopsworks: Theofilos Kakantousis References ● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017. Size matters: Improving the performance of small files …, ACM Middleware 2018. ● ● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019. ● Hopsworks Demo, SysML 2019.

470 Ramona St Palo Alto https://www.logicalclocks.com Register for a free account at www.hops.site Thank you! Twitter @logicalclocks @hopsworks GitHub https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks - PowerPoint PPT Presentation

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks Technical Milestones First non-Google ML Worlds fastest Worlds First Worlds first Hadoop Platform with HDFS Published at Open Source Feature platform to support

Distributed Deep Learning Using Hopsworks CGI Trainee Program Workshop Kim Hammar

Distributed Deep Learning Using Hopsworks SF Machine Learning Mesosphere Kim Hammar

Data Model Predictions ( x ) Kim Hammar (Logical Clocks) Hopsworks Feature Store February

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Bidimensional Golomb Codes and Lossless Image Compression Pablo Rotondo pabloedrot@gmail.com

Reduced Order Methods for Environmental Marine Problems by Optimal Flow Control M. Strazzullo, F.

PAUL JACKSON ESQ., OBE DOVETAIL GAMES 100% STEAM. HOW DOVETAIL GAMES STEAM-ONLY POLICY HAS

Starting, Maintaining and Expanding Ubuntu Hours by iheartubuntu What is an Ubuntu Hour?

Whole-Body Motion Planning for Humanoid Robots (slides prepared by Paolo Ferrari) introduction

Introduction & Logistics Mike George (in for Elisavet Kozyri) Summer 2013 Cornell University

Logistics PS 2 due today Midterm in one week Covers all material through value iteration

Logistics Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/9

XL4A: Logistic Model using OLS1A in Excel 2013 1 Mar 2017 V0E 2x XL4A: V0E2x XL4A: V0E2x 2015

Logistic Regression CS60010: Deep Learning Abir Das IIT Kharagpur Jan 22, 23 and 24, 2020

Bringing Agility into Linked Data Development An Industrial Use Case in Logistics Domain Pinar

Homework Logistics HW1 is due this Friday 09/02. Login to Gradescope TODAY to see if you

Learning Logistic Circuits Yitao Liang, Guy Van den Broeck January 31, 2019 Which model to

Logistic Regression and POS Tagging CSE392 - Spring 2019 Special Topic in CS Task

IPv6 Ready Logo Aiding IPv6 Deployment Timothy Winters LACNIC 27 May 2017 IPv6 Forums IPv6

Update on Proton Calorimetric Reconstruction Heng-Ye Liao ProtoDUNE sim/reco meeting Dec 11,

Evaluating ProtoDUNE Single Phase Detector Response with a Cosmic Ray Tagger (CRT) Richie Diurba

Generalized tilting theory in functor categories Xi Tang April 25, 2019 logo Table of content

Contracting Procedures & Forms Below and Above the SAT JTF-Role Play Scenarios (Contract

Learning Transferable Architectures for Scalable Image Recognition Zoph et al. Introduction

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks - PowerPoint PPT Presentation

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks Technical Milestones First non-Google ML Worlds fastest Worlds First Worlds first Hadoop Platform with HDFS Published at Open Source Feature platform to support

Distributed Deep Learning Using Hopsworks CGI Trainee Program Workshop Kim Hammar

Distributed Deep Learning Using Hopsworks SF Machine Learning Mesosphere Kim Hammar

Data Model Predictions ( x ) Kim Hammar (Logical Clocks) Hopsworks Feature Store February

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Bidimensional Golomb Codes and Lossless Image Compression Pablo Rotondo pabloedrot@gmail.com

Reduced Order Methods for Environmental Marine Problems by Optimal Flow Control M. Strazzullo, F.

PAUL JACKSON ESQ., OBE DOVETAIL GAMES 100% STEAM. HOW DOVETAIL GAMES STEAM-ONLY POLICY HAS

Starting, Maintaining and Expanding Ubuntu Hours by iheartubuntu What is an Ubuntu Hour?

Whole-Body Motion Planning for Humanoid Robots (slides prepared by Paolo Ferrari) introduction

Introduction &amp; Logistics Mike George (in for Elisavet Kozyri) Summer 2013 Cornell University

Logistics PS 2 due today Midterm in one week Covers all material through value iteration

Logistics Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/9

XL4A: Logistic Model using OLS1A in Excel 2013 1 Mar 2017 V0E 2x XL4A: V0E2x XL4A: V0E2x 2015

Logistic Regression CS60010: Deep Learning Abir Das IIT Kharagpur Jan 22, 23 and 24, 2020

Bringing Agility into Linked Data Development An Industrial Use Case in Logistics Domain Pinar

Homework Logistics HW1 is due this Friday 09/02. Login to Gradescope *TODAY* to see if you

Learning Logistic Circuits Yitao Liang, Guy Van den Broeck January 31, 2019 Which model to

Logistic Regression and POS Tagging CSE392 - Spring 2019 Special Topic in CS Task

IPv6 Ready Logo Aiding IPv6 Deployment Timothy Winters LACNIC 27 May 2017 IPv6 Forums IPv6

Update on Proton Calorimetric Reconstruction Heng-Ye Liao ProtoDUNE sim/reco meeting Dec 11,

Evaluating ProtoDUNE Single Phase Detector Response with a Cosmic Ray Tagger (CRT) Richie Diurba

Generalized tilting theory in functor categories Xi Tang April 25, 2019 logo Table of content

Contracting Procedures &amp; Forms Below and Above the SAT JTF-Role Play Scenarios (Contract

Learning Transferable Architectures for Scalable Image Recognition Zoph et al. Introduction

Introduction & Logistics Mike George (in for Elisavet Kozyri) Summer 2013 Cornell University

Homework Logistics HW1 is due this Friday 09/02. Login to Gradescope TODAY to see if you

Contracting Procedures & Forms Below and Above the SAT JTF-Role Play Scenarios (Contract