hopsworks
play

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks - PowerPoint PPT Presentation

Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019 Hopsworks Technical Milestones First non-Google ML Worlds fastest Worlds First Worlds first Hadoop Platform with HDFS Published at Open Source Feature platform to support


  1. Hopsworks Dr. Jim Dowling Sina Sheikholeslami Nov 2019

  2. Hopsworks Technical Milestones First non-Google ML World’s fastest World’s First World’s first Hadoop Platform with HDFS Published at Open Source Feature platform to support TensorFlow Extended Store for Machine USENIX FAST with (TFX) support through GPUs-as-a-Resource Learning Oracle and Spotify Beam/Flink 2017 2018 2019 World’s First Winner of IEEE World’s most scalable World’s First Unified Scale POSIX-like Hierarchical Hyperparam and Distributed Filesystem to Challenge 2017 Filesystem with Ablation Study store small files in Multi Data Center Availability with HopsFS - Parallel Prog.` metadata on NVMe disks with 1.6m ops/sec on GCP 1.2m ops/sec Framework “If you’re working with big data and Hadoop, this one paper could repay your investment in the Morning Paper many times over.... HopsFS is a huge win. ” - Adrian Colyer, The Morning Paper

  3. 0. Slides: http://hops.io/id2223.pdf 1. Register for an account at: www.hops.site 2. Follow the Instructions here: https://bit.ly/2UEixTr 3. Getting started Videos https://bit.ly/2NnbKgu

  4. Hopsworks hides the Complexity of Deep Learning A/B Distributed Testing Training Data validation Data Model Prediction Hopsworks Model Data Serving φ(x) REST API Collection HyperParameter Hopsworks Tuning Monitoring Hardware Feature Store Management Feature Engineering Pipeline Management [Adapted from Schulley et Al “Technical Debt of ML” ]

  5. Machine Learning Pipelines

  6. End-to-End ML Pipelines

  7. ML Pipelines with a Feature Store

  8. End-to-End ML Pipelines in Hopsworks

  9. Stage 1 Stage 3 Stage 4 Redshift S3 Cassandra Hadoop Data Engineer ML Ops Engineer App Developer Roles in Machine Learning Feature Engineering Stage 1. Data Engineer Get Online Features Feature Store Log Predictions and Join Outcomes Stage 2. Data Scientist Feature KPI Dashboards Streaming or Stage 2 Selection Serverless Alerts Data Scientist Monitoring App Actions Test Data Training Data Stage 3. ML Ops Engineer Predict Log Predictions Model Design Online App pplica cation Kafka Stage 4. App Developer Log Predictions Model Candidates Model Inference API Model Model Model Architecture Predict Model Architecture Model Architecture Architecture Architecture Kubernetes / Serverless Model Model Model Model Hyperparameters Architecture Model Architecture Architecture Model Architecture Model Model Model Model Architecture Architecture Hyperparameter Architecture Architecture Trial Experiment Model Deploy Model Repository

  10. Running TensorFlow/Keras/PyTorch Apps in PySpark Warning: micro-exposure to PySpark may cure you of distributed programming phobia

  11. GPU(s) in PySpark Executor, Driver coordinates PySpark makes it easier to write TensorFlow/Keras/ Executor Executor PyTorch code that can either be run on a single GPU or scale to run on lots of GPUS for Parallel Experiments or Distributed Training. Driver

  12. Need Distributed Filesystem for Coordination • Training/Test Datasets • Model checkpoints, Trained Models • Experiment run data • Provenance data • Application logs Model Executor 1 Executor N Driver TensorBoard Serving HopsFS

  13. PySpark Hello World

  14. PySpark – Hello World Executor print(“Hello from GPU”) * 1 Driver experiment.launch(..)

  15. Leave code unchanged, but configure 4 Executors print(“Hello print(“Hello print(“Hello print(“Hello from GPU”) from GPU”) from GPU”) from GPU”) Driver

  16. Driver with 4 Executors

  17. Same/Replica Conda Environment on all Executors conda_env conda_env conda_env conda_env conda_env

  18. A Conda Environment Per Project in Hopsworks

  19. Use Pip or Conda to install Python libraries

  20. TensorFlow Distributed Training with PySpark def train(): # Separate shard of dataset per worker # create Estimator w/ DistribStrategy # as CollectiveAllReduce # train model, evaluate return loss # Driver code below here # builds TF_CONFIG and shares to workers from hops import experiment experiment.collective_allreduce(train) More details: http//github.com/logicalclocks/hops-examples

  21. Undirected Hyperparam Search with PySpark def train(dropout): # Same dataset for all workers # create model and optimizer # add this worker’s value of dropout # train model and evaluate return loss # Driver code below here from hops import experiment args ={“ dropout ”:[0.1, 0.4, 0.8]} experiment.grid_search(train,args) More details: http//github.com/logicalclocks/hops-examples

  22. Directed Hyperparameter Search with PySpark def train(dropout): # Same dataset for all workers # create model and optimizer optimizer.apply(dropout) # train model and evaluate return loss from hops import experiment args ={“ dropout ”: “0.1 - 0.8”} experiment.diff_ev(train,args) More details: http//github.com/logicalclocks/hops-examples

  23. Wasted Compute!

  24. Parallel ML Trials with Maggy

  25. Maggy: Unified Hparam Opt & Ablation Programming Evaluate Synchronous or Asynchronous Trials Machine Hyperparameter Ablation Study Learning Directed or Optimizer Controller System Undirected Search User-Defined Search/Optimizers New Hyperparameter Values New Dataset/Model-Architecture

  26. Directed Hyperparameter Search with Maggy def train(dropout, reporter): ….. Task 11 from maggy import experiment, Task 12 Searchspace Barrier sp = Task 13 SearchSpace(dropout=('INTEGER', Metrics [2, 8])) … New Trial Task 1N experiment.lagom(train, sp) Early Driver Stop More details: http//github.com/logicalclocks/hops-examples

  27. Parallel Ablation Studies with Maggy def train(dataset_function, model_function): Task 11 ….. Task 12 Barrier from maggy import experiment Task 13 ablation_study =… Metrics experiment.lagom(train, … New experiment_type='ablation', Trial Task 1N ablation_study=ablation_study, Early ablator='loco') Driver Stop More details: http//github.com/logicalclocks/hops-examples

  28. /Experiments Executions add entries in /Experiments: ● /Projects/MyProj └ Experiments experiment.launch (…) └ < app_id> experiment.grid_search (…) └ <type> experiment.collective_allreduce (…) ├─ checkpoints experiment.lagom (…) ├─ tensorboard_logs ├─ logfile /Experiments contains: ● └─ versioned_resources ├─ notebook.ipynb logs (application, tensorboard) ○ └─ conda_env.yml ○ executed notebook file conda environment used ○ checkpoints ○

  29. /Models /Projects/MyProj Named/versioned model ● └ Models management for: └ <name> TensorFlow/Keras └ <version> Scikit Learn ├─ saved_model.pb └─ variables/ A Models dataset can be ● ... securely shared with other projects or the whole cluster ● The provenance API returns the conda.yml and execution used to train a given model

  30. That was Hopsworks Efficiency & Performance Development & Operations Security & Governance Feature Store Development Environment Secure Multi-Tenancy FS Data warehouse for ML First-class Python Support Project-based restricted access Distributed Deep Learning Version Everything Encryption At-Rest, In-Motion Faster with more GPUs Code, Infrastructure, Data TLS/SSL everywhere HopsFS Model Serving on Kubernetes AI-Asset Governance NVMe speed with Big Data TF Serving, SkLearn Models, experiments, data, GPUs Horizontally Scalable End-to-End ML Pipelines Data/Model/Feature Lineage Ingestion, DataPrep, Orchestrated by Airflow Discover/track dependencies Training, Serving

  31. Acknowledgements and References Slides and Diagrams from colleagues: ● Maggy: Moritz Meister and Sina Sheikholeslami ● Feature Store: Kim Hammar ● Beam/Flink on Hopsworks: Theofilos Kakantousis References ● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017. Size matters: Improving the performance of small files …, ACM Middleware 2018. ● ● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019. ● Hopsworks Demo, SysML 2019.

  32. 470 Ramona St Palo Alto https://www.logicalclocks.com Register for a free account at www.hops.site Thank you! Twitter @logicalclocks @hopsworks GitHub https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend