from python to pyspark and back again
play

From Python to PySpark and Back Again - Unifying Single-host and - PowerPoint PPT Presentation

From Python to PySpark and Back Again - Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software Engineer, Logical Clocks Jim Dowling, @jim_dowling Associate Professor, KTH Royal Institute of


  1. From Python to PySpark and Back Again - Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software Engineer, Logical Clocks Jim Dowling, @jim_dowling Associate Professor, KTH Royal Institute of Technology

  2. ML Model Development A simplified view Feature Explainability Exploration Experimentation Model Training Serving Pipelines and Validation

  3. ML Model Development It’s simple - only four steps Explore Experimentation: Explainability and Model Training and Design Tune and Search Ablation Studies (Distributed)

  4. Artifacts and Non DRY Code Explore Experimentation: Explainability and Model Training and Design Tune and Search Ablation Studies (Distributed)

  5. What It’s Really Like … not linear but iterative

  6. What It’s Really Really Like … not linear but iterative

  7. Root Cause: Iterative Development of ML Models Explore Experimentation: Explainability and Model Training and Design Tune and Search Ablation Studies (Distributed)

  8. Iterative Development Is a Pain, We Need DRY Code! Each step requires different implementations of the training code EDA HParam Tuning Ablation Studies Training (Dist)

  9. The Oblivious Training Function # RUNS ON THE WORKERS def train(): def input_fn(): # return dataset model = … optimizer = … OBLIVIOUS model.compile(…) rc = tf.estimator.RunConfig( TRAINING FUNCTION ‘CollectiveAllReduceStrate gy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) EDA HParam Tuning Ablation Studies Training (Dist)

  10. Challenge: Obtrusive Framework Artifacts Example: TensorFlow ▪ TF_CONFIG ▪ Distribution Strategy ▪ Dataset (Sharding, DFS) ▪ Integration in Python - hard from inside a notebook ▪ Keras vs. Estimator vs. Custom Training Loop

  11. Where is Deep Learning headed?

  12. Productive High-Level APIs Or why data scientists love Keras and PyTorch Idea Framework Experiment Tracking Visualization Infrastructure Results Francois Chollet, “Keras: The Next 5 Years”

  13. Productive High-Level APIs Or why data scientists love Keras and PyTorch Idea Framework Experiment Tracking Visualization Infrastructure ? Results Hopsworks (Open Source) Databricks Apache Spark Cloud Providers Francois Chollet, “Keras: The Next 5 Years”

  14. How do we keep our high-level APIs transparent and productive?

  15. What Is Transparent Code? def dataset(batch_size): def dataset(batch_size): (x_train, y_train) = load_data() (x_train, y_train) = load_data() x_train = x_train / np.float32(255) x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( train_dataset = tf.data.Dataset.from_tensor_slices( NO CHANGES! (x_train,y_train)).shuffle(60000) (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) .repeat().batch(batch_size) return train_dataset return train_dataset def build_and_compile_cnn_model(lr): def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) tf.keras.layers.Dense(10) ]) ]) model.compile( model.compile( loss=SparseCategoricalCrossentropy(from_logits=True), loss=SparseCategoricalCrossentropy(from_logits=True), optimizer=SGD(learning_rate=lr)) optimizer=SGD(learning_rate=lr)) return model return model

  16. Building Blocks for Distribution Transparency

  17. Distribution Context Single-host vs. parallel multi-host vs. distributed multi-host Worker 1 Worker 8 Worker 2 Single Driver Driver Worker 7 Worker 3 Experiment Host TF_CONFIG Controller Worker 6 Worker 4 Worker 5 Worker 1 Worker 2 Worker N

  18. Distribution Context Single-host vs. parallel multi-host vs. distributed multi-host Worker 1 Worker 8 Worker 2 Single Driver Driver Worker 7 Worker 3 Experiment Host TF_CONFIG Controller Worker 6 Worker 4 Worker 5 Worker 1 Worker 2 Worker N Explore Experimentation: Explainability and Model Training and Design Tune and Search Ablation Studies (Distributed)

  19. Model Development Best Practices ▪ Modularize ▪ Parametrize ▪ Higher order training functions Training Model Dataset ▪ Usage of callbacks at Logic Generation Generation runtime

  20. Oblivious Training Function as an Abstraction Let the system handle the complexities System takes care of ... … fixing parameters … launching trials (parametrized … setting up TF_CONFIG instantiations of the function) … launching … wrapping in Distribution Strategy the function … generating new trials … launching function as workers … collecting and logging results … collecting results

  21. Maggy Make the Oblivious Training Function a core abstraction on Hopsworks Spark+AI Summit 2019 Today With Hopsworks and Maggy, we provide a unified development and execution environment for distribution transparent ML model development.

  22. Hopsworks - Award Winning Plattform

  23. Recap: Maggy - Asynchronous Trials on Spark Spark is bulk-synchronous HopsFS Metrics 1 Metrics 2 Metrics 3 Task 11 Task 21 Task 31 Task 12 Task 22 Task 32 Barrier Barrier Barrier Task 13 Task 23 Task 33 Wasted Compute Early-Stopping Task 1N Task 2N Task 3N Wasted Wasted Compute Compute Driver

  24. Recap: The Solution Add Communication and Long Running Tasks Task 11 Task 12 Barrier Task 13 Task 1N Driver Metrics New Trial

  25. What’s New? Worker discovery and distribution context set-up Task 11 Task 12 Barrier Task 13 Task 1N Driver Discover Launch Oblivious Training Workers Function in Context

  26. What’s New: Distribution Context from maggy import experiment experiment.set_dataset_generator(gen_dataset) experiment.set_model_generator(gen_model) # Hyperparameter optimization experiment.set_context('optimization', 'randomsearch', searchspace) result = experiment.lagom(train_fun) params = result.get('best_hp') # Distributed Training experiment.set_context('dist_training', 'MultiWorkerMirroredStrategy', params) experiment.lagom(train_fun) # Ablation study experiment.set_context('ablation', 'loco', ablation_study, params) experiment.lagom(train_fun)

  27. DEMO Code changes required to go from standard Python code to scale-out hyperparameter tuning and distributed training.

  28. What’s Next Extend the platform to provide a unified development and execution environment for distribution transparent Jupyter Notebooks.

  29. Summary ▪ Moving between distribution contexts requires code rewriting ▪ Factor out obtrusive framework artifacts ▪ Let system handle distribution context ▪ Keep productive high-level APIs

  30. Thank You! Get Started Thanks to the Logical Clocks Team! hopsworks.ai github.com/logicalclocks/maggy Contributions from colleagues Sina Sheikholeslami Twitter ▪ Robin Andersson @morimeister ▪ @jim_dowling Alex Ormenisan ▪ @logicalclocks Kai Jeggle @hopsworks ▪ Web www.logicalclocks.com

  31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend