From Python to PySpark and Back Again - Unifying Single-host and - - PowerPoint PPT Presentation

from python to pyspark and back again
SMART_READER_LITE
LIVE PREVIEW

From Python to PySpark and Back Again - Unifying Single-host and - - PowerPoint PPT Presentation

From Python to PySpark and Back Again - Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software Engineer, Logical Clocks Jim Dowling, @jim_dowling Associate Professor, KTH Royal Institute of


slide-1
SLIDE 1
slide-2
SLIDE 2

From Python to PySpark and Back Again

  • Unifying Single-host and Distributed Machine Learning with Maggy

Moritz Meister, @morimeister Software Engineer, Logical Clocks Jim Dowling, @jim_dowling Associate Professor, KTH Royal Institute of Technology

slide-3
SLIDE 3

ML Model Development

A simplified view

Exploration Experimentation Model Training Explainability and Validation Serving Feature Pipelines

slide-4
SLIDE 4

ML Model Development

Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies

It’s simple - only four steps

slide-5
SLIDE 5

Artifacts and Non DRY Code

Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies

slide-6
SLIDE 6

What It’s Really Like

… not linear but iterative

slide-7
SLIDE 7

What It’s Really Really Like

… not linear but iterative

slide-8
SLIDE 8

Root Cause: Iterative Development of ML Models

Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies

slide-9
SLIDE 9

Ablation Studies EDA HParam Tuning Training (Dist)

Iterative Development Is a Pain, We Need DRY Code!

Each step requires different implementations of the training code

slide-10
SLIDE 10

OBLIVIOUS TRAINING FUNCTION

# RUNS ON THE WORKERS def train(): def input_fn(): # return dataset model = …

  • ptimizer = …

model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrate gy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn)

Ablation Studies EDA HParam Tuning Training (Dist)

The Oblivious Training Function

slide-11
SLIDE 11

Challenge: Obtrusive Framework Artifacts

▪ TF_CONFIG ▪ Distribution Strategy ▪ Dataset (Sharding, DFS) ▪ Integration in Python - hard from inside a notebook ▪ Keras vs. Estimator vs. Custom Training Loop

Example: TensorFlow

slide-12
SLIDE 12

Where is Deep Learning headed?

slide-13
SLIDE 13

Productive High-Level APIs

Or why data scientists love Keras and PyTorch

Idea Experiment Results Infrastructure Framework Tracking Visualization

Francois Chollet, “Keras: The Next 5 Years”

slide-14
SLIDE 14

Productive High-Level APIs

Or why data scientists love Keras and PyTorch

Idea Experiment Results Infrastructure Framework Tracking Visualization

Francois Chollet, “Keras: The Next 5 Years”

?

Hopsworks (Open Source) Databricks Apache Spark Cloud Providers

slide-15
SLIDE 15

How do we keep our high-level APIs transparent and productive?

slide-16
SLIDE 16

What Is Transparent Code?

def dataset(batch_size): (x_train, y_train) = load_data() x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=SparseCategoricalCrossentropy(from_logits=True),

  • ptimizer=SGD(learning_rate=lr))

return model def dataset(batch_size): (x_train, y_train) = load_data() x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=SparseCategoricalCrossentropy(from_logits=True),

  • ptimizer=SGD(learning_rate=lr))

return model

NO CHANGES!

slide-17
SLIDE 17

Building Blocks for Distribution Transparency

slide-18
SLIDE 18

Distribution Context

Single-host vs. parallel multi-host vs. distributed multi-host

Worker 1 Worker 5 Worker 3 Worker 2 Worker 4 Worker 7 Worker 8 Worker 6

Driver

TF_CONFIG

Driver

Experiment Controller

Worker 1 Worker N Worker 2

Single Host

slide-19
SLIDE 19

Distribution Context

Single-host vs. parallel multi-host vs. distributed multi-host

Worker 1 Worker 5 Worker 3 Worker 2 Worker 4 Worker 7 Worker 8 Worker 6

Driver

TF_CONFIG

Driver

Experiment Controller

Worker 1 Worker N Worker 2

Single Host

Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies

slide-20
SLIDE 20

Model Development Best Practices

▪ Modularize ▪ Parametrize ▪ Higher order training

functions

▪ Usage of callbacks at

runtime

Dataset Generation Model Generation Training Logic

slide-21
SLIDE 21

Oblivious Training Function as an Abstraction

Let the system handle the complexities

System takes care of ...

… fixing parameters … launching the function … launching trials (parametrized instantiations of the function) … generating new trials … collecting and logging results … setting up TF_CONFIG … wrapping in Distribution Strategy … launching function as workers … collecting results

slide-22
SLIDE 22

Maggy

Spark+AI Summit 2019 Today

With Hopsworks and Maggy, we provide a unified development and execution environment for distribution transparent ML model development.

Make the Oblivious Training Function a core abstraction on Hopsworks

slide-23
SLIDE 23

Hopsworks - Award Winning Plattform

slide-24
SLIDE 24

Recap: Maggy - Asynchronous Trials on Spark

Spark is bulk-synchronous

Wasted Compute Wasted Compute

HopsFS Barrier

Task11 Task12 Task13 Task1N

Driver

Metrics1

Barrier

Task21 Task22 Task23 Task2N Metrics2

Barrier

Task31 Task32 Task33 Task3N Metrics3

Wasted Compute

Early-Stopping

slide-25
SLIDE 25

Recap: The Solution

Add Communication and Long Running Tasks

Task11 Task12 Task13 Task1N

Driver Barrier

Metrics New Trial

slide-26
SLIDE 26

What’s New?

Worker discovery and distribution context set-up

Task11 Task12 Task13 Task1N

Driver Barrier

Launch Oblivious Training Function in Context Discover Workers

slide-27
SLIDE 27

What’s New: Distribution Context

from maggy import experiment experiment.set_dataset_generator(gen_dataset) experiment.set_model_generator(gen_model) # Hyperparameter optimization experiment.set_context('optimization', 'randomsearch', searchspace) result = experiment.lagom(train_fun) params = result.get('best_hp') # Distributed Training experiment.set_context('dist_training', 'MultiWorkerMirroredStrategy', params) experiment.lagom(train_fun) # Ablation study experiment.set_context('ablation', 'loco', ablation_study, params) experiment.lagom(train_fun)

slide-28
SLIDE 28

DEMO

Code changes required to go from standard Python code to scale-out hyperparameter tuning and distributed training.

slide-29
SLIDE 29

What’s Next

Extend the platform to provide a unified development and execution environment for distribution transparent Jupyter Notebooks.

slide-30
SLIDE 30

Summary

▪ Moving between distribution contexts requires code rewriting ▪ Factor out obtrusive framework artifacts ▪ Let system handle distribution context ▪ Keep productive high-level APIs

slide-31
SLIDE 31

Thank You!

Get Started hopsworks.ai github.com/logicalclocks/maggy Twitter @morimeister @jim_dowling @logicalclocks @hopsworks Web www.logicalclocks.com Contributions from colleagues

Sina Sheikholeslami

Robin Andersson

Alex Ormenisan

Kai Jeggle

Thanks to the Logical Clocks Team!

slide-32
SLIDE 32

Feedback

Your feedback is important to us.

Don’t forget to rate and review the sessions.

slide-33
SLIDE 33