The AI Thunderdome Using OpenStack to accelerate AI training with - PowerPoint PPT Presentation

The AI Thunderdome Using OpenStack to accelerate AI training with Sahara, Spark, and Swift Sean Pryor, Sr. Cloud Consultant, RHCE Red Hat https://www.redhat.com spryor@redhat.com

Overview This talk will cover Brief explanations of ML, Spark, and Sahara ● Some notes on preparation for Sahara ● (And some issues we hit in our lab while preparing for this talk) ● A look at Machine Learning concepts inside Spark ● Cross Validation and Model Selection ● Sparkflow architecture ● Example code ● 2 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Big Data and OpenStack

From the user survey: Big Data and OpenStack https://www.openstack.org/analytics A lot of data resides on OpenStack already ● The data is already there. Why move it elsewhere to analyze it? ● Tools are already there to do the analysis 4 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Sahara+Spark+Swift Architecture Basic architecture outline ● Sahara is a wrapper around Heat ○ It does more than just Spark too ● Basic architecture involves just Spark on compute nodes ● Spark cluster can directly access Swift via swift://container/object URLs ● Code deployed on Spark clusters can access things independently as well 5 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Spark Architecture Overview Basic architecture outline ● Spark has a master/slave architecture ● The cluster manager can be either the built-in one, Mesos, Yarn, or Kubernetes ● Spark is built on top of the traditional Map/Reduce framework, but has additional tools, notably ones that include Machine Learning ● For TensorFlow, there are several frameworks that make training and deploying models on Spark a lot easier ● Workers have in-memory data cache - this is important to know when using TensorFlow 6 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Deploying Sahara A few notes when deploying Spark clusters via Sahara Image modifications are Ensure hadoop swift OpenStack job framework needed support is present doesn't support Python guestmount works great here ● java.lang.RuntimeException: The Job/Job Execution/Job ● ● java.lang.ClassNotFoundExcep Template framework assumes pip install: ● tion: Class java org.apache.hadoop.fs.swift.s tensorflow or ○ native.SwiftNativeFileSystem ● In order to do python, it likely tensorflow-gpu not found means spark-submit keras ○ This error indicates support is ● ○ sparkdl missing, may need to reinstall sparkflow ○ /usr/lib/hadoop-mapreduce/ha ● Add supergroup to ubuntu user doop-openstack.jar 7 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Machine Learning with Spark

Training AI Basic overview of AI and AI training For ML techniques, broadly, each iteration tries to ● fit a function to the data. Each new iteration refines the function ● Features : Characteristics of a single datapoint ● ● Labels : Outputs of a Machine Learning model Learning rate : How much each new iteration ● changes the function Loss : How far from reality each label is ● ● Normalization : Penalizes complex functions. This helps prevent overfitting 9 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Spark Machine Learning Important Components in Spark ML DataFrame Transformer Estimator Built on the regular Spark Transformers add/change Estimators are Transformers ● ● ● RDD/DataFrame API data in a dataframe that instead output a model SQL-like Transformers implement a Estimators implement a fit() ● ● ● transform() method which method which trains the Lazy evaluation ● returns a modified algorithm on the data Notably transform() doesn't ● DataFrame Estimators can also give you ● trigger evaluation. Things data about the model like like count() do weights and Supports a Vector type in ● hyperparameters addition to regular datatypes Can be saved/reused ● 10 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Cross Validation Automatic selection of the best model ● CrossValidator allows you to select model ● After evaluating on all sets of parameters, the parameters based on results of parallel training best is trained and tested against the entire ● Wraps a Pipeline, and executes several dataset pipelines in parallel with different parameters ● Parameter grid should ideally be small ● Requires a grid of parameters to train against ● The folding of the dataset means that it's not ● Splits the dataset into N folds, with a ⅔ train ⅓ ideal for small datasets test split ● Still requires some expertise in making sure it ● Requires a loss metric to optimize against, doesn't overfit, or that other errors don't occur Evaluator classes have these pre-baked 11 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Example Code

Right out of the manual: https://spark.apache.org/docs/2.3.0/ml-tuning.html Parallel Hyperparameter Training Spark CrossValidation Sample Code from pyspark.ml import Pipeline crossval = CrossValidator( from pyspark.ml.classification import LogisticRegression estimator=pipeline, from pyspark.ml.evaluation import BinaryClassificationEvaluator estimatorParamMaps=paramGrid, from pyspark.ml.feature import HashingTF, Tokenizer evaluator=BinaryClassificationEvaluator(), from pyspark.ml.tuning import CrossValidator, ParamGridBuilder numFolds=2) # use 3+ folds in practice cvModel = crossval.fit(training) training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), test = spark.createDataFrame([ (1, "b d", 0.0), (4, "spark i j k"), ... (5, "l m n"), ], ["id", "text", "label"]) (6, "mapreduce spark"), (7, "apache hadoop") tokenizer = Tokenizer(inputCol="text", outputCol="words") ], ["id", "text"]) hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") prediction = cvModel.transform(test) lr = LogisticRegression(maxIter=10) selected = prediction.select("id", "text", "probability", pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) "prediction") for row in selected.collect(): paramGrid = ParamGridBuilder() \ print(row) .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() 13 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Parallel Hyperparameter Training Spark CrossValidation Sample Code from pyspark.sql import SparkSession from pyspark.ml import Pipeline from pyspark.ml.feature import HashingTF, Tokenizer spark = SparkSession.builder.appName("SparkCV").getOrCreate() training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), ● Boilerplate start sets up Spark (1, "b d", 0.0), Session and training data ... ], ["id", "text", "label"]) ● Tokenizer takes in the input strings and outputs tokens tokenizer = Tokenizer(inputCol="text", outputCol="words") ● HashingTF generates features by hashing based on the frequency hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") of the input ● LogisticRegression is one of the lr = LogisticRegression(maxIter=10) pre-canned ML algorithms ● Pipeline sets up all the stages pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) 14 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

Parallel Hyperparameter Training Spark CrossValidation Sample Code ● ParamGrid is a grid of different parameters to plug into our Pipeline segments from before ● CrossValidator is a wrapper around the pipeline paramGrid = ParamGridBuilder() \ it gets passed, and executes each pipeline with .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ the values from the ParameterGrid .addGrid(lr.regParam, [0.1, 0.01]) \ .build() ● The Evaluator parameter is the function we use to measure the loss of each model crossval = CrossValidator( estimator=pipeline, estimatorParamMaps=paramGrid, ● numFolds is how much we want to partition the evaluator=BinaryClassificationEvaluator(), dataset numFolds=2) # use 3+ folds in practice ● cvModel is our best model result from the cvModel = crossval.fit(training) training. ● cvModel.bestModel is an alias 15 The AI Thunderdome: Using OpenStack to accelerate AI training with Sahara, Spark, and Swift

The AI Thunderdome Using OpenStack to accelerate AI training with - PowerPoint PPT Presentation

The AI Thunderdome Using OpenStack to accelerate AI training with Sahara, Spark, and Swift Sean Pryor, Sr. Cloud Consultant, RHCE Red Hat https://www.redhat.com spryor@redhat.com Overview This talk will cover Brief explanations of ML,

THUNDERDOME THE JORDANS D23 AZIZ ANSARI, CHETAN PATEL, DAVID CLARK, MATTHEW LEGATE EF 152

Phantom project Alexandre Ancel 2 Alexandre Fortin 1 Simon Garnotel 3 Olivia Miraucourt 1

{ Daniel Wilkey John Graham CS6998 Given speech, was the speaker intoxicated?

Fast Cross-Validation for Incremental Learning Pooria Joulani, Andr as Gy orgy, Csaba

Evaluate Deep Q-Learning for Sequential Targeted Marketing with 10-fold Cross Validation

Coordination Request Capture exercise, Validation, and Correction 1 SpaceCap: First steps

EVALUATION OF STUDENT PERFORMANCE WITH DATA MINING: AN APPLICATION OF ID3 AND CART ALGORITHMS

Detecting mixtures in multivariate extremes S.H.A. Tendijck Lancaster University January 31,

Machine Learning 101 QCon SF 2019 Grishma Jena Data Scientist, IBM @DebateLover About me

RESPONSIBLE BUSINESS RESPONSIBLE BUSINESS ROTORUA AQUATIC CENTRE ROTORUA AQUATIC CENTRE

NOBINA AB Investor presentation, September - November 2017 1 LARGEST PUBLIC TRANSPORT COMPANY

The multi rotor turbine Project Manager, Sren O. Lind 17-11-2016 This material is not for

January 2018 Disclaimer This management presentation is intended to provide an overview of the

Optimal Extraction with Sub-sampled Line-Spread Functions Nicholas R. Collins 1 Science Systems

Final Evaluation Results of The Milwaukee Community Literacy Project/ SPARK Early Literacy Curtis

Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson

Day 9 Optimization of Cloud Data Centre Energy Consumption

Robot Motion Planning Barbara Frank, Cyrill Stachniss, Rdiger Schmedding, Matthias Teschner,

Using Measures of Linguistic Complexity to Assess German L2 Proficiency in Learner Corpora under

Lake Creek Instream Flow Study Status Update Fish/Aquatics Meeting June 2007 INSTREAM FLOW

The Potential for Beneficial Use The Potential for Beneficial Use of Stormwater Stormwater in

Detection and Attribution of Climate Change and Human Activities Impacts on Water Resources in

Data to Examine Consumption Poverty and I Inequality in the U.S.: 1960-2008 lit i th U S 1960

Testing for Multifractality and Multiplicativity using Surrogates E. Foufoula-Georgiou (Univ. of