in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research - PowerPoint PPT Presentation

Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1

Cross-Validation 101 [Wikipedia]  Popular Model Validation Technique – to avoid overfitting, for better generalization – useful when not enough dataset 2

Cross-Validation + Elastic Net Regression Tons of problems to crunch [Wikipedia]  Cross Validation is popularly used with – Linear/Logistic Regression – Elastic Net Regularization  A large number of problems to solve – #fold from cross-validation – various lambdas to find the best prediction model – 4 fold x 1000 lambdas = 4000 regressions to fit 3

Apache Spark Overview  In-memory engine for large-scale distributed data processing – Used in database, streaming, machine/deep learning, graph processing – Support high-level APIs in Java, Scala, Python and R  RDD: resilient distributed datasets – Partitioned collection of records – Spread across the cluster – Caching dataset in memory 4

Spark GPU Acceleration [Rajesh, oreilly.com]  Accelerated Compute-Intensive Workload with GPUs 5

Cross-Validation in Spark  For each problem – Create RDD – Distribute RDD – Call optimizer – Return Model [Berkeley] 6

Cross-Validation in Spark Dataset Partitioned RDD worker i worker j worker k Dataset j Dataset k Dataset i Reduce Is this best for GPU? One Model 7

Proposed Cross-Validation in Spark Using GPU  Broadcast Data – Cross-Validation reuses the same mother dataset  RDD of problem instances, not DATA – Tons of problems with different folding/lambdas  Maximize GPU stream to minimize down-time 8

Cross-Validation in Spark Using GPU Problems Dataset Broadcasted as Array worker j worker i worker k Dataset Dataset Dataset 9

Cross-Validation in Spark Using GPU Problems Distributed as RDD worker j worker i worker k Dataset Dataset Dataset Problems j Problems k Problems i  Problems in RDD 10

Code Snippet Build a problem set 11

Code Snippet (cont.) Input: dataset, problems Dataset broadcast Problem RDD 12

Cross-Validation in Spark Using GPU worker i Dataset Problems i GPU0 GPU1 Dataset fold 0 Dataset fold 1 Dataset fold 2 Dataset fold 3 cudaStream cudaStream cudaStream cudaStream Problem a:0 Problem a:2 Problem a:1 Problem a:3 Problem b:0 Problem b:2 Problem b:1 Problem b:3 13 13

Cross-Validation in Spark Using GPU Problems Distributed as RDD worker i worker j worker k Dataset Dataset Dataset Problems j Problems k Problems i Reduce All Models 14

Cross-Validation in Spark Using GPU (Advantages)  Dataset Broadcast – Efficient p2p protocol in Spark – One-time upfront overhead – Data reused within GPUs  Problem RDD – No communication among workers – Multiple streams to maximize GPU utilization  Multi-level parallelism – Functional parallelism from Problem RDD – Multiple GPUs – Multiple cudaStreams 15

Experimental Results  System – 2 node cluster – Each node with thirty two x86 cores – Each node with two K40ms  Software – Spark 2.0 – OpenJDK 1.8  Workload – Real Watson Health dataset – 5 fold cross validation – 1024 lambda exploration  Algorithms – Logistic regression – Linear regression  Measured e2e runtime including dataset broadcast 16

Result: GPU utilization  Sustained over 97% Multi-GPU utilization 17

Result: Logistic Regression No help : 2 problems Help : more problems 114x speedup Help : enough problems

Result: Linear Regression 94x speedup

Conclusion  Cross-Validation on Spark using GPU – New way of parallelization in Spark  Broadcast dataset  RDDmized problems – Reduce communication  About 100x speedup for Logistic/Linear Regression + Elastic Net  Future work – Support out of core execution 20

in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research - PowerPoint PPT Presentation

Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1 Cross-Validation 101 [Wikipedia] Popular Model Validation Technique to avoid overfitting, for better generalization useful when not

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

PREDICTING DAMAGE ACCUMULATION IN GLASS FIBER REINFORCED PLASTICS THROUGH CUMULATIVE DAMAGE MODELS

Illustrating the Statistical Process with Regression Josh Tabor Daren Starnes Canyon del Oro

CARD, DOBKIN AND MAESTAS (AER, 2008): THE EFFECT OF NEARLY UNIVERSAL INSURANCE COVERAGE ON HEALTH

Extending Nearly-Linear Models Chiara Corsato, Renato Pelessoni and Paolo Vicig University of

Algorithm and clinical validation M. Obermeier 4/2007 Interpretation-systems Free available:

Evaluation of Long-Term Trends and Variations in the Average Total Dissolved Solids Concentrations

Analyst Meeting March 6, 2017 Cautionary Language This presentation contains forward-looking

PRESENTATION For the period ending 30 September 2018 AGENDA 01 Salient features 02 Strategy

in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research - PowerPoint PPT Presentation

Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1 Cross-Validation 101 [Wikipedia] Popular Model Validation Technique to avoid overfitting, for better generalization useful when not

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg

GPU Enabled Spark MLlib Lingyun Li &amp; Lei Yao CS 848 University of Waterloo Outline

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

PREDICTING DAMAGE ACCUMULATION IN GLASS FIBER REINFORCED PLASTICS THROUGH CUMULATIVE DAMAGE MODELS

Illustrating the Statistical Process with Regression Josh Tabor Daren Starnes Canyon del Oro

CARD, DOBKIN AND MAESTAS (AER, 2008): THE EFFECT OF NEARLY UNIVERSAL INSURANCE COVERAGE ON HEALTH

Extending Nearly-Linear Models Chiara Corsato, Renato Pelessoni and Paolo Vicig University of

Algorithm and clinical validation M. Obermeier 4/2007 Interpretation-systems Free available:

Evaluation of Long-Term Trends and Variations in the Average Total Dissolved Solids Concentrations

Analyst Meeting March 6, 2017 Cautionary Language This presentation contains forward-looking

PRESENTATION For the period ending 30 September 2018 AGENDA 01 Salient features 02 Strategy

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline