in spark using gpu
play

in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research - PowerPoint PPT Presentation

Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1 Cross-Validation 101 [Wikipedia] Popular Model Validation Technique to avoid overfitting, for better generalization useful when not


  1. Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1

  2. Cross-Validation 101 [Wikipedia]  Popular Model Validation Technique – to avoid overfitting, for better generalization – useful when not enough dataset 2

  3. Cross-Validation + Elastic Net Regression Tons of problems to crunch [Wikipedia]  Cross Validation is popularly used with – Linear/Logistic Regression – Elastic Net Regularization  A large number of problems to solve – #fold from cross-validation – various lambdas to find the best prediction model – 4 fold x 1000 lambdas = 4000 regressions to fit 3

  4. Apache Spark Overview  In-memory engine for large-scale distributed data processing – Used in database, streaming, machine/deep learning, graph processing – Support high-level APIs in Java, Scala, Python and R  RDD: resilient distributed datasets – Partitioned collection of records – Spread across the cluster – Caching dataset in memory 4

  5. Spark GPU Acceleration [Rajesh, oreilly.com]  Accelerated Compute-Intensive Workload with GPUs 5

  6. Cross-Validation in Spark  For each problem – Create RDD – Distribute RDD – Call optimizer – Return Model [Berkeley] 6

  7. Cross-Validation in Spark Dataset Partitioned RDD worker i worker j worker k Dataset j Dataset k Dataset i Reduce Is this best for GPU? One Model 7

  8. Proposed Cross-Validation in Spark Using GPU  Broadcast Data – Cross-Validation reuses the same mother dataset  RDD of problem instances, not DATA – Tons of problems with different folding/lambdas  Maximize GPU stream to minimize down-time 8

  9. Cross-Validation in Spark Using GPU Problems Dataset Broadcasted as Array worker j worker i worker k Dataset Dataset Dataset 9

  10. Cross-Validation in Spark Using GPU Problems Distributed as RDD worker j worker i worker k Dataset Dataset Dataset Problems j Problems k Problems i  Problems in RDD 10

  11. Code Snippet Build a problem set 11

  12. Code Snippet (cont.) Input: dataset, problems Dataset broadcast Problem RDD 12

  13. Cross-Validation in Spark Using GPU worker i Dataset Problems i GPU0 GPU1 Dataset fold 0 Dataset fold 1 Dataset fold 2 Dataset fold 3 cudaStream cudaStream cudaStream cudaStream Problem a:0 Problem a:2 Problem a:1 Problem a:3 Problem b:0 Problem b:2 Problem b:1 Problem b:3 13 13

  14. Cross-Validation in Spark Using GPU Problems Distributed as RDD worker i worker j worker k Dataset Dataset Dataset Problems j Problems k Problems i Reduce All Models 14

  15. Cross-Validation in Spark Using GPU (Advantages)  Dataset Broadcast – Efficient p2p protocol in Spark – One-time upfront overhead – Data reused within GPUs  Problem RDD – No communication among workers – Multiple streams to maximize GPU utilization  Multi-level parallelism – Functional parallelism from Problem RDD – Multiple GPUs – Multiple cudaStreams 15

  16. Experimental Results  System – 2 node cluster – Each node with thirty two x86 cores – Each node with two K40ms  Software – Spark 2.0 – OpenJDK 1.8  Workload – Real Watson Health dataset – 5 fold cross validation – 1024 lambda exploration  Algorithms – Logistic regression – Linear regression  Measured e2e runtime including dataset broadcast 16

  17. Result: GPU utilization  Sustained over 97% Multi-GPU utilization 17

  18. Result: Logistic Regression No help : 2 problems Help : more problems 114x speedup Help : enough problems

  19. Result: Linear Regression 94x speedup

  20. Conclusion  Cross-Validation on Spark using GPU – New way of parallelization in Spark  Broadcast dataset  RDDmized problems – Reduce communication  About 100x speedup for Logistic/Linear Regression + Elastic Net  Future work – Support out of core execution 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend