Minsik Cho, Rajesh Bordawekar IBM TJW Research
Accelerating Cross-Validation in Spark Using GPU
1
in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research - - PowerPoint PPT Presentation
Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1 Cross-Validation 101 [Wikipedia] Popular Model Validation Technique to avoid overfitting, for better generalization useful when not
Minsik Cho, Rajesh Bordawekar IBM TJW Research
1
2
[Wikipedia]
– to avoid overfitting, for better generalization – useful when not enough dataset
3
– Linear/Logistic Regression – Elastic Net Regularization
– #fold from cross-validation – various lambdas to find the best prediction model – 4 fold x 1000 lambdas = 4000 regressions to fit
[Wikipedia]
4
– Used in database, streaming, machine/deep learning, graph processing – Support high-level APIs in Java, Scala, Python and R
– Partitioned collection of records – Spread across the cluster – Caching dataset in memory
5
[Rajesh, oreilly.com]
6
– Create RDD – Distribute RDD – Call optimizer – Return Model
[Berkeley]
7
Is this best for GPU?
8
– Cross-Validation reuses the same mother dataset
– Tons of problems with different folding/lambdas
9
10
11
12
13
cudaStream cudaStream cudaStream cudaStream
13
14
15
– Efficient p2p protocol in Spark – One-time upfront overhead – Data reused within GPUs
– No communication among workers – Multiple streams to maximize GPU utilization
– Functional parallelism from Problem RDD – Multiple GPUs – Multiple cudaStreams
16
– 2 node cluster – Each node with thirty two x86 cores – Each node with two K40ms
– Spark 2.0 – OpenJDK 1.8
– Real Watson Health dataset – 5 fold cross validation – 1024 lambda exploration
– Logistic regression – Linear regression
17
– New way of parallelization in Spark
– Reduce communication
– Support out of core execution
20