integration of spark parallelization in tmva
play

Integration of Spark parallelization in TMVA Georgios Douzas Enric - PowerPoint PPT Presentation

Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer Spark engine A generalized framework for distributed data processing. Implemented in


  1. Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer

  2. Spark engine ¤ A generalized framework for distributed data processing. ¤ Implemented in Scala. ¤ Provides a Python API called PySpark. ¤ Two main concepts: - RDD (Resilient Distributed Datasets) - DAG (Direct Acyclic Graph)

  3. Spark engine ¤ RDD is an immutable parallel data structure. ¤ DAG is a programming model for distributed systems. ¤ RDD operations: Transformations and Actions.

  4. Spark architecture

  5. Parallelization of the TMVA code ¤ Identify opportunities for parallelism. ¤ Examine whether the parallelism improves performance. ¤ Target on loops that include independent calculations. ¤ Use the same interface as the C++ TMVA code.

  6. Parallelization in TMVA ¤ Cross validation. ¤ Optimization of tuning parameters. ¤ Local search for the optimization of tuning parameters.

  7. Cross validation

  8. Cross validation Validation K-fold cross validation

  9. Parallelized CrossValidate ¤ RDD = sc.parallelize( [fold 0 , fold 1 , …, fold k -1 ] ). ¤ A map transformation is applied to the RDD. ¤ A new RDD with an AUC value for each fold index is returned. ¤ The average AUC is calculated.

  10. Driver Program SparkContext Read input data DataLoader object Serialized DataLoader Serialize Factory object CrossValidate Parameters Broadcast RDD [fold 0 , fold 1 , …] RDD [ (fold 0 , AUC 0 ), … ] Fold AUC func>on Map Results Worker Worker Worker Serialized DataLoader object Serialized DataLoader object Serialized DataLoader object Task - Local Factory object Task - Local Factory object Task - Local Factory object Input root file Input root file Input root file Distributed File System : Input root file

  11. Optimization of tuning parameters

  12. Parallelized OptimizeTuningParameters (Full search of parameter space) ¤ A default parameter space is defined. RDD = sc.parallelize( [ (fold 0, par 0 ), …, (fold k - 1, par 0 )…, (fold 0, par p - 1 ), …, (fold k - 1, par p - 1 ) ] ) ¤ ¤ A map transformation is applied to the RDD. ¤ A new RDD with an AUC value for each fold and parameter index is returned. ¤ The maximum AUC in each fold is calculated. ¤ The cross validation AUC is calculated for each “fold winner” parameter.

  13. Driver Program SparkContext Read input data DataLoader object Serialized DataLoader Serialize Factory object OpQmizeTuningParameters Parameters Broadcast RDD [ (fold 0 , par 0 ), … ] RDD [ (fold 0 , par 0 , AUC (0, 0) ), …] (Fold, Parameter) Map AUC func>on Results Worker Worker Worker Serialized DataLoader object Serialized DataLoader object Serialized DataLoader object Task - Local Factory object Task - Local Factory object Task - Local Factory object Input root file Input root file Input root file Distributed File System : Input root file

  14. Parallelized OptimizeTuningParameters (Local search of parameter space) ¤ For each fold a H.C. algorithm is applied. ¤ Parallelize any calculation in each H.C. iteration. ¤ RDD includes a subset of all the folds/parameters pairs.

  15. Spark cluster Node 2 Node 1 Driver Program SparkContext Master Worker 2 Worker 1 4 cores 4 cores

  16. Experimental results

  17. Experimental results

  18. Experimental results Full search Hill climbing

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend