Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An - - PowerPoint PPT Presentation
Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An - - PowerPoint PPT Presentation
Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach Vincius Garcia Pinto, Lucas Mello Schnorr , Luka Stanisic Arnaud Legrand, Samuel Thibault, Vincent Danjean WSPPD Workshop Porto Alegre, Brazil
Context
Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs
2 / 11
Context
Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performance
2 / 11
Context
Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performance Monolithic codes hard to develop and maintain Hard to optimize performance portability Fixed scheduling sensitive to variability
2 / 11
Context
Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performance Monolithic codes hard to develop and maintain Hard to optimize performance portability Fixed scheduling sensitive to variability
Recent task-based programming models (PaRSEC, OmpSs, Charm++, StarPU, . . . )
Single, abstract programming model based on DAG Runtime responsible for dynamic scheduling Portability of code and performance
2 / 11
Context
Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performance Monolithic codes hard to develop and maintain Hard to optimize performance portability Fixed scheduling sensitive to variability
Recent task-based programming models (PaRSEC, OmpSs, Charm++, StarPU, . . . )
Single, abstract programming model based on DAG Runtime responsible for dynamic scheduling Portability of code and performance New challenge scheduling heuristic
2 / 11
Visualization of Task Scheduling
Parallel simulation of superscalar scheduling, Haugen, Kurzak, YarKhan, Dongarra. ICPP 2014. The QR factorization of a matrix (size: 3960; tiles size: 180) The QUARK scheduler: 48 cores (one node). The Cholesky factorization of a matrix (size: 47040; tiles size: 960) The “MPI-Aware” DMDAS scheduler of StarPU+MPI: 2 nodes with 4 cores and 4 GPUs each. 3 / 11
Related Work: Classical Analysis Tools
Space/time view (resources may be hierarchically organized) + bonus Paraver (100K) – https://tools.bsc.es/paraver Projections (35K) – http://charm.cs.uiuc.edu/software FrameSoC (300K+LTTNG) – https://soctrace-inria.github.io/framesoc/ Ravel (19K) – https://github.com/LLNL/ravel Paje (31K in Objective-C) – https://github.com/schnorr/Paje ViTE (27K) – http://vite.gforge.inria.fr/
Tiled Cholesky Factorization from StarPU+MPI visualized with ViTE. 4 / 11
Related Work: Emerging Alternatives
Ad hoc visualization of task dependencies (??? SLOC) See VPA 2015 Exploiting DAG structure: DAGViz (??? SLOC) See VPA 2015 Entropy-aware aggregation: Ocelotl (3K+300K) https://github.com/soctrace-inria/ocelotl
5 / 11
Current Tools for Visual Performance Analysis Tools
Implemented in C/C++ to scale Interactive (depending on scale) and user friendly (mouse interaction) Large and complex source code, difficult to extend Generally not designed for hybrid platforms and dynamic runtimes Flexible filter calls for scripting capability Lack custom views exploiting application and platform structure
6 / 11
Our (Agile, Scriptable, Flexible) 2-Phase Workflow
Adopt modern data analysis tools for scripting → pj_dump + R + tidyverse + ggplot2 + plotly (≈ 3.5K SLOC) Workflow Execution: screen (1st phase) + org-mode (2nd phase)
Chameleon/Cholesky Execution Traces (FXT)
DOT
DAG
PAJE
Trace
FXT FXT FXT FXT CSV
DAG
CSV
states
PJ CSV
entities
PJ CSV
links
PJ CSV
variable
PJ
dot2csv
SH
pjdump
CPP
starpu_fxt_tool
C
ZERO
read
- utliers
left_join read tree_filter y_coord. read read left_join left_join
DAG
FEATHER
states
FEATHER
entities
FEATHER
links
FEATHER
variable
FEATHER
read
Export A Conversion B Reading C Cleaning, filtering, derivation D Output E DAG
FEATHER
states
FEATHER
entities
FEATHER
links
FEATHER
variable
FEATHER
read read read read read
case
In-memory analysis & visualization filter K-Iteration Space/time
YAML
user config master
static plots
ggplot2
interactive
plotly ABE Idleness Outlier s t a t e s l i n k s v a r s . D A G . CPE CPP GPUtransfers GFlops Used Mem. Ready Submitted MPI transfers TI
CPP
TI
CPP
TI
CPP
TI
CPP
TI
CPP
TI
CPP
scarce
Reading A Data visualization B Analisys D Assembly C
TI
CPP
Simplified 2-phase workflow (see our forthcoming paper).
Fail fast if an idea does not work Workflow can be shared to reproduce (and change) the analysis
7 / 11
Experimental validation: application and platform
MORSE – Matrices Over Runtime Systems @ Exascale http://icl.cs.utk.edu/projectsdev/morse/ Tiled Cholesky factorization available in Chameleon
for (k = 0; k < N; k++) { DPOTRF(RW,A[k][k]); for (i = k+1; i < N; i++) DTRSM(RW,A[i][k], R,A[k][k]); for (i = k+1; i < N; i++) { DSYRK(RW,A[i][i], R,A[i][k]); for (j = k+1; j < i; j++) DGEMM(RW,A[i][j], R,A[i][k], R,A[j][k]); } }
dpotrf 0 dtrsm 0 dtrsm 0 dtrsm 0 dtrsm 0 dsyrk 0 dgemm 0 dgemm 0 dgemm 0 dsyrk 0 dgemm 0 dgemm 0 dsyrk 0 dgemm 0 dsyrk 0 dpotrf 1 dtrsm 1 dtrsm 1 dtrsm 1 dsyrk 1 dgemm 1 dgemm 1 dsyrk 1 dgemm 1 dsyrk 1 dpotrf 2 dtrsm 2 dtrsm 2 dsyrk 2 dgemm 2 dsyrk 2 dpotrf 3 dtrsm 3 dsyrk 3 dpotrf 4
StarPU runtime on these platforms
idcin-2.grenoble.grid5000.fr (Digitalis, phased out in February 2017)
Two 14-core Intel Xeon E5-2697v3 with Three NVIDIA Titan X
8 / 11
Scheduler Comparison (input: 60×60 of 960×960)
DMDAS WS Unconstrained Constrained DMDA
Small matrix + interaction (12×12)
→ try yourself at http://perf-ev-runtime.gforge.inria.fr/vpa2016/
9 / 11
Conclusion and Ongoing Work
Achievements Flexible analysis workflow in ≈ 3.5K SLOC
Dynamic task-based applications Multi-node, multi-core, multi-GPU · · · What’s next?
Suitable for scheduling specialists Immediate work Investigate data dependencies (scheduler) anomalies on scale
10 / 11
Thank you for your attention!
schnorr@inf.ufrgs.br vgpinto@inf.ufrgs.br Questions? Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach. 3rd Workshop on Visual Performance Analysis (VPA) https://hal.inria.fr/hal-01353962
11 / 11