Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An - - PowerPoint PPT Presentation

analyzing dynamic task based applications on hybrid
SMART_READER_LITE
LIVE PREVIEW

Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An - - PowerPoint PPT Presentation

Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach Vincius Garcia Pinto, Lucas Mello Schnorr , Luka Stanisic Arnaud Legrand, Samuel Thibault, Vincent Danjean WSPPD Workshop Porto Alegre, Brazil


slide-1
SLIDE 1

Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach

Vinícius Garcia Pinto, Lucas Mello Schnorr, Luka Stanisic Arnaud Legrand, Samuel Thibault, Vincent Danjean WSPPD Workshop Porto Alegre, Brazil – September 4th, 2017

slide-2
SLIDE 2

Context

Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs

2 / 11

slide-3
SLIDE 3

Context

Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )

Perfect control maximal achievable performance

2 / 11

slide-4
SLIDE 4

Context

Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )

Perfect control maximal achievable performance Monolithic codes hard to develop and maintain Hard to optimize performance portability Fixed scheduling sensitive to variability

2 / 11

slide-5
SLIDE 5

Context

Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )

Perfect control maximal achievable performance Monolithic codes hard to develop and maintain Hard to optimize performance portability Fixed scheduling sensitive to variability

Recent task-based programming models (PaRSEC, OmpSs, Charm++, StarPU, . . . )

Single, abstract programming model based on DAG Runtime responsible for dynamic scheduling Portability of code and performance

2 / 11

slide-6
SLIDE 6

Context

Current HPC architectures Moving from transistors to heterogeneity scaling Hybrid computing resources: CPUs, GPUs, MICs Programming hybrid platforms Traditional, explicit programming models (MPI, CUDA, OpenMP, pthreads, . . . )

Perfect control maximal achievable performance Monolithic codes hard to develop and maintain Hard to optimize performance portability Fixed scheduling sensitive to variability

Recent task-based programming models (PaRSEC, OmpSs, Charm++, StarPU, . . . )

Single, abstract programming model based on DAG Runtime responsible for dynamic scheduling Portability of code and performance New challenge scheduling heuristic

2 / 11

slide-7
SLIDE 7

Visualization of Task Scheduling

Parallel simulation of superscalar scheduling, Haugen, Kurzak, YarKhan, Dongarra. ICPP 2014. The QR factorization of a matrix (size: 3960; tiles size: 180) The QUARK scheduler: 48 cores (one node). The Cholesky factorization of a matrix (size: 47040; tiles size: 960) The “MPI-Aware” DMDAS scheduler of StarPU+MPI: 2 nodes with 4 cores and 4 GPUs each. 3 / 11

slide-8
SLIDE 8

Related Work: Classical Analysis Tools

Space/time view (resources may be hierarchically organized) + bonus Paraver (100K) – https://tools.bsc.es/paraver Projections (35K) – http://charm.cs.uiuc.edu/software FrameSoC (300K+LTTNG) – https://soctrace-inria.github.io/framesoc/ Ravel (19K) – https://github.com/LLNL/ravel Paje (31K in Objective-C) – https://github.com/schnorr/Paje ViTE (27K) – http://vite.gforge.inria.fr/

Tiled Cholesky Factorization from StarPU+MPI visualized with ViTE. 4 / 11

slide-9
SLIDE 9

Related Work: Emerging Alternatives

Ad hoc visualization of task dependencies (??? SLOC) See VPA 2015 Exploiting DAG structure: DAGViz (??? SLOC) See VPA 2015 Entropy-aware aggregation: Ocelotl (3K+300K) https://github.com/soctrace-inria/ocelotl

5 / 11

slide-10
SLIDE 10

Current Tools for Visual Performance Analysis Tools

Implemented in C/C++ to scale Interactive (depending on scale) and user friendly (mouse interaction) Large and complex source code, difficult to extend Generally not designed for hybrid platforms and dynamic runtimes Flexible filter calls for scripting capability Lack custom views exploiting application and platform structure

6 / 11

slide-11
SLIDE 11

Our (Agile, Scriptable, Flexible) 2-Phase Workflow

Adopt modern data analysis tools for scripting → pj_dump + R + tidyverse + ggplot2 + plotly (≈ 3.5K SLOC) Workflow Execution: screen (1st phase) + org-mode (2nd phase)

Chameleon/Cholesky Execution Traces (FXT)

DOT

DAG

PAJE

Trace

FXT FXT FXT FXT CSV

DAG

CSV

states

PJ CSV

entities

PJ CSV

links

PJ CSV

variable

PJ

dot2csv

SH

pjdump

CPP

starpu_fxt_tool

C

ZERO

read

  • utliers

left_join read tree_filter y_coord. read read left_join left_join

DAG

FEATHER

states

FEATHER

entities

FEATHER

links

FEATHER

variable

FEATHER

read

Export A Conversion B Reading C Cleaning, filtering, derivation D Output E DAG

FEATHER

states

FEATHER

entities

FEATHER

links

FEATHER

variable

FEATHER

read read read read read

case

In-memory analysis & visualization filter K-Iteration Space/time

YAML

user config master

static plots

ggplot2

interactive

plotly ABE Idleness Outlier s t a t e s l i n k s v a r s . D A G . CPE CPP GPUtransfers GFlops Used Mem. Ready Submitted MPI transfers TI

CPP

TI

CPP

TI

CPP

TI

CPP

TI

CPP

TI

CPP

scarce

Reading A Data visualization B Analisys D Assembly C

TI

CPP

Simplified 2-phase workflow (see our forthcoming paper).

Fail fast if an idea does not work Workflow can be shared to reproduce (and change) the analysis

7 / 11

slide-12
SLIDE 12

Experimental validation: application and platform

MORSE – Matrices Over Runtime Systems @ Exascale http://icl.cs.utk.edu/projectsdev/morse/ Tiled Cholesky factorization available in Chameleon

for (k = 0; k < N; k++) { DPOTRF(RW,A[k][k]); for (i = k+1; i < N; i++) DTRSM(RW,A[i][k], R,A[k][k]); for (i = k+1; i < N; i++) { DSYRK(RW,A[i][i], R,A[i][k]); for (j = k+1; j < i; j++) DGEMM(RW,A[i][j], R,A[i][k], R,A[j][k]); } }

dpotrf 0 dtrsm 0 dtrsm 0 dtrsm 0 dtrsm 0 dsyrk 0 dgemm 0 dgemm 0 dgemm 0 dsyrk 0 dgemm 0 dgemm 0 dsyrk 0 dgemm 0 dsyrk 0 dpotrf 1 dtrsm 1 dtrsm 1 dtrsm 1 dsyrk 1 dgemm 1 dgemm 1 dsyrk 1 dgemm 1 dsyrk 1 dpotrf 2 dtrsm 2 dtrsm 2 dsyrk 2 dgemm 2 dsyrk 2 dpotrf 3 dtrsm 3 dsyrk 3 dpotrf 4

StarPU runtime on these platforms

idcin-2.grenoble.grid5000.fr (Digitalis, phased out in February 2017)

Two 14-core Intel Xeon E5-2697v3 with Three NVIDIA Titan X

8 / 11

slide-13
SLIDE 13

Scheduler Comparison (input: 60×60 of 960×960)

DMDAS WS Unconstrained Constrained DMDA

Small matrix + interaction (12×12)

→ try yourself at http://perf-ev-runtime.gforge.inria.fr/vpa2016/

9 / 11

slide-14
SLIDE 14

Conclusion and Ongoing Work

Achievements Flexible analysis workflow in ≈ 3.5K SLOC

Dynamic task-based applications Multi-node, multi-core, multi-GPU · · · What’s next?

Suitable for scheduling specialists Immediate work Investigate data dependencies (scheduler) anomalies on scale

10 / 11

slide-15
SLIDE 15

Thank you for your attention!

schnorr@inf.ufrgs.br vgpinto@inf.ufrgs.br Questions? Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach. 3rd Workshop on Visual Performance Analysis (VPA) https://hal.inria.fr/hal-01353962

11 / 11