[PPT] - Cray Tools, an overview 8th International Parallel Tools Workshop PowerPoint Presentation

SLIDE 1

C O M P U T E | S T O R E | A N A L Y Z E

Cray Tools, an overview

8th International Parallel Tools Workshop Stuttgart, Germany, 1st October 2014 Stefan Andersson Cray onsite support at HLRS

SLIDE 2

C O M P U T E | S T O R E | A N A L Y Z E

Introduction

2

Cray develops several tools for their XE/XK and XC computers
There is lot of effort going into the development
Several of the tools are ‘stand-alone’ solutions, being

developed for a specific problem

STAT, ATP
IOBUF (includes serial IO monitoring)
MPIIO profiling
Other tools will interact in order to be more efficient or to

create new solution for a problem

CCE providing ‘hooks’ for profiling on loop level
Reveal using CCE listing information and CrayPat Profiling

SLIDE 3

C O M P U T E | S T O R E | A N A L Y Z E

Which tools does Cray develop

3

It doesn’t make sense to develop tools where a good tool

already exists on the market DDT and Totalview are good examples

Cray’s tools are either
Something new, like Reveal
Concentrate on a solution to a specific issue, like STAT
Are part of the development process, like MPIIO Stats
Comes out of benchmarking, like IOBUF
Cray also collaborate with different sites in developing tools

SLIDE 4

C O M P U T E | S T O R E | A N A L Y Z E

CCE : Cray Compiler Environment

4

The compiler is in general not considered a ‘tool’, but in fact it

is the most important piece of user software

Compiles and Link the user application
Feedback about the application
Code errors
How optimization was done/or not done (lst file)
Providing ‘hooks’ into different levels of the application, to which other

tools can attach

Functions
Loops
This makes CCE the ‘centerpiece’ in Cray’s Tools Strategies
CCE can adapt rather quickly to user/tool needs
All tools will work with other Compilers, but there might be some

limitations The goal is not to force a user to use CCE, but to provide extensions where it makes sense

SLIDE 5

C O M P U T E | S T O R E | A N A L Y Z E

Overview : Tools infrastructure (selection)

Debugging

Get your code up and running correctly.

Profiling

Locate performance bottlenecks.

Light weight

At most relinking. Get a first picture of a performance or problems during execution.

ATP
STAT
CrayPAT-lite
Profiler library
IOBUF
MPIIO Stats

In-depth

Recompile/Relink. Provides detailed information at user routine level.

lgdb with ccdb
Fast track
DDT
Totalview
Intel Inspector
CrayPAT
Apprentice2
Reveal
Intel Vtune

5

SLIDE 6

C O M P U T E | S T O R E | A N A L Y Z E

Easy of Use : CrayPAT evolving over time

6

CrayPat is a is not easy to get started with :
Man pages for intro_craypat, pat_build and pat_report has ~4000 lines
~70 environment variables
A lot of arguments available
Output is configurable to the very last character
Improvement in ‘Easy of use’ over time :

1. Interactive help tool : pat_help 2. Introduction of the ‘Automatic Profiling Analysis’ approach Guides the user to a traced run in two step

First a sampling run is done
Based on this run, a traced application is generated and run
User can interact with the process and do changes

3. Introduction of CrayPAT-light

Profiling is transparent to the user :

No changes in the build and execution process

Users can still use ‘plain’ CrayPAT

SLIDE 7

C O M P U T E | S T O R E | A N A L Y Z E

Debugging Tools on the Cray XC30

SLIDE 8

C O M P U T E | S T O R E | A N A L Y Z E

The porting optimization Cycle

Port or update your application to the XC30 Debug your application (get right results).

Stack Trace Analysis Tool (STAT)
Abnormal Termination Processing (ATP)
Fast Track Debugger (FTD)
Allinea DDT
lgdb, (ccdb)

Profile your application for performance.

Cray performance analysis toolkit CrayPat.
CrayPat lite for easier profiling.
Cray Profiler Library

8

SLIDE 9

C O M P U T E | S T O R E | A N A L Y Z E

Stack Trace Analysis Tool (STAT)

For when nothing appears to be happening…

SLIDE 10

C O M P U T E | S T O R E | A N A L Y Z E

Stack Trace Analysis Tool (STAT)

Stack Trace Analysis Tool (STAT) is a cross-platform tool

from the University of Wisconsin-Madison.

Gathers and merges stack traces from a running application’s parallel

processes.

Creates call graph prefix tree
Compressed representation
Scalable visualization
Scalable analysis
It is very useful when application seems to

be stuck/hung

Full information including use cases is

available at http://www.paradyn.org/STAT/STAT.html

Scales to many thousands of concurrent

process.

10

SLIDE 11

C O M P U T E | S T O R E | A N A L Y Z E

Stack Trace Merge Example

11

SLIDE 12

C O M P U T E | S T O R E | A N A L Y Z E

Merged Stack

SLIDE 13

C O M P U T E | S T O R E | A N A L Y Z E

STAT Advantages

13

Always available as linked into an application
Doesn’t use CPU cycles if not needed/activated
Attaches to a running program at scale
Can create several snapshot during a run
No extra license costs

SLIDE 14

C O M P U T E | S T O R E | A N A L Y Z E

Abnormal Termination Processing (ATP)

For when things break unexpectedly… (Collecting back-trace information)

SLIDE 15

C O M P U T E | S T O R E | A N A L Y Z E

ATP Description

Abnormal Termination Processing is a lightweight

monitoring framework that detects crashes and provides more analysis instead of silently terminating.

Designed to be so light weight it can be used all the time with almost

no impact on performance.

Almost completely transparent to the user
Requires atp module loaded during compilation (usually included by

default)

Output controlled by the ATP_ENABLED environment variable (set by user).
Tested at scale (tens of thousands of processors)
ATP rationalizes parallel debug information into three

easier to user forms:

1.

A single stack trace of the first failing process to stderr

2.

A visualization of every processes stack trace when it crashed

3.

A selection of representative core files for analysis

15

SLIDE 16

C O M P U T E | S T O R E | A N A L Y Z E

ATP Usage

Job scripts must include the following variable
export ATP_ENABLED=1
ulimit –c unlimited
After abnormal termination the application will not simply

crash but proceed with the ATP analysis instead.

Backtrace of first crashing process is passed to stderr and

the merged backtrace of all procs is in atpMergedBT.dot

Core files are being generated. ATP respects ulimits on corefiles. Trace back of crashing process

16

SLIDE 17

C O M P U T E | S T O R E | A N A L Y Z E

Viewing the results after the crash

The merged backtrace is inspected via STAT:

> module load stat > stat-view atpMergedBT.dot

The core files can be inspected with a debugger like gdb or

Allinea DDT.

17

SLIDE 18

C O M P U T E | S T O R E | A N A L Y Z E

Fast Track Debugging

For getting to the problem more quickly…

SLIDE 19

C O M P U T E | S T O R E | A N A L Y Z E

The Problem

Debug compilations eliminate optimizations
Today's machines really need optimizations
Slows down execution
Problem might disappear
Compile such that both debug and non-debug (optimized)

versions of each routine are created. Use –Gfast instead

f –g with the Cray compiler. Check the man pages.
Linkage such that optimized versions are used by default
Debugger overrides default linkage when setting

breakpoints and stepping into functions

Supported by DDT and lgdb.

19

SLIDE 20

C O M P U T E | S T O R E | A N A L Y Z E

A Closer Look at How FTD Works

subrountine difuze(…) call difuze(…) call interf(…) subrountine interf(…)

source code

difuze()

call difuze(…) call interf(…)

interf()

ptimized binary code

dbg$difuze() dbg$interf()

call difuze(…) call interf(…)

debug code

Jmp inserted as part of breakpoint planting Breakpoint requested in interf(), placed in interf_debug()

20

SLIDE 21

C O M P U T E | S T O R E | A N A L Y Z E

Profiling : CrayPAT

SLIDE 22

C O M P U T E | S T O R E | A N A L Y Z E

CrayPAT’s Design Goals

.

Assist the user with application performance analysis and
ptimization
Help user identify important and meaningful information from

potentially massive data sets

Help user identify problem areas instead of just reporting data
Bring optimization knowledge to a wider set of users
Focus on ease of use and intuitive user interfaces
Lightweight and automatic program instrumentation
Automatic Profiling Analysis mode to bootstrap the process
Target scalability issues in all areas of tool development
Work on user codes at realistic core counts with thousands of

processes/threads

Integrate into large codes with millions of lines of code
Be a universal tool
Basic functionality available to all compilers on the system
Additional functionality available from the Cray compiler

22

SLIDE 23

C O M P U T E | S T O R E | A N A L Y Z E

The Three Stages of CrayPAT

.

There are three fundamental stages with accompanying

tools

1.

Instrumentation

Use pat_build to apply instrumentation to program binaries

2.

Data Collection

Transparent collection via CrayPAT’s run-time library

3.

Analysis

Interpreting and visualizing collected data using a series of post-mortem

tools:

1.

pat_report: a command line tool for generating text reports

2.

Cray Apprentice2: a graphical performance analysis tool

3.

Reveal: Graphical performance analysis and code restructuring tool

Documentation is provided via
The pat_help system
And the traditional man craypat

23

SLIDE 24

C O M P U T E | S T O R E | A N A L Y Z E

Example of CrayPat assisting the User

MPI grid detection: There appears to be point-to-point MPI communication in a 20 X 16 grid pattern. The 27.5% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Custom rank order from the following table. Rank On-Node On-Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE%

f Total

Bytes/PE Custom 8.092e+09 75.00% 3 SMP 4.580e+09 42.45% 1 Fold 2.290e+08 2.12% 2 RoundRobin 0.000e+00 0.00% 0 MPI grid detection: There appears to be point-to-point MPI communication in a 20 X 16 grid pattern. The 27.5% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Custom rank order from the following table. Rank On-Node On-Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE%

f Total

Bytes/PE Custom 8.092e+09 75.00% 3 SMP 4.580e+09 42.45% 1 Fold 2.290e+08 2.12% 2 RoundRobin 0.000e+00 0.00% 0

When testing this the time went only down to 348 from 360 seconds, but approach might become important when scaling higher

24

SLIDE 25

C O M P U T E | S T O R E | A N A L Y Z E

Auto-Generated MPI Rank Order File

25

# The 'USER_Time_hybrid' rank

rder in this file

targets nodes with multi-core # processors, based on Sent Msg Total Bytes collected for: # # Program: /lus/nid00023/malice/cr aypat/WORKSHOP/bh2o- demo/Rank/sweep3d/src/s weep3d # Ap2 File: sweep3d.gmpi-u.ap2 # Number PEs: 768 # Max PEs/Node: 16 # # To use this file, make a copy named MPICH_RANK_ORDER, and set the # environment variable MPICH_RANK_REORDER_METH OD to 3 prior to # executing the program. # 0,532,64,564,32,572,96, 540,8,596,72,524,40,604 ,24,588 104,556,16,628,80,636,5 6,620,48,516,112,580,88 ,548,120,612 1,403,65,435,33,411,97, 443,9,467,25,499,105,50 7,41,475 73,395,81,427,57,459,17 ,419,113,491,49,387,89, 451,121,483 6,436,102,468,70,404,38 ,412,14,444,46,476,110, 508,78,500 86,396,30,428,62,460,54 ,492,118,420,22,452,94, 388,126,484 129,563,193,531,161,571 ,225,539,241,595,233,52 3,249,603,185,555 153,587,169,627,137,635 ,201,619,177,515,145,57 9,209,547,217,611 7,405,71,469,39,437,103 ,413,47,445,15,509,79,4 77,31,501 111,397,63,461,55,429,8 7,421,23,493,119,389,95 ,453,127,485 134,402,198,434,166,410 ,230,442,238,466,174,50 6,158,394,246,474 190,498,254,426,142,458 ,150,386,182,418,206,49 0,214,450,222,482 128,533,192,541,160,565 ,232,525,224,573,240,59 7,184,557,248,605 168,589,200,517,152,629 ,136,549,176,637,144,62 1,208,581,216,613 5,439,37,407,69,447,101 ,415,13,471,45,503,29,4 79,77,511 53,399,85,431,21,463,61 ,391,109,423,93,455,117 ,495,125,487 2,530,34,562,66,538,98, 522,10,570,42,554,26,59 4,50,602 18,514,74,586,58,626,82 ,546,106,634,90,578,114 ,618,122,610 135,315,167,339,199,347 ,259,307,231,371,239,37 9,191,331,247,299 175,363,159,323,143,355 ,255,291,207,275,183,28 3,151,267,215,223 133,406,197,438,165,470 ,229,414,245,446,141,47 8,237,502,253,398 157,510,189,462,173,430 ,205,390,149,422,213,45 4,181,494,221,486 130,316,260,340,194,372 ,162,348,226,308,234,38 0,242,332,250,300 202,364,186,324,154,356 ,138,292,170,276,178,28 4,210,218,268,146 4,535,36,543,68,567,100 ,527,12,599,44,575,28,5 59,76,607 52,591,20,631,60,639,84 ,519,108,623,92,551,116 ,583,124,615 3,440,35,432,67,400,99, 408,11,464,43,496,27,47 2,51,504 19,392,75,424,59,456,83 ,384,107,416,91,488,115 ,448,123,480 132,401,196,441,164,409 ,228,433,236,465,204,47 3,244,393,188,497 252,505,140,425,212,457 ,156,385,172,417,180,44 9,148,489,220,481 131,534,195,542,163,566 ,227,526,235,574,203,59 8,243,558,187,606 251,590,211,630,179,638 ,139,622,155,550,171,51 8,219,582,147,614 761,660,737,652,705,668 ,745,692,673,700,641,68 4,713,644,753,724 729,732,681,756,721,716 ,764,676,697,748,689,65 7,740,665,649,708 760,528,736,536,704,560 ,744,520,672,568,712,59 2,752,552,640,600 728,584,680,624,720,512 ,696,632,688,616,664,54 4,608,656,648,576 762,659,738,651,706,667 ,746,643,714,691,674,69 9,754,683,730,723 722,731,763,658,642,755 ,739,675,707,650,682,71 5,698,666,690,747 257,345,265,313,281,305 ,273,337,609,369,577,37 7,617,329,513,529 545,297,633,361,625,321 ,585,537,601,289,553,35 3,593,521,569,561 256,373,261,341,264,349 ,280,317,272,381,269,30 9,285,333,277,365 352,301,320,325,288,357 ,328,304,360,312,376,29 3,296,368,336,344 258,338,266,346,282,314 ,274,370,766,306,710,37 8,742,330,678,362 646,298,750,322,718,354 ,758,290,734,662,686,67 0,726,702,694,654 262,375,263,343,270,311 ,271,351,286,319,278,34 2,287,350,279,374 294,318,358,383,359,310 ,295,382,326,303,327,36 7,366,335,302,334 765,661,709,663,741,653 ,711,669,767,655,743,67 1,749,695,679,703 677,727,751,693,647,701 ,717,687,757,685,733,72 5,719,735,645,759

SLIDE 26

C O M P U T E | S T O R E | A N A L Y Z E

Automatic Profile Analysis

A two step process to create an guided event trace binary.

SLIDE 27

C O M P U T E | S T O R E | A N A L Y Z E

Program Instrumentation - Automatic Profiling Analysis

.

Automatic profiling analysis (APA)
Provides simple procedure to instrument and collect

performance data as a first step for novice and expert users

Identifies top time consuming routines
Automatically creates instrumentation template

customized to application for future in-depth measurement and analysis

>90% of users don’t need more information

27

SLIDE 28

C O M P U T E | S T O R E | A N A L Y Z E

Steps to Using CrayPat “APA”, page 1

28

Access performance tools software Build program, retaining .o files Instrument binary Modify batch script and run program Process raw performance data and create report

a.out a.out+pat a.out+pat.xf > make a.out+pat.ap2 Text report to stdout a.out+pat.apa MPICH_RANK_XXX > pat_build –O apa a.out aprun a.out+pat > pat_report a.out+pat.xf > module load perftools

SLIDE 29

C O M P U T E | S T O R E | A N A L Y Z E

Steps to Using CrayPat “APA”, page 2

29

Check the *apa file Reinstrument binary, Modify batch script and run program Process raw performance data and create report

a.out+apa a.out+apa.xf > pat_build –O .apa a.out+apa.ap2 Text report to stdout a.out+pat.apa MPICH_RANK_XXX aprun a.out+apa > pat_report a.out+apa.xf > vi .apa

SLIDE 30

C O M P U T E | S T O R E | A N A L Y Z E

Light-weight application profiling

SLIDE 31

C O M P U T E | S T O R E | A N A L Y Z E

Steps to Using CrayPat-lite

31

Access light version of performance tools software Build program Run program (no modification to batch script)

a.out (instrumented program) Condensed report to stdout a.out.rpt (same as stdout) a.out.ap2 MPICH_RANK_XXX files > make aprun a.out > module load perftools-lite

SLIDE 32

C O M P U T E | S T O R E | A N A L Y Z E

Performance Statistics Available

32

Job information
Number of MPI ranks, …
Wallclock
Memory high water mark
Performance counters (CPU only)
Profile of top time consuming routines with load balance
Observations and Instructions on how to get more info.

SLIDE 33

C O M P U T E | S T O R E | A N A L Y Z E

CrayPAT’s API

SLIDE 34

C O M P U T E | S T O R E | A N A L Y Z E

API for adding user instrumentation

Users are able to define their own trace points via the

region API.

#include <pat_api.h>
int PAT_region_begin (int id, char *label)
id is a unique identifier for the region,
Label is the description that will appear in profiling output.
int PAT_region_end (int id)
id is a unique identifier for the region, must match begin call.

Fortran equivalents, like MPI, are subroutines with extra final integer argument for return value

SLIDE 35

C O M P U T E | S T O R E | A N A L Y Z E

Trace On / Trace Off Example

include "pat_apif.h“ ! Turn data recording off at the beginning of execution. call PAT_record( PAT_STATE_OFF, istat ) ... ! Turn data recording on for two regions of interest. call PAT_record( PAT_STATE_ON, istat ) … call PAT_region_begin( 1, "step 1", istat ) ... call PAT_region_end( 1, istat ) … call PAT_region_begin( 2, "step 2", istat ) ... call PAT_region_end( 2, istat ) … ! Turn data recording off again. call PAT_record( PAT_STATE_OFF, istat ) …

SLIDE 36

C O M P U T E | S T O R E | A N A L Y Z E

Profiler library

Get a first overview of your application.

SLIDE 37

C O M P U T E | S T O R E | A N A L Y Z E

Usage

37

Cray Profiler is not a Cray supported lib
No official support
It is used and developped by Cray’s benchmark Team
Output is text based
Easy to use : Load the module and relink your application

> module load tools/cray_profiler > {ftn,cc,CC} app.{f90,c,cpp} –o app.exe

After running your application you should obtain a file

profile.<qsub_id>.txt

SLIDE 38

C O M P U T E | S T O R E | A N A L Y Z E

The profile.*.txt file (System summary)

38

System summary , min, max, avg, minPE, maxPE Wall clock time , 558.000, 667.940, 574.445,36322,40464 User processor time , 519.264, 555.263, 552.060, 0,65603 System processor time , 0.740, 21.061, 1.676,76001,103200 Current processor clock (GHz) , 2.500, 2.500, 2.500, 0, 0 Maximum processor clock (GHz) , 2.501, 2.501, 2.501, 0, 0 Turbo processor clock (GHz) , 2.481, 2.500, 2.500,89499,146953 Maximum memory usage (MB/proc) , 7.637, 203.570, 85.976, 1462,91440 Memory usage at exit (MB/proc) , 7.629, 203.562, 85.968, 1462,91440 Memory touched (MB/proc) , 13.395, 209.863, 93.292, 1800,91440 Minor page faults , 3429, 53725, 23882, 1800,91440 Core Cycles , 1284039321940, 1307470759007, 1300649360435,101616,65603 Reference cycles , 51719382445, 52297559645, 52025267684,115248,65603 Node memory size (MB/node) , 129024.000, 129024.000, 129024.000, 0, 0 User memory avail (MB/node) , 129069.590, 129069.590, 129069.590, 0, 0 Memory free NUMA node 0 (MB) , 43431.145, 44527.934, 43985.866,184416, 4176 Memory free NUMA node 1 (MB) , 43911.270, 44990.508, 44444.119, 4176,184416 Total huge pages (MB/node) , 1440.000, 1440.000, 1440.000, 0, 0 Used huge pages (MB/node) , 1440.000, 1440.000, 1440.000, 0, 0 Node Energy Use (Joules) , 138188, 166713, 153164,89424,13872 Average Power Use (Watts) , 224.798, 290.333, 266.821,31584,25782

SLIDE 39

C O M P U T E | S T O R E | A N A L Y Z E

The profile.*.txt file (STDIO summary)

39

STDIO summary , min, max, avg,minPE,maxPE Total I/O time , 0.001, 0.327, 0.003,122902,13392 fwrite total time , 0.000, 0.322, 0.000, 1463,13392 fgets time , 0.001, 0.005, 0.003,122902,34128 fopen time , 0.001, 0.024, 0.006,158833, 0 fclose time , 0.000, 0.332, 0.000,47149,173424 fflush time , 0.000, 0.001, 0.000,44614,151368 fwrite average time , 0.000, 0.107, 0.000, 1463,13392 Total I/O bytes , 0.055M, 0.094M, 0.056M, 180, 0 fwrite total bytes , 0.000M, 0.001M, 0.000M, 1, 0 fgets total bytes , 0.055M, 0.093M, 0.056M, 180,114960 Total I/O calls , 691, 1088, 703, 47, 0 fwrite total calls , 2, 35, 2, 1, 0 fgets calls , 689, 1053, 701, 47, 0 fopen calls , 45, 51, 47, 1, 0 fclose calls , 41, 47, 43, 1, 0 fflush calls , 1, 6, 1, 1, 0

SLIDE 40

C O M P U T E | S T O R E | A N A L Y Z E

Contents of the profile.*.txt file (MPI summary)

40

MPI summary , min, max, avg,minPE,maxPE Init-Finalize elapsed time , 521.661, 522.108, 521.879,182828, 6078 Total MPI time , 95.617, 117.670, 111.278,166464,163630 Total communication time , 0.298, 73.124, 27.476, 45,188527 Total Wait and Probe time , 7.359, 19.622, 12.947,188635,68930 Total collective sync time , 8.859, 101.966, 67.015,187392, 2254 Total MPI other time , 3.298, 4.491, 3.840,17826,187823 Maximum sends posted , 2.000, 234.000, 3.814, 1,74016 Average sends posted , 1.400, 125.723, 2.272, 1,27648 Maximum recvs posted , 2.000, 244.000, 3.804, 1,153408 Average recvs posted , 1.400, 126.545, 2.269, 1,27648 Send total time , 0.000, 0.031, 0.000,188639,186287 size 1B - 15B , 0.000, 0.031, 0.000,188639,186287 Recv total time , 0.000, 72.549, 27.119, 0,188570 size 1B - 15B , 0.000, 72.549, 27.119, 0,188570 Isend total time , 0.014, 0.106, 0.042, 3584,179880 size 0B , 0.000, 0.075, 0.022, 3216,180125 size 1MB - 16MB , 0.013, 0.046, 0.020,56196,152040 Irecv total time , 0.004, 0.404, 0.112,56204,145111 size 1MB - 16MB , 0.004, 0.404, 0.112,56204,145111 Bcast total time , 0.003, 0.017, 0.008,114704,184150 size 1B - 15B , 0.002, 0.004, 0.003,13056,184101

SLIDE 41

C O M P U T E | S T O R E | A N A L Y Z E

MPI-IO Stats

New since Aug’ 2014

41

SLIDE 42

C O M P U T E | S T O R E | A N A L Y Z E

Cray MPI-IO Performance Metrics

42

Many times MPI-IO calls are “Black Holes” with little

performance information available.

Cray’s MPI-IO library attempts collective buffering and stripe

matching to improve bandwidth and performance.

User can help performance by favouring larger contiguous

reads/writes to smaller scattered ones.

MPI-IO library now provides a way of collecting statistics on the

actual read/write operations performed after collective buffering

Enable with: export MPICH_MPIIO_STATS=2
In addition to some information written to stdout it will also

provide some cvs files which can be analysed by a provided tool called cray_mpiio_summary

SLIDE 43

C O M P U T E | S T O R E | A N A L Y Z E

MPI-IO Performance Stats

43

Example output

Running wrf on 19200 cores :

| MPIIO write access patterns for wrfout_d01_2013-07-01_01_00_00 | independent writes = 2 | collective writes = 5932800 | system writes = 99871 | stripe sized writes = 99291 | total bytes for writes = 104397074583 = 99560 MiB = 97 GiB | ave system write size = 1045319 | number of write gaps = 2 | ave write gap size = 524284 | See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.

Best performance when avg write size > 1MB and few gaps. Careful selection of MPI types, file views and ordering of data on disk can improve this.

SLIDE 44

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores run Number of MPIIO calls over time

44

SLIDE 45

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores : Number of system writes&Read

45

SLIDE 46

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores : Number of stripesize aligned system write calls

46

SLIDE 47

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores : Number of stripesize aligned system read calls

47

SLIDE 48

C O M P U T E | S T O R E | A N A L Y Z E

Wrf 19200 cores run, Shows how many files are open at any time

48

SLIDE 49

C O M P U T E | S T O R E | A N A L Y Z E

Compiler Feedback and Variable Scoping.

SLIDE 50

C O M P U T E | S T O R E | A N A L Y Z E

Reveal

50

For an OpenMP port the developer has to understand the

scoping of the variables, i.e. whether variables are shared

r private.
Reveal is Cray’s next-generation integrated performance

analysis and code optimization tool.

Source code navigation using whole program analysis (data provided

by the Cray compilation environment.)

Coupling with performance data collected during execution by
CrayPAT. Understand which high level serial loops could benefit from

parallelism.

Enhanced loop mark listing functionality.
Assist users optimize code by providing

variable scoping feedback and suggested compile directives.

SLIDE 51

C O M P U T E | S T O R E | A N A L Y Z E

Reveal with Loop Work Estimates

Cray Inc.

51

CScADS 2012

SLIDE 52

C O M P U T E | S T O R E | A N A L Y Z E

Visualize CCE’s Loopmark with Performance Profile

52

Performance feedback Loopmark and optimization annotations Compiler feedback

SLIDE 53

C O M P U T E | S T O R E | A N A L Y Z E

Visualize CCE’s Loopmark with Performance Profile (2)

53

Integrated message ‘explain support’ Integrated message ‘explain support’

SLIDE 54

C O M P U T E | S T O R E | A N A L Y Z E

View Pseudo Code for Inlined Functions

54

Inlined call sites marked Expand to see pseudo code

SLIDE 55

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – Review Scoping Results

55

User addresses parallelization issues for unresolved variables Loops with scoping information are highlighted – red needs user assistance Parallelization inhibitor messages are provided to assist user with analysis

SLIDE 56

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – User Resolves Issues

56

Click on variable to view all

ccurrences in loop

Use Reveal’s OpenMP parallelization tips

SLIDE 57

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – Generate Directive

57

Automatically generate OpenMP directive Reveal generates example OpenMP directive

SLIDE 58

C O M P U T E | S T O R E | A N A L Y Z E