Cray Tools, an overview 8th International Parallel Tools Workshop - - PowerPoint PPT Presentation

cray tools an overview
SMART_READER_LITE
LIVE PREVIEW

Cray Tools, an overview 8th International Parallel Tools Workshop - - PowerPoint PPT Presentation

Cray Tools, an overview 8th International Parallel Tools Workshop Stuttgart, Germany, 1st October 2014 Stefan Andersson Cray onsite support at HLRS C O M P U T E | S T O R E | A N A L Y Z E Introduction Cray develops


slide-1
SLIDE 1

C O M P U T E | S T O R E | A N A L Y Z E

Cray Tools, an overview

8th International Parallel Tools Workshop Stuttgart, Germany, 1st October 2014 Stefan Andersson Cray onsite support at HLRS

slide-2
SLIDE 2

C O M P U T E | S T O R E | A N A L Y Z E

Introduction

2

  • Cray develops several tools for their XE/XK and XC computers
  • There is lot of effort going into the development
  • Several of the tools are ‘stand-alone’ solutions, being

developed for a specific problem

  • STAT, ATP
  • IOBUF (includes serial IO monitoring)
  • MPIIO profiling
  • Other tools will interact in order to be more efficient or to

create new solution for a problem

  • CCE providing ‘hooks’ for profiling on loop level
  • Reveal using CCE listing information and CrayPat Profiling
slide-3
SLIDE 3

C O M P U T E | S T O R E | A N A L Y Z E

Which tools does Cray develop

3

  • It doesn’t make sense to develop tools where a good tool

already exists on the market DDT and Totalview are good examples

  • Cray’s tools are either
  • Something new, like Reveal
  • Concentrate on a solution to a specific issue, like STAT
  • Are part of the development process, like MPIIO Stats
  • Comes out of benchmarking, like IOBUF
  • Cray also collaborate with different sites in developing tools
slide-4
SLIDE 4

C O M P U T E | S T O R E | A N A L Y Z E

CCE : Cray Compiler Environment

4

  • The compiler is in general not considered a ‘tool’, but in fact it

is the most important piece of user software

  • Compiles and Link the user application
  • Feedback about the application
  • Code errors
  • How optimization was done/or not done (lst file)
  • Providing ‘hooks’ into different levels of the application, to which other

tools can attach

  • Functions
  • Loops
  • This makes CCE the ‘centerpiece’ in Cray’s Tools Strategies
  • CCE can adapt rather quickly to user/tool needs
  • All tools will work with other Compilers, but there might be some

limitations The goal is not to force a user to use CCE, but to provide extensions where it makes sense

slide-5
SLIDE 5

C O M P U T E | S T O R E | A N A L Y Z E

Overview : Tools infrastructure (selection)

Debugging

Get your code up and running correctly.

Profiling

Locate performance bottlenecks.

Light weight

At most relinking. Get a first picture of a performance or problems during execution.

  • ATP
  • STAT
  • CrayPAT-lite
  • Profiler library
  • IOBUF
  • MPIIO Stats

In-depth

Recompile/Relink. Provides detailed information at user routine level.

  • lgdb with ccdb
  • Fast track
  • DDT
  • Totalview
  • Intel Inspector
  • CrayPAT
  • Apprentice2
  • Reveal
  • Intel Vtune

5

slide-6
SLIDE 6

C O M P U T E | S T O R E | A N A L Y Z E

Easy of Use : CrayPAT evolving over time

6

  • CrayPat is a is not easy to get started with :
  • Man pages for intro_craypat, pat_build and pat_report has ~4000 lines
  • ~70 environment variables
  • A lot of arguments available
  • Output is configurable to the very last character
  • Improvement in ‘Easy of use’ over time :

1. Interactive help tool : pat_help 2. Introduction of the ‘Automatic Profiling Analysis’ approach Guides the user to a traced run in two step

  • First a sampling run is done
  • Based on this run, a traced application is generated and run
  • User can interact with the process and do changes

3. Introduction of CrayPAT-light

  • Profiling is transparent to the user :

No changes in the build and execution process

  • Users can still use ‘plain’ CrayPAT
slide-7
SLIDE 7

C O M P U T E | S T O R E | A N A L Y Z E

Debugging Tools on the Cray XC30

slide-8
SLIDE 8

C O M P U T E | S T O R E | A N A L Y Z E

The porting optimization Cycle

Port or update your application to the XC30 Debug your application (get right results).

  • Stack Trace Analysis Tool (STAT)
  • Abnormal Termination Processing (ATP)
  • Fast Track Debugger (FTD)
  • Allinea DDT
  • lgdb, (ccdb)

Profile your application for performance.

  • Cray performance analysis toolkit CrayPat.
  • CrayPat lite for easier profiling.
  • Cray Profiler Library

8

slide-9
SLIDE 9

C O M P U T E | S T O R E | A N A L Y Z E

Stack Trace Analysis Tool (STAT)

For when nothing appears to be happening…

slide-10
SLIDE 10

C O M P U T E | S T O R E | A N A L Y Z E

Stack Trace Analysis Tool (STAT)

  • Stack Trace Analysis Tool (STAT) is a cross-platform tool

from the University of Wisconsin-Madison.

  • Gathers and merges stack traces from a running application’s parallel

processes.

  • Creates call graph prefix tree
  • Compressed representation
  • Scalable visualization
  • Scalable analysis
  • It is very useful when application seems to

be stuck/hung

  • Full information including use cases is

available at http://www.paradyn.org/STAT/STAT.html

  • Scales to many thousands of concurrent

process.

10

slide-11
SLIDE 11

C O M P U T E | S T O R E | A N A L Y Z E

Stack Trace Merge Example

11

slide-12
SLIDE 12

C O M P U T E | S T O R E | A N A L Y Z E

Merged Stack

slide-13
SLIDE 13

C O M P U T E | S T O R E | A N A L Y Z E

STAT Advantages

13

  • Always available as linked into an application
  • Doesn’t use CPU cycles if not needed/activated
  • Attaches to a running program at scale
  • Can create several snapshot during a run
  • No extra license costs
slide-14
SLIDE 14

C O M P U T E | S T O R E | A N A L Y Z E

Abnormal Termination Processing (ATP)

For when things break unexpectedly… (Collecting back-trace information)

slide-15
SLIDE 15

C O M P U T E | S T O R E | A N A L Y Z E

ATP Description

  • Abnormal Termination Processing is a lightweight

monitoring framework that detects crashes and provides more analysis instead of silently terminating.

  • Designed to be so light weight it can be used all the time with almost

no impact on performance.

  • Almost completely transparent to the user
  • Requires atp module loaded during compilation (usually included by

default)

  • Output controlled by the ATP_ENABLED environment variable (set by user).
  • Tested at scale (tens of thousands of processors)
  • ATP rationalizes parallel debug information into three

easier to user forms:

1.

A single stack trace of the first failing process to stderr

2.

A visualization of every processes stack trace when it crashed

3.

A selection of representative core files for analysis

15

slide-16
SLIDE 16

C O M P U T E | S T O R E | A N A L Y Z E

ATP Usage

  • Job scripts must include the following variable
  • export ATP_ENABLED=1
  • ulimit –c unlimited
  • After abnormal termination the application will not simply

crash but proceed with the ATP analysis instead.

  • Backtrace of first crashing process is passed to stderr and

the merged backtrace of all procs is in atpMergedBT.dot

Core files are being generated. ATP respects ulimits on corefiles. Trace back of crashing process

16

slide-17
SLIDE 17

C O M P U T E | S T O R E | A N A L Y Z E

Viewing the results after the crash

  • The merged backtrace is inspected via STAT:

> module load stat > stat-view atpMergedBT.dot

  • The core files can be inspected with a debugger like gdb or

Allinea DDT.

17

slide-18
SLIDE 18

C O M P U T E | S T O R E | A N A L Y Z E

Fast Track Debugging

For getting to the problem more quickly…

slide-19
SLIDE 19

C O M P U T E | S T O R E | A N A L Y Z E

The Problem

  • Debug compilations eliminate optimizations
  • Today's machines really need optimizations
  • Slows down execution
  • Problem might disappear
  • Compile such that both debug and non-debug (optimized)

versions of each routine are created. Use –Gfast instead

  • f –g with the Cray compiler. Check the man pages.
  • Linkage such that optimized versions are used by default
  • Debugger overrides default linkage when setting

breakpoints and stepping into functions

  • Supported by DDT and lgdb.

19

slide-20
SLIDE 20

C O M P U T E | S T O R E | A N A L Y Z E

A Closer Look at How FTD Works

subrountine difuze(…) call difuze(…) call interf(…) subrountine interf(…)

source code

difuze()

call difuze(…) call interf(…)

interf()

  • ptimized binary code

dbg$difuze() dbg$interf()

call difuze(…) call interf(…)

debug code

Jmp inserted as part of breakpoint planting Breakpoint requested in interf(), placed in interf_debug()

20

slide-21
SLIDE 21

C O M P U T E | S T O R E | A N A L Y Z E

Profiling : CrayPAT

slide-22
SLIDE 22

C O M P U T E | S T O R E | A N A L Y Z E

CrayPAT’s Design Goals

.

  • Assist the user with application performance analysis and
  • ptimization
  • Help user identify important and meaningful information from

potentially massive data sets

  • Help user identify problem areas instead of just reporting data
  • Bring optimization knowledge to a wider set of users
  • Focus on ease of use and intuitive user interfaces
  • Lightweight and automatic program instrumentation
  • Automatic Profiling Analysis mode to bootstrap the process
  • Target scalability issues in all areas of tool development
  • Work on user codes at realistic core counts with thousands of

processes/threads

  • Integrate into large codes with millions of lines of code
  • Be a universal tool
  • Basic functionality available to all compilers on the system
  • Additional functionality available from the Cray compiler

22

slide-23
SLIDE 23

C O M P U T E | S T O R E | A N A L Y Z E

The Three Stages of CrayPAT

.

  • There are three fundamental stages with accompanying

tools

1.

Instrumentation

  • Use pat_build to apply instrumentation to program binaries

2.

Data Collection

  • Transparent collection via CrayPAT’s run-time library

3.

Analysis

  • Interpreting and visualizing collected data using a series of post-mortem

tools:

1.

pat_report: a command line tool for generating text reports

2.

Cray Apprentice2: a graphical performance analysis tool

3.

Reveal: Graphical performance analysis and code restructuring tool

  • Documentation is provided via
  • The pat_help system
  • And the traditional man craypat

23

slide-24
SLIDE 24

C O M P U T E | S T O R E | A N A L Y Z E

Example of CrayPat assisting the User

MPI grid detection: There appears to be point-to-point MPI communication in a 20 X 16 grid pattern. The 27.5% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Custom rank order from the following table. Rank On-Node On-Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE%

  • f Total

Bytes/PE Custom 8.092e+09 75.00% 3 SMP 4.580e+09 42.45% 1 Fold 2.290e+08 2.12% 2 RoundRobin 0.000e+00 0.00% 0 MPI grid detection: There appears to be point-to-point MPI communication in a 20 X 16 grid pattern. The 27.5% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Custom rank order from the following table. Rank On-Node On-Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE%

  • f Total

Bytes/PE Custom 8.092e+09 75.00% 3 SMP 4.580e+09 42.45% 1 Fold 2.290e+08 2.12% 2 RoundRobin 0.000e+00 0.00% 0

When testing this the time went only down to 348 from 360 seconds, but approach might become important when scaling higher

24

slide-25
SLIDE 25

C O M P U T E | S T O R E | A N A L Y Z E

Auto-Generated MPI Rank Order File

25

# The 'USER_Time_hybrid' rank

  • rder in this file

targets nodes with multi-core # processors, based on Sent Msg Total Bytes collected for: # # Program: /lus/nid00023/malice/cr aypat/WORKSHOP/bh2o- demo/Rank/sweep3d/src/s weep3d # Ap2 File: sweep3d.gmpi-u.ap2 # Number PEs: 768 # Max PEs/Node: 16 # # To use this file, make a copy named MPICH_RANK_ORDER, and set the # environment variable MPICH_RANK_REORDER_METH OD to 3 prior to # executing the program. # 0,532,64,564,32,572,96, 540,8,596,72,524,40,604 ,24,588 104,556,16,628,80,636,5 6,620,48,516,112,580,88 ,548,120,612 1,403,65,435,33,411,97, 443,9,467,25,499,105,50 7,41,475 73,395,81,427,57,459,17 ,419,113,491,49,387,89, 451,121,483 6,436,102,468,70,404,38 ,412,14,444,46,476,110, 508,78,500 86,396,30,428,62,460,54 ,492,118,420,22,452,94, 388,126,484 129,563,193,531,161,571 ,225,539,241,595,233,52 3,249,603,185,555 153,587,169,627,137,635 ,201,619,177,515,145,57 9,209,547,217,611 7,405,71,469,39,437,103 ,413,47,445,15,509,79,4 77,31,501 111,397,63,461,55,429,8 7,421,23,493,119,389,95 ,453,127,485 134,402,198,434,166,410 ,230,442,238,466,174,50 6,158,394,246,474 190,498,254,426,142,458 ,150,386,182,418,206,49 0,214,450,222,482 128,533,192,541,160,565 ,232,525,224,573,240,59 7,184,557,248,605 168,589,200,517,152,629 ,136,549,176,637,144,62 1,208,581,216,613 5,439,37,407,69,447,101 ,415,13,471,45,503,29,4 79,77,511 53,399,85,431,21,463,61 ,391,109,423,93,455,117 ,495,125,487 2,530,34,562,66,538,98, 522,10,570,42,554,26,59 4,50,602 18,514,74,586,58,626,82 ,546,106,634,90,578,114 ,618,122,610 135,315,167,339,199,347 ,259,307,231,371,239,37 9,191,331,247,299 175,363,159,323,143,355 ,255,291,207,275,183,28 3,151,267,215,223 133,406,197,438,165,470 ,229,414,245,446,141,47 8,237,502,253,398 157,510,189,462,173,430 ,205,390,149,422,213,45 4,181,494,221,486 130,316,260,340,194,372 ,162,348,226,308,234,38 0,242,332,250,300 202,364,186,324,154,356 ,138,292,170,276,178,28 4,210,218,268,146 4,535,36,543,68,567,100 ,527,12,599,44,575,28,5 59,76,607 52,591,20,631,60,639,84 ,519,108,623,92,551,116 ,583,124,615 3,440,35,432,67,400,99, 408,11,464,43,496,27,47 2,51,504 19,392,75,424,59,456,83 ,384,107,416,91,488,115 ,448,123,480 132,401,196,441,164,409 ,228,433,236,465,204,47 3,244,393,188,497 252,505,140,425,212,457 ,156,385,172,417,180,44 9,148,489,220,481 131,534,195,542,163,566 ,227,526,235,574,203,59 8,243,558,187,606 251,590,211,630,179,638 ,139,622,155,550,171,51 8,219,582,147,614 761,660,737,652,705,668 ,745,692,673,700,641,68 4,713,644,753,724 729,732,681,756,721,716 ,764,676,697,748,689,65 7,740,665,649,708 760,528,736,536,704,560 ,744,520,672,568,712,59 2,752,552,640,600 728,584,680,624,720,512 ,696,632,688,616,664,54 4,608,656,648,576 762,659,738,651,706,667 ,746,643,714,691,674,69 9,754,683,730,723 722,731,763,658,642,755 ,739,675,707,650,682,71 5,698,666,690,747 257,345,265,313,281,305 ,273,337,609,369,577,37 7,617,329,513,529 545,297,633,361,625,321 ,585,537,601,289,553,35 3,593,521,569,561 256,373,261,341,264,349 ,280,317,272,381,269,30 9,285,333,277,365 352,301,320,325,288,357 ,328,304,360,312,376,29 3,296,368,336,344 258,338,266,346,282,314 ,274,370,766,306,710,37 8,742,330,678,362 646,298,750,322,718,354 ,758,290,734,662,686,67 0,726,702,694,654 262,375,263,343,270,311 ,271,351,286,319,278,34 2,287,350,279,374 294,318,358,383,359,310 ,295,382,326,303,327,36 7,366,335,302,334 765,661,709,663,741,653 ,711,669,767,655,743,67 1,749,695,679,703 677,727,751,693,647,701 ,717,687,757,685,733,72 5,719,735,645,759

slide-26
SLIDE 26

C O M P U T E | S T O R E | A N A L Y Z E

Automatic Profile Analysis

A two step process to create an guided event trace binary.

slide-27
SLIDE 27

C O M P U T E | S T O R E | A N A L Y Z E

Program Instrumentation - Automatic Profiling Analysis

.

  • Automatic profiling analysis (APA)
  • Provides simple procedure to instrument and collect

performance data as a first step for novice and expert users

  • Identifies top time consuming routines
  • Automatically creates instrumentation template

customized to application for future in-depth measurement and analysis

  • >90% of users don’t need more information

27

slide-28
SLIDE 28

C O M P U T E | S T O R E | A N A L Y Z E

Steps to Using CrayPat “APA”, page 1

28

Access performance tools software Build program, retaining .o files Instrument binary Modify batch script and run program Process raw performance data and create report

a.out a.out+pat a.out+pat*.xf > make a.out+pat*.ap2 Text report to stdout a.out+pat*.apa MPICH_RANK_XXX > pat_build –O apa a.out aprun a.out+pat > pat_report a.out+pat*.xf > module load perftools

slide-29
SLIDE 29

C O M P U T E | S T O R E | A N A L Y Z E

Steps to Using CrayPat “APA”, page 2

29

Check the *apa file Reinstrument binary, Modify batch script and run program Process raw performance data and create report

a.out+apa a.out+apa*.xf > pat_build –O *.apa a.out+apa*.ap2 Text report to stdout a.out+pat*.apa MPICH_RANK_XXX aprun a.out+apa > pat_report a.out+apa*.xf > vi *.apa

slide-30
SLIDE 30

C O M P U T E | S T O R E | A N A L Y Z E

Light-weight application profiling

slide-31
SLIDE 31

C O M P U T E | S T O R E | A N A L Y Z E

Steps to Using CrayPat-lite

31

Access light version of performance tools software Build program Run program (no modification to batch script)

a.out (instrumented program) Condensed report to stdout a.out*.rpt (same as stdout) a.out*.ap2 MPICH_RANK_XXX files > make aprun a.out > module load perftools-lite

slide-32
SLIDE 32

C O M P U T E | S T O R E | A N A L Y Z E

Performance Statistics Available

32

  • Job information
  • Number of MPI ranks, …
  • Wallclock
  • Memory high water mark
  • Performance counters (CPU only)
  • Profile of top time consuming routines with load balance
  • Observations and Instructions on how to get more info.
slide-33
SLIDE 33

C O M P U T E | S T O R E | A N A L Y Z E

CrayPAT’s API

slide-34
SLIDE 34

C O M P U T E | S T O R E | A N A L Y Z E

API for adding user instrumentation

  • Users are able to define their own trace points via the

region API.

  • #include <pat_api.h>
  • int PAT_region_begin (int id, char *label)
  • id is a unique identifier for the region,
  • Label is the description that will appear in profiling output.
  • int PAT_region_end (int id)
  • id is a unique identifier for the region, must match begin call.

Fortran equivalents, like MPI, are subroutines with extra final integer argument for return value

slide-35
SLIDE 35

C O M P U T E | S T O R E | A N A L Y Z E

Trace On / Trace Off Example

include "pat_apif.h“ ! Turn data recording off at the beginning of execution. call PAT_record( PAT_STATE_OFF, istat ) ... ! Turn data recording on for two regions of interest. call PAT_record( PAT_STATE_ON, istat ) … call PAT_region_begin( 1, "step 1", istat ) ... call PAT_region_end( 1, istat ) … call PAT_region_begin( 2, "step 2", istat ) ... call PAT_region_end( 2, istat ) … ! Turn data recording off again. call PAT_record( PAT_STATE_OFF, istat ) …

slide-36
SLIDE 36

C O M P U T E | S T O R E | A N A L Y Z E

Profiler library

Get a first overview of your application.

slide-37
SLIDE 37

C O M P U T E | S T O R E | A N A L Y Z E

Usage

37

  • Cray Profiler is not a Cray supported lib
  • No official support
  • It is used and developped by Cray’s benchmark Team
  • Output is text based
  • Easy to use : Load the module and relink your application

> module load tools/cray_profiler > {ftn,cc,CC} app.{f90,c,cpp} –o app.exe

  • After running your application you should obtain a file

profile.<qsub_id>.txt

slide-38
SLIDE 38

C O M P U T E | S T O R E | A N A L Y Z E

The profile.*.txt file (System summary)

38

System summary , min, max, avg, minPE, maxPE Wall clock time , 558.000, 667.940, 574.445,36322,40464 User processor time , 519.264, 555.263, 552.060, 0,65603 System processor time , 0.740, 21.061, 1.676,76001,103200 Current processor clock (GHz) , 2.500, 2.500, 2.500, 0, 0 Maximum processor clock (GHz) , 2.501, 2.501, 2.501, 0, 0 Turbo processor clock (GHz) , 2.481, 2.500, 2.500,89499,146953 Maximum memory usage (MB/proc) , 7.637, 203.570, 85.976, 1462,91440 Memory usage at exit (MB/proc) , 7.629, 203.562, 85.968, 1462,91440 Memory touched (MB/proc) , 13.395, 209.863, 93.292, 1800,91440 Minor page faults , 3429, 53725, 23882, 1800,91440 Core Cycles , 1284039321940, 1307470759007, 1300649360435,101616,65603 Reference cycles , 51719382445, 52297559645, 52025267684,115248,65603 Node memory size (MB/node) , 129024.000, 129024.000, 129024.000, 0, 0 User memory avail (MB/node) , 129069.590, 129069.590, 129069.590, 0, 0 Memory free NUMA node 0 (MB) , 43431.145, 44527.934, 43985.866,184416, 4176 Memory free NUMA node 1 (MB) , 43911.270, 44990.508, 44444.119, 4176,184416 Total huge pages (MB/node) , 1440.000, 1440.000, 1440.000, 0, 0 Used huge pages (MB/node) , 1440.000, 1440.000, 1440.000, 0, 0 Node Energy Use (Joules) , 138188, 166713, 153164,89424,13872 Average Power Use (Watts) , 224.798, 290.333, 266.821,31584,25782

slide-39
SLIDE 39

C O M P U T E | S T O R E | A N A L Y Z E

The profile.*.txt file (STDIO summary)

39

STDIO summary , min, max, avg,minPE,maxPE Total I/O time , 0.001, 0.327, 0.003,122902,13392 fwrite total time , 0.000, 0.322, 0.000, 1463,13392 fgets time , 0.001, 0.005, 0.003,122902,34128 fopen time , 0.001, 0.024, 0.006,158833, 0 fclose time , 0.000, 0.332, 0.000,47149,173424 fflush time , 0.000, 0.001, 0.000,44614,151368 fwrite average time , 0.000, 0.107, 0.000, 1463,13392 Total I/O bytes , 0.055M, 0.094M, 0.056M, 180, 0 fwrite total bytes , 0.000M, 0.001M, 0.000M, 1, 0 fgets total bytes , 0.055M, 0.093M, 0.056M, 180,114960 Total I/O calls , 691, 1088, 703, 47, 0 fwrite total calls , 2, 35, 2, 1, 0 fgets calls , 689, 1053, 701, 47, 0 fopen calls , 45, 51, 47, 1, 0 fclose calls , 41, 47, 43, 1, 0 fflush calls , 1, 6, 1, 1, 0

slide-40
SLIDE 40

C O M P U T E | S T O R E | A N A L Y Z E

Contents of the profile.*.txt file (MPI summary)

40

MPI summary , min, max, avg,minPE,maxPE Init-Finalize elapsed time , 521.661, 522.108, 521.879,182828, 6078 Total MPI time , 95.617, 117.670, 111.278,166464,163630 Total communication time , 0.298, 73.124, 27.476, 45,188527 Total Wait and Probe time , 7.359, 19.622, 12.947,188635,68930 Total collective sync time , 8.859, 101.966, 67.015,187392, 2254 Total MPI other time , 3.298, 4.491, 3.840,17826,187823 Maximum sends posted , 2.000, 234.000, 3.814, 1,74016 Average sends posted , 1.400, 125.723, 2.272, 1,27648 Maximum recvs posted , 2.000, 244.000, 3.804, 1,153408 Average recvs posted , 1.400, 126.545, 2.269, 1,27648 Send total time , 0.000, 0.031, 0.000,188639,186287 size 1B - 15B , 0.000, 0.031, 0.000,188639,186287 Recv total time , 0.000, 72.549, 27.119, 0,188570 size 1B - 15B , 0.000, 72.549, 27.119, 0,188570 Isend total time , 0.014, 0.106, 0.042, 3584,179880 size 0B , 0.000, 0.075, 0.022, 3216,180125 size 1MB - 16MB , 0.013, 0.046, 0.020,56196,152040 Irecv total time , 0.004, 0.404, 0.112,56204,145111 size 1MB - 16MB , 0.004, 0.404, 0.112,56204,145111 Bcast total time , 0.003, 0.017, 0.008,114704,184150 size 1B - 15B , 0.002, 0.004, 0.003,13056,184101

slide-41
SLIDE 41

C O M P U T E | S T O R E | A N A L Y Z E

MPI-IO Stats

New since Aug’ 2014

41

slide-42
SLIDE 42

C O M P U T E | S T O R E | A N A L Y Z E

Cray MPI-IO Performance Metrics

42

  • Many times MPI-IO calls are “Black Holes” with little

performance information available.

  • Cray’s MPI-IO library attempts collective buffering and stripe

matching to improve bandwidth and performance.

  • User can help performance by favouring larger contiguous

reads/writes to smaller scattered ones.

  • MPI-IO library now provides a way of collecting statistics on the

actual read/write operations performed after collective buffering

  • Enable with: export MPICH_MPIIO_STATS=2
  • In addition to some information written to stdout it will also

provide some cvs files which can be analysed by a provided tool called cray_mpiio_summary

slide-43
SLIDE 43

C O M P U T E | S T O R E | A N A L Y Z E

MPI-IO Performance Stats

43

Example output

  • Running wrf on 19200 cores :

| MPIIO write access patterns for wrfout_d01_2013-07-01_01_00_00 | independent writes = 2 | collective writes = 5932800 | system writes = 99871 | stripe sized writes = 99291 | total bytes for writes = 104397074583 = 99560 MiB = 97 GiB | ave system write size = 1045319 | number of write gaps = 2 | ave write gap size = 524284 | See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.

Best performance when avg write size > 1MB and few gaps. Careful selection of MPI types, file views and ordering of data on disk can improve this.

slide-44
SLIDE 44

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores run Number of MPIIO calls over time

44

slide-45
SLIDE 45

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores : Number of system writes&Read

45

slide-46
SLIDE 46

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores : Number of stripesize aligned system write calls

46

slide-47
SLIDE 47

C O M P U T E | S T O R E | A N A L Y Z E

Wrf, 19200 cores : Number of stripesize aligned system read calls

47

slide-48
SLIDE 48

C O M P U T E | S T O R E | A N A L Y Z E

Wrf 19200 cores run, Shows how many files are open at any time

48

slide-49
SLIDE 49

C O M P U T E | S T O R E | A N A L Y Z E

Compiler Feedback and Variable Scoping.

slide-50
SLIDE 50

C O M P U T E | S T O R E | A N A L Y Z E

Reveal

50

  • For an OpenMP port the developer has to understand the

scoping of the variables, i.e. whether variables are shared

  • r private.
  • Reveal is Cray’s next-generation integrated performance

analysis and code optimization tool.

  • Source code navigation using whole program analysis (data provided

by the Cray compilation environment.)

  • Coupling with performance data collected during execution by
  • CrayPAT. Understand which high level serial loops could benefit from

parallelism.

  • Enhanced loop mark listing functionality.
  • Assist users optimize code by providing

variable scoping feedback and suggested compile directives.

slide-51
SLIDE 51

C O M P U T E | S T O R E | A N A L Y Z E

Reveal with Loop Work Estimates

Cray Inc.

51

CScADS 2012

slide-52
SLIDE 52

C O M P U T E | S T O R E | A N A L Y Z E

Visualize CCE’s Loopmark with Performance Profile

52

Performance feedback Loopmark and optimization annotations Compiler feedback

slide-53
SLIDE 53

C O M P U T E | S T O R E | A N A L Y Z E

Visualize CCE’s Loopmark with Performance Profile (2)

53

Integrated message ‘explain support’ Integrated message ‘explain support’

slide-54
SLIDE 54

C O M P U T E | S T O R E | A N A L Y Z E

View Pseudo Code for Inlined Functions

54

Inlined call sites marked Expand to see pseudo code

slide-55
SLIDE 55

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – Review Scoping Results

55

User addresses parallelization issues for unresolved variables Loops with scoping information are highlighted – red needs user assistance Parallelization inhibitor messages are provided to assist user with analysis

slide-56
SLIDE 56

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – User Resolves Issues

56

Click on variable to view all

  • ccurrences in loop

Use Reveal’s OpenMP parallelization tips

slide-57
SLIDE 57

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – Generate Directive

57

Automatically generate OpenMP directive Reveal generates example OpenMP directive

slide-58
SLIDE 58

C O M P U T E | S T O R E | A N A L Y Z E

Thank You Questions ?