C O M P U T E | S T O R E | A N A L Y Z E
Cray Tools, an overview 8th International Parallel Tools Workshop - - PowerPoint PPT Presentation
Cray Tools, an overview 8th International Parallel Tools Workshop - - PowerPoint PPT Presentation
Cray Tools, an overview 8th International Parallel Tools Workshop Stuttgart, Germany, 1st October 2014 Stefan Andersson Cray onsite support at HLRS C O M P U T E | S T O R E | A N A L Y Z E Introduction Cray develops
C O M P U T E | S T O R E | A N A L Y Z E
Introduction
2
- Cray develops several tools for their XE/XK and XC computers
- There is lot of effort going into the development
- Several of the tools are ‘stand-alone’ solutions, being
developed for a specific problem
- STAT, ATP
- IOBUF (includes serial IO monitoring)
- MPIIO profiling
- Other tools will interact in order to be more efficient or to
create new solution for a problem
- CCE providing ‘hooks’ for profiling on loop level
- Reveal using CCE listing information and CrayPat Profiling
C O M P U T E | S T O R E | A N A L Y Z E
Which tools does Cray develop
3
- It doesn’t make sense to develop tools where a good tool
already exists on the market DDT and Totalview are good examples
- Cray’s tools are either
- Something new, like Reveal
- Concentrate on a solution to a specific issue, like STAT
- Are part of the development process, like MPIIO Stats
- Comes out of benchmarking, like IOBUF
- Cray also collaborate with different sites in developing tools
C O M P U T E | S T O R E | A N A L Y Z E
CCE : Cray Compiler Environment
4
- The compiler is in general not considered a ‘tool’, but in fact it
is the most important piece of user software
- Compiles and Link the user application
- Feedback about the application
- Code errors
- How optimization was done/or not done (lst file)
- Providing ‘hooks’ into different levels of the application, to which other
tools can attach
- Functions
- Loops
- This makes CCE the ‘centerpiece’ in Cray’s Tools Strategies
- CCE can adapt rather quickly to user/tool needs
- All tools will work with other Compilers, but there might be some
limitations The goal is not to force a user to use CCE, but to provide extensions where it makes sense
C O M P U T E | S T O R E | A N A L Y Z E
Overview : Tools infrastructure (selection)
Debugging
Get your code up and running correctly.
Profiling
Locate performance bottlenecks.
Light weight
At most relinking. Get a first picture of a performance or problems during execution.
- ATP
- STAT
- CrayPAT-lite
- Profiler library
- IOBUF
- MPIIO Stats
In-depth
Recompile/Relink. Provides detailed information at user routine level.
- lgdb with ccdb
- Fast track
- DDT
- Totalview
- Intel Inspector
- CrayPAT
- Apprentice2
- Reveal
- Intel Vtune
5
C O M P U T E | S T O R E | A N A L Y Z E
Easy of Use : CrayPAT evolving over time
6
- CrayPat is a is not easy to get started with :
- Man pages for intro_craypat, pat_build and pat_report has ~4000 lines
- ~70 environment variables
- A lot of arguments available
- Output is configurable to the very last character
- Improvement in ‘Easy of use’ over time :
1. Interactive help tool : pat_help 2. Introduction of the ‘Automatic Profiling Analysis’ approach Guides the user to a traced run in two step
- First a sampling run is done
- Based on this run, a traced application is generated and run
- User can interact with the process and do changes
3. Introduction of CrayPAT-light
- Profiling is transparent to the user :
No changes in the build and execution process
- Users can still use ‘plain’ CrayPAT
C O M P U T E | S T O R E | A N A L Y Z E
Debugging Tools on the Cray XC30
C O M P U T E | S T O R E | A N A L Y Z E
The porting optimization Cycle
Port or update your application to the XC30 Debug your application (get right results).
- Stack Trace Analysis Tool (STAT)
- Abnormal Termination Processing (ATP)
- Fast Track Debugger (FTD)
- Allinea DDT
- lgdb, (ccdb)
Profile your application for performance.
- Cray performance analysis toolkit CrayPat.
- CrayPat lite for easier profiling.
- Cray Profiler Library
8
C O M P U T E | S T O R E | A N A L Y Z E
Stack Trace Analysis Tool (STAT)
For when nothing appears to be happening…
C O M P U T E | S T O R E | A N A L Y Z E
Stack Trace Analysis Tool (STAT)
- Stack Trace Analysis Tool (STAT) is a cross-platform tool
from the University of Wisconsin-Madison.
- Gathers and merges stack traces from a running application’s parallel
processes.
- Creates call graph prefix tree
- Compressed representation
- Scalable visualization
- Scalable analysis
- It is very useful when application seems to
be stuck/hung
- Full information including use cases is
available at http://www.paradyn.org/STAT/STAT.html
- Scales to many thousands of concurrent
process.
10
C O M P U T E | S T O R E | A N A L Y Z E
Stack Trace Merge Example
11
C O M P U T E | S T O R E | A N A L Y Z E
Merged Stack
C O M P U T E | S T O R E | A N A L Y Z E
STAT Advantages
13
- Always available as linked into an application
- Doesn’t use CPU cycles if not needed/activated
- Attaches to a running program at scale
- Can create several snapshot during a run
- No extra license costs
C O M P U T E | S T O R E | A N A L Y Z E
Abnormal Termination Processing (ATP)
For when things break unexpectedly… (Collecting back-trace information)
C O M P U T E | S T O R E | A N A L Y Z E
ATP Description
- Abnormal Termination Processing is a lightweight
monitoring framework that detects crashes and provides more analysis instead of silently terminating.
- Designed to be so light weight it can be used all the time with almost
no impact on performance.
- Almost completely transparent to the user
- Requires atp module loaded during compilation (usually included by
default)
- Output controlled by the ATP_ENABLED environment variable (set by user).
- Tested at scale (tens of thousands of processors)
- ATP rationalizes parallel debug information into three
easier to user forms:
1.
A single stack trace of the first failing process to stderr
2.
A visualization of every processes stack trace when it crashed
3.
A selection of representative core files for analysis
15
C O M P U T E | S T O R E | A N A L Y Z E
ATP Usage
- Job scripts must include the following variable
- export ATP_ENABLED=1
- ulimit –c unlimited
- After abnormal termination the application will not simply
crash but proceed with the ATP analysis instead.
- Backtrace of first crashing process is passed to stderr and
the merged backtrace of all procs is in atpMergedBT.dot
Core files are being generated. ATP respects ulimits on corefiles. Trace back of crashing process
16
C O M P U T E | S T O R E | A N A L Y Z E
Viewing the results after the crash
- The merged backtrace is inspected via STAT:
> module load stat > stat-view atpMergedBT.dot
- The core files can be inspected with a debugger like gdb or
Allinea DDT.
17
C O M P U T E | S T O R E | A N A L Y Z E
Fast Track Debugging
For getting to the problem more quickly…
C O M P U T E | S T O R E | A N A L Y Z E
The Problem
- Debug compilations eliminate optimizations
- Today's machines really need optimizations
- Slows down execution
- Problem might disappear
- Compile such that both debug and non-debug (optimized)
versions of each routine are created. Use –Gfast instead
- f –g with the Cray compiler. Check the man pages.
- Linkage such that optimized versions are used by default
- Debugger overrides default linkage when setting
breakpoints and stepping into functions
- Supported by DDT and lgdb.
19
C O M P U T E | S T O R E | A N A L Y Z E
A Closer Look at How FTD Works
subrountine difuze(…) call difuze(…) call interf(…) subrountine interf(…)
source code
difuze()
call difuze(…) call interf(…)
interf()
- ptimized binary code
dbg$difuze() dbg$interf()
call difuze(…) call interf(…)
debug code
Jmp inserted as part of breakpoint planting Breakpoint requested in interf(), placed in interf_debug()
20
C O M P U T E | S T O R E | A N A L Y Z E
Profiling : CrayPAT
C O M P U T E | S T O R E | A N A L Y Z E
CrayPAT’s Design Goals
.
- Assist the user with application performance analysis and
- ptimization
- Help user identify important and meaningful information from
potentially massive data sets
- Help user identify problem areas instead of just reporting data
- Bring optimization knowledge to a wider set of users
- Focus on ease of use and intuitive user interfaces
- Lightweight and automatic program instrumentation
- Automatic Profiling Analysis mode to bootstrap the process
- Target scalability issues in all areas of tool development
- Work on user codes at realistic core counts with thousands of
processes/threads
- Integrate into large codes with millions of lines of code
- Be a universal tool
- Basic functionality available to all compilers on the system
- Additional functionality available from the Cray compiler
22
C O M P U T E | S T O R E | A N A L Y Z E
The Three Stages of CrayPAT
.
- There are three fundamental stages with accompanying
tools
1.
Instrumentation
- Use pat_build to apply instrumentation to program binaries
2.
Data Collection
- Transparent collection via CrayPAT’s run-time library
3.
Analysis
- Interpreting and visualizing collected data using a series of post-mortem
tools:
1.
pat_report: a command line tool for generating text reports
2.
Cray Apprentice2: a graphical performance analysis tool
3.
Reveal: Graphical performance analysis and code restructuring tool
- Documentation is provided via
- The pat_help system
- And the traditional man craypat
23
C O M P U T E | S T O R E | A N A L Y Z E
Example of CrayPat assisting the User
MPI grid detection: There appears to be point-to-point MPI communication in a 20 X 16 grid pattern. The 27.5% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Custom rank order from the following table. Rank On-Node On-Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE%
- f Total
Bytes/PE Custom 8.092e+09 75.00% 3 SMP 4.580e+09 42.45% 1 Fold 2.290e+08 2.12% 2 RoundRobin 0.000e+00 0.00% 0 MPI grid detection: There appears to be point-to-point MPI communication in a 20 X 16 grid pattern. The 27.5% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Custom rank order from the following table. Rank On-Node On-Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE%
- f Total
Bytes/PE Custom 8.092e+09 75.00% 3 SMP 4.580e+09 42.45% 1 Fold 2.290e+08 2.12% 2 RoundRobin 0.000e+00 0.00% 0
When testing this the time went only down to 348 from 360 seconds, but approach might become important when scaling higher
24
C O M P U T E | S T O R E | A N A L Y Z E
Auto-Generated MPI Rank Order File
25
# The 'USER_Time_hybrid' rank
- rder in this file
targets nodes with multi-core # processors, based on Sent Msg Total Bytes collected for: # # Program: /lus/nid00023/malice/cr aypat/WORKSHOP/bh2o- demo/Rank/sweep3d/src/s weep3d # Ap2 File: sweep3d.gmpi-u.ap2 # Number PEs: 768 # Max PEs/Node: 16 # # To use this file, make a copy named MPICH_RANK_ORDER, and set the # environment variable MPICH_RANK_REORDER_METH OD to 3 prior to # executing the program. # 0,532,64,564,32,572,96, 540,8,596,72,524,40,604 ,24,588 104,556,16,628,80,636,5 6,620,48,516,112,580,88 ,548,120,612 1,403,65,435,33,411,97, 443,9,467,25,499,105,50 7,41,475 73,395,81,427,57,459,17 ,419,113,491,49,387,89, 451,121,483 6,436,102,468,70,404,38 ,412,14,444,46,476,110, 508,78,500 86,396,30,428,62,460,54 ,492,118,420,22,452,94, 388,126,484 129,563,193,531,161,571 ,225,539,241,595,233,52 3,249,603,185,555 153,587,169,627,137,635 ,201,619,177,515,145,57 9,209,547,217,611 7,405,71,469,39,437,103 ,413,47,445,15,509,79,4 77,31,501 111,397,63,461,55,429,8 7,421,23,493,119,389,95 ,453,127,485 134,402,198,434,166,410 ,230,442,238,466,174,50 6,158,394,246,474 190,498,254,426,142,458 ,150,386,182,418,206,49 0,214,450,222,482 128,533,192,541,160,565 ,232,525,224,573,240,59 7,184,557,248,605 168,589,200,517,152,629 ,136,549,176,637,144,62 1,208,581,216,613 5,439,37,407,69,447,101 ,415,13,471,45,503,29,4 79,77,511 53,399,85,431,21,463,61 ,391,109,423,93,455,117 ,495,125,487 2,530,34,562,66,538,98, 522,10,570,42,554,26,59 4,50,602 18,514,74,586,58,626,82 ,546,106,634,90,578,114 ,618,122,610 135,315,167,339,199,347 ,259,307,231,371,239,37 9,191,331,247,299 175,363,159,323,143,355 ,255,291,207,275,183,28 3,151,267,215,223 133,406,197,438,165,470 ,229,414,245,446,141,47 8,237,502,253,398 157,510,189,462,173,430 ,205,390,149,422,213,45 4,181,494,221,486 130,316,260,340,194,372 ,162,348,226,308,234,38 0,242,332,250,300 202,364,186,324,154,356 ,138,292,170,276,178,28 4,210,218,268,146 4,535,36,543,68,567,100 ,527,12,599,44,575,28,5 59,76,607 52,591,20,631,60,639,84 ,519,108,623,92,551,116 ,583,124,615 3,440,35,432,67,400,99, 408,11,464,43,496,27,47 2,51,504 19,392,75,424,59,456,83 ,384,107,416,91,488,115 ,448,123,480 132,401,196,441,164,409 ,228,433,236,465,204,47 3,244,393,188,497 252,505,140,425,212,457 ,156,385,172,417,180,44 9,148,489,220,481 131,534,195,542,163,566 ,227,526,235,574,203,59 8,243,558,187,606 251,590,211,630,179,638 ,139,622,155,550,171,51 8,219,582,147,614 761,660,737,652,705,668 ,745,692,673,700,641,68 4,713,644,753,724 729,732,681,756,721,716 ,764,676,697,748,689,65 7,740,665,649,708 760,528,736,536,704,560 ,744,520,672,568,712,59 2,752,552,640,600 728,584,680,624,720,512 ,696,632,688,616,664,54 4,608,656,648,576 762,659,738,651,706,667 ,746,643,714,691,674,69 9,754,683,730,723 722,731,763,658,642,755 ,739,675,707,650,682,71 5,698,666,690,747 257,345,265,313,281,305 ,273,337,609,369,577,37 7,617,329,513,529 545,297,633,361,625,321 ,585,537,601,289,553,35 3,593,521,569,561 256,373,261,341,264,349 ,280,317,272,381,269,30 9,285,333,277,365 352,301,320,325,288,357 ,328,304,360,312,376,29 3,296,368,336,344 258,338,266,346,282,314 ,274,370,766,306,710,37 8,742,330,678,362 646,298,750,322,718,354 ,758,290,734,662,686,67 0,726,702,694,654 262,375,263,343,270,311 ,271,351,286,319,278,34 2,287,350,279,374 294,318,358,383,359,310 ,295,382,326,303,327,36 7,366,335,302,334 765,661,709,663,741,653 ,711,669,767,655,743,67 1,749,695,679,703 677,727,751,693,647,701 ,717,687,757,685,733,72 5,719,735,645,759
C O M P U T E | S T O R E | A N A L Y Z E
Automatic Profile Analysis
A two step process to create an guided event trace binary.
C O M P U T E | S T O R E | A N A L Y Z E
Program Instrumentation - Automatic Profiling Analysis
.
- Automatic profiling analysis (APA)
- Provides simple procedure to instrument and collect
performance data as a first step for novice and expert users
- Identifies top time consuming routines
- Automatically creates instrumentation template
customized to application for future in-depth measurement and analysis
- >90% of users don’t need more information
27
C O M P U T E | S T O R E | A N A L Y Z E
Steps to Using CrayPat “APA”, page 1
28
Access performance tools software Build program, retaining .o files Instrument binary Modify batch script and run program Process raw performance data and create report
a.out a.out+pat a.out+pat*.xf > make a.out+pat*.ap2 Text report to stdout a.out+pat*.apa MPICH_RANK_XXX > pat_build –O apa a.out aprun a.out+pat > pat_report a.out+pat*.xf > module load perftools
C O M P U T E | S T O R E | A N A L Y Z E
Steps to Using CrayPat “APA”, page 2
29
Check the *apa file Reinstrument binary, Modify batch script and run program Process raw performance data and create report
a.out+apa a.out+apa*.xf > pat_build –O *.apa a.out+apa*.ap2 Text report to stdout a.out+pat*.apa MPICH_RANK_XXX aprun a.out+apa > pat_report a.out+apa*.xf > vi *.apa
C O M P U T E | S T O R E | A N A L Y Z E
Light-weight application profiling
C O M P U T E | S T O R E | A N A L Y Z E
Steps to Using CrayPat-lite
31
Access light version of performance tools software Build program Run program (no modification to batch script)
a.out (instrumented program) Condensed report to stdout a.out*.rpt (same as stdout) a.out*.ap2 MPICH_RANK_XXX files > make aprun a.out > module load perftools-lite
C O M P U T E | S T O R E | A N A L Y Z E
Performance Statistics Available
32
- Job information
- Number of MPI ranks, …
- Wallclock
- Memory high water mark
- Performance counters (CPU only)
- Profile of top time consuming routines with load balance
- Observations and Instructions on how to get more info.
C O M P U T E | S T O R E | A N A L Y Z E
CrayPAT’s API
C O M P U T E | S T O R E | A N A L Y Z E
API for adding user instrumentation
- Users are able to define their own trace points via the
region API.
- #include <pat_api.h>
- int PAT_region_begin (int id, char *label)
- id is a unique identifier for the region,
- Label is the description that will appear in profiling output.
- int PAT_region_end (int id)
- id is a unique identifier for the region, must match begin call.
Fortran equivalents, like MPI, are subroutines with extra final integer argument for return value
C O M P U T E | S T O R E | A N A L Y Z E
Trace On / Trace Off Example
include "pat_apif.h“ ! Turn data recording off at the beginning of execution. call PAT_record( PAT_STATE_OFF, istat ) ... ! Turn data recording on for two regions of interest. call PAT_record( PAT_STATE_ON, istat ) … call PAT_region_begin( 1, "step 1", istat ) ... call PAT_region_end( 1, istat ) … call PAT_region_begin( 2, "step 2", istat ) ... call PAT_region_end( 2, istat ) … ! Turn data recording off again. call PAT_record( PAT_STATE_OFF, istat ) …
C O M P U T E | S T O R E | A N A L Y Z E
Profiler library
Get a first overview of your application.
C O M P U T E | S T O R E | A N A L Y Z E
Usage
37
- Cray Profiler is not a Cray supported lib
- No official support
- It is used and developped by Cray’s benchmark Team
- Output is text based
- Easy to use : Load the module and relink your application
> module load tools/cray_profiler > {ftn,cc,CC} app.{f90,c,cpp} –o app.exe
- After running your application you should obtain a file
profile.<qsub_id>.txt
C O M P U T E | S T O R E | A N A L Y Z E
The profile.*.txt file (System summary)
38
System summary , min, max, avg, minPE, maxPE Wall clock time , 558.000, 667.940, 574.445,36322,40464 User processor time , 519.264, 555.263, 552.060, 0,65603 System processor time , 0.740, 21.061, 1.676,76001,103200 Current processor clock (GHz) , 2.500, 2.500, 2.500, 0, 0 Maximum processor clock (GHz) , 2.501, 2.501, 2.501, 0, 0 Turbo processor clock (GHz) , 2.481, 2.500, 2.500,89499,146953 Maximum memory usage (MB/proc) , 7.637, 203.570, 85.976, 1462,91440 Memory usage at exit (MB/proc) , 7.629, 203.562, 85.968, 1462,91440 Memory touched (MB/proc) , 13.395, 209.863, 93.292, 1800,91440 Minor page faults , 3429, 53725, 23882, 1800,91440 Core Cycles , 1284039321940, 1307470759007, 1300649360435,101616,65603 Reference cycles , 51719382445, 52297559645, 52025267684,115248,65603 Node memory size (MB/node) , 129024.000, 129024.000, 129024.000, 0, 0 User memory avail (MB/node) , 129069.590, 129069.590, 129069.590, 0, 0 Memory free NUMA node 0 (MB) , 43431.145, 44527.934, 43985.866,184416, 4176 Memory free NUMA node 1 (MB) , 43911.270, 44990.508, 44444.119, 4176,184416 Total huge pages (MB/node) , 1440.000, 1440.000, 1440.000, 0, 0 Used huge pages (MB/node) , 1440.000, 1440.000, 1440.000, 0, 0 Node Energy Use (Joules) , 138188, 166713, 153164,89424,13872 Average Power Use (Watts) , 224.798, 290.333, 266.821,31584,25782
C O M P U T E | S T O R E | A N A L Y Z E
The profile.*.txt file (STDIO summary)
39
STDIO summary , min, max, avg,minPE,maxPE Total I/O time , 0.001, 0.327, 0.003,122902,13392 fwrite total time , 0.000, 0.322, 0.000, 1463,13392 fgets time , 0.001, 0.005, 0.003,122902,34128 fopen time , 0.001, 0.024, 0.006,158833, 0 fclose time , 0.000, 0.332, 0.000,47149,173424 fflush time , 0.000, 0.001, 0.000,44614,151368 fwrite average time , 0.000, 0.107, 0.000, 1463,13392 Total I/O bytes , 0.055M, 0.094M, 0.056M, 180, 0 fwrite total bytes , 0.000M, 0.001M, 0.000M, 1, 0 fgets total bytes , 0.055M, 0.093M, 0.056M, 180,114960 Total I/O calls , 691, 1088, 703, 47, 0 fwrite total calls , 2, 35, 2, 1, 0 fgets calls , 689, 1053, 701, 47, 0 fopen calls , 45, 51, 47, 1, 0 fclose calls , 41, 47, 43, 1, 0 fflush calls , 1, 6, 1, 1, 0
C O M P U T E | S T O R E | A N A L Y Z E
Contents of the profile.*.txt file (MPI summary)
40
MPI summary , min, max, avg,minPE,maxPE Init-Finalize elapsed time , 521.661, 522.108, 521.879,182828, 6078 Total MPI time , 95.617, 117.670, 111.278,166464,163630 Total communication time , 0.298, 73.124, 27.476, 45,188527 Total Wait and Probe time , 7.359, 19.622, 12.947,188635,68930 Total collective sync time , 8.859, 101.966, 67.015,187392, 2254 Total MPI other time , 3.298, 4.491, 3.840,17826,187823 Maximum sends posted , 2.000, 234.000, 3.814, 1,74016 Average sends posted , 1.400, 125.723, 2.272, 1,27648 Maximum recvs posted , 2.000, 244.000, 3.804, 1,153408 Average recvs posted , 1.400, 126.545, 2.269, 1,27648 Send total time , 0.000, 0.031, 0.000,188639,186287 size 1B - 15B , 0.000, 0.031, 0.000,188639,186287 Recv total time , 0.000, 72.549, 27.119, 0,188570 size 1B - 15B , 0.000, 72.549, 27.119, 0,188570 Isend total time , 0.014, 0.106, 0.042, 3584,179880 size 0B , 0.000, 0.075, 0.022, 3216,180125 size 1MB - 16MB , 0.013, 0.046, 0.020,56196,152040 Irecv total time , 0.004, 0.404, 0.112,56204,145111 size 1MB - 16MB , 0.004, 0.404, 0.112,56204,145111 Bcast total time , 0.003, 0.017, 0.008,114704,184150 size 1B - 15B , 0.002, 0.004, 0.003,13056,184101
C O M P U T E | S T O R E | A N A L Y Z E
MPI-IO Stats
New since Aug’ 2014
41
C O M P U T E | S T O R E | A N A L Y Z E
Cray MPI-IO Performance Metrics
42
- Many times MPI-IO calls are “Black Holes” with little
performance information available.
- Cray’s MPI-IO library attempts collective buffering and stripe
matching to improve bandwidth and performance.
- User can help performance by favouring larger contiguous
reads/writes to smaller scattered ones.
- MPI-IO library now provides a way of collecting statistics on the
actual read/write operations performed after collective buffering
- Enable with: export MPICH_MPIIO_STATS=2
- In addition to some information written to stdout it will also
provide some cvs files which can be analysed by a provided tool called cray_mpiio_summary
C O M P U T E | S T O R E | A N A L Y Z E
MPI-IO Performance Stats
43
Example output
- Running wrf on 19200 cores :
| MPIIO write access patterns for wrfout_d01_2013-07-01_01_00_00 | independent writes = 2 | collective writes = 5932800 | system writes = 99871 | stripe sized writes = 99291 | total bytes for writes = 104397074583 = 99560 MiB = 97 GiB | ave system write size = 1045319 | number of write gaps = 2 | ave write gap size = 524284 | See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.
Best performance when avg write size > 1MB and few gaps. Careful selection of MPI types, file views and ordering of data on disk can improve this.
C O M P U T E | S T O R E | A N A L Y Z E
Wrf, 19200 cores run Number of MPIIO calls over time
44
C O M P U T E | S T O R E | A N A L Y Z E
Wrf, 19200 cores : Number of system writes&Read
45
C O M P U T E | S T O R E | A N A L Y Z E
Wrf, 19200 cores : Number of stripesize aligned system write calls
46
C O M P U T E | S T O R E | A N A L Y Z E
Wrf, 19200 cores : Number of stripesize aligned system read calls
47
C O M P U T E | S T O R E | A N A L Y Z E
Wrf 19200 cores run, Shows how many files are open at any time
48
C O M P U T E | S T O R E | A N A L Y Z E
Compiler Feedback and Variable Scoping.
C O M P U T E | S T O R E | A N A L Y Z E
Reveal
50
- For an OpenMP port the developer has to understand the
scoping of the variables, i.e. whether variables are shared
- r private.
- Reveal is Cray’s next-generation integrated performance
analysis and code optimization tool.
- Source code navigation using whole program analysis (data provided
by the Cray compilation environment.)
- Coupling with performance data collected during execution by
- CrayPAT. Understand which high level serial loops could benefit from
parallelism.
- Enhanced loop mark listing functionality.
- Assist users optimize code by providing
variable scoping feedback and suggested compile directives.
C O M P U T E | S T O R E | A N A L Y Z E
Reveal with Loop Work Estimates
Cray Inc.
51
CScADS 2012
C O M P U T E | S T O R E | A N A L Y Z E
Visualize CCE’s Loopmark with Performance Profile
52
Performance feedback Loopmark and optimization annotations Compiler feedback
C O M P U T E | S T O R E | A N A L Y Z E
Visualize CCE’s Loopmark with Performance Profile (2)
53
Integrated message ‘explain support’ Integrated message ‘explain support’
C O M P U T E | S T O R E | A N A L Y Z E
View Pseudo Code for Inlined Functions
54
Inlined call sites marked Expand to see pseudo code
C O M P U T E | S T O R E | A N A L Y Z E
Scoping Assistance – Review Scoping Results
55
User addresses parallelization issues for unresolved variables Loops with scoping information are highlighted – red needs user assistance Parallelization inhibitor messages are provided to assist user with analysis
C O M P U T E | S T O R E | A N A L Y Z E
Scoping Assistance – User Resolves Issues
56
Click on variable to view all
- ccurrences in loop
Use Reveal’s OpenMP parallelization tips
C O M P U T E | S T O R E | A N A L Y Z E
Scoping Assistance – Generate Directive
57
Automatically generate OpenMP directive Reveal generates example OpenMP directive
C O M P U T E | S T O R E | A N A L Y Z E