Efficient HPC Development and Production with Allinea Tools
Florent Lebeau Florent.Lebeau@arm.com 27/04/2017
Efficient HPC Development and Production with Allinea Tools Florent - - PowerPoint PPT Presentation
Efficient HPC Development and Production with Allinea Tools Florent Lebeau Florent.Lebeau@arm.com 27/04/2017 Download the slides https://goo.gl/GcNg8O Agenda 09:30 - 10:00: Registration 10:00 - 10:30: Introduction and how to
Florent Lebeau Florent.Lebeau@arm.com 27/04/2017
Lunch Break
Coffee Break
– Allinea objective: continue to be the trusted HPC Tools leader in tools across every platform
– The same team will continue to work with you, our customers and partners, and the wider HPC community – Being part of ARM gives us strength to deliver on our roadmap faster – We remain 100% committed to providing cross-platform tools for HPC – Our engineering roadmap is aligned with upcoming architectures from every vendor
Reduce HPC systems operating costs Resolve cutting-edge challenges Promote Efficiency (as opposed to Utilization) Transfer knowledge to HPC communities
Reach highest levels of performance and scalability Improve scientific code quality and accuracy
– Forge Supercomputing – 64 tokens – Performance Reports Supercomputing – 64 tokens
Research Compilers ARM Performance Libraries Userspace Performance Tools Open Source HPC Allinea Tools New compiler technology to support and evaluate next- generation ARM architecture. Commercially- supported BLAS, LAPACK and FFT routines optimized for ARM- compatible microarchitectures. New commercial tools to deliver actionable performance improvement advice to software developers. Identification of issues in ARM builds of open- source packages and the upstreaming
Parallel debugger, profiler and performance analysis tools for HPC
The mission: Enable the software ecosystem for large-scale ARM systems. Current team of 50, from an initial team of 9 in July 2014 Based in Manchester and Warwick, UK. www.developer.arm.com/hpc
caption
DDT
MAP
to CI tools (Jenkins, Bamboo etc) Performance Reports
(Jenkins, Bamboo etc)
In your opinion, what is the most critical step?
MODEL
ALGORITHM(S)
HIGH LEVEL CODE
BINARY
APPLICATION PROFILE
Very simple start-up No source code needed Fully scalable, very low overhead Rich set of metrics Powerful data analysis
$ perf-report mpirun -n 8 ./myapp.exe arg1 arg2
$ cat myapp_8p_1t_YYYY-MM-DD_HH:MM.txt $ firefox myapp_8p_1t_YYYY-MM-DD_HH:MM.html
$ perf-report --output=“report.csv” mpirun -n 8 ./myapp.exe arg1 arg2
$> ssh –X <username>@salomon.it4i.cz
$> cp /home/flebeau/allinea_workshop.tar.gz . $> tar xzvf allinea_workshop.tar.gz $> cd allinea_workshop/
$> module load iimpi PerformanceReports/6.0.6 Forge/7.0.2 $> export ALLINEA_LICENCE_FILE=/home/flebeau/Licence.11373 (only necessary to use the temporary licence for the workshop) OR $> . common/env.sh
– Compilation flags – Number of processes – Number of nodes
$> cd allinea_workshop/1_*/c or cd allinea_workshop/1_*/f90 $> make $> qsub ./job.sub # Modify the job script accordingly
$> module load PerformanceReports/6.0.6 In the job script, prefix the mpirun/srun command with perf-report $> perf-report mpirun ./wave.exe
x f(x)
– Each process prints a message
– Diagnose the problem from evidence and intuition
– Analogous to bisection root finding
– Too much output – too many log files
dependable, I’ll be there for you.
BOHR BUG
debugging? Let me hide for a sec!
HEISEN BUG
name and you shall fear me.
MANDEL BUG
AND not
about that?
SCHRODIN BUG
Debugging a problem is much easier when you can :
Any technology sufficiently advanced is indistinguishable from magic. Unpredictable, dangerous, irresistible.
Identify and optimise bottlenecks Flick to Allinea MAP to check the performance Use Allinea DDT to check your code or find and fix the problem: Memory error? Deadlock? Observe and debug your code step by step Scalability issue prevents from reaching performance goals
‒ Merges stacks from processes and threads
‒ Allinea DDT leaps to source automatically
‒ Detailed error message given to the user ‒ Some faults evident instantly from source
‒ Unique “Smart Highlighting” ‒ Sparklines comparing data across processes
Run with Allinea tools Identify a problem Gather info Who, Where, How, Why Fix
$ mpiicc -O0 -g myapp.c –o myapp.exe
$ ddt mpirun./myapp.exe arg1 arg2
$ ddt --offline --output=report.html mpirun ./myapp.exe arg1 arg2
On the login node:
$ ddt & (or use the remote client)
In the job script to submit:
ddt --connect mpirun -n 8 ./myapp.exe arg1 arg2
Go to : http://www.allinea.com/products/downloads/
Connection name: VSC Hostname: <username>@salomon.it4i.cz Remote Installation Directory: /apps/all/Forge/7.0.2/ Remote script: <leave blank> Click on “Test Remote Launch”, and if it works, click on “OK” Connect to the remote cluster through the remote client
connect
k k i
size j i, j, k: loop indexes nslices = 4
Algorithm
1- Master initialises matrices A, B & C 2- Master slices the matrices A & C, sends them to slaves 3- Master and Slaves perform the multiplication 4- Slaves send their results back to Master 5- Master writes the result Matrix C in an output file
$ cd allinea_training/2* $ make
mpirun ./mmult*_c.exe
$ qsub job.sub
$> module load Forge/7.0.2
$ ddt &
OR connect to your remote machine with the remote client
$ make clean $ mpiicc –g –O0 –DDEBUG mmult1.c –o mmult1_c.exe Or edit the Makefile
launch the debugger
ddt --connect mpirun ./mmult*_c.exe
$ qsub job.sub
Lunch Break
Coffee Break
example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration
aid program optimization. (Wikipedia)
– Select representative test case(s) – Profile – Analyse and find bottlenecks – Optimise – Profile again to check performance results and iterate
– Tracing Records and timestamps all operations Intrusive – Instrumenting Add instructions in the source code to collect data Intrusive – Sampling Automatically collect data Not intrusive
Hotspot
Spike
Flat
DEBUG PROFILE
Fine tune bottlenecks Find unexpected issues Resolve bugs Test codes
ACCESSIBLE
POWERFUL
INNOVATIVE
Low overhead measurement
Easy to use
Deep
Simple
with Allinea MAP
Prepare
strategy with Allinea MAP
Fine tune the code with tracing tool
$ mpiicc -O3 -g myapp.c –o myapp.exe
$ map mpirun ./myapp.exe arg1 arg2
$ map --profile mpirun ./myapp.exe arg1 arg2
$ map --profile --start-after=TIME mpirun ./myapp.exe arg1 arg2 $ map --profile --stop-after=TIME mpirun ./myapp.exe arg1 arg2
$ map myapp_8p_1t_YYYY-MM-DD_HH:MM.map
$> cd allinea_workshop/3_*/ $> vi ./job.sub #Modify the submission script accordingly $> vi ./Makefile #Modify the Makefile accordingly $> make $> qsub ./job.sub
$> module load Forge/7.0.2 $> map –profile mpirun ./mmult*_c.exe
Registers L1 Cache L2 Cache L3 Cache Main memory
Size (bytes) Latency from next level (cycles) 192 32k 256k 2M 2G 4 12 26 230-360 ? Example of Intel Sandy Bridge
– There is an opportunity for cache re-use – Data is local to the core for quick usage – CPU gets data from memory to cache before it is actually needed Registers L1 Cache L2 Cache L3 Cache Main memory
CPUs
D A T A S T R E A M
– Temporal locality: use of data within a short time of its last use – Spatial locality: use memory references close to memory already referenced
Temporal locality example for (i=0 ; i < N; i++) { for (loop=0; loop < 10; loop++) { … = … x[i] … } } Spatial locality example for (i=0 ; i < N*s; i+=s) { … = … x[i] … }
for(i=0; i<n; i++) { for(j=0; j<n; j++) { A[i*n+j]=… } }
i=0, n=4 j=0 j=1
for(i=0; i<n; i++) { for(j=0; j<n; j++) { A[j*n+i]=… } }
A A i=0, n=4 j=0 HIT MISS j=1
VERSION CONTROL DEVELOP REVIEW INTEGRATE TEST RELEASE
In weather and forecasting PRODUCTION : TEST ratio
FORGE
ANALYZE (Allinea Performance Reports) DEBUGGING (Allinea DDT) PERF OPTIMIZATION (Allinea MAP)
Demand for software efficiency Debug/optimize, edit, commit, build, repeat Demand for developer efficiency Version Control (e.g. CVS, etc…) Continuous Integration (e.g. Jenkins, etc.) Open Interfaces (e.g. JSON APIs)
DB
NEW VERSION
basic
passed to memory functions (e.g. malloc, free, ALLOCATE, DEALLOCATE,...)
check-fence
allocation has not been overwritten when it is freed.
free-protect
(using hardware memory protection) so subsequent read/writes cause a fatal error.
Added goodiness
statistics, etc.
Fast
free-blank
freed memory with a known value.
alloc-blank
new allocations with a known value.
check-heap
corruption (e.g. due to writes to invalid memory addresses).
realloc-copy
new pointer when re- allocating a memory allocation (e.g. due to realloc)
Balanced
check-blank
that was blanked when a pointer was allocated/freed has been overwritten.
check-funcs
(mostly string
pointers.
Thorough
See user-guide: Chapter 12.3.2
$> cd allinea_workshop/4_*/ $> vi ./job.sub #Modify the submission script accordingly $> make DEBUG=1 $> qsub ./job.sub
$> ddt --offline --output=orig.html --mem-debug mpirun./mmult*_c.exe
Lunch Break
Coffee Break
– Owner computes Balance done through data distribution – Independent tasks Balance done through prediction/statistics – A mix of various components Balance between scalar workload and communications (for instance)
– Processors may be idle for an extended period of time – They could have been doing some work instead of burning energy
There is an asymmetry between processors having too much work and having not enough work. It is better to have one processor that finishes a task early than having one that is overloaded so that all others wait for it.
Workload imbalance webinar video https://youtu.be/MScwYTNGOp0
$> cd allinea_workshop/5_*/ $> make $> qsub ./job.sub
$> map --profile mpirun ./mmult*_c.exe
– Collective operation precedes write – Single rank performs write – Very low performance
– Ranks write into same file – Problems with file locks? – Strided or segmented?
– Every rank writes own data – Issues with large file counts
– Ranks combine to write to a subset of files
IO webinar video https://youtu.be/2rjxgsYOG-E
Technical Support team : support@allinea.com Questions: florent.lebeau@arm.com