[PPT] - Efficient HPC Development and Production with Allinea Tools Florent PowerPoint Presentation

SLIDE 1

Efficient HPC Development and Production with Allinea Tools

Florent Lebeau Florent.Lebeau@arm.com 27/04/2017

SLIDE 2

Download the slides

https://goo.gl/GcNg8O

SLIDE 3

Agenda

09:30 - 10:00: Registration
10:00 - 10:30: Introduction and how to use Allinea tools on Salomon
10:30 - 11:00: Maximize application efficiency
11:00 - 12:00: Fix an application crash
12:00 - 13:00:

Lunch Break

13:00 - 14:00: Optimize memory accesses
14:00 - 14:45: Detect memory leaks
14:45 - 15:00:

Coffee Break

15:00 - 15:45: Resolve workload imbalances
15:45 - 16:00: Wrap-up and Q&A session

SLIDE 4

And now…

Let's talk about us!

SLIDE 5

Example: Weather and Forecasting models

SLIDE 6

Building blocks for better science

Scalability Efficiency Simplicity

Enable multi-physics simulations
Run larger, more accurate models
Resolve ground-breaking scientific problems
Reduce wasted resources (energy…)
Maximize science output per $
Minimize time to result
Pro-actively and automatically detect faults
Provide applications on various hardware
Facilitate technical dialogue with scientists

SLIDE 7

About Allinea

Allinea: leading toolkit for HPC application developers
As of December 2016 Allinea is now part of ARM

– Allinea objective: continue to be the trusted HPC Tools leader in tools across every platform

This means:

– The same team will continue to work with you, our customers and partners, and the wider HPC community – Being part of ARM gives us strength to deliver on our roadmap faster – We remain 100% committed to providing cross-platform tools for HPC – Our engineering roadmap is aligned with upcoming architectures from every vendor

SLIDE 8

They trust Allinea

SLIDE 9

Where to find Allinea tools

From small to very large tools provision

Over 65% of Top 100 HPC systems

From 1,000 to 700,000 core tools usage

6 of the Top 10 HPC systems

Millions of cores usage

Future leadership systems

SLIDE 10

Allinea Tools

Helping maximize HPC efficiency

Reduce HPC systems operating costs Resolve cutting-edge challenges Promote Efficiency (as opposed to Utilization) Transfer knowledge to HPC communities

Helping the HPC community design the best applications

Reach highest levels of performance and scalability Improve scientific code quality and accuracy

Available at VSB:

– Forge Supercomputing – 64 tokens – Performance Reports Supercomputing – 64 tokens

SLIDE 11

ARM HPC Tools

Research Compilers ARM Performance Libraries Userspace Performance Tools Open Source HPC Allinea Tools New compiler technology to support and evaluate next- generation ARM architecture. Commercially- supported BLAS, LAPACK and FFT routines optimized for ARM- compatible microarchitectures. New commercial tools to deliver actionable performance improvement advice to software developers. Identification of issues in ARM builds of open- source packages and the upstreaming

f fixes.

Parallel debugger, profiler and performance analysis tools for HPC

The mission: Enable the software ecosystem for large-scale ARM systems. Current team of 50, from an initial team of 9 in July 2014 Based in Manchester and Warwick, UK. www.developer.arm.com/hpc

SLIDE 12

caption

DDT

Intel Knight’s Landing high-bandwidth memory debugging
IBM Spectrum MPI support
Reverse connect via gateway nodes

MAP

PAPI metrics (advanced metrics pack)
MPI_THREAD_MULTIPLE support (metrics on main thread
nly)
IBM Spectrum MPI support
Reverse connect via gateway nodes
Workflow integration: export function-level performance data

to CI tools (Jenkins, Bamboo etc) Performance Reports

Custom metrics – add section to your own reports
MPI_THREAD_MULTIPLE support (metrics on main thread
nly)
IBM Spectrum MPI support
Workflow integration: export all metrics data to CI tools

(Jenkins, Bamboo etc)

Roadmap Update – 7.0 December 2016

SLIDE 13

Maximise Application Efficiency

SLIDE 14

Building a scientific application

In your opinion, what is the most critical step?

MODEL

Science

ALGORITHM(S)

Complexity
Parallelism
Scalability

HIGH LEVEL CODE

Libraries
Data

BINARY

Compilation

APPLICATION PROFILE

Profile
Tune

Criticality

Don’t go for code optimisation first as the profile of the application depend on the earlier steps

SLIDE 15

“Learn” with Allinea Performance Reports

Very simple start-up No source code needed Fully scalable, very low overhead Rich set of metrics Powerful data analysis

SLIDE 16

Compile your application for production
Prefix your usual launch command with “perf-report”

$ perf-report mpirun -n 8 ./myapp.exe arg1 arg2

Open the result

$ cat myapp_8p_1t_YYYY-MM-DD_HH:MM.txt $ firefox myapp_8p_1t_YYYY-MM-DD_HH:MM.html

Specify the format or the output name

$ perf-report --output=“report.csv” mpirun -n 8 ./myapp.exe arg1 arg2

Allinea Performance Reports Cheat sheet

SLIDE 17

Getting started for the workshop

Connect to the cluster from a terminal

$> ssh –X <username>@salomon.it4i.cz

Retrieve the workshop archive

$> cp /home/flebeau/allinea_workshop.tar.gz . $> tar xzvf allinea_workshop.tar.gz $> cd allinea_workshop/

Set the environment

$> module load iimpi PerformanceReports/6.0.6 Forge/7.0.2 $> export ALLINEA_LICENCE_FILE=/home/flebeau/Licence.11373 (only necessary to use the temporary licence for the workshop) OR $> . common/env.sh

SLIDE 18

Go to exercise 1

Exercise objectives
Generate a performance report of a simple code
Find the best parameters to maximize the application efficiency

– Compilation flags – Number of processes – Number of nodes

Commands to use:

$> cd allinea_workshop/1_*/c or cd allinea_workshop/1_*/f90 $> make $> qsub ./job.sub # Modify the job script accordingly

Key Allinea commands

$> module load PerformanceReports/6.0.6 In the job script, prefix the mpirun/srun command with perf-report $> perf-report mpirun ./wave.exe

SLIDE 19

Fix an Application Crash

SLIDE 20

Print statement debugging

x f(x)

The first debugger: print

statements

– Each process prints a message

r value at defined locations

– Diagnose the problem from evidence and intuition

A long slow process

– Analogous to bisection root finding

Broken at modest scale

– Too much output – too many log files

SLIDE 21

Typical types of bugs

Steady and

dependable, I’ll be there for you.

BOHR BUG

Oh, you are

debugging? Let me hide for a sec!

HEISEN BUG

Chaos is my

name and you shall fear me.

MANDEL BUG

I am buggy

AND not

buggy. How

about that?

SCHRODIN BUG

SLIDE 22

Debugging a problem is much easier when you can :

Make and undo changes fearlessly
Use a source control (CVS, …)
Track what you’ve tried so far
Write logbooks
Reproduce bugs with a single command
Create and use test script

Debugging by Discipline

SLIDE 23

Debugging by Magic

Any technology sufficiently advanced is indistinguishable from magic. Unpredictable, dangerous, irresistible.

SLIDE 24

Allinea Forge One Unified Solution

Identify and optimise bottlenecks Flick to Allinea MAP to check the performance Use Allinea DDT to check your code or find and fix the problem: Memory error? Deadlock? Observe and debug your code step by step Scalability issue prevents from reaching performance goals

SLIDE 25

Who had a rogue behaviour ?

‒ Merges stacks from processes and threads

Where did it happen?

‒ Allinea DDT leaps to source automatically

How did it happen?

‒ Detailed error message given to the user ‒ Some faults evident instantly from source

Why did it happen?

‒ Unique “Smart Highlighting” ‒ Sparklines comparing data across processes

Allinea DDT helps to understand

Run with Allinea tools Identify a problem Gather info Who, Where, How, Why Fix

SLIDE 26

Prepare the code

$ mpiicc -O0 -g myapp.c –o myapp.exe

Start Allinea DDT in interactive mode

$ ddt mpirun./myapp.exe arg1 arg2

Start Allinea DDT in offline mode

$ ddt --offline --output=report.html mpirun ./myapp.exe arg1 arg2

Use reverse connect

On the login node:

$ ddt & (or use the remote client)

In the job script to submit:

ddt --connect mpirun -n 8 ./myapp.exe arg1 arg2

Learn your spells

SLIDE 27

Allinea Remote Client

Install the Allinea Remote Client

Go to : http://www.allinea.com/products/downloads/

Connect to the cluster with the remote client

Connection name: VSC Hostname: <username>@salomon.it4i.cz Remote Installation Directory: /apps/all/Forge/7.0.2/ Remote script: <leave blank> Click on “Test Remote Launch”, and if it works, click on “OK” Connect to the remote cluster through the remote client

Connect to the cluster with a terminal to submit the job to

connect

SLIDE 28

Exercise: Matrix Multiplication: C = A x B + C

k k i

A B C

size j i, j, k: loop indexes nslices = 4

Algorithm

1- Master initialises matrices A, B & C 2- Master slices the matrices A & C, sends them to slaves 3- Master and Slaves perform the multiplication 4- Slaves send their results back to Master 5- Master writes the result Matrix C in an output file

SLIDE 29

Exercise 2: Compile and run the application

In a terminal, change directory and compile the application

$ cd allinea_training/2* $ make

Check the job script job.sub is running the application with

mpirun ./mmult*_c.exe

Submit the job

$ qsub job.sub

Check the output. What is the error message?

SLIDE 30

Exercise 2: Debug with Allinea DDT

Load the module

$> module load Forge/7.0.2

Launch the debugger from the terminal with

$ ddt &

OR connect to your remote machine with the remote client

Recompile the application for debugging

$ make clean $ mpiicc –g –O0 –DDEBUG mmult1.c –o mmult1_c.exe Or edit the Makefile

In the job script, prefix the original execution command with “ddt --connect” to

launch the debugger

ddt --connect mpirun ./mmult*_c.exe

Submit the job

$ qsub job.sub

Accept the incoming connection in the GUI of Allinea DDT
Can you find and fix the bug?

SLIDE 31

Lunch break Back at 13:00!

SLIDE 32

Agenda

09:30 - 10:00: Registration
10:00 - 10:30: Introduction and how to use Allinea tools on Salomon
10:30 - 11:00: Maximize application efficiency
11:00 - 12:00: Fix an application crash
12:00 - 13:00:

Lunch Break

13:00 - 14:00: Optimize memory accesses
14:00 - 14:45: Detect memory leaks
14:45 - 15:00:

Coffee Break

15:00 - 15:45: Resolve workload imbalances
15:45 - 16:00: Wrap-up and Q&A session

SLIDE 33

Optimize memory accesses

SLIDE 34

Why profiling?

How to improve the performance of an application?
Profiling: a form of dynamic program analysis that measures, for

example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration

f function calls. Most commonly, profiling information serves to

aid program optimization. (Wikipedia)

How?

– Select representative test case(s) – Profile – Analyse and find bottlenecks – Optimise – Profile again to check performance results and iterate

SLIDE 35

How to profile?

Different methods

– Tracing Records and timestamps all operations Intrusive – Instrumenting Add instructions in the source code to collect data Intrusive – Sampling Automatically collect data Not intrusive

SLIDE 36

Some types of profiles

Hotspot

One function corresponds to more 80% of the runtime
Large speed-up potential
Best optimisation scenario

Spike

The application spends most of the time in a few functions
Speed-up potential depends on the aggregated time
Variable optimisation time

Flat

Runtime split evenly between numerous functions, each one with a very small runtime
Little speed-up potential without algorithmic changes
Worst optimisation scenario

SLIDE 37

DEBUG PROFILE

FORGE

Allinea Forge: a toolkit to save developers’ time

Allinea Forge

Fine tune bottlenecks Find unexpected issues Resolve bugs Test codes

Unique single interface
Easy to start and use

ACCESSIBLE

State of the art features
Fully scalable

POWERFUL

Tackles new challenges
For latest HPC systems

INNOVATIVE

SLIDE 38

Allinea MAP Performance made easy

Low overhead measurement

Accurate, non-intrusive application performance profiling
Seamless – no recompilation or relinking required

Easy to use

Source code viewer pinpoints bottleneck locations
Zoom in to explore iterations, functions and loops

Deep

Measures CPU, communication, I/O and memory to identify problem causes
Identifies vectorization and cache performance

SLIDE 39

Allinea MAP and tracing tools: a great synergy

Simple

ptimization

with Allinea MAP

Characterize performance at-scale with a lightweight tool
See which lines of code are hotspots
Identify common problems at once

Prepare

ptimization

strategy with Allinea MAP

Identify loop(s) to instrument
Identify performance counter(s) to record
Document performance issues to communicate to profiling experts

Fine tune the code with tracing tool

Retrieve low-level details using traces
Fix up CPU usage to make the code fly

SLIDE 40

Prepare the code

$ mpiicc -O3 -g myapp.c –o myapp.exe

Start Allinea MAP in interactive mode

$ map mpirun ./myapp.exe arg1 arg2

Start Allinea MAP in profile or “offline” mode

$ map --profile mpirun ./myapp.exe arg1 arg2

Specify when to start/stop profiling (TIME in seconds)

$ map --profile --start-after=TIME mpirun ./myapp.exe arg1 arg2 $ map --profile --stop-after=TIME mpirun ./myapp.exe arg1 arg2

Open the result

$ map myapp_8p_1t_YYYY-MM-DD_HH:MM.map

Allinea MAP Cheat sheet

SLIDE 41

Exercise 3: Application Performance

Exercise objectives
Generate a profile with Allinea MAP
Find one (or more) performance bottlenecks
Make suggestions to improve the matrix multiplication kernel
Commands to use:

$> cd allinea_workshop/3_*/ $> vi ./job.sub #Modify the submission script accordingly $> vi ./Makefile #Modify the Makefile accordingly $> make $> qsub ./job.sub

Key Allinea commands

$> module load Forge/7.0.2 $> map –profile mpirun ./mmult*_c.exe

SLIDE 42

Typical memory hierarchy nowadays

Registers L1 Cache L2 Cache L3 Cache Main memory

Size (bytes) Latency from next level (cycles) 192 32k 256k 2M 2G 4 12 26 230-360 ? Example of Intel Sandy Bridge

SLIDE 43

Speeding up memory accesses

High performance is possible when:

– There is an opportunity for cache re-use – Data is local to the core for quick usage – CPU gets data from memory to cache before it is actually needed Registers L1 Cache L2 Cache L3 Cache Main memory

CPUs

D A T A S T R E A M

SLIDE 44

Memory access patterns

Data locality

– Temporal locality: use of data within a short time of its last use – Spatial locality: use memory references close to memory already referenced

Temporal locality example for (i=0 ; i < N; i++) { for (loop=0; loop < 10; loop++) { … = … x[i] … } } Spatial locality example for (i=0 ; i < N*s; i+=s) { … = … x[i] … }

SLIDE 45

Memory Accesses and Cache Misses

for(i=0; i<n; i++) { for(j=0; j<n; j++) { A[i*n+j]=… } }

i=0, n=4 j=0 j=1

for(i=0; i<n; i++) { for(j=0; j<n; j++) { A[j*n+i]=… } }

A A i=0, n=4 j=0 HIT MISS j=1

SLIDE 46

Detect Memory Leaks

SLIDE 47

Development workflow in the industry

VERSION CONTROL DEVELOP REVIEW INTEGRATE TEST RELEASE

In weather and forecasting PRODUCTION : TEST ratio

approx. 1:2 or 1:3

SLIDE 48

Software tools-centric view

FORGE

ANALYZE (Allinea Performance Reports) DEBUGGING (Allinea DDT) PERF OPTIMIZATION (Allinea MAP)

Demand for software efficiency Debug/optimize, edit, commit, build, repeat Demand for developer efficiency Version Control (e.g. CVS, etc…) Continuous Integration (e.g. Jenkins, etc.) Open Interfaces (e.g. JSON APIs)

DB

NEW VERSION

SLIDE 49

Memory Bugs

A strange behaviour

where the application “sometimes” crashes is a typical sign of a memory bug

Allinea DDT is able to

force the crash to happen

I am buggy

AND not

buggy. How

about that?

SCHRODIN BUG

SLIDE 50

Memory debugging menu in Allinea DDT

SLIDE 51

Heap debugging options available

basic

Detect invalid pointers

passed to memory functions (e.g. malloc, free, ALLOCATE, DEALLOCATE,...)

check-fence

Check the end of an

allocation has not been overwritten when it is freed.

free-protect

Protect freed memory

(using hardware memory protection) so subsequent read/writes cause a fatal error.

Added goodiness

Memory usage,

statistics, etc.

Fast

free-blank

Overwrite the bytes of

freed memory with a known value.

alloc-blank

Initialise the bytes of

new allocations with a known value.

check-heap

Check for heap

corruption (e.g. due to writes to invalid memory addresses).

realloc-copy

Always copy data to a

new pointer when re- allocating a memory allocation (e.g. due to realloc)

Balanced

check-blank

Check to see if space

that was blanked when a pointer was allocated/freed has been overwritten.

check-funcs

Check the arguments
f addition functions

(mostly string

perations) for invalid

pointers.

Thorough

See user-guide: Chapter 12.3.2

SLIDE 52

Exercise 4: Check Memory Defects

Exercise objectives
Use Allinea DDT in offline mode for continuous integration
Find the memory problem by configuring Allinea DDT correctly
Make suggestions to fix the bug
Check for memory leaks by generating a offline debugging report
Typical run commands to use:

$> cd allinea_workshop/4_*/ $> vi ./job.sub #Modify the submission script accordingly $> make DEBUG=1 $> qsub ./job.sub

Key Allinea commands

$> ddt --offline --output=orig.html --mem-debug mpirun./mmult*_c.exe

SLIDE 53

Coffee break Back in 15 min!

SLIDE 54

Agenda

09:30 - 10:00: Registration
10:00 - 10:30: Introduction and how to use Allinea tools on Salomon
10:30 - 11:00: Maximize application efficiency
11:00 - 12:00: Fix an application crash
12:00 - 13:00:

Lunch Break

13:00 - 14:00: Optimize memory accesses
14:00 - 14:45: Detect memory leaks
14:45 - 15:00:

Coffee Break

15:00 - 15:45: Resolve workload imbalances
15:45 - 16:00: Wrap-up and Q&A session

SLIDE 55

Resolve Workload Imbalances

SLIDE 56

Load balancing in theory

Examples of load balancing

– Owner computes Balance done through data distribution – Independent tasks Balance done through prediction/statistics – A mix of various components Balance between scalar workload and communications (for instance)

Balancing the workload is critical because:

– Processors may be idle for an extended period of time – They could have been doing some work instead of burning energy

SLIDE 57

Load balancing can be counter intuitive

Corollary:

There is an asymmetry between processors having too much work and having not enough work. It is better to have one processor that finishes a task early than having one that is overloaded so that all others wait for it.

When it comes to load balancing, the “costliest” function shown by the profiler is not the bottleneck. The bottleneck is the “cheapest” one.

Workload imbalance webinar video https://youtu.be/MScwYTNGOp0

SLIDE 58

Exercise 5: Workload Imbalance

Exercise objectives
Expose the workload imbalance in the code (on 1 or 2 nodes)
Make suggestions to fix the problem
Commands to use:

$> cd allinea_workshop/5_*/ $> make $> qsub ./job.sub

Key Allinea commands

$> map --profile mpirun ./mmult*_c.exe

SLIDE 59

Complexity of I/O Systems - Schemes

1 – 1 (Spokesperson):

– Collective operation precedes write – Single rank performs write – Very low performance

N – 1:

– Ranks write into same file – Problems with file locks? – Strided or segmented?

N – N:

– Every rank writes own data – Issues with large file counts

N – M:

– Ranks combine to write to a subset of files

IO webinar video https://youtu.be/2rjxgsYOG-E

SLIDE 60

SLIDE 61

Summary

Help the scientific community

Allinea’s mission.

Strengthen professional development workflows

Reduce your costs

Prepare for application changes and migration

Change is inevitable

SLIDE 62

Thank you!

Technical Support team : support@allinea.com Questions: florent.lebeau@arm.com