Efficient HPC Development and Production with Allinea Tools Florent - - PowerPoint PPT Presentation

efficient hpc development and production with allinea
SMART_READER_LITE
LIVE PREVIEW

Efficient HPC Development and Production with Allinea Tools Florent - - PowerPoint PPT Presentation

Efficient HPC Development and Production with Allinea Tools Florent Lebeau Florent.Lebeau@arm.com 27/04/2017 Download the slides https://goo.gl/GcNg8O Agenda 09:30 - 10:00: Registration 10:00 - 10:30: Introduction and how to


slide-1
SLIDE 1

Efficient HPC Development and Production with Allinea Tools

Florent Lebeau Florent.Lebeau@arm.com 27/04/2017

slide-2
SLIDE 2

Download the slides

  • https://goo.gl/GcNg8O
slide-3
SLIDE 3

Agenda

  • 09:30 - 10:00: Registration
  • 10:00 - 10:30: Introduction and how to use Allinea tools on Salomon
  • 10:30 - 11:00: Maximize application efficiency
  • 11:00 - 12:00: Fix an application crash
  • 12:00 - 13:00:

Lunch Break

  • 13:00 - 14:00: Optimize memory accesses
  • 14:00 - 14:45: Detect memory leaks
  • 14:45 - 15:00:

Coffee Break

  • 15:00 - 15:45: Resolve workload imbalances
  • 15:45 - 16:00: Wrap-up and Q&A session
slide-4
SLIDE 4

And now…

Let's talk about us!

slide-5
SLIDE 5

Example: Weather and Forecasting models

slide-6
SLIDE 6

Building blocks for better science

Scalability Efficiency Simplicity

  • Enable multi-physics simulations
  • Run larger, more accurate models
  • Resolve ground-breaking scientific problems
  • Reduce wasted resources (energy…)
  • Maximize science output per $
  • Minimize time to result
  • Pro-actively and automatically detect faults
  • Provide applications on various hardware
  • Facilitate technical dialogue with scientists
slide-7
SLIDE 7

About Allinea

  • Allinea: leading toolkit for HPC application developers
  • As of December 2016 Allinea is now part of ARM

– Allinea objective: continue to be the trusted HPC Tools leader in tools across every platform

  • This means:

– The same team will continue to work with you, our customers and partners, and the wider HPC community – Being part of ARM gives us strength to deliver on our roadmap faster – We remain 100% committed to providing cross-platform tools for HPC – Our engineering roadmap is aligned with upcoming architectures from every vendor

slide-8
SLIDE 8

They trust Allinea

slide-9
SLIDE 9

Where to find Allinea tools

  • From small to very large tools provision

Over 65% of Top 100 HPC systems

  • From 1,000 to 700,000 core tools usage

6 of the Top 10 HPC systems

  • Millions of cores usage

Future leadership systems

slide-10
SLIDE 10

Allinea Tools

  • Helping maximize HPC efficiency

Reduce HPC systems operating costs Resolve cutting-edge challenges Promote Efficiency (as opposed to Utilization) Transfer knowledge to HPC communities

  • Helping the HPC community design the best applications

Reach highest levels of performance and scalability Improve scientific code quality and accuracy

  • Available at VSB:

– Forge Supercomputing – 64 tokens – Performance Reports Supercomputing – 64 tokens

slide-11
SLIDE 11

ARM HPC Tools

Research Compilers ARM Performance Libraries Userspace Performance Tools Open Source HPC Allinea Tools New compiler technology to support and evaluate next- generation ARM architecture. Commercially- supported BLAS, LAPACK and FFT routines optimized for ARM- compatible microarchitectures. New commercial tools to deliver actionable performance improvement advice to software developers. Identification of issues in ARM builds of open- source packages and the upstreaming

  • f fixes.

Parallel debugger, profiler and performance analysis tools for HPC

The mission: Enable the software ecosystem for large-scale ARM systems. Current team of 50, from an initial team of 9 in July 2014 Based in Manchester and Warwick, UK. www.developer.arm.com/hpc

slide-12
SLIDE 12

caption

DDT

  • Intel Knight’s Landing high-bandwidth memory debugging
  • IBM Spectrum MPI support
  • Reverse connect via gateway nodes

MAP

  • PAPI metrics (advanced metrics pack)
  • MPI_THREAD_MULTIPLE support (metrics on main thread
  • nly)
  • IBM Spectrum MPI support
  • Reverse connect via gateway nodes
  • Workflow integration: export function-level performance data

to CI tools (Jenkins, Bamboo etc) Performance Reports

  • Custom metrics – add section to your own reports
  • MPI_THREAD_MULTIPLE support (metrics on main thread
  • nly)
  • IBM Spectrum MPI support
  • Workflow integration: export all metrics data to CI tools

(Jenkins, Bamboo etc)

Roadmap Update – 7.0 December 2016

slide-13
SLIDE 13

Maximise Application Efficiency

slide-14
SLIDE 14

Building a scientific application

In your opinion, what is the most critical step?

MODEL

  • Science

ALGORITHM(S)

  • Complexity
  • Parallelism
  • Scalability

HIGH LEVEL CODE

  • Libraries
  • Data

BINARY

  • Compilation

APPLICATION PROFILE

  • Profile
  • Tune

Criticality

Don’t go for code optimisation first as the profile of the application depend on the earlier steps

slide-15
SLIDE 15

“Learn” with Allinea Performance Reports

Very simple start-up No source code needed Fully scalable, very low overhead Rich set of metrics Powerful data analysis

slide-16
SLIDE 16
  • Compile your application for production
  • Prefix your usual launch command with “perf-report”

$ perf-report mpirun -n 8 ./myapp.exe arg1 arg2

  • Open the result

$ cat myapp_8p_1t_YYYY-MM-DD_HH:MM.txt $ firefox myapp_8p_1t_YYYY-MM-DD_HH:MM.html

  • Specify the format or the output name

$ perf-report --output=“report.csv” mpirun -n 8 ./myapp.exe arg1 arg2

Allinea Performance Reports Cheat sheet

slide-17
SLIDE 17

Getting started for the workshop

  • Connect to the cluster from a terminal

$> ssh –X <username>@salomon.it4i.cz

  • Retrieve the workshop archive

$> cp /home/flebeau/allinea_workshop.tar.gz . $> tar xzvf allinea_workshop.tar.gz $> cd allinea_workshop/

  • Set the environment

$> module load iimpi PerformanceReports/6.0.6 Forge/7.0.2 $> export ALLINEA_LICENCE_FILE=/home/flebeau/Licence.11373 (only necessary to use the temporary licence for the workshop) OR $> . common/env.sh

slide-18
SLIDE 18

Go to exercise 1

  • Exercise objectives
  • Generate a performance report of a simple code
  • Find the best parameters to maximize the application efficiency

– Compilation flags – Number of processes – Number of nodes

  • Commands to use:

$> cd allinea_workshop/1_*/c or cd allinea_workshop/1_*/f90 $> make $> qsub ./job.sub # Modify the job script accordingly

  • Key Allinea commands

$> module load PerformanceReports/6.0.6 In the job script, prefix the mpirun/srun command with perf-report $> perf-report mpirun ./wave.exe

slide-19
SLIDE 19

Fix an Application Crash

slide-20
SLIDE 20

Print statement debugging

x f(x)

  • The first debugger: print

statements

– Each process prints a message

  • r value at defined locations

– Diagnose the problem from evidence and intuition

  • A long slow process

– Analogous to bisection root finding

  • Broken at modest scale

– Too much output – too many log files

slide-21
SLIDE 21

Typical types of bugs

  • Steady and

dependable, I’ll be there for you.

BOHR BUG

  • Oh, you are

debugging? Let me hide for a sec!

HEISEN BUG

  • Chaos is my

name and you shall fear me.

MANDEL BUG

  • I am buggy

AND not

  • buggy. How

about that?

SCHRODIN BUG

slide-22
SLIDE 22

Debugging a problem is much easier when you can :

  • Make and undo changes fearlessly
  • Use a source control (CVS, …)
  • Track what you’ve tried so far
  • Write logbooks
  • Reproduce bugs with a single command
  • Create and use test script

Debugging by Discipline

slide-23
SLIDE 23

Debugging by Magic

Any technology sufficiently advanced is indistinguishable from magic. Unpredictable, dangerous, irresistible.

slide-24
SLIDE 24

Allinea Forge One Unified Solution

Identify and optimise bottlenecks Flick to Allinea MAP to check the performance Use Allinea DDT to check your code or find and fix the problem: Memory error? Deadlock? Observe and debug your code step by step Scalability issue prevents from reaching performance goals

slide-25
SLIDE 25
  • Who had a rogue behaviour ?

‒ Merges stacks from processes and threads

  • Where did it happen?

‒ Allinea DDT leaps to source automatically

  • How did it happen?

‒ Detailed error message given to the user ‒ Some faults evident instantly from source

  • Why did it happen?

‒ Unique “Smart Highlighting” ‒ Sparklines comparing data across processes

Allinea DDT helps to understand

Run with Allinea tools Identify a problem Gather info Who, Where, How, Why Fix

slide-26
SLIDE 26
  • Prepare the code

$ mpiicc -O0 -g myapp.c –o myapp.exe

  • Start Allinea DDT in interactive mode

$ ddt mpirun./myapp.exe arg1 arg2

  • Start Allinea DDT in offline mode

$ ddt --offline --output=report.html mpirun ./myapp.exe arg1 arg2

  • Use reverse connect

On the login node:

$ ddt & (or use the remote client)

In the job script to submit:

ddt --connect mpirun -n 8 ./myapp.exe arg1 arg2

Learn your spells

slide-27
SLIDE 27

Allinea Remote Client

  • Install the Allinea Remote Client

Go to : http://www.allinea.com/products/downloads/

  • Connect to the cluster with the remote client

Connection name: VSC Hostname: <username>@salomon.it4i.cz Remote Installation Directory: /apps/all/Forge/7.0.2/ Remote script: <leave blank> Click on “Test Remote Launch”, and if it works, click on “OK” Connect to the remote cluster through the remote client

  • Connect to the cluster with a terminal to submit the job to

connect

slide-28
SLIDE 28

Exercise: Matrix Multiplication: C = A x B + C

k k i

A B C

size j i, j, k: loop indexes nslices = 4

Algorithm

1- Master initialises matrices A, B & C 2- Master slices the matrices A & C, sends them to slaves 3- Master and Slaves perform the multiplication 4- Slaves send their results back to Master 5- Master writes the result Matrix C in an output file

slide-29
SLIDE 29

Exercise 2: Compile and run the application

  • In a terminal, change directory and compile the application

$ cd allinea_training/2* $ make

  • Check the job script job.sub is running the application with

mpirun ./mmult*_c.exe

  • Submit the job

$ qsub job.sub

  • Check the output. What is the error message?
slide-30
SLIDE 30

Exercise 2: Debug with Allinea DDT

  • Load the module

$> module load Forge/7.0.2

  • Launch the debugger from the terminal with

$ ddt &

OR connect to your remote machine with the remote client

  • Recompile the application for debugging

$ make clean $ mpiicc –g –O0 –DDEBUG mmult1.c –o mmult1_c.exe Or edit the Makefile

  • In the job script, prefix the original execution command with “ddt --connect” to

launch the debugger

ddt --connect mpirun ./mmult*_c.exe

  • Submit the job

$ qsub job.sub

  • Accept the incoming connection in the GUI of Allinea DDT
  • Can you find and fix the bug?
slide-31
SLIDE 31

Lunch break Back at 13:00!

slide-32
SLIDE 32

Agenda

  • 09:30 - 10:00: Registration
  • 10:00 - 10:30: Introduction and how to use Allinea tools on Salomon
  • 10:30 - 11:00: Maximize application efficiency
  • 11:00 - 12:00: Fix an application crash
  • 12:00 - 13:00:

Lunch Break

  • 13:00 - 14:00: Optimize memory accesses
  • 14:00 - 14:45: Detect memory leaks
  • 14:45 - 15:00:

Coffee Break

  • 15:00 - 15:45: Resolve workload imbalances
  • 15:45 - 16:00: Wrap-up and Q&A session
slide-33
SLIDE 33

Optimize memory accesses

slide-34
SLIDE 34

Why profiling?

  • How to improve the performance of an application?
  • Profiling: a form of dynamic program analysis that measures, for

example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration

  • f function calls. Most commonly, profiling information serves to

aid program optimization. (Wikipedia)

  • How?

– Select representative test case(s) – Profile – Analyse and find bottlenecks – Optimise – Profile again to check performance results and iterate

slide-35
SLIDE 35

How to profile?

  • Different methods

– Tracing Records and timestamps all operations Intrusive – Instrumenting Add instructions in the source code to collect data Intrusive – Sampling Automatically collect data Not intrusive

slide-36
SLIDE 36

Some types of profiles

Hotspot

  • One function corresponds to more 80% of the runtime
  • Large speed-up potential
  • Best optimisation scenario

Spike

  • The application spends most of the time in a few functions
  • Speed-up potential depends on the aggregated time
  • Variable optimisation time

Flat

  • Runtime split evenly between numerous functions, each one with a very small runtime
  • Little speed-up potential without algorithmic changes
  • Worst optimisation scenario
slide-37
SLIDE 37

DEBUG PROFILE

FORGE

Allinea Forge: a toolkit to save developers’ time

Allinea Forge

Fine tune bottlenecks Find unexpected issues Resolve bugs Test codes

  • Unique single interface
  • Easy to start and use

ACCESSIBLE

  • State of the art features
  • Fully scalable

POWERFUL

  • Tackles new challenges
  • For latest HPC systems

INNOVATIVE

slide-38
SLIDE 38

Allinea MAP Performance made easy

Low overhead measurement

  • Accurate, non-intrusive application performance profiling
  • Seamless – no recompilation or relinking required

Easy to use

  • Source code viewer pinpoints bottleneck locations
  • Zoom in to explore iterations, functions and loops

Deep

  • Measures CPU, communication, I/O and memory to identify problem causes
  • Identifies vectorization and cache performance
slide-39
SLIDE 39

Allinea MAP and tracing tools: a great synergy

Simple

  • ptimization

with Allinea MAP

  • Characterize performance at-scale with a lightweight tool
  • See which lines of code are hotspots
  • Identify common problems at once

Prepare

  • ptimization

strategy with Allinea MAP

  • Identify loop(s) to instrument
  • Identify performance counter(s) to record
  • Document performance issues to communicate to profiling experts

Fine tune the code with tracing tool

  • Retrieve low-level details using traces
  • Fix up CPU usage to make the code fly
slide-40
SLIDE 40
  • Prepare the code

$ mpiicc -O3 -g myapp.c –o myapp.exe

  • Start Allinea MAP in interactive mode

$ map mpirun ./myapp.exe arg1 arg2

  • Start Allinea MAP in profile or “offline” mode

$ map --profile mpirun ./myapp.exe arg1 arg2

  • Specify when to start/stop profiling (TIME in seconds)

$ map --profile --start-after=TIME mpirun ./myapp.exe arg1 arg2 $ map --profile --stop-after=TIME mpirun ./myapp.exe arg1 arg2

  • Open the result

$ map myapp_8p_1t_YYYY-MM-DD_HH:MM.map

Allinea MAP Cheat sheet

slide-41
SLIDE 41

Exercise 3: Application Performance

  • Exercise objectives
  • Generate a profile with Allinea MAP
  • Find one (or more) performance bottlenecks
  • Make suggestions to improve the matrix multiplication kernel
  • Commands to use:

$> cd allinea_workshop/3_*/ $> vi ./job.sub #Modify the submission script accordingly $> vi ./Makefile #Modify the Makefile accordingly $> make $> qsub ./job.sub

  • Key Allinea commands

$> module load Forge/7.0.2 $> map –profile mpirun ./mmult*_c.exe

slide-42
SLIDE 42

Typical memory hierarchy nowadays

Registers L1 Cache L2 Cache L3 Cache Main memory

Size (bytes) Latency from next level (cycles) 192 32k 256k 2M 2G 4 12 26 230-360 ? Example of Intel Sandy Bridge

slide-43
SLIDE 43

Speeding up memory accesses

  • High performance is possible when:

– There is an opportunity for cache re-use – Data is local to the core for quick usage – CPU gets data from memory to cache before it is actually needed Registers L1 Cache L2 Cache L3 Cache Main memory

CPUs

D A T A S T R E A M

slide-44
SLIDE 44

Memory access patterns

  • Data locality

– Temporal locality: use of data within a short time of its last use – Spatial locality: use memory references close to memory already referenced

Temporal locality example for (i=0 ; i < N; i++) { for (loop=0; loop < 10; loop++) { … = … x[i] … } } Spatial locality example for (i=0 ; i < N*s; i+=s) { … = … x[i] … }

slide-45
SLIDE 45

Memory Accesses and Cache Misses

for(i=0; i<n; i++) { for(j=0; j<n; j++) { A[i*n+j]=… } }

i=0, n=4 j=0 j=1

for(i=0; i<n; i++) { for(j=0; j<n; j++) { A[j*n+i]=… } }

A A i=0, n=4 j=0 HIT MISS j=1

slide-46
SLIDE 46

Detect Memory Leaks

slide-47
SLIDE 47

Development workflow in the industry

VERSION CONTROL DEVELOP REVIEW INTEGRATE TEST RELEASE

In weather and forecasting PRODUCTION : TEST ratio

  • approx. 1:2 or 1:3
slide-48
SLIDE 48

Software tools-centric view

FORGE

ANALYZE (Allinea Performance Reports) DEBUGGING (Allinea DDT) PERF OPTIMIZATION (Allinea MAP)

Demand for software efficiency Debug/optimize, edit, commit, build, repeat Demand for developer efficiency Version Control (e.g. CVS, etc…) Continuous Integration (e.g. Jenkins, etc.) Open Interfaces (e.g. JSON APIs)

DB

NEW VERSION

slide-49
SLIDE 49

Memory Bugs

  • A strange behaviour

where the application “sometimes” crashes is a typical sign of a memory bug

  • Allinea DDT is able to

force the crash to happen

  • I am buggy

AND not

  • buggy. How

about that?

SCHRODIN BUG

slide-50
SLIDE 50

Memory debugging menu in Allinea DDT

slide-51
SLIDE 51

Heap debugging options available

basic

  • Detect invalid pointers

passed to memory functions (e.g. malloc, free, ALLOCATE, DEALLOCATE,...)

check-fence

  • Check the end of an

allocation has not been overwritten when it is freed.

free-protect

  • Protect freed memory

(using hardware memory protection) so subsequent read/writes cause a fatal error.

Added goodiness

  • Memory usage,

statistics, etc.

Fast

free-blank

  • Overwrite the bytes of

freed memory with a known value.

alloc-blank

  • Initialise the bytes of

new allocations with a known value.

check-heap

  • Check for heap

corruption (e.g. due to writes to invalid memory addresses).

realloc-copy

  • Always copy data to a

new pointer when re- allocating a memory allocation (e.g. due to realloc)

Balanced

check-blank

  • Check to see if space

that was blanked when a pointer was allocated/freed has been overwritten.

check-funcs

  • Check the arguments
  • f addition functions

(mostly string

  • perations) for invalid

pointers.

Thorough

See user-guide: Chapter 12.3.2

slide-52
SLIDE 52

Exercise 4: Check Memory Defects

  • Exercise objectives
  • Use Allinea DDT in offline mode for continuous integration
  • Find the memory problem by configuring Allinea DDT correctly
  • Make suggestions to fix the bug
  • Check for memory leaks by generating a offline debugging report
  • Typical run commands to use:

$> cd allinea_workshop/4_*/ $> vi ./job.sub #Modify the submission script accordingly $> make DEBUG=1 $> qsub ./job.sub

  • Key Allinea commands

$> ddt --offline --output=orig.html --mem-debug mpirun./mmult*_c.exe

slide-53
SLIDE 53

Coffee break Back in 15 min!

slide-54
SLIDE 54

Agenda

  • 09:30 - 10:00: Registration
  • 10:00 - 10:30: Introduction and how to use Allinea tools on Salomon
  • 10:30 - 11:00: Maximize application efficiency
  • 11:00 - 12:00: Fix an application crash
  • 12:00 - 13:00:

Lunch Break

  • 13:00 - 14:00: Optimize memory accesses
  • 14:00 - 14:45: Detect memory leaks
  • 14:45 - 15:00:

Coffee Break

  • 15:00 - 15:45: Resolve workload imbalances
  • 15:45 - 16:00: Wrap-up and Q&A session
slide-55
SLIDE 55

Resolve Workload Imbalances

slide-56
SLIDE 56

Load balancing in theory

  • Examples of load balancing

– Owner computes Balance done through data distribution – Independent tasks Balance done through prediction/statistics – A mix of various components Balance between scalar workload and communications (for instance)

  • Balancing the workload is critical because:

– Processors may be idle for an extended period of time – They could have been doing some work instead of burning energy

slide-57
SLIDE 57

Load balancing can be counter intuitive

Corollary:

There is an asymmetry between processors having too much work and having not enough work. It is better to have one processor that finishes a task early than having one that is overloaded so that all others wait for it.

When it comes to load balancing, the “costliest” function shown by the profiler is not the bottleneck. The bottleneck is the “cheapest” one.

Workload imbalance webinar video https://youtu.be/MScwYTNGOp0

slide-58
SLIDE 58

Exercise 5: Workload Imbalance

  • Exercise objectives
  • Expose the workload imbalance in the code (on 1 or 2 nodes)
  • Make suggestions to fix the problem
  • Commands to use:

$> cd allinea_workshop/5_*/ $> make $> qsub ./job.sub

  • Key Allinea commands

$> map --profile mpirun ./mmult*_c.exe

slide-59
SLIDE 59

Complexity of I/O Systems - Schemes

  • 1 – 1 (Spokesperson):

– Collective operation precedes write – Single rank performs write – Very low performance

  • N – 1:

– Ranks write into same file – Problems with file locks? – Strided or segmented?

  • N – N:

– Every rank writes own data – Issues with large file counts

  • N – M:

– Ranks combine to write to a subset of files

IO webinar video https://youtu.be/2rjxgsYOG-E

slide-60
SLIDE 60
slide-61
SLIDE 61

Summary

Help the scientific community

  • Allinea’s mission.

Strengthen professional development workflows

  • Reduce your costs

Prepare for application changes and migration

  • Change is inevitable
slide-62
SLIDE 62

Thank you!

Technical Support team : support@allinea.com Questions: florent.lebeau@arm.com