Computing with Allinea 06 Nov 2015 VSB, Ostrava Florent Lebeau - - PowerPoint PPT Presentation

computing with allinea
SMART_READER_LITE
LIVE PREVIEW

Computing with Allinea 06 Nov 2015 VSB, Ostrava Florent Lebeau - - PowerPoint PPT Presentation

Towards Efficiency Computing with Allinea 06 Nov 2015 VSB, Ostrava Florent Lebeau flebeau@allinea.com Agenda 09:30-10:00 Registration 10:00-10:15 Introduction to Allinea tools 10:15-11:45 Getting started with Allinea Forge for profiling


slide-1
SLIDE 1

Towards Efficiency Computing with Allinea

06 Nov 2015 VSB, Ostrava Florent Lebeau flebeau@allinea.com

slide-2
SLIDE 2

Agenda

09:30-10:00 Registration 10:00-10:15 Introduction to Allinea tools 10:15-11:45 Getting started with Allinea Forge for profiling

11:45-13:00 Lunch break

13:00-14:15 Getting started with Allinea Forge for debugging

14:15-14:45 Coffee break

14:45-16:00 Allinea tools and Intel Xeon Phi coprocessors 16:00-16:30 Questions and wrap-up

slide-3
SLIDE 3

Introduction to Allinea Tools

slide-4
SLIDE 4

Allinea : an expanding company

  • HPC tools company since 2002

– Leading in HPC software tools market worldwide – Global customer base

  • Helping the HPC community design the best applications

– Unrivaled productive and easy-to-use development environment… – … To help reach the highest level of performance and scalability

  • Helping HPC production make the most of their clusters

– Unique solutions to reduce HPC systems operating costs – Innovative approach to facilitate cutting-edge challenges resolution

slide-5
SLIDE 5

Improve cluster efficiency

  • “Optimization” is not always synonym of “efficiency”

– Cluster productivity or cluster usage

  • Possible efficiency needs during production

– Define and enforce best practices (scale, parameters…) – Provision and validate cluster upgrades and changes – Detect & resolve hardware or software faults impacting performance

  • Effortless one-touch reports with allinea

– Generates explicit and readable reports with metrics and explanations – Understand optimized HPC applications effortlessly

slide-6
SLIDE 6

Better runs, quickly

No source code needed Less than 5% runtime overhead Fully scalable Run regularly – or in regression tests Explicit and usable output

slide-7
SLIDE 7
  • Allinea Forge: a modern integrated environment for HPC developers

‒ Rebranding of Allinea Unified (Allinea DDT + Allinea MAP)

  • Supporting the lifecycle of application development and improvement

‒ Productively debug code with Allinea DDT ‒ Enhance application performance with Allinea MAP

  • Designed for productivity

‒ Consistent easy to use tools ‒ Fewer failed jobs

  • Available to you

Need to dive into the code ?

slide-8
SLIDE 8

Allinea Forge One Unified Solution

Observe and debug your code step by step Flick to Allinea DDT Common interface and settings files Increasing memory usage ? Memory leak ! Workload imbalance ? Possible partitioner bug ! Use Allinea MAP to find a bottleneck

slide-9
SLIDE 9

Allinea MAP Performance made easy

Low overhead measurement

  • Accurate, non-intrusive application performance profiling
  • Seamless – no recompilation or relinking required

Easy to use

  • Source code viewer pinpoints bottleneck locations
  • Zoom in to explore iterations, functions and loops

Deep

  • Measures CPU, communication, I/O and memory to identify problem causes
  • Identifies vectorization and cache performance
slide-10
SLIDE 10
  • Who had a rogue behaviour ?

‒ Merges stacks from processes and threads

  • Where did it happen?

‒ Allinea DDT leaps to source automatically

  • How did it happen?

‒ Detailed error message given to the user ‒ Some faults evident instantly from source

  • Why did it happen?

‒ Unique “Smart Highlighting” ‒ Sparklines comparing data across processes

Allinea DDT helps to understand

Run with Allinea tools Identify a problem Gather info Who, Where, How, Why Fix

slide-11
SLIDE 11

Latest Changes

slide-12
SLIDE 12

Reverse connect: the end of template files

slide-13
SLIDE 13

Horizontal & vertical zoom

slide-14
SLIDE 14

Energy Metrics: Quantify gains immediately

CPU RUN GPU RUN

slide-15
SLIDE 15

DDT

Reverse connect (No more queue configuration) ARM v8 and Power8 support

MAP

Energy metrics Start/stop sampling by time Zoomable metrics Stdout/stderr in .map files Limited ARMv8 support

Performance Reports

Energy metrics Limited ARMv8 support

Allinea Forge and Performance Reports 5.1

caption

Hardware (system + CPU) provides energy-related data (Currently: IPMI-based power sensors) API extracts this data and feeds Allinea’s tools (Currently: Intel Energy Checker SDK) Allinea’s tools process data at runtime to bring unique perspective

slide-16
SLIDE 16

Profile and Optimize with Allinea Forge

slide-17
SLIDE 17

Code optimisation can be time- consuming. Efficient tools can help you focus on the most important bottlenecks.

The quest for the Holy Performance

slide-18
SLIDE 18

Getting Started with profiling on Salomon

  • Load the environment

$ module load iimpi/5.5.0 $ module load Forge/5.1-43967

  • Prepare the code for profiling

$ mpiicc –g –O3 myapp.c –o myapp.exe

  • Modify job script to prefix the mpirun command

map -–profile mpirun myapp.exe

  • Submit job

$ qsub myjob.sub

  • View result

$ map myapp_Xp_Yt_YYYY-MM-DD-HH-MM.map

slide-19
SLIDE 19

Hands-on Exercise Matrix Multiplication: C = A x B + C

k k i

A B C

size j i, j, k: loop indexes nslices = 4

Algorithm

1- Master initializes matrices A, B & C 2- Master slices the matrices A & C, sends them to slaves 3- Master and Slaves perform the multiplication 4- Slaves send their results back to Master 5- Master writes the result Matrix C in an output file

slide-20
SLIDE 20

Exercise objectives : – Load Allinea Forge environment – Compile a code for allinea MAP – Submit the job through the queue – Discover allinea MAP interface and features – Optimize a simple code Content – Handout with step by step instructions – Source code in C and F90 + Makefile – Submission script Tutorial archive on Salomon in:

/scratch/temp/flebeau/allinea_workshop.tar.gz

Profiling the application

slide-21
SLIDE 21

Resolving Bugs with Allinea Forge

slide-22
SLIDE 22

Debugging a problem is much easier when you can :

  • Make and undo changes fearlessly
  • Use a source control (CVS, …)
  • Track what you’ve tried so far
  • Write logbooks
  • Reproduce bugs with a single command
  • Create and use test script

Debugging by Discipline

slide-23
SLIDE 23

Debugging by Magic

Any technology sufficiently advanced is indistinguishable from magic. Unpredictable, dangerous, irresistible.

slide-24
SLIDE 24

Debugging a problem is much easier if you know debuggers

  • Prepare the code

$ mpiicc –O0 –g myapp.c –o myapp

  • Start Allinea DDT in interactive mode

$ ddt mpirun -n 8 ./myapp arg1 arg2

  • Start Allinea DDT in offline mode

$ ddt --offline report.html mpirun -n 8 ./myapp arg1 arg2

Learn your spells

slide-25
SLIDE 25

Debugging by Inspiration

Look at the problem, see the solution. Trust your instincts. Control if they are right.

slide-26
SLIDE 26

Debugging a problem is much easier if you are inspired :

  • Search your inspiration sources
  • Check your past logbooks
  • Explain the problem to a rubber duck
  • Test your instincts
  • Create tests (tracepoints, watchpoints, conditional breakpoints…)
  • Observe what the debugger is telling you
  • Analyse what the debugger communicates
  • Retrieve information from the debugger (advanced magic)

Debugging by Inspiration

slide-27
SLIDE 27

Debugging by Inspiration

  • Memory errors can be obvious (segfaults …)
  • Sometimes not
  • Allinea DDT memory debugging tool enables automatic error

detection

  • By activating dmalloc library
  • By adding guard pages
  • On the host as well as on the Xeon Phi
  • Different levels of detection brings different debugger behaviour
slide-28
SLIDE 28

Getting Started on Salomon

  • Load the environment

$ module load iimpi/5.5.0 $ module load Forge/5.1-43967

  • Prepare the code for debugging

$ mpiicc –g –O0 myapp.c –o myapp.exe

  • Modify job script to prefix the mpirun command for reverse connect

ddt -–connect mpirun myapp.exe

  • Launch Allinea MAP in the background on the login node (or use the remote

client on your laptop)

$ ddt &

  • Submit job

$ qsub myjob.sub

  • When the job runs, it automatically connects to the GUI
slide-29
SLIDE 29

Exercise objectives: – Compile a code for allinea forge – Discover underlying bugs with allinea MAP – Use allinea DDT to debug issues Content – Handout with step by step instructions – Source code in C and F90 + Makefile – Submission script

Exercise 2 : Working on the Optimized code

slide-30
SLIDE 30

Allinea Tools and Intel Xeon Phi Coprocessors From Xeon to Xeon Phi

slide-31
SLIDE 31

Determine the right candidates for Intel Xeon Phi

  • Best applications for intel xeon phi*

– Scalable to over 100 threads – Heavy use of vectorization – Heavy use of memory bandwidth

  • Scientific approach : make a decision based on facts

– How to retrieve relevant metrics to identify appropriate applications? – How to benchmark and analyze lots of applications? – How to speed up the benchmarking process to focus on the migration itself?

*Source : https://software.intel.com/en-us/articles/is-the-intel-xeon-phi-coprocessor-right-for-me

Allinea Performance Reports can help.

slide-32
SLIDE 32

Painless and quick benchmarks on Intel Xeon

Extremely simple to start No source code needed Fully scalable, very low overhead Contains the relevant metrics Helps make informed decisions

slide-33
SLIDE 33

Getting Started on Salomon

  • Load the environment

$ module load PerformanceReports/5.1-43967

  • Modify job script to prefix the mpirun command

perf-report mpirun –n 8 myapp.exe

  • Submit job

$ sbatch myjob.sub

  • View result

$ firefox myapp_Xp_Yt_YYYY-MM-DD-HH-MM.html $ cat myapp_Xp_Yt_YYYY-MM-DD-HH-MM.txt

slide-34
SLIDE 34

Preparing the application migration

  • Define development scope

– How should the application use the intel xeon phi (symmetric, native, offload)? – What regions should be rewritten/offloaded?

  • Analyze the source code

– Where are the hotspots that will specifically benefit from intel xeon phi? – What are the exact lines of code to spend time and energy on?

Use Allinea Forge to work on the code

slide-35
SLIDE 35
  • Analyze code line by line with Allinea MAP
  • Fix bugs with Allinea DDT

Preparing application migration with Allinea Forge

Perfect candidate for Intel Xeon Phi!

slide-36
SLIDE 36

Allinea MAP and Intel Vtune : a great Synergy

Simple

  • ptimization

with Allinea MAP

  • Characterize performance at-scale with a lightweight tool
  • See which lines of code are hotspots
  • Identify common problems at once

Prepare

  • ptimization

strategy with Allinea MAP

  • Identify loop(s) to instrument
  • Identify performance counter(s) to record
  • Document performance issues to communicate to VTune experts

Fine tune the code with VTune

  • Retrieve low-level details with VTune
  • Fix up CPU usage to make the code fly
slide-37
SLIDE 37

Intel Xeon Phi : Environment variables cheat sheet

For heterogeneous programs (#pragma offloads) – Set the number of threads to be started on the intel xeon phi

  • export MIC_OMP_NUM_THREADS=32

– To make sure the host process isn’t killed when we enter a debugging session

  • export MYO_WATCHDOG_MONITOR=-1

– To make sure that debugging symbols are accessible on the host and the card

  • MPSS 3.*

– export AMPLXE_COI_DEBUG_SUPPORT=true

  • MPSS 2.*

– export COI_SEP_DISABLE=false – To make sure allinea DDT can attach to offloaded codes

  • unset OFFLOAD_MAIN

To run in symmetric mode (MPMD host+coprocessor)

  • export I_MPI_MIC=1
slide-38
SLIDE 38

Getting Started with Xeon Phi on Salomon (native MPI)

  • Request an interactive session with a Xeon Phi

$ qsub ‐IX ‐q qexp ‐l select=1:ncpus=24:accelerator=True –A <youraccount>

  • Load the environment

$ module load iimpi/7.3.5 $ module load Forge/5.1-43967 + Cheat sheet environment variables

  • Prepare the code for debugging/profiling

$ mpiicc –mmic –g –O0 myapp.c –o myapp.exe

  • Launch Allinea Forge and configure a remote connection to the Xeon Phi by specifying the following:

– Host name: mic0 (may differ depending which accelerator is allocated) – Remote installation directory: /apps/all/Forge/5.1-43967/mic/ – Remote script: /home/<user>/remote-mic.sh Where remote-mic.sh exports the variables of the cheat sheet and the intel environment for the xeon phi as below:

$ export LD_LIBRARY_PATH= /apps/all/impi/5.0.3.048-iccifort-2015.3.187/mic/lib: /apps/all/ifort/2015.3.187/lib/mic: /apps/all/icc/2015.3.187/lib/mic:$LD_LIBRARY_PATH $ export PATH=/apps/all/impi/5.0.3.048-iccifort-2015.3.187/mic/bin:$PATH

  • Select the remote session and debug/profile your Xeon Phi executable myapp.exe
slide-39
SLIDE 39

Summary

  • Develop your efficiency with allinea forge

– Optimize your code to reach your goals with allinea MAP – Reduce the number of failed jobs with allinea DDT

  • Improve cluster usage with allinea performance reports

– Squeeze more jobs within a given time frame – Increase research by freeing machine time without hardware investment – Help application support teams focus on the right issues

  • Allinea tools help for all steps of the migration to Xeon Phi

– Choose applications for Intel Xeon Phi with Allinea Performance Reports – Prepare development scope with Allinea Forge – Migrate easily to Intel Xeon Phi with Allinea Forge

slide-40
SLIDE 40

Thank you

Your contacts :

– Technical questions? flebeau@allinea.com – Sales team: sales@allinea.com