Towards Efficiency Computing with Allinea
06 Nov 2015 VSB, Ostrava Florent Lebeau flebeau@allinea.com
Computing with Allinea 06 Nov 2015 VSB, Ostrava Florent Lebeau - - PowerPoint PPT Presentation
Towards Efficiency Computing with Allinea 06 Nov 2015 VSB, Ostrava Florent Lebeau flebeau@allinea.com Agenda 09:30-10:00 Registration 10:00-10:15 Introduction to Allinea tools 10:15-11:45 Getting started with Allinea Forge for profiling
06 Nov 2015 VSB, Ostrava Florent Lebeau flebeau@allinea.com
09:30-10:00 Registration 10:00-10:15 Introduction to Allinea tools 10:15-11:45 Getting started with Allinea Forge for profiling
11:45-13:00 Lunch break
14:15-14:45 Coffee break
14:45-16:00 Allinea tools and Intel Xeon Phi coprocessors 16:00-16:30 Questions and wrap-up
– Leading in HPC software tools market worldwide – Global customer base
– Unrivaled productive and easy-to-use development environment… – … To help reach the highest level of performance and scalability
– Unique solutions to reduce HPC systems operating costs – Innovative approach to facilitate cutting-edge challenges resolution
– Cluster productivity or cluster usage
– Define and enforce best practices (scale, parameters…) – Provision and validate cluster upgrades and changes – Detect & resolve hardware or software faults impacting performance
– Generates explicit and readable reports with metrics and explanations – Understand optimized HPC applications effortlessly
No source code needed Less than 5% runtime overhead Fully scalable Run regularly – or in regression tests Explicit and usable output
‒ Rebranding of Allinea Unified (Allinea DDT + Allinea MAP)
‒ Productively debug code with Allinea DDT ‒ Enhance application performance with Allinea MAP
‒ Consistent easy to use tools ‒ Fewer failed jobs
Observe and debug your code step by step Flick to Allinea DDT Common interface and settings files Increasing memory usage ? Memory leak ! Workload imbalance ? Possible partitioner bug ! Use Allinea MAP to find a bottleneck
Low overhead measurement
Easy to use
Deep
‒ Merges stacks from processes and threads
‒ Allinea DDT leaps to source automatically
‒ Detailed error message given to the user ‒ Some faults evident instantly from source
‒ Unique “Smart Highlighting” ‒ Sparklines comparing data across processes
Run with Allinea tools Identify a problem Gather info Who, Where, How, Why Fix
CPU RUN GPU RUN
DDT
Reverse connect (No more queue configuration) ARM v8 and Power8 support
MAP
Energy metrics Start/stop sampling by time Zoomable metrics Stdout/stderr in .map files Limited ARMv8 support
Performance Reports
Energy metrics Limited ARMv8 support
caption
Hardware (system + CPU) provides energy-related data (Currently: IPMI-based power sensors) API extracts this data and feeds Allinea’s tools (Currently: Intel Energy Checker SDK) Allinea’s tools process data at runtime to bring unique perspective
$ module load iimpi/5.5.0 $ module load Forge/5.1-43967
$ mpiicc –g –O3 myapp.c –o myapp.exe
map -–profile mpirun myapp.exe
$ qsub myjob.sub
$ map myapp_Xp_Yt_YYYY-MM-DD-HH-MM.map
k k i
size j i, j, k: loop indexes nslices = 4
Algorithm
1- Master initializes matrices A, B & C 2- Master slices the matrices A & C, sends them to slaves 3- Master and Slaves perform the multiplication 4- Slaves send their results back to Master 5- Master writes the result Matrix C in an output file
Exercise objectives : – Load Allinea Forge environment – Compile a code for allinea MAP – Submit the job through the queue – Discover allinea MAP interface and features – Optimize a simple code Content – Handout with step by step instructions – Source code in C and F90 + Makefile – Submission script Tutorial archive on Salomon in:
/scratch/temp/flebeau/allinea_workshop.tar.gz
Debugging a problem is much easier when you can :
Any technology sufficiently advanced is indistinguishable from magic. Unpredictable, dangerous, irresistible.
Debugging a problem is much easier if you know debuggers
$ mpiicc –O0 –g myapp.c –o myapp
$ ddt mpirun -n 8 ./myapp arg1 arg2
$ ddt --offline report.html mpirun -n 8 ./myapp arg1 arg2
Look at the problem, see the solution. Trust your instincts. Control if they are right.
Debugging a problem is much easier if you are inspired :
detection
$ module load iimpi/5.5.0 $ module load Forge/5.1-43967
$ mpiicc –g –O0 myapp.c –o myapp.exe
ddt -–connect mpirun myapp.exe
client on your laptop)
$ ddt &
$ qsub myjob.sub
Exercise objectives: – Compile a code for allinea forge – Discover underlying bugs with allinea MAP – Use allinea DDT to debug issues Content – Handout with step by step instructions – Source code in C and F90 + Makefile – Submission script
Allinea Tools and Intel Xeon Phi Coprocessors From Xeon to Xeon Phi
– Scalable to over 100 threads – Heavy use of vectorization – Heavy use of memory bandwidth
– How to retrieve relevant metrics to identify appropriate applications? – How to benchmark and analyze lots of applications? – How to speed up the benchmarking process to focus on the migration itself?
*Source : https://software.intel.com/en-us/articles/is-the-intel-xeon-phi-coprocessor-right-for-me
Allinea Performance Reports can help.
Extremely simple to start No source code needed Fully scalable, very low overhead Contains the relevant metrics Helps make informed decisions
$ module load PerformanceReports/5.1-43967
perf-report mpirun –n 8 myapp.exe
$ sbatch myjob.sub
$ firefox myapp_Xp_Yt_YYYY-MM-DD-HH-MM.html $ cat myapp_Xp_Yt_YYYY-MM-DD-HH-MM.txt
– How should the application use the intel xeon phi (symmetric, native, offload)? – What regions should be rewritten/offloaded?
– Where are the hotspots that will specifically benefit from intel xeon phi? – What are the exact lines of code to spend time and energy on?
Use Allinea Forge to work on the code
Perfect candidate for Intel Xeon Phi!
Simple
with Allinea MAP
Prepare
strategy with Allinea MAP
Fine tune the code with VTune
For heterogeneous programs (#pragma offloads) – Set the number of threads to be started on the intel xeon phi
– To make sure the host process isn’t killed when we enter a debugging session
– To make sure that debugging symbols are accessible on the host and the card
– export AMPLXE_COI_DEBUG_SUPPORT=true
– export COI_SEP_DISABLE=false – To make sure allinea DDT can attach to offloaded codes
To run in symmetric mode (MPMD host+coprocessor)
$ qsub ‐IX ‐q qexp ‐l select=1:ncpus=24:accelerator=True –A <youraccount>
$ module load iimpi/7.3.5 $ module load Forge/5.1-43967 + Cheat sheet environment variables
$ mpiicc –mmic –g –O0 myapp.c –o myapp.exe
– Host name: mic0 (may differ depending which accelerator is allocated) – Remote installation directory: /apps/all/Forge/5.1-43967/mic/ – Remote script: /home/<user>/remote-mic.sh Where remote-mic.sh exports the variables of the cheat sheet and the intel environment for the xeon phi as below:
$ export LD_LIBRARY_PATH= /apps/all/impi/5.0.3.048-iccifort-2015.3.187/mic/lib: /apps/all/ifort/2015.3.187/lib/mic: /apps/all/icc/2015.3.187/lib/mic:$LD_LIBRARY_PATH $ export PATH=/apps/all/impi/5.0.3.048-iccifort-2015.3.187/mic/bin:$PATH
– Optimize your code to reach your goals with allinea MAP – Reduce the number of failed jobs with allinea DDT
– Squeeze more jobs within a given time frame – Increase research by freeing machine time without hardware investment – Help application support teams focus on the right issues
– Choose applications for Intel Xeon Phi with Allinea Performance Reports – Prepare development scope with Allinea Forge – Migrate easily to Intel Xeon Phi with Allinea Forge
Your contacts :
– Technical questions? flebeau@allinea.com – Sales team: sales@allinea.com