XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Xeon Phi Basics Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_US This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that presentations may contains images owned by others. Please seek their permission before reusing these images.

Xeon Phi Basics LESSON PLAN • Programming models • Parallelisation • Compilers and Tools • Performance Considerations

Xeon Phi Basics Programming models

Programming models Xeon Phi Basics Host Coprocessor + Main Memory

Programming models Xeon Phi Basics 3 Basic Programming Models Host Coprocessor Native mode + Offload execution Symmetric execution Main Memory

Programming models Xeon Phi Basics Native Mode: Xeon Phi only Host int main() { int main() { Coprocessor do stuff(); do stuff(); ssh (PCIe) } } Main Memory • Host used for preparation work (e.g. compiling, data copy) • User initiates run from host or can use host to connect to Xeon Phi via ssh

Programming models Xeon Phi Basics Native Mode: Xeon Phi only Host int main() { Coprocessor ssh do stuff(); (PCIe) } Main Memory • Host used for preparation work (e.g. compiling, data copy) • User initiates run from host or can use host to connect to Xeon Phi via ssh • Programme runs on Xeon Phi from start to finish “as usual”

Programming models Xeon Phi Basics Native Mode: Xeon Phi only Pros: • Requires minimal effort to “port” • Works well with ‘flat profile’ applications • No memory copy required

Programming models Xeon Phi Basics Native Mode: Xeon Phi only Pros: • Requires minimal effort to “port” • Works well with ‘flat profile’ applications • No memory copy required Cons: • Poor performance on codes with large serial regions and ‘complex codes’ • Limited Xeon Phi memory

Programming models Xeon Phi Basics Offload Execution: Hotspot eliminator Host int int main() { Coprocessor … … ssh do_stuff(){ do_stuff(){ (PCIe) #pragma offload #pr … … do_ do_stuff() } } … … Main Memory } } • Application is initiated on host

Programming models Xeon Phi Basics Offload Execution: Hotspot eliminator Host int int main() { Coprocessor … … ssh do_stuff(){ (PCIe) #pragma offload #pr … do_ do_stuff() } … … Main Memory } } • Application is initiated on host • Embarrassingly parallel hotspots are offloaded to Xeon Phi

Programming models Xeon Phi Basics Offload Execution: Hotspot eliminator Host int int main() { Coprocessor … … do_stuff(){ do_stuff(){ #pragma offload #pr … … do_stuff() do_ } } … … ssh Main Memory } } (PCIe) • Application is initiated on host • Embarrassingly parallel hotspots are offloaded to Xeon Phi • Results of offload region are returned to host where execution continues

Programming models Xeon Phi Basics Offload Execution: Hotspot eliminator Pros: • Serial code handled by advanced CPU cores • Embarrassingly parallel hotspots are executed efficiently on Xeon Phi • More efficient use of (limited) Xeon Phi memory

Programming models Xeon Phi Basics Offload Execution: Hotspot eliminator Pros: • Serial code handled by advanced CPU cores • Embarrassingly parallel hotspots are executed efficiently on Xeon Phi • More efficient use of (limited) Xeon Phi memory Cons: • Data must be copied to and from the Xeon Phi via (slow) PCIe Bus • May lead to poor utilisation of CPU/XeonPhi (idle time)

Programming models Xeon Phi Basics Symmetric Execution: Phi-as-a-node MPI_RANK=16…255 Host MPI_RANK=0…15 Coprocessor int main() { int int main() { ssh (PCIe) … … … do_stuff() do_ do_stuff() … … … Main Memory } } } • Application is initiated on host but…

Programming models Xeon Phi Basics Symmetric Execution: Phi-as-a-node MPI_RANK=0…15 MPI_RANK=16…255 Host Coprocessor int main() { int int main() { ssh (PCIe) … … … do_stuff() do_ do_stuff() … … … Main Memory } } } • Application is initiated on host but… • Runs across both CPU and Xeon Phi cores

Programming models Xeon Phi Basics Symmetric Execution: Phi-as-a-node MPI_RANK=0…15 MPI_RANK=16…255 Host Coprocessor int main() { int main() { int ssh (PCIe) … … … do_stuff() do_ do_stuff() … … … Main Memory } } } • Application is initiated on host but… • Runs across both CPU and Xeon Phi cores • Effectively using Xeon Phi as just another node for MPI to use

Programming models Xeon Phi Basics Symmetric Execution: Phi-as-a-node Pros: • Promise of full hardware utilisation • No need for offloading pragmas and memory copies

Programming models Xeon Phi Basics Symmetric Execution: Phi-as-a-node Pros: • Serial code handled by advanced CPU cores • Embarrassingly parallel hotspots are executed efficiently on Xeon Phi • More efficient use of (limited) Xeon Phi memory Cons: • Tricky load-balancing • Code is rarely optimal for both CPU and Xeon Phi

Xeon Phi Basics Parallelisation

Parallelisation Xeon Phi Basics MPI and / or OpenMP

Parallelisation Xeon Phi Basics MPI+OpenMP with Offload • MPI runs only on hosts • MPI processes offload to Xeon Phi • OpenMP in MPI processes • OpenMP in offload regions Image from Colfax training material

Parallelisation Xeon Phi Basics Symmetric Pure MPI • MPI processes on host • MPI processes (native) on Xeon Phi • No OpenMP Image from Colfax training material

Parallelisation Xeon Phi Basics Symmetric hybrid MPI+OpenMP • MPI processes on host • MPI processes (native) on Xeon Phi • All MPI processes use OpenMP multithreading Image from Colfax training material

Parallelisation Xeon Phi Basics What is best? • What is your goal? • What is your system? • What is your application? • Generally OpenMP faster than MPI on Xeon Phi • Poor performance of MPI on Xeon Phi • Less memory (especially important on Xeon Phi) • Worth checking affinity settings (more later)

Xeon Phi Basics Compilers & Tools

Compilers & Tools Xeon Phi Basics Compilers In a word: Intel

Compilers & Tools Xeon Phi Basics Compilers In a word: Intel • Intel C Compiler • Intel C++ Compiler • Intel Fortran Compiler

Compilers & Tools Xeon Phi Basics Tools In two words: Intel & Allinea (but mainly Intel)

Compilers & Tools Xeon Phi Basics Tools Intel Allinea Parallel Studio XE • Intel C, C++ and Fortran compilers (MIC-capable) • Map (lightweight • Intel Math Kernel Library (MKL) profiler) • Intel MPI Library (only in Cluster Edition) • DDT (debug) • Intel Trace Analyzer and Collector / ITAC (MPI profiler) • Forge (unified UI • Intel VTune Amplifier XE (multi-threaded profiler) for DDT & Map) • Intel Inspector XE (memory and threading debugging) • Intel Threading Building Blocks / TBB (threading library) • Intel Performance Primitives / IPP (media and data) • Intel Advisor XE (guided parallelism design)

Compilers & Tools Xeon Phi Basics Tools Runtime

Compilers & Tools Xeon Phi Basics Tools Runtime MPSS (Intel Manycore Platform Software Stack) Environment Variables Linux Commands

Compilers & Tools Xeon Phi Basics Tools Runtime Linux Environment MPSS Variables Commands • MKL_MIC_ENABLE • lspci | grep Phi • micnativeloadex • MIC_ENV_PREFIX • cat /etc/hosts | grep mic • micinfo • MIC_LD_LIBRARY_PATH • cat /proc/cpuinfo | grep • miccheck • I_MPI_MIC proc | tail -n 3 • micsmc (GUI) • I_MPI_MIC_POSTFIX … • OFFLOAD_REPORT • micrasd (root) • KMP_AFFINITY … • KMP_BLOCKTIME • MIC_USE_2MB_BUFFERS … For more details: http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-phi- software-configuration-users-guide.pdf https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID- E1EC94AE-A13D-463E-B3C3-6D7A7205F5A1.htm

Xeon Phi Basics Performance Considerations

Performance Considerations Xeon Phi Basics Four things to consider first: Execution mode Vectorisation Alignment Affinity Application Design

Performance Considerations Xeon Phi Basics Mode of execution • Native • Offload • Symmetric Mode chosen should depend on the application and system configuration (as discussed previously)

Performance Considerations Xeon Phi Basics Vectorisation • Xeon Phi performance is greatly dependant on vector units. • Intel Xeon CPUs also use (smaller) vector units → Code optimised for Intel Xeon will run faster on Intel Xeon Phi • KNL (next generation Xeon Phi) will also use 512-AVX vector units → Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc - PowerPoint PPT Presentation

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Towards Direct Visualization on CPU and Xeon Phi Aaron Knoll SCI Institute, University of Utah

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on

FLSCHED: A Lockless and Lightweight Approach to OS Scheduler for Xeon Phi Heeseung Jo Chonbuk

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 ,

V = k L s L is the normal load s is the sliding distance at constant sliding speed H soft H

Chapter 14. Transformer Design Some more advanced design issues, not considered in previous

Simula'on of Superconduc'ng Qubit Devices Workshop on Microwave Cavities and Detectors for Axion

NN Correlations and Final-State Interaction in Electromagnetic Two-Nucleon Knockout Reactions C.

November 13, 2020 Sixth International Workshop on Heterogeneous High-performance Reconfigurable

ECE 697J Advanced Topics Advanced Topics ECE 697J in Computer Networks in Computer

Lars Bauer, Jrg Henkel - 1 - Institut fr Technische Informatik Chair for Embedded Systems -

MIPS ISA and MIPS Assembly CS301 Prof. Szajda Administrative HW #2 due Wednesday (9/11) at