XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc - - PowerPoint PPT Presentation

xeon phi basics
SMART_READER_LITE
LIVE PREVIEW

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc - - PowerPoint PPT Presentation

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.


slide-1
SLIDE 1

XEON PHI BASICS

Adrian Jackson

adrianj@epcc.ed.ac.uk @adrianjhpc

slide-2
SLIDE 2

Xeon Phi Basics

Reusing this material

This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_US

This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that presentations may contains images owned by others. Please seek their permission before reusing these images.

slide-3
SLIDE 3

Xeon Phi Basics

LESSON PLAN

  • Programming models
  • Parallelisation
  • Compilers and Tools
  • Performance Considerations
slide-4
SLIDE 4

Xeon Phi Basics

Programming models

slide-5
SLIDE 5

Xeon Phi Basics

Programming models

+ Host

Main Memory

Coprocessor

slide-6
SLIDE 6

Xeon Phi Basics

Programming models

+ Host

Main Memory

Coprocessor 3 Basic Programming Models

Native mode Offload execution Symmetric execution

slide-7
SLIDE 7

Xeon Phi Basics

Programming models

Host

Main Memory

Native Mode: Xeon Phi only

  • Host used for preparation work (e.g. compiling,

data copy)

  • User initiates run from host or can use host to

connect to Xeon Phi via ssh

ssh

(PCIe)

Coprocessor

int main() { do stuff(); } int main() { do stuff(); }

slide-8
SLIDE 8

Xeon Phi Basics

Programming models

Host

Main Memory

Coprocessor

Native Mode: Xeon Phi only

  • Host used for preparation work (e.g. compiling, data copy)
  • User initiates run from host or can use host to connect to

Xeon Phi via ssh

  • Programme runs on Xeon Phi from start to finish

“as usual”

ssh

(PCIe)

int main() { do stuff(); }

slide-9
SLIDE 9

Xeon Phi Basics

Programming models

Native Mode: Xeon Phi only

Pros:

  • Requires minimal effort to “port”
  • Works well with ‘flat profile’ applications
  • No memory copy required
slide-10
SLIDE 10

Xeon Phi Basics

Programming models

Native Mode: Xeon Phi only

Pros:

  • Requires minimal effort to “port”
  • Works well with ‘flat profile’ applications
  • No memory copy required

Cons:

  • Poor performance on codes with large

serial regions and ‘complex codes’

  • Limited Xeon Phi memory
slide-11
SLIDE 11

Xeon Phi Basics

Programming models

Host

Main Memory

Offload Execution: Hotspot eliminator

  • Application is initiated on host

ssh

(PCIe)

Coprocessor

do_stuff(){ … } do_stuff(){ … } int … #pr do_ … } int main() { … #pragma offload do_stuff() … }

slide-12
SLIDE 12

Xeon Phi Basics

Programming models

Host

Main Memory
  • Application is initiated on host
  • Embarrassingly parallel hotspots are offloaded to

Xeon Phi

ssh

(PCIe)

Coprocessor

do_stuff(){ … } int … #pr do_ … } int main() { … #pragma offload do_stuff() … }

Offload Execution: Hotspot eliminator

slide-13
SLIDE 13

Xeon Phi Basics

Programming models

Host

Main Memory
  • Application is initiated on host
  • Embarrassingly parallel hotspots are offloaded to

Xeon Phi

  • Results of offload region are returned to host

where execution continues

ssh

(PCIe)

Coprocessor

do_stuff(){ … } do_stuff(){ … } int … #pr do_ … } int main() { … #pragma offload do_stuff() … }

Offload Execution: Hotspot eliminator

slide-14
SLIDE 14

Xeon Phi Basics

Programming models

Offload Execution: Hotspot eliminator

Pros:

  • Serial code handled by advanced

CPU cores

  • Embarrassingly parallel hotspots are

executed efficiently on Xeon Phi

  • More efficient use of (limited) Xeon

Phi memory

slide-15
SLIDE 15

Xeon Phi Basics

Programming models

Offload Execution: Hotspot eliminator

Pros:

  • Serial code handled by advanced CPU cores
  • Embarrassingly parallel hotspots are executed

efficiently on Xeon Phi

  • More efficient use of (limited) Xeon Phi memory

Cons:

  • Data must be copied to and from the

Xeon Phi via (slow) PCIe Bus

  • May lead to poor utilisation of

CPU/XeonPhi (idle time)

slide-16
SLIDE 16

Xeon Phi Basics

Programming models

Host

Main Memory

Symmetric Execution: Phi-as-a-node

  • Application is initiated on host but…

ssh

(PCIe)

Coprocessor

int … do_ … } int main() { … do_stuff() … } int main() { … do_stuff() … }

MPI_RANK=0…15 MPI_RANK=16…255

slide-17
SLIDE 17

Xeon Phi Basics

Programming models

Host

Main Memory
  • Application is initiated on host but…
  • Runs across both CPU and Xeon Phi cores

ssh

(PCIe)

Coprocessor

int … do_ … } int main() { … do_stuff() … } int main() { … do_stuff() … }

MPI_RANK=0…15 MPI_RANK=16…255

Symmetric Execution: Phi-as-a-node

slide-18
SLIDE 18

Xeon Phi Basics

Programming models

Host

Main Memory
  • Application is initiated on host but…
  • Runs across both CPU and Xeon Phi cores
  • Effectively using Xeon Phi as just another node

for MPI to use

ssh

(PCIe)

Coprocessor

int … do_ … } int main() { … do_stuff() … } int main() { … do_stuff() … }

MPI_RANK=0…15 MPI_RANK=16…255

Symmetric Execution: Phi-as-a-node

slide-19
SLIDE 19

Xeon Phi Basics

Programming models

Pros:

  • Promise of full hardware utilisation
  • No need for offloading pragmas and

memory copies

Symmetric Execution: Phi-as-a-node

slide-20
SLIDE 20

Xeon Phi Basics

Programming models

Symmetric Execution: Phi-as-a-node

Pros:

  • Serial code handled by advanced CPU cores
  • Embarrassingly parallel hotspots are

executed efficiently on Xeon Phi

  • More efficient use of (limited) Xeon Phi

memory

Cons:

  • Tricky load-balancing
  • Code is rarely optimal for both CPU

and Xeon Phi

slide-21
SLIDE 21

Xeon Phi Basics

Parallelisation

slide-22
SLIDE 22

Xeon Phi Basics

Parallelisation

MPI OpenMP

and / or

slide-23
SLIDE 23

Xeon Phi Basics

Parallelisation

  • MPI runs only on hosts
  • MPI processes offload to

Xeon Phi

  • OpenMP in MPI processes
  • OpenMP in offload regions

MPI+OpenMP with Offload

Image from Colfax training material

slide-24
SLIDE 24

Xeon Phi Basics

Parallelisation

  • MPI processes on host
  • MPI processes (native) on

Xeon Phi

  • No OpenMP

Symmetric Pure MPI

Image from Colfax training material

slide-25
SLIDE 25

Xeon Phi Basics

Parallelisation

  • MPI processes on host
  • MPI processes (native) on

Xeon Phi

  • All MPI processes use

OpenMP multithreading

Symmetric hybrid MPI+OpenMP

Image from Colfax training material

slide-26
SLIDE 26

Xeon Phi Basics

Parallelisation

  • What is your goal?
  • What is your system?
  • What is your application?
  • Generally OpenMP faster than MPI on Xeon Phi
  • Poor performance of MPI on Xeon Phi
  • Less memory (especially important on Xeon Phi)
  • Worth checking affinity settings (more later)

What is best?

slide-27
SLIDE 27

Xeon Phi Basics

Compilers & Tools

slide-28
SLIDE 28

Xeon Phi Basics

Compilers & Tools

In a word: Intel

Compilers

slide-29
SLIDE 29

Xeon Phi Basics

Compilers & Tools

In a word: Intel

Compilers

  • Intel C Compiler
  • Intel C++ Compiler
  • Intel Fortran Compiler
slide-30
SLIDE 30

Xeon Phi Basics

Compilers & Tools

In two words:

Tools

Intel Allinea &

(but mainly Intel)

slide-31
SLIDE 31

Xeon Phi Basics

Compilers & Tools

Tools

Intel Allinea

Parallel Studio XE

  • Intel C, C++ and Fortran compilers (MIC-capable)
  • Intel Math Kernel Library (MKL)
  • Intel MPI Library (only in Cluster Edition)
  • Intel Trace Analyzer and Collector / ITAC (MPI

profiler)

  • Intel VTune Amplifier XE (multi-threaded profiler)
  • Intel Inspector XE (memory and threading debugging)
  • Intel Threading Building Blocks / TBB (threading

library)

  • Intel Performance Primitives / IPP (media and data)
  • Intel Advisor XE (guided parallelism design)
  • Map (lightweight

profiler)

  • DDT (debug)
  • Forge (unified UI

for DDT & Map)

slide-32
SLIDE 32

Xeon Phi Basics

Compilers & Tools

Runtime

Tools

slide-33
SLIDE 33

Xeon Phi Basics

Compilers & Tools

Runtime

Tools

(Intel Manycore Platform Software Stack)

MPSS

Environment Variables

Linux Commands

slide-34
SLIDE 34

Xeon Phi Basics

Compilers & Tools

MPSS

  • micnativeloadex
  • micinfo
  • miccheck
  • micsmc (GUI)
  • micrasd (root)

For more details: http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-phi- software-configuration-users-guide.pdf https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID- E1EC94AE-A13D-463E-B3C3-6D7A7205F5A1.htm

Environment Variables

  • MKL_MIC_ENABLE
  • MIC_ENV_PREFIX
  • MIC_LD_LIBRARY_PATH
  • I_MPI_MIC
  • I_MPI_MIC_POSTFIX
  • OFFLOAD_REPORT
  • KMP_AFFINITY
  • KMP_BLOCKTIME
  • MIC_USE_2MB_BUFFERS

Linux Commands

  • lspci | grep Phi
  • cat /etc/hosts | grep mic
  • cat /proc/cpuinfo | grep

proc | tail -n 3 …

Runtime

Tools

slide-35
SLIDE 35

Xeon Phi Basics

Performance Considerations

slide-36
SLIDE 36

Xeon Phi Basics

Performance Considerations

Execution mode Vectorisation Alignment Affinity Application Design

Four things to consider first:

slide-37
SLIDE 37

Xeon Phi Basics

Performance Considerations

Mode chosen should depend on the application and system configuration (as discussed previously)

Mode of execution

  • Native
  • Offload
  • Symmetric
slide-38
SLIDE 38

Xeon Phi Basics

Performance Considerations

  • Xeon Phi performance is greatly

dependant on vector units.

  • Intel Xeon CPUs also use (smaller) vector

units → Code optimised for Intel Xeon will run faster on Intel Xeon Phi

  • KNL (next generation Xeon Phi) will also

use 512-AVX vector units → Code

  • ptimised for Intel Xeon Phi KNC will also

run faster on Intel Xeon Phi KNL

*(KNC-KNL not binary compatible)

Vectorisation

slide-39
SLIDE 39

Xeon Phi Basics

Performance Considerations

  • Xeon Phi performance is greatly

dependant on vector units.

  • Intel Xeon CPUs also use (smaller)

vector units → Code optimised for Intel Xeon will run faster on Intel Xeon Phi

  • KNL (next generation Xeon Phi) will also

use 512-AVX vector units → Code

  • ptimised for Intel Xeon Phi KNC will also

run faster on Intel Xeon Phi KNL

*(KNC-KNL not binary compatible)

Vectorisation

slide-40
SLIDE 40

Xeon Phi Basics

Performance Considerations

  • Xeon Phi performance is greatly

dependant on vector units.

  • Intel Xeon CPUs also use (smaller) vector

units → Code optimised for Intel Xeon will run faster on Intel Xeon Phi

  • KNL (next generation Xeon Phi) will also

use 512-AVX vector units → Code

  • ptimised for Intel Xeon Phi KNC will

also run faster on Intel Xeon Phi KNL

*(KNC-KNL not binary compatible)

Vectorisation

slide-41
SLIDE 41

Xeon Phi Basics

Performance Considerations

  • “Loop is vectorised” != faster
  • Data alignment is critical for

vectorisation to be beneficial

  • Remember to not only align

data, but also to tell the compiler that data is aligned at loop.

Data Alignment

slide-42
SLIDE 42

Xeon Phi Basics

Performance Considerations

  • “Loop is vectorised” != faster
  • Data alignment is critical for

vectorisation to be beneficial

  • Remember to not only align

data, but also to tell the compiler that data is aligned at loop.

Data Alignment

slide-43
SLIDE 43

Xeon Phi Basics

Performance Considerations

  • “Loop is vectorised” != faster
  • Data alignment is critical for

vectorisation to be beneficial

  • Remember to not only align data,

but also to tell the compiler that data is aligned at loop.

Data Alignment

slide-44
SLIDE 44

Xeon Phi Basics

Performance Considerations

  • All data moves over high-speed ring

interconnect

  • Affinity critical for good performance
  • Default settings are not always optimal
  • In offload mode, may accidentally use

poor settings.

e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

Affinity

slide-45
SLIDE 45

Xeon Phi Basics

Performance Considerations

  • All data moves over high-speed ring

interconnect

  • Affinity critical for good performance
  • Default settings are not always optimal
  • In offload mode, may accidentally use

poor settings.

e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

Affinity

slide-46
SLIDE 46

Xeon Phi Basics

Performance Considerations

  • All data moves over high-speed ring

interconnect

  • Affinity critical for good performance
  • Default settings are not always optimal
  • In offload mode, may accidentally use

poor settings.

e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

Affinity

slide-47
SLIDE 47

Xeon Phi Basics

Performance Considerations

  • All data moves over high-speed ring

interconnect

  • Affinity critical for good performance
  • Default settings are not always optimal
  • In offload mode, may accidentally use

poor settings.

e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

Affinity

slide-48
SLIDE 48

Xeon Phi Basics

Performance Considerations

  • Design >> Optimisation
  • Consider all levels of parallelism available

and adapt your algorithm to exploit as many and as much as possible

Application Design

slide-49
SLIDE 49

Xeon Phi Basics

Vector Unit

Thread

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-50
SLIDE 50

Xeon Phi Basics

Vector Unit

Thread

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-51
SLIDE 51

Xeon Phi Basics

Vector Unit

Thread

Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-52
SLIDE 52

Xeon Phi Basics

Vector Unit

Thread

Core Core Core Core Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-53
SLIDE 53

Xeon Phi Basics

Thread Thread Thread

Vector Unit

Thread

Core Core Core Core Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-54
SLIDE 54

Xeon Phi Basics

Vector Unit Vector Unit Vector Unit

Thread Thread Thread

Vector Unit

Thread

Core Core Core Core Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-55
SLIDE 55

Xeon Phi Basics

Vector Unit Vector Unit Vector Unit

Thread Thread Thread

Vector Unit

Thread

Co-processor Core Core Core Core Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-56
SLIDE 56

Xeon Phi Basics

Vector Unit Vector Unit Vector Unit

Thread Thread Thread

Vector Unit

Thread

Vector Unit Vector Unit Vector Unit Vector Unit Vector Unit Co-processor Core Core Core Core Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit

slide-57
SLIDE 57

Xeon Phi Basics

Vector Unit Vector Unit Vector Unit

Thread Thread Thread

Vector Unit Vector Unit Vector Unit Vector Unit Vector Unit Co-processor Core Core Core Core Numa Region

Node Node Node

Performance Considerations

Levels of parallelism

Machine

Node

Numa Region Core Co-processor Vector Unit Vector Unit

Thread

slide-58
SLIDE 58

Xeon Phi Basics

Summary

slide-59
SLIDE 59

Xeon Phi Basics

  • Programming models
  • Native, Offload, Symmetric - what’s best for you.
  • Parallelisation
  • MPI, OpenMP -> OpenMP better on Xeon Phi
  • Many ways to mix and match
  • Compilers and Tools
  • Use Intel compilers (C, C++, Fortran)
  • Intel and Allinea tools: VTune, Map, etc.
  • Wide variety of runtime tools and environment

variables: micinfo, KMP_AFFINITY

  • Performance Considerations
  • Programming model
  • Vectorisation - needed to exploit Xeon Phi compute
  • Data alignment - needed to make vectorisation useful
  • Thread/process affinity - can be critical for performance
  • Application design: Consider levels of parallelism

Summary

slide-60
SLIDE 60

Xeon Phi Basics

  • Programming models
  • Native, Offload, Symmetric - what’s best for you.
  • Parallelisation
  • MPI, OpenMP -> OpenMP better on Xeon Phi
  • Many ways to mix and match
  • Compilers and Tools
  • Use Intel compilers (C, C++, Fortran)
  • Intel and Allinea tools: VTune, Map, etc.
  • Wide variety of runtime tools and environment

variables: micinfo, KMP_AFFINITY

  • Performance Considerations
  • Programming model
  • Vectorisation - needed to exploit Xeon Phi compute
  • Data alignment - needed to make vectorisation useful
  • Thread/process affinity - can be critical for performance
  • Application design: Consider levels of parallelism

Summary

slide-61
SLIDE 61

Xeon Phi Basics

  • Programming models
  • Native, Offload, Symmetric - what’s best for you.
  • Parallelisation
  • MPI, OpenMP -> OpenMP better on Xeon Phi
  • Many ways to mix and match
  • Compilers and Tools
  • Use Intel compilers (C, C++, Fortran)
  • Intel and Allinea tools: VTune, Map, etc.
  • Wide variety of runtime tools and environment

variables: micinfo, KMP_AFFINITY

  • Performance Considerations
  • Programming model
  • Vectorisation - needed to exploit Xeon Phi compute
  • Data alignment - needed to make vectorisation useful
  • Thread/process affinity - can be critical for performance
  • Application design: Consider levels of parallelism

Summary

slide-62
SLIDE 62

Xeon Phi Basics

  • Programming models
  • Native, Offload, Symmetric - what’s best for you.
  • Parallelisation
  • MPI, OpenMP -> OpenMP better on Xeon Phi
  • Many ways to mix and match
  • Compilers and Tools
  • Use Intel compilers (C, C++, Fortran)
  • Intel and Allinea tools: VTune, Map, etc.
  • Wide variety of runtime tools and environment

variables: micinfo, KMP_AFFINITY

  • Performance Considerations
  • Programming model
  • Vectorisation - needed to exploit Xeon Phi compute
  • Data alignment - needed to make vectorisation useful
  • Thread/process affinity - can be critical for performance
  • Application design: Consider levels of parallelism

Summary

slide-63
SLIDE 63

Xeon Phi Basics

Thank You!