Parallel Programming and Heterogeneous Computing SIMD: Integrated - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Köhler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group

1 I I I I D D D D D D D D D D D D D D D D D D D D D D D D SIMD ParProg20 C1 D D D D D D D D D D D D Integrated Accelerators & AltiVec Sven Köhler Chart 2

Definition SIMD SIMD ::= S ingle I nstruction M ultiple D ata The same instruction is performed simultaneously on multiple data points (fit for data-level parallelism). First proposed for ILLIAC IV, University of Illinois (1966). Today many architectures provide SIMD instruction set extensions. ParProg20 C1 Intel: MMX, SSE, AVX Integrated Accelerators ARM: VPF, NEON, SVE Sven Köhler POWER: AltiVec (VMX), VSX Chart 3

Scalar vs. SIMD How many instructions are needed to add four numbers from memory? scalar 4 element SIMD A 0 + B 0 = C 0 A 0 B 0 C 0 A 1 + B 1 = C 1 A 1 B 1 C 1 + = A 2 B 2 C 2 A 2 + B 2 C 2 = A 3 B 3 C 3 ParProg20 C1 A 3 + B 3 = C 3 Integrated Accelerators Sven Köhler 4 additions 1 addition 8 loads 2 loads Chart 4 4 stores 1 store

Vector Registers on POWER8 (1) 32 vector registers containing 128 bits each. fpr0 vsr0 fpr1 vsr1 … … fpr31 vsr31 VSX vr0 vsr32 AltiVec/VMX vr1 vsr33 … … vr31 vsr63 ParProg20 C1 These are also used by Quad Word 0 Integrated several coprocessors : Accelerators Double Word 0 Double Word 1 … Sven Köhler Word 0 Word 3 … Half Half Word 0 Word 7 VSX SHA2 AES … … Chart 5 Byte 0 Byte 15

Vector Registers on POWER8 (2) 32 vector registers containing 128 bits each. Depending on the instruction they are interpreted as 16 (un)signed bytes 8 (un)signed shorts 4 (un)signed integers of 32bit 4 single precision floats 2 (un)signed long integers of 64bit ParProg20 C1 Integrated Accelerators 2 double precision floats Sven Köhler or 2, 4, 8, 16 logic values Chart 6

AltiVec Instruction Reference Version 2.07 B 6.7.2 Vector Load Instructions For all instructions, registers The aligned byte, halfword, word, or quadword in Programming Note storage addressed by EA is loaded into register VRT. The Load Vector Element instructions load the specified element into the same location in the target register as the location into which it would and usage see be loaded using the Load Vector instruction. Load Vector Element Byte Indexed X-form Load Vector Element Halfword Indexed X-form lvebx VRT,RA,RB lvehx VRT,RA,RB 31 VRT RA RB 7 / PowerISA 2.07(B), chapter 6 & 7 0 6 11 16 21 31 31 VRT RA RB 39 / 0 6 11 16 21 31 if RA = 0 then b � 0 else b � (RA) if RA = 0 then b � 0 EA � b + (RB) else b � (RA) eb � EA 60:63 EA � (b + (RB)) & 0xFFFF_FFFF_FFFF_FFFE eb � EA 60:63 VRT � undefined VRT � undefined if Big-Endian byte ordering then VRT 8 × eb:8 × eb+7 � MEM(EA,1) ParProg20 C1 if Big-Endian byte ordering then VRT 8 × eb:8 × eb+15 � MEM(EA,2) else VRT 120-(8 × eb):127-(8 × eb) � MEM(EA,1) else Integrated VRT 112-(8 × eb):127-(8 × eb) � MEM(EA,2) Let the effective address (EA) be the sum Accelerators (RA|0)+(RB). Let the effective address (EA) be the result of ANDing 0xFFFF_FFFF_FFFF_FFFE with the sum Let eb be bits 60:63 of EA. (RA|0)+(RB). Sven Köhler If Big-Endian byte ordering is used for the storage Let eb be bits 60:63 of EA. access, the contents of the byte in storage at address EA are placed into byte eb of register VRT. The If Big-Endian byte ordering is used for the storage remaining bytes in register VRT are set to undefined access, values. Chart 7 – the contents of the byte in storage at address EA

2 #include <altivec.h> gcc -maltivec -mabi=altivec gcc -mvsx xlc –qaltivec –qarch=auto ParProg20 C1 Integrated C-Interface Accelerators Sven Köhler Chart 8

Vector Data Types The C-Interface introduces new keywords and data types: vector unsigned char vector unsigned long 16x 1 byte 2 x 8 bytes vector signed char vector signed long vector bool char vector double vector unsigned short 8x 2 bytes vector signed short vector bool short vector pixel ParProg20 C1 vector unsigned int Integrated vector signed int 4x 4 bytes Accelerators vector bool int Sven Köhler vector float Chart 9 gcc -maltivec gcc -mvsx

Vector Data Types Initialization, Loading and Storing vector int va = {1, 2, 3, 4}; int data[] = {1, 2, 3, 4, 5, 6, 7, 8}; vector int vb = *((vector int *)data); int output[4]; *((vector int *)output) = va; Can be very slow! ParProg20 C1 Integrated printf("vb = {%d, %d, %d, %d};\n", Accelerators Sven Köhler vb[0], vb[1], vb[2], vb[3]); Chart 10

Aligned Addresses Historically memory addresses required be aligned at 16 byte boundaries for efficiency reasons. (Although POWER8 has improved unaligned load/store and modern compilers will support you.) int data[] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8}; (compiler specific) int *output = aligned_alloc(16, NUM * sizeof(int)); vector int va = vec_ld(0, data); ParProg20 C1 Integrated vec_st(va, 0, output); Accelerators Sven Köhler index + (truncated to 16) address Chart 11

Vector Intrinsics Operations are available through a rich set 1 of “overloaded functions” (actually intrinsics): vector int va = {4, 3, 2, 1}; vector int vb = {1, 2, 3, 4}; A 0 B 0 C 0 vector int vc = vec_add(va, vb); A 1 B 1 C 1 + = A 2 B 2 C 2 A 3 B 3 C 3 vector float vfa = {4, 3, 2, 1}; ParProg20 C1 vector float vfb = {1, 2, 3, 4}; Integrated Accelerators vector float vfc = vec_add(vfa, vfb); Sven Köhler Chart 12 1 https://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html

) Vector Intrinsics: Lots of overloads ( vector signed char vec_add (vector bool char, vector signed char); vector signed char vec_add (vector signed char, vector bool char); vector signed char vec_add (vector signed char, vector signed char); vector unsigned char vec_add (vector bool char, vector unsigned char); vector unsigned char vec_add (vector unsigned char, vector bool char); vector unsigned char vec_add (vector unsigned char, vector unsigned char); vector signed short vec_add (vector bool short, vector signed short); vector signed short vec_add (vector signed short, vector bool short); vector signed short vec_add (vector signed short, vector signed short); vector unsigned short vec_add (vector bool short, vector unsigned short); vector unsigned short vec_add (vector unsigned short, vector bool short); Attention: No implicit conversion! vector unsigned short vec_add (vector unsigned short, vector unsigned short); vector signed int vec_add (vector bool int, vector signed int); Also not all types for every operation. vector signed int vec_add (vector signed int, vector bool int); ParProg20 C1 vector signed int vec_add (vector signed int, vector signed int); Integrated vector unsigned int vec_add (vector bool int, vector unsigned int); Accelerators vector unsigned int vec_add (vector unsigned int, vector bool int); vector unsigned int vec_add (vector unsigned int, vector unsigned int); Sven Köhler vector float vec_add (vector float, vector float); vector double vec_add (vector double, vector double); vector long long vec_add (vector long long, vector long long); Chart 13 vector unsigned long long vec_add (vector unsigned long long, vector unsigned long long); 1 https://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html

Get Help: Programming Interface Manual Generic and Specific AltiVec Operations Highly helpful resource: vec_add vec_add Vector Add d = vec_add( a , b ) • Integer add: Name of operation n ¨ number of elements □ do i=0 to n-1 d i ¨ a i + b i end Pseudocode description □ • Floating-point add: Text description □ do i=0 to 3 d i ¨ a i + fp b i end Graphical description □ Each element of a is added to the corresponding element of b. Each sum is placed in the corresponding element of d. Type table and according □ For vector float argument types, if VSCR[NJ] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each assembly instruction denormalized result element is truncated to a 0 of the same sign. The valid combinations of argument types and the corresponding result types for d = vec_add( a , b ) are shown in Figure 4-12, Figure 4-13, Figure 4-14, and Figure 4-15. ParProg20 C1 Integrated Element Æ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a Accelerators b Sven Köhler + + + + + + + + + + + + + + + + d d a b maps to Chart 14 vector unsigned char vector unsigned char vector unsigned char vector unsigned char vector bool char http://www.nxp.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf vector bool char vector unsigned char vaddubm d,a,b vector signed char vector signed char

Get Help: IBM Knowledge Center IBM has an online documentation of the extended standard, not fully implemented by GCC. ParProg20 C1 Integrated Accelerators Sven Köhler Chart 15

Parallel Programming and Heterogeneous Computing SIMD: Integrated - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Khler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group 1 I I I I D D D D D D D D D D D D D

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Bits and Bytes Aug. 29, 2002 Topics Topics n Why bits? n Representing information as bits l

Gold Performance Features Future Ian Lance Taylor Who? Google June 17, 2008 What? Gold

CS 240 Programming in C Typecast of Pointers, Endianess November 13, 2019 Haoyu Wang UMass

CHAPTER X MEMORY SYSTEMS READ MEMORY NOTES ON COURSE WEBPAGE CONSIDER READING PAGES 285-310 FROM

Sidan # Address The Instruction space To work with a space we must be able to address

CS140 Lecture 08: Data Representation: Bits and Ints John Magee 13 February 2017 Material From

A Pixel Format Guide to the galaxy Alexandros Frantzis alexandros.frantzis@collabora.com

4.3. External data representation and marshalling At language-level data are stored in data

Parallel Programming and Heterogeneous Computing SIMD: Integrated - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Khler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group 1 I I I I D D D D D D D D D D D D D

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Bits and Bytes Aug. 29, 2002 Topics Topics n Why bits? n Representing information as bits l

Gold Performance Features Future Ian Lance Taylor Who? Google June 17, 2008 What? Gold

CS 240 Programming in C Typecast of Pointers, Endianess November 13, 2019 Haoyu Wang UMass

CHAPTER X MEMORY SYSTEMS READ MEMORY NOTES ON COURSE WEBPAGE CONSIDER READING PAGES 285-310 FROM

Sidan # Address The Instruction space To work with a space we must be able to address

CS140 Lecture 08: Data Representation: Bits and Ints John Magee 13 February 2017 Material From

A Pixel Format Guide to the galaxy Alexandros Frantzis alexandros.frantzis@collabora.com

4.3. External data representation and marshalling At language-level data are stored in data

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &