 
              Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Köhler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group
1 I I I I D D D D D D D D D D D D D D D D D D D D D D D D SIMD ParProg20 C1 D D D D D D D D D D D D Integrated Accelerators & AltiVec Sven Köhler Chart 2
Definition SIMD SIMD ::= S ingle I nstruction M ultiple D ata The same instruction is performed simultaneously on multiple data points (fit for data-level parallelism). First proposed for ILLIAC IV, University of Illinois (1966). Today many architectures provide SIMD instruction set extensions. ParProg20 C1 Intel: MMX, SSE, AVX Integrated Accelerators ARM: VPF, NEON, SVE Sven Köhler POWER: AltiVec (VMX), VSX Chart 3
Scalar vs. SIMD How many instructions are needed to add four numbers from memory? scalar 4 element SIMD A 0 + B 0 = C 0 A 0 B 0 C 0 A 1 + B 1 = C 1 A 1 B 1 C 1 + = A 2 B 2 C 2 A 2 + B 2 C 2 = A 3 B 3 C 3 ParProg20 C1 A 3 + B 3 = C 3 Integrated Accelerators Sven Köhler 4 additions 1 addition 8 loads 2 loads Chart 4 4 stores 1 store
Vector Registers on POWER8 (1) 32 vector registers containing 128 bits each. fpr0 vsr0 fpr1 vsr1 … … fpr31 vsr31 VSX vr0 vsr32 AltiVec/VMX vr1 vsr33 … … vr31 vsr63 ParProg20 C1 These are also used by Quad Word 0 Integrated several coprocessors : Accelerators Double Word 0 Double Word 1 … Sven Köhler Word 0 Word 3 … Half Half Word 0 Word 7 VSX SHA2 AES … … Chart 5 Byte 0 Byte 15
Vector Registers on POWER8 (2) 32 vector registers containing 128 bits each. Depending on the instruction they are interpreted as 16 (un)signed bytes 8 (un)signed shorts 4 (un)signed integers of 32bit 4 single precision floats 2 (un)signed long integers of 64bit ParProg20 C1 Integrated Accelerators 2 double precision floats Sven Köhler or 2, 4, 8, 16 logic values Chart 6
AltiVec Instruction Reference Version 2.07 B 6.7.2 Vector Load Instructions For all instructions, registers The aligned byte, halfword, word, or quadword in Programming Note storage addressed by EA is loaded into register VRT. The Load Vector Element instructions load the specified element into the same location in the target register as the location into which it would and usage see be loaded using the Load Vector instruction. Load Vector Element Byte Indexed X-form Load Vector Element Halfword Indexed X-form lvebx VRT,RA,RB lvehx VRT,RA,RB 31 VRT RA RB 7 / PowerISA 2.07(B), chapter 6 & 7 0 6 11 16 21 31 31 VRT RA RB 39 / 0 6 11 16 21 31 if RA = 0 then b � 0 else b � (RA) if RA = 0 then b � 0 EA � b + (RB) else b � (RA) eb � EA 60:63 EA � (b + (RB)) & 0xFFFF_FFFF_FFFF_FFFE eb � EA 60:63 VRT � undefined VRT � undefined if Big-Endian byte ordering then VRT 8 × eb:8 × eb+7 � MEM(EA,1) ParProg20 C1 if Big-Endian byte ordering then VRT 8 × eb:8 × eb+15 � MEM(EA,2) else VRT 120-(8 × eb):127-(8 × eb) � MEM(EA,1) else Integrated VRT 112-(8 × eb):127-(8 × eb) � MEM(EA,2) Let the effective address (EA) be the sum Accelerators (RA|0)+(RB). Let the effective address (EA) be the result of ANDing 0xFFFF_FFFF_FFFF_FFFE with the sum Let eb be bits 60:63 of EA. (RA|0)+(RB). Sven Köhler If Big-Endian byte ordering is used for the storage Let eb be bits 60:63 of EA. access, the contents of the byte in storage at address EA are placed into byte eb of register VRT. The If Big-Endian byte ordering is used for the storage remaining bytes in register VRT are set to undefined access, values. Chart 7 – the contents of the byte in storage at address EA
2 #include <altivec.h> gcc -maltivec -mabi=altivec gcc -mvsx xlc –qaltivec –qarch=auto ParProg20 C1 Integrated C-Interface Accelerators Sven Köhler Chart 8
Vector Data Types The C-Interface introduces new keywords and data types: vector unsigned char vector unsigned long 16x 1 byte 2 x 8 bytes vector signed char vector signed long vector bool char vector double vector unsigned short 8x 2 bytes vector signed short vector bool short vector pixel ParProg20 C1 vector unsigned int Integrated vector signed int 4x 4 bytes Accelerators vector bool int Sven Köhler vector float Chart 9 gcc -maltivec gcc -mvsx
Vector Data Types Initialization, Loading and Storing vector int va = {1, 2, 3, 4}; int data[] = {1, 2, 3, 4, 5, 6, 7, 8}; vector int vb = *((vector int *)data); int output[4]; *((vector int *)output) = va; Can be very slow! ParProg20 C1 Integrated printf("vb = {%d, %d, %d, %d};\n", Accelerators Sven Köhler vb[0], vb[1], vb[2], vb[3]); Chart 10
Aligned Addresses Historically memory addresses required be aligned at 16 byte boundaries for efficiency reasons. (Although POWER8 has improved unaligned load/store and modern compilers will support you.) int data[] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8}; (compiler specific) int *output = aligned_alloc(16, NUM * sizeof(int)); vector int va = vec_ld(0, data); ParProg20 C1 Integrated vec_st(va, 0, output); Accelerators Sven Köhler index + (truncated to 16) address Chart 11
Vector Intrinsics Operations are available through a rich set 1 of “overloaded functions” (actually intrinsics): vector int va = {4, 3, 2, 1}; vector int vb = {1, 2, 3, 4}; A 0 B 0 C 0 vector int vc = vec_add(va, vb); A 1 B 1 C 1 + = A 2 B 2 C 2 A 3 B 3 C 3 vector float vfa = {4, 3, 2, 1}; ParProg20 C1 vector float vfb = {1, 2, 3, 4}; Integrated Accelerators vector float vfc = vec_add(vfa, vfb); Sven Köhler Chart 12 1 https://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html
) Vector Intrinsics: Lots of overloads ( vector signed char vec_add (vector bool char, vector signed char); vector signed char vec_add (vector signed char, vector bool char); vector signed char vec_add (vector signed char, vector signed char); vector unsigned char vec_add (vector bool char, vector unsigned char); vector unsigned char vec_add (vector unsigned char, vector bool char); vector unsigned char vec_add (vector unsigned char, vector unsigned char); vector signed short vec_add (vector bool short, vector signed short); vector signed short vec_add (vector signed short, vector bool short); vector signed short vec_add (vector signed short, vector signed short); vector unsigned short vec_add (vector bool short, vector unsigned short); vector unsigned short vec_add (vector unsigned short, vector bool short); Attention: No implicit conversion! vector unsigned short vec_add (vector unsigned short, vector unsigned short); vector signed int vec_add (vector bool int, vector signed int); Also not all types for every operation. vector signed int vec_add (vector signed int, vector bool int); ParProg20 C1 vector signed int vec_add (vector signed int, vector signed int); Integrated vector unsigned int vec_add (vector bool int, vector unsigned int); Accelerators vector unsigned int vec_add (vector unsigned int, vector bool int); vector unsigned int vec_add (vector unsigned int, vector unsigned int); Sven Köhler vector float vec_add (vector float, vector float); vector double vec_add (vector double, vector double); vector long long vec_add (vector long long, vector long long); Chart 13 vector unsigned long long vec_add (vector unsigned long long, vector unsigned long long); 1 https://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html
Get Help: Programming Interface Manual Generic and Specific AltiVec Operations Highly helpful resource: vec_add vec_add Vector Add d = vec_add( a , b ) • Integer add: Name of operation n ¨ number of elements □ do i=0 to n-1 d i ¨ a i + b i end Pseudocode description □ • Floating-point add: Text description □ do i=0 to 3 d i ¨ a i + fp b i end Graphical description □ Each element of a is added to the corresponding element of b. Each sum is placed in the corresponding element of d. Type table and according □ For vector float argument types, if VSCR[NJ] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each assembly instruction denormalized result element is truncated to a 0 of the same sign. The valid combinations of argument types and the corresponding result types for d = vec_add( a , b ) are shown in Figure 4-12, Figure 4-13, Figure 4-14, and Figure 4-15. ParProg20 C1 Integrated Element Æ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a Accelerators b Sven Köhler + + + + + + + + + + + + + + + + d d a b maps to Chart 14 vector unsigned char vector unsigned char vector unsigned char vector unsigned char vector bool char http://www.nxp.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf vector bool char vector unsigned char vaddubm d,a,b vector signed char vector signed char
Get Help: IBM Knowledge Center IBM has an online documentation of the extended standard, not fully implemented by GCC. ParProg20 C1 Integrated Accelerators Sven Köhler Chart 15
Recommend
More recommend