Parallel Programming and Heterogeneous Computing SIMD: Integrated - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Köhler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group

1 SIMD ParProg 2019 SIMD: Integrated Accelerators & AltiVec Sven Köhler Chart 2

Definition SIMD SIMD ::= S ingle I nstruction M ultiple D ata The same instruction is performed simultaneously on multiple data points (data-level parallelism). First proposed for ILLIAC IV, University of Illinois (1966). Today many architectures provide SIMD instruction set extensions. ParProg 2019 Intel: MMX, SSE, AVX SIMD: Integrated Accelerators ARM: VPF, NEON, SVE Sven Köhler POWER: AltiVec (VMX), VSX Chart 3

) Flynn’s Taxonomy on Multiprocessors (1966) ( instruction and data processing dimension Multiple Data (SIMD) Single Data (SISD) Single Instruction,   Single Instruction,   (C) Blaise Barney Multiple Data (MIMD) Multiple Instruction,   Single Data (MISD) Multiple Instruction,   ParProg 2019 SIMD: Integrated Accelerators Sven Köhler Chart 4

Scalar vs. SIMD How many instructions are needed to add four numbers from memory? scalar 4 element SIMD A 0 + B 0 = C 0 A 0 B 0 C 0 A 1 + B 1 = C 1 A 1 B 1 C 1 + = A 2 B 2 C 2 A 2 + B 2 C 2 = A 3 B 3 C 3 ParProg 2019 A 3 + B 3 = C 3 SIMD: Integrated Accelerators Sven Köhler 4 additions 1 addition 8 loads 2 loads Chart 5 4 stores 1 store

Vector Registers on POWER8 (1) 32 vector registers containing 128 bits each. fpr0 vsr0 fpr1 vsr1 … … fpr31 vsr31 VSX vr0 vsr32 AltiVec/VMX vr1 vsr33 … … vr31 vsr63 ParProg 2019 These are also used by Quad Word 0 SIMD: Integrated several coprocessors : Accelerators Double Word 0 Double Word 1 … Sven Köhler Word 0 Word 3 … Half Half Word 0 Word 7 VSX SHA2 AES … … Chart 6 Byte 0 Byte 15

Vector Registers on POWER8 (2) 32 vector registers containing 128 bits each. Depending on the instruction they are interpreted as 16 (un)signed bytes 8 (un)signed shorts 4 (un)signed integers of 32bit 4 single precision floats 2 (un)signed long integers of 64bit ParProg 2019 SIMD: Integrated Accelerators 2 double precision floats Sven Köhler or 2, 4, 8, 16 logic values Chart 7

AltiVec Instruction Reference Version 2.07 B 6.7.2 Vector Load Instructions For all instructions, registers The aligned byte, halfword, word, or quadword in Programming Note storage addressed by EA is loaded into register VRT. The Load Vector Element instructions load the specified element into the same location in the target register as the location into which it would and usage see be loaded using the Load Vector instruction. Load Vector Element Byte Indexed X-form Load Vector Element Halfword Indexed X-form lvebx VRT,RA,RB lvehx VRT,RA,RB 31 VRT RA RB 7 / PowerISA 2.07(B), chapter 6 & 7 0 6 11 16 21 31 31 VRT RA RB 39 / 0 6 11 16 21 31 if RA = 0 then b � 0 else b � (RA) if RA = 0 then b � 0 EA � b + (RB) else b � (RA) eb � EA 60:63 EA � (b + (RB)) & 0xFFFF_FFFF_FFFF_FFFE eb � EA 60:63 VRT � undefined VRT � undefined if Big-Endian byte ordering then VRT 8 × eb:8 × eb+7 � MEM(EA,1) ParProg 2019 if Big-Endian byte ordering then VRT 8 × eb:8 × eb+15 � MEM(EA,2) else VRT 120-(8 × eb):127-(8 × eb) � MEM(EA,1) else SIMD: Integrated VRT 112-(8 × eb):127-(8 × eb) � MEM(EA,2) Let the effective address (EA) be the sum Accelerators (RA|0)+(RB). Let the effective address (EA) be the result of ANDing 0xFFFF_FFFF_FFFF_FFFE with the sum Let eb be bits 60:63 of EA. (RA|0)+(RB). Sven Köhler If Big-Endian byte ordering is used for the storage Let eb be bits 60:63 of EA. access, the contents of the byte in storage at address EA are placed into byte eb of register VRT. The If Big-Endian byte ordering is used for the storage remaining bytes in register VRT are set to undefined access, values. Chart 8 – the contents of the byte in storage at address EA

2 #include <altivec.h> gcc -maltivec -mabi=altivec gcc -mvsx xlc –qaltivec –qarch=auto ParProg 2019 SIMD: Integrated C-Interface Accelerators Sven Köhler Chart 9

Vector Data Types The C-Interface introduces new keywords and data types: vector unsigned char vector unsigned long 16x 1 byte 2 x 8 bytes vector signed char vector signed long vector bool char vector double vector unsigned short 8x 2 bytes vector signed short vector bool short vector pixel ParProg 2019 vector unsigned int SIMD: Integrated vector signed int 4x 4 bytes Accelerators vector bool int Sven Köhler vector float Chart 10 gcc -maltivec gcc -mvsx

Vector Data Types Initialization, Loading and Storing vector int va = {1, 2, 3, 4}; int data[] = {1, 2, 3, 4, 5, 6, 7, 8}; vector int vb = *((vector int *)data); int output[4]; *((vector int *)output) = va; Can be very slow! ParProg 2019 SIMD: Integrated printf("vb = {%d, %d, %d, %d};\n", Accelerators Sven Köhler vb[0], vb[1], vb[2], vb[3]); Chart 11

Aligned Addresses Historically memory addresses required be aligned at 16 byte boundaries for efficiency reasons. (Although POWER8 has improved unaligned load/store and modern compilers will support you.) int data[] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8}; (compiler specific) int *output = aligned_alloc(16, NUM * sizeof(int)); vector int va = vec_ld(data, 0); ParProg 2019 SIMD: Integrated vec_st(va, output, 0); Accelerators Sven Köhler address + index (truncated to 16) Chart 12

Vector Intrinsics Operations are available through a rich set 1 of “overloaded functions” (actually intrinsics): vector int va = {4, 3, 2, 1}; vector int vb = {1, 2, 3, 4}; A 0 B 0 C 0 vector int vc = vec_add(va, vb); A 1 B 1 C 1 + = A 2 B 2 C 2 A 3 B 3 C 3 vector float vfa = {4, 3, 2, 1}; ParProg 2019 vector float vfb = {1, 2, 3, 4}; SIMD: Integrated Accelerators vector float vfc = vec_add(vfa, vfb); Sven Köhler Chart 13 1 https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html

) Vector Intrinsics: Lots of overloads ( vector signed char vec_add (vector bool char, vector signed char); vector signed char vec_add (vector signed char, vector bool char); vector signed char vec_add (vector signed char, vector signed char); vector unsigned char vec_add (vector bool char, vector unsigned char); vector unsigned char vec_add (vector unsigned char, vector bool char); vector unsigned char vec_add (vector unsigned char, vector unsigned char); vector signed short vec_add (vector bool short, vector signed short); vector signed short vec_add (vector signed short, vector bool short); vector signed short vec_add (vector signed short, vector signed short); vector unsigned short vec_add (vector bool short, vector unsigned short); vector unsigned short vec_add (vector unsigned short, vector bool short); Attention: No implicit conversion! vector unsigned short vec_add (vector unsigned short, vector unsigned short); vector signed int vec_add (vector bool int, vector signed int); Also not all types for every operation. vector signed int vec_add (vector signed int, vector bool int); ParProg 2019 vector signed int vec_add (vector signed int, vector signed int); SIMD: Integrated vector unsigned int vec_add (vector bool int, vector unsigned int); Accelerators vector unsigned int vec_add (vector unsigned int, vector bool int); vector unsigned int vec_add (vector unsigned int, vector unsigned int); Sven Köhler vector float vec_add (vector float, vector float); vector double vec_add (vector double, vector double); vector long long vec_add (vector long long, vector long long); Chart 14 vector unsigned long long vec_add (vector unsigned long long, vector unsigned long long); 1 https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html

Get Help: Programming Interface Manual Generic and Specific AltiVec Operations Highly helpful resource: vec_add vec_add Vector Add d = vec_add( a , b ) • Integer add: Name of operation n ¨ number of elements □ do i=0 to n-1 d i ¨ a i + b i end Pseudocode description □ • Floating-point add: Text description □ do i=0 to 3 d i ¨ a i + fp b i end Graphical description □ Each element of a is added to the corresponding element of b. Each sum is placed in the corresponding element of d. Type table and according □ For vector float argument types, if VSCR[NJ] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each assembly instruction denormalized result element is truncated to a 0 of the same sign. The valid combinations of argument types and the corresponding result types for d = vec_add( a , b ) are shown in Figure 4-12, Figure 4-13, Figure 4-14, and Figure 4-15. ParProg 2019 SIMD: Integrated Element Æ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a Accelerators b Sven Köhler + + + + + + + + + + + + + + + + d d a b maps to Chart 15 vector unsigned char vector unsigned char vector unsigned char vector unsigned char vector bool char http://www.nxp.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf vector bool char vector unsigned char vaddubm d,a,b vector signed char vector signed char

Get Help: IBM Knowledge Center IBM has an online documentation of the extended standard, not fully implemented by GCC. ParProg 2019 SIMD: Integrated Accelerators Sven Köhler Chart 16

Parallel Programming and Heterogeneous Computing SIMD: Integrated - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Khler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group 1 SIMD ParProg 2019 SIMD: Integrated

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

# non-linearly. ! As height ( H ) increases, ( f ) decreases, $ % & non-linearly. As

ARM Assembly Programming Cuauhtemoc Carbajal 06/08/2013 Introduction The ARM processor is

Saber on ARM CCA-secure module lattice-based key encapsulation on ARM Angshuman Karmakar CHES,

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ Today's

u I{ o,t "L 6,a I4{r^J..- E^tr{^t (urqi ) /**={<* )&, 1r4 h"l.b .^ r.t orl Pp

Language IN = is

The Halting Problem Joseph Paul Cohen Joseph Paul Cohen The Halting Problem Basics What is the

Sambuz

Useful Links

Newsletter

Mail Us

Parallel Programming and Heterogeneous Computing SIMD: Integrated - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Khler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group 1 SIMD ParProg 2019 SIMD: Integrated

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

# non-linearly. ! As height ( H ) increases, ( f ) decreases, $ % &amp; non-linearly. As

ARM Assembly Programming Cuauhtemoc Carbajal 06/08/2013 Introduction The ARM processor is

Saber on ARM CCA-secure module lattice-based key encapsulation on ARM Angshuman Karmakar CHES,

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ Today's

u I{ o,t &quot;L 6,a I4{r^J..- E^tr{^t (urqi ) /**={&lt;* )&amp;, 1r4 h&quot;l.b .^ r.t orl Pp

Language IN = is

The Halting Problem Joseph Paul Cohen Joseph Paul Cohen The Halting Problem Basics What is the

Sambuz

Useful Links

Newsletter

Mail Us

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

# non-linearly. ! As height ( H ) increases, ( f ) decreases, $ % & non-linearly. As

u I{ o,t "L 6,a I4{r^J..- E^tr{^t (urqi ) /**={<* )&, 1r4 h"l.b .^ r.t orl Pp