Assembly Language Programming Parallel architectures Zbigniew - PowerPoint PPT Presentation

Assembly Language Programming Parallel architectures Zbigniew Jurkiewicz, Instytut Informatyki UW December 4, 2017 1 1 Images: Song Ho Anh Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Flynn taxonomy Large variety of parallel architectures, so different criteria for partitioning. The oldest popular classification of Flynn used the concepts of instruction stream and data stream . SISD (Single Instruction Single Data): classical uniprocessor system with single streams of instructions and data. If we multiply data streams, we will have SIMD system (Single Instruction Multiple Data), where many processors execute the same instruction, each on a different data units. This speeds up typical matrix operations, like multiplication or inversion. This mode was used on vector (e.g. Cray-1) and matrix processors . Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Flynn taxonomy Multiplying instead the instruction streams we will obtain the MISD system (Multiple Instruction Single Data), used only in very specific applications (e.g. nondeterministic or mirror processing). When we multiply both streams we will have MIMD system (Multiple Instruction Multiple Data), with full parallel concurrency — with the possibility of independent processing of different data by different processors. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Division across methods of synchronization synchronous, e.g. SIMD; asynchronuous, e.g. MIMD. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

MMX First Intel approach to SIMD (Pentium MMX). „Parasitic” on FPU — uses the same registers. First 64-bit instructions, e.g. movq mm0,[vector] which moves vector of values to MMX register. They are also available in SSE. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Checking for SSE technology Before using advanced instructions we have to check, whether our processor accepts them. Our laboratories have different processors! Checking is performed with cpuid instruction. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

CPUID For function 1 we get in EDX register bit 0: is FPU present? bit 11: does SYSCALL/SYSRET work? bit 15: is CMOV present? bit 23: is MMX present? bit 25: is SSE present? bit 26: is SSE2 present? and in ECX bit 0: is SSE3 present? Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

CPUID For function 4 we have in EAX bits 26..31: number of processors ( cores ) Much more in the documentation. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

XMM registers Used for SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE5, Advanced Vector Extensions, ... sets of instructions. In 32-bit mode: 8 128-bit registers XMM0–XMM7. In 64-bit mode: 16 128-bit registers XMM0–XMM15. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

XMM registers They can store a vector of 16 8-bit integer values. a vector of 8 16-bit integer values. a vector of 4 32-bit integer of floating-point values a pair (“vector”) of 2 64-bit integer of floating-point values Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Data Transfers Arithmetic instructions operate only on memory addresses aligned to 16 bytes (and of course on registers). Two kinds of transfers to/from registers: MOVDQU if the address is not aligned ( Unaligned ) MOVDQA if the address is aligned ( Aligned ) Clearing (zeroing) a register pxor xmm1,xmm1 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Addition Already present in MMX for integer vectors. SSE adds operations on (small) vectors of single precision floating-point numbers, e.g. ADDPS (where P is from Packed and S from Single ). In SSE2 extended to double precision floating-point numbers, e.g. ADDPD. Also added operations on single values, called scalars , e.g. ADDSS and ADDSD. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Example addition We are to add two three-dimensional vectors, represented in single precision. They have to be represented on 128 bits, so we assume that the “highest” part is cleared. movups xmm0,[eax] movups xmm1,[ebx] addps xmm0,xmm1 movups [edx],xmm0 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Multiplication Similar to addition: MULPS/MULPD for vectors, MULPS/MULSD for scalars Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Operations scalar/vector When we add a scalar to a vector, we must replicate the scalar to make it look like a vector. jak wektor The simplest way is to use the “shuffling” operation: shufps xmm1,xmm2, mask This operation scatters around the first argument two copies of scalars from the first argument and two copies of scalars from the second one. The selection of scalars is controlled by 8-bit mask. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Shuffling Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Adding scalar to vector When shuffling the first argument may be the same as the second one. movss xmm0,fscalar shufps xmm0,xmmo,00000000B addps xmm0,xmm1 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Comparisons The comparison operations in SSE do not set flags (there is not enough flags). Instead the first argument is set to true (all ones) or false (all zeros) — of course separately for each pair of elements. Thus comparison operations are destructive ! cmpd xmm0,xmm1 ;result in xmm0 movmskpd eax,xmm0 cmp eax,0 jne tam Remark: for scalars the instructions COMISD and UCOMISD directly modify EFLAGS, and we can use conditional jumps as usual. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Comparisons Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

SSE3 Scalar operations on floating-point with signle and double precision stored in lower parts of XMM registers. Floating-point operations register-register (instead of stack-based) result in less use for FPU, e.g. G CC uses FPU only in rare, special cases. Floating-point constants must be fetched from memory, they may not be put directly in instructions as constants. When comparing floating-points four results (also in C): < , = , > and “none of this” (for example when an argument is NaN). Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

SSE3 Added some interesting (read “exotic”) instructions, for example horizontal addition HADDPS/HADDPD, which adds neighbouring pairs in arguments and stores the result in the first argument. We will use it do compute scalar product of vectors movups xmm0,[eax] ;First vector movups xmm1,[ebx] ;Second vector mulps xmm0,xmm1 ;Products haddps xmm0,xmm0 ;Half-sums haddps xmm0,xmm0 movss [edx],xmm0 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

Assembly Language Programming Parallel architectures Zbigniew - PowerPoint PPT Presentation

Assembly Language Programming Parallel architectures Zbigniew Jurkiewicz, Instytut Informatyki UW December 4, 2017 1 1 Images: Song Ho Anh Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures Flynn

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Assembly Language Programming Assembler and assembly language Zbigniew Jurkiewicz, Instytut

Architectures Architectural styles Software architectures Architectures versus middleware

Assembly Language CS2253 Owen Kaser, UNBSJ Assembly Language Some insane machine-code

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Assembly Language Introduction Learning Objectives Explain what assembly language is

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Architectures for Parallel Processing Current Architectures for Parallel "With the

Parallel Architectures Parallel Architectures 1 Memory Access Multiple processing units

FROM SYSTEM F TO TYPED ASSEMBLY LANGUAGE Greg Morrisett, David Walker, Karl Crary & Neal

Overview of Assembly Language Chapter 9 S. Dandamudi Outline Assembly language

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Lightning Talks June 2, 2020 Session I 2:15 - 2:20 Caleb Springer, Penn State 2:20 - 2:25 Jacob

Outline

GRAPHICS PROCESSING UNIT Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Course Overview Miguel Areias Computer Science Department Faculty of Sciences University of

Lattice QCD and Vittorio Lubicz flavour physics OUTLINE: OUTLINE: Workshop on The accuracy

using FM radio signals Andrei Popleteev Advisors: Oscar Mayora Venet Osmani Outline

Fiduccia-Mattheyses Algorithm Perform FM algorithm on the following circuit: Area

Local SGD for non-i.i.d. data Konstantin Mishchenko Work done together with Ahmed Khaled and

Assembly Language Programming Parallel architectures Zbigniew - PowerPoint PPT Presentation

Assembly Language Programming Parallel architectures Zbigniew Jurkiewicz, Instytut Informatyki UW December 4, 2017 1 1 Images: Song Ho Anh Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures Flynn

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Assembly Language Programming Assembler and assembly language Zbigniew Jurkiewicz, Instytut

Architectures Architectural styles Software architectures Architectures versus middleware

Assembly Language CS2253 Owen Kaser, UNBSJ Assembly Language Some insane machine-code

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Assembly Language Introduction Learning Objectives Explain what assembly language is

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Architectures for Parallel Processing Current Architectures for Parallel &quot;With the

Parallel Architectures Parallel Architectures 1 Memory Access Multiple processing units

FROM SYSTEM F TO TYPED ASSEMBLY LANGUAGE Greg Morrisett, David Walker, Karl Crary &amp; Neal

Overview of Assembly Language Chapter 9 S. Dandamudi Outline Assembly language

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Lightning Talks June 2, 2020 Session I 2:15 - 2:20 Caleb Springer, Penn State 2:20 - 2:25 Jacob

Outline

GRAPHICS PROCESSING UNIT Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Course Overview Miguel Areias Computer Science Department Faculty of Sciences University of

Lattice QCD and Vittorio Lubicz flavour physics OUTLINE: OUTLINE: Workshop on The accuracy

using FM radio signals Andrei Popleteev Advisors: Oscar Mayora Venet Osmani Outline

Fiduccia-Mattheyses Algorithm Perform FM algorithm on the following circuit: Area

Local SGD for non-i.i.d. data Konstantin Mishchenko Work done together with Ahmed Khaled and

Architectures for Parallel Processing Current Architectures for Parallel "With the

FROM SYSTEM F TO TYPED ASSEMBLY LANGUAGE Greg Morrisett, David Walker, Karl Crary & Neal