Assembly Language Programming Parallel architectures Zbigniew - - PowerPoint PPT Presentation

assembly language programming parallel architectures
SMART_READER_LITE
LIVE PREVIEW

Assembly Language Programming Parallel architectures Zbigniew - - PowerPoint PPT Presentation

Assembly Language Programming Parallel architectures Zbigniew Jurkiewicz, Instytut Informatyki UW December 4, 2017 1 1 Images: Song Ho Anh Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures Flynn


slide-1
SLIDE 1

Assembly Language Programming Parallel architectures

Zbigniew Jurkiewicz, Instytut Informatyki UW December 4, 2017

1

1Images: Song Ho Anh Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-2
SLIDE 2

Flynn taxonomy

Large variety of parallel architectures, so different criteria for

  • partitioning. The oldest popular classification of Flynn used the

concepts of instruction stream and data stream. SISD (Single Instruction Single Data): classical uniprocessor system with single streams of instructions and data. If we multiply data streams, we will have SIMD system (Single Instruction Multiple Data), where many processors execute the same instruction, each on a different data

  • units. This speeds up typical matrix operations, like

multiplication or inversion. This mode was used on vector (e.g. Cray-1) and matrix processors.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-3
SLIDE 3

Flynn taxonomy

Multiplying instead the instruction streams we will obtain the MISD system (Multiple Instruction Single Data), used

  • nly in very specific applications (e.g. nondeterministic or

mirror processing). When we multiply both streams we will have MIMD system (Multiple Instruction Multiple Data), with full parallel concurrency — with the possibility of independent processing of different data by different processors.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-4
SLIDE 4

Division across methods of synchronization

synchronous, e.g. SIMD; asynchronuous, e.g. MIMD.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-5
SLIDE 5

MMX

First Intel approach to SIMD (Pentium MMX). „Parasitic” on FPU — uses the same registers. First 64-bit instructions, e.g.

movq mm0,[vector]

which moves vector of values to MMX register. They are also available in SSE.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-6
SLIDE 6

Checking for SSE technology

Before using advanced instructions we have to check, whether our processor accepts them. Our laboratories have different processors! Checking is performed with cpuid instruction.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-7
SLIDE 7

CPUID

For function 1 we get in EDX register bit 0: is FPU present? bit 11: does SYSCALL/SYSRET work? bit 15: is CMOV present? bit 23: is MMX present? bit 25: is SSE present? bit 26: is SSE2 present? and in ECX bit 0: is SSE3 present?

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-8
SLIDE 8

CPUID

For function 4 we have in EAX bits 26..31: number of processors (cores) Much more in the documentation.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-9
SLIDE 9

XMM registers

Used for SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE5, Advanced Vector Extensions, ... sets of instructions. In 32-bit mode: 8 128-bit registers XMM0–XMM7. In 64-bit mode: 16 128-bit registers XMM0–XMM15.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-10
SLIDE 10

XMM registers

They can store a vector of 16 8-bit integer values. a vector of 8 16-bit integer values. a vector of 4 32-bit integer of floating-point values a pair (“vector”) of 2 64-bit integer of floating-point values

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-11
SLIDE 11

Data Transfers

Arithmetic instructions operate only on memory addresses aligned to 16 bytes (and of course on registers). Two kinds of transfers to/from registers:

MOVDQU if the address is not aligned (Unaligned) MOVDQA if the address is aligned (Aligned)

Clearing (zeroing) a register

pxor xmm1,xmm1

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-12
SLIDE 12

Addition

Already present in MMX for integer vectors. SSE adds operations on (small) vectors of single precision floating-point numbers, e.g. ADDPS (where P is from Packed and S from Single). In SSE2 extended to double precision floating-point numbers, e.g. ADDPD. Also added operations on single values, called scalars, e.g. ADDSS and ADDSD.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-13
SLIDE 13

Example addition

We are to add two three-dimensional vectors, represented in single precision. They have to be represented on 128 bits, so we assume that the “highest” part is cleared.

movups xmm0,[eax] movups xmm1,[ebx] addps xmm0,xmm1 movups [edx],xmm0

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-14
SLIDE 14

Multiplication

Similar to addition: MULPS/MULPD for vectors, MULPS/MULSD for scalars

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-15
SLIDE 15

Operations scalar/vector

When we add a scalar to a vector, we must replicate the scalar to make it look like a vector. jak wektor The simplest way is to use the “shuffling” operation:

shufps xmm1,xmm2,mask

This operation scatters around the first argument two copies of scalars from the first argument and two copies of scalars from the second one. The selection of scalars is controlled by 8-bit mask.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-16
SLIDE 16

Shuffling

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-17
SLIDE 17

Adding scalar to vector

When shuffling the first argument may be the same as the second one.

movss xmm0,fscalar shufps xmm0,xmmo,00000000B addps xmm0,xmm1

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-18
SLIDE 18

Comparisons

The comparison operations in SSE do not set flags (there is not enough flags). Instead the first argument is set to true (all ones) or false (all zeros) — of course separately for each pair of elements. Thus comparison operations are destructive!

cmpd xmm0,xmm1 ;result in xmm0 movmskpd eax,xmm0 cmp eax,0 jne tam

Remark: for scalars the instructions COMISD and UCOMISD directly modify EFLAGS, and we can use conditional jumps as usual.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-19
SLIDE 19

Comparisons

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-20
SLIDE 20

SSE3

Scalar operations on floating-point with signle and double precision stored in lower parts of XMM registers. Floating-point operations register-register (instead of stack-based) result in less use for FPU, e.g. GCC uses FPU only in rare, special cases. Floating-point constants must be fetched from memory, they may not be put directly in instructions as constants. When comparing floating-points four results (also in C): <, =, > and “none of this” (for example when an argument is NaN).

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures

slide-21
SLIDE 21

SSE3

Added some interesting (read “exotic”) instructions, for example horizontal addition HADDPS/HADDPD, which adds neighbouring pairs in arguments and stores the result in the first argument. We will use it do compute scalar product of vectors

movups xmm0,[eax] ;First vector movups xmm1,[ebx] ;Second vector mulps xmm0,xmm1 ;Products haddps xmm0,xmm0 ;Half-sums haddps xmm0,xmm0 movss [edx],xmm0

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures