Assembly Language Programming Optimization Zbigniew Jurkiewicz, - - PowerPoint PPT Presentation

assembly language programming optimization
SMART_READER_LITE
LIVE PREVIEW

Assembly Language Programming Optimization Zbigniew Jurkiewicz, - - PowerPoint PPT Presentation

Assembly Language Programming Optimization Zbigniew Jurkiewicz, Instytut Informatyki UW December 9, 2017 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization Conditional transfer Sometimes we make comparison


slide-1
SLIDE 1

Assembly Language Programming Optimization

Zbigniew Jurkiewicz, Instytut Informatyki UW December 9, 2017

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-2
SLIDE 2

Conditional transfer

Sometimes we make comparison only to execute a single assignment depending on the result. Then we can use conditional move instruction, where assignment is performed only if the indicated condition was satisfied, e.g. the instruction cmove eax,1 sets register eax to 1 only if recently compared elements were equal. The main advantage is avoidance of the necessity of cleaning the pipeline or speculative execution. lub wykonania spekulacyjnego. Conditional assignment SET. Conditional transfer CMOV.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-3
SLIDE 3

Conditional transfer: an example

Find maximum of two numbers (arguments in EAX and EBX, result in ECX):

mov ecx,eax cmp ebx,ecx cmova ecx,ebx

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-4
SLIDE 4

Conditional transfers: errors

Assume we are compiling in C the expression

int *xp; ... return (xp ? *xp : 0);

If xp is in rdi, we could try

xor eax,eax ;Maybe we will return zero test rdi,rdi ;xp == 0 ? cmovne eax,[rdi] ;Maybe we will return *xp

But then the dereference of xp will occurs always (even for the NULL pointer), and this we want to avoid.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-5
SLIDE 5

Jump avoidance

Avoiding jumps ia a larger problem. Let us look at the computation of absolute value of number

test eax,eax ;We set flags jns omi´ n ;Positive sign neg eax skip:

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-6
SLIDE 6

Jump avoidance

There is a different way:

mov ecx,eax sar ecx,31 ;sign bit everywhere xor eax,ecx ;bit reverse sub eax,ecx ;we subtract -1 and have 2-complement

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-7
SLIDE 7

Power of 2

Another trick: how to check, whether a number in EAX is a power of two?

mov ebx,eax ;or lea ebx,[eax - 1] dec ebx test eax,ebx jnz isnot

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-8
SLIDE 8

Hints

The processor tries to guess, whether the conditional jump will be performed. With static guess it is assumed, that the jump “backwards” will be peformed. We can help it using hints: prefixes HT(0x3e) and HNT(0x2e), for example

test ecx,ecx db 3eh ;HT = we will jump jz L9 ... L9:

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-9
SLIDE 9

Hints

Sometimes holding the data in cache memory is not useful, if it is only used once Direct write instructions (non-temporal store) MOVNTI, MOVNTPD, etc. in write phase omit the cache.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-10
SLIDE 10

Conservativity of compiler

The C compiler must be conservative and generate code in such a way, that all possible cases are covered. Example:

void memclr (char *data, int n) { for (; n > 0; n--) *data++ = 0; }

If the compiler knew something about the alignment of data, it could generate a code to zero 2, 4 or ever 8 bajtów in one step. However, it must assume the worst case.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-11
SLIDE 11

Conservativity of compiler

There a few elements in C/C++, which are classic examples of slowing down programs. The group is lead by the conversion (cast) from real number to integer, for example

int i; float f; ... i = (int)f;

Such conversion takes 50-100 processor cycles. Reason: the C/C++ defines a different way of rounding than implemented in FPU, so we have to toggle coprocessor mode.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-12
SLIDE 12

Conservativity of compiler

Other nomination to Oscara prize is pointer aliasing. In the code below a compiler will not pull the evaluation of *p + 2 befor the loop

void Func1 (int a[], int *p) { int i; for (i = 0; i < 100; i++) a[i] = *p + 2; }

And it is right, because (hooray for C and C++ :-)

void Func2() { int list[100]; Func1(list, &list[8]); }

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-13
SLIDE 13

Conservativity of compiler

Sometimes the recipes are simple. The code below twice fetches arg1->p1 from the memory:

struct S1 int p1; struct S2 int p2, p3; void f1 (struct S1 *arg1, struct S2 *arg2) arg2->p2 += arg1->p1; arg2->p3 += arg1->p1;

It must work this way, because arg2->p2 and arg1->p1 may be the same memory cell. But it is enough to introduce local variable bound to S1->p1.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-14
SLIDE 14

Assembler

Asembler allows us to take advantage from low-level services: Registers and direct input/output. Violating the compiler conventions: different passing of parameters, violating the memory allocation rules, iterative call of procedures. Linking incompatible code fragments, e.g. built by different compilers. Code optimization by hand to adapt it to a very particular hardware configuration.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-15
SLIDE 15

Extreme example

Appetizer The following code in C

float a[4], b[4], c[4]; for (int i = 0; i < 4; i++) { c[i] = a[i] > b[i] ? a[i] : b[i]; }

can be optimally coded as follows

movaps xmm0,[a] ;Load a vector maxps xmm0,[b] ;max(a,b) movaps [c],xmm0 ;c = a > b ? a : b

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-16
SLIDE 16

Not enough registers or “two in one”

We have two variables index and increment, both 16-bit (short). On ARM they can pe put into one register, index at the top. Then the C code

elem = tab[index]; index += increment;

could be written in assembler as

LDRB Relem, [Rtab, Rindincr, LSR#16] ADD Rindincr, Rindincr, Rindincr, LSL#16

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-17
SLIDE 17

Intel/AMD

The instruction set of CISC processors (x86) is not optimal — confirmed by several changes of architecture philosophy. It must be preserved because of back compatibility with systems from years 1980s, when RAM and disc memory were small and costly. But CISC also has some advantages. The compactness of code fits well to requirements of cache memories with restricted sizes. The main problem of x86 processors is lack of enough registers, alleviated a little when designing x86-64.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-18
SLIDE 18

Graphics accelerators

Demading graphic applications need platforms with graphics coprocessor or accelerator card. The computational power contained in them can be used also to other tasks, but this is another story (and it depends much on hardware).

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-19
SLIDE 19

64-bit code

Advantages: More registers: usually no need to store variables and intermediate result in RAM memory. The efficient procedure call: passing parameters in registers. 64-bitowe registers for integers. Better management of large memory blocks. Built-in restricted SIMD (SSE). Relative addressing of data, efficient relocatable code.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-20
SLIDE 20

64-bit code

Disadvantages: Twice larger addresses and stack positions: troubles with cache memory. The access to static and global arrays requires more instructions for large memory images. Mostly for Windows and Mac. More complicated computation of effective memory address when the size greater than 2GB. Some instructions are longer.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-21
SLIDE 21

Intrinsic functions in C++

New approach for joining code from different levels. Intrinsic functions represent known to the compiler processor instructions. Example: addition of floating-point vectors ADDPS may be written in C++ as the function _mm_add_ps. We can also define the appropriate class of vectors and

  • verlod the + operator in it.

Intrinsic functions exist in Microsoft, Intela and GNU compilers.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-22
SLIDE 22

Examining compiled code

Various reasons: Checking for evident places for rewriting by hand in assembly language (or for switching compiler flag, e.g. -O3 ;-) Use compiler as an intelligent typist, and the resulting code as more comfortable base than staring form nothing. This code at least has correct interfaces with environment, and they give us usually most troubles. And sometimes we will discover an error in compiler.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-23
SLIDE 23

Examining compiled code

Let us look at the loop

for (int i = 0; i <= 15; i++) T[i] := i;

The compiler should logically replace it by

for (int i = 15; i >= 0; i--) T[i] := i;

Reason: we save at a comparison instruction (with 15), because subtraction already set zero flag.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-24
SLIDE 24

Examining compiled code

But when loop body is much more complicated, it could be difficult for the compiler to decide, whether it may change the order of passing. Then we have to do it ourselves!

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-25
SLIDE 25

Intel C++ compiler (parallel composer)

Intrinsics for vectors, automatic vectorization. OpenMP and automatic parallelization of threads. CPU dispatch: different versions for different processors. The best optimized mathematical libraries (but once they could not divide correctly). Drawback: the code may execute slower on AMD and VIA processors, then you should bypass dispatch.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-26
SLIDE 26

GNU compiler

Intrinsics for vectors, automatic vectorization. OpenMP and automatic parallelization of threads. Library optimization waits for its turn. But it accepts mathematical vector libraries of AMD and Intela.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-27
SLIDE 27

Hardware restrictions

On classic ARM registers are 32-bit wide. You should avoid types char and short for loop counters, because then one has to check ranges by hand, e.g. for instruction

short i; ... i++;

must generate code to check each time, whether there is no overflow, and possibly “roll” to zero. As registers are 32-bit wide, so there is no signalling of overflow/carry for 16-bit numbers. Here also the compiler defenceless. Of course on x86 processor we do not have these problems (AL, AX).

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-28
SLIDE 28

Dependent instructions

The total time of execution of a sequence of dependent instructions (same arguments and/or results) is equal to the sum of their latency — necessary number of cycles. If instructions are independent, then next instruction starts earlier and the total time is decreased, for example code

double list[100], sum = 0.0; for (int i = 0; i < 100; i++) sum += list[i];

should be replaced by

double list[100], sum1 = 0.0, sum2 = 0.0, sum3 = 0.0, sum4 = 0.0; for (int i = 0; i < 100; i += 4) { sum1 += list[i]; sum2 += list[i+1]; sum3 += list[i+2]; sum4 += list[i+3]; } sum1 = (sum1 + sum2) + (sum3 + sum4);

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-29
SLIDE 29

Dependencies

Sometimes it looks strange, for example the assignment instruction

y = a + b + c + d;

is better replaced by

y = (a + b) + (c + d);

The specification of many programming language forces the compiler to always compute the experssions form left to right (e.g. to have always the same rounding orders) and the compiler may not do anything.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-30
SLIDE 30

Partial registers

Some CPUs implement out of order execution, but are not able to rename partial registers (ax, ah, al). This causes the delay in the code below, because the third instruction has to wait for higher 16 bits from multiplication

imul eax,6 mov [mem2],eax mov ax,[mem3] ;16-bit operands add ax,2 mov [mem4],ax

If we replace this instruction by

movzx eax,[mem3]

the dependency is removed. It could be one of reasons for doing it automatically for 32-bit transfers in 64-bit mode.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-31
SLIDE 31

Changing order of execution

Mostly on strongly pipelined RISCs (e.g. ARM), forced by specific of a processor On ARM9TDMI after the memory load instruction (e.g. LDR) the loaded value should not be used for two cycles. Multiplication takes the same time as multiplication with accumulation (MLA). Conclusion obvious. On ARM10E instructions of multiple load from memory and store to it work “in the background”. Superficially they take uone cycle, unless we try to use one of these register in the following instruction. On Intel XScale the instruction LDRD loads two words at

  • nce (in one cycle. But the first register should not be used

for two following cycles, and the second one for three cycles.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-32
SLIDE 32

Jumps and procedures

Fetching code after (unexpected) jump generates delays

  • n the order of 1–3 cycles.

The delay is largest when the destination address falls on the end of 16-byte block (frame).

Paradox: it is sometimes worthy to replace in the code earlier the shorter form of a instruction with the longer one to get the alignment.

To predict the returns from procedures (ret) processor uses so called return stack buffer, usually with 16 elements. Do not fool the mechanism by jumping out of procedures

  • r secretly removing return addresses from the stack (or

using ret as indirect jump). Reduction calls (tail calls) are implemented with jumps!

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-33
SLIDE 33

Metaprogramming

Instead of writing twisted assembler macros or overabuse m4 it is better to write programs, which generate other programs or their parts: Table generators for sinus, cosinus or leap years Converters of bitmaps into fast display procedures Gettting different aspect from the same code (aspect-oriented programming) Specialized code in assembler based on script written in Scheme or other language and on additional constraints.

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

slide-34
SLIDE 34

Tuning: tools

AMD Code Analyst Intel VTune New-Jersey Machine-Code Toolkit (w ML) http://www.eecs.harvard.edu/ nr/toolkit/

Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization