1
play

1 Reduction in Strength Compiler-Generated Code Motion (-O1) void - PDF document

Today Overview Generally Useful Optimizations Program Optimization Code motion/precomputation Strength reduction Sharing of common subexpressions CSci 2021: Machine Architecture and Organization Removing unnecessary


  1. Today  Overview  Generally Useful Optimizations Program Optimization  Code motion/precomputation  Strength reduction  Sharing of common subexpressions CSci 2021: Machine Architecture and Organization  Removing unnecessary procedure calls April 6th-15th, 2020  Optimization Blockers Your instructor: Stephen McCamant  Procedure calls  Memory aliasing Based on slides originally by:  Exploiting Instruction-Level Parallelism Randy Bryant, Dave O’Hallaron  Dealing with Conditionals 1 2 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Optimizing Compilers Performance Realities  Provide efficient mapping of program to machine There’s more to performance than asymptotic complexity  register allocation  code selection and ordering (scheduling)  dead code elimination  Constant factors matter too!  eliminating minor inefficiencies  Easily see 10:1 performance range depending on how code is written  Don’t (usually) improve asymptotic efficiency  Must optimize at multiple levels:  up to programmer to select best overall algorithm  algorithm, data representations, procedures, and loops  big-O savings are (often) more important than constant factors  Must understand system to optimize performance  but constant factors also matter  How programs are compiled and executed  How modern processors + memory systems operate  Have difficulty overcoming “optimization blockers”  potential memory aliasing  How to measure program performance and identify bottlenecks  How to improve performance without destroying code modularity and  potential procedure side-effects generality 3 4 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Limitations of Optimizing Compilers Generally Useful Optimizations Operate under fundamental constraint   Must not cause any change in program behavior  Optimizations that you or the compiler should do regardless  Except, possibly when program making use of nonstandard language of processor / compiler features  Often prevents it from making optimizations that would only affect behavior under pathological conditions.  Code Motion Behavior that may be obvious to the programmer can be obfuscated by  Reduce frequency with which computation performed  languages and coding styles  If it will always produce same result  e.g., Data ranges may be more limited than variable types suggest  Especially moving code out of loop Most analysis is performed only within procedures   Whole-program analysis is too expensive in most cases void set_row(double *a, double *b, long i, long n)  Newer versions of GCC do interprocedural analysis within individual files { long j;  But, not between code in different files long j; int ni = n*i ; for (j = 0; j < n; j++) Most analysis is based only on static information for (j = 0; j < n; j++)  a[n*i+j] = b[j]; a[ni+j] = b[j]; }  Compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative  5 6 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 1

  2. Reduction in Strength Compiler-Generated Code Motion (-O1) void set_row(double *a, double *b,  Replace costly operation with simpler one long j; long i, long n) long ni = n*i; {  Shift, add instead of multiply or divide double *rowp = a+ni; long j; for (j = 0; j < n; j++) for (j = 0; j < n; j++) 16*x --> x << 4 *rowp++ = b[j]; a[n*i+j] = b[j]; }  Utility machine dependent  Depends on cost of multiply or divide instruction – On Intel Nehalem, integer multiply requires 3 CPU cycles set_row:  Most valuable when it can be done within a loop testq %rcx, %rcx # Test n jle .L1 # If 0, goto done  “Induction variable” has value linear in loop execution count imulq %rcx, %rdx # ni = n*i leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8 movl $0, %eax # j = 0 int ni = 0; .L3: # loop: for (i = 0; i < n; i++) { for (i = 0; i < n; i++) { movsd (%rsi,%rax,8), %xmm0 # t = b[j] for (j = 0; j < n; j++) int ni = n*i; movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t a[ni + j] = b[j]; for (j = 0; j < n; j++) addq $1, %rax # j++ a[ni + j] = b[j]; ni += n; cmpq %rcx, %rax # j:n } } jne .L3 # if !=, goto loop .L1: # done: rep ; ret 7 8 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Optimization Blocker #1: Procedure Calls Share Common Subexpressions  Reuse portions of expressions  GCC will do this with – O1  Procedure to Convert String to Lower Case void lower(char *s) { /* Sum neighbors of i,j */ long inj = i*n + j; up = val[(i-1)*n + j ]; up = val[inj - n]; size_t i; down = val[(i+1)*n + j ]; down = val[inj + n]; for (i = 0; i < strlen(s); i++) left = val[i*n + j-1]; left = val[inj - 1]; if (s[i] >= 'A' && s[i] <= 'Z') right = val[i*n + j+1]; right = val[inj + 1]; sum = up + down + left + right; sum = up + down + left + right; s[i] -= ('A' - 'a'); } 3 multiplications: i*n, (i – 1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 imulq %rcx, %rsi # i*n leaq -1(%rsi), %r8 # i-1 addq %rdx, %rsi # i*n+j  Extracted from CMU 213 lab submissions, Fall, 1998 imulq %rcx, %rsi # i*n movq %rsi, %rax # i*n+j  Similar pattern seen in UMN 2018 HA1 imulq %rcx, %rax # (i+1)*n subq %rcx, %rax # i*n+j-n imulq %rcx, %r8 # (i-1)*n leaq (%rsi,%rcx), %rcx # i*n+j+n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j 9 10 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Lower Case Conversion Performance Convert Loop To Goto Form void lower(char *s) {  Time quadruples when double string length size_t i = 0;  Quadratic performance if (i >= strlen(s)) goto done; loop: 250 if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); 200 i++; CPU seconds if (i < strlen(s)) 150 lower1 goto loop; done: 100 } 50  strlen executed every iteration 0 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 String length 11 12 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend