a look at gforth performance
play

A Look at Gforth Performance M. Anton Ertl TU Wien New performance - PowerPoint PPT Presentation

A Look at Gforth Performance M. Anton Ertl TU Wien New performance features since Gforth 0.5.0 (2000) primitive-centric direct threaded code (EuroForth 2001) dynamic superinstructions with replication (PLDI 2003) static


  1. A Look at Gforth Performance M. Anton Ertl TU Wien

  2. New performance features since Gforth 0.5.0 (2000) • primitive-centric direct threaded code (EuroForth 2001) • dynamic superinstructions with replication (PLDI 2003) • static superinstructions (EuroForth 2003) • multi-state static stack caching (IVME 2004, EuroForth 2005) • automatic build tuning (explicit register allocation) • workarounds for GCC performance bugs • branch target alignment • ... What is the big picture? How well do the performance features work relative to others? How well does it work across machines and GCC versions?

  3. Portability space • 7 architectures, 12 architecture/CPU combinations • up to 9 GCC versions per architecture How well do the performance features do in all variations? Measurements • 4 Gforth versions, 7 with options • 5 application benchmarks (geometric mean reported) • 3 runs (median reported) • logarithmic graphs

  4. Overall performance perf/cycle geometric mean 1 IA32 Xeon 5450 • Typical speedup factor: 3 0.9 AMD64 Xeon 5450 0.8 • Biggest contribution: IA32 Opteron 270 IA32 Athlon MP 0.7 AMD64 Opteron 270 Dynamic superinstructions 0.6 in 0.6.2 for Alpha IA32 PPC IA32 Pentium 4 Northwood PPC 7447A 0.5 in 0.7.0 for others PPC 970 Alpha 21264B • Automatic tuning (0.7.0) PPC64 PPC970 0.4 IA64 Itanium II helps IA32 ARM Xscale IOP80321 • Multi-state stack caching vs. 0.3 static superinstructions • Branch target alignment 0.2 helps Alpha (0.7.0) • best performance per cycle: IA32, AMD64 Reason: indirect branch predictors gforth version 0.1 0.5.0 0.6.1 0.6.2 0.7.0 0.6.1nd 0.6.2ns 0.7.0ssc

  5. Dynamic Superinstructions Forth Threaded code Native code mov [esi], ecx lit mov ecx, [ebx] : foo 5 ; 5 add ebx, #4 ;s add esi, #-4 add ebx, #4 mov ebx, [edi] add edi, #4 add ebx, #4 jmp -4[ebx]

  6. Engines speed PPC 7447A gforth-fast 1 0.9 0.8 • Benchmarking: gforth-fast 0.7 • Debugging: gforth 0.6 Error detection and reporting gforth 0.5 • Typical difference: factor 2 • Debugging engine: 0.4 dynamic superinstructions no static superinstructions no multi-state stack caching 0.3 no automatic tuning gforth version 0.2 0.5.0 0.6.1 0.6.2 0.7.0 0.6.1nd 0.6.2ns 0.7.0ssc

  7. Engines (2) speed IA32 Xeon 5450 speed AMD64 Xeon 5450 1 gforth-fast 1 gforth-fast 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 gforth 0.5 0.5 0.4 0.4 gforth 0.3 0.3 gforth version gforth version 0.2 0.2 0.5.0 0.6.1 0.6.2 0.7.0 0.5.0 0.6.1 0.6.2 0.7.0 0.6.1nd 0.6.2ns 0.7.0ssc 0.6.1nd 0.6.2ns 0.7.0ssc

  8. GCC versions (1) speed PPC 7447A 0.7.0 1 0.7.0ssc 0.9 0.8 0.7 0.6 0.6.2 • Gcc ≥ 3 . 4 disables dynamic 0.6.2ns superinstructions in gforth 0.6.x 0.6.1 0.6.1nd 0.5 • Gforth 0.7.0 works around that 0.5.0 • gcc-2.95 works well 0.4 0.3 gcc version 2.95 3.3 4.0 4.3 3.2 3.4 4.1

  9. GCC versions (2) speed IA32 Xeon 5450 1 0.7.0 0.9 0.8 0.7.0ssc 0.7 • PR15242 lowers branch prediction accuracy 0.6 (gcc-3.4, gcc-4.4.0) • Gforth 0.7.0 works around that 0.5 • Gforth 0.7.0 affected by bad register allocation (4.1, 4.2) 0.4 NEXT expansion (4.4.0) • gcc-2.95 works well 0.3 0.6.2 0.6.1 0.6.1nd 0.5.0 0.6.2ns gcc version 2.95 3.4 4.1 4.4.0 3.3 4.0 4.2

  10. GCC performance bug: PR15242 Code 1+ Code 1+ ( $804B6D8 ) add ebx, #4 ( $804B6D8 ) add ebx, #4 ( $804B6DB ) inc ebp ( $804B6DB ) inc ebp instead of ( $804B6DC ) mov esi, -4[ebx] ( $804B6DC ) jmp -4[ebx] ( $804B6DF ) mov eax, esi ( $804B6E1 ) jmp 804AE8C ... ( $804AE8C ) jmp eax

  11. GCC performance bug: NEXT expansion before_goto: goto *real_ca; is compiled to: instead of: mov edx, 58 [esp] jmp esi mov eax, esi mov 68 [esp], edx mov 6C [esp], edx mov 70 [esp], edx mov 74 [esp], edx jmp eax

  12. Other Forth systems speed IA32 Opteron 270 iforth 5.6 bigforth gforth 4 vfxlin spf4 2.8 2 1.4 1 0.7 0.5 0.3 0.2 benchgc4 brew fcp brainless cd16sim lexex

  13. Conclusion • Typical speedup factor: 3 • Most important optimization: dynamic superinstructions New gcc versions often disable it ⇒ workarounds • Important on IA32: Explicit register allocation Automatic enabling and testing to get it into Linux distributions • Other optimizations have small or architecture-specific effect But their combination is still significant • Gforth is very portable 0.5.0 runs on architectures that were not available on release • Future work Inlining Compilation through C (independence from GCC) Native code generation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend