A Look at Gforth Performance M. Anton Ertl TU Wien New performance - - PowerPoint PPT Presentation

a look at gforth performance
SMART_READER_LITE
LIVE PREVIEW

A Look at Gforth Performance M. Anton Ertl TU Wien New performance - - PowerPoint PPT Presentation

A Look at Gforth Performance M. Anton Ertl TU Wien New performance features since Gforth 0.5.0 (2000) primitive-centric direct threaded code (EuroForth 2001) dynamic superinstructions with replication (PLDI 2003) static


slide-1
SLIDE 1

A Look at Gforth Performance

  • M. Anton Ertl

TU Wien

slide-2
SLIDE 2

New performance features since Gforth 0.5.0 (2000)

  • primitive-centric direct threaded code (EuroForth 2001)
  • dynamic superinstructions with replication (PLDI 2003)
  • static superinstructions (EuroForth 2003)
  • multi-state static stack caching (IVME 2004, EuroForth 2005)
  • automatic build tuning (explicit register allocation)
  • workarounds for GCC performance bugs
  • branch target alignment
  • ...

What is the big picture? How well do the performance features work relative to others? How well does it work across machines and GCC versions?

slide-3
SLIDE 3

Portability space

  • 7 architectures, 12 architecture/CPU combinations
  • up to 9 GCC versions per architecture

How well do the performance features do in all variations?

Measurements

  • 4 Gforth versions, 7 with options
  • 5 application benchmarks (geometric mean reported)
  • 3 runs (median reported)
  • logarithmic graphs
slide-4
SLIDE 4

Overall performance

IA32 Opteron 270 IA32 Xeon 5450 PPC 7447A ARM Xscale IOP80321 IA32 Athlon MP PPC 970 AMD64 Opteron 270 AMD64 Xeon 5450 IA64 Itanium II Alpha 21264B PPC64 PPC970 IA32 Pentium 4 Northwood gforth version perf/cycle geometric mean 0.5.0 0.6.1nd 0.6.1 0.6.2ns 0.6.2 0.7.0ssc 0.7.0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

  • Typical speedup factor: 3
  • Biggest contribution:

Dynamic superinstructions in 0.6.2 for Alpha IA32 PPC in 0.7.0 for others

  • Automatic tuning (0.7.0)

helps IA32

  • Multi-state stack caching vs.

static superinstructions

  • Branch target alignment

helps Alpha (0.7.0)

  • best performance per cycle:

IA32, AMD64 Reason: indirect branch predictors

slide-5
SLIDE 5

Dynamic Superinstructions

lit 5 ;s mov [esi], ecx mov ecx, [ebx] add ebx, #4 add esi, #-4 add ebx, #4 mov ebx, [edi] add edi, #4 add ebx, #4 jmp -4[ebx] Threaded code Native code Forth : foo 5 ;

slide-6
SLIDE 6

Engines

gforth gforth-fast gforth version speed PPC 7447A 0.5.0 0.6.1nd 0.6.1 0.6.2ns 0.6.2 0.7.0ssc 0.7.0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

  • Benchmarking: gforth-fast
  • Debugging: gforth

Error detection and reporting

  • Typical difference: factor 2
  • Debugging engine:

dynamic superinstructions no static superinstructions no multi-state stack caching no automatic tuning

slide-7
SLIDE 7

Engines (2)

gforth gforth-fast gforth version speed IA32 Xeon 5450 0.5.0 0.6.1nd 0.6.1 0.6.2ns 0.6.2 0.7.0ssc 0.7.0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 gforth gforth-fast gforth version speed AMD64 Xeon 5450 0.5.0 0.6.1nd 0.6.1 0.6.2ns 0.6.2 0.7.0ssc 0.7.0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

slide-8
SLIDE 8

GCC versions (1)

0.5.0 0.6.1nd 0.6.1 0.6.2ns 0.6.2 0.7.0ssc 0.7.0 gcc version speed PPC 7447A 2.95 3.2 3.3 3.4 4.0 4.1 4.3 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

  • Gcc ≥ 3.4 disables dynamic

superinstructions in gforth 0.6.x

  • Gforth 0.7.0 works around that
  • gcc-2.95 works well
slide-9
SLIDE 9

GCC versions (2)

0.5.0 0.6.1nd 0.6.1 0.6.2ns 0.6.2 0.7.0ssc 0.7.0 gcc version speed IA32 Xeon 5450 2.95 3.3 3.4 4.0 4.1 4.2 4.4.0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

  • PR15242 lowers

branch prediction accuracy (gcc-3.4, gcc-4.4.0)

  • Gforth 0.7.0 works around that
  • Gforth 0.7.0 affected by

bad register allocation (4.1, 4.2) NEXT expansion (4.4.0)

  • gcc-2.95 works well
slide-10
SLIDE 10

GCC performance bug: PR15242

Code 1+ ( $804B6D8 ) add ebx, #4 ( $804B6DB ) inc ebp ( $804B6DC ) mov esi, -4[ebx] ( $804B6DF ) mov eax, esi ( $804B6E1 ) jmp 804AE8C ... ( $804AE8C ) jmp eax instead of Code 1+ ( $804B6D8 ) add ebx, #4 ( $804B6DB ) inc ebp ( $804B6DC ) jmp -4[ebx]

slide-11
SLIDE 11

GCC performance bug: NEXT expansion

before_goto: goto *real_ca; is compiled to: mov edx, 58 [esp] mov eax, esi mov 68 [esp], edx mov 6C [esp], edx mov 70 [esp], edx mov 74 [esp], edx jmp eax instead of: jmp esi

slide-12
SLIDE 12

Other Forth systems

speed IA32 Opteron 270 5.6 4 2.8 2 1.4 1 0.7 0.5 0.3 0.2 iforth bigforth gforth vfxlin spf4 benchgc4 brainless brew cd16sim fcp lexex

slide-13
SLIDE 13

Conclusion

  • Typical speedup factor: 3
  • Most important optimization: dynamic superinstructions

New gcc versions often disable it ⇒ workarounds

  • Important on IA32: Explicit register allocation

Automatic enabling and testing to get it into Linux distributions

  • Other optimizations have small or architecture-specific effect

But their combination is still significant

  • Gforth is very portable

0.5.0 runs on architectures that were not available on release

  • Future work

Inlining Compilation through C (independence from GCC) Native code generation