Speeding Up Thread-Local Storage Access from Dynamic Libraries - - PowerPoint PPT Presentation

speeding up thread local storage access from dynamic
SMART_READER_LITE
LIVE PREVIEW

Speeding Up Thread-Local Storage Access from Dynamic Libraries - - PowerPoint PPT Presentation

Speeding Up Thread-Local Storage Access from Dynamic Libraries Alexandre Oliva http://www.lsd.ic.unicamp.br/ oliva/ aoliva@redhat.com oliva@lsd.ic.unicamp.br Red Hat University of Campinas March, 2008 Summary TLS?!? Dynamic


slide-1
SLIDE 1

Speeding Up Thread-Local Storage Access from Dynamic Libraries Alexandre Oliva

http://www.lsd.ic.unicamp.br/∼oliva/ aoliva@redhat.com

  • liva@lsd.ic.unicamp.br

Red Hat University of Campinas March, 2008

slide-2
SLIDE 2

Summary

  • TLS?!?
  • Dynamic libraries
  • Thread-Local Storage
  • Optimizations
  • Performance numbers
  • ARM Port
  • Relaxations
slide-3
SLIDE 3

Background

  • Per-thread data
  • Stack, automatic variables
  • pthread [gs]etspecific
  • TLS:

thread variables

slide-4
SLIDE 4

Dynamic Libraries

extern int i, g(void); int f(void) { return g() + i; } PDC (exec) PIC (shared lib) copy next pc to %ebx addl $ G O T - ., %ebx call g call g@PLT movl i@GOT(%ebx), %edx addl i, %eax addl (%edx), %eax

slide-5
SLIDE 5

Thread-Local Storage

thread int x; extern thread int y;

DTV Static TLS Block Module Index Offset TP Offset TP offsets Dynamic TLS Blocks x z y

slide-6
SLIDE 6

Local Exec

thread int x; int getx() { return x; }

  • movl %gs:x@NTPOFF, %eax

DTV Static TLS Block Module Index Offset TP Offset TP offsets Dynamic TLS Blocks x z y

slide-7
SLIDE 7

Initial Exec

extern thread int y; int gety() { return y; }

  • movl y@GOTNTPOFF(%ebx), %eax
  • movl %gs:(%eax), %eax

G O T + y@GOTNTPOFF:

  • .word y@NTPOFF
slide-8
SLIDE 8

General Dynamic

thread int z; int getz() { return z; }

  • leal z@TLSGD(,%ebx,1), %eax
  • call

tls get addr@PLT

  • movl (%eax), %eax

void * tls get addr(struct { long index, offset; } *); G O T + z@TLSGD:

  • .word index, offset
slide-9
SLIDE 9

tls get addr

  • If generation count is not current, update() DTV
  • If dtv[index] not allocated, allocate() it
  • Return dtv[index] + offset

DTV Static TLS Block Module Index Offset TP Offset TP offsets Dynamic TLS Blocks x z y

slide-10
SLIDE 10

Local Dynamic

static thread int z1, z2; int getz() { return z1 + z2; }

  • leal z1@TLSLDM(%ebx), %eax
  • call

tls get addr@PLT

  • movl %eax, %esi
  • movl z1@DTPOFF(%eax), %eax
  • addl z2@DTPOFF(%esi), %eax
slide-11
SLIDE 11

TLS Descriptor-based General Dynamic

thread int yz; int getyz() { return yz; }

  • leal yz@TLSDESC(%ebx), %eax
  • call *yz@TLSCALL(%eax) ;; == call *(%eax)
  • movl %gs:(%eax), %eax

G O T + yz@TLSDESC:

  • .word resolver, argument
slide-12
SLIDE 12

Static Descriptor

G O T + y@TLSDESC:

  • .word sresolver, y@NTPOFF

sresolver:

  • movl 4(%eax), %eax
  • ret
slide-13
SLIDE 13

Dynamic Descriptor

G O T + z@TLSDESC:

  • .word dresolver, dyndesc(z)

dresolver:

  • If GC is current enough and dtv[index] is allocated,

return dtv[index] + offset - TP

  • Call

tls get addr preserving registers, subtract TP dyndesc(z): (allocated by the dynamic loader)

  • .word index, offset, generation
slide-14
SLIDE 14

Lazy Descriptor

G O T + yz@TLSDESC:

  • .word lresolver, reloc

lresolver:

  • Acquire loader lock
  • If not resolved yet,

– Apply relocation preserving registers

  • Release lock
  • Return into final resolver
slide-15
SLIDE 15

Speedups: Static

t (CK) × ((MinSt, MaxSt) × (P3, A64/32, A64/64))

10 20 30 40 50 60 70 5.2x 1.6x 26.0x 3.8x 5.2x 2.9x 3.4x 1.6x 4.6x 2.7x 5.3x 3.4x

slide-16
SLIDE 16

Speedups: Dynamic

t (CK) × ((MinSt, MaxSt) × (P3, A64/32, A64/64))

10 20 30 40 50 60 70 1.2x 1.1x 1.8x 1.5x 1.1x 1.0x 1.4x 1.2x 1.9x 1.6x 1.6x 1.5x

slide-17
SLIDE 17

ARM Port

Original Optimized ldr r0, .Lt0 ldr r0, .Lt0 .L1: add r0, pc, r0 bl tls get addr(PLT) bl foo(tlscall) ldr r0, [r0] ldr r0, [$tp, r0] .Lt0: .word foo(tlsgd) \ .word foo(tlsdesc) \ + (. - .L1 - 8) + (. - .L1)

slide-18
SLIDE 18

ARM Port (cont)

Original Optimized bl tga(PLT) bl tramp .word foo(tlsgd) \ .word foo(tlsdesc) \ + (. - .L1 - 8) + (. - .L1 - 4) add ip, pc, tga(got)[24:31] add r0, lr, r0 add ip, ip, tga(got)[16:23] ldr r1, [r0, #4] ldr pc, [ip, tga(got)[0:15]]! bx r1

slide-19
SLIDE 19

Relaxations

GD IE LE ldr r0, .Lt0 ldr r0, .Lt0 ldr r0, .Lt0 bl foo(tlscall) ldr r0, [pc, r0] nop .word foo(tlsdesc) \ foo(gottpoff) \ foo(tpoff) + (. - .L1 - 4) + (. - .L1 - 8) + 0

slide-20
SLIDE 20

Inlining the Trampoline

ldr rt, .Lt1 ldr rt, .Lt1 ldr rt, .Lt1 add rx, pc, rt add rx, pc, rt mov rx, rt ldr ry, [rx, #4] ldr ry, [rx] nop mov r0, rx mov r0, rx mov r0, rx blx ry mov r0, ry nop .word foo(tlsdesc) \ foo(gottpoff) \ foo(tpoff) + (. - .L1 - 8) + (. - .L1 - 8) + 0

slide-21
SLIDE 21

Conclusions

  • Major speedups in the most common case
  • Small speedups even in the dlopen case

– Compiler improvements could reduce them – Generation count – Calling conventions

  • Smaller code, same data space in static case
  • Lazy relocation
  • Ported to x86, x86 64, ARM and FR-V