Path Specialization: Reducing Phased Execution Overheads Filip - - PowerPoint PPT Presentation

path specialization
SMART_READER_LITE
LIVE PREVIEW

Path Specialization: Reducing Phased Execution Overheads Filip - - PowerPoint PPT Presentation

Path Specialization: Reducing Phased Execution Overheads Filip Pizlo, Erez Petrank, Bjarne Steensgaard Purdue, Technion/Microsoft, Microsoft ISMM08 - Tucson, AZ 1 Real-time, concurrent, and incremental garbage collectors are becoming


slide-1
SLIDE 1

Path Specialization:

Reducing Phased Execution Overheads

Filip Pizlo, Erez Petrank, Bjarne Steensgaard Purdue, Technion/Microsoft, Microsoft ISMM’08 - Tucson, AZ

1

slide-2
SLIDE 2
  • Real-time, concurrent, and incremental

garbage collectors are becoming main- stream techniques.

  • But these collectors require barriers to be

inserted, which causes execution to slow down.

2

slide-3
SLIDE 3
  • Barriers slow down execution of programs.
  • This talk focuses on increasing the

throughput of programs that use expensive barriers.

3

slide-4
SLIDE 4

Types of Barriers

(a non-exclusive list of expensive barriers that we’re familiar with)

4

slide-5
SLIDE 5
  • Stopless (ISMM’07)
  • Brooks read barrier (both lazy and eager)
  • Yuasa barrier for concurrent or

incremental mark-sweep

5

slide-6
SLIDE 6

Stopless Barriers

  • “The write barrier from heck” -anonymous
  • Stopless barriers require potentially

multiple branches, loads, stores, and CASes even on primitive reads and writes.

  • But the barriers are only active during the

(short) copying phase.

6

slide-7
SLIDE 7
  • Brooks read barriers
  • Useful when the mutator may see the

same object in both to-space and from- space

  • Idea: each object has a pointer in its

header to the “correct” version of the

  • bject.
  • This pointer may be self-pointing

7

slide-8
SLIDE 8

Brooks Forwarding Pointer

8

slide-9
SLIDE 9

Brooks Forwarding Pointer

8

slide-10
SLIDE 10

“Lazy” Brooks

  • bject a = b.f

use a use a

  • bject a = b.forward.f

use a.forward use a.forward

9

slide-11
SLIDE 11

These barriers are only needed when copying is

  • ngoing.

10

slide-12
SLIDE 12

Yuasa Write Barrier

a.f = b if barrier active mark a.f a.f = b

11

slide-13
SLIDE 13

Yuasa Write Barrier

a.f = b if barrier active mark a.f a.f = b We use this barrier in concurrent and incremental mark-sweep collectors.

11

slide-14
SLIDE 14
  • Barriers for concurrent and incremental collectors

tend to only be active during some phase of collector execution.

  • Even if the collector is always running, the barriers are
  • nly active a fraction of the time.
  • Concurrent Mark-sweep: only active during marking

phase.

  • Metronome: Brooks only active during the (rare)

copying phase

  • Stopless: only active during the (rare and short)

copying phase.

12

slide-15
SLIDE 15
  • What we want:
  • Make code run faster when the barriers

are not needed.

  • Make code run not much slower when

the barriers are needed.

  • Result: get better throughput.

13

slide-16
SLIDE 16

Path Specialization

14

slide-17
SLIDE 17

Simple Example

Original

15

slide-18
SLIDE 18

Simple Example

barriers Original

15

slide-19
SLIDE 19

Simple Example

Original

15

slide-20
SLIDE 20

Simple Example

Original Fast Slow

15

slide-21
SLIDE 21
  • We wish to provide best throughput while still

being sound.

  • Thus - we need to be able to allow code to

switch between one version of the barrier to another when there is a phase change in the collector.

  • This is the crucial difference from previous

work on specialization.

How It Really Works

16

slide-22
SLIDE 22

GC points

  • Typically, concurrent and incremental collectors

require that each mutator acknowledges changes in phase at GC points.

  • A GC point may be:
  • memory allocation
  • back branch (to ensure that GC points are

reached in a timely fashion)

  • by proxy - any method call

17

slide-23
SLIDE 23
  • Three versions of code:
  • Unspecialized - code where we don’t

care about GC phase

  • Fast - code where we know that we

don’t need barriers

  • Slow - code where we need barriers

How It Really Works

18

slide-24
SLIDE 24
  • The approach:
  • The “Unspecialized” code is the original

code; it will check phase, and switch to either Fast or Slow, at every barrier.

  • Fast and Slow switch to Unspecialized at

GC points (e.g. method call).

19

slide-25
SLIDE 25

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; }

A better example (Lazy Brooks)

20

slide-26
SLIDE 26

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; }

A better example (Lazy Brooks)

Needs Barriers Needs Barrier

20

slide-27
SLIDE 27

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; }

A better example (Lazy Brooks)

Needs Barriers Needs Barrier GC point

20

slide-28
SLIDE 28

int foo(object o) { int x = 2+2;

  • .forward.f = x;
  • .forward.g = null;
  • .bar();

return o.forward.f; }

Lazy Brooks: Without Specialization

Needs Barriers Needs Barrier GC point

21

slide-29
SLIDE 29

What happens with path specialization?

22

slide-30
SLIDE 30

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; }

23

slide-31
SLIDE 31

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; }

24

slide-32
SLIDE 32

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; } int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; } int foo(object o) { int x = 2+2;

  • .forward.f = x;
  • .forward.g = null;
  • .bar();

return o.forward.f; }

Unspecialized Fast Slow

25

slide-33
SLIDE 33

int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; } int foo(object o) { int x = 2+2;

  • .f = x;
  • .g = null;
  • .bar();

return o.f; } int foo(object o) { int x = 2+2;

  • .bar();

Unspecialized Fast Slow

  • .forward.f = x;
  • .forward.g = null;

return o.forward.f; }

26

slide-34
SLIDE 34

int foo(object o) { int x = 2+2;

  • .bar();
  • .f = x;
  • .g = null;

return o.f; } return o.f;

  • .f = x;
  • .g = null;

int foo(object o) { int foo(object o) {

Unspecialized Fast Slow

  • .forward.f = x;
  • .forward.g = null;

return o.forward.f; }

27

slide-35
SLIDE 35

int foo(object o) { int x = 2+2; if need barrier o.forward.f = x;

  • .forward.g = null;

else o.f = x;

  • .g = null;
  • .bar();

if need barrier return o.forward.f; else return o.f; }

Lazy Brooks: With Specialization

28

slide-36
SLIDE 36

int foo(object o) { int x = 2+2; if need barrier o.forward.f = x;

  • .forward.g = null;

else o.f = x;

  • .g = null;
  • .bar();

if need barrier return o.forward.f; else return o.f; }

Lazy Brooks: With Specialization

Unspecialized Unspecialized

28

slide-37
SLIDE 37

int foo(object o) { int x = 2+2; if need barrier o.forward.f = x;

  • .forward.g = null;

else o.f = x;

  • .g = null;
  • .bar();

if need barrier return o.forward.f; else return o.f; }

Lazy Brooks: With Specialization

Unspecialized Unspecialized Fast Fast

28

slide-38
SLIDE 38

int foo(object o) { int x = 2+2; if need barrier o.forward.f = x;

  • .forward.g = null;

else o.f = x;

  • .g = null;
  • .bar();

if need barrier return o.forward.f; else return o.f; }

Lazy Brooks: With Specialization

Unspecialized Unspecialized Slow Slow Fast Fast

28

slide-39
SLIDE 39
  • Our algorithm aims to introduce the

smallest number of “needs barrier” phase checks along any path...

  • ... while ensuring that code is not duplicated

unnecessarily (example: any path from a GC point to a check is not duplicated).

  • See the paper for the complete algorithm.

Summary

29

slide-40
SLIDE 40

Implementation

30

slide-41
SLIDE 41
  • We have implemented Path Specialization in the

Microsoft Bartok Research Compiler.

  • Path specialization exists as an optional pass that

can be applied to any barrier that has a phase check.

  • We have tested this with our

Yuasa barrier, our lazy and eager Brooks barriers, and our Stopless barriers.

31

slide-42
SLIDE 42

Results

32

slide-43
SLIDE 43
  • We test four internal MSR benchmarks

(large PL-type programs) and three smaller traditional benchmarks ported to .NET.

  • Five barriers are used: CMS (Yuasa-type

barrier), Brooks (lazy), Brooks (sunk eager), Stopless, and Stopless without any copying activity.

33

slide-44
SLIDE 44

Without Specialization

34

slide-45
SLIDE 45

35

slide-46
SLIDE 46

36

slide-47
SLIDE 47

37

slide-48
SLIDE 48

Conclusion

  • For heavy barriers (Stopless), path specialization

reduces code size and improves performance.

  • For barriers that are cheap but already have

phase checks (like CMS), path specialization increases performance a bit without affecting code size.

  • For Brooks barriers, performance improves but

results in large code blow-up.

  • Performance improves for every barrier we

tried.

38

slide-49
SLIDE 49

Questions/Comments

39