Shenandoah: Theory and Practice Christine Flood Roman Kennke - - PowerPoint PPT Presentation

shenandoah theory and practice
SMART_READER_LITE
LIVE PREVIEW

Shenandoah: Theory and Practice Christine Flood Roman Kennke - - PowerPoint PPT Presentation

Shenandoah: Theory and Practice Christine Flood Roman Kennke Principal Software Engineers Red Hat 1 Shenandoah Christine Flood Roman Kennke Principal Software Engineers Red Hat 2 Shenandoah Why do we need it? What does it do?


slide-1
SLIDE 1

1

Shenandoah: Theory and Practice

Christine Flood Roman Kennke Principal Software Engineers Red Hat

slide-2
SLIDE 2

2

Shenandoah

Christine Flood Roman Kennke Principal Software Engineers Red Hat

slide-3
SLIDE 3

3

Shenandoah

  • Why do we need it?
  • What does it do?
  • How does it work?
  • What's the current state?
  • What's left to do?
  • Performance
slide-4
SLIDE 4

4

GC is like an omniscient organizer for program memory.

I bet that's your messy pantry isn't it?

slide-5
SLIDE 5

5

Stack Frame Method Foo Heap Reference 42 Value Reference Stack Frame Method Bar Reference Value 6.847 Reference Reference Object Object Object Object Array Object Object Object Object Object Object Object Reference Object

Java execution

slide-6
SLIDE 6

6

When we reorganize objects we need to copy the

  • bjects and update the stack locations to point to

their new addresses.

Stack Frame Method Foo Heap Reference 42 Value Stack Frame Method Bar Value 6.847 Reference Object Copy Object Object Object Object Object Object Object Object Reference

slide-7
SLIDE 7

7

Why yet another garbage collector?

  • OpenJDK already has 4 collectors:
  • Serial
  • Parallel
  • Concurrent Mark Sweep
  • G1
slide-8
SLIDE 8

8

Why yet another garbage collector?

  • OpenJDK already has 4 collectors:
  • Serial (minimal collector)
  • Parallel (high throughput)
  • Concurrent Mark Sweep (low pause time, but...)
  • G1 (low/managed pause time, but...)
slide-9
SLIDE 9

9

But?

  • All existing collectors must (occasionally) compact
  • ld-gen or the whole heap
  • .. and therefore stop the world
  • …. for a long time, if heap is large
slide-10
SLIDE 10

10

Shenandoah!

  • Aims to reduce GC pause times
  • Goal: <10ms pauses for >100GB heaps
  • More precisely:
  • Make GC pauses independent of heap size
  • Long-term goal: pauseless GC
slide-11
SLIDE 11

11

How do we do it?

  • Evacuate concurrently with Java threads
slide-12
SLIDE 12

12

Garbage-First (G1)

Java Init Mark Java Final Mark Concurrent Mark Evacuation Java

slide-13
SLIDE 13

13

Shenandoah: Current implementation

Java Init Mark Java Final Mark Concurrent Mark Concurrent Evacuation Java

We choose our collection set to Minimize amount of copying. We have a plan for removing all of the stop the world pauses.

slide-14
SLIDE 14

14

Stack Frame Method Foo Heap Reference 42 Value Stack Frame Method Bar Value 6.847 Reference Object Copy Object Object Object Object Object Object Object Object Reference

Wait, are you moving those objects while the program is running?

slide-15
SLIDE 15

15

How do we do that?

We recycle an idea from the 1980's and add a level

  • f indirection.
slide-16
SLIDE 16

16

Forwarding Pointers based on Brooks Pointers

  • Rodney A. Brooks “Trading Data Space for

Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware” 1984 Symposium on Lisp and Functional Programing

slide-17
SLIDE 17

17

Forwarding Pointer

  • Object layout inside

the JVM remains the same.

  • Third party tools can

still walk the heap.

  • Can choose GC

algorithm at run time.

  • We hope to one day

be able to take advantage of unused space in double word aligned objects when possible.

Foo

Foo indirection pointer

slide-18
SLIDE 18

18

Forwarding Pointers

Any reads or writes of A will now be redirected to A'. We don't need to update Foo immediately.

A B From-Region To-Region A'

Foo

slide-19
SLIDE 19

19

How to move an object while the program is running.

  • Read the forwarding pointer to from space.
  • Allocate a temporary copy of the object in to space.
  • Copy the data.
  • CAS the forwarding pointer.
  • If you succeed carry on.
  • If you fail, use the copy that was placed by the

thread that beat you and recycle your temporary copy.

slide-20
SLIDE 20

20

Forwarding Pointers

Reading an object in a From-region doesn't trigger an evacuation.

A B From-Region To-Region

Note: If reads were to cause copying we might have a “read storm” where every operation required copying an object. Our intention is that since we are only copying on writes we will have less bursty behavior.

slide-21
SLIDE 21

21

Forwarding Pointers

Writing an object in a From-Region will trigger an evacuation of that object to a To-Region and the write will occur in there.

From-Region To-Region A B A'

slide-22
SLIDE 22

22

How does Java code know where the real object is?

  • Reads, writes, amps and some others are wrapped

by code that ensures the correct objects are accessed:

  • Read barriers
  • Write barriers
  • Acmp / cmpxchg barriers
slide-23
SLIDE 23

23

Read Barriers

  • Read the forwarding pointer to access the

forwarded object.

  • Does not trigger evacuation
  • If a write occurs concurrently, it's a race, but it's

been a race before :-)

  • Usually compiles into a single mov instruction
slide-24
SLIDE 24

24

Write Barriers

  • Ensures that writes only happen in to-space
  • It does so by speculatively making a copy, then

CASing the forwarding pointer in the object

  • If CAS succeeds, we win. If not, we roll back the

allocation, and use whatever the other thread did

  • … but only for objects in collection set, and only if

evacuation is currently in progress

  • … otherwise it's a simple read barrier
slide-25
SLIDE 25

25

Acmp barriers

  • If we compare a == a', we can get false negatives
  • Therefore, if an object comparison fails, we resolve

both operands through a read barrier, then try again.

slide-26
SLIDE 26

26

CmpXChg Barriers

  • compareAndSwapObject() combines all three,

because it loads, compares and writes an object field

  • We insert a somewhat complex barrier that
  • Resolves the written value (read-barrier)
  • Ensures to-space copy (write-barrier)
  • Prevents false negative (acmp-barrier)
slide-27
SLIDE 27

27

How are barriers implemented?

  • Need two types of barriers:
  • Read barrier - read brooks pointer
  • Write barrier – maybe copy obj & update brooks ptr
  • oop read_barrier(oop obj)
  • oop write_barrier(oop obj)
slide-28
SLIDE 28

Shenandoah barriers

  • op read_barrier(oop obj) {

return *(obj-0x8); }

slide-29
SLIDE 29

Shenandoah barriers

  • op write_barrier(oop obj) {

if (evacuation_in_progress) { return runtime_wbarrier(obj); } return obj; }

slide-30
SLIDE 30

Shenandoah barriers

  • Read barriers:

– getfield – Xaload – Intrinsics – Some esoteric stuff

slide-31
SLIDE 31

Shenandoah barriers

  • Write barriers:

– putfield – Xastore – Intrinsics – Some esoteric stuff

slide-32
SLIDE 32

Shenandoah barrier example

// Method without barriers void doStuff(TypeA a, TypeA b) { for (..) { a.x = 3; // putfield System.out.println(b.x); // getfield } } // Same method with Shenandoah barriers void doStuff(TypeA a, TypeA b) { for (..) { a = write_barrier(a); a.x = 3; // putfield b = read_barrier(b); System.out.println(b.x); // getfield } }

slide-33
SLIDE 33

Shenandoah barriers

  • Barriers are inserted by:

– The interpreter – The C1 compiler – The C2 compiler – By us, hardcoded in the runtime

slide-34
SLIDE 34

Shenandoah barriers

  • Initial implementation showed disheartening

performance: more than 50% slower than with

  • ther Gcs
  • So how did we make it fast?
slide-35
SLIDE 35

Shenandoah barriers

  • How to optimize barriers?

– Make barrier more efficient – Eliminate barriers – Optimize barrier placement

slide-36
SLIDE 36

Shenandoah barriers

  • Making barriers more efficient

– Eliminate null-checks – Inline null-checks – Inline evacuation-in-progress checks – Inline in-collection-set checks

→ Only call runtime when really necessary

slide-37
SLIDE 37

Shenandoah barriers

  • Eliminate barriers
  • We don't need barriers:

– For known NULL objects – For inlined constants – For newly allocated objects – After write barriers

  • Since we can only figure most of this out after

parsing, this isn't possible to do with parse-time barriers

slide-38
SLIDE 38

Eliminate barriers on null objects

bool isNull(Type a) { Type b = null; a' = read_barrier(a); b' = read_barrier(b); return a' == b'; }

slide-39
SLIDE 39

Eliminate barriers on null objects

bool isNull(Type a) { Type b = null; a' = read_barrier(a); // Dont care b' = read_barrier(b); // Known null return a' == b'; }

slide-40
SLIDE 40

Eliminate barriers on null objects

bool isNull(Type a) { Type b = null; return a == b; }

slide-41
SLIDE 41

Eliminate barriers on constants

static final Type A = ...; int getFoo() { return A.foo; }

slide-42
SLIDE 42

Eliminate barriers on constants

static final Type A = ...; int getFoo() { Type A' = read_barrier(A); return A'.foo; }

slide-43
SLIDE 43

Eliminate barriers on constants

static final Type A = ...; int getFoo() { // Constants are always in to-space Type A' = read_barrier(A); return A'.foo; }

slide-44
SLIDE 44

Eliminate barriers on new objects

int getFoo() { Type a = new Type(); a' = read_barrier(a); return a'.foo; }

slide-45
SLIDE 45

Eliminate barriers on new objects

int getFoo() { Type a = new Type(); // New objects are always in to-space a' = read_barrier(a); return a'.foo; }

slide-46
SLIDE 46

Eliminate barriers on new objects

int getFoo() { Type a = new Type(); return a.foo; }

slide-47
SLIDE 47

Eliminate barriers after write barriers

int getFoo(Type a) { a' = write_barrier(a); a'.bar = …; a'' = read_barrier(a'); return a''.foo; }

slide-48
SLIDE 48

Eliminate barriers after write barriers

int getFoo(Type a) { a' = write_barrier(a); a'.bar = …; // a' already in to-space a'' = read_barrier(a'); return a''.foo; }

slide-49
SLIDE 49

Eliminate barriers after write barriers

int getFoo(Type a) { a' = write_barrier(a); a'.bar = …; return a'.foo; }

slide-50
SLIDE 50

Optimize barrier placement

  • Hoist barriers out of hot loops
slide-51
SLIDE 51

Example

void doStuff(TypeA a, TypeZ z) { for (…) { Call(); // Safepoint for (…) { a = write_barrier(a); a.x = foo; z = read_barrier(z); System.out.println(z.y); } }

slide-52
SLIDE 52

Example

void doStuff(TypeA a, TypeZ z) { a = write_barrier(a); for (…) { Call(); // Safepoint for (…) { a.x = foo; z = read_barrier(z); System.out.println(z.y); } }

slide-53
SLIDE 53

Example

void doStuff(TypeA a, TypeZ z) { a = write_barrier(a); z = read_barrier(z); for (…) { Call(); // Safepoint for (…) { a.x = foo; System.out.println(z.y); } }

slide-54
SLIDE 54

Lessons learned

  • Basic algorithm pretty easy
  • Hard parts:

– Finding all the right places where to insert barriers – Support all JVM peculiarities:

  • Weak references
  • JNI Critical regions
  • System.gc()

– Compiler support and optimization

slide-55
SLIDE 55

55

Status

  • Feature complete
  • Stable (beta-quality)
  • Good performance (see later…)
  • Established OpenJDK project:

http://openjdk.java.net/projects/shenandoah/

  • Got nightly builds:
  • https://adopt-openjdk.ci.cloudbees.com/view/OpenJDK/

(Thanks Adopt-OpenJDK!!)

slide-56
SLIDE 56

56

Future Work (last year)

  • Finish big application testing.
  • Move the barriers to right before code generation.
  • Barrier-specific C2 opts?
  • Exploit Java Memory Model?
  • Heuristics tuning!
  • Generational Shenandoah?
  • Remembered Sets for updating roots and freeing

memory sooner?

  • Round Robin Thread Stopping?
  • NUMA Aware?
slide-57
SLIDE 57

57

Future Work (now)

  • Finish big application testing.
  • Move the barriers to right before code generation.
  • Barrier-specific C2 opts?
  • Exploit Java Memory Model?
  • Heuristics tuning!
  • Generational Shenandoah?
  • Remembered Sets for updating roots and freeing

memory sooner?

  • Round Robin Thread Stopping? (2.0)
  • NUMA Aware? (2.0)
slide-58
SLIDE 58

58

Releases?

  • First in Fedora 24
  • JDK 10
  • JEP 189: http://openjdk.java.net/jeps/189
slide-59
SLIDE 59

59

Performance

  • SPECjbb2015

compiler compress crypto derby mpegaudio scimark.large scimark.small serial startup sunflow xml total 200 400 600 800 1000 1200 1400 1600 Shenandoah G1

Throughput: Shenandoah: 374ops/m G1: 393ops/m (95%, min 80%, max 140%) Pauses: Shenandoah: avg: 41ms, max: 202ms G1: avg: 240ms, max: 1126ms

  • 32 cores
  • 160GB RAM, 140GB heap
slide-60
SLIDE 60

60

Performance SPECjbb2015

  • Max-jops: maximum throughput
  • Critical-jops: throughput under response-time-

constraints (SLA)

G1 Shenandoah Max-jops 18117 16899 93% Critical-jops 4294 7990 186% Pause avg 862ms 24.6ms Pause max 2054ms 78.61

slide-61
SLIDE 61

61

Performance Radargun/Infinispan

Throughput: G1: 940,065 reqs/s Shenandoah: 1,202,925 reqs/s

slide-62
SLIDE 62

62

Performance Radargun/Infinispan

Response time percentiles Beware the scales!

slide-63
SLIDE 63

63

LRU test

  • Simple handwritten LRU cache benchmark
  • ParallelGC: 116091ms / 100000 ops
  • G1: 98598ms / 100000 ops
  • Shenandoah: 56698ms / 100000 ops
slide-64
SLIDE 64

64

Please test

  • Download and build:
  • http://hg.openjdk.java.net/shenandoah
  • Or use nightly builds:
  • https://adopt-openjdk.ci.cloudbees.com/view/OpenJDK/job/project-shenandoah-jdk9/
  • https://adopt-openjdk.ci.cloudbees.com/view/OpenJDK/job/project-shenandoah-jdk8/
  • Report issues or success stories to:
  • http://mail.openjdk.java.net/mailman/listinfo/shenandoah-dev
slide-65
SLIDE 65

65

References

  • http://openjdk.java.net/projects/shenandoah/