Limits of Parallel Marking Garbage Collection ...how parallel can a - - PowerPoint PPT Presentation

limits of parallel marking garbage collection
SMART_READER_LITE
LIVE PREVIEW

Limits of Parallel Marking Garbage Collection ...how parallel can a - - PowerPoint PPT Presentation

Limits of Parallel Marking Garbage Collection ...how parallel can a GC become? Dr. Fridtjof Siebert CTO, aicas ISMM 2008, Tucson, 7. June 2008 Limits of Parallel Marking Garbage Collection Introduction Parallel Hardware is becoming the norm


slide-1
SLIDE 1

Limits of Parallel Marking Garbage Collection

...how parallel can a GC become?

  • Dr. Fridtjof Siebert

CTO, aicas ISMM 2008, Tucson, 7. June 2008

slide-2
SLIDE 2

Limits of Parallel Marking Garbage Collection

2

Introduction

Parallel Hardware is becoming the norm

  • even for embedded computers
  • even for real-time systems

We need parallel garbage collection

  • That is not only optimized for max. throughput
  • But that gives guarantees on its performance
  • The worst-case GC timing must predictable and

fast

slide-3
SLIDE 3

Limits of Parallel Marking Garbage Collection

3

Terminology

blocking GC

cycle 1 cycle 2

slide-4
SLIDE 4

Limits of Parallel Marking Garbage Collection

4

Terminology

blocking GC Incremental GC

cycle 1 cycle 2

slide-5
SLIDE 5

Limits of Parallel Marking Garbage Collection

5

Terminology

blocking GC Incremental GC Concurrent GC

cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: Application

slide-6
SLIDE 6

Limits of Parallel Marking Garbage Collection

6

Terminology

blocking GC Incremental GC Concurrent GC parallel GC

cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: Application CPU 1 CPU 2 CPU 3 cycle 1 cycle 2 cycle 1 cycle 2 cycle 1 cycle 2

slide-7
SLIDE 7

Limits of Parallel Marking Garbage Collection

7

Terminology

blocking GC Incremental GC Concurrent GC parallel GC Parallel & Concurrent

cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: Application CPU 1 CPU 2 CPU 3 cycle 1 cycle 2 cycle 1 cycle 2 cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: GC

slide-8
SLIDE 8

Limits of Parallel Marking Garbage Collection

8

Terminology

blocking GC Incremental GC Concurrent GC parallel GC Parallel & Concurrent Parallel & Concurrent

cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: Application CPU 1 CPU 2 CPU 3 cycle 1 cycle 2 cycle 1 cycle 2 cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: GC CPU 1 CPU 2 CPU 3

slide-9
SLIDE 9

Limits of Parallel Marking Garbage Collection

9

Terminology

blocking GC Incremental GC Concurrent GC parallel GC Parallel & Concurrent Parallel & Concurrent

cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: Application CPU 1 CPU 2 CPU 3 cycle 1 cycle 2 cycle 1 cycle 2 cycle 1 cycle 2 CPU 1: Application CPU 2: GC CPU 3: GC CPU 1 CPU 2 CPU 3

slide-10
SLIDE 10

Limits of Parallel Marking Garbage Collection

10

Parallel Mark & Sweep

Incremental Mark & Sweep

  • uses three color marking: white, grey and black
  • mark phase step is
  • find take grey object o
  • mark all white objects referenced by o grey
  • mark o black
  • sweep phase step is
  • take white object
  • free its memory
slide-11
SLIDE 11

Limits of Parallel Marking Garbage Collection

11

Parallel Mark & Sweep

Parallel Sweep Steps

  • not addressed here
  • sweeping can be performed fully in parallel by
  • sweeping different regions of the heap by

different CPUs

  • need parallel access to the free lists
slide-12
SLIDE 12

Limits of Parallel Marking Garbage Collection

12

Parallel Mark & Sweep

Parallel Mark

  • several threads may scan grey objects in parallel
  • new color anthracite for grey object that is being

scanned by one CPU

  • stalls possible if grey set temporarily empty!
slide-13
SLIDE 13

Limits of Parallel Marking Garbage Collection

13

Worst Case: Linked List

root

slide-14
SLIDE 14

Limits of Parallel Marking Garbage Collection

14

Worst Case: Linked List

root CPU1 CPU2 CPU3

slide-15
SLIDE 15

Limits of Parallel Marking Garbage Collection

15

Worst Case: Linked List

root

CPU1

starts mark step CPU1 CPU2 CPU3

slide-16
SLIDE 16

Limits of Parallel Marking Garbage Collection

16

Worst Case: Linked List

root

CPU1

no grey

  • bject, stalls!

CPU1 CPU2 CPU3

slide-17
SLIDE 17

Limits of Parallel Marking Garbage Collection

17

Worst Case: Linked List

root

CPU1

CPU1 CPU2 CPU3 no grey

  • bject, stalls!
slide-18
SLIDE 18

Limits of Parallel Marking Garbage Collection

18

Worst Case: Linked List

root

CPU1

CPU1 CPU2 mark step finished CPU3

slide-19
SLIDE 19

Limits of Parallel Marking Garbage Collection

19

Worst Case: Linked List

root

CPU1

all CPUs compete for

  • ne grey
  • bject!

CPU1 CPU2 CPU3

slide-20
SLIDE 20

Limits of Parallel Marking Garbage Collection

20

Worst Case: Linked List

root

CPU1 CPU2

eg., CPU2 successful, CPU1 + CPU3 stall! CPU1 CPU2 CPU3

slide-21
SLIDE 21

Limits of Parallel Marking Garbage Collection

21

Worst Case: Linked List

With n CPUs performing mark in parallel

  • there might be n-1 stalls for each mark step
  • only one CPU is performing a mark step at any time

Worst-case performance equal to non-parallel GC!

slide-22
SLIDE 22

Limits of Parallel Marking Garbage Collection

22

Can we find a better limit for real applications?

First, look at two processor parallel mark only

  • what if memory graph consists of two linked lists?
slide-23
SLIDE 23

Limits of Parallel Marking Garbage Collection

23

Two Linked Lists with two CPUs

root CPU1 CPU2 we might be lucky and see no stalls

slide-24
SLIDE 24

Limits of Parallel Marking Garbage Collection

24

Two Linked Lists with two CPUs

root CPU1 CPU2 but we might have bad luck: one list is scanned first, there is a single linked list left!

slide-25
SLIDE 25

Limits of Parallel Marking Garbage Collection

25

Limit on stalls depends on object depth

root CPU1 CPU2 1 2 2 11 10 5 3 12 9 6 4 13 8 7 5 6 7 8 12 11 10 9 3 4 13

slide-26
SLIDE 26

Limits of Parallel Marking Garbage Collection

26

Limit on stalls depends on object depth (2-processors)

  • after 1st stall, all objects with depth ≤ 1 are black
  • after 2nd stall, all objects with depth ≤ 2 are black
  • etc.
  • after nth stall, all objects with depth ≤ n are black
slide-27
SLIDE 27

Limits of Parallel Marking Garbage Collection

27

Limit on stalls depends on object depth (2-processors)

# of stalls s on two-processor parallel mark is limited by

  • max. depth of the memory graph H:
slide-28
SLIDE 28

Limits of Parallel Marking Garbage Collection

28

Generalization for more processors

# of stalls s on p-processor parallel mark is limited by:

slide-29
SLIDE 29

Limits of Parallel Marking Garbage Collection

29

Analysis and Measurements

Instrumented JamaicaVM Java implementation to

  • measure the maximum depth of the heap graph,
  • make samples of the current heap graph all

10,000 reference store operations, and

  • output the maximum depths and the maximum

ratios depth / heap size in # of objects The instrumented VM was then used to run the SPECjvm98 benchmark suite

slide-30
SLIDE 30

Limits of Parallel Marking Garbage Collection

30

Measurements

Maximum depths of SPECjvm98 benchmarks

check compress jess raytrace db javac mpegaudio mtrt jack

250 500 750 1000 1250 1500

slide-31
SLIDE 31

Limits of Parallel Marking Garbage Collection

31

Measurements

Maximum relative depths of SPECjvm98 benchmarks

check compress jess raytrace db javac mpegaudio mtrt jack

0,00% 0,50% 1,00% 1,50% 2,00% 2,50% 3,00% 3,50% 4,00%

slide-32
SLIDE 32

Limits of Parallel Marking Garbage Collection

32

Measurements

Worst-case scalability of SPECjvm98 benchmarks

1 2 4 8 16 32 64 128 256 1 10 100 ideal check compress jess raytrace db javac mpegaudio mtrt jack 1 2 4 8 16 32 64 128 256 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 ideal check compress jess raytrace db javac mpegaudio mtrt jack non-parallel

slide-33
SLIDE 33

Limits of Parallel Marking Garbage Collection

33

Conclusions

In the general case, parallel marking garbage collection can not be parallelized. However, if the depth of the memory graph is limited, then parallel mark phase generally works well. To be able to give realtime guarantees on the performance of the mark phase, we need a guarantee from the application on its maximum heap depth.