Boosting the Priority of Garbage: Scheduling Collection on - - PowerPoint PPT Presentation

boosting the priority of garbage scheduling collection on
SMART_READER_LITE
LIVE PREVIEW

Boosting the Priority of Garbage: Scheduling Collection on - - PowerPoint PPT Presentation

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors Shoaib Akram , Jennifer B. Sartor, Kenzo Van Craeynest, Wim Heirman, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@UGent.be Popularity of


slide-1
SLIDE 1

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors

Shoaib Akram, Jennifer B. Sartor, Kenzo Van Craeynest, Wim Heirman, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@UGent.be

slide-2
SLIDE 2

2

Popularity of Managed Languages

The 2015 Top Ten Programming Languages, spectrum.ieee.org.

slide-3
SLIDE 3

3

Memory automatically reclaimed for reuse Takes extra CPU cycles to provide the service Concurrent collectors suited to multicores

The Garbage Collection Advantage

slide-4
SLIDE 4

4

600 Series 4x ARM Cortex A72 4x ARM Cortex A53 Exynox 8890 4x ARM Cortex A53 4x Exynos M1

big

LITTLE

Power Performance

Out-of-Order In-Order

Heterogeneous Multicores

slide-5
SLIDE 5

5

big

LITTLE

Power Performance

Application à Garbage Collector à big or LITTLE? Out-of-Order In-Order

Managed Language Applications

  • n Heterogeneous Multicores
slide-6
SLIDE 6

6

GC on big versus LITTLE

big Applica'on Collector big Run Collector on big versus LITTLE and measure the difference in execution time

LITTLE

Allocates objects on heap Iden;fies live objects on heap and then reclaims memory taken up by remaining objects

Applica;on and collector running concurrently

slide-7
SLIDE 7

7

4 8 12 16 20 % increase in execution time

GC on big versus LITTLE

slide-8
SLIDE 8

8

4 8 12 16 20 % increase in execution time

GC on big versus LITTLE

slide-9
SLIDE 9

9

GC on big versus LITTLE

GC-Critical GC-Uncritical Some applications exhibit GC-Criticality GC on LITTLE detrimental for GC-Critical

4 8 12 16 20 % increase in execution time

slide-10
SLIDE 10

10

GC on big versus LITTLE

Applica'on Collector Application is paused if no free memory on heap because collector still running

Allocates objects on heap

What happens if GC runs on LITTLE for GC-Cri;cal apps?

Serial collec;on

Paused !!!

Iden;fies live objects on heap and then reclaims memory taken up by remaining objects

slide-11
SLIDE 11
  • gc-fair

– Equally share the big core among all threads – Based on Van Craeynest et al [PACT 2013]

  • Baseline is gc-on-LITTLE

– Pin the GC threads on LITTLE cores

  • Observe the % reduc;on in execu;on ;me

11

Giving GC Fair Share of Big Core

slide-12
SLIDE 12

12

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical

2 LITTLE 3 LITTLE 1 LITTLE

slide-13
SLIDE 13

13

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical

2 LITTLE 3 LITTLE 1 LITTLE

slide-14
SLIDE 14

14

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical

2 LITTLE 3 LITTLE 1 LITTLE

gc-on-LITTLE for GC-Uncritical

slide-15
SLIDE 15

15

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical GC-Critical

2 LITTLE 3 LITTLE 1 LITTLE

gc-on-LITTLE for GC-Uncritical

slide-16
SLIDE 16

16

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical GC-Critical

2 LITTLE 3 LITTLE 1 LITTLE

gc-on-LITTLE for GC-Uncritical

slide-17
SLIDE 17

17

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

gc-on-LITTLE for GC-Uncritical gc-fair for GC-Critical GC-Uncritical GC-Critical

2 LITTLE 3 LITTLE 1 LITTLE

slide-18
SLIDE 18

18

Giving GC Fair Share of Big Core

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical GC-Critical

2 LITTLE 3 LITTLE 1 LITTLE

GC-Criticality depends on architecture, application, and runtime environment

slide-19
SLIDE 19

19

Our Contribution

  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical GC-Critical

2 LITTLE 3 LITTLE 1 LITTLE

GC-Criticality depends on architecture, application, and runtime environment

GC-Criticality-Aware Scheduler

Dynamically adjusts # big core cycles given to the concurrent collector

slide-20
SLIDE 20

20

app gc App alone gc-on-LITTLE Schd.

'me

GC-Criticality-Aware Scheduler

Runtime Activity à How Scheduler Reacts?

slide-21
SLIDE 21

21

app gc App alone gc-on-LITTLE Schd.

'me

GC-Criticality-Aware Scheduler

gc-on-LITTLE to gc-fair

slide-22
SLIDE 22

22

'me

app gc App alone Stop Concurrent gc-on-LITTLE Schd. gc-fair Scan Stop pause to do book-keeping ignored Scan stop pause: JVM signals scheduler gc-fair gives equal priority to GC and app

JVM signals the scheduler

GC-Criticality-Aware Scheduler

gc-on-LITTLE to gc-fair

slide-23
SLIDE 23

23

GC-Criticality-Aware Scheduler

Boost States

Scheduler State How many quanta scheduled on the BIG core? gc-boost P0 First GC thread = 1, Second GC thread = 1 gc-boost P1 First GC thread = 1, Second GC thread = 2 …

Stop scan pauses observed even with gc-fair

Scheduler How many quanta scheduled on the BIG core? gc-on-LITTLE First GC thread = 0, Second GC thread = 0 gc-fair First GC thread = 1, Second GC thread = 1

Boost the priority of garbage Give GC more consecu;ve quanta on big Degrade boost state when no longer cri;cal

slide-24
SLIDE 24

24

'me

app gc App alone Stop gc-boost:P0 Schd. gc-on-LITTLE If no scan pause in state P0, go to gc-on-LITTLE Can configure # zero stop scan intervals before returning to gc-on-LITTLE App alone

GC-Criticality-Aware Scheduler

gc-boost:P0 to gc-on-LITTLE

Concurrent

JVM signals the scheduler

slide-25
SLIDE 25

25

GC-Criticality-Aware Scheduler

Summary

  • JVM detects GC-Criticality during runtime
  • Communicates criticality information down

to the scheduler

  • Scheduler dynamically adapts big core

cycles given to GC

slide-26
SLIDE 26
  • Java Virtual Machine

– Jikes Research Virtual Machine (Version 3.1.2) – Full-heap concurrent collector with two threads – Tackle non-determinism by warming up the JVM – Heap size 2x of minimum

  • Benchmarks

– Ten benchmarks from DaCapo – Vary the # threads – 1 to 4

  • Heterogeneous Multicore Setup

– Sniper multicore simulator (Version 4.0) – Different four core heterogeneous architectures – Varying # of big and LITTLE cores

26

Experimental Setup

slide-27
SLIDE 27

27

3 big plus one LITTLE core

  • 20
  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

GC-Uncritical GC-Critical

gc-fair

Performance of GC-Criticality-Aware Scheduler

slide-28
SLIDE 28

28

gc-boost gc-fair

gc-boost performance neutral for GC-Uncritical 3 big plus one LITTLE core GC-Uncritical GC-Critical

  • 20
  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

Performance of GC-Criticality-Aware Scheduler

slide-29
SLIDE 29

29

gc-boost gc-fair

gc-boost performance neutral for GC-Uncritical Improves perf. of GC-Critical by 14% on avg. 3 big plus one LITTLE core GC-Uncritical GC-Critical

  • 20
  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction

Performance of GC-Criticality-Aware Scheduler

slide-30
SLIDE 30

30

0.2 0.4 0.6 0.8 1 1.2

Cycles per instruction

L3 Miss L2 Miss L1-D Miss L1-I Base

Application Collector

Understanding the Performance Advantage of Big Core

slide-31
SLIDE 31

31

0.2 0.4 0.6 0.8 1 1.2

Cycles per instruction

L3 Miss L2 Miss L1-D Miss L1-I Base

Understanding the Performance Advantage of Big Core

Application Collector

LITTLE

Collector performs a heap traversal chasing pointers

slide-32
SLIDE 32

32

0.2 0.4 0.6 0.8 1 1.2

Cycles per instruction

L3 Miss L2 Miss L1-D Miss L1-I Base

Understanding the Performance Advantage of Big Core

Application Collector

Instruction-level parallelism J Memory-level parallelism L LITTLE big

Collector performs a heap traversal chasing pointers

slide-33
SLIDE 33

33

Lowering frequency of LITTLE core

  • 20
  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction Similar freq.

GC-Uncritical GC-Critical

Performance of GC-Criticality-Aware Scheduler

slide-34
SLIDE 34

34

Lowering frequency of LITTLE core

  • 20
  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction 1 GHz slower Similar freq.

GC-Uncritical GC-Critical Lowering frequency increases GC-Criticality

Performance of GC-Criticality-Aware Scheduler

slide-35
SLIDE 35

35

Lowering frequency of LITTLE core

  • 20
  • 15
  • 10
  • 5

5 10 15 20 25 % execution time reduction Similar freq.

GC-Uncritical GC-Critical Lowering frequency increases GC-Criticality Improves perf. of GC-Critical by 20% on avg.

1 GHz slower

Performance of GC-Criticality-Aware Scheduler

slide-36
SLIDE 36

36

Different # LITTLE cores

  • 5

5 10 15 % execuBon Bme reducBon GC-Cri;cal GC-UnCri;cal

1L 2L 3L Allocation rate lowers with more LITTLE cores gc-boost is beneficial for different # LITTLE

Performance of GC-Criticality-Aware Scheduler

slide-37
SLIDE 37

Energy Efficiency of GC-Criticality-Aware Scheduler

3 big plus one LITTLE core

  • 10
  • 5

5 10 15 20 25 % reduction in energy-delay product

GC-Critical GC-Uncritical

37

Negligible change in EDP for GC-Uncritical 20% avg. reduction in EDP for GC-Critical

slide-38
SLIDE 38

38

  • Sensitivity studies

– Varying number of total cores – Scheduling quantum and # zero scan intervals – Heap size

  • GC-Criticality using OpenJDK’s collector

More in the Paper

slide-39
SLIDE 39

39

  • Concurrent garbage collection benefits from
  • ut-of-order execution
  • Java applications that allocate rapidly exhibit

GC-Criticality

  • GC-Criticality-Aware scheduler adjusts big

core cycles given to GC on a heterogeneous multicore

– Uses information provided by the JVM – Improves both performance and energy efficiency

Conclusions

slide-40
SLIDE 40

Thank You !

Shoaib.Akram@UGent.be http://users.elis.ugent.be/~sakram Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors

slide-41
SLIDE 41

41

GC Criticality with OpenJDK’s CMS

2 4 6 8

% increase in execution time

slide-42
SLIDE 42

42

Triggering Concurrent GC Every 32 MB of Allocation

5 10 15 % reduction in energy delay product

slide-43
SLIDE 43

43

'me

app gc App alone Stop gc-boost:P0 Schd. gc-boost:P1 Scan gc-boost:P1 gives GC two quanta on big

GC-Criticality-Aware Scheduler

gc-boost:P0 to gc-boost:P1

Concurrent

JVM signals the scheduler

slide-44
SLIDE 44

44

'me

app gc App alone Stop gc-boost:P1 Schd. gc-boost:P0 Degrade boost state if no stop scan pause App alone

GC-Criticality-Aware Scheduler

gc-boost:P1 to gc-boost:P0

Concurrent

JVM signals the scheduler

slide-45
SLIDE 45

Energy Efficiency of GC-Criticality-Aware Scheduler

3 big plus one LITTLE core

  • 10
  • 5

5 10 15 20 25 % reduction in energy-delay product

45

slide-46
SLIDE 46

Energy Efficiency of GC-Criticality-Aware Scheduler

3 big plus one LITTLE core

  • 10
  • 5

5 10 15 20 25 % reduction in energy-delay product

GC-Uncritical

46

Negligible change in EDP for GC-Uncritical