Transactional Garbage and how to collect it for fun and profit - - PowerPoint PPT Presentation

transactional garbage
SMART_READER_LITE
LIVE PREVIEW

Transactional Garbage and how to collect it for fun and profit - - PowerPoint PPT Presentation

Transactional Garbage and how to collect it for fun and profit Fadi Meawad Ryan Macnak Jan Vitek S3Lab Computer Science Dept Purdue University 1 % of time spent in GC C# STMBench7 on Bartok # of threads 8-core, 1.60GHz Intel Xeon


slide-1
SLIDE 1

Transactional Garbage

and how to collect it for fun and profit

Fadi Meawad Ryan Macnak Jan Vitek

S3Lab Computer Science Dept Purdue University

1

slide-2
SLIDE 2

% of time spent in GC # of threads

C# STMBench7 on Bartok

8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled, running Windows Server 2003 SP2.

2

slide-3
SLIDE 3

Up to 98%

% of time spent in GC # of threads

C# STMBench7 on Bartok

8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled, running Windows Server 2003 SP2.

2

slide-4
SLIDE 4

What does this have to do with Transactional Memory?

3

slide-5
SLIDE 5

Let’s Benchmark

GCBench

  • New micro-benchmark, creates a linked list of lists
  • A transaction traverses lists, at every node, either:
  • Update node
  • Allocate unreachable or live object
  • Allocate object unreachable after commit
  • Make object unreachable

Wormbench

  • C#, designed for the Bartok STM
  • A “worm” with a triangular head and a tail lives in a

matrix with other worms

  • 15 ops, such as move forward or turn right

4

4

slide-6
SLIDE 6

Let’s Benchmark

STMBench7

  • Trees, graphs and indices as in CAD/CAM workloads
  • 500MB of data
  • Configure either r-, r/w- or w-dominated workload
  • Long or short traversals

LeeTM

  • Automatic circuit routing using Lee’s algorithm
  • Pairs of points on a grid connected with non-

intersecting paths.

5

5

slide-7
SLIDE 7

C# GCBench

list size

slow down

8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled. Windows Server 2003 SP2.

6

slide-8
SLIDE 8

Where did that time go?

!"" #"" $"" %"" &"" '"" ("" )"" !

"* +* #$* $'* %)* &%* '!* '(*

#

!#* $&* &'* '+* (%* )#* )&* ))*

$

#&* &$* (#* (+* )(* +!* +$* +&*

%

$+* ''* )!* ))* +#* +&* +'* +(*

&

%"* (&* )(* +#* +&* +'* +(* +)*

'

%+* )!* +"* +&* +(* +)* +)* ++*

(

'"* )#* +$* +'* +(* +)* ++* ++*

)

'%* )&* +%* +(* +)* ++* ++* ++*

C# GCBench, % of time spent in GC

# of threads

list size

7

slide-9
SLIDE 9

Maybe it is just in Bartok!

8

slide-10
SLIDE 10

What is Bartok STM?

Optimizing ahead-of-time research compiler & runtime STM

  • Object-based, in-place, optimistic updates.
  • Read-enlistment, update-enlistment & undo-value logs
  • Allocated from normal heap
  • Transaction ID is stored in object's header

GC

  • 2-generational semispace copying collector
  • Stop the world

[Harris, Plesko, Shinnar, Tarditi, Optimizing memory transactions, PLDI06] 9

9

slide-11
SLIDE 11

GC % using Multiverse

!"#$"%&' ()*+,-./#. !"#$"%&' ()*+,-./#.

123 043

05 44

0647

8

9:3 ;83

02 01<

86:7

;

<:3 ;23

85 021

;617

1

<03 183

;0 852

8647

9

:93 123

;8 819

;6<7

<

:53 9:3

10 ;02

1617

:

213 <<3

11 195

<627

2

213 :<3

1: :<4

086<7

=>3 ?@+A*B+,C.BD#.%@"'#E F*@G'@G"

Java GCBench size 800, % of time spent in GC

10

slide-12
SLIDE 12

GC % using Multiverse

!"#$"%&' ()*+,-./#. !"#$"%&' ()*+,-./#.

123 043

05 44

0647

8

9:3 ;83

02 01<

86:7

;

<:3 ;23

85 021

;617

1

<03 183

;0 852

8647

9

:93 123

;8 819

;6<7

<

:53 9:3

10 ;02

1617

:

213 <<3

11 195

<627

2

213 :<3

1: :<4

086<7

=>3 ?@+A*B+,C.BD#.%@"'#E F*@G'@G"

Java GCBench size 800, % of time spent in GC

10

slide-13
SLIDE 13

Does it depend on the GC?

# of threads execution time (seconds)

GCBench, size 600

11

slide-14
SLIDE 14

Does the problem Scale?

# of threads execution time (seconds)

Azul Vega 3 3310B, two 54-core processors. 48GB of RAM, Azul VM. Concurrent Pauseless GC

GCBench size 800, Azul

12

slide-15
SLIDE 15

Does the problem Scale?

# of threads memory usage (GB)

GCBench size 800, Azul

13

slide-16
SLIDE 16

What can we do about it?

14

slide-17
SLIDE 17

899:%,&'&; <=%.&'&67 +,-. 4>?@?ABCDE@F 86#"%*"5% $6.7,$-/

Transaction manager Offset in log chunk 00 v90 Value = 10 Next = null Node VTable 00 v100 Head Tail List VTable 00 1 Sum = 42 List: Updated-object log entry: Node1:

Logs in Bartok

Object-based in place updates with undo logs Reads The read-object log contains read STM Word (version #) Updates An object opened for update has an updated-object log entry (previous STM word) Upon update, old value is maintained in an undo log Chunks Logs are allocated in chunks maintained by the STM Discarded at end of transaction

[Harris, Plesko, Shinnar, Tarditi, Optimizing memory transactions, PLDI 06]

1$(#2")3$4.% *"5%$6.7,$-/ !"#$0 )) .()) )

15

slide-18
SLIDE 18

Log Reuse

  • At transaction end, preserve log chunks into a pool

rather than leave for GC

  • When a transaction needs a new log chunk, try the

pool first, otherwise allocate

Issues

  • Hard to decide when to deallocate log chunks
  • Large initializing Txs followed by small ones will result a

large unused pool

  • The pool will be traced by the GC
  • Weak references are expensive

What is log reuse, and is it enough?

16

16

slide-19
SLIDE 19

Dedicated Nurseries

Generational-GC nurseries

  • Most objects either die young or live forever

Transaction nurseries

  • Objects allocated in transaction not visible to other

threads until commit

  • Reclaim nursery in one step after abort
  • Can support nested transactions
  • Finalization?

17

17

slide-20
SLIDE 20

18

Dedicated Nurseries

18

slide-21
SLIDE 21
  • Much of transactional allocation is logs
  • Lifetime of logs bounded by the transaction they serve
  • Known lifetime allows manual memory management
  • Cheap to allocate/free in chunks from a mutexed

freelist

Dedicated Heap

19

19

slide-22
SLIDE 22

Dedicated Heap

20

slide-23
SLIDE 23

Does it work?

!$# !%# # %# $# "# &# '# %## $## "## &## '## (## )## *## %+,-./012

  • ./012

$

$ "

" & & ' ' ( ( )

C# GCBench

8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled, Windows Server 2003 SP2.

% total speedup

list size

21

slide-24
SLIDE 24

Does it work?

C# STMBench7

# of threads

  • ps per second

22

slide-25
SLIDE 25

!"# "# $# %# &# '# ""# "$# "%# "&# "'# ("# " ( $ ) % * & + ,-./0123452637605.38759 ,-./0123452637609:638759

Does it work?

C# STMBench7

# of threads % increase in ops per second

23

slide-26
SLIDE 26

LeeTM, issues

Issues

  • Allocates a temp grid within the transaction
  • Bartok STM does not optimize multidimensional array

access (excessive logging)

Workarounds

  • Allocate before the transaction (Opt)
  • Use RowMajor array access (RM)

24

24

slide-27
SLIDE 27

Results (cont’d)

!" #!" $!" %!" &!" '!!" '#!" '" #" (" $" )" %" *" &" +,-./010234" +,-,560786" .9:-./010234" .9:-,560786" .9:+,-./010234" .9:+,-,560786" +,-./010234" +,-,560786" +,-,560786" .9:-./010234" .9:-,560786" .9:+,-./010234" .9:+,-,560786"

C# LeeTM

# of threads total time (seconds)

25

slide-28
SLIDE 28

Results (cont’d)

# "# $# %# &# '# (# )# " $ % & ' ( ) * +,- .,-/0 /0 .,-/0 /0 .,-/0 /0 +,-

Running on an 8-core, 1.60GHz Intel Xeon E5310 with 8GB of RAM and Physical Address Extension enabled, running Windows Server 2003 SP2.

C# LeeTM

# of threads % of time saved

26

slide-29
SLIDE 29

Conclusion

Memory Usage

  • Same overall allocated memory
  • Less demands on GCed heap

Speed Up Applying to other systems

  • Not for library based STM systems, but with runtime

support will work with most STM flavors

27

27