 
              Transactional Garbage and how to collect it for fun and profit Fadi Meawad Ryan Macnak Jan Vitek S3Lab Computer Science Dept Purdue University 1
% of time spent in GC C# STMBench7 on Bartok # of threads 8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled, running Windows Server 2003 SP2. 2
Up to 98% % of time spent in GC C# STMBench7 on Bartok # of threads 8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled, running Windows Server 2003 SP2. 2
What does this have to do with Transactional Memory? 3
Let’s Benchmark GCBench ‣ New micro-benchmark, creates a linked list of lists ‣ A transaction traverses lists, at every node, either: • Update node • Allocate unreachable or live object • Allocate object unreachable after commit • Make object unreachable Wormbench ‣ C#, designed for the Bartok STM ‣ A “worm” with a triangular head and a tail lives in a matrix with other worms ‣ 15 ops, such as move forward or turn right 4 4
Let’s Benchmark STMBench7 ‣ Trees, graphs and indices as in CAD/CAM workloads ‣ 500MB of data ‣ Configure either r-, r/w- or w-dominated workload ‣ Long or short traversals LeeTM ‣ Automatic circuit routing using Lee’s algorithm ‣ Pairs of points on a grid connected with non- intersecting paths. 5 5
slow down C# GCBench list size 8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled. Windows Server 2003 SP2. 6
Where did that time go? list size !"" #"" $"" %"" &"" '"" ("" )"" "* +* #$* $'* %)* &%* '!* '(* ! !#* $&* &'* '+* (%* )#* )&* ))* # # of threads #&* &$* (#* (+* )(* +!* +$* +&* $ $+* ''* )!* ))* +#* +&* +'* +(* % %"* (&* )(* +#* +&* +'* +(* +)* & %+* )!* +"* +&* +(* +)* +)* ++* ' '"* )#* +$* +'* +(* +)* ++* ++* ( '%* )&* +%* +(* +)* ++* ++* ++* ) C# GCBench, % of time spent in GC 7
Maybe it is just in Bartok! 8
What is Bartok STM? Optimizing ahead-of-time research compiler & runtime STM ‣ Object-based, in-place, optimistic updates. ‣ Read-enlistment, update-enlistment & undo-value logs • Allocated from normal heap ‣ Transaction ID is stored in object's header GC ‣ 2-generational semispace copying collector ‣ Stop the world [Harris, Plesko, Shinnar, Tarditi, Optimizing memory transactions, PLDI06 ] 9 9
GC % using Multiverse =>3 ?@+A*B+,C.BD#.%@"'#E !"#$"%&' ()*+,-./#. !"#$"%&' ()*+,-./#. F*@G'@G" 123 043 0647 0 05 44 9:3 ;83 86:7 8 02 01< <:3 ;23 ;617 ; 85 021 <03 183 8647 1 ;0 852 :93 123 ;6<7 9 ;8 819 :53 9:3 1617 < 10 ;02 213 <<3 <627 : 11 195 213 :<3 086<7 2 1: :<4 Java GCBench size 800, % of time spent in GC 10
GC % using Multiverse =>3 ?@+A*B+,C.BD#.%@"'#E !"#$"%&' ()*+,-./#. !"#$"%&' ()*+,-./#. F*@G'@G" 123 043 0647 0 05 44 9:3 ;83 86:7 8 02 01< <:3 ;23 ;617 ; 85 021 <03 183 8647 1 ;0 852 :93 123 ;6<7 9 ;8 819 :53 9:3 1617 < 10 ;02 213 <<3 <627 : 11 195 213 :<3 086<7 2 1: :<4 Java GCBench size 800, % of time spent in GC 10
Does it depend on the GC? GCBench, size 600 execution time (seconds) # of threads 11
Does the problem Scale? GCBench size 800, Azul execution time (seconds) # of threads Azul Vega 3 3310B, two 54-core processors. 48GB of RAM, Azul VM. Concurrent Pauseless GC 12
Does the problem Scale? memory usage (GB) GCBench size 800, Azul # of threads 13
What can we do about it? 14
Logs in Bartok Object-based in place updates with 1$(#2")3$4.% !"#$0 *"5%$6.7,$-/ undo logs .()) ) )) Reads List: Node1: 1 00 v100 0 00 The read-object log contains List VTable Node VTable read STM Word (version #) Head Value = 10 Updates Tail Next = null An object opened for update has Sum = 42 an updated-object log entry Updated-object (previous STM word) log entry: v90 0 00 Upon update, old value is Transaction manager maintained in an undo log Offset in log chunk Chunks Logs are allocated in chunks +,-. 86#"%*"5% maintained by the STM $6.7,$-/ 4>?@?ABCDE@F Discarded at end of transaction 899:%,&'&; <=%.&'&67 [Harris, Plesko, Shinnar, Tarditi, Optimizing memory transactions, PLDI 06 ] 15
What is log reuse, and is it enough? Log Reuse ‣ At transaction end, preserve log chunks into a pool rather than leave for GC ‣ When a transaction needs a new log chunk, try the pool first, otherwise allocate Issues ‣ Hard to decide when to deallocate log chunks ‣ Large initializing Txs followed by small ones will result a large unused pool ‣ The pool will be traced by the GC ‣ Weak references are expensive 16 16
Dedicated Nurseries Generational-GC nurseries ‣ Most objects either die young or live forever Transaction nurseries ‣ Objects allocated in transaction not visible to other threads until commit ‣ Reclaim nursery in one step after abort ‣ Can support nested transactions ‣ Finalization? 17 17
Dedicated Nurseries 18 18
Dedicated Heap ‣ Much of transactional allocation is logs ‣ Lifetime of logs bounded by the transaction they serve ‣ Known lifetime allows manual memory management ‣ Cheap to allocate/free in chunks from a mutexed freelist 19 19
Dedicated Heap 20
Does it work? % total speedup ' ( $ " '# " & &# ( ) -./012 $ & ' "# %+,-./012 $# %# # list size %## $## "## &## '## (## )## *## !%# C# GCBench !$# 8-core, 1.60GHz Intel Xeon E5310. 8GB RAM. Physical Address Extension enabled, Windows Server 2003 SP2. 21
Does it work? ops per second C# STMBench7 # of threads 22
Does it work? ("# "'# % increase in ops per second "&# ,-./0123452637605.38759 "%# C# STMBench7 "$# ,-./0123452637609:638759 ""# '# &# %# $# "# !"# " ( $ ) % * & + # of threads 23
LeeTM, issues Issues ‣ Allocates a temp grid within the transaction ‣ Bartok STM does not optimize multidimensional array access (excessive logging) Workarounds ‣ Allocate before the transaction (Opt) ‣ Use RowMajor array access (RM) 24 24
Results (cont’d) '#!" +,-./010234" C# LeeTM +,-,560786" '!!" total time (seconds) &!" +,-./010234" +,-,560786" .9:-./010234" %!" .9:-,560786" .9:+,-./010234" .9:+,-,560786" +,-,560786" $!" .9:-./010234" #!" .9:-,560786" .9:+,-./010234" .9:+,-,560786" !" '" #" (" $" )" %" *" &" # of threads 25
Results (cont’d) )# .,-/0 /0 (# '# % of time saved &# +,- C# LeeTM .,-/0 %# /0 $# .,-/0 /0 "# +,- # " $ % & ' ( ) * # of threads Running on an 8-core, 1.60GHz Intel Xeon E5310 with 8GB of RAM and Physical Address Extension enabled, running Windows Server 2003 SP2. 26
Conclusion Memory Usage ‣ Same overall allocated memory ‣ Less demands on GCed heap Speed Up Applying to other systems ‣ Not for library based STM systems, but with runtime support will work with most STM flavors 27 27
Recommend
More recommend