[PPT] - Return Value Prediction in a Java Virtual Machine Christopher J.F PowerPoint Presentation

SLIDE 1

Return Value Prediction in a Java Virtual Machine

Christopher J.F . Pickett Clark Verbrugge {cpicke,clump}@sable.mcgill.ca School of Computer Science, McGill University Montr´ eal, Qu´ ebec, Canada H3A 2A7

SLIDE 2

Overview

Introduction and Related Work Contributions Design Benchmark Properties Size Variation Memory Usage Hybrid Performance Conclusions and Future Work

VPW2 – 1/39

SLIDE 3

Introduction and Related Work

Speculative method-level parallelism (SMLP) allows for dynamic parallelisation of single-threaded programs speculative threads are forked at callsites suitable for Java virtual machines Perfect return value prediction can double performance of SMLP (Hu et al., 2003) Goals Implement Hu’s predictors in SableVM Achieve higher accuracy

VPW2 – 2/39

SLIDE 4

Speculative Method-Level Parallelism

// execute foo non-speculatively r = foo (a, b, c); // execute past return point // speculatively in parallel with foo() if (r > 10) { s = o1.f; // buffer head reads

2.f = r;

// buffer heap writes } ...

VPW2 – 3/39

SLIDE 5

Impact of Return Value Prediction

RVP strategy return value SMLP speedup none arbitrary 1.52 best predicted 1.92 perfect correct 2.76 26% speedup over no RVP with Hu’s best predictor 82% speedup over no RVP with perfect prediction Improved hybrid accuracy is highly desirable

S. Hu., R. Bhargava, and L. K. John. The role of return value

prediction in exploiting speculative method-level parallelism.

Journal of Instruction-Level Parallelism, 5:1–21, Nov. 2003.

VPW2 – 4/39

SLIDE 6

Contributions (1)

Expand on previous data collected Use S100 instead of S1 (size 1) for SPEC JVM98 Report all return types, not just boolean, int, ref Explicitly account for exceptions (run jack) Predict all method calls (no inlining) Implement existing predictors in JVM last value, stride, 2-delta stride parameter stride finite context method (FCM) hybrid

VPW2 – 5/39

SLIDE 7

Contributions (2)

New memoization predictor Table-based, like context predictor Hashes together method arguments Performs well in a hybrid Explore predictor performance limits Allocate storage until accuracy no longer improves Reduce memory requirements Dynamically expand hashtables Exploit VM info about value widths

VPW2 – 6/39

SLIDE 8

Design

Implement all predictors in software JVM not trace-based not simulated Fixed size predictors:

Last value – last callsite return value Stride – prediction = r1 + (r1 - r2) 2-delta stride – update after two identical strides Parameter stride – search for and capture stride

between r and one parameter Focus on predictors with variable size

VPW2 – 7/39

SLIDE 9

Design

Variable-memory table-based predictors:

Context – inputs are return value history Memoization – inputs are method parameters

Hash values together, use extra bits as tag Rehash on tag collisions Use direct addressing, not chaining Use Jenkins’ fast hash to get even distribution Attach context and memoization tables per callsite Expand tables if load > 75%, up to a fixed maximum

VPW2 – 8/39

SLIDE 10

Design

Hybrid predictor Chooses best sub-predictor over last 32 values

LS2PC – all previous predictors LS2PCM – all previous predictors + memoization

VPW2 – 9/39

SLIDE 11

Context and Memoization Predictors

VPW2 – 10/39

SLIDE 12

Hashtable Lookup and Expansion

VPW2 – 11/39

SLIDE 13

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 3.75K 11.1M 9.38M 17.5M 35.8M 24.3M 3.06M byte B 580K 39.3K char C 935 1.73K 1.55M 3.70M 6.65K 2.11K 9.84K short S 0 73.3K 0 18.0M int I 133M 48.0M 11.5M 36.5M 20.8M 34.6M 4.54M long J 477 152K 1.23M 845K 101K 15.8K 2.13K float F 96 280 7.81K 162M double D 156 1.77M 56 188K reference R 15.8K 56.2M 10.2M 22.9M 43.5M 32.7K 97.5M

VPW2 – 12/39

SLIDE 14

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 3.75K 11.1M 9.38M 17.5M 35.8M 24.3M 3.06M byte B 580K 39.3K char C 935 1.73K 1.55M 3.70M 6.65K 2.11K 9.84K short S 0 73.3K 0 18.0M int I 133M 48.0M 11.5M 36.5M 20.8M 34.6M 4.54M long J 477 152K 1.23M 845K 101K 15.8K 2.13K float F 96 280 7.81K 162M double D 156 1.77M 56 188K reference R 15.8K 56.2M 10.2M 22.9M 43.5M 32.7K 97.5M

VPW2 – 13/39

SLIDE 15

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 3.75K 11.1M 9.38M 17.5M 35.8M 24.3M 3.06M byte B 580K 39.3K char C 935 1.73K 1.55M 3.70M 6.65K 2.11K 9.84K short S 0 73.3K 0 18.0M int I 133M 48.0M 11.5M 36.5M 20.8M 34.6M 4.54M long J 477 152K 1.23M 845K 101K 15.8K 2.13K float F 96 280 7.81K 162M double D 156 1.77M 56 188K reference R 15.8K 56.2M 10.2M 22.9M 43.5M 32.7K 97.5M

VPW2 – 14/39

SLIDE 16

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 3.75K 11.1M 9.38M 17.5M 35.8M 24.3M 3.06M byte B 580K 39.3K char C 935 1.73K 1.55M 3.70M 6.65K 2.11K 9.84K short S 0 73.3K 0 18.0M int I 133M 48.0M 11.5M 36.5M 20.8M 34.6M 4.54M long J 477 152K 1.23M 845K 101K 15.8K 2.13K float F 96 280 7.81K 162M double D 156 1.77M 56 188K reference R 15.8K 56.2M 10.2M 22.9M 43.5M 32.7K 97.5M

VPW2 – 15/39

SLIDE 17

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 3.75K 11.1M 9.38M 17.5M 35.8M 24.3M 3.06M byte B 580K 39.3K char C 935 1.73K 1.55M 3.70M 6.65K 2.11K 9.84K short S 0 73.3K 0 18.0M int I 133M 48.0M 11.5M 36.5M 20.8M 34.6M 4.54M long J 477 152K 1.23M 845K 101K 15.8K 2.13K float F 96 280 7.81K 162M double D 156 1.77M 56 188K reference R 15.8K 56.2M 10.2M 22.9M 43.5M 32.7K 97.5M

VPW2 – 16/39

SLIDE 18

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 3.75K 11.1M 9.38M 17.5M 35.8M 24.3M 3.06M byte B 580K 39.3K char C 935 1.73K 1.55M 3.70M 6.65K 2.11K 9.84K short S 0 73.3K 0 18.0M int I 133M 48.0M 11.5M 36.5M 20.8M 34.6M 4.54M long J 477 152K 1.23M 845K 101K 15.8K 2.13K float F 96 280 7.81K 162M double D 156 1.77M 56 188K reference R 15.8K 56.2M 10.2M 22.9M 43.5M 32.7K 97.5M

VPW2 – 17/39

SLIDE 19

SPEC JVM98 Dynamic Properties

property comp db jack javac jess mpeg mtrt callsites 1.72K 1.89K 3.60K 5.12K 3.04K 2.17K 2.90K forked 226M 170M 59.4M 127M 125M 111M 288M aborted 36 18 608K 41.8K 290 114 62 void 93.4M 54.4M 24.4M 45.3M 23.3M 34.1M 20.5M verified 133M 115M 34.4M 81.5M 102M 76.9M 267M boolean Z 0% 10% 27% 21% 35% 32% 1% byte B 0% 0% 2% 0% 0% 0% 0% char C 0% 0% 5% 5% 0% 0% 0% short S 0% 0% 0% 0% 0% 23% 0% int I 100% 42% 33% 45% 20% 45% 2% long J 0% 0% 4% 1% 0% 0% 0% float F 0% 0% 0% 0% 0% 0% 61% double D 0% 0% 0% 0% 2% 0% 0% reference R 0% 49% 30% 28% 43% 0% 37%

VPW2 – 18/39

SLIDE 20

Size Variation

Vary hashtable maximum size from 4 to 26 bits Graph accuracy against size for: Context Memoization LS2PCM hybrid (all sub-predictors) Choose optimal points for context and memoization Use these in future hybrid experiments Future: try to do this profiling dynamically

VPW2 – 19/39

SLIDE 21

comp Size Variation

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) context

VPW2 – 20/39

SLIDE 22

comp Size Variation

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) context memoization

VPW2 – 21/39

SLIDE 23

comp Size Variation

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) context memoization hybrid

VPW2 – 22/39

SLIDE 24

mtrt Context Size Variation

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) mtrt_up mtrt_mp

VPW2 – 23/39

SLIDE 25

mtrt Memoization Size Variation

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) mtrt_up mtrt_mp

VPW2 – 24/39

SLIDE 26

mtrt Hybrid Size Variation

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) mtrt_up mtrt_mp

VPW2 – 25/39

SLIDE 27

{jack,javac,jess} Context

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) jack javac jess

VPW2 – 26/39

SLIDE 28

{jack,javac,jess} Memoization

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) jack javac jess

VPW2 – 27/39

SLIDE 29

Context Size Variation (all)

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) comp db jack javac jess mpeg mtrt_up mtrt_mp

VPW2 – 28/39

SLIDE 30

Memoization Size Variation (all)

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) comp db jack javac jess mpeg mtrt_up mtrt_mp

VPW2 – 29/39

SLIDE 31

Hybrid Size Variation (all)

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 16 18 20 22 24 26 accuracy (%) maximum per-callsite table size (bits) comp db jack javac jess mpeg mtrt_up mtrt_mp

VPW2 – 30/39

SLIDE 32

Memory Usage

Allow hashtables to expand freely up to 26 bits Look at final distributions On average, 87% never expand beyond 8 bits Expansion α hash function input variability Exploit VM level value width info to conserve memory Compare against using full 64-bit table values Memoization requires less space than context Indicates suitability for hardware designs

VPW2 – 31/39

SLIDE 33

Context Table Distribution

2000 1000 100 10 1 3 5 7 9 11 13 15 17 19 21 23 25 number of callsites table size (bits) comp db jack javac jess mpeg mtrt_up

VPW2 – 32/39

SLIDE 34

Memoization Table Distribution

2000 1000 100 10 1 3 5 7 9 11 13 15 17 19 21 23 25 number of callsites table size (bits) comp db jack javac jess mpeg mtrt_up

VPW2 – 33/39

SLIDE 35

Table Memory

bench- context memoization mark size original reduced size original reduced comp 24 313M 208M 18 9.60M 6.38M db 24 541M 361M 24 345M 206M jack 14 15.8M 10.7M 8 1.79M 1.11M javac 20 291M 195M 14 103M 64.5M jess 14 13.5M 9.59M 12 8.83M 5.62M mpeg 12 3.72M 2.49M 12 1.46M 856K mtrt 14 69.4M 46.4M 12 23.0M 15.3M average 17 178M 119M 14 70.4M 42.8M

value width optimizations yield 35% space reduction

VPW2 – 34/39

SLIDE 36

Hybrid Performance

Compare hybrid LS2PC and LS2PCM predictors Omit types with zero calls Difficult to compare with directly with Hu’s results S100 vs. S1 non-inlined vs. inlined Memoization complements context nicely in hybrid 72% average accuracy with LS2PC 81% average accuracy with LS2PCM Hybrid fitness: 96% Ability to capture correct sub-predictions

VPW2 – 35/39

SLIDE 37

LS2PC (light) vs. LS2PCM (dark)

10 20 30 40 50 60 70 80 90 100 R D F J I S C B Z A accuracy (%) type

VPW2 – 36/39

SLIDE 38

LS2PC (light) vs. LS2PCM (dark)

20 40 60 80 100 R D F J I S C B Z A R J I C B Z A R J I C Z A R J I C Z A accuracy (%) comp db jack javac 20 40 60 80 100 R D F J I S C B Z A R D F J I C Z A R D F J I S C Z A R D F J I C Z A accuracy (%) jess mpeg mtrt average

VPW2 – 37/39

SLIDE 39

Conclusions

Reported comprehensive data over all method calls Achieved high prediction accuracy in software JVM Introduced powerful memoization predictor Cut memory costs without sacrificing accuracy

VPW2 – 38/39

SLIDE 40

Future Work

Extend framework with new predictors (e.g. DFCM) Generalised load predictors Implement compiler analyses in Soot Parameter dependence analysis Return value use analysis Determine extent to which memoization compensates for concurrent update problems with context predictors Expand hashtables only if accuracy increased on last expansion Finish SMLP implementation in SableVM Study costs and benefits of RVP in this system

VPW2 – 39/39