[PPT] - H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon School of Electrical PowerPoint Presentation

SLIDE 1

H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon

School of Electrical Engineering and Computer Science Seoul National University, Korea

SLIDE 2

2

Virtual Machine & Optimization Lab

Android apps are programmed using Java
Android uses DVM instead of JVM for running Java
Some people believe that Android is successful partl

y due to DVM; is this really true?  How DVM performs compared to JVM?

Evaluate on the same board using the same benchmarks

 How DVM affects the performance of Android apps?

Analyze runtime profile

SLIDE 3

3

Virtual Machine & Optimization Lab

Comparison of DVM and JVM
Evaluation of DVM and JVM
Evaluation of Android apps
Conclusion

SLIDE 4

4

Virtual Machine & Optimization Lab

VM for executing Java in Android platform
Java code in applications, framework, and core libraries
Executes dex files

instead of class files

f Java VM (JVM)
DX (class-to-dex)
Dex file has different

bytecode ISA

SLIDE 5

5

Virtual Machine & Optimization Lab

DVM has a register-based bytecode, while JVM has

a stack-based bytecode

JAVA SOURCE CODE public static int add(int a, int b) { int c = a + b; return c; } JVM DVM 0: iload_0 1: iload_1 2: iadd 3: istore_2 4: iload_2 5: ireturn |0000: add-int v0, v1, v2 |0002: return v0

SLIDE 6

6

Virtual Machine & Optimization Lab

DVM interpreter is supposed to be faster than JVM’s, due to fewer bytecode count and operand accesses

According to Shi’s “stack vs. register” paper [TACO’08]
DVM has two interpreters (assembly version, C version),

while our JVM has C version only

SLIDE 7

7

Virtual Machine & Optimization Lab

Higher performance requires just-in-time compilation, which translates bytecode to native code at runtime

Both VMs employ adaptive compilation
Interpret initially, when finding hot spot, compiling it
DVM’s JIT compilation unit is a hot path called a tra

ce, while JVM’s is a hot method

For lower memory footprint, yet competitive performance
But, the reality is …

SLIDE 8

8

Virtual Machine & Optimization Lab

1 2 3 4 5 6 7 1 2 4 3 4 5 7 6 7 Blocks:Loop

Interpret initially, count at each trace entry
Trace entry: target of jump, next bytecode of trace
If counter > threshold, trace recording starts
Trace recording stops when meeting a branch
r a method call; trace is enqueued for JITC
A join BB can be compiled multiple times
Chaining is used for control transfer at the en

d of a trace: chaining cells are added

[Jump to a VM internal function + address cache]

SLIDE 9

9

Virtual Machine & Optimization Lab

Code quality: too short (~3 bytecode) traces
Fewer optimizations, higher overhead of chaining cells
Preciseness of hot trace detection
Counters are shared among traces to reduce space
Register allocation
Cannot map virtual registers to physical registers globally

– v0=v0+v1 requires two loads from v0 and v1 and a store to v0

Can affect performance and memory, negatively

SLIDE 10

10

Virtual Machine & Optimization Lab

Generated Machine code (12 instructions generated) Java Source Code Dalvik Bytecode public static int factorial( ) { int result = 1; for(int i = 1 ; i < 10000 ; i++) { result = result * i; } return result; } |0000: const/4 v0, #int 1 // #1 |0001: move v1, v0 |0002: const/16 v2, #int 10000 // #2710 |0004: if-ge v0, v2, 000a // +0006 |0006: add-int/2addr v1, v0 |0007: add-int/lit8 v0, v0, #int 1 // #01 |0009: goto 0002 // -0007 |000a: return v1

// if- // if-ge ge v0, v2, 000a v0, v2, 000a LDR R3, [RFP, #0] CMP R3, R2 STR R2, [RFP, #8] BGE label2 B label1 label2: …… label1: // add- // add-int int/2addr v1, v0 /2addr v1, v0 LDR R0, [RFP, #4] LDR R1, [RFP, #0] ADDS R0, R0, R1 STR R0, [RFP, #4] // // add- add-int int/lit8 /lit8 v0, v0, v0, v0, #int #int 1 1 ADDS R1, R1, #1 // // goto goto 0002 0002 STR R0,[RFP, #4] STR R1,[RFP, #0]

SLIDE 11

11

Virtual Machine & Optimization Lab

Java Source Code Java Bytecode public static int factorial( ) { int result = 1; for(int i = 1 ; i < 10000 ; i++) { result = result * i; } return result; }

L2: // // sipush sipush 10000 10000 LDR v8, [pc, #+0] @const 10 000 // // if_icmpge if_icmpge <21> <21> CMP v4, v8 LSL #0 BGE L1 // //iinc iinc 1 1 1 1 ADD v4, v4, #1 STR v4, [rJFP, #-4] // //goto goto <4> <4> B L2 L1: …… // iload_0 // iload_0 // iload_1 // iload_1 // // iadd iadd ADD v3, v3, v4 LSL #0 // istore_0 STR v3, [rJFP, #-8]

Generated Machine code (8 instructions generated)

SLIDE 12

12

Virtual Machine & Optimization Lab

Tablet PC with ARM Cortex-A8 and 1GB memory
Android 2.3 Gingerbread on Linux 2.6.35
PhoneME advanced JVM (HotSpot) on Linux 2.6.32
EEMBC GrinderBench
DVM JITC generates Thumb2 code, while JVM JITC

generates ARM code

Thumb2 reduces code size by 15%, performance by 6%

SLIDE 13

13

Virtual Machine & Optimization Lab

0.5 1 1.5 2 2.5 Chess kXML Parallel PNG RegEx Geomean JVM Interpreter DVM Interpreter DVM C Interpreter

DVM assembly interpreter is faster than JVM’s, but its C interpreter is similar

SLIDE 14

14

Virtual Machine & Optimization Lab

0.2 0.4 0.6 0.8 1 1.2 Chess kXML Parallel PNG RegEx Geomean JVM Dynamic Bytecode Count DVM Dynamic Bytecode Count

DVM executes 40% fewer bytecode instructions

SLIDE 15

15

Virtual Machine & Optimization Lab

0.5 1 1.5 2 2.5 Chess kXML Parallel PNG RegEx Geomean JVM Dynamic Bytecode Size DVM Dynamic Bytecode Size

DVM requires a 60% larger program than the JVM for achieving the same job

SLIDE 16

16

Virtual Machine & Optimization Lab 2 4 6 8 10 12 14 16 18 20 Chess kXML Parallel PNG RegEx Geomean JVM JITC DVM JITC

DVM with JITC is three times slower than JVM with JITC

SLIDE 17

17

Virtual Machine & Optimization Lab 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Chess kXML Parallel PNG RegEx Geomean JVM Compiled Bytecode Size DVM Compiled Bytecode Size

DVM compiles a smaller amount of bytecode because of its trace-based JITC

SLIDE 18

18

Virtual Machine & Optimization Lab 0.5 1 1.5 2 2.5 Chess kXML Parallel PNG RegEx Geomean JVM Generated Code Size DVM Generated Code Size

DVM generates 35% larger machine code than the JVM’s

SLIDE 19

19

Virtual Machine & Optimization Lab

Chess kXML Parallel PNG RegEx Avg. Ratio 1.18 1.08 1.15 1.15 1.13 1.13

How many times a Dalvik bytecode is translated redundantly?

SLIDE 20

20

Virtual Machine & Optimization Lab

0.5 1 1.5 2 2.5 3 3.5 4 Chess kXML Parallel PNG RegEx Geomean

How many instructions are generated for 1 byte of bytecode ?

JVM: ~1.3 instructions/1 byte of JVM DVM: ~2.7 instructions/1 byte of DVM = ~4.5 instructions/1 byte of JVM

Chaining cell overhead

SLIDE 21

21

Virtual Machine & Optimization Lab 1 2 3 4 5 6 7 8 Chess kXML Parallel PNG RegEx Geomean JVM Compile Time DVM Compile Time 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% Chess kXML Parallel PNG RegEx Geomean JVM Compile Overhead DVM Compile Overhead

DVM compilation time is 4 times longer

SLIDE 22

22

Virtual Machine & Optimization Lab

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 Chess kXML Parallel PNG RegEx Geomean DVM Original DVM Trace Extension DVM Trace Extension (Opt)

Even if we extend the trace and add more optimizations, the impact is not high

SLIDE 23

23

Virtual Machine & Optimization Lab

Low code quality due to short trace, low optimization
Expanding the trace would not help much
Little difference for Jelly Bean JITC
A preliminary implementation of a naïve method-based JIT

C is included (but disabled currently)

One question: how come Android apps work fine?

SLIDE 24

24

Virtual Machine & Optimization Lab

Profile results based on OProfile
DVM portion (interpreter and JITC code)
Native portion (kernel+library and native app)
Run the apps for ~5 sec (since EEMBC runs ~5 sec)

Applications Category Running Details

AngryBirds

Game Load the stage 1-1

DoodleJump

Game Play for 5 seconds

Seesmic

SNS Refresh facebook feed

Twitter

SNS Refresh timeline

Astro File Manager

File Navigator Search file system

Google Sky Map

Navigation Navigate constellations

SLIDE 25

25

Virtual Machine & Optimization Lab

Fortunately, the DVM portion is much smaller, so slower DVM affects much less

0% 20% 40% 60% 80% 100% Native Native app DVM

SLIDE 26

26

Virtual Machine & Optimization Lab

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Interpreter(except GC) GC JITC

SLIDE 27

27

Virtual Machine & Optimization Lab

Garbage collection (GC) portion is way too high

GC for benchmarks take less than 2%
GC might be too frequent or takes longer time

JITC portion is much smaller than interpreter’s: Why?

Fewer hot spots than benchmarks?
Reuse of JITC-generated code is lower?

SLIDE 28

28

Virtual Machine & Optimization Lab

1 10 100 1000 10000 100000 1000000

Numbers are log scale

App loops iterate much fewer than benchmark loops.

SLIDE 29

29

Virtual Machine & Optimization Lab

1000 10000 100000 1000000 10000000

App methods are called much fewer than benchmark methods

Numbers are log scale

SLIDE 30

30

Virtual Machine & Optimization Lab

1 10 100 1000 10000 100000 1000000

Numbers are log scale

App traces are executed much fewer than benchmark traces

SLIDE 31

31

Virtual Machine & Optimization Lab

50 100 150 200 250 300 350 400 450 500

App traces are generated much more than benchmark traces

SLIDE 32

32

Virtual Machine & Optimization Lab

Apps generate more traces, yet app traces are exe

cuted far fewer than benchmark traces

Perhaps even not enough to justify the JITC overhead

 Is JITC really useful for App performance?

SLIDE 33

33

Virtual Machine & Optimization Lab

0.7 0.8 0.9 1 1.1 1.2 Angrybirds DoodleJump Seesmic Twitter Astro File Manager Google Sky Map Geomean Interpreter JITC

App performance goes down when we turn on JIT compiler

Loading time only

SLIDE 34

34

Virtual Machine & Optimization Lab

We believe Dalvik’s trace-based JITC has a severe

performance problem in its current form

We do not experience any critical problem in runni

ng the Android apps, though

Dalvik portion in the total running time is not dominant
Android apps lack hot spots unlike benchmarks
Requiring a faster warm spot detection or ahead-of-time

compilation

SLIDE 35