H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon School of Electrical - - PowerPoint PPT Presentation
H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon School of Electrical - - PowerPoint PPT Presentation
H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon School of Electrical Engineering and Computer Science Seoul National University, Korea Android apps are programmed using Java Android uses DVM instead of JVM for running Java Some people
2
Virtual Machine & Optimization Lab
- Android apps are programmed using Java
- Android uses DVM instead of JVM for running Java
- Some people believe that Android is successful partl
y due to DVM; is this really true? How DVM performs compared to JVM?
- Evaluate on the same board using the same benchmarks
How DVM affects the performance of Android apps?
- Analyze runtime profile
3
Virtual Machine & Optimization Lab
- Comparison of DVM and JVM
- Evaluation of DVM and JVM
- Evaluation of Android apps
- Conclusion
4
Virtual Machine & Optimization Lab
- VM for executing Java in Android platform
- Java code in applications, framework, and core libraries
- Executes dex files
instead of class files
- f Java VM (JVM)
- DX (class-to-dex)
- Dex file has different
bytecode ISA
5
Virtual Machine & Optimization Lab
- DVM has a register-based bytecode, while JVM has
a stack-based bytecode
JAVA SOURCE CODE public static int add(int a, int b) { int c = a + b; return c; } JVM DVM 0: iload_0 1: iload_1 2: iadd 3: istore_2 4: iload_2 5: ireturn |0000: add-int v0, v1, v2 |0002: return v0
6
Virtual Machine & Optimization Lab
DVM interpreter is supposed to be faster than JVM’s, due to fewer bytecode count and operand accesses
- According to Shi’s “stack vs. register” paper [TACO’08]
- DVM has two interpreters (assembly version, C version),
while our JVM has C version only
7
Virtual Machine & Optimization Lab
Higher performance requires just-in-time compilation, which translates bytecode to native code at runtime
- Both VMs employ adaptive compilation
- Interpret initially, when finding hot spot, compiling it
- DVM’s JIT compilation unit is a hot path called a tra
ce, while JVM’s is a hot method
- For lower memory footprint, yet competitive performance
- But, the reality is …
8
Virtual Machine & Optimization Lab
1 2 3 4 5 6 7 1 2 4 3 4 5 7 6 7 Blocks:Loop
- Interpret initially, count at each trace entry
- Trace entry: target of jump, next bytecode of trace
- If counter > threshold, trace recording starts
- Trace recording stops when meeting a branch
- r a method call; trace is enqueued for JITC
- A join BB can be compiled multiple times
- Chaining is used for control transfer at the en
d of a trace: chaining cells are added
- [Jump to a VM internal function + address cache]
9
Virtual Machine & Optimization Lab
- Code quality: too short (~3 bytecode) traces
- Fewer optimizations, higher overhead of chaining cells
- Preciseness of hot trace detection
- Counters are shared among traces to reduce space
- Register allocation
- Cannot map virtual registers to physical registers globally
– v0=v0+v1 requires two loads from v0 and v1 and a store to v0
Can affect performance and memory, negatively
10
Virtual Machine & Optimization Lab
Generated Machine code (12 instructions generated) Java Source Code Dalvik Bytecode public static int factorial( ) { int result = 1; for(int i = 1 ; i < 10000 ; i++) { result = result * i; } return result; } |0000: const/4 v0, #int 1 // #1 |0001: move v1, v0 |0002: const/16 v2, #int 10000 // #2710 |0004: if-ge v0, v2, 000a // +0006 |0006: add-int/2addr v1, v0 |0007: add-int/lit8 v0, v0, #int 1 // #01 |0009: goto 0002 // -0007 |000a: return v1
// if- // if-ge ge v0, v2, 000a v0, v2, 000a LDR R3, [RFP, #0] CMP R3, R2 STR R2, [RFP, #8] BGE label2 B label1 label2: …… label1: // add- // add-int int/2addr v1, v0 /2addr v1, v0 LDR R0, [RFP, #4] LDR R1, [RFP, #0] ADDS R0, R0, R1 STR R0, [RFP, #4] // // add- add-int int/lit8 /lit8 v0, v0, v0, v0, #int #int 1 1 ADDS R1, R1, #1 // // goto goto 0002 0002 STR R0,[RFP, #4] STR R1,[RFP, #0]
11
Virtual Machine & Optimization Lab
Java Source Code Java Bytecode public static int factorial( ) { int result = 1; for(int i = 1 ; i < 10000 ; i++) { result = result * i; } return result; }
|0000: iconst_1 |0001: istore_0 |0002: iconst_1 |0003: istore_1 |0004: iload_1 |0005: sipush 10000 |0008: if_icmpge <21> |0011: iload_0 |0012: iload_1 |0013: iadd |0014: istore_0 |0015: iinc 1 1 |0018: goto <4> |0021: iload_0 |0022: ireturn
L2: // // sipush sipush 10000 10000 LDR v8, [pc, #+0] @const 10 000 // // if_icmpge if_icmpge <21> <21> CMP v4, v8 LSL #0 BGE L1 // //iinc iinc 1 1 1 1 ADD v4, v4, #1 STR v4, [rJFP, #-4] // //goto goto <4> <4> B L2 L1: …… // iload_0 // iload_0 // iload_1 // iload_1 // // iadd iadd ADD v3, v3, v4 LSL #0 // istore_0 STR v3, [rJFP, #-8]
Generated Machine code (8 instructions generated)
12
Virtual Machine & Optimization Lab
- Tablet PC with ARM Cortex-A8 and 1GB memory
- Android 2.3 Gingerbread on Linux 2.6.35
- PhoneME advanced JVM (HotSpot) on Linux 2.6.32
- EEMBC GrinderBench
- DVM JITC generates Thumb2 code, while JVM JITC
generates ARM code
- Thumb2 reduces code size by 15%, performance by 6%
13
Virtual Machine & Optimization Lab
0.5 1 1.5 2 2.5 Chess kXML Parallel PNG RegEx Geomean JVM Interpreter DVM Interpreter DVM C Interpreter
DVM assembly interpreter is faster than JVM’s, but its C interpreter is similar
14
Virtual Machine & Optimization Lab
0.2 0.4 0.6 0.8 1 1.2 Chess kXML Parallel PNG RegEx Geomean JVM Dynamic Bytecode Count DVM Dynamic Bytecode Count
DVM executes 40% fewer bytecode instructions
15
Virtual Machine & Optimization Lab
0.5 1 1.5 2 2.5 Chess kXML Parallel PNG RegEx Geomean JVM Dynamic Bytecode Size DVM Dynamic Bytecode Size
DVM requires a 60% larger program than the JVM for achieving the same job
16
Virtual Machine & Optimization Lab 2 4 6 8 10 12 14 16 18 20 Chess kXML Parallel PNG RegEx Geomean JVM JITC DVM JITC
DVM with JITC is three times slower than JVM with JITC
17
Virtual Machine & Optimization Lab 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Chess kXML Parallel PNG RegEx Geomean JVM Compiled Bytecode Size DVM Compiled Bytecode Size
DVM compiles a smaller amount of bytecode because of its trace-based JITC
18
Virtual Machine & Optimization Lab 0.5 1 1.5 2 2.5 Chess kXML Parallel PNG RegEx Geomean JVM Generated Code Size DVM Generated Code Size
DVM generates 35% larger machine code than the JVM’s
19
Virtual Machine & Optimization Lab
Chess kXML Parallel PNG RegEx Avg. Ratio 1.18 1.08 1.15 1.15 1.13 1.13
How many times a Dalvik bytecode is translated redundantly?
20
Virtual Machine & Optimization Lab
0.5 1 1.5 2 2.5 3 3.5 4 Chess kXML Parallel PNG RegEx Geomean
How many instructions are generated for 1 byte of bytecode ?
JVM: ~1.3 instructions/1 byte of JVM DVM: ~2.7 instructions/1 byte of DVM = ~4.5 instructions/1 byte of JVM
Chaining cell overhead
21
Virtual Machine & Optimization Lab 1 2 3 4 5 6 7 8 Chess kXML Parallel PNG RegEx Geomean JVM Compile Time DVM Compile Time 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% Chess kXML Parallel PNG RegEx Geomean JVM Compile Overhead DVM Compile Overhead
DVM compilation time is 4 times longer
22
Virtual Machine & Optimization Lab
0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 Chess kXML Parallel PNG RegEx Geomean DVM Original DVM Trace Extension DVM Trace Extension (Opt)
Even if we extend the trace and add more optimizations, the impact is not high
23
Virtual Machine & Optimization Lab
- Low code quality due to short trace, low optimization
- Expanding the trace would not help much
- Little difference for Jelly Bean JITC
- A preliminary implementation of a naïve method-based JIT
C is included (but disabled currently)
- One question: how come Android apps work fine?
24
Virtual Machine & Optimization Lab
- Profile results based on OProfile
- DVM portion (interpreter and JITC code)
- Native portion (kernel+library and native app)
- Run the apps for ~5 sec (since EEMBC runs ~5 sec)
Applications Category Running Details
AngryBirds
Game Load the stage 1-1
DoodleJump
Game Play for 5 seconds
Seesmic
SNS Refresh facebook feed
SNS Refresh timeline
Astro File Manager
File Navigator Search file system
Google Sky Map
Navigation Navigate constellations
25
Virtual Machine & Optimization Lab
Fortunately, the DVM portion is much smaller, so slower DVM affects much less
0% 20% 40% 60% 80% 100% Native Native app DVM
26
Virtual Machine & Optimization Lab
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Interpreter(except GC) GC JITC
27
Virtual Machine & Optimization Lab
Garbage collection (GC) portion is way too high
- GC for benchmarks take less than 2%
- GC might be too frequent or takes longer time
JITC portion is much smaller than interpreter’s: Why?
- Fewer hot spots than benchmarks?
- Reuse of JITC-generated code is lower?
28
Virtual Machine & Optimization Lab
1 10 100 1000 10000 100000 1000000
Numbers are log scale
App loops iterate much fewer than benchmark loops.
29
Virtual Machine & Optimization Lab
1000 10000 100000 1000000 10000000
App methods are called much fewer than benchmark methods
Numbers are log scale
30
Virtual Machine & Optimization Lab
1 10 100 1000 10000 100000 1000000
Numbers are log scale
App traces are executed much fewer than benchmark traces
31
Virtual Machine & Optimization Lab
50 100 150 200 250 300 350 400 450 500
App traces are generated much more than benchmark traces
32
Virtual Machine & Optimization Lab
- Apps generate more traces, yet app traces are exe
cuted far fewer than benchmark traces
- Perhaps even not enough to justify the JITC overhead
Is JITC really useful for App performance?
33
Virtual Machine & Optimization Lab
0.7 0.8 0.9 1 1.1 1.2 Angrybirds DoodleJump Seesmic Twitter Astro File Manager Google Sky Map Geomean Interpreter JITC
App performance goes down when we turn on JIT compiler
Loading time only
34
Virtual Machine & Optimization Lab
- We believe Dalvik’s trace-based JITC has a severe
performance problem in its current form
- We do not experience any critical problem in runni
ng the Android apps, though
- Dalvik portion in the total running time is not dominant
- Android apps lack hot spots unlike benchmarks
- Requiring a faster warm spot detection or ahead-of-time