Igor Böhm
Processor Automated Synthesis by iTerative Analysis
The University of Edinburgh
Reducing Dynamic Compilation Latency
LLVM’12 - European Conference, London
Reducing Dynamic Compilation Latency Igor Bhm P rocessor A utomated - - PowerPoint PPT Presentation
LLVM12 - European Conference, London Reducing Dynamic Compilation Latency Igor Bhm P rocessor A utomated S ynthesis by i T erative A nalysis The U niversity o f E dinburgh LLVM12 - European Conference, London Concurrent and Parallel
Igor Böhm
Processor Automated Synthesis by iTerative Analysis
The University of Edinburgh
LLVM’12 - European Conference, London
Igor Böhm
Processor Automated Synthesis by iTerative Analysis
The University of Edinburgh
LLVM’12 - European Conference, London
Interp Native Interp Native Interp
Interpretation
Native
Native Code Execution Time
What do we want to improve?
2
Interp Native Interp Native Interp
Interpretation
Native
Native Code Execution Time
What do we want to improve?
2
initially code is interpreted
Interp Native Interp Native Interp
Interpretation
Native
Native Code Execution Time
What do we want to improve?
2
initially code is interpreted frequently executed code is compiled on-the-fly
Interp Native Interp Native Interp
Interpretation
Native
Native Code Execution Time
What do we want to improve?
2
initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available
Interp Native Interp Native Interp
Interpretation
Native
Native Code Execution Time
What do we want to improve?
2
initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available
slow slow fast fast
Interp Native Interp Native Interp
Interpretation
Native
Native Code Execution Time
What do we want to improve?
2
Interp Interp Native Native
slow slow fast fast
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
critical path
2 Time
Main Thread
Native Profile Profile Profile Native Native
Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
critical path
2 Time
Main Thread
Native Profile Profile Profile Native Native
Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
critical path
2 Time
Main Thread
Native Profile Profile Profile Native Native
Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm
Thread 1 Compile Thread 2 Thread 3
Compile Compile Compile Compile Compile Compile Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
critical path
2 Time
Main Thread
Native Profile Profile Profile Native Native
Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm
Thread 1 Compile Thread 2 Thread 3
Compile Compile Compile Compile Compile Compile Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
critical path
2 Time
Main Thread
Native Profile Profile Profile Native Native
Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm
Thread 1 Compile Thread 2 Thread 3
Compile Compile Compile Compile Compile Compile Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread
3 Compile
Dynamic Compilation
Interp
Interpretation
Profile
Interpretation with Profiling
Native
Native Code Execution
1
Native Interp Interp Profile
Time Main Thread
Interp Interp
Thread 1
Compile Compile
Profile Native Profile
Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler
critical path critical path
4
Time
Main Thread
Thread 1 Compile
Native Profile Profile Profile
Thread 2 Thread 3
Compile Compile Compile Compile Compile Compile
Native Native
Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread Concurrent and Parallel Dynamic Compilation
Solution To Dynamic Compilation Latency Problem
4
Time
Main Thread
Thread 1 Compile
Native Profile Profile Profile
Thread 2 Thread 3
Compile Compile Compile Compile Compile Compile
Native Native
Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread Concurrent and Parallel Dynamic Compilation
Solution To Dynamic Compilation Latency Problem
improve code discovery/profiling
4
Time
Main Thread
Thread 1 Compile
Native Profile Profile Profile
Thread 2 Thread 3
Compile Compile Compile Compile Compile Compile
Native Native
Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread Concurrent and Parallel Dynamic Compilation
Solution To Dynamic Compilation Latency Problem
improve code discovery/profiling improve dynamic compilation workload throughput
5
5
Java byte-code
JavaScript
CIL
5
Java byte-code
JavaScript
CIL
x86 binary ARCompact binary ARM binary
disassembly of computer programs is the identification of executable code, i.e. the separation of instructions from data. This problem, for most computer architectures, is equivalent to the Halting Problem and is therefore unsolvable in general.”
5
[Horspool and Marovac - 1980]
6
Basic Block
A
Return to Interpreter CFG Edges
Trace Interval
F G A B I F G
Time
B C D E F G A A
6
Basic Block
A
Return to Interpreter CFG Edges Trace Interval
F G A B I F G
Time
B C D E F G A A A B C D
Region after t'
1
t' 1
6
Basic Block
A
Return to Interpreter CFG Edges Trace Interval
F G A B I F G
Time
B C D E F G A A A B C D
Region after t'
1
t' 1
A B C D E F G
Region after t''
2
t'' 2
6
Basic Block
A
Return to Interpreter CFG Edges Trace Interval
F G A B I F G
Time
B C D E F G A A A B C D
Region after t'
1
t' 1
A B C D E F G
Region after t''
2
t'' 2
A B C D E F G H I
Region after t'''
3
t''' 3
6
Basic Block
A
Return to Interpreter CFG Edges
Trace Interval
F G A B I F G
Time
B C D E F G A A A B C D E F G H I
Region after t'''
3
t''' 3
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Trace
Dynamic Code Discovery
Native
Native Code Execution
Native Trace Trace Trace Native
Simulation
Native
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Trace
Dynamic Code Discovery
Native
Native Code Execution
Native Trace Trace Trace Native
Simulation
Native
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Interval 1 Interval 2 Interval 3
Time Page
Region 1Page
Region 2Page
Region 4Page
Region 3Regions
page is a fixed size container for translation
Trace
Dynamic Code Discovery
Native
Native Code Execution
ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7Page (size variable)
Native Trace Trace Trace Native
Simulation
Native
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Interval 1 Interval 2 Interval 3
Time Page
Region 1Page
Region 2Page
Region 4Page
Region 3Regions
page is a fixed size container for translation
Trace
Dynamic Code Discovery
Native
Native Code Execution Region 1 Region 2 Region 4 Region 3
Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1
ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7Page (size variable)
Native Trace Trace Trace Native
Simulation
Native
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Interval 1 Interval 2 Interval 3
Time Page
Region 1Page
Region 2Page
Region 4Page
Region 3Regions
page is a fixed size container for translation
Trace
Dynamic Code Discovery
Native
Native Code Execution Region 1 Region 2 Region 4 Region 3
Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1
Page (size variable)
Native Trace Trace Trace Native
Simulation
Native
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Interval 1 Interval 2 Interval 3
Time Page
Region 1Page
Region 2Page
Region 4Page
Region 3Regions
Trace
Dynamic Code Discovery
Native
Native Code Execution Region 1 Region 2 Region 4 Region 3
Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1
Page (size variable)
Native Trace Trace Trace Native
Simulation
Native
Interval 4 Interval 5 Region 5 Region 6 Region 7 Region 9 Region 8 Region 10 Region 11 Region 12
Page
Region7Page
Region 6Page
Region 5Page
Region 12Page
Region 11Page
Region10Page
Region8Page
Region 9Exploit Task Parallelism 2
Temporal Region Partitioning 3
Concurrent and Parallel JIT Compilation in Action
(reducing the critical path)
7
Interval 1 Interval 2 Interval 3
Time Page
Region 1Page
Region 2Page
Region 4Page
Region 3Regions
Trace
Dynamic Code Discovery
Native
Native Code Execution Region 1 Region 2 Region 4 Region 3
Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1
Page (size variable)
Native Trace Trace Trace Native
Simulation
Native
Spatial Region Partitioning 4
Interval 4 Interval 5 Region 5 Region 6 Region 7 Region 9 Region 8 Region 10 Region 11 Region 12
Page
Region7Page
Region 6Page
Region 5Page
Region 12Page
Region 11Page
Region10Page
Region8Page
Region 9Exploit Task Parallelism 2
8
Execution Loop
Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions
Yes Yes
PC address
No No No
Hot Regions Present
No Yes
Enqueue Hot Regions and Continue
Yes
8
Execution Loop
JIT Compilation Task Farm
Enqueue 1 Continue 2
Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue
Concurrent Shared Data- Structure 3
Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions
Yes Yes
PC address
No No No
Hot Regions Present
No Yes
Enqueue Hot Regions and Continue
Yes
8
Execution Loop
JIT Compilation Task Farm
Enqueue 1 Continue 2
Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue
Concurrent Shared Data- Structure 3
Concurrent and Parallel Dynamic Compilation Task Farm
Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread 1 Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread 2 Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread N
Dequeue and Farm Out 4
JIT Compilation Task Farm
Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions
Yes Yes
PC address
No No No
Hot Regions Present
No Yes
Enqueue Hot Regions and Continue
Yes
8
Execution Loop
JIT Compilation Task Farm
Enqueue 1 Continue 2
Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue
Concurrent Shared Data- Structure 3
Concurrent and Parallel Dynamic Compilation Task Farm
Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread 1 Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread 2 Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread N
Dequeue and Farm Out 4
JIT Compilation Task Farm dynamic work scheduling
Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions
Yes Yes
PC address
No No No
Hot Regions Present
No Yes
Enqueue Hot Regions and Continue
Yes
8
Execution Loop
JIT Compilation Task Farm
Enqueue 1 Continue 2
Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue
Concurrent Shared Data- Structure 3
Concurrent and Parallel Dynamic Compilation Task Farm
Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread 1 Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread 2 Create LLVM IR Optimise Compile Link Native Code
JIT Compilation
Thread N
Dequeue and Farm Out 4
JIT Compilation Task Farm adaptive hotspot selection dynamic work scheduling
Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions
Yes Yes
PC address
No No No
Hot Regions Present
No Yes
Enqueue Hot Regions and Continue
Yes
Key Components: llvm::LLVMContext - owns and manages core ‘global’ data of LLVM’ s core infrastructure llvm::ExecutionEngine - abstract, easy to use interface for implementation execution
state-of-the-art set of optimisation passes
9
Key Concepts: dispatch of compilation units via thread- safe priority queue abstraction each JIT compiler thread owns private llvm::ExecutionEngine instance enabling parallel JIT compilation without explicit synchronisation asynchronous registration of compiled native code
10
11
class JITThread : public Thread { private: llvm::LLVMContext* CTX_; // per thread LLVMContext llvm::Module* MOD_; // per thread main Module llvm::ExecutionEngine* ENG_; // per thread ExecutionEngine ... public: }
11
class JITThread : public Thread { private: llvm::LLVMContext* CTX_; // per thread LLVMContext llvm::Module* MOD_; // per thread main Module llvm::ExecutionEngine* ENG_; // per thread ExecutionEngine ... public: } void create() { CTX_ = new llvm::LLVMContext(); MOD_ = new llvm::Module("module", *CTX_); ENG_ = llvm::EngineBuilder(MOD_) .setEngineKind(llvm::EngineKind::JIT) .create(); ... }
11
class JITThread : public Thread { private: llvm::LLVMContext* CTX_; // per thread LLVMContext llvm::Module* MOD_; // per thread main Module llvm::ExecutionEngine* ENG_; // per thread ExecutionEngine ... public: } void create() { CTX_ = new llvm::LLVMContext(); MOD_ = new llvm::Module("module", *CTX_); ENG_ = llvm::EngineBuilder(MOD_) .setEngineKind(llvm::EngineKind::JIT) .create(); ... } void run() { for ( ; /* ever */ ; ) { queue.mutex.acquire(); while (queue.empty()) { // wait for work if queue is empty queue.condvar.wait(queue.mutex); } WorkUnit* u = queue.top(); // retrieve compilation unit queue.pop(); queue.mutex.release(); llvm::Function* f = Codegen(u); // generate IR void* native = ENG_->getPointerToFunction(f); // run JIT // register native translation for execution ... } }
12
Extensive evaluation using over 60 industry standard benchmarks built for ARCompact RISC platform: BioPERF SPEC CPU 2006 EEMBC and CoreMark Target Platform: ARCompact RISC ISA targeting ARC 700 processor Simulation Platform: standard x86 Dell Intel Xeon quad-core machine
13
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline
Speedup BioPerf
Measured on standard x86 quad-core machine
clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5
13
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline
Speedup BioPerf
Measured on standard x86 quad-core machine
clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34
13
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline
Speedup BioPerf
Measured on standard x86 quad-core machine
clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34
13
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline
Speedup BioPerf
Measured on standard x86 quad-core machine
clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34
1.38 1.01 1.43 2.08 1.29 1.03 1.52 1.35 1.02 1.07 2.01
13
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline
Speedup BioPerf
Measured on standard x86 quad-core machine
clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34
1.38 1.01 1.43 2.08 1.29 1.03 1.52 1.35 1.02 1.07 2.01 181,MIPS 329,MIPS 466,MIPS 81,MIPS 50,MIPS 258,MIPS 60,MIPS 143,MIPS 234,MIPS 360,MIPS
,,217,MIPS
,1.38,x,
Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm
astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
14
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,
Speedup SPEC CPU 2006
Measured on standard x86 quad-core machine
very long running CPU intensive benchmarks [worst-case scenario]
Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm
astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88
14
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,
Speedup SPEC CPU 2006
Measured on standard x86 quad-core machine
very long running CPU intensive benchmarks [worst-case scenario]
Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm
astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88
14
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,
Speedup SPEC CPU 2006
Measured on standard x86 quad-core machine
very long running CPU intensive benchmarks [worst-case scenario]
Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm
astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88
1.15 1.47 1.04 1.07 1.16 1.01 1.06 1.04 1.03 1.02 1.20 1.01 1.08 1.00 1.06 2.04 1.07 1.18 14
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,
Speedup SPEC CPU 2006
Measured on standard x86 quad-core machine
very long running CPU intensive benchmarks [worst-case scenario]
Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm
astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88
1.15 1.47 1.04 1.07 1.16 1.01 1.06 1.04 1.03 1.02 1.20 1.01 1.08 1.00 1.06 2.04 1.07 1.18 14
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,
Speedup SPEC CPU 2006
Measured on standard x86 quad-core machine
very long running CPU intensive benchmarks [worst-case scenario]
26,MIPS 403,MIPS 53,MIPS 201,MIPS 378,MIPS 153,MIPS 254,MIPS 209,MIPS 360,MIPS 145,MIPS 617,MIPS 368,MIPS 170,MIPS 96,MIPS 353,MIPS 319,MIPS 24,MIPS
243,MIPS, , 1.15,x
a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01
pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
15
Measured on standard x86 quad-core machine
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup
Speedup EEMBC CoreMark
very short running embedded benchmarks [worst-case scenario]
a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01
pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
15
Measured on standard x86 quad-core machine
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup
Speedup EEMBC CoreMark
very short running embedded benchmarks [worst-case scenario]
a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01
pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
15
Measured on standard x86 quad-core machine
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup
Speedup EEMBC CoreMark
very short running embedded benchmarks [worst-case scenario]
a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01
pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
1.13 1.02 1.15 1.06 1.31 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.04 1.00 1.00 1.17 1.46 1.01 1.28 1.00 1.05 1.00 1.03 1.06 1.35 1.01 1.00 1.57 1.00 1.31 1.00 1.70 1.00 1.54 1.16
15
Measured on standard x86 quad-core machine
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup
Speedup EEMBC CoreMark
very short running embedded benchmarks [worst-case scenario]
a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01
pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
1.13 1.02 1.15 1.06 1.31 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.04 1.00 1.00 1.17 1.46 1.01 1.28 1.00 1.05 1.00 1.03 1.06 1.35 1.01 1.00 1.57 1.00 1.31 1.00 1.70 1.00 1.54 1.16
15
Measured on standard x86 quad-core machine
Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup
Speedup EEMBC CoreMark
very short running embedded benchmarks [worst-case scenario]
392,MIPS, , 1.13,x
What is a sensible number of JIT compilation threads?
16
1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
blastp
1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
tcoffee
1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
gcc
1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
perlbench
1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
bzip2
Speedup Speedup Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Speedup Speedup Speedup 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 %,of,Benchmarks,Benefi1ng,from,N,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Cumula1ve,Histogram
Measured on a 16-core machine
17
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
higher throughput
Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000
403.gcc - Average Queue Length
Avg Queue Length T1 Avg Queue Length T3
higher throughput
Time,Interval Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000
403.gcc - Average Queue Length
Avg Queue Length T1 Avg Queue Length T3
higher throughput
Time,Interval Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000
403.gcc - Average Queue Length
Avg Queue Length T1 Avg Queue Length T3
higher throughput
Time,Interval Time,Interval
17
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100
403.gcc - Regions Compiled
Regions compiled T1 Regions compiled T3
11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000
403.gcc - Average Queue Length
Avg Queue Length T1 Avg Queue Length T3
higher throughput smaller queue length
Time,Interval Time,Interval
18
Concurrent and Parallel JIT Compilation in Action
(trace sharing)
19
Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 1
Region 1Region 2
Region 2Region 4
Region 4Region 3
Region 3Region 7
Region 7Region 6
Region 6Region 5
Region 5Region 12
Region 12Region 11
Region 11Native
Native Code Execution
Native Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 4
Region 4Region 7
Region 7Region 6
Region 6Region 3
Region 3Region 9
Region 9Region 8
Region 8Region 10
Region 10Native Native
Region 2
Region 2Region 1
Region 1Region 5
Region 5Tracing
A A D B B C C Region 10
Region 10Region 9
Region 9Region 8
Region 8Regions
Thread1
T1
Regions
Thread 2
T2
Thread 1
T1
Thread 2
T2 D
Concurrent and Parallel JIT Compilation in Action
(trace sharing)
19
Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Shared Regions
A B D C
Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 1
Region 1Region 2
Region 2Region 4
Region 4Region 3
Region 3Region 7
Region 7Region 6
Region 6Region 5
Region 5Region 12
Region 12Region 11
Region 11Native
Native Code Execution
Native Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 4
Region 4Region 7
Region 7Region 6
Region 6Region 3
Region 3Region 9
Region 9Region 8
Region 8Region 10
Region 10Native Native
Region 2
Region 2Region 1
Region 1Region 5
Region 5Tracing
A A D B B C C Region 10
Region 10Region 9
Region 9Region 8
Region 8Regions
Thread1
T1
Regions
Thread 2
T2
Thread 1
T1
Thread 2
T2 D
Concurrent and Parallel JIT Compilation in Action
(trace sharing)
19
Region 1 Region 4 Region 1 Region 3 Region 4 Region 3 Region 2
T1 T2 T1 T2 T2 T2 T1 A C Region Tagged for Multiple Threads Register Translation for T1 and T2
Tag EntryRegion In Translation Tag Existing Entry 1
Region 2
T1 A B Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Shared Regions
A B D C
Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 1
Region 1Region 2
Region 2Region 4
Region 4Region 3
Region 3Region 7
Region 7Region 6
Region 6Region 5
Region 5Region 12
Region 12Region 11
Region 11Native
Native Code Execution
Native Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 4
Region 4Region 7
Region 7Region 6
Region 6Region 3
Region 3Region 9
Region 9Region 8
Region 8Region 10
Region 10Native Native
Region 2
Region 2Region 1
Region 1Region 5
Region 5Tracing
A A D B B C C Region 10
Region 10Region 9
Region 9Region 8
Region 8Regions
Thread1
T1
Regions
Thread 2
T2
Thread 1
T1
Thread 2
T2 D
Concurrent and Parallel JIT Compilation in Action
(trace sharing)
19
Region 1 Region 4 Region 1 Region 3 Region 4 Region 3 Region 2
T1 T2 T1 T2 T2 T2 T1 A C Region Tagged for Multiple Threads Register Translation for T1 and T2
Tag EntryRegion In Translation Tag Existing Entry 1
Region 2
T1 A B
Region 6 Region 7 Region 6 Region 7 Region8 Region 9 Region 8 Region 10 Region 9 Region 12 Region 11
T1 T1 T1 T2 T2 T2 T1 T1 T2 T1 T1 Regions Already Translated Retrieve from Translation Cache 2
Region 5
T1 B
Region 5
T2 C
Region 10
T2 D D Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Shared Regions
A B D C
Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 1
Region 1Region 2
Region 2Region 4
Region 4Region 3
Region 3Region 7
Region 7Region 6
Region 6Region 5
Region 5Region 12
Region 12Region 11
Region 11Native
Native Code Execution
Native Native Tracing Tracing Native
Interval 1 Interval 2 Interval 3 Interval 4 Interval 5
Region 4
Region 4Region 7
Region 7Region 6
Region 6Region 3
Region 3Region 9
Region 9Region 8
Region 8Region 10
Region 10Native Native
Region 2
Region 2Region 1
Region 1Region 5
Region 5Tracing
A A D B B C C Region 10
Region 10Region 9
Region 9Region 8
Region 8Regions
Thread1
T1
Regions
Thread 2
T2
Thread 1
T1
Thread 2
T2 D
20
Novel interval based region code discovery scheme enables concurrent and parallel JIT compilation and is able to deliver: average reduction of execution time of 11.5% - and up to 51.9% across 60 industry standard benchmarks we minimise JIT compilation overhead and effectively hide compilation latency by combining: light-weight interval based tracing dynamic work scheduling adaptive hotspot threshold selection concurrent and parallel JIT compilation
21
21
22