Reducing Dynamic Compilation Latency Igor Bhm P rocessor A utomated - - PowerPoint PPT Presentation

reducing dynamic compilation latency
SMART_READER_LITE
LIVE PREVIEW

Reducing Dynamic Compilation Latency Igor Bhm P rocessor A utomated - - PowerPoint PPT Presentation

LLVM12 - European Conference, London Reducing Dynamic Compilation Latency Igor Bhm P rocessor A utomated S ynthesis by i T erative A nalysis The U niversity o f E dinburgh LLVM12 - European Conference, London Concurrent and Parallel


slide-1
SLIDE 1

Igor Böhm

Processor Automated Synthesis by iTerative Analysis

The University of Edinburgh

Reducing Dynamic Compilation Latency

LLVM’12 - European Conference, London

slide-2
SLIDE 2

Igor Böhm

Processor Automated Synthesis by iTerative Analysis

The University of Edinburgh

Concurrent and Parallel Dynamic Compilation

LLVM’12 - European Conference, London

slide-3
SLIDE 3

Interp Native Interp Native Interp

Interpretation

Native

Native Code Execution Time

Dynamic Compilation

What do we want to improve?

2

slide-4
SLIDE 4

Interp Native Interp Native Interp

Interpretation

Native

Native Code Execution Time

Dynamic Compilation

What do we want to improve?

2

initially code is interpreted

slide-5
SLIDE 5

Interp Native Interp Native Interp

Interpretation

Native

Native Code Execution Time

Dynamic Compilation

What do we want to improve?

2

initially code is interpreted frequently executed code is compiled on-the-fly

slide-6
SLIDE 6

Interp Native Interp Native Interp

Interpretation

Native

Native Code Execution Time

Dynamic Compilation

What do we want to improve?

2

initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available

slide-7
SLIDE 7

Interp Native Interp Native Interp

Interpretation

Native

Native Code Execution Time

Dynamic Compilation

What do we want to improve?

2

initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available

slow slow fast fast

slide-8
SLIDE 8

Interp Native Interp Native Interp

Interpretation

Native

Native Code Execution Time

Dynamic Compilation

What do we want to improve?

2

Interp Interp Native Native

slow slow fast fast

Earlier transition from interpretive to native execution

slide-9
SLIDE 9

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

slide-10
SLIDE 10

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

slide-11
SLIDE 11

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

slide-12
SLIDE 12

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

critical path

slide-13
SLIDE 13

2 Time

Main Thread

Native Profile Profile Profile Native Native

Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

critical path

slide-14
SLIDE 14

2 Time

Main Thread

Native Profile Profile Profile Native Native

Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

critical path

slide-15
SLIDE 15

2 Time

Main Thread

Native Profile Profile Profile Native Native

Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm

Thread 1 Compile Thread 2 Thread 3

Compile Compile Compile Compile Compile Compile Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

critical path

slide-16
SLIDE 16

2 Time

Main Thread

Native Profile Profile Profile Native Native

Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm

Thread 1 Compile Thread 2 Thread 3

Compile Compile Compile Compile Compile Compile Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

critical path

slide-17
SLIDE 17

2 Time

Main Thread

Native Profile Profile Profile Native Native

Dynamic Compilation using Concurrent and Parallel JIT Compiler Task Farm

Thread 1 Compile Thread 2 Thread 3

Compile Compile Compile Compile Compile Compile Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread

3 Compile

Dynamic Compilation

Interp

Interpretation

Profile

Interpretation with Profiling

Native

Native Code Execution

1

Native Interp Interp Profile

Time Main Thread

Interp Interp

Thread 1

Compile Compile

Profile Native Profile

Compile Time Compiler Thread Dynamic Compilation using one Concurrent JIT Compiler

critical path critical path

slide-18
SLIDE 18

4

Time

Main Thread

Thread 1 Compile

Native Profile Profile Profile

Thread 2 Thread 3

Compile Compile Compile Compile Compile Compile

Native Native

Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread Concurrent and Parallel Dynamic Compilation

Solution To Dynamic Compilation Latency Problem

slide-19
SLIDE 19

4

Time

Main Thread

Thread 1 Compile

Native Profile Profile Profile

Thread 2 Thread 3

Compile Compile Compile Compile Compile Compile

Native Native

Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread Concurrent and Parallel Dynamic Compilation

Solution To Dynamic Compilation Latency Problem

improve code discovery/profiling

slide-20
SLIDE 20

4

Time

Main Thread

Thread 1 Compile

Native Profile Profile Profile

Thread 2 Thread 3

Compile Compile Compile Compile Compile Compile

Native Native

Compile Compile Time Compiler Thread Time Compiler Thread Time Compiler Thread Concurrent and Parallel Dynamic Compilation

Solution To Dynamic Compilation Latency Problem

improve code discovery/profiling improve dynamic compilation workload throughput

slide-21
SLIDE 21

How hard can Code Discovery be?

5

slide-22
SLIDE 22

How hard can Code Discovery be?

5

Java byte-code

JavaScript

CIL

Static

slide-23
SLIDE 23

How hard can Code Discovery be?

5

Java byte-code

JavaScript

CIL

Static Dynamic

x86 binary ARCompact binary ARM binary

slide-24
SLIDE 24

How hard can Code Discovery be?

“A crucial problem in the decompilation or

disassembly of computer programs is the identification of executable code, i.e. the separation of instructions from data. This problem, for most computer architectures, is equivalent to the Halting Problem and is therefore unsolvable in general.”

5

[Horspool and Marovac - 1980]

slide-25
SLIDE 25

6

Incremental Code Discovery

Basic Block

A

Return to Interpreter CFG Edges

Sequence of interpreted basic blocks

Trace Interval

F G A B I F G

Time

B C D E F G A A

slide-26
SLIDE 26

6

Incremental Code Discovery

Basic Block

A

Return to Interpreter CFG Edges Trace Interval

F G A B I F G

Time

B C D E F G A A A B C D

Region after t'

1

t' 1

slide-27
SLIDE 27

6

Incremental Code Discovery

Basic Block

A

Return to Interpreter CFG Edges Trace Interval

F G A B I F G

Time

B C D E F G A A A B C D

Region after t'

1

t' 1

A B C D E F G

Region after t''

2

t'' 2

slide-28
SLIDE 28

6

Incremental Code Discovery

Basic Block

A

Return to Interpreter CFG Edges Trace Interval

F G A B I F G

Time

B C D E F G A A A B C D

Region after t'

1

t' 1

A B C D E F G

Region after t''

2

t'' 2

A B C D E F G H I

Region after t'''

3

t''' 3

slide-29
SLIDE 29

6

Incremental Code Discovery

Basic Block

A

Return to Interpreter CFG Edges

Region == Dynamic CFG

Trace Interval

F G A B I F G

Time

B C D E F G A A A B C D E F G H I

Region after t'''

3

t''' 3

slide-30
SLIDE 30

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

Trace

Dynamic Code Discovery

Native

Native Code Execution

Native Trace Trace Trace Native

Simulation

Native

slide-31
SLIDE 31

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

trace right from the start

Trace

Dynamic Code Discovery

Native

Native Code Execution

Native Trace Trace Trace Native

Simulation

Native

slide-32
SLIDE 32

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

Interval 1 Interval 2 Interval 3

Time Page

Region 1

Page

Region 2

Page

Region 4

Page

Region 3

Regions

break tracing into intervals

page is a fixed size container for translation

Trace

Dynamic Code Discovery

Native

Native Code Execution

ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7
  • r r3,r3,r2
asl r4,r3,0x8 brcc.d r10,r13,0x2c
  • r r4,r4,r3

Page (size variable)

Native Trace Trace Trace Native

Simulation

Native

slide-33
SLIDE 33

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

Interval 1 Interval 2 Interval 3

Time Page

Region 1

Page

Region 2

Page

Region 4

Page

Region 3

Regions

hide compilation latency

page is a fixed size container for translation

Trace

Dynamic Code Discovery

Native

Native Code Execution Region 1 Region 2 Region 4 Region 3

Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1

ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7
  • r r3,r3,r2
asl r4,r3,0x8 brcc.d r10,r13,0x2c
  • r r4,r4,r3

Page (size variable)

Native Trace Trace Trace Native

Simulation

Native

slide-34
SLIDE 34

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

Interval 1 Interval 2 Interval 3

Time Page

Region 1

Page

Region 2

Page

Region 4

Page

Region 3

Regions

hide compilation latency

page is a fixed size container for translation

Trace

Dynamic Code Discovery

Native

Native Code Execution Region 1 Region 2 Region 4 Region 3

Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1

async registration of compiled regions

ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7
  • r r3,r3,r2
asl r4,r3,0x8 brcc.d r10,r13,0x2c
  • r r4,r4,r3

Page (size variable)

Native Trace Trace Trace Native

Simulation

Native

slide-35
SLIDE 35

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

Interval 1 Interval 2 Interval 3

Time Page

Region 1

Page

Region 2

Page

Region 4

Page

Region 3

Regions

hide compilation latency

Trace

Dynamic Code Discovery

Native

Native Code Execution Region 1 Region 2 Region 4 Region 3

Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1

exploit task parallelism

ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7
  • r r3,r3,r2
asl r4,r3,0x8 brcc.d r10,r13,0x2c
  • r r4,r4,r3

Page (size variable)

Native Trace Trace Trace Native

Simulation

Native

Interval 4 Interval 5 Region 5 Region 6 Region 7 Region 9 Region 8 Region 10 Region 11 Region 12

Page

Region7

Page

Region 6

Page

Region 5

Page

Region 12

Page

Region 11

Page

Region10

Page

Region8

Page

Region 9

Exploit Task Parallelism 2

slide-36
SLIDE 36

Temporal Region Partitioning 3

Concurrent and Parallel JIT Compilation in Action

(reducing the critical path)

7

Interval 1 Interval 2 Interval 3

Time Page

Region 1

Page

Region 2

Page

Region 4

Page

Region 3

Regions

hide compilation latency

Trace

Dynamic Code Discovery

Native

Native Code Execution Region 1 Region 2 Region 4 Region 3

Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Hide Dynamic Compilation Latency 1

exploit task parallelism

ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7
  • r r3,r3,r2
asl r4,r3,0x8 brcc.d r10,r13,0x2c
  • r r4,r4,r3

Page (size variable)

Native Trace Trace Trace Native

Simulation

Native

Spatial Region Partitioning 4

Interval 4 Interval 5 Region 5 Region 6 Region 7 Region 9 Region 8 Region 10 Region 11 Region 12

Page

Region7

Page

Region 6

Page

Region 5

Page

Region 12

Page

Region 11

Page

Region10

Page

Region8

Page

Region 9

Exploit Task Parallelism 2

slide-37
SLIDE 37

Concurrent and Parallel JIT Compiler Design

8

Execution Loop

Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions

Yes Yes

PC address

No No No

Hot Regions Present

No Yes

Enqueue Hot Regions and Continue

Yes

slide-38
SLIDE 38

Concurrent and Parallel JIT Compiler Design

8

Execution Loop

JIT Compilation Task Farm

Enqueue 1 Continue 2

Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue

Concurrent Shared Data- Structure 3

Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions

Yes Yes

PC address

No No No

Hot Regions Present

No Yes

Enqueue Hot Regions and Continue

Yes

slide-39
SLIDE 39

Concurrent and Parallel JIT Compiler Design

8

Execution Loop

JIT Compilation Task Farm

Enqueue 1 Continue 2

Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue

Concurrent Shared Data- Structure 3

Concurrent and Parallel Dynamic Compilation Task Farm

Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread 1 Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread 2 Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread N

Dequeue and Farm Out 4

JIT Compilation Task Farm

Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions

Yes Yes

PC address

No No No

Hot Regions Present

No Yes

Enqueue Hot Regions and Continue

Yes

slide-40
SLIDE 40

Concurrent and Parallel JIT Compiler Design

8

Execution Loop

JIT Compilation Task Farm

Enqueue 1 Continue 2

Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue

Concurrent Shared Data- Structure 3

Concurrent and Parallel Dynamic Compilation Task Farm

Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread 1 Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread 2 Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread N

Dequeue and Farm Out 4

JIT Compilation Task Farm dynamic work scheduling

Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions

Yes Yes

PC address

No No No

Hot Regions Present

No Yes

Enqueue Hot Regions and Continue

Yes

slide-41
SLIDE 41

Concurrent and Parallel JIT Compiler Design

8

Execution Loop

JIT Compilation Task Farm

Enqueue 1 Continue 2

Region 1 Region 7 Region 2 Region 6 Region N Translation Priority Queue

Concurrent Shared Data- Structure 3

Concurrent and Parallel Dynamic Compilation Task Farm

Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread 1 Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread 2 Create LLVM IR Optimise Compile Link Native Code

JIT Compilation

Thread N

Dequeue and Farm Out 4

JIT Compilation Task Farm adaptive hotspot selection dynamic work scheduling

Block Translated New Block End of Trace Interval Native Code Execution Record Block in Region Interpretive Block Simulation Analyse Recorded Regions

Yes Yes

PC address

No No No

Hot Regions Present

No Yes

Enqueue Hot Regions and Continue

Yes

slide-42
SLIDE 42

Key Components: llvm::LLVMContext - owns and manages core ‘global’ data of LLVM’ s core infrastructure llvm::ExecutionEngine - abstract, easy to use interface for implementation execution

  • f LLVM modules

state-of-the-art set of optimisation passes

9

Concurrent and Parallel JIT Compiler Design Based on LLVM

slide-43
SLIDE 43

Key Concepts: dispatch of compilation units via thread- safe priority queue abstraction each JIT compiler thread owns private llvm::ExecutionEngine instance enabling parallel JIT compilation without explicit synchronisation asynchronous registration of compiled native code

10

Concurrent and Parallel JIT Compiler Design Based on LLVM

slide-44
SLIDE 44

11

Concurrent and Parallel JIT Compiler Design Based on LLVM

class JITThread : public Thread { private: llvm::LLVMContext* CTX_; // per thread LLVMContext llvm::Module* MOD_; // per thread main Module llvm::ExecutionEngine* ENG_; // per thread ExecutionEngine ... public: }

slide-45
SLIDE 45

11

Concurrent and Parallel JIT Compiler Design Based on LLVM

class JITThread : public Thread { private: llvm::LLVMContext* CTX_; // per thread LLVMContext llvm::Module* MOD_; // per thread main Module llvm::ExecutionEngine* ENG_; // per thread ExecutionEngine ... public: } void create() { CTX_ = new llvm::LLVMContext(); MOD_ = new llvm::Module("module", *CTX_); ENG_ = llvm::EngineBuilder(MOD_) .setEngineKind(llvm::EngineKind::JIT) .create(); ... }

slide-46
SLIDE 46

11

Concurrent and Parallel JIT Compiler Design Based on LLVM

class JITThread : public Thread { private: llvm::LLVMContext* CTX_; // per thread LLVMContext llvm::Module* MOD_; // per thread main Module llvm::ExecutionEngine* ENG_; // per thread ExecutionEngine ... public: } void create() { CTX_ = new llvm::LLVMContext(); MOD_ = new llvm::Module("module", *CTX_); ENG_ = llvm::EngineBuilder(MOD_) .setEngineKind(llvm::EngineKind::JIT) .create(); ... } void run() { for ( ; /* ever */ ; ) { queue.mutex.acquire(); while (queue.empty()) { // wait for work if queue is empty queue.condvar.wait(queue.mutex); } WorkUnit* u = queue.top(); // retrieve compilation unit queue.pop(); queue.mutex.release(); llvm::Function* f = Codegen(u); // generate IR void* native = ENG_->getPointerToFunction(f); // run JIT // register native translation for execution ... } }

slide-47
SLIDE 47

Evaluation

12

Extensive evaluation using over 60 industry standard benchmarks built for ARCompact RISC platform: BioPERF SPEC CPU 2006 EEMBC and CoreMark Target Platform: ARCompact RISC ISA targeting ARC 700 processor Simulation Platform: standard x86 Dell Intel Xeon quad-core machine

slide-48
SLIDE 48

13

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline

Speedup BioPerf

Measured on standard x86 quad-core machine

clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5

slide-49
SLIDE 49

13

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline

Speedup BioPerf

Measured on standard x86 quad-core machine

clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34

slide-50
SLIDE 50

13

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline

Speedup BioPerf

Measured on standard x86 quad-core machine

clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34

slide-51
SLIDE 51

13

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline

Speedup BioPerf

Measured on standard x86 quad-core machine

clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34

1.38 1.01 1.43 2.08 1.29 1.03 1.52 1.35 1.02 1.07 2.01

slide-52
SLIDE 52

13

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Speedup Baseline

Speedup BioPerf

Measured on standard x86 quad-core machine

clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average 0.5 1.0 1.5 2.0 2.5 0.44 0.08 0.19 0.68 0.81 0.12 0.94 0.47 0.06 0.11 0.34

1.38 1.01 1.43 2.08 1.29 1.03 1.52 1.35 1.02 1.07 2.01 181,MIPS 329,MIPS 466,MIPS 81,MIPS 50,MIPS 258,MIPS 60,MIPS 143,MIPS 234,MIPS 360,MIPS

,,217,MIPS

,1.38,x,

slide-53
SLIDE 53

Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm

  • mnetpp

astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

14

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,

Speedup SPEC CPU 2006

Measured on standard x86 quad-core machine

very long running CPU intensive benchmarks [worst-case scenario]

slide-54
SLIDE 54

Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm

  • mnetpp

astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88

14

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,

Speedup SPEC CPU 2006

Measured on standard x86 quad-core machine

very long running CPU intensive benchmarks [worst-case scenario]

slide-55
SLIDE 55

Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm

  • mnetpp

astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88

14

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,

Speedup SPEC CPU 2006

Measured on standard x86 quad-core machine

very long running CPU intensive benchmarks [worst-case scenario]

slide-56
SLIDE 56

Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm

  • mnetpp

astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88

1.15 1.47 1.04 1.07 1.16 1.01 1.06 1.04 1.03 1.02 1.20 1.01 1.08 1.00 1.06 2.04 1.07 1.18 14

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,

Speedup SPEC CPU 2006

Measured on standard x86 quad-core machine

very long running CPU intensive benchmarks [worst-case scenario]

slide-57
SLIDE 57

Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm

  • mnetpp

astar sphinx3 xalancbmk average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.28 1.22 0.09 0.09 0.26 0.15 0.09 0.07 0.18 0.08 0.15 0.11 0.17 0.08 0.21 0.85 0.10 0.88

1.15 1.47 1.04 1.07 1.16 1.01 1.06 1.04 1.03 1.02 1.20 1.01 1.08 1.00 1.06 2.04 1.07 1.18 14

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler,

Speedup SPEC CPU 2006

Measured on standard x86 quad-core machine

very long running CPU intensive benchmarks [worst-case scenario]

26,MIPS 403,MIPS 53,MIPS 201,MIPS 378,MIPS 153,MIPS 254,MIPS 209,MIPS 360,MIPS 145,MIPS 617,MIPS 368,MIPS 170,MIPS 96,MIPS 353,MIPS 319,MIPS 24,MIPS

243,MIPS, , 1.15,x

slide-58
SLIDE 58

a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01

  • spf

pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

15

Measured on standard x86 quad-core machine

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup

Speedup EEMBC CoreMark

very short running embedded benchmarks [worst-case scenario]

slide-59
SLIDE 59

a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01

  • spf

pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

15

Measured on standard x86 quad-core machine

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup

Speedup EEMBC CoreMark

very short running embedded benchmarks [worst-case scenario]

slide-60
SLIDE 60

a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01

  • spf

pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

15

Measured on standard x86 quad-core machine

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup

Speedup EEMBC CoreMark

very short running embedded benchmarks [worst-case scenario]

slide-61
SLIDE 61

a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01

  • spf

pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

1.13 1.02 1.15 1.06 1.31 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.04 1.00 1.00 1.17 1.46 1.01 1.28 1.00 1.05 1.00 1.03 1.06 1.35 1.01 1.00 1.57 1.00 1.31 1.00 1.70 1.00 1.54 1.16

15

Measured on standard x86 quad-core machine

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup

Speedup EEMBC CoreMark

very short running embedded benchmarks [worst-case scenario]

slide-62
SLIDE 62

a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 canrdr01 coremark cjpeg conven00 dither01 djpeg fbital00 fft00 idctrn01 iirflt01 matrix01

  • spf

pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 tblook01 text01 ttsprk01 viterb00 average 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

1.13 1.02 1.15 1.06 1.31 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.04 1.00 1.00 1.17 1.46 1.01 1.28 1.00 1.05 1.00 1.03 1.06 1.35 1.01 1.00 1.57 1.00 1.31 1.00 1.70 1.00 1.54 1.16

15

Measured on standard x86 quad-core machine

Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,Compiler Execu1on,using,concurrent,and,parallel,JIT,Compiler, Baseline Speedup

Speedup EEMBC CoreMark

very short running embedded benchmarks [worst-case scenario]

392,MIPS, , 1.13,x

slide-63
SLIDE 63

How far does it scale?

What is a sensible number of JIT compilation threads?

16

1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

blastp

1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

tcoffee

1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

gcc

1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

perlbench

1.0 1.5 2.0 2.5 3.0 3.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

bzip2

Speedup Speedup Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Speedup Speedup Speedup 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 %,of,Benchmarks,Benefi1ng,from,N,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Cumula1ve,Histogram

Measured on a 16-core machine

slide-64
SLIDE 64

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

slide-65
SLIDE 65

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

Time,Interval

slide-66
SLIDE 66

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

Time,Interval

slide-67
SLIDE 67

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

Time,Interval

slide-68
SLIDE 68

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

higher throughput

Time,Interval

slide-69
SLIDE 69

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000

403.gcc - Average Queue Length

Avg Queue Length T1 Avg Queue Length T3

higher throughput

Time,Interval Time,Interval

slide-70
SLIDE 70

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000

403.gcc - Average Queue Length

Avg Queue Length T1 Avg Queue Length T3

higher throughput

Time,Interval Time,Interval

slide-71
SLIDE 71

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000

403.gcc - Average Queue Length

Avg Queue Length T1 Avg Queue Length T3

higher throughput

Time,Interval Time,Interval

slide-72
SLIDE 72

Effect of Concurrent and Parallel JIT Compilation on Throughput

17

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 10 20 30 40 50 60 70 80 90 100

403.gcc - Regions Compiled

Regions compiled T1 Regions compiled T3

11 22 33 44 55 66 77 88 99 110 121 132 143 154 165 176 187 198 209 220 100 200 300 400 500 600 700 800 900 1000

403.gcc - Average Queue Length

Avg Queue Length T1 Avg Queue Length T3

higher throughput smaller queue length

Time,Interval Time,Interval

slide-73
SLIDE 73

18

Does this scale for multi-threaded/core applications?

slide-74
SLIDE 74

Concurrent and Parallel JIT Compilation in Action

(trace sharing)

19

Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 1

Region 1

Region 2

Region 2

Region 4

Region 4

Region 3

Region 3

Region 7

Region 7

Region 6

Region 6

Region 5

Region 5

Region 12

Region 12

Region 11

Region 11

Native

Native Code Execution

Native Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 4

Region 4

Region 7

Region 7

Region 6

Region 6

Region 3

Region 3

Region 9

Region 9

Region 8

Region 8

Region 10

Region 10

Native Native

Region 2

Region 2

Region 1

Region 1

Region 5

Region 5

Tracing

A A D B B C C Region 10

Region 10

Region 9

Region 9

Region 8

Region 8

Regions

Thread1

T1

Regions

Thread 2

T2

Thread 1

T1

Thread 2

T2 D

slide-75
SLIDE 75

Concurrent and Parallel JIT Compilation in Action

(trace sharing)

19

Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Shared Regions

A B D C

Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 1

Region 1

Region 2

Region 2

Region 4

Region 4

Region 3

Region 3

Region 7

Region 7

Region 6

Region 6

Region 5

Region 5

Region 12

Region 12

Region 11

Region 11

Native

Native Code Execution

Native Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 4

Region 4

Region 7

Region 7

Region 6

Region 6

Region 3

Region 3

Region 9

Region 9

Region 8

Region 8

Region 10

Region 10

Native Native

Region 2

Region 2

Region 1

Region 1

Region 5

Region 5

Tracing

A A D B B C C Region 10

Region 10

Region 9

Region 9

Region 8

Region 8

Regions

Thread1

T1

Regions

Thread 2

T2

Thread 1

T1

Thread 2

T2 D

slide-76
SLIDE 76

Concurrent and Parallel JIT Compilation in Action

(trace sharing)

19

Region 1 Region 4 Region 1 Region 3 Region 4 Region 3 Region 2

T1 T2 T1 T2 T2 T2 T1 A C Region Tagged for Multiple Threads Register Translation for T1 and T2

Tag Entry

Region In Translation Tag Existing Entry 1

Region 2

T1 A B Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Shared Regions

A B D C

Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 1

Region 1

Region 2

Region 2

Region 4

Region 4

Region 3

Region 3

Region 7

Region 7

Region 6

Region 6

Region 5

Region 5

Region 12

Region 12

Region 11

Region 11

Native

Native Code Execution

Native Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 4

Region 4

Region 7

Region 7

Region 6

Region 6

Region 3

Region 3

Region 9

Region 9

Region 8

Region 8

Region 10

Region 10

Native Native

Region 2

Region 2

Region 1

Region 1

Region 5

Region 5

Tracing

A A D B B C C Region 10

Region 10

Region 9

Region 9

Region 8

Region 8

Regions

Thread1

T1

Regions

Thread 2

T2

Thread 1

T1

Thread 2

T2 D

slide-77
SLIDE 77

Concurrent and Parallel JIT Compilation in Action

(trace sharing)

19

Region 1 Region 4 Region 1 Region 3 Region 4 Region 3 Region 2

T1 T2 T1 T2 T2 T2 T1 A C Region Tagged for Multiple Threads Register Translation for T1 and T2

Tag Entry

Region In Translation Tag Existing Entry 1

Region 2

T1 A B

Region 6 Region 7 Region 6 Region 7 Region8 Region 9 Region 8 Region 10 Region 9 Region 12 Region 11

T1 T1 T1 T2 T2 T2 T1 T1 T2 T1 T1 Regions Already Translated Retrieve from Translation Cache 2

Region 5

T1 B

Region 5

T2 C

Region 10

T2 D D Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3 Shared Regions

A B D C

Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 1

Region 1

Region 2

Region 2

Region 4

Region 4

Region 3

Region 3

Region 7

Region 7

Region 6

Region 6

Region 5

Region 5

Region 12

Region 12

Region 11

Region 11

Native

Native Code Execution

Native Native Tracing Tracing Native

Interval 1 Interval 2 Interval 3 Interval 4 Interval 5

Region 4

Region 4

Region 7

Region 7

Region 6

Region 6

Region 3

Region 3

Region 9

Region 9

Region 8

Region 8

Region 10

Region 10

Native Native

Region 2

Region 2

Region 1

Region 1

Region 5

Region 5

Tracing

A A D B B C C Region 10

Region 10

Region 9

Region 9

Region 8

Region 8

Regions

Thread1

T1

Regions

Thread 2

T2

Thread 1

T1

Thread 2

T2 D

slide-78
SLIDE 78

Conclusions

20

Novel interval based region code discovery scheme enables concurrent and parallel JIT compilation and is able to deliver: average reduction of execution time of 11.5% - and up to 51.9% across 60 industry standard benchmarks we minimise JIT compilation overhead and effectively hide compilation latency by combining: light-weight interval based tracing dynamic work scheduling adaptive hotspot threshold selection concurrent and parallel JIT compilation

slide-79
SLIDE 79

Demos

21

Video Decoding and Playback

slide-80
SLIDE 80

Demos

21

Video Decoding and Playback Full System OS Simulation

slide-81
SLIDE 81

22

Thank You