Emulation Michael Jantz Acknowledgements Slides adapted from - - PowerPoint PPT Presentation
Emulation Michael Jantz Acknowledgements Slides adapted from - - PowerPoint PPT Presentation
Emulation Michael Jantz Acknowledgements Slides adapted from Chapter 2 in Virtual Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair Credit to Prasad A. Kulkarni some slides were borrowed from his
Acknowledgements
- Slides adapted from Chapter 2 in Virtual
Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair
- Credit to Prasad A. Kulkarni – some slides
were borrowed from his course on Virtual Machines at the University of Kansas
2
Outline
- Emulation
- Interpretation
- Basic, indirect threaded, and direct threaded
- Binary translation
- Code discovery, code location
- Other issues
- Control transfer optimizations
- Instruction set issues
3
Emulation vs. Simulation
- Emulation: process of implementing the
interface / functionality of a (sub)system on a different system
- Applies specifically to an instruction set
- Different emulation techniques
- Interpretation (instruction-at-a-time)
- Binary translation (block-at-a-time)
- Simulation
- Method for modeling a system’s operation
- Goal is to study process – not to imitate function
4
Definitions
- Guest
- Environment supported by
underlying platform
- Host
- Underlying platform used
to provide an environment for the guest
5
Guest Host supported by
Definitions
- Source ISA or binary
- Original instruction set or binary
- The ISA to be emulated
- Target ISA or binary
- ISA of the host processor
- Underlying ISA
- Source / target refer to ISAs
- Guest / host refer to platforms
6
Source Target emulated by
Instruction Set Emulation
- Binaries in source instruction set can be
executed on machine implementing target instruction set
- Required for many VM implementations
- Example: IA-32 EL
7
Interpretation vs. Translation
- Interpretation
- Simple, easy to implement
- Low performance
- Binary translation
- Complex implementation
- Higher initial cost, better performance
- Techniques in between these extremes
- Predecoding
- Selective compilation
8
Interpreter State
- Must maintain
state of machine implementing the source ISA
- Registers
- Memory
- Code
- Data
- Stack
9
Code Data Stack
Program Counter Condition Codes Reg 0 Reg 1 Reg n-1
. . .
Interpreter Code
Decode-And-Dispatch Interpreter
- Decode-and-dispatch loop
- One instruction at a time
- Decode the current instruction
- Dispatch to corresponding interpreter routine
10
while (!halt && !interrupt) { inst = code[PC];
- pcode = extract(inst,31,6);
switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .} }
Decode-And-Dispatch Interpreter
11
LoadWordAndZero(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; }
Decode-And-Dispatch Interpreter
12
ALU(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract(inst,10,10); switch(extended_opcode) { case Add: Add(inst); case AddCarrying: AddCarrying(inst); case AddExtended: AddExtended(inst); . . .} PC = PC + 4; }
Decode-And-Dispatch Efficiency
- Decode-and-dispatch loop
- Several branch instructions
- Indirect branch on switch statement
- Interpreting an add instruction
- Requires approximately 20 target instructions
- Several expensive loads/stores to memory
- Hand-coded assembly can improve performance
- Example: HotSpot JVM
13
Indirect Threaded Interpretation
- High number of branches in decode-and-
dispatch loop reduces performance
- At least 5 branches per instruction
- Threaded interpretation
- Append dispatch code with each
interpretation routine
- Removes 3 branches
- Threads interpretation routines together
14
Indirect Threaded Interpretation
15
LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC];
- pcode = extract(inst,31,6)
extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;
16
Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC];
- pcode = extract(inst,31,6);
extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;
Indirect Threaded Interpretation
Indirect Threaded Interpretation
- Dispatch occurs indirectly through a table
- Interpretation routines can be modified and
relocated independently
- Advantages
- Interpretation routines still portable
- Improves efficiency over decode-and-dispatch
- Disadvantages
- Increases interpreter code size
17
Indirect Threaded Interpretation
18
source code dispatch loop interpreter routines "data" accesses
Decode-dispatch
source code interpreter routines
Threaded
Predecoding
- Parse each instruction into a pre-defined
data structure to facilitate interpretation
- Separate opcodes, operands, etc.
- Reduces shifts / masks for decoding
- More useful when source ISA is CISC
19
lwz r1, 8(r2) add r3, r3,r1 stw r3, 0(r4)
Predecoding
struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit;
- pcode = code[TPC].op
routine = dispatch[opcode]; goto *routine;
20
Direct Threaded Interpretation
- Replace table lookup with direct access to
address of interpreter routine
- Requires predecoding
- Reduces portability
21
Direct Threaded Interpretation
LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine;
22
Direct Threaded Interpretation
23
source code pre- decoder interpreter routines intermediate code
Binary Translation
- Convert source binary to target binary
before execution
- Logical conclusion of predecoding
- Removes parsing and jumps altogether
- Allows optimizations on native code
- Achieves better performance than
interpretation
- Generated code no longer portable
24
Binary Translation
25
source code binary translator binary translated target code
Binary Translation
26
x86 Source Binary
addl %edx,4(%eax) movl 4(%eax),%edx add %eax,4
Translate to PowerPC Target
r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value
Binary Translation
27
lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memory lwz r4,12(r1) ;load %edx from register block add r5,r4,r5 ;perform add stw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwz r4,12(r1) ;load %edx from register block stwx r4,r2,r5 ;store %edx value into memory addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r4,r4,4 ;add immediate stw r4,0(r1) ;place result back into %eax addi r3,r3,3 ;update PC (3 bytes)
Register Mapping
- Map source registers
to target registers
- Reduces memory
loads / stores
- If target registers <
source registers
- Map some to memory
- Map on per-block
basis
28
Register Mapping
29
r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value r4 holds x86 register %eax r7 holds x86 register %edx etc. addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memory add r7,r17,r7 ;perform add of %edx addi r16,r4,4 ;add 4 to %eax stwx r7,r2,r16 ;store %edx value into memory addi r4,r4,4 ;increment %eax addi r3,r3,9 ;update PC (9 bytes)
Code Discovery Problem
- May be difficult to statically predecode or
translate the entire source program
- Code Discovery Problem: how to find the
beginning of all source instructions?
- Consider the x86 code:
30
mov %ch,0 ?? 31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 movl %esi, 0x08030000(%ebp) ??
Code Discovery Problem
- Contributors to the code discovery problem
- Variable length CISC instructions
- Indirect jumps
- Data interspersed with code
- Padding instructions to align branch targets
31
source ISA instructions
- inst. 1
- inst. 2
- inst. 3
jump data
- inst. 5
- inst. 6
- uncond. brnch
- inst. 8
jump indirect to??? data in instruction stream pad for instruction alignment reg. pad
Code Location Problem
- How to map source PC to target PC for
indirect jumps?
- Indirect jump addresses in the target code still
refer to addresses in the source
32
x86 source code
movl %eax, 4(%esp) ;load jump address from memory jmp %eax ;jump indirect through %eax
PowerPC target code
addi r16,r11,4 ;compute x86 address lwzx r4,r2,r16 ;get x86 jump address ; from x86 memory image mtctr r4 ;move to count register bctr ;jump indirect through ctr
Simplified Solutions
- Solutions are simpler for special cases
- Fixed-length instruction sets (RISC ISAs)
- ISAs designed to be emulated (Java)
- No jumps / branches to arbitrary locations
- No data / padding interspersed with instructions
- All code can then be discovered
33
Incremental Code Translation
- General solutions translate the code
dynamically and incrementally
- Interpret, then translate
- Interpretation performs code discovery
- Emulation Manager (EM)
- Translated code placed in the code cache
- Employ a map table for SPC to TPC mapping
34
Incremental Code Translation
35
Emulation Manager
source binary Translation Memory
SPC to TPC Lookup Table
hit miss
translator Interpreter
Dynamic Basic Blocks
- Basic unit of incremental code translation
- Determined by runtime control flow
- Start at instruction following branch or jump
- Follow the sequential instruction stream
- End with the next branch or jump
36
Dynamic Basic Blocks
37
block 1 block 2 block 3 block 4 add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop add... load... store... jump indirect ... ... block 5 add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop loop: load ... add ..... store brcond skip skip: add... store brcond loop ... Static Basic Blocks block 1 block 2 block 3 block 4 Dynamic Basic Blocks
Flow of Control
- Load source binary into memory, and EM begins
interpreting source instructions
- Dynamically translate blocks of source code
- Place translated code into code cache
- SPC-to-TPC mapping placed into the map table
- Translation for one block is finished when a
branch or jump is encountered
38
Dynamic Translation Flowchart
39
Tracking the Source PC
40
Code Block
Branch and Link to EM Next Source PC Emulation Manager Hash Table
Code Block
- EM needs up-to-date
SPC from interpreter
- r translated code
- How to transfer?
- Map SPC to a register
- Place SPC in a “stub”
after B&L instruction
- EM uses link register to
find the SPC
Translation Chaining
- Similar to threading in interpreters
- Link blocks together into chains
- Replace branch to EM w/ branch to next block
- Address of successor is determined using the
map table
- If successor block is not yet translated
- Insert stub code to branch back to the EM
- EM can replace the stub code later
41
Translation Chaining
42
translation block VMM translation block translation block
translation block VMM translation block translation block translation block
Without Chaining With Chaining
Creating a Link
43
JAL TM Next SPC Predecessor Successor
get next SPC Set up chain Lookup Successor
B&L EM 1 2 3 4 Jump TPC
Translation Chaining
44
Indirect Jump Prediction
- Chaining cannot be used for blocks ending with
an indirect jump
- In many cases, jump target seldom changes
- Use profiling to determine jump targets
- Inline frequently used source PC / target PC values
that are the targets of the jump
- Most frequent source PCs are checked first
45
if Rx == addr_1 goto target_1 else if Rx == addr_2 goto target_2 else if Rx == addr_3 goto target_3 else hash_lookup(Rx) ; do it the slow way
Incremental Translation Issues
- Tracking the source PC
- See earlier slide
- Self-modifying code
- Already translated code may become invalid
- Self-referencing code
- Referenced data should correspond to source
code – not translated target
- Precise traps
- Provide source state at traps and executions
46
Instruction Set Issues
- Register architectures
- Condition codes
- Data formats and arithmetic
- Byte Order
47
Register Architectures
- Target registers are used for:
- Holding general-purpose regs of the source ISA
- Holding special-purpose regs of the source ISA
- Pointing to register context block / memory image
- Holding intermediate values used by emulator
- # target regs < # source regs
- Must prioritize use of target registers
- Assign registers on a block-by-block basis
48
Condition Codes
- Condition codes are not used uniformly
- IA-32 CC are set implicitly
- SPARC and PowerPC set CC explicitly
- MIPS does not use CC
- Cases for CC emulation
- Neither source or target ISA use CC
- Source ISA does not use CC, target does
- Source ISA has explicit CC, no CC on target
- Source ISA has implicit CC, no CC on target
49
IA-32 Condition Codes
- IA-32 CC are a set of “flags” (EFLAGS)
- CC set by IA-32 add instruction
- OF: indicates whether integer overflow occurred
- SF: indicates the sign of the result
- ZF: indicates a zero result
- AF: indicates a carry or borrow out of bit 3 of the
result (used for BCD arithmetic)
- CF: indicates a carry or borrow out of the most
significant bit of the result
- PF: indicates parity of the least significant byte of
the result.
50
Lazy Condition Code Evaluation
- CC are seldom used
- Lazy evaluation of CC
- Save the instruction that sets the CC
- Only compute the CC when needed
- During binary translation
- Use analysis to determine cases where CC
will never be used
51
Lazy Condition Code Evaluation
add %ebx, 0(%eax) add %ecx,%ebx jmp label1 . . . label1: jz target R4 ↔ eax PPC to R5 ↔ ebx x86 register R6 ↔ ecx mappings . . R24 ↔ scratch register used by emulation code R25 ↔ condition code operand 1 ;registers R26 ↔ condition code operand 2 ;used for R27 ↔ condition code operation ;lazy condition ;emulation code R28 ↔ jump table base address
52
Lazy Condition Code Evaluation
mr r25,r6 ;save operands mr r26,r5 ;and opcode for li r27,“add” ;lazy condition code emulation add r6,r6,r5 ;translation of add b label1 ... label1: bl genZF ;branch and link genZF code beq cr0,target ;branch on condition flag ... genZF: add r29,r28,r27 ;add “opcode” to jump table base mtctr r29 ;copy to counter register bctr ;branch via jump table ... ... “add”: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return
53
Data Formats and Arithmetic
- Most formats have been standardized
- Integers: two’s complement
- Floating point: IEEE standard
- Basic logical / arithmetic operations are
mostly present
- Some exceptions
- IA-32 FP uses 80 bit intermediate results
- Integer divide vs FP divide and approximate
- Different immediate lengths
54
Byte Order
- Ordering of bytes within a word may differ
- Big endian vs. little endian
- Guest typically maintained in same order
assumed by the source ISA
- Emulation SW modifies addresses when
bytes within words are addressed
- Some targets support both orders
- MIPS, IA-64
55
Same ISA Emulation
- Emulation SW can monitor source
program at the instruction-level
- Applications
- Simulation
- OS call emulation
- Program shepherding
- Performance optimization
56