[PPT] - Emulation Michael Jantz Acknowledgements Slides adapted from PowerPoint Presentation

SLIDE 1

Emulation

Michael Jantz

SLIDE 2

Acknowledgements

Slides adapted from Chapter 2 in Virtual

Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair

Credit to Prasad A. Kulkarni – some slides

were borrowed from his course on Virtual Machines at the University of Kansas

2

SLIDE 3

Outline

Emulation
Interpretation
Basic, indirect threaded, and direct threaded
Binary translation
Code discovery, code location
Other issues
Control transfer optimizations
Instruction set issues

3

SLIDE 4

Emulation vs. Simulation

Emulation: process of implementing the

interface / functionality of a (sub)system on a different system

Applies specifically to an instruction set
Different emulation techniques
Interpretation (instruction-at-a-time)
Binary translation (block-at-a-time)
Simulation
Method for modeling a system’s operation
Goal is to study process – not to imitate function

4

SLIDE 5

Definitions

Guest
Environment supported by

underlying platform

Host
Underlying platform used

to provide an environment for the guest

5

Guest Host supported by

SLIDE 6

Definitions

Source ISA or binary
Original instruction set or binary
The ISA to be emulated
Target ISA or binary
ISA of the host processor
Underlying ISA
Source / target refer to ISAs
Guest / host refer to platforms

6

Source Target emulated by

SLIDE 7

Instruction Set Emulation

Binaries in source instruction set can be

executed on machine implementing target instruction set

Required for many VM implementations
Example: IA-32 EL

7

SLIDE 8

Interpretation vs. Translation

Interpretation
Simple, easy to implement
Low performance
Binary translation
Complex implementation
Higher initial cost, better performance
Techniques in between these extremes
Predecoding
Selective compilation

8

SLIDE 9

Interpreter State

Must maintain

state of machine implementing the source ISA

Registers
Memory
Code
Data
Stack

9

Code Data Stack

Program Counter Condition Codes Reg 0 Reg 1 Reg n-1

. . .

Interpreter Code

SLIDE 10

Decode-And-Dispatch Interpreter

Decode-and-dispatch loop
One instruction at a time
Decode the current instruction
Dispatch to corresponding interpreter routine

10

while (!halt && !interrupt) { inst = code[PC];

pcode = extract(inst,31,6);

switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .} }

SLIDE 11

Decode-And-Dispatch Interpreter

11

LoadWordAndZero(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; }

SLIDE 12

Decode-And-Dispatch Interpreter

12

ALU(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract(inst,10,10); switch(extended_opcode) { case Add: Add(inst); case AddCarrying: AddCarrying(inst); case AddExtended: AddExtended(inst); . . .} PC = PC + 4; }

SLIDE 13

Decode-And-Dispatch Efficiency

Decode-and-dispatch loop
Several branch instructions
Indirect branch on switch statement
Interpreting an add instruction
Requires approximately 20 target instructions
Several expensive loads/stores to memory
Hand-coded assembly can improve performance
Example: HotSpot JVM

13

SLIDE 14

Indirect Threaded Interpretation

High number of branches in decode-and-

dispatch loop reduces performance

At least 5 branches per instruction
Threaded interpretation
Append dispatch code with each

interpretation routine

Removes 3 branches
Threads interpretation routines together

14

SLIDE 15

Indirect Threaded Interpretation

15

LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC];

pcode = extract(inst,31,6)

extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

SLIDE 16

16

Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC];

pcode = extract(inst,31,6);

extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

Indirect Threaded Interpretation

SLIDE 17

Indirect Threaded Interpretation

Dispatch occurs indirectly through a table
Interpretation routines can be modified and

relocated independently

Advantages
Interpretation routines still portable
Improves efficiency over decode-and-dispatch
Disadvantages
Increases interpreter code size

17

SLIDE 18

Indirect Threaded Interpretation

18

source code dispatch loop interpreter routines "data" accesses

Decode-dispatch

source code interpreter routines

Threaded

SLIDE 19

Predecoding

Parse each instruction into a pre-defined

data structure to facilitate interpretation

Separate opcodes, operands, etc.
Reduces shifts / masks for decoding
More useful when source ISA is CISC

19

lwz r1, 8(r2) add r3, r3,r1 stw r3, 0(r4)

SLIDE 20

Predecoding

struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit;

pcode = code[TPC].op

routine = dispatch[opcode]; goto *routine;

20

SLIDE 21

Direct Threaded Interpretation

Replace table lookup with direct access to

address of interpreter routine

Requires predecoding
Reduces portability

21

SLIDE 22

Direct Threaded Interpretation

LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine;

22

SLIDE 23

Direct Threaded Interpretation

23

source code pre- decoder interpreter routines intermediate code

SLIDE 24

Binary Translation

Convert source binary to target binary

before execution

Logical conclusion of predecoding
Removes parsing and jumps altogether
Allows optimizations on native code
Achieves better performance than

interpretation

Generated code no longer portable

24

SLIDE 25

Binary Translation

25

source code binary translator binary translated target code

SLIDE 26

Binary Translation

26

x86 Source Binary

addl %edx,4(%eax) movl 4(%eax),%edx add %eax,4

Translate to PowerPC Target

r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value

SLIDE 27

Binary Translation

27

lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memory lwz r4,12(r1) ;load %edx from register block add r5,r4,r5 ;perform add stw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwz r4,12(r1) ;load %edx from register block stwx r4,r2,r5 ;store %edx value into memory addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r4,r4,4 ;add immediate stw r4,0(r1) ;place result back into %eax addi r3,r3,3 ;update PC (3 bytes)

SLIDE 28

Register Mapping

Map source registers

to target registers

Reduces memory

loads / stores

If target registers <

source registers

Map some to memory
Map on per-block

basis

28

SLIDE 29

Register Mapping

29

r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value r4 holds x86 register %eax r7 holds x86 register %edx etc. addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memory add r7,r17,r7 ;perform add of %edx addi r16,r4,4 ;add 4 to %eax stwx r7,r2,r16 ;store %edx value into memory addi r4,r4,4 ;increment %eax addi r3,r3,9 ;update PC (9 bytes)

SLIDE 30

Code Discovery Problem

May be difficult to statically predecode or

translate the entire source program

Code Discovery Problem: how to find the

beginning of all source instructions?

Consider the x86 code:

30

mov %ch,0 ?? 31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 movl %esi, 0x08030000(%ebp) ??

SLIDE 31

Code Discovery Problem

Contributors to the code discovery problem
Variable length CISC instructions
Indirect jumps
Data interspersed with code
Padding instructions to align branch targets

31

source ISA instructions

inst. 1
inst. 2
inst. 3

jump data

inst. 5
inst. 6
uncond. brnch
inst. 8

jump indirect to??? data in instruction stream pad for instruction alignment reg. pad

SLIDE 32

Code Location Problem

How to map source PC to target PC for

indirect jumps?

Indirect jump addresses in the target code still

refer to addresses in the source

32

x86 source code

movl %eax, 4(%esp) ;load jump address from memory jmp %eax ;jump indirect through %eax

PowerPC target code

addi r16,r11,4 ;compute x86 address lwzx r4,r2,r16 ;get x86 jump address ; from x86 memory image mtctr r4 ;move to count register bctr ;jump indirect through ctr

SLIDE 33

Simplified Solutions

Solutions are simpler for special cases
Fixed-length instruction sets (RISC ISAs)
ISAs designed to be emulated (Java)
No jumps / branches to arbitrary locations
No data / padding interspersed with instructions
All code can then be discovered

33

SLIDE 34

Incremental Code Translation

General solutions translate the code

dynamically and incrementally

Interpret, then translate
Interpretation performs code discovery
Emulation Manager (EM)
Translated code placed in the code cache
Employ a map table for SPC to TPC mapping

34

SLIDE 35

Incremental Code Translation

35

Emulation Manager

source binary Translation Memory

SPC to TPC Lookup Table

hit miss

translator Interpreter

SLIDE 36

Dynamic Basic Blocks

Basic unit of incremental code translation
Determined by runtime control flow
Start at instruction following branch or jump
Follow the sequential instruction stream
End with the next branch or jump

36

SLIDE 37

Dynamic Basic Blocks

37

block 1 block 2 block 3 block 4 add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop add... load... store... jump indirect ... ... block 5 add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop loop: load ... add ..... store brcond skip skip: add... store brcond loop ... Static Basic Blocks block 1 block 2 block 3 block 4 Dynamic Basic Blocks

SLIDE 38

Flow of Control

Load source binary into memory, and EM begins

interpreting source instructions

Dynamically translate blocks of source code
Place translated code into code cache
SPC-to-TPC mapping placed into the map table
Translation for one block is finished when a

branch or jump is encountered

38

SLIDE 39

Dynamic Translation Flowchart

39

SLIDE 40

Tracking the Source PC

40

Code Block

Branch and Link to EM Next Source PC Emulation Manager Hash Table

Code Block

EM needs up-to-date

SPC from interpreter

r translated code
How to transfer?
Map SPC to a register
Place SPC in a “stub”

after B&L instruction

EM uses link register to

find the SPC

SLIDE 41

Translation Chaining

Similar to threading in interpreters
Link blocks together into chains
Replace branch to EM w/ branch to next block
Address of successor is determined using the

map table

If successor block is not yet translated
Insert stub code to branch back to the EM
EM can replace the stub code later

41

SLIDE 42

Translation Chaining

42

translation block VMM translation block translation block

translation block VMM translation block translation block translation block

Without Chaining With Chaining

SLIDE 43

Creating a Link

43

JAL TM Next SPC Predecessor Successor

get next SPC Set up chain Lookup Successor

B&L EM 1 2 3 4 Jump TPC

SLIDE 44

Translation Chaining

44

SLIDE 45

Indirect Jump Prediction

Chaining cannot be used for blocks ending with

an indirect jump

In many cases, jump target seldom changes
Use profiling to determine jump targets
Inline frequently used source PC / target PC values

that are the targets of the jump

Most frequent source PCs are checked first

45

if Rx == addr_1 goto target_1 else if Rx == addr_2 goto target_2 else if Rx == addr_3 goto target_3 else hash_lookup(Rx) ; do it the slow way

SLIDE 46

Incremental Translation Issues

Tracking the source PC
See earlier slide
Self-modifying code
Already translated code may become invalid
Self-referencing code
Referenced data should correspond to source

code – not translated target

Precise traps
Provide source state at traps and executions

46

SLIDE 47

Instruction Set Issues

Register architectures
Condition codes
Data formats and arithmetic
Byte Order

47

SLIDE 48

Register Architectures

Target registers are used for:
Holding general-purpose regs of the source ISA
Holding special-purpose regs of the source ISA
Pointing to register context block / memory image
Holding intermediate values used by emulator
# target regs < # source regs
Must prioritize use of target registers
Assign registers on a block-by-block basis

48

SLIDE 49

Condition Codes

Condition codes are not used uniformly
IA-32 CC are set implicitly
SPARC and PowerPC set CC explicitly
MIPS does not use CC
Cases for CC emulation
Neither source or target ISA use CC
Source ISA does not use CC, target does
Source ISA has explicit CC, no CC on target
Source ISA has implicit CC, no CC on target

49

SLIDE 50

IA-32 Condition Codes

IA-32 CC are a set of “flags” (EFLAGS)
CC set by IA-32 add instruction
OF: indicates whether integer overflow occurred
SF: indicates the sign of the result
ZF: indicates a zero result
AF: indicates a carry or borrow out of bit 3 of the

result (used for BCD arithmetic)

CF: indicates a carry or borrow out of the most

significant bit of the result

PF: indicates parity of the least significant byte of

the result.

50

SLIDE 51

Lazy Condition Code Evaluation

CC are seldom used
Lazy evaluation of CC
Save the instruction that sets the CC
Only compute the CC when needed
During binary translation
Use analysis to determine cases where CC

will never be used

51

SLIDE 52

Lazy Condition Code Evaluation

add %ebx, 0(%eax) add %ecx,%ebx jmp label1 . . . label1: jz target R4 ↔ eax PPC to R5 ↔ ebx x86 register R6 ↔ ecx mappings . . R24 ↔ scratch register used by emulation code R25 ↔ condition code operand 1 ;registers R26 ↔ condition code operand 2 ;used for R27 ↔ condition code operation ;lazy condition ;emulation code R28 ↔ jump table base address

52

SLIDE 53

Lazy Condition Code Evaluation

mr r25,r6 ;save operands mr r26,r5 ;and opcode for li r27,“add” ;lazy condition code emulation add r6,r6,r5 ;translation of add b label1 ... label1: bl genZF ;branch and link genZF code beq cr0,target ;branch on condition flag ... genZF: add r29,r28,r27 ;add “opcode” to jump table base mtctr r29 ;copy to counter register bctr ;branch via jump table ... ... “add”: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return

53

SLIDE 54

Data Formats and Arithmetic

Most formats have been standardized
Integers: two’s complement
Floating point: IEEE standard
Basic logical / arithmetic operations are

mostly present

Some exceptions
IA-32 FP uses 80 bit intermediate results
Integer divide vs FP divide and approximate
Different immediate lengths

54

SLIDE 55

Byte Order

Ordering of bytes within a word may differ
Big endian vs. little endian
Guest typically maintained in same order

assumed by the source ISA

Emulation SW modifies addresses when

bytes within words are addressed

Some targets support both orders
MIPS, IA-64

55

SLIDE 56

Same ISA Emulation

Emulation SW can monitor source

program at the instruction-level

Applications
Simulation
OS call emulation
Program shepherding
Performance optimization

56