Emulation Michael Jantz Acknowledgements Slides adapted from - - PowerPoint PPT Presentation

emulation
SMART_READER_LITE
LIVE PREVIEW

Emulation Michael Jantz Acknowledgements Slides adapted from - - PowerPoint PPT Presentation

Emulation Michael Jantz Acknowledgements Slides adapted from Chapter 2 in Virtual Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair Credit to Prasad A. Kulkarni some slides were borrowed from his


slide-1
SLIDE 1

Emulation

Michael Jantz

slide-2
SLIDE 2

Acknowledgements

  • Slides adapted from Chapter 2 in Virtual

Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair

  • Credit to Prasad A. Kulkarni – some slides

were borrowed from his course on Virtual Machines at the University of Kansas

2

slide-3
SLIDE 3

Outline

  • Emulation
  • Interpretation
  • Basic, indirect threaded, and direct threaded
  • Binary translation
  • Code discovery, code location
  • Other issues
  • Control transfer optimizations
  • Instruction set issues

3

slide-4
SLIDE 4

Emulation vs. Simulation

  • Emulation: process of implementing the

interface / functionality of a (sub)system on a different system

  • Applies specifically to an instruction set
  • Different emulation techniques
  • Interpretation (instruction-at-a-time)
  • Binary translation (block-at-a-time)
  • Simulation
  • Method for modeling a system’s operation
  • Goal is to study process – not to imitate function

4

slide-5
SLIDE 5

Definitions

  • Guest
  • Environment supported by

underlying platform

  • Host
  • Underlying platform used

to provide an environment for the guest

5

Guest Host supported by

slide-6
SLIDE 6

Definitions

  • Source ISA or binary
  • Original instruction set or binary
  • The ISA to be emulated
  • Target ISA or binary
  • ISA of the host processor
  • Underlying ISA
  • Source / target refer to ISAs
  • Guest / host refer to platforms

6

Source Target emulated by

slide-7
SLIDE 7

Instruction Set Emulation

  • Binaries in source instruction set can be

executed on machine implementing target instruction set

  • Required for many VM implementations
  • Example: IA-32 EL

7

slide-8
SLIDE 8

Interpretation vs. Translation

  • Interpretation
  • Simple, easy to implement
  • Low performance
  • Binary translation
  • Complex implementation
  • Higher initial cost, better performance
  • Techniques in between these extremes
  • Predecoding
  • Selective compilation

8

slide-9
SLIDE 9

Interpreter State

  • Must maintain

state of machine implementing the source ISA

  • Registers
  • Memory
  • Code
  • Data
  • Stack

9

Code Data Stack

Program Counter Condition Codes Reg 0 Reg 1 Reg n-1

. . .

Interpreter Code

slide-10
SLIDE 10

Decode-And-Dispatch Interpreter

  • Decode-and-dispatch loop
  • One instruction at a time
  • Decode the current instruction
  • Dispatch to corresponding interpreter routine

10

while (!halt && !interrupt) { inst = code[PC];

  • pcode = extract(inst,31,6);

switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .} }

slide-11
SLIDE 11

Decode-And-Dispatch Interpreter

11

LoadWordAndZero(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; }

slide-12
SLIDE 12

Decode-And-Dispatch Interpreter

12

ALU(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract(inst,10,10); switch(extended_opcode) { case Add: Add(inst); case AddCarrying: AddCarrying(inst); case AddExtended: AddExtended(inst); . . .} PC = PC + 4; }

slide-13
SLIDE 13

Decode-And-Dispatch Efficiency

  • Decode-and-dispatch loop
  • Several branch instructions
  • Indirect branch on switch statement
  • Interpreting an add instruction
  • Requires approximately 20 target instructions
  • Several expensive loads/stores to memory
  • Hand-coded assembly can improve performance
  • Example: HotSpot JVM

13

slide-14
SLIDE 14

Indirect Threaded Interpretation

  • High number of branches in decode-and-

dispatch loop reduces performance

  • At least 5 branches per instruction
  • Threaded interpretation
  • Append dispatch code with each

interpretation routine

  • Removes 3 branches
  • Threads interpretation routines together

14

slide-15
SLIDE 15

Indirect Threaded Interpretation

15

LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC];

  • pcode = extract(inst,31,6)

extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

slide-16
SLIDE 16

16

Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC];

  • pcode = extract(inst,31,6);

extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

Indirect Threaded Interpretation

slide-17
SLIDE 17

Indirect Threaded Interpretation

  • Dispatch occurs indirectly through a table
  • Interpretation routines can be modified and

relocated independently

  • Advantages
  • Interpretation routines still portable
  • Improves efficiency over decode-and-dispatch
  • Disadvantages
  • Increases interpreter code size

17

slide-18
SLIDE 18

Indirect Threaded Interpretation

18

source code dispatch loop interpreter routines "data" accesses

Decode-dispatch

source code interpreter routines

Threaded

slide-19
SLIDE 19

Predecoding

  • Parse each instruction into a pre-defined

data structure to facilitate interpretation

  • Separate opcodes, operands, etc.
  • Reduces shifts / masks for decoding
  • More useful when source ISA is CISC

19

lwz r1, 8(r2) add r3, r3,r1 stw r3, 0(r4)

slide-20
SLIDE 20

Predecoding

struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit;

  • pcode = code[TPC].op

routine = dispatch[opcode]; goto *routine;

20

slide-21
SLIDE 21

Direct Threaded Interpretation

  • Replace table lookup with direct access to

address of interpreter routine

  • Requires predecoding
  • Reduces portability

21

slide-22
SLIDE 22

Direct Threaded Interpretation

LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine;

22

slide-23
SLIDE 23

Direct Threaded Interpretation

23

source code pre- decoder interpreter routines intermediate code

slide-24
SLIDE 24

Binary Translation

  • Convert source binary to target binary

before execution

  • Logical conclusion of predecoding
  • Removes parsing and jumps altogether
  • Allows optimizations on native code
  • Achieves better performance than

interpretation

  • Generated code no longer portable

24

slide-25
SLIDE 25

Binary Translation

25

source code binary translator binary translated target code

slide-26
SLIDE 26

Binary Translation

26

x86 Source Binary

addl %edx,4(%eax) movl 4(%eax),%edx add %eax,4

Translate to PowerPC Target

r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value

slide-27
SLIDE 27

Binary Translation

27

lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memory lwz r4,12(r1) ;load %edx from register block add r5,r4,r5 ;perform add stw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwz r4,12(r1) ;load %edx from register block stwx r4,r2,r5 ;store %edx value into memory addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r4,r4,4 ;add immediate stw r4,0(r1) ;place result back into %eax addi r3,r3,3 ;update PC (3 bytes)

slide-28
SLIDE 28

Register Mapping

  • Map source registers

to target registers

  • Reduces memory

loads / stores

  • If target registers <

source registers

  • Map some to memory
  • Map on per-block

basis

28

slide-29
SLIDE 29

Register Mapping

29

r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value r4 holds x86 register %eax r7 holds x86 register %edx etc. addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memory add r7,r17,r7 ;perform add of %edx addi r16,r4,4 ;add 4 to %eax stwx r7,r2,r16 ;store %edx value into memory addi r4,r4,4 ;increment %eax addi r3,r3,9 ;update PC (9 bytes)

slide-30
SLIDE 30

Code Discovery Problem

  • May be difficult to statically predecode or

translate the entire source program

  • Code Discovery Problem: how to find the

beginning of all source instructions?

  • Consider the x86 code:

30

mov %ch,0 ?? 31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 movl %esi, 0x08030000(%ebp) ??

slide-31
SLIDE 31

Code Discovery Problem

  • Contributors to the code discovery problem
  • Variable length CISC instructions
  • Indirect jumps
  • Data interspersed with code
  • Padding instructions to align branch targets

31

source ISA instructions

  • inst. 1
  • inst. 2
  • inst. 3

jump data

  • inst. 5
  • inst. 6
  • uncond. brnch
  • inst. 8

jump indirect to??? data in instruction stream pad for instruction alignment reg. pad

slide-32
SLIDE 32

Code Location Problem

  • How to map source PC to target PC for

indirect jumps?

  • Indirect jump addresses in the target code still

refer to addresses in the source

32

x86 source code

movl %eax, 4(%esp) ;load jump address from memory jmp %eax ;jump indirect through %eax

PowerPC target code

addi r16,r11,4 ;compute x86 address lwzx r4,r2,r16 ;get x86 jump address ; from x86 memory image mtctr r4 ;move to count register bctr ;jump indirect through ctr

slide-33
SLIDE 33

Simplified Solutions

  • Solutions are simpler for special cases
  • Fixed-length instruction sets (RISC ISAs)
  • ISAs designed to be emulated (Java)
  • No jumps / branches to arbitrary locations
  • No data / padding interspersed with instructions
  • All code can then be discovered

33

slide-34
SLIDE 34

Incremental Code Translation

  • General solutions translate the code

dynamically and incrementally

  • Interpret, then translate
  • Interpretation performs code discovery
  • Emulation Manager (EM)
  • Translated code placed in the code cache
  • Employ a map table for SPC to TPC mapping

34

slide-35
SLIDE 35

Incremental Code Translation

35

Emulation Manager

source binary Translation Memory

SPC to TPC Lookup Table

hit miss

translator Interpreter

slide-36
SLIDE 36

Dynamic Basic Blocks

  • Basic unit of incremental code translation
  • Determined by runtime control flow
  • Start at instruction following branch or jump
  • Follow the sequential instruction stream
  • End with the next branch or jump

36

slide-37
SLIDE 37

Dynamic Basic Blocks

37

block 1 block 2 block 3 block 4 add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop add... load... store... jump indirect ... ... block 5 add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop loop: load ... add ..... store brcond skip skip: add... store brcond loop ... Static Basic Blocks block 1 block 2 block 3 block 4 Dynamic Basic Blocks

slide-38
SLIDE 38

Flow of Control

  • Load source binary into memory, and EM begins

interpreting source instructions

  • Dynamically translate blocks of source code
  • Place translated code into code cache
  • SPC-to-TPC mapping placed into the map table
  • Translation for one block is finished when a

branch or jump is encountered

38

slide-39
SLIDE 39

Dynamic Translation Flowchart

39

slide-40
SLIDE 40

Tracking the Source PC

40

Code Block

Branch and Link to EM Next Source PC Emulation Manager Hash Table

Code Block

  • EM needs up-to-date

SPC from interpreter

  • r translated code
  • How to transfer?
  • Map SPC to a register
  • Place SPC in a “stub”

after B&L instruction

  • EM uses link register to

find the SPC

slide-41
SLIDE 41

Translation Chaining

  • Similar to threading in interpreters
  • Link blocks together into chains
  • Replace branch to EM w/ branch to next block
  • Address of successor is determined using the

map table

  • If successor block is not yet translated
  • Insert stub code to branch back to the EM
  • EM can replace the stub code later

41

slide-42
SLIDE 42

Translation Chaining

42

translation block VMM translation block translation block

translation block VMM translation block translation block translation block

Without Chaining With Chaining

slide-43
SLIDE 43

Creating a Link

43

JAL TM Next SPC Predecessor Successor

get next SPC Set up chain Lookup Successor

B&L EM 1 2 3 4 Jump TPC

slide-44
SLIDE 44

Translation Chaining

44

slide-45
SLIDE 45

Indirect Jump Prediction

  • Chaining cannot be used for blocks ending with

an indirect jump

  • In many cases, jump target seldom changes
  • Use profiling to determine jump targets
  • Inline frequently used source PC / target PC values

that are the targets of the jump

  • Most frequent source PCs are checked first

45

if Rx == addr_1 goto target_1 else if Rx == addr_2 goto target_2 else if Rx == addr_3 goto target_3 else hash_lookup(Rx) ; do it the slow way

slide-46
SLIDE 46

Incremental Translation Issues

  • Tracking the source PC
  • See earlier slide
  • Self-modifying code
  • Already translated code may become invalid
  • Self-referencing code
  • Referenced data should correspond to source

code – not translated target

  • Precise traps
  • Provide source state at traps and executions

46

slide-47
SLIDE 47

Instruction Set Issues

  • Register architectures
  • Condition codes
  • Data formats and arithmetic
  • Byte Order

47

slide-48
SLIDE 48

Register Architectures

  • Target registers are used for:
  • Holding general-purpose regs of the source ISA
  • Holding special-purpose regs of the source ISA
  • Pointing to register context block / memory image
  • Holding intermediate values used by emulator
  • # target regs < # source regs
  • Must prioritize use of target registers
  • Assign registers on a block-by-block basis

48

slide-49
SLIDE 49

Condition Codes

  • Condition codes are not used uniformly
  • IA-32 CC are set implicitly
  • SPARC and PowerPC set CC explicitly
  • MIPS does not use CC
  • Cases for CC emulation
  • Neither source or target ISA use CC
  • Source ISA does not use CC, target does
  • Source ISA has explicit CC, no CC on target
  • Source ISA has implicit CC, no CC on target

49

slide-50
SLIDE 50

IA-32 Condition Codes

  • IA-32 CC are a set of “flags” (EFLAGS)
  • CC set by IA-32 add instruction
  • OF: indicates whether integer overflow occurred
  • SF: indicates the sign of the result
  • ZF: indicates a zero result
  • AF: indicates a carry or borrow out of bit 3 of the

result (used for BCD arithmetic)

  • CF: indicates a carry or borrow out of the most

significant bit of the result

  • PF: indicates parity of the least significant byte of

the result.

50

slide-51
SLIDE 51

Lazy Condition Code Evaluation

  • CC are seldom used
  • Lazy evaluation of CC
  • Save the instruction that sets the CC
  • Only compute the CC when needed
  • During binary translation
  • Use analysis to determine cases where CC

will never be used

51

slide-52
SLIDE 52

Lazy Condition Code Evaluation

add %ebx, 0(%eax) add %ecx,%ebx jmp label1 . . . label1: jz target R4 ↔ eax PPC to R5 ↔ ebx x86 register R6 ↔ ecx mappings . . R24 ↔ scratch register used by emulation code R25 ↔ condition code operand 1 ;registers R26 ↔ condition code operand 2 ;used for R27 ↔ condition code operation ;lazy condition ;emulation code R28 ↔ jump table base address

52

slide-53
SLIDE 53

Lazy Condition Code Evaluation

mr r25,r6 ;save operands mr r26,r5 ;and opcode for li r27,“add” ;lazy condition code emulation add r6,r6,r5 ;translation of add b label1 ... label1: bl genZF ;branch and link genZF code beq cr0,target ;branch on condition flag ... genZF: add r29,r28,r27 ;add “opcode” to jump table base mtctr r29 ;copy to counter register bctr ;branch via jump table ... ... “add”: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return

53

slide-54
SLIDE 54

Data Formats and Arithmetic

  • Most formats have been standardized
  • Integers: two’s complement
  • Floating point: IEEE standard
  • Basic logical / arithmetic operations are

mostly present

  • Some exceptions
  • IA-32 FP uses 80 bit intermediate results
  • Integer divide vs FP divide and approximate
  • Different immediate lengths

54

slide-55
SLIDE 55

Byte Order

  • Ordering of bytes within a word may differ
  • Big endian vs. little endian
  • Guest typically maintained in same order

assumed by the source ISA

  • Emulation SW modifies addresses when

bytes within words are addressed

  • Some targets support both orders
  • MIPS, IA-64

55

slide-56
SLIDE 56

Same ISA Emulation

  • Emulation SW can monitor source

program at the instruction-level

  • Applications
  • Simulation
  • OS call emulation
  • Program shepherding
  • Performance optimization

56