DQEMU: A Scalable Emulator with Retargetable DBT on Distributed - - PowerPoint PPT Presentation

dqemu a scalable emulator with retargetable dbt on
SMART_READER_LITE
LIVE PREVIEW

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed - - PowerPoint PPT Presentation

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Ziyi Zhao , Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia 1 Introduction Dynamic


slide-1
SLIDE 1

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms

Ziyi Zhao, Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia

1

slide-2
SLIDE 2

2

Introduction

Dynamic Binary Translation(DBT)

“A Key Enabling Technology” Cross-ISA Virtualization Dynamic Instrumentation

slide-3
SLIDE 3

3

Introduction

The scalability of DBT is limited by computing resources

Saturate around speedup of 2.0x

  • QEMU is a trending DBT
  • Parallel programs from PARSEC
  • On x64 dual-core machine
slide-4
SLIDE 4

4

Introduction

Goal: Enable DBT to utilize compute resources across nodes

Host OS

Distributed DBT

Host OS Guest Application Host OS Hardware Hardware Hardware

slide-5
SLIDE 5

5

Introduction

Goal: Enable DBT to utilize compute resources across nodes In a distributed emulator...

  • How to maintain guest cache coherence?
  • Transparently
  • How to emulate guest system calls?
  • Side effect to host kernel
  • How to emulate guest atomic operations?
  • Equivalent atomic sematic between RISCCISC
slide-6
SLIDE 6

6

Introduction

How does DBT work?

Guest Code Intermediate Code Host Code

Tiny Code Generator (TCG)

slide-7
SLIDE 7

7

Introduction

How does DBT work?

Host OS Guest Application

Guest Mem Region

Execute Translate DBT TCG Thread

slide-8
SLIDE 8

8

Implementation

What should Distributed DBT looks like?

TCG Thread

Host OS Host OS Guest Application Host OS

Guest Mem Region

Distributed Shared Memroy Communicator Master Node Worker Node1 Worker Node2 Manager

slide-9
SLIDE 9

9

Implementation

How to keep cache coherence? For the Distributed Shared Memory Region...

  • At what granularity?
  • Cache line size? Page size? Larger?
  • How to check privilege?
  • Software-based instrumentation: check on every memory access
  • Hardware-based: MMU – host page level check
  • Which type of protocol?
  • Distributed / Centralized
  • MSI
slide-10
SLIDE 10

10

Implementation

How to keep cache coherence?

State Page Protection Modified RW Shared R- Invalid

  • Utilize host MMU to do state check
  • Synchronize granularity = 4K(host page size)
slide-11
SLIDE 11

11

Implementation

The problem of system calls

Syscall fopen()

input.txt input.txt File Missing

  • Eg. fopen() by a worker thread at

node#2 affects

  • User-space file descriptor
  • Kernel-space resource manager
  • Syscalls also affects host kernel
slide-12
SLIDE 12

12

Implementation

The problem of system calls – Syscall Delegate

Local Syscall Global Syscall

read, write, openat, open, fstat, close, stat64, lstat64, fstat64, futex, writev, brk, mmap2, mprotect, madvise, mumap, clone, vfork, futex gettimeofday, clock_gettime, exit, nanosleep, ... all the rest

Master Node Slave Node

  • Syscall parameters
  • Guest CPU state
slide-13
SLIDE 13

13

Implementation

The emulation of atomic operations

CISC

x86

LL(Load-linked) SC(Store-conditional) CAS(Compare and Swap) Translate?

RISC

ARM, MIPS...

slide-14
SLIDE 14

14

Implementation

The emulation of atomic operations

Hierarchical lock

  • 1. Intra-node: Consistency model translation[ArMOR]
  • 2. Inter-node: MSI Coherence Protocol – Sequential
slide-15
SLIDE 15

15

Optimization

Page Split: The false sharing overhead

  • Probability: cache line size 64B  page size 4096B
  • Cost: cache miss 23 cycles  network + pagefault >= 120000cycles
slide-16
SLIDE 16

16

Optimization

Page Split: The false sharing overhead

  • Reduce false sharing possibility
  • Compatible with cache coherence protocol
slide-17
SLIDE 17

17

Optimization

Hint-based thread scheduling: data sharing among nodes

TCG Thread

Host OS Host OS Guest Application Host OS

Guest Mem Region

Distributed Shared Memroy Communicator Master Node Slave Node 1 Slave Node 2 Manager Data Sharing

slide-18
SLIDE 18

18

Optimization

Hint-based thread scheduling: data sharing among nodes

Source Code Hint Means “call DQEMU_scheduler” to DBT

slide-19
SLIDE 19

19

Optimization

Page forwarding: to cover the network latency

trigger forward / prefetch

……

10 pages Continuous Virtual Memory Space

record record record

page cache trigger forward / prefetch 20 pages

slide-20
SLIDE 20

20

Results

Experiment Setup

Network TP-Link TL-SG1024DT Gigabit Switch Processor Quad-core Intel i5-6500@3.30GHz CPU Memory 12GB Kernel Linux 4.15.0 Ubuntu 18.04 Workload micro bench, PARSEC-3.0 ISA Guest: ARM  Host: X64 Baseline QEMU-4.2.0

slide-21
SLIDE 21

Access Type Throughput(MB/s) Latency(us) QEMU Sequential Access 173.06

  • Remote Sequential Access

7.88 410.5 Page forwarding Enabled 108.01 83.2

21

Results

Memory Access Performance

Access Type Throughput(MB/s) Latency(us) QEMU Sequential Access 173.06

  • Remote Sequential Access

7.88 410.5 Sequential memory access Memory QEMU Memory DQEMU Memory DQEMU Access Type Throughput(MB/s) Latency(us) QEMU Sequential Access 173.06

slide-22
SLIDE 22

22

Results

Memory Access Performance

Access Type Throughput(MB/s) QEMU Access of 128 bytes 20,259 False Sharing of 1 Page 2,216 Page Splitting Enabled 75,294 False sharing

slide-23
SLIDE 23

23

Results

Atomic Operation Performance

5.2 6.8 9.5 16.5 21.3 25.6 0.48 1 2 3 4 5 6 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Slave Node(s) Elapsed Time(s) DQEMU-1 QEMU-1 4.0 2.1 1.6 1.4 1.2 1.2 3.4 1 2 3 4 5 6 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 Slave Node(s) Elapsed Time(s)

slide-24
SLIDE 24

24

Results

Scalability - Ideal

1.00 1.97 2.97 3.98 4.93 5.94 1.04 1 2 3 4 5 6 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 Slave Node(s) Normalized Speedup DQEMU

slide-25
SLIDE 25

25

Results

Scalability – Parallel Programs

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 Normalized Speedup

blackscholes

  • rigin

qemu-4.2.0

slide-26
SLIDE 26

26

Results

Scalability – Parallel Programs

1 2 3 4 5 6 1 2 3 4 5 6 Normalized Speedup

blackscholes

  • rigin

forwarding full qemu-4.2.0

slide-27
SLIDE 27

27

Results

Scalability – Heavy data sharing program

0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 Normalized Time Slave Nodes

x264

0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 Normalized Time Slave Nodes

x264

pagefault syscall exec

slide-28
SLIDE 28

28

Results

Discussion

  • A more scalable coherence protocol?
  • Random memory access hurts DSM.
  • What kind of program suits DQEMU? How to recognize?
  • Support various host ISA  Heterogeneous computing?
slide-29
SLIDE 29

Thank you! Q&A

29