DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms
Ziyi Zhao, Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia
1
DQEMU: A Scalable Emulator with Retargetable DBT on Distributed - - PowerPoint PPT Presentation
DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Ziyi Zhao , Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia 1 Introduction Dynamic
Ziyi Zhao, Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia
1
2
Introduction
“A Key Enabling Technology” Cross-ISA Virtualization Dynamic Instrumentation
3
Introduction
Saturate around speedup of 2.0x
4
Introduction
Host OS
Host OS Guest Application Host OS Hardware Hardware Hardware
5
Introduction
6
Introduction
Guest Code Intermediate Code Host Code
Tiny Code Generator (TCG)
7
Introduction
Host OS Guest Application
Guest Mem Region
Execute Translate DBT TCG Thread
8
Implementation
TCG Thread
Host OS Host OS Guest Application Host OS
Guest Mem Region
Distributed Shared Memroy Communicator Master Node Worker Node1 Worker Node2 Manager
9
Implementation
10
Implementation
State Page Protection Modified RW Shared R- Invalid
11
Implementation
Syscall fopen()
input.txt input.txt File Missing
node#2 affects
12
Implementation
Local Syscall Global Syscall
read, write, openat, open, fstat, close, stat64, lstat64, fstat64, futex, writev, brk, mmap2, mprotect, madvise, mumap, clone, vfork, futex gettimeofday, clock_gettime, exit, nanosleep, ... all the rest
Master Node Slave Node
13
Implementation
CISC
x86
LL(Load-linked) SC(Store-conditional) CAS(Compare and Swap) Translate?
RISC
ARM, MIPS...
14
Implementation
Hierarchical lock
15
Optimization
16
Optimization
17
Optimization
TCG Thread
Host OS Host OS Guest Application Host OS
Guest Mem Region
Distributed Shared Memroy Communicator Master Node Slave Node 1 Slave Node 2 Manager Data Sharing
18
Optimization
Source Code Hint Means “call DQEMU_scheduler” to DBT
19
Optimization
trigger forward / prefetch
……
10 pages Continuous Virtual Memory Space
record record record
page cache trigger forward / prefetch 20 pages
20
Results
Network TP-Link TL-SG1024DT Gigabit Switch Processor Quad-core Intel i5-6500@3.30GHz CPU Memory 12GB Kernel Linux 4.15.0 Ubuntu 18.04 Workload micro bench, PARSEC-3.0 ISA Guest: ARM Host: X64 Baseline QEMU-4.2.0
Access Type Throughput(MB/s) Latency(us) QEMU Sequential Access 173.06
7.88 410.5 Page forwarding Enabled 108.01 83.2
21
Results
Access Type Throughput(MB/s) Latency(us) QEMU Sequential Access 173.06
7.88 410.5 Sequential memory access Memory QEMU Memory DQEMU Memory DQEMU Access Type Throughput(MB/s) Latency(us) QEMU Sequential Access 173.06
22
Results
Access Type Throughput(MB/s) QEMU Access of 128 bytes 20,259 False Sharing of 1 Page 2,216 Page Splitting Enabled 75,294 False sharing
23
Results
5.2 6.8 9.5 16.5 21.3 25.6 0.48 1 2 3 4 5 6 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Slave Node(s) Elapsed Time(s) DQEMU-1 QEMU-1 4.0 2.1 1.6 1.4 1.2 1.2 3.4 1 2 3 4 5 6 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 Slave Node(s) Elapsed Time(s)
24
Results
1.00 1.97 2.97 3.98 4.93 5.94 1.04 1 2 3 4 5 6 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 Slave Node(s) Normalized Speedup DQEMU
25
Results
0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 Normalized Speedup
blackscholes
qemu-4.2.0
26
Results
1 2 3 4 5 6 1 2 3 4 5 6 Normalized Speedup
blackscholes
forwarding full qemu-4.2.0
27
Results
0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 Normalized Time Slave Nodes
x264
0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 Normalized Time Slave Nodes
x264
pagefault syscall exec
28
Results
29