Measuring PROOF Lite performance in (non)virtualized environment - - PowerPoint PPT Presentation

measuring proof lite performance in non virtualized
SMART_READER_LITE
LIVE PREVIEW

Measuring PROOF Lite performance in (non)virtualized environment - - PowerPoint PPT Presentation

Measuring PROOF Lite performance in (non)virtualized environment Ioannis Charalampidis, Aristotle University of Thessaloniki Summer Student 2010 Overview Introduction Benchmarks: Overall execution time Benchmarks: In-depth analysis


slide-1
SLIDE 1

Measuring PROOF Lite performance in (non)virtualized environment

Ioannis Charalampidis, Aristotle University of Thessaloniki Summer Student 2010

slide-2
SLIDE 2

Overview

  • Introduction
  • Benchmarks: Overall execution time
  • Benchmarks: In-depth analysis
  • Conclusion
slide-3
SLIDE 3

What am I looking for?

  • There is a known overhead caused by the

virtualization process

▫ How big is it? ▫ Where is located? ▫ How can we minimize it? ▫ Which hypervisor has the best performance?

  • I am using CernVM as guest
slide-4
SLIDE 4

What is CernVM?

  • It’s a baseline Virtual Software Appliance for use by

LHC experiments

  • It’s available for many hypervisors
  • Hyper-V
  • VM Ware
  • Virtual Box
  • KVM / QEMU
  • XEN
slide-5
SLIDE 5

How am I going to find the answers?

  • Using as benchmark a standard data analysis

application (ROOT + PROOF Lite)

  • Test it on different hypervisors
  • And on varying number of workers/CPUs
  • Compare the performance (Physical vs. Virtualized)
slide-6
SLIDE 6

Problem

  • The benchmark application requires too much

time to complete ( 2 min ~ 15 min )

▫ At least 3 runs are required for reliable results ▫ The in-depth analysis overhead is about 40% ▫ It is not efficient to perform detailed analysis for every CPU / Hypervisor configuration

  • Create the overall execution time benchmarks
  • Find the best configuration to run the traces on
slide-7
SLIDE 7

Benchmarks performed

  • Overall time

▫ Using time utility and automated batch scripts

  • In-depth analysis

▫ Tracing system calls using

 Strace  KernelTAP

▫ Analyzing the trace files using applications I wrote

 BASST (Batch analyzer based on STrace)  KARBON (General purpose application profiler based

  • n trace files)
slide-8
SLIDE 8

Process description and results

slide-9
SLIDE 9

Benchmark Configuration

  • Base machine

▫ Scientific Linux CERN 5

  • Guests

▫ CernVM 2.1

  • Software packages from SLC repositories

▫ Linux Kernel 2.6.18-194.8.1.el5 ▫ XEN 3.1.2 + 2.6.18-194.8.1.el5 ▫ KVM 83-194.8.1.el5 ▫ Python 2.5.4p2 (from AFS) ▫ ROOT 5.26.00b (from AFS)

  • Base machine hardware

▫ 24 x Intel Xeon X7460 2.66GHz with VT-x Support (64 bit) ▫ No VT-d nor Extended Page Tables (EPT) hardware support ▫ 32G RAM

slide-10
SLIDE 10

Benchmark Configuration

  • Virtual machine configuration

▫ 1, 2 to 16 CPUs with 2 CPU step ▫ <CPU#> + 1Gb RAM for Physical disk and Network tests ▫ <CPU#> + 17Gb RAM for RAM Disk tests ▫ Disk image for the OS ▫ Physical disk for the Data + Software

  • Important background services running

▫ NSCD (Caching daemon)

slide-11
SLIDE 11

Benchmark Configuration

  • Caches were cleared before every test

▫ Page cache, dentries and inodes ▫ Using the /proc/sys/vm/drop_caches flag

  • No swap memory was used

▫ By periodically monitoring the free memory

slide-12
SLIDE 12

Automated batch scripts

  • The VM batch script runs on the

host machine

  • It repeats the following procedure:

▫ Crate a new Virtual Machine ▫ Wait for the machine to finish booting ▫ Connect to the controlling script inside the VM ▫ Drop caches both on the host and the guest ▫ Start the job ▫ Receive and archive the results

Client Server

Hypervisor Benchmark Benchmark Benchmark

slide-13
SLIDE 13

Problem

  • There was a bug on PROOF Lite that was looking

up a non-existing hostname during the startup of each worker

 Example : 0.2-plit litehp2 hp24.c .cer ern.c .ch-128 281241251-1271

  • Discovered by detailed system call tracing

 The hostname couldn’t be cached  The application had to wait for the timeout  The startup time was delayed randomly  Call tracing applications made this delay even bigger virtually hanging the application

slide-14
SLIDE 14

Problem

  • The problem was resolved with:

▫ A minimal DNS proxy was developed that fakes the existence of the buggy hostname ▫ It was later fixed in PROOF source

Application DNS Server Fake DNS Proxy

cernvm.cern.ch? 137.138.234.20 x.x-xxxxxx-xxx-xxx? 127.0.0.1

slide-15
SLIDE 15

Problem

Ex Example: le: Events / sec for different CPU settings, as reported by the buggy benchmark

2000 4000 6000 8000 10000 12000 14000 16000 18000 5 10 15 20 25 30 RAM Disk - XEN RAM Disk - Host

  • Phys. Disk - XEN
  • Phys. Disk - HOST

5 10 15 20 25 30 RAM Disk + Fixed DNS - XEN RAM Disk + Fixed DNS - Host

  • Phys. Disk + Fixed DNS - XEN
  • Phys. Disk + Fixed DNS - Host

Befor

  • re

After

slide-16
SLIDE 16

Results – Physical Disk

2000 4000 6000 8000 10000 12000 14000 1 2 4 6 8 10 12 14 16 Events / Sec Worker ers = CPUs Baremetal XEN KVM

slide-17
SLIDE 17

Results – Network (XROOTD)

2000 4000 6000 8000 10000 12000 14000 1 2 4 6 8 10 12 14 16 Events / Sec Worker ers = CPUs Baremetal XEN KVM

slide-18
SLIDE 18

Results – RAM Disk

2000 4000 6000 8000 10000 12000 14000 1 2 4 6 8 10 12 14 16 Events / Sec Worker ers = CPUs Baremetal XEN KVM

slide-19
SLIDE 19

Results – Relative values

0.2 0.4 0.6 0.8 1 1.2 5 10 15 20 VM/Bar areme emetal al Ratio Worker ers = CPUs 5 10 15 20 Worker ers = CPUs 5 10 15 20 Worker ers = CPUs

RAM Disk Network (XROOTD) Physical Disk

Bare metal KVM XEN

slide-20
SLIDE 20

5 10 15 Worker ers = CPUs 2000 4000 6000 8000 10000 12000 14000 5 10 15 Events / Sec Worker ers = CPUs 5 10 15 Worker ers = CPUs

Results – Absolute values

RAM Disk Network (XROOTD) Physical Disk

Bare metal KVM XEN

slide-21
SLIDE 21

Results – Comparison chart

2000 4000 6000 8000 10000 12000 14000 2 4 6 8 10 12 14 16 18 Events / Sec Worker ers = CPUs Physical Disk - Bare metal Xrootd - Bare metal RAM Disk - Bare metal Physical Disk - XEN Xrootd - XEN RAM Disk - XEN Physica Disk - KVM Xrootd - KVM RAM Disk - KVM

slide-22
SLIDE 22

Procedure, problems and results

slide-23
SLIDE 23

In depth analysis

  • In order to get more details the program execution

was monitored and all the system calls were traced and logged

  • Afterwards, the analyzer extracted useful

information from the trace files such as

▫ Detecting the time spent on each system call ▫ Detecting the filesystem / network activity

  • The process of tracing adds some overhead but it

is cancelled out from the overall performance measurement

slide-24
SLIDE 24

System call tracing utilities

  • STrace

ace

▫ Traces application-wide system calls from user space ▫ Connects to the tracing process using the ptrace() system call and monitors it’s activity

  • Advantages

▫ Traces the application’s system calls in real time ▫ Has very verbose output

  • Disadvantages

▫ Creates big overhead

Process Kernel STrace

slide-25
SLIDE 25

System call tracing utilities

  • SystemT

mTAP AP

▫ Traces system-wide kernel activity, asynchronously ▫ Runs as a kernel module

  • Advantages

▫ Can trace virtually everything

  • n a running kernel

▫ Supports scriptable kernel probes

  • Disadvantages

▫ It is not simple to extract detailed information ▫ System calls can be lost on high CPU activity

Process Kernel System TAP

slide-26
SLIDE 26

System call tracing utilities

  • Sample STrace

e output:

5266 1282662179.860933 arch_prctl(ARCH_SET_FS, 0x2b5f2bcc27d0) = 0 <0.000005> 5266 1282662179.860960 mprotect(0x34ca54d000, 16384, PROT_READ) = 0 <0.000007> 5266 1282662179.860985 mprotect(0x34ca01b000, 4096, PROT_READ) = 0 <0.000006> 5266 1282662179.861009 munmap(0x2b5f2bc92000, 189020) = 0 <0.000011> 5266 1282662179.861082 open("/usr/lib/locale/locale-archive", O_RDONLY) = 4 <0.000008> 5266 1282662179.861113 fstat(4, {st_mode=S_IFREG|0644, st_size=56442560, ...}) = 0 <0.000005> 5266 1282662179.861166 mmap(NULL, 56442560, PROT_READ, MAP_PRIVATE, 4, 0) = 0x2b5f2bcc3000 <0.000007> 5266 1282662179.861192 close(4) = 0 <0.000005> 5266 1282662179.861269 brk(0) = 0x1ad1f000 <0.000005> 5266 1282662179.861290 brk(0x1ad40000) = 0x1ad40000 <0.000006> 5266 1282662179.861444 open("/usr/share/locale/locale.alias", O_RDONLY) = 4 <0.000009> 5266 1282662179.861483 fstat(4, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0 <0.000005> 5266 1282662179.861944 read(4, "", 4096) = 0 <0.000006> 5266 1282662179.861968 close(4) = 0 <0.000005> 5266 1282662179.861989 munmap(0x2b5f2f297000, 4096) = 0 <0.000009> 5264 1282662179.863063 wait4(-1, 0x7fff8d813064, WNOHANG, NULL) = -1 ECHILD (No child processes) ...

slide-27
SLIDE 27

KARBON – A trace file analyzer

slide-28
SLIDE 28

KARBON – A trace file analyzer

  • Is a general purpose application profiler based on

system call trace files

  • It traces file descriptors and reports detailed I/O

statistics for files, network sockets and FIFO pipes

  • It analyzes the child processes and creates

process graphs and process trees

  • It can detect the “Hot spots” of an application
  • Custom analyzing tools can be created on-demand

using the development API

slide-29
SLIDE 29

KARBON – Application block diagram

Source (File or TCP Stream) Router Preprocessing Tool Analyzer Filter Presenter Presenter Tokenizer

slide-30
SLIDE 30

Network (Xrootd) - Baremetal Network (Xrootd) - XEN Network (Xrootd) - KVM File IO UNIX Sockets TCP Sockets Misc calls Physical Disk - Baremetal Physical Disk - XEN Physical Disk - KVM File IO Net IO Misc calls 50000 100000 150000 200000 250000 300000 RAM Disk - Baremetal RAM Disk - XEN RAM Disk - KVM Time spent (ms) File IO Net IO Misc calls

Results

  • Time utilization of the traced application
slide-31
SLIDE 31

Results

  • Time utilization of the traced application

Physical Disk - Baremetal Physical Disk - XEN Physical Disk - KVM File IO UNIX Sockets Misc calls Network (Xrootd) - Baremetal Network (Xrootd) - XEN Network (Xrootd) - KVM File IO UNIX Sockets TCP Sockets Misc calls 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% RAM Disk - Baremetal RAM Disk - XEN RAM Disk - KVM File IO UNIX Sockets Misc calls

slide-32
SLIDE 32

Results

  • Time utilization of the traced application

Physical Disk - Baremetal Physical Disk - XEN Physical Disk - KVM File IO UNIX Sockets Misc calls 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% RAM Disk - Baremetal RAM Disk - XEN RAM Disk - KVM File IO UNIX Sockets Misc calls Network (Xrootd) - Baremetal Network (Xrootd) - XEN Network (Xrootd) - KVM File IO UNIX Sockets TCP Sockets Misc calls

slide-33
SLIDE 33

Results

  • Overall system call time for filesystem I/O
  • Reminder: Kernel buffers were dropped before every test

▫ Possible caching effect inside the hypervisor

[ms] s] Readin ing Writing ng Seekin ing Tot

  • tal

al

Bare metal 490,861.354 2,054.354 21,594.583 524,872.823 KVM 38,391.715 36,422.440 122,769.518 244,406.512 XEN 38,111.980 20,930.382 102,769.901 210,247.468

slide-34
SLIDE 34

Results

  • Overall system call time for UNIX Sockets

[ms] s] Recei eiving ng Sending ing Bind, d, Listen Conne necti cting ng Tot

  • tal

al

Bare metal 993.884 10,313.304 4.251 5.259 11,301.588 KVM 59,637.942 164,655.077 7.412 13.656 223,872.164 XEN 97,823.986 550,050.484 5.014 8.493 652,784.010

slide-35
SLIDE 35

Results

  • Most time-consuming miscellaneous system calls

System em call Bare met etal KVM XEN wait4() 178,200.34 316,829,30 388,885,57 gettimeofday() (No trace) 219,780,33 218,018,63 nanosleep() (No trace) 12,250,12 12,029,30 time() (No trace) (No trace) 9,081,94 rt_sigreturn() 150,943 1,685,285 9,271,061 setitimer() 23,245 698,785 223,669

slide-36
SLIDE 36

Conclusion

  • Physical Disk

▫ KVM can achieve better performance than XEN, reaching 70 70 - 98% 98% of the native speed ▫ Best performance achieved on 6 CPUs/6 workers (7Gb RAM) with 81% of the native speed

  • Network (Xrootd)

▫ XEN can achieve better performance than KVM, reaching 73 73 - 90% 90% of the native speed ▫ Best performance achieved again on 6 CPUs / 6 workers (7G RAM) with 92% of the native speed

slide-37
SLIDE 37

Conclusion

  • Some disk I/O operations (read) appear to be

faster inside the Virtual Machine

  • Some of them appear to be slower (seek, write)

▫ Possible caching effect even on direct disk access

  • Network I/O

▫ TCP under XEN looks fine, whereas with KVM there are some issues ▫ UNIX Sockets seem to have significant penalty inside the VMs

  • Some miscellaneous system calls take longer

inside the VM

▫ Time-related functions (gettimeoftheday, nanosleep)

 Used for paravirtualized implementation of other system calls?

slide-38
SLIDE 38

Other uses of the tools

  • SystemTAP could be used by nightly builds in order

to detect hanged applications

  • KARBON can be used as a general log file analysis

program

slide-39
SLIDE 39

Future work

  • Benchmark VMs with a disk image file residing on a RAID Array
  • Benchmark many concurrent KVM virtual machines with total

memory exceed the overall system memory – Exploit NPT

  • Test the PCI Pass-through for network cards (KVM) – Test VT-d
  • Convert the benchmark application from python to pure C
  • Repeat the benchmarks with the optimized ROOT input files
  • Test again the KVM Network performance with
  • Recompile the kernel with CONFIG_KVM_CLOCK