A Case for High Performance Computing with Virtual Machines Wei - - PowerPoint PPT Presentation

a case for high performance computing with virtual
SMART_READER_LITE
LIVE PREVIEW

A Case for High Performance Computing with Virtual Machines Wei - - PowerPoint PPT Presentation

A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu + , Bulent Abali + , and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center ICS'06 -- June 28th, 2006 Presentation Outline


slide-1
SLIDE 1

ICS'06 -- June 28th, 2006

A Case for High Performance Computing with Virtual Machines

Wei Huang*, Jiuxing Liu+, Bulent Abali+, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center

slide-2
SLIDE 2

ICS'06 -- June 28th, 2006

Presentation Outline

  • Virtual Machine environment and HPC
  • Background -- VMM-bypass I/O
  • A framework for HPC with virtual

machines

  • A prototype implementation
  • Performance evaluation
  • Conclusion
slide-3
SLIDE 3

ICS'06 -- June 28th, 2006

What is Virtual Machine Environment?

  • A Virtual Machine environment provides

virtualized hardware interface to VMs through Virtual Machine Monitor (VMM)

  • A physical node may host several VMs,

with each running separate OSes

  • Benefits: ease of management,

performance isolation, system security, checkpoint/restart, live migration …

slide-4
SLIDE 4

ICS'06 -- June 28th, 2006

Why HPC with Virtual Machines?

  • Ease of management
  • Customized OS

– Light-weight OSes customized for applications can potentially gain performance benefits [FastOS] – No widely adoption due to management difficulties – VM makes it possible

  • System security

[FastOS]: Forum to Address Scalable Technology for Runtime and Operating Systems

slide-5
SLIDE 5

ICS'06 -- June 28th, 2006

Why HPC with Virtual Machines?

  • Ease of management
  • Customized OS
  • System security

– Currently, most HPC environment disallow users to performance privileged operations (e.g. loading customized kernel modules) – Limit productivities and convenience – Users can do ‘anything’ in VM, in the worst case crash an VM, not the whole system

slide-6
SLIDE 6

ICS'06 -- June 28th, 2006

But Performance?

  • NAS Parallel Benchmarks (MPICH over TCP) in Xen VM environment

– Communication intensive benchmarks show bad results

0.2 0.4 0.6 0.8 1 1.2 1.4

BT CG EP IS SP Normalized Execution Time

VM Native

83.8% 06.5% 09.7% SP 89.9% 04.0% 06.1% BT 99.0% 00.3% 00.6% EP 68.8% 13.1% 18.1% IS 72.7% 10.7% 16.6% CG DomU VMM Dom0

  • Time Profiling using Xenoprof

– Many CPU cycles are spent in VMM and the device domain to process network IO requests

slide-7
SLIDE 7

ICS'06 -- June 28th, 2006

Challenges

  • I/O virtualization overhead
  • A framework to virtualize the cluster environment

– Jobs require multiple processes distributed across multiple physical nodes – Typically requires all nodes have the same setup – How to allow customized OS? – How to reduce other virtualization overheads (memory, storage, etc …) – How to reconfigure nodes and start jobs efficiently?

slide-8
SLIDE 8

ICS'06 -- June 28th, 2006

Challenges

  • I/O virtualization overhead [USENIX ’06]
  • A framework to virtualize the cluster environment

– Jobs requires multiple processes distributed across multiple physical nodes – Typically requires all nodes have the same setup – How to allow customized OS? – How to reduce other virtualization overheads (memory, storage, etc …) – How to reconfigure nodes and start jobs efficiently?

[USENIX ‘06]: J. Liu, W. Huang, B. Abali, D. K. Panda. High Performance VMM-bypass I/O in Virtual Machines

slide-9
SLIDE 9

ICS'06 -- June 28th, 2006

Challenges

  • I/O virtualization overhead [USENIX ’06]

– Evaluation of VMM-bypass I/O with HPC benchmarks

  • A framework to virtualize the cluster environment

– Jobs requires multiple processes distributed across multiple physical nodes – Typically requires all nodes have the same setup – How to allow customized OS? – How to reduce other virtualization overheads (memory, storage, etc …) – How to reconfigure nodes and start jobs efficiently?

[USENIX ‘06]: J. Liu, W. Huang, B. Abali, D. K. Panda. High Performance VMM-bypass I/O in Virtual Machines

slide-10
SLIDE 10

ICS'06 -- June 28th, 2006

Presentation Outline

  • Virtual Machines and HPC
  • Background -- VMM-bypass I/O
  • A framework for HPC with virtual

machines

  • A prototype implementation
  • Performance evaluation
  • Conclusion
slide-11
SLIDE 11

ICS'06 -- June 28th, 2006

VMM-Bypass I/O

  • VMM-Bypass I/O: Guest modules in

guest VMs handle setup and management operations (privileged access).

– Once things are setup properly, devices can be accessed directly from guest VMs (VMM-bypass access). – Requires the device to have OS- bypass feature, e.g. InfiniBand – Can achieve native level performance

Application

Dom0

Application

VM

OS VMM Device

Backend Module Privileged Module

Guest Module Privileged Access VMM-bypass Access

  • Original Scheme: Guest module

contact with privileged domain to complete I/O

– Packets are sent to backend module, which are sent out through the privileged module (e.g. drivers) – Extra communication, domain switch, is very costly

slide-12
SLIDE 12

ICS'06 -- June 28th, 2006

Presentation Outline

  • Virtual Machines and HPC
  • Background -- VMM-bypass I/O
  • A framework for HPC with virtual

machines

  • A prototype implementation
  • Performance evaluation
  • Conclusion
slide-13
SLIDE 13

ICS'06 -- June 28th, 2006

Framework for VM-based Computing

  • Physical Nodes: each running VM environment

– typically no more VM instances than number of physical CPUs – Customized OS is achived through different versions images used to instantiate VMs

  • Front-end node: user submit jobs / customized versions of VMs
  • Management: batch job processing, instantiate VMs/ lauch jobs
  • VM image manager: update user VMs, match user request with VM image

versions

  • Storage: Store different versions of VM images and application generated

data, fast distribution of VM images

Front-end

Storage Nodes VM Image Manager

Physical Resources

VMM VM VM Jobs/VMs Queries Update Image distribution/ application data

Management module

Instantiate VM / launch jobs

slide-14
SLIDE 14

ICS'06 -- June 28th, 2006

How it works?

Front-end

Storage Nodes VM Image Manager

Physical Resources

VMM VM VM Match Image distribution

Management module

Instantiate VM / launch jobs

Jobs / requests requests

  • User requests: number of VMs, number of VCPUs per

VM, operating systems, kernels, libraries, etc.

– Or: previously submitted versions of VM image

  • Matching requests: many algorithms have been studied

in grid environment, e.g. Matchmaker in Condor

slide-15
SLIDE 15

ICS'06 -- June 28th, 2006

Challenges

  • I/O virtualization overhead [USENIX ’06]

– Evaluation of VMM-bypass I/O with HPC benchmarks

  • A framework to virtualize the cluster environment

– Jobs requires multiple processes distributed across multiple physical nodes – Typically requires all nodes have the same setup – How to allow customized OS? – How to reduce other virtualization overheads (memory, storage, etc …) – How to reconfigure nodes and start jobs efficiently?

[USENIX ‘06]: J. Liu, W. Huang, B. Abali, D. K. Panda. High Performance VMM-bypass I/O in Virtual Machines

slide-16
SLIDE 16

ICS'06 -- June 28th, 2006

Prototype – Setup

  • A Xen-based VM environment on an eight-

node SMP cluster with InfiniBand

– Node with dual Intel Xeon 3.0GHz – 2 GB memory

  • Xen-3.0.1: an open-source high

performance VMM originally developed at the University of Cambridge

  • InfiniBand: a high performance

Interconnect with OS-bypass features

slide-17
SLIDE 17

ICS'06 -- June 28th, 2006

Prototype Implementation

  • Reducing virtualization overhead:

– I/O overhead

  • Xen-IB, the VMM-bypass I/O implementation for

InfiniBand in Xen environment

– Memory overhead: Including the memory footprints of VMM and the OS in VMs:

  • VMM: can be as small as 20KB per extra domain
  • Guest OSes: specific tuned for HPC, we reduce it

to 23MB at fresh boot-up in our prototype

slide-18
SLIDE 18

ICS'06 -- June 28th, 2006

Prototype Implementation

  • Reducing the VM image management cost

– VM images must be as small as possible to be efficiently stored and distributed

  • Images created based on ttylinux can be as small as 30MB
  • Basic system calls
  • MPI libraries
  • Communication libraries
  • Any user specific libraries

– Image distribution: distributed through a binomial tree – VM image caching: VM image cached at the physical nodes as long as there is enough local storage

  • Things left to future work:

– VM-awareness storage to further reduce the storage overhead – Matching and scheduling

slide-19
SLIDE 19

ICS'06 -- June 28th, 2006

Presentation Outline

  • Virtual Machines and HPC
  • Background -- VMM-bypass I/O
  • A framework for HPC with virtual

machines

  • A prototype implementation
  • Performance evaluation
  • Conclusion
slide-20
SLIDE 20

ICS'06 -- June 28th, 2006

Performance Evaluation Outline

  • Focused on MPI applications

– MVAPICH: high performance MPI implementation

  • ver InfiniBand, from the Ohio State University.

Current used by over 370 organizations across 30 countries

  • Micro-benchmarks
  • Application-level benchmarks (NAS & HPL)
  • Other virtualization overhead (memory overhead,

startup time, image distribution, etc.)

slide-21
SLIDE 21

ICS'06 -- June 28th, 2006

Micro-benchmarks

  • Latency/bandwidth:

– between 2 VMs on 2 different nodes – Performance in VM environment matches with native ones

  • Registration cache in effect:

– data are sent from the same user buffer multiple times – InfiniBand requires registration, tests are benefited from registration cache – Registration cost (privileged operations) in VM environment is higher

Latency

5 10 15 20 25 30 2 8 3 2 1 2 8 5 1 2 2 k 8 k Msg size (Bytes) Latency (us)

xen native

Bandwidth

200 400 600 800 1000 1 4 1 6 6 4 2 5 6 1 k 4 k 1 6 k 6 4 k 2 5 6 k 1 M 4 M Msg size (Bytes) MillionBytes/s

xen native

slide-22
SLIDE 22

ICS'06 -- June 28th, 2006

Micro-benchmarks (2)

Bandwidth

200 400 600 800 1000 1 4 16 64 256 1k 4k 16k 64k 256k 1M 4M Msg size (Bytes) MillionBytes/s

xen native

Latency

1000 2000 3000 4000 5000 6000 7000 8000 9000 32k 64k 128k 256k 512k 1M 2M 4M Msg size (Bytes) Latency (us)

xen native

  • The set of results are taken without registration cache
  • For MVAPICH, small messages are sent through pre-registered buffer, so
  • nly for medium to large messages (>16k) we see the difference
  • Latency: a consistent around 200us higher in VM environment
  • Bandwidth: difference is smaller due to potential overlap of registration and

communication

  • The worst case scenario is shown: many applications show good buffer re-

use.

slide-23
SLIDE 23

ICS'06 -- June 28th, 2006

HPC Benchmarks (NAS)

  • NAS Parallel Benchmarks achieves similar performance in VM and native

environment

97.3% 1.0% 1.8% MG 94.5% 1.9% 3.6% IS 97.9% 0.5% 1.6% FT 99.6% 0.1% 0.3% SP 99.0% 0.3% 0.6% LU 99.3% 0.3% 0.6% EP 99.0% 0.3% 0.6% CG 99.4% 0.2% 0.4% BT DomU VMM Dom0

  • Time Profiling using Xenoprof

– It is clear that most time is spent in effective computation in DomUs

0.2 0.4 0.6 0.8 1 1.2

BT CG EP FT IS LU MG SP Normalized Execution Time

VM Native

slide-24
SLIDE 24

ICS'06 -- June 28th, 2006

HPC Benchmarks (HPL)

10 20 30 40 50 60

2 4 8 16 GFLOPS

Xen Native

  • HPL: the achievable GFLOPS in VM and Native environment is within 1%

difference

slide-25
SLIDE 25

ICS'06 -- June 28th, 2006

Management Overhead

  • VM image size: ~30MB
  • Reduced services allows VM to be started very efficiently
  • Small image size and the binomial tree distribution make the image

distribution fast

90.0MB 18.4s 58.9s AS4-native 77.1MB 13.2s 24.1s AS4-domu 23.6MB 5.0s 5.3s ttylinux-domU Memory Shutdown Startup 16.1s 12.1s 6.2s 4.1s NFS 5.0s 3.7s 2.8s 1.3s Binomial tree 8 4 2 1 Scheme

slide-26
SLIDE 26

ICS'06 -- June 28th, 2006

Conclusion

  • We proposed a framework to use VM-based

computing environment for HPC applications

  • We explained how the disadvantages of virtual

machines can be addressed using current technologies with our framework using a prototype implementation

  • We carried out detailed performance evaluations
  • n the overhead of VM-based computing for

HPC applications, where we show the virtualization cost is marginal

  • Our case study held promises to bring the

benefits of VMs to the area of HPC

slide-27
SLIDE 27

ICS'06 -- June 28th, 2006

Future work

  • Migration support for VM-based computing

environment with VMM-bypass I/O

  • Investigate scheduling and resource

management schemes

  • More detailed evaluations of VM-based

computing environments

slide-28
SLIDE 28

ICS'06 -- June 28th, 2006

Acknowledgements

Our research at the Ohio State University is supported by the following organizations:

  • Current Funding support by
  • Current Equipment support by
slide-29
SLIDE 29

ICS'06 -- June 28th, 2006

Thank you!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/